Inbound 8969254549211759123
Inbound 8969254549211759123
1 Probability Space 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Probabilities on a Finite or Countable Space . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1
CONTENTS 2
5 Parameter estimation 88
5.1 Samples and characteristic of sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Data display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.1 Stem-and-leaf diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.2 Frequency distribution and histogram . . . . . . . . . . . . . . . . . . . . . . 91
5.2.3 Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.4 Probability plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Point estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.2 Point estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Method of finding estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
CONTENTS 3
7 Regression 156
7.1 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.1.1 Simple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.1.2 Confidence interval for σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.1.3 Confidence interval for β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.1.4 Confidence interval for β0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.1.5 Prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Chapter 1
Probability Space
1.1 Introduction
Random experiments are experiments whose output cannot be surely predicted in advance.
But when one repeats the same experiment a large number of times one can observe some ”regu-
larity” in the average output. A typical example is the toss of a coin: one cannot predict the result
of a single toss, but if we toss the coin many times we get an average of about 50% of ”heads” if
the coin is fair. The theory of probability aims towards a mathematical theory which describes
such phenomena. This theory contains three main ingredients:
a) The state space: this is the set of all possible outcomes of the experiment, and it is usually
denoted by Ω.
Examples
b) The event: An ”event” is a property which can be observed either to hold or not to hold
after the experiment is done. In mathematical terms, an event is a subset of Ω. If A and B are two
events, then
5
CHAPTER 1. PROBABILITY SPACE 6
We denote by A the family of all events. Often (but not always: we will see why later) we have
A = 2Ω the set of all subsets of Ω. The family A should be ”stable” by the logical operations
described above: if A, B ∈ A then we must have Ac ∈ A, A ∩ B ∈ A, A ∪ B ∈ A, and also Ω ∈ A
and ∅ ∈ A.
c) The probability: With each event A one associates a number denoted by P(A) and called
the ”probability of A”. This number measures the likelihood of the event A to be realized a priori,
before performing the experiment. It is chosen between 0 and 1, and the more likely the event is,
the closer to 1 this number is.
To get an idea of the properties of these numbers, one can imagine that they are the limits of
the ”frequency” with which the events are realized: let us repeat the same experiment n times;
the n outcomes might of course be different (think of n successive tosses of the same die, for
instance). Denote by fn (A) the frequency with which the event A is realized (i.e. the number of
times the event occurs, divided by n). Intuitively we have:
(we will give a precise meaning to this ”limit” later). From the obvious properties of frequencies,
we immediately deduce that:
1. 0 ≤ P(A) ≤ 1;
2. P(Ω) = 1;
A mathematical model for our experiment is thus a triple (Ω, A, P), consisting of the space Ω,
the family A of all events, and the family of all P(A) for A ∈ A; hence we can consider that P is
a map from A into [0, 1], which satisfies at least the properties (2) and (3) above (plus in fact an
additional property, more difficult to understand, and which is given later).
A fourth notion, also important although less basic, is the following one:
d) Random variable: A random variable is a quantity which depends on the outcome of the
experiment. In mathematical terms, this is a map from Ω into a space E, where often E = R or
E = Rd . Warning: this terminology, which is rooted in the history of Probability Theory going
back 400 years, is quite unfortunate; a random ”variable” is not a variable in the analytical sense,
but a function!
CHAPTER 1. PROBABILITY SPACE 7
Let X be such a random variable, mapping Ω into E. One can then ”transport” the prob-
abilistic structure onto the target space E, by setting PX (B) = P(X −1 (B)) for B ∈ E, where
X −1 (B) = {w ∈ Ω : X(w) ∈ B} denotes the pre-image of B by X. This formula defines a new
probability, denoted by PX but on the space E instead of Ω. The probability PX is called the law
of the variable X.
Example (toss of two dice): One tosses two fair dice and observers the number of dots ap-
pearing on each dice. The sample space is Ω = {(i, j) : 1 ≤ i ≤ 6, 1 ≤ i ≤ 6}, and it is natural to
take here A = 2Ω and
|A|
P(A) = if A ⊂ Ω,
36
where |A| denotes the number of points in A. One easily verifies the properties (1), (2), (3) above,
and P({w}) = 1/36 for each singleton. The map X : Ω → N defined by X(i, j) = |j − i| is the
random variable ”different of the two dice”, and its law is
number of pairs (i, j) such that |i − j| ∈ B
PX (B) =
36
(for example PX ({1}) = 5/18, PX ({5}) = 1/18, etc . . .). We will formalize the concepts of a prob-
ability space and random variable in following sections.
1. ∅ ∈ A and Ω ∈ A;
3. A is closed under finite unions and finite intersections: that is, if A1 , . . . , An are all in A,
then ∪ni=1 and ∩ni=1 Ai are in A as well (for this it is enough that A be stable by the union and
the intersection of any two sets);
Definition 1.2.1. A is an algebra if it satisfies (1), (2) and (3) above. It is a σ-algebra, (or a σ-field)
if it satisfies (1), (2), and (4) above.
Definition 1.2.3. If C ⊂ 2Ω , the σ-algebra generated by C, and written σ(C), is the smallest σ-
algebra containing C. (It always exists because 2Ω is a σ-algebra, and the intersection of a family
of σ-algebras is again a σ-algebra)
CHAPTER 1. PROBABILITY SPACE 8
where x1 , . . . , xn ∈ Q.
We can show that B(Rd ) is also the σ-algebra generated by all open subsets (or by all the
closed subsets) of Rd .
1. P(Ω) = 1;
2. For every countable sequence (An )n≥1 of elements of A, pairwise disjoint (that is, An ∩Am =
∅ whenever n 6= m), one has
[∞ X ∞
P An = P(An ).
n=1 n=1
Axiom (2) above is called countable additivity; the number P(A) is called the probability of
the event A.
In Definition 1.2.4 one might imagine a more elementary condition than (2), namely:
Theorem 1.2.5. Let (Ω, A, P) be a probability space. The following properties hold:
(i) P(∅) = 0;
(ii) P is additive.
Proof. If in Axiom (2) we take An = ∅ for all n, we see that the number a = P(∅) is equal to an
infinite sum of itself; since 0 ≤ a ≤ 1, this is possible only if a = 0, and we have (i).
For (ii) it sufffices to apply Axiom (2) with A1 = A and A2 = B and A3 = A4 = ... = ∅, plus the
fact that P(∅) = 0, to obtain (1.1).
Applying (1.1) for A ∈ A and B = Ac we get (iii).
CHAPTER 1. PROBABILITY SPACE 9
To show (iv), suppose A ⊂ B then applying (1.1) for A and B\A we have
Theorem 1.3.1. Let A be a σ-algebra. Suppose that P : A → [0, 1] satisfies P(Ω) = 1 and is additive.
Then the following statements are equivalent:
Proof. The notation An ↓ A means that An+1 ⊂ An , each n, and ∩∞ i=1 An = A. The notation
∞
An ↑ A means that An ⊂ An+1 and ∪i=1 An = A, ∀n ≥ 2.
(i) ⇒ (v) Let An ∈ A with An ↑ A. We construct a new sequence as follows: B1 = A1 and
Bn = An+1 \An .Then ∪∞
i=1 Bn = A; An = ∪ni=1 Bi and the events (Bn )n≥1 are pairwise disjoint.
Therefore
∞
X
P(A) = P(∪k≥1 Bk ) = Bk .
k=1
Hence
n
X ∞
X
P(An ) = P(Bk ) ↑ P(Bk ) = P(A).
k=1 k=1
n
X ∞
X
P(∪∞
k=1 Ak ) = lim P(Bn ) = lim P(Ak ) = P(Ak ).
n→∞ n→∞
k=1 k=1
We can say that An ∈ A converges to A (we write An → A) if limn→∞ IAn (w) = IA (w) for all
w ∈ Ω. Note that if the sequence An increases (resp. decreases) to A, then it also tends to A in the
above sense.
Theorem 1.3.2. Let P be a probability measure and let An be a sequence of events in A which
converges to A. Then A ∈ A and limn→∞ P(An ) = P(A).
Proof. Now let Bn = ∩m≥n Am and Cn = ∪m≤n Am . Then Bn increases to A and Cn decreases to
A, thus lim P(Bn ) = lim P(Cn ) = P(A), by Theorem 1.3.1. However Bn ⊂ An ⊂ Cn , therefore
n→∞ n→∞
P(Bn ) ≤ P(An ) ≤ P(Cn ), so lim P(An ) = P(A) as well.
n→∞
Lemma 1.3.3. Let S be a set. Let I be a π-system on S, that is, a family of subsets of S stable under
finite intersection:
I1 , I2 ∈ I ⇒ I1 ∩ I2 ∈ I.
Let Σ = σ(I). Suppose that µ1 and µ2 are probability measure on (S, Σ) such that µ1 = µ2 on I.
Then µ1 = µ2 on Σ.
Proof. Let
D = {F ∈ Σ : µ1 (F ) = µ2 (F )}.
a) S ∈ D,
b) if A, B ∈ D and A ⊆ B then B \ A ∈ D,
c) if An ∈ D and An ↑ A, then A ∈ D.
so that F ∈ D.
Since D is a d-system and D ⊇ I then D ⊇ σ(I) = Σ, and the result follows.
CHAPTER 1. PROBABILITY SPACE 11
This lemma implies that if two probability measure agree on a π-system, then they agree on
the σ-algebra generated by that π-system.
Now, by definition of λ,
X X X
µ0 (Fn ) = µ0 (E ∩ Fn ) + µ0 (E c ∩ Fn )
n n n
c
≥ λ(E ∩ G) + λ(E ∩ G),
Theorem 1.4.1. Let (pw )w∈Ω be a family of real numbers indexed by the finite or countable set
Ω. Then there exists a unique probability P such that P({w}) = pw if and only if pw ≥ 0 and
P
w∈Ω pw = 1. In this case for any A ⊂ Ω,
X
P(A) = pw .
w∈A
Suppose first that Ω is finite. Any family of nonnegative terms summing up to 1 gives an exam-
ple of a probability on Ω. But among all these examples the following is particularly important:
Definition 1.4.2. A probability P on the finite set Ω is called uniform if pw = P({w}) does not
depend on w.
Then computing the probability of any event A amounts to counting the number of points in A.
On a given finite set Ω there is one and only one uniform probability.
Example: There are 20 balls in the urn, 10 white 10 red. One draws a set of 5 balls from the
urn. Denote X the number of white ball in the set. We want to find the probability that X = x,
where x is an arbitrary fixed integer.
We label from 1 to 10 for white balls and from 11 to 20 for red balls. Since the balls are drawn
at once, it is natural to consider that an outcome is a subset with 5 elements of the set {1, . . . , 20}
of all 20 balls. That is, Ω is the family of all subsets with 5 points, and the total number of possible
outcomes is |Ω| = C20 5 . Next, it is also natural to consider that all possible outcomes are equally
likely, that is P is the uniform probability on Ω. The quantity X is a “random variable” because
when the outcome w is known, one also knows the number X(w). The possible values of X is
from 0 to 5 and the set X −1 ({x}) = {X = x} contains C10 x C 5−x points for all 0 ≤ x ≤ 5. Hence
10
x 5−x
C10 C510 if 0 ≤ x ≤ 5
C20
P(X = x) =
0 otherwise.
We thus obtain, when x varies, the distribution or the law, of X. This distribution is called the
hypergeometric distribution.
CHAPTER 1. PROBABILITY SPACE 14
Definition 1.5.1. Let A, B be events, P(B) > 0. The conditional probability of A given B is
P(A ∩ B)
P(A|B) = .
P(B)
Theorem 1.5.2. Suppose P(B) > 0. The operation A 7→ P(A|B) from A → [0, 1] defines a new
probability measure on A, called the conditional probability measure given B.
Proof. We define Q(A) = P(A|B), with B fixed. We must show Q satisfies (1) and (2) of 1.2.4. But
P(Ω ∩ B) P(B)
Q(Ω) = P(Ω|B) = = = 1.
P(B) P(B)
Therefore, Q satisfies (1). As for (2), note that if (An )n≥1 is a sequence of elements of A which are
pairwise disjoint, then
P((∪∞
n=1 An ) ∩ B) P(∪∞
n=1 (An ∩ B))
Q(∪∞ ∞
n=1 An ) = P(∪n=1 An |B) = =
P(B) P(B)
and also the sequence (An ∩ B)n≥1 is pairwise disjoint as well; thus
∞ ∞ ∞
X P(An ∩ B) X X
Q(∪∞
n=1 An ) = = P(An |B) = Q(An ).
P(B)
n=1 n=1 n=1
Proof. We use induction. For n = 2, the theorem is simply 1.5.1. Suppose the theorem holds for
n − 1 events. Let B = A1 ∩ . . . ∩ An−1 . Then by 1.5.1 P(B ∩ An ) = P(An |B)P(B); next we replace
P(B) by its value given in the inductive hypothesis:
3. Ω = ∪n En .
Theorem 1.5.5 (Partition Equation). Let (En )n≥1 be a finite or countable partition of Ω. Then if
A ∈ A,
X
P(A) = P(A|En )P(En ).
n
Theorem 1.5.6 (Bayes’ Theorem). Let (En ) be a finite or countable partition of Ω, and suppose
P(A) > 0. Then
P(A|En )P(En )
P(En |A) = P .
m P(A|Em )P(Em )
Example 1.5.7. Because a new medical procedure has been shown to be effective in the early
detection of an illness, a medical screening of the population is proposed. The probability that
the test correctly identifies someone with the illness as positive is 0.99, and the probability that
the test correctly identifies someone without the illness as negative is 0.95. The incidence of the
illness in the general population is 0.0001. You take the test, and the result is positive. What is
the probability that you have the illness? Let D denote the event that you have the illness, and
let S denote the event that the test signals positive. The probability requested can be denoted as
. The probability that the test correctly signals someone without the illness as negative is 0.95.
Consequently, the probability of a positive test without the illness is
P(S|Dc ) = 0.05.
CHAPTER 1. PROBABILITY SPACE 16
Example 1.5.8. Suppose that Bob can decide to go to work by one of three modes of transporta-
tion, car, bus, or commuter train. Because of high traffic, if he decides to go by car, there is a
0.5 chance he will be late. If he goes by bus, which has special reserved lanes but is somtimes
overcrowded, the probability of being late is only 0.2. The commuter train is almost never late,
with a probability of only 0.01, but is more expensive than the bus.
(a) Suppose that Bob is late one day, and his boss wishes to estimate the probability that he drove
to work that day by car. Since he does not know which mode of transportation Bod usually uses,
he gives a prior probability of 13 to each of the three possibilities. What is the boss’ estimate of the
probability that Bob drove to work?
(b) Suppose that the coworker of Bob’s knows that he almost always takes the commuter train
to work, never take the bus, but somtimes, 0.1 of the time, takes the car. What is the coworkers
probability that Bob drove to work that day, given that he was late?
We have the following information given in the problem:
1
P(bus) = P(car) = P(train) =
3
P(late|car) = 0.5;
P(late|train) = 0.01;
P(late|bus) = 0.2.
Repeat the identical calculations as above, but instead of the prior probabilities being 13 , we use
pr(bus) = 0, P(car) = 0.1, and P(train) = 0.9. Plugging in to the same equation with these three
changes, we get P(car|late) = 0.8475.
1.6 Independence
Definition 1.6.1. 1. Two events A and B are independent if
P(A ∩ B) = P(A)P(B).
CHAPTER 1. PROBABILITY SPACE 17
2. A (possibly infinite) collection of events (Ai )i∈I is a pairwise independent collection if for
any distinct elements i1 , i2 ∈ I,
3. A (possibly infinite) collection of events (Ai )i∈I is an independent collection if for every
finite subset J of I, one has
Y
P(∩i∈J Ai ) = P(Ai ).
i∈J
If events (Ai )i∈I are independent, they are pairwise independent, but the converse is false.
Proposition 1.6.2. a) If A and B are independent, so also are A and B c ; Ac and B; and Ac and B c .
b) If A and B are independent and P(B) > 0, then
Proof. a) A and B c . Since A and B are independent, then P(A ∩ B) = P(A)P(B) = P(A)(1 −
P(B c )) = P(A) − P(A)P(B c ). We have P(A ∩ B) = P(A) − P(A ∩ B c ). Substituting these into the
equation P(A ∩ B) = P(A)P(B), we obtain
Hence
P(A ∩ B c ) = P(A)P(B c ).
So
and
P(A ∩ B c ) P(A)P(B c )
P(A|B c ) = = = P(A).
P(B c ) P(B c )
CHAPTER 1. PROBABILITY SPACE 18
Examples:
1. Toss a coin 3 times. If Ai is an event depending only on the ith toss, then it is standard to
model (Ai )1≤i≤3 as being independent.
2. One chooses a card at random from a deck of 52 cards. A = ”the card is a heart”, and
B = ”the card is Queen”. A natural model for this experiment consists in prescribing the
probability 1/52 for picking any one of the cards. By additivity, P(A) = 13/52 and P(B) =
4/52 and P(A ∩ B) = 1/52 hence A and B are independent.
3. Let n = {1, 2, 3, 4}, and A = 2Ω . Let P(i) = 1/4, where i = 1, 2, 3, 4. Let A = {1, 2}, B =
{1, 3}, and C = {2, 3}. Then A, B, C are pairwise independent but are not independent.
Exercises
Axiom of Probability
1.1. Give a possible sample space for each of the following experiments:
2. A student is asked for the month of the year and the day of the week on which her birthday
falls.
1.4. Let (Gα )α∈I be an arbitrary family of σ-algebras defined on an abstract space Ω. Show that
H = ∩α∈I Gα is also a σ-algebra.
1.5. Suppose that Ω is an infinite set (countable or not), and let A be the family of all subsets
which are either finite or have a finite complement. Show that A is an algebra, but not a σ-
algebra.
1.6. Give a counterexample that shows that, in general, the union A ∪ B of two σ-algebras need
not be a σ-algebra.
1.7. Let Ω = {a, b, c} be a sample space. Let P({a}) = 1/2, P({b}) = 1/3, and P({c}) = 1/6. Find
the probabilities for all eight subsets of Ω.
1.10. Let (Bn ) be a sequence of events such that P(Bn ) = 1 for all n ≥ 1. Show that
\
P Bn = 1.
n
1.14. Let (Ω, A, P) be a probability space. Show for events Bi ⊂ Ai the following inequality
X
P(∪i Ai ) − P(∪i Bi ) ≤ P(Ai ) − P(Bi ) .
i
Pn
1.15. If (Bk ) are events such that k=1 P(Bk ) > n − 1, then
n
\
P Bk ) > 0.
k=1
1.16. In the laboratory analysis of samples from a chemical process, five samples from the pro-
cess are analyzed daily. In addition, a control sample is analyzed two times each day to check the
calibration of the laboratory instruments.
CHAPTER 1. PROBABILITY SPACE 20
1. How many different sequences of process and control samples are possible each day? As-
sume that the five process samples are considered identical and that the two control sam-
ples are considered identical.
2. How many different sequences of process and control samples are possible if we consider
the five process samples to be different and the two control samples to be identical.
3. For the same situation as part (b), how many sequences are possible if the first test of each
day must be a control sample?
1.17. In the design of an electromechanical product, seven different components are to be stacked
into a cylindrical casing that holds 12 components in a manner that minimizes the impact of
shocks. One end of the casing is designated as the bottom and the other end is the top.
2. If the seven components are all identical, how many different designs are possible?
3. If the seven components consist of three of one type of component and four of another
type, how many different designs are possible? (more difficult)
1. How many three-digit phone prefixes that are used to represent a particular geographic
area (such as an area code) can be created from the digits 0 through 9?
2. As in part (a), how many three-digit phone prefixes are possible that do not start with 0 or
1, but contain 0 or 1 as the middle digit?
3. How many three-digit phone prefixes are possible in which no digit appears more than
once in each prefix?
2. If the first bit of a byte is a parity check, that is, the first byte is determined from the other
seven bits, how many different bytes are possible?
1.20. A bowl contains 16 chips, of which 6 are red, 7 are white, and 3 are blue. If four chips are
talmn at random and without replacement, find the probability that: (a) each of the 4 chips is
red; (b) none of the 4 chips is red; (c) there is at least 1 chip of each color.
1.21. Three distinct integers are chosen at random from the first 20 positive integers. Compute
the probability that: (a) their stun is even; (b) their product is even.
1.22. There are 5 red chips and 3 blue chips in a bowl. The red chips are numbered 1, 2, 3, 4,
5, respectively, and the blue chips are numbered 1, 2, 3, respectively. If 2 chips are to be drawn
at random and without replacement, find the probability that these chips have either the same
number or the same color.
CHAPTER 1. PROBABILITY SPACE 21
1.23. In a lot of 50 light bulbs, there are 2 bad bulbs. An inspector examines 5 bulbs, which are
selected at random and without replacement. (a) Find the probability of at least 1 defective bulb
among the 5. (b) How many bulbs should be examined so that the probability of finding at least
1 bad bulb exceeds 0.2 ?
1.24. Three winning tickets are drawn from urn of 100 tickets. What is the probability of winning
for a person who buys:
1. 4 tickets?
1.25. A drawer contains eight different pairs of socks. If six socks are taken at random and with-
out replacement, compute the probability that there is at least one matching pair among these
six socks.
1. What is the probability that at least two students have the same birthday?
2. What is the minimum value of n which secures probability 1/2 that at least two have a
common birthday?
1.27. Four mice are chosen (without replacement) from a litter containing two white mice. The
probability that both white mice are chosen is twice the probability that neither is chosen. How
many mice are there in the litter?
1.28. Suppose there are N different types of coupons available when buying cereal; each box
contains one coupon and the collector is seeking to collect one of each in order to win a prize.
After buying n boxes, what is the probability pn that the collector has at least one of each type?
(Consider sampling with replacement from a population of N distinct elements. The sample size
is n > N . Use inclusion-exclusion formula)
1.29. An absent-minded person has to put n personal letters in n addressed envelopes, and he
does it at random. What is the probability pm,n that exactly m letters will be put correctly in their
envelopes?
1.30. N men run out of a men’s club after a fire and each takes a coat and a hat. Prove that:
a) the probability that no one will take his own coat and hat is
N
X (N − k)!
(−1)k ;
N !k!
k=1
b) the probability that each man takes a wrong coat and a wrong hat is
"N #2
X 1
(−1)k .
k!
k=2
CHAPTER 1. PROBABILITY SPACE 22
1.31. You throw 6n dice at random. Find the probability that each number appears exactly n
times.
1.32. * Mary tosses n + 1 fair coins and John tosses n fair coins. What is the probability that Mary
gets more heads than John?
Conditional Probability
1.33. Bowl I contains 6 red chips and 4 blue chips. Five of these 10 chips are selected at random
and without replacement and put in bowl II, which was originally empty. One chip is then drawn
at random from bowl II. Given that this chip is blue, find the conditional probability that 2 red
chips and 3 blue chips are transferred from bowl I to bowl II.
1.34. You enter a chess tournament where your probability of winning a game is 0.3 against half
the players (call them type 1), 0.4 against a quarter of the players (call them type 2), and 0.5
against the remaining quarter of the players (call them type 3). You play a game against a ran-
domly chosen opponent. What is the probability of winning?
1.35. We roll a fair four-sided die. If the result is 1 or 2, we roll once more but otherwise, we stop.
What is the probability that the sum total of our rolls is at least 4?
1.36. There are three coins in a box. One is a two-headed coin, another is a fair coin, and the
third is a biased coin that comes up heads 75 percent of the time. When one of the three coins
is selected at random and flipped, it shows heads. What is the probability that it was the two-
headed coin?
1.37. Alice is taking a probability class and at the end of each week she can be either up-to-
date or she may have fallen behind. If she is up-to-date in a given week, the probability that she
will be up-to-date (or behind) in the next week is 0.8 (or 0.2, respectively). If she is behind in a
given week, the probability that she will be up-to-date (or behind) in the next week is 0.6 (or 0.4,
respectively). Alice is (by default) up-to-date when she starts the class. What is the probability
that she is up-to-date after three weeks?
1.38. At the station there are three payphones which accept 20p pieces. One never works, an-
other always works, while the third works with probability 1/2. On my way to the metropolis
for the day, I wish to identify the reliable phone, so that I can use it on my return. The station
is empty and I have just three 20p pieces. I try one phone and it does not work. I try another
twice in succession and it works both times. What is the probability that this second phone is the
reliable one?
1.39. An insurance company insure an equal number of male and female drivers. In any given
year the probability that a male driver has an accident involving a claim is α, independently of
other years. The analogous probability for females is β. Assume the insurance company selects
a driver at random.
a) What is the probability the selected driver will make a claim this year?
CHAPTER 1. PROBABILITY SPACE 23
b) What is the probability the selected driver makes a claim in two consecutive years?
c) Let A1 , A2 be the events that a randomly chosen driver makes a claim in each of the first
and second years, respectively. Show that P (A2 |A1 ) ≥ P (A1 ).
1.40. Three newspapers A, B and C are published in a certain city, and a survey shows that for
the adult population 20% read A, 16% B, and 14% C, 8% read both A and B, 5% both A and C, 4%
both B and C, and 2% read all three. If an adult chosen at random, find the probability that
1.41. Customers are used to evaluate preliminary product designs. In the past, 95% of highly
successful products received good reviews, 60% of moderately successful products received good
reviews, and 10% of poor products received good reviews. In addition, 40% of products have been
highly successful, 35% have been moderately successful, and 25% have been poor products.
2. If a new design attains a good review, what is the probability that it will be a highly success-
ful product?
3. If a product does not attain a good review, what is the probability that it will be a highly
successful product?
1.42. An inspector working for a manufacturing company has a 99% chance of correctly identi-
fying defective items and a 0.5% chance of incorrectly classifying a good item as defective. The
company has evidence that its line produces 0.9% of nonconforming items.
1. What is the probability that an item selected for inspection is classified as defective?
1.43. A new analytical method to detect pollutants in water is being tested. This new method
of chemical analysis is important because, if adopted, it could be used to detect three different
contaminantsorganic pollutants, volatile solvents, and chlorinated compoundsinstead of having
to use a single test for each pollutant. The makers of the test claim that it can detect high levels of
organic pollutants with 99.7% accuracy, volatile solvents with 99.95% accuracy, and chlorinated
compounds with 89.7% accuracy. If a pollutant is not present, the test does not signal. Sam-
ples are prepared for the calibration of the test and 60% of them are contaminated with organic
pollutants, 27% with volatile solvents, and 13% with traces of chlorinated compounds.
A test sample is selected randomly.
CHAPTER 1. PROBABILITY SPACE 24
2. If the test signals, what is the probability that chlorinated compounds are present?
1.44. Software to detect fraud in consumer phone cards tracks the number of metropolitan areas
where calls originate each day. It is found that 1% of the legitimate users originate calls from two
or more metropolitan areas in a single day. However, 30% of fraudulent users originate calls from
two or more metropolitan areas in a single day. The proportion of fraudulent users is 0.01%. If
the same user originates calls from two or more metropolitan areas in a single day, what is the
probability that the user is fraudulent?
1.45. The probability of getting through by telephone to buy concert tickets is 0.92. For the same
event, the probability of accessing the vendors Web site is 0.95. Assume that these two ways
to buy tickets are independent. What is the probability that someone who tries to buy tickets
through the Internet and by phone will obtain tickets?
1.46. The British government has stepped up its information campaign regarding foot and mouth
disease by mailing brochures to farmers around the country. It is estimated that 99% of Scot-
tish farmers who receive the brochure possess enough information to deal with an outbreak of
the disease, but only 90% of those without the brochure can deal with an outbreak. After the
first three months of mailing, 95% of the farmers in Scotland received the informative brochure.
Compute the probability that a randomly selected farmer will have enough information to deal
effectively with an outbreak of the disease.
1.47. In an automated filling operation, the probability of an incorrect fill when the process is
operated at a low speed is 0.001. When the process is operated at a high speed, the probability of
an incorrect fill is 0.01. Assume that 30% of the containers are filled when the process is operated
at a high speed and the remainder are filled when the process is operated at a low speed.
2. If an incorrectly filled container is found, what is the probability that it was filled during the
high-speed operation?
1.48. An encryption-decryption system consists of three elements: encode, transmit, and de-
code. A faulty encode occurs in 0.5% of the messages processed, transmission errors occur in
1% of the messages, and a decode error occurs in 0.1% of the messages. Assume the errors are
independent.
2. What is the probability of a message that has either an encode or a decode error?
1.49. It is known that two defective copies of a commercial software program were erroneously
sent to a shipping lot that has now a total of 75 copies of the program. A sample of copies will be
selected from the lot without replacement.
CHAPTER 1. PROBABILITY SPACE 25
1. If three copies of the software are inspected, determine the probability that exactly one of
the defective copies will be found.
2. If three copies of the software are inspected, determine the probability that both defective
copies will be found.
3. If 73 copies are inspected, determine the probability that both copies will be found. Hint:
Work with the copies that remain in the lot.
1.50. A robotic insertion tool contains 10 primary components. The probability that any com-
ponent fails during the warranty period is 0.01. Assume that the components fail independently
and that the tool fails if any component fails. What is the probability that the tool fails during the
warranty period?
1.51. A machine tool is idle 15% of the time. You request immediate use of the tool on five dif-
ferent occasions during the year. Assume that your requests represent independent events.
1. What is the probability that the tool is idle at the time of all of your requests?
2. What is the probability that the machine is idle at the time of exactly four of your requests?
3. What is the probability that the tool is idle at the time of at least three of your requests?
1.52. A lot of 50 spacing washers contains 30 washers that are thicker than the target dimension.
Suppose that three washers are selected at random, without replacement, from the lot.
1. What is the probability that all three washers are thicker than the target?
2. What is the probability that the third washer selected is thicker than the target if the first
two washers selected are thinner than the target?
3. What is the probability that the third washer selected is thicker than the target?
1.53. Continuation of previous exercise. Washers are selected from the lot at random, without
replacement.
1. What is the minimum number of washers that need to be selected so that the probability
that all the washers are thinner than the target is less than 0.10?
2. What is the minimum number of washers that need to be selected so that the probability
that one or more washers are thicker than the target is at least 0.90?
1.54. The alignment between the magnetic tape and head in a magnetic tape storage system
affects the performance of the system. Suppose that 10% of the read operations are degraded
by skewed alignments, 5% by off-center alignments, 1% by both skewness and offcenter, and the
remaining read operations are properly aligned. The probability of a read error is 0.01 from a
skewed alignment, 0.02 from an off-center alignment, 0.06 from both conditions, and 0.001 from
a proper alignment. What is the probability of a read error.
CHAPTER 1. PROBABILITY SPACE 26
1.55. Suppose that a lot of washers is large enough that it can be assumed that the sampling is
done with replacement. Assume that 60% of the washers exceed the target thickness.
1. What is the minimum number of washers that need to be selected so that the probability
that all the washers are thinner than the target is less than 0.10?
2. What is the minimum number of washers that need to be selected so that the probability
that one or more washers are thicker than the target is at least 0.90?
1.56. In a chemical plant, 24 holding tanks are used for final product storage. Four tanks are
selected at random and without replacement. Suppose that six of the tanks contain material in
which the viscosity exceeds the customer requirements.
1. What is the probability that exactly one tank in the sample contains high viscosity material?
2. What is the probability that at least one tank in the sample contains high viscosity material?
3. In addition to the six tanks with high viscosity levels, four different tanks contain material
with high impurities. What is the probability that exactly one tank in the sample contains
high viscosity material and exactly one tank in the sample contains material with high im-
purities?
1.57. Plastic parts produced by an injection-molding operation are checked for conformance
to specifications. Each tool contains 12 cavities in which parts are produced, and these parts
fall into a conveyor when the press opens. An inspector chooses 3 parts from among the 12 at
random. Two cavities are affected by a temperature malfunction that results in parts that do not
conform to specifications.
1. What is the probability that the inspector finds exactly one nonconforming part?
2. What is the probability that the inspector finds at least one nonconforming part?
1.58. A bin of 50 parts contains five that are defective. A sample of two is selected at random,
without replacement.
1. Determine the probability that both parts in the sample are defective by computing a con-
ditional probability.
2. Determine the answer to part (a) by using the subset approach that was described in this
section.
1.59. * The Polya urn model is as follows. We start with an urn which contains one white ball and
one black ball. At each second we choose a ball at random from the urn and replace it together
with one more ball of the same color. Calculate the probability that when n balls are in the urn, i
of them are white.
CHAPTER 1. PROBABILITY SPACE 27
1.60. You have n urns, the rth of which contains r − 1 red balls and n − r blue balls, r = 1, . . . , n.
You pick an urn at random and remove two balls from it without replacement. Find the prob-
ability that the two balls are of different colors. Find the same probability when you put back a
removed ball.
1.61. A coin shows heads with probability p on each toss. Let πn be the probability that the
number of heads after n tosses is even. Show that πn+1 = (1 − p)πn + p(1 − πn ) and find πn .
1.62. There are n similarly biased dice such that the probability of obtaining a 6 with each one
of them is the same and equal to p (0 < p < 1). If all the dice are rolled once, show that pn , the
probability that an odd number of 6’s is obtained, satisfies the difference equation
pn + (2p − 1)pn−1 = p,
1.63. Dubrovsky sits down to a night of gambling with his fellow officers. Each time he stakes u
roubles there is a probability r that he will win and receive back 2u roubles (including his stake).
At the beginning of the night he has 8000 roubles. If ever he has 256000 roubles he will marry the
beautiful Natasha and retire to his estate in the country. Otherwise, he will commit suicide. He
decides to follow one of two courses of action:
(i) to stake 1000 roubles each time until the issue is decided;
Advise him (a) if r = 1/4 and (b) if r = 3/4. What are the chances of a happy ending in each case
if he follows your advice?
Independence
1.64. Let the events A1 , A2 , . . . , An be independent and P (Ai ) = p (i = 1, 2, . . . , n). What is the
probability that:
1.65. Each of four persons fires one shot at a target. Let Ck denote the event that the target is hit
by person k, k = 1, 2, 3, 4. If C1 , C2 , C3 , C4 are independent and if P(C1 ) = P(C2 ) = 0.7, P(C3 ) =
0.9, and P(C4 ) = 0.4, compute the probability that (a) all of them hit the target; (b) exactly one
hits the target; (c) no one hits the target; (d) at least one hits the target.
1.66. The probability of winning on a single toss of the dice is p. A starts, and if he fails, he passes
the dice to B, who then attempts to win on her toss. They continue tossing the dice back and
forth until one of them wins. What are their respective probabilities of winning?
CHAPTER 1. PROBABILITY SPACE 28
1.67. Two darts players throw alternately at a board and the first to score a bull wins. On each of
their throws player A has probability pA and player B pB of success; the results of different throws
are independent. If A starts, calculate the probability that he/she wins.
1.68. * A fair coin is tossed until either the sequence HHH occurs in which case I win or the
sequence T HH occurs, when you win. What is the probability that you win?
1.69. Let A1 , . . . , An be independent events, with P(Ai ) < 1. Prove that there exists an event B
with P(B) > 0 such that B ∩ Ai = ∅ for 1 ≤ i ≤ n.
1.70. n balls are placed at random into n cells. Find the probability pn that exactly two cells
remain empty.
1.71. An urn contains b black balls and r red balls. One of the balls is drawn at random, but when
it is put back in the urn c additional balls of the same color are put in with it. Now suppose that
we draw another ball. Show that the probability that the first ball drawn was black given that the
second ball drawn was red is b/(b + r + c).
1.72. Suppose every packet of the detergent TIDE contains a coupon bearing one of the letters
of the word TIDE. A customer who has all the letters of the word gets a free packet. All the letters
have the same possibility of appearing in a packet. Find the probability that a housewife who
buys 8 packets will get:
2.1.1 Definitions
Since the set Ω is countable, the range of X is also countable. Suppose that X(Ω) = {x1 , x2 , . . .}.
Then the distribution of X is completely determined by the following numbers pX i = P(X =
xi ), i ≥ 1. Indeed, for any event A ∈ A,
X X
PX (A) = P[X = xi ] = pX
i .
xi ∈A xi ∈A
Definition 2.1.1. Let X be a real-valued random variable on a countable space Ω. Suppose that
X(Ω) = {x1 , x2 , . . .}. The expectation of X, denoted E[X], is defined to be
X X
E[X] := xi P[X = xi ] = xi pX
i
i i
provided this sum makes sense: this is the case when at least one of the following conditions is
satisfied
1. Ω is finite;
X
P
2. Ω is countable and the series i xi p i absolutely convergence;
3. X ≥ 0 always (in this case, the above sum and hence E[X] as well may take value +∞.
29
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 30
Remark 1. Since Ω is countable, we denote pw the probability that the elementary event w ∈ Ω
happens. Then the expectation of X is given by
X
E[X] = X(w)pw .
w∈Ω
Let L1 denote the space of all random variables with finite expectation defined on (Ω, A, P).
The following facts are straightforward from the definition of expectation.
5. Let ϕ : R → R. Then
X X
E[ϕ(X)] = ϕ(xi )pX
i = ϕ(X(w))pw
i w∈Ω
Remark 2. If E[X 2 ] = 2 X
P
i xi p i < ∞, then
X 1X 1
E[|X|] = |xi |pX
i ≤ (|xi |2 + 1)pX 2
i = (E(X ) + 1) < ∞.
2 2
i i
DX = E[(X − E[X])2 ]
DX = E[X 2 ] − (E[X])2 .
Hence X 2
X
DX = x2i pX
i − xi pX
i .
i i
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 31
2.1.2 Examples
Poisson distribution
X has a Poisson distribution with parameter λ > 0, denoted X ∼ P oi(λ), if X(Ω) = {0, 1, . . .}
and
e−λ λk
P[X = k] = , k = 0, 1, . . .
k!
The expectation of X is
∞ ∞
X e−λ λk −λ
X λj
E[X] = k = λe = λeλ e−λ = λ.
k! j!
k=0 j=0
DX = λ.
Bernoulli distribution
X is Bernoulli with parameter p ∈ [0, 1], denoted X ∼ Ber(p), if it takes only two values 0 and
1 and
P[X = 1] = 1 − P[X = 0] = p.
X corresponds to an experiment with only two outcomes, usually called “success” (X = 1) and
“failure” (X = 0). The expectation and variance of X are
Binomial distribution
X has Binomial distribution with parameters p ∈ [0, 1] and n ∈ N, denoted X ∼ B(n, p), if X
takes on the values {0, 1, . . . , n} and
One has
n
X n
X
E[X] = kP[X = k] = kCnk pk (1 − p)n−k
k=0 k=0
n
X
k−1 k−1
= np Cn−1 p (1 − p)n−k = np,
k=1
and
n
X n
X
E[X 2 ] = k 2 P[X = k] = k 2 Cnk pk (1 − p)n−k
k=0 k=0
n
X n
X
k−2 k−2 k−1 k−1
= n(n − 1)p2 Cn−2 p (1 − p)n−k + np Cn−1 p (1 − p)n−k
k=2 k=1
2
= n(n − 1)p + np,
Geometric distribution
One repeatedly performs a sequence of independent Bernoulli trials until achieving the first
sucesses. Let X denote the number of failures before reaching the first success. X has a Geomet-
ric distribution with parameter q = 1 − p ∈ [0, 1], denoted X ∼ Geo(q),
P[X = k] = q k p, k = 0, 1, . . .
2.2.1 Definition
1. X is a random variable.
Proof. Claim (1) ⇒ (2) is self-evident, we will prove: (2) ⇒ (1). Let
We have C is a σ-algebra and it contains all sets with the form (−∞, a] for every a ∈ R. Thus C
contains B(R). On the other hand, C ⊂ B(R), so C = B(R). This concludes our proof.
Example 2.2.3. Let (Ω, A) be a measurable space. For each subset B of Ω one can verifies that
IB is a random variable iff B ∈ A. More general, if xi ∈ R and Ai ∈ A for all i belongs to some
P
countable index set I, then X(w) = i∈I xi IAi (w) is also a random variable. We call such random
variable X discrete random variable. When I is finite then X is called simple random variable.
Definition 2.2.4. A function ϕ : Rd → R is called Borel measurable if X −1 (B) ∈ B(Rd ) for all
B ∈ B(R).
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 33
Remark 3. It implies from the above definition that every continuous function is Borel. Conse-
quently, all the functions (x, y) 7→ x + y, (x, y) 7→ xy, (x, y) 7→ x/y, (x, y) 7→ x ∨ y, (x, y) 7→ x ∧ y
are Borel, where x ∨ y = max(x, y), x ∧ y = min(x, y).
Theorem 2.2.5. Let X1 , . . . , Xd be random variables defined on a measurable space (Ω, A) and
ϕ : Rd → R a Borel function. Then Y = ϕ(X1 , . . . , Xd ) is also a random variable.
Proof. Let: X(w) = (X1 (w), . . . , Xd (w)) is the function on (Ω, A) and takes values in Rd . For every
a1 , . . . , ad ∈ R we have:
d
Y \d
X −1 (−∞, ai ] = {w : Xi (w) ≤ ai } ∈ A.
i=1 i=1
This implies X −1 (B) ∈ A for every B ∈ B(Rd ). Hence, for every C ∈ B(Rd ), B := ϕ−1 (C) ∈ B(Rd ).
Thus,
Y −1 (C) = X −1 (ϕ−1 (C)) ∈ A,
Corollary 2.2.6. If X and Y are random variables, so also are X ± Y, XY, X ∧ Y, X ∨ Y, |X|, X + :=
X ∨ 0, X − = (−X) ∨ 0 and X/Y (if Y 6= 0).
Theorem 2.2.7. If X1 , X2 , . . . are random variables then so are supn Xn , inf n Xn , lim supn Xn , lim inf n Xn
It follows from Theorem 2.2.7 that if the sequence of random variables (Xn )n≥1 point-wise
converges to X, i.e. Xn (w) → X(w) for all w ∈ Ω, then X is a random variable.
Theorem 2.2.8. Let X be a random variable defined on a probability space (Ω, A).
1. There exists a sequence of discrete random variables which uniformly point-wise converges
to X.
2. If X is non-negative then there exists a sequence of simple random variables Yn such that
Yn ↑ X.
Proof. 1. For each n ≥ 1, denote Xn (w) = nk if nk ≤ X(w) < k+1n for some k ∈ Z. Xn is a
1
discrete random variable and |Xn (w) − X(w)| ≤ n for every w ∈ Ω. Hence, the sequence
(Xn ) converges uniformly in w to X.
2. Suppose that X ≥ 0. For each n ≥ 1, denote Yn (w) = 2kn if 2kn ≤ X(w) < k+1 2n for some
n n n
k ∈ {0, 1, . . . , n2 − 1} and Yn (w) = 2 if X(w) ≥ 2 . We can easily verify that the sequence
of simple random variables (Yn ) satisfying Yn (w) ↑ X(w) for all w ∈ Ω.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 34
Definition 2.2.9. Let X be a random variable defined on a measurable space (Ω, A).
Theorem 2.2.10. Let X be a random variable defined on a measurable space (Ω, A) and Y a func-
tion Ω → R. Then Y is σ(X)-measurable iff there exists a Borel function ϕ : R → R such that
Y = ϕ(X).
Proof. The sufficient condition is evident. We prove the necessary condition. Firstly, suppose
Y is a discrete random variable taking values y1 , y2 , . . . Since Y is σ(X)-measurable, sets An =
{w : Y (w) = yn } ∈ σ(X). By definition of σ(X), there exists a sequence Bn ∈ B(R) such that
An = X −1 (Bn ). Denote
i=1 Bi ∈ B(R), n ≥ 1.
Cn = Bn \ ∪n−1
We have sets Cn are pairwise disjoint and X −1 (Cn ) = An for every n. Consider the Borel function
ϕ defined by
X
ϕ(x) = yn ICn (x),
n≥1
we have Y = ϕ(X).
In general case, by Theorem 2.2.8, there exists a sequence of discrete σ(X)-measurable func-
tions Yn which uniformly converges to Y . Thus, there exists Borel functions ϕn such that Yn =
ϕn (X). Denote
B = {x ∈ R : ∃ lim ϕn (x)}.
n
Clearly, B ∈ B(R) and B ⊃ X(Ω). Let: ϕ(x) = limn ϕn (x)IB (x). We have Y = limn Yn =
limn ϕn (X) = ϕ(X).
2.3.1 Definition
On the other hand, for any function F : R → [0, 1] satisfying the these three conditions there
exists a (unique) probability measure µ on (R, B(R)) such that F (x) = µ((−∞, x)), for all x ∈ R
(See [13], section 2.5.2).
If X and Y has the same distribution function we say X and Y are equal in distribution and
d
denote X = Y .
2.3.2 Examples
1
b−a if a ≤ x ≤ b,
f (x) =
0 otherwise,
is called the Uniform distribution on [a, b] and denoted by U [a, b]. The distribution function cor-
responds to f is
0 if x < a,
F (x) = x−a b−a if a ≤ x ≤ b,
1 if x > b.
Suppose λ > 0. X has a exponential distribution with rate λ, denoted X ∼ Exp(λ), if X takes
values in (0, ∞) and its density is given by
1 (x−a)2
f (x) = √ e− 2σ 2 , x ∈ R,
2πσ 2
is called the Normal distribution with mean a and variance σ 2 and denoted by N(a, σ 2 ). When
a = 0 and σ 2 = 1, N(0, 1) is called the Standard normal distribution.
xα−1 e−x/λ
fX (x) = I (x)
Γ(α)λα (0,∞)
is called the Gamma distribution with parameters α, λ(α, λ > 0); Γ denotes the gamma function
R∞
Γ(α) = 0 xα−1 e−x dx. In particular, an Exp(λ) distribution is G(1, λ) distribution. The gamma
distribution is frequently a probability model for waiting times; for instance, in life testing, the
waiting time until ”death” is a random variable which is frequently modeled with a gamma dis-
tribution.
2.4 Expectation
Definition 2.4.1. Let X be a simple random variable which can be written in the form
n
X
X= ai IAi (2.2)
i=1
Denote Ls = Ls (Ω, A, P) the set of simple random variable. It should be noted that a simple
random variable has of course many different representations of the form (2.2). However, E[X]
does not depend on the particular representation chosen for X.
Let X and Y be in Ls . We can write
n
X n
X
X= ai IAi , and Y = bi IAi .
i=1 i=1
for some subsets Ai which form a measurable partition of Ω. Then for any α, β ∈ R, αX + βY is
also in Ls and
Xn
αX + βY = (ai + bi )IAi .
i=1
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 37
This supremum always exists in [0, ∞]. It follows from the positivity of expectation operator that
the definition above for E[X] coincides with Definition 2.4.1 on Ls .
Note that EX ≥ 0 but it may happen that EX = +∞ even when X is never equal to +∞.
Definition 2.4.2. 1. A random variable X is called integrable if E[|X|] < ∞. In this case, its
expectation is defined to be
E[X] = E[X + ] − E[X − ]. (2.4)
R R
We also write E[X] = X(w)dP(w) = XdP.
2. If E[X + ] and E[X − ] are not both equal to +∞ then the expectation of X is still defined and
given by (2.4) where we use the convention that +∞ + a = +∞ and −∞ + a = −∞ for any
a ∈ R.
Lemma 2.4.3. Let X be a non-negative random variable and (Xn )n≥1 a sequence of simple ran-
dom variables increasing to X. Then E[Xn ] ↑ E[X] (even if E[X] = ∞).
Proof. We have (EXn )n≥1 is the increasing sequence and upper bounded by EX by Definition
(2.3) so (EXn )n≥1 is convergent to a with a ≤ EX. To prove a = EX, we only show that for every
simple random variable Y satisfying 0 ≤ Y ≤ X, we have EY ≤ a.
Indeed, suppose Y takes m different values y1 , . . . , ym . Let Ak = {w : Y (w) = yk }. For each
∈ (0, 1], consider the sequence Yn, = (1 − )Y I{(1−)Y ≤Xn } . We have Yn, is simple random
variable, Yn, ≤ Xn so
EYn, ≤ EXn ≤ a for every n. (2.5)
On the other hand, Y ≤ limn Xn so for every w ∈ Ω, there exists n = n(w) such that (1 − )Y (w) ≤
Xn (w), i.e. Ak ∩ {w : (1 − )Y (w) ≤ Xn (w)} → Ak as n → ∞. We have
m
X
EYn, = (1 − ) yk P Ak ∩ [(1 − )Y ≤ Xn ]
k=1
Xm
→ (1 − ) yk P(Ak ) = (1 − )EY, as n → ∞.
k=1
Asscociate with (2.5), we have (1 − )EY ≤ a for every ∈ (0, 1], i.e. EY ≤ a.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 38
i.e.
E(αX) = αE(X) for every α ∈ R. (2.6)
On the other hand, let Z = X + Y we have Z + − Z − = X + Y = X + + Y + − (X − + Y − ), so
Z + + X − + Y − = Z − + X + + Y + . Thus E(Z + ) + E(X − ) + E(Y − ) = E(Z − ) + E(X + ) + E(Y + ), then
An event A happens almost surely if P(A) = 1. Thus we say X equals Y almost surely if
P[X = Y ] = 1 and denote X = Y a.s.
Corollary 2.4.5. 1. If Y ∈ L1 and |X| ≤ Y , then X ∈ L1 .
Theorem 2.4.6. Let X and Y be integrable random variables. If X = Y a.s. then E[X] = E[Y ].
Proof. Firstly, we consider the case: X and Y are non-negative. Let A = {w : X(w) 6= Y (w)}. We
have P(A) = 0. Moreover,
Suppose (Yn ) is a sequence of simple random variables increasing to Y . Hence, (Yn IA ) is also a
sequence of simple random variables increasing to (Y IA ). Suppose for each n ≥ 1, the random
variable Yn is bouned by Nn , so
Theorem 2.4.7 (Monotone convergence theorem). If the random variables Xn are non-negative
and increasing a.s. to X, then limn→∞ E[Xn ] = E[X] (even if E[X] = ∞).
Proof. For each n, let (Yn,k )k≥1 be a sequence of simple random variables increasing to Xn and
let Zk = maxn≤k Yn,k . Then (Zk )k≥1 is the sequence of simple non-negative random variables,
and thus there exists Z = limk→∞ Zk . Also
Yn,k ≤ Zk ≤ X ∀n ≤ k
Since the left and right sides are the same, X = Z a.s., we deduce the result.
Theorem 2.4.8 (Fatou’s lemma). If the random variables Xn satisfy Xn ≥ Y a.s for all n and some
Y ∈ L1 no . Then
E[lim inf Xn ] ≤ lim inf E[Xn ].
n→∞ n→∞
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 40
Proof. Firstly we prove Theorem to the case Y = 0. Let Yn = inf k≥n Xk . We have (Yn ) is the
sequence of non-decreasing random variables and
Since Xn ≥ Yn , we have EXn ≥ EYn . Asscociate with monotone convergence theorem to the
sequence Yn , we obtain
The general case follows from appling the above result to the sequence of non-negative random
variables X̂n := Xn − Y.
Theorem 2.4.9 (Lebesgue’s dominated convergence theorem). If the random variables Xn con-
verge a.s. to X and supn |Xn | ≤ Y a.s. for some Y ∈ L1 . We have X, Xn ∈ L1 and
Proof. Since |X| ≤ Y , X ∈ L1 . Let Zn = |Xn − X|. Since Zn ≥ 0 and −Zn ≥ −2Y , applying Fatou
Lemma to Zn and −Zn , we obtain
0 = E(lim inf Zn ) ≤ lim inf EZn ≤ lim sup EZn = − lim inf E(−Zn ) ≤ −E(lim inf (−Zn )) = 0.
n→∞ n→∞ n→∞ n→∞ n→∞
Proof. If E(X 2 )E(Y 2 ) = 0 then XY = 0 a.s. Thus |E(XY )|2 = E(X 2 )E(Y 2 ) = 0.
p
If E(X 2 )E(Y 2 ) 6= 0, applying the inequality 2|ab| ≤ a2 + b2 for a = X/ E(X 2 ) and b =
p
Y / E(Y 2 ) and then taking expectation for two sides, we obtain
XY X2 Y2
2E p ≤E + E = 2.
E(X 2 )E(Y 2 ) E(X 2 ) E(Y 2 )
If X ∈ L2 , we denote
DX = E[(X − EX)2 ].
DX is called the variance of X. Using the linearity of expectation operator, one can verify that
DX = E(X 2 ) − (EX)2 .
Theorem 2.4.11. 1. (Markov’s inequality) Suppose X ∈ L1 , then for any a > 0, it holds
E(|X|)
P(|X| ≥ a) ≤ .
a
2. (Chebyshev’s inequality) Suppose X ∈ L2 , then for any a > 0, it holds
DX
P(|X − EX| ≥ a) ≤ .
a2
Proof. 1) Since aI{|X|≥a} (w) ≤ |X(w)|I{|X|≥a} (w) ≤ |X(w)| for every w ∈ Ω. Taking expectation
for two sides, we obtain aP(|X| ≥ a) ≤ E(|X|).
2) Applying Markov’s inequality’, we have
DX
P(|X − EX| ≥ a) = P(|X − EX|2 ≥ a2 ) ≤ .
a2
Theorem 2.4.12. Suppose that X has a density function f . Let h : R → R be a Borel function. We
have Z
E(h(X)) = h(x)f (x)dx.
Example 2.4.13. Let X ∼ Exp(1). Applying Theorem 2.4.12 for h(x) = x and h(x) = x2 respec-
tively, we have Z ∞
EX = xe−x dx = 1,
0
and Z ∞
EX 2 = x2 e−x dx = 2.
0
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 42
2.5.1 Definitions
Let X = (X1 , . . . , Xd ) be a d-dimensional random vector defined on (Ω, A, P). The distribu-
tion function of X is defined by
4. F is left continuous.
In particular, we have
Z Z
P[X1 ∈ B1 ] = P[X ∈ B1 × R d−1
]= f (x1 , . . . , xd )dx2 . . . dxd dx1 for all B1 ∈ B(Rd ).
B1 Rd−1
This implies that if X = (X1 , . . . , Xd ) has a density f then X1 also has a density given by
Z
fX1 (x1 ) = f (x1 , x2 , . . . , xd )dx2 . . . dxd , for all x1 ∈ R. (2.8)
Rd−1
Theorem 2.5.2. Let X = (X1 , . . . , Xd ) be a random vector which has density function f , ϕ : Rd →
R a Borel measurable function. We have
Z
E[ϕ(X)] = ϕ(x)f (x)dx
Rd
R
provided that ϕ is non-negative or Rd |ϕ(x)|f (x)dx < ∞.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 43
2.5.2 Example
Polynomial distribution
n! kd+1
P[X1 = k1 , . . . , Xd = kd ] = pk11 pk22 . . . pd+1 ,
k1 !k2 ! . . . kd+1 !
Using Theorem 2.5.2 and the change of variables formula we have the following useful result.
2. The (Ei , Ei )-valued random variables (Xi )i∈I are independent if the σ-algebras (Xi−1 (Ei ))i∈I
are independent.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 44
• A1 , A2 , . . . ∈ D with An ↑ A implies A ∈ D;
Lemma 2.6.2 (Monotone classes). Let C, D be classes of subsets of Ω where C is a π-system and D
is a λ-system such that C ⊂ D. Then σ(C) ⊂ D.
Lemma 2.6.3. Let G and F be sub-σ-algebras of A. Let G1 and F1 be π-systems such that σ(G1 ) = G
and σ(F1 ) = F. Then G is independent of F if F1 and G are independent, i.e.,
Proof. Suppose that F1 and G1 are independent. We fix any F ∈ F1 and define
We also have that σG is a λ-system containing π-system F1 so that σG = F, which yields the
desired property.
Theorem 2.6.4. Let X and Y be two random variables. The following statements are equivalent:
(i) X is independent of Y ;
(iii) f (X) and g(Y ) are independent for any Borel functions f, g : R → R;
(iv) E[f (X)g(Y )] = E[f (X)]E[g(Y )] for any Borel function f, g : R → R which are either positive
or bounded.
Proof. (i) ⇒ (ii): Suppose X be independent of Y , then two events {w : X(w) < x} v {w : Y (w) <
y} are also independent for every x, y ∈ R. We have (ii).
(ii) ⇒ (i): Since the set of events {w : X(w) < x}, x ∈ R, is a π-system generating σ(X) and
{w : X(w) < y}, y ∈ R, is a π-system generating σ(Y ) , so applying Lemma 2.6.3 we have X is
independent of Y .
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 45
(i) ⇒ (iii): For every A, B ∈ B(R), we have f −1 (A), g −1 (B) ∈ B(R) then
E(XY ) = E(X)E(Y ) for every random variable which is integrable or non-negative X and Y.
Firstly, we suppose that: X and Y are non-negative. By Theorem 2.2.8 there exists a sequence of
simple random variables Xn = ki=1 ai IAi increasing to X and Yn = lj=1
P n Pn
bj IBj increasing to Y
for Ai ∈ σ(X) v Bi ∈ σ(Y ). Applying monotone convergence theorem, we have
kn X
X ln kn X
X ln
E(XY ) = lim E(Xn Yn ) = lim ai bj P(Ai Bj ) = lim ai bj P(Ai )P(Bj )
n→∞ n→∞ n→∞
i=1 j=1 i=1 j=1
kn
X ln
X
= lim ai P(Ai ) bj P(Bj ) = lim E(Xn )E(Yn ) = E(X)E(Y ).
n→∞ n→∞
i=1 j=1
2.7 Covariance
Definition 2.7.1. The covariance of random variables X, Y ∈ L2 is defined by
|ρ(X, Y )| ≤ 1.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 46
Example 2.7.2. Let X and Y be independent random variables whose distributions are N (0, 1).
Denote Z = XY and T = X − Y . We have
and
cov(Z, T 2 ) = E(XY (X − Y )2 ) − E(XY )E((X − Y )2 ) = −2,
since E(XY ) = EXEY = 0, E(X 3 Y ) = E(X 3 )EY = 0, E(XY 3 ) = EXE(Y 3 ) = 0 and E(X 2 Y 2 ) =
E(X 2 )E(Y 2 ) = 1. Thus Z and T are uncorrelated random variables but not independent.
Proposition 2.7.3. Let (Xn )n≥1 be a sequence of pair-wise uncorrelated random variables. Then
Proof. We have
h 2 i
D(X1 + . . . + Xn ) = E (X1 − EX1 ) + . . . (Xn − EXn )
n
X X
= E[(Xi − EXi )2 ] + 2 E[(Xi − EXi )(Xj − EXj )]
i=1 1≤i<j≤n
Xn n
X
= E[(Xi − EXi )2 ] = D(Xi ),
i=1 i=1
since E[(Xi − EXi )(Xj − EXj )] = E(Xi Xj ) − E(Xi )E(Xj ) = 0 for any 1 ≤ i < j ≤ n.
2.8.1 Definition
Definition 2.8.1. Let (Ω, A, P) be a probability space and X an integrable random variable. Let G
be a sub-σ-algebra of A. Then there exists a random variable Y such that
1. Y is G-measurable,
2. E[|Y |] < ∞,
Moreover, if Z is another random variable with these properties then P[Z = Y ] = 1. Y is called a
version of the conditional expectation E[X|G] of X given G, and we write Y = E[X|G], a.s.
2.8.2 Examples
Example 2.8.2. Let X be an integrable random variable and G = σ(A1 , . . . , Am ) where (Ai )1≤i≤m
is a measurable partition of Ω. Suppose that P(Ai ) > 0 for all i = 1, . . . , m. Then
n Z
X 1
E(X|G) = XdP IAi .
P(Ai ) Ai
i=1
Example 2.8.3. Let X and Z be random variables whose joint density is fX,Z (x, z). We know
R
that fZ (z) = R fX,Z (x, z)dx is density of Z. Define the elementary conditional density fX|Z of X
given Z by
fX,Z (x,z) if f (z) 6= 0,
fZ (z) Z
fX|Z (x|z) :=
0 otherwise.
Let h be a Borel function on R such that E[|h(X)|] < ∞. Set
Z
g(z) = h(x)fX|Z (x|z)dx.
R
Theorem 2.8.4. Let ξ, η be integrable random variables defined on (Ω, F, P). Let G be a sub-σ-
algebras of F.
5. E(ξ|F) = ξ a.s.
6. E(E(ξ|G)) = E(ξ).
E ξ|σ(G, H) = E(ξ|G)
a.s.
5. Statement 5 is evident.
6. Using Definition 2.8.1 for G = Ω, we have:
Z Z
E(ξ|G)dP = ξdP ⇒ E(E(ξ|G)) = Eξ a.s.
Ω Ω
7. If A ∈ G, we have: Z Z Z
E[E(ξ|G2 )|G1 ]dP = E(ξ|G2 )dP = ξdP.
A A A
From this and Definition 2.8.1, the first equation is proven. The second one follows from State-
ment 5 and remark that E(ξ|G1 ) is G2 -measurable.
8. If A ∈ G, X and IA are independent. Hence, we have:
Z Z
ξdP = E(ξIA ) = Eξ.P(A) = (Eξ)dP ⇒ E(ξ|G) = E(ξ) a.s.
A A
which proves the desired relation for indicators, and hence for simple random variables. Next,
if {ηn , n ≥ 1} are simple random variables, such that ηn ↑ η almost surely as n → ∞ , it follows
that ηn ξ ↑ ηξ and ηn E(ξ|G) ↑ ηE(ξG|) almost surely as n → ∞, from which the conclusion follows
by monotone convergence. The general case follows by the decomposition ξ = ξ + − ξ − and
η = η+ − η−.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 49
Let (ξn ), ξ and η be random variables defined on (Ω, F, P). Let G be a sub-σ-algebras of F.
Theorem 2.8.5 (Monotone convergence theorem). a) Suppose that ξn ↑ ξ a.s. and there exists a
positive integer m such that E(ξm− ) < ∞. Then, E(ξ |G) ↑ E(ξ|G) a.s.
n
+ ) < ∞, then E(ξ |G) ↓
b) Suppose that ξn ↓ ξ a.s. and there exists a positive integer m such that E(ξm n
E(ξ|G) a.s.
and then
lim E(ξn |G) = E(ξ|G) a.s.
n
Similarly to claim (b).
E(lim inf ξn |G) ≤ lim inf E(ξn |G) ≤ lim sup E(ξn |G) ≤ E(lim sup ξn |G), a.s.
n n n n
Theorem 2.8.7 (Lebesgue’s dominated convergence theorem). Suppose that E(η) < ∞, |ξn | ≤ η
a.s.
a.s., and ξn −→ ξ. Then,
The proofs of Fatou’s lemma and Lebesgue’s dominated convergence theorem are analogous
in a similar vein to the proofs of Fatou’s lemma and the Dominated convergence theorem without
conditioning.
Theorem 2.8.8 (Jensen’s inequality). Let ϕ : R → R be a convex function such that ϕ(ξ) is inte-
grable. Then
E(ϕ(ξ)|G) ≥ ϕ(E(ξ|G)), a.s.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 50
Proof. A result in real analysis is that if ϕ : R → R is convex, then ϕ(x) = supn (an x + bn ) for a
countable collection of real numbers (an , bn ). Then
But E(an ξ + bn |G) ≤ E(ϕ(ξ)|G), hence an E(ξ|G) + bn ≤ E(ξ|G), for every n. Taking the supremum
in n, we get the result.
2
In particular, if ϕ(x) = x2 then E(ξ 2 |G) ≥ E(ξ|G) .
b) Let ϕ : R → R be a Borel function such that both ξ and ξϕ(η) are integrable. Then, the equation
E(ξϕ(η)|η = y) = ϕ(y)E(ξ|η = y)
holds Pη -a.s.
c) If ξ and η are independent, then
E(ξ|η = y) = E(ξ).
2.9 Exercises
2.1. An urn contains five red, three orange, and two blue balls. Two balls are randomly selected.
What is the sample space of this experiment? Let X represent the number of orange balls se-
lected. What are the possible values of X? Calculate expectation and variance of X.
2.2. An urn contains 7 white balls numbered 1,2,...,7 and 3 black ball numbered 8,9,10. Five balls
are randomly selected, (a) with replacement, (b) without replacement. For each of cases (a) and
(b) give the distribution:
2.3. A machine normally makes items of which 4% are defective. Every hour the producer draws
a sample of size 10 for inspection. If the sample contains no defective items he does not stop
the machine. What is the probability that the machine will not be stopped when it has started
producing items of which 10% are defective.
2.4. Let X represent the difference between the number of heads and the number of tails ob-
tained when a fair coin is tossed n times. What are the possible values of X? Calculate expecta-
tion and variance of X.
2.5. An urn contains N1 white balls and N2 black balls; n balls are drawn at random, (a) with re-
placement, (b) without replacement. What is the expected number of white balls in the sample?
2.6. A student takes a multiple choice test consisting of two problems. The first one has 3 possi-
ble answers and the second one has 5. The student chooses, at random, one answer as the right
one from each of the two problems. Find:
b) the V ar(X).
2.7. In a lottery that sells 3,000 tickets the first lot wins $1,000, the second $500, and five other
lots that come next win $100 each. What is the expected gain of a man who pays 1 dollar to buy
a ticket?
2.8. A pays 1 dollar for each participation in the following game: three dice are thrown; if one
ace appears he gets 1 dollar, if two aces appear he gets 2 dollars and if three aces appear he gets
8 dollars; otherwise he gets nothing. Is the game fair, i.e., is the expected gain of the player zero?
If not, how much should the player receive when three aces appear to make the game fair?
2.9. Suppose a die is rolled twice. What are the possible values that the following random vari-
ables can take on?
4. The value of the first roll minus the value of the second roll.
2.10. Suppose X has a binomial distribution with parameters n and p ∈ (0, 1). What is the most
likely outcome of X?
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 52
2.11. An airline knows that 5 percent of the people making reservations on a certain flight will
not show up. Consequently, their policy is to sell 52 tickets for a flight that can hold only 50
passengers. What is the probability that there will be a seat available for every passenger who
shows up?
2.12. Suppose that an experiment can result in one of r possible outcomes, the ith outcome
having probability pi , i = 1, . . . , r, ri=1 pi = 1. If n of these experiments are performed, and if the
P
outcome of any one of the n does not affect the outcome of the other n1 experiments, then show
that the probability that the first outcome appears x1 times, the second x2 times, and the rth xr
times is
n!
px1 px2 · · · pxr r
x1 !x2 ! · · · xr ! 1 2
when x1 + x2 + . . . + xr = n. This is known as the multinomial distribution.
2.13. A television store owner figures that 50 percent of the customers entering his store will
purchase an ordinary television set, 20 percent will purchase a color television set, and 30 percent
will just be browsing. If five customers enter his store on a certain day, what is the probability that
two customers purchase color sets, one customer purchases an ordinary set, and two customers
purchase nothing?
2.15. If a fair coin is successively flipped, find the probability that a head first appears on the fifth
trial.
2.16. A coin having probability p of coming up heads is successively flipped until the rth head
appears. Argue that X, the number of flips required, will be n, n ≥ r, with probability
r−1 r
P[X = n] = Cn−1 p (1 − p)n−r , n ≥ r.
This is known as the negative binomial distribution. Find the expectation and variance of X.
2.17. A fair coin is independently flipped n times, k times by A and n − k times by B. Show that
the probability that A and B flip the same number of heads is equal to the probability that there
are a total of k heads.
2.18. Suppose that we want to generate a random variable X that is equally likely to be either 0
or 1, and that all we have at our disposal is a biased coin that, when flipped, lands on heads with
some (unknown) probability p. Consider the following procedure:
1. Flip the coin, and let 01 , either heads or tails, be the result.
(a) Show that the random variable X generated by this procedure is equally likely to be either
0 or 1.
(b) Could we use a simpler procedure that continues to flip the coin until the last two flips are
different, and then sets X = 0 if the final flip is a head, and sets X = 1 if it is a tail?
2.19. Consider n independent flips of a coin having probability p of landing heads. Say a changeover
occurs whenever an outcome differs from the one preceding it. For instance, if the results of the
flips are HHT HT HHT , then there are a total of five changeovers. If p = 1/2, what is the proba-
bility there are k changeovers?
2.20. Let X be a Poisson random variable with parameter λ. What is the most likely outcome of
X?
2.21. * Poisson Approximation to the Binomial Let P be a Binomial probability with probability
of success p and number of trial n. Let λ = np. Show that
λk λ −k
λ n n−1 n−k+1
P (k successes) = 1− ... 1− .
k! n n n n n
Let n → ∞ and let p change so that λ remains constant. Conclude that for small p and large n,
λk −λ
P (k successes) = e , where λ = pn.
k!
2.22. * Let X be the Binomial B(n, p).
b) Show for r = 2, 3, 4, . . .,
E{X(X − 1) . . . (X − r + 1)} = λr .
r!pr
b) Show for r = 2, 3, 4, . . ., E{X(X − 1) . . . (X − r + 1)} = (1−p)r .
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 54
2.25. Suppose X takes all its values in N = {0, 1, 2, . . .}. Show that
∞
X
E[X] = P[X > n].
n=0
2.26. Liam’s bowl of spaghetti contains n strands. He selects two ends at random and joins them
together. He does this until there are no ends left. What is the expected number of spaghetti
hoops in the bowl?
2.27. Sarah collects figures from cornflakes packets. Each packet contains one figure, and n
distinct figures make a complete set. Find the expected number of packets Sarah needs to collect
a complete set.
2.28. Each packet of the breakfast cereal Soggies contains exactly one token, and tokens are
available in each of the three colours blue, white and red. You may assume that each token
obtained is equally likely to be of the three available colours, and that the (random) colours of
different tokens are independent. Find the probability that, having searched the contents of k
packets of Soggies, you have not yet obtained tokens of every colour.
Let N be the number of packets required until you have obtained tokens of every colour.
Show that E[N ] = 11
2 .
2.29. Each box of cereal contains one of 2n different coupons. The coupons are organized into n
pairs, so that coupons 1 and 2 are a pair, coupons 3 and 4 are a pair, and so on.
Once you obtain one coupon from every pair, you can obtain a prize. Assuming that the
coupon in each box is chosen independently and uniformly at random from the 2n possibilities,
what is the expected number of boxes you must buy before you can claim the prize?
2.30. The amount of bread (in hundreds of kilos) that a bakery sells in a day is a random variable
with density
cx
for 0 ≤ x < 3,
f (x) = c(6 − x) for 3 ≤ x < 6,
0 otherwise.
b) What is the probability that the number of kilos of bread that will be sold in a day is, (i)
more than 300 kilos? (ii) between 150 and 450 kilos?
c) Denote by A and B the events in (i) and (ii), respectively. Are A and B independent events?
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 55
2.31. Suppose that the duration in minutes of long-distance telephone conversations follows an
exponential density function:
1
f (x) = e−x/5 for x > 0.
5
Find the probability that the duration of a conversation:
d) will be less than 6 minutes given that it was greater than 3 minutes.
2.32. A number is randomly chosen from the interval (0;1). What is the probability that:
2.33. The height of men is normally distributed with mean µ=167 cm and standard deviation
σ=3 cm.
a) What is the percentage of the population of men that have height, (i) greater than 167 cm,
(ii) greater than 170 cm, (iii) between 161 cm and 173 cm?
ii) two will have height smaller than the mean (and two bigger than the mean)?
2.34. Find the constant k and the mean and variance of the population defined by the probability
density function
f (x) = k(1 + x)−3 for 0 ≤ x < ∞
2.35. A mode of a distribution of one random variable X is a value of x that maximizes the pdf
or pmf. For X of the continuous type, f (x) must be continuous. If there is only one such x, it is
called the mode of the distribution. Find the mode of each of the following distributions
2.37. Let 0 < p < 1. A (100p)th percentile (quantile of order p) of the distribution of a random
variable X is a value ζp such that
Find the pdf f (x), the 25th percentile and the 60th percentile for each of the the followin cdfs.
3. F (x) = 1
2 + 1
π tan−1 (x), −∞ < x < ∞.
2.38. If X is a random variable with the probability density function f , find the probability den-
sity function of Y = X 2 if
2
(a) f (x) = 2xe−x , for 0 ≤ X < ∞
2.40. Let X be a uniform distribution U (0, 1). Find the density of the following random variable.
1. Y = − λ1 ln(1 − X).
X
2. Z = ln 1−X . This is known as Logistic distribution.
q
1
3. T = 2 ln 1−X . This is known as Rayleigh distribution.
2.42. Let X be a random variable with distribution function F that is continuous. Show that
Y = F (X) is uniform.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 57
2.43. Let F be a distribution function that is continuous and is such that the inverse function
F −1 exists. Let U be uniform on [0, 1]. Show that X = F −1 (U ) has distribution function F .
2.44. 1. Let X be a non-negative random variable satisfying E[X α ] < ∞ for some α > 0. Show
that Z ∞
E[X α ] = α xα−1 (1 − F (x))dx.
0
2.46. Let X be a nonnegative random variable with mean µ and variance σ 2 , both finite. Show
that for any b > 0,
1
P[X ≥ µ + bσ] ≤ .
1 + b2
[(x−µ)b+σ]2
Hint: consider the function g(x) = σ 2 (1+b2 )2
.
2.47. Let X be a random variable with mean µ and variance σ 2 , both finite. Show that for any
d > 1,
1
P[µ − dσ < X < µ + dσ] ≥ 1 − 2 .
d
2.48. Divide a line segment into two parts by selecting a point at random. Find the probability
that the larger segment is at least three times the shorter. Assume a uniform distribution.
1. Let (An ) be a sequence of events such that limn P(An ) = 0. Show that limn→∞ E[XIAn ] = 0.
2. Show that for any > 0, there exists a δ > 0 such that for any event A satisfying P(A) < δ,
E[XIA ] < .
2.51. Given the probability space (Ω, A, P), suppose X is a non-negative random variable and
E[X] = 1. Define Q : A → R by Q(A) = E[XIA ].
3. Suppose P(X > 0) = 1. Let EQ denote expectation with respect to Q. Show that EQ [Y ] =
EP [Y X].
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 58
Random elements
2.52. An urn contains 3 red balls, 4 blue balls and 2 yellow balls. Pick up randomly 2 ball from
that urn and denote X and Y the number of red and yellow balls in the 2 balls, respectively.
1. P[X ∧ Y ≤ i].
2. P[X = Y ].
3. P[X > Y ].
4. P[X divides Y ].
2.55. Let X and Y be independent geometric random variables with parameters λ and µ.
2.56. Let X and Y be independent random variables with uniform distribution on the set {−1, 1}.
Let Z = XY . Show that X, Y, Z are pairwise independent but that they are not mutually inde-
pendent.
2.57. * Let n be a prime number greater than 2; and X, Y be independent and uniformly dis-
tributed on {0, 1, . . . , n − 1}. For each r, 0 ≤ r ≤ n − 1, define Zr = X + rY ( mod n). Show that
the random variable Zr , r = 0, . . . , n − 1, are pairwise independent.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 59
1
2.58. Let (Xn ) be a sequence of independent random variables with P[Xn = 1] = P[Xn = −1] = 2
for all n. Let Zn = X0 X1 . . . Xn . Show that Z1 , Z2 , . . . are independent.
2.59. Let (a1 , . . . , an ) be a random permutation of (1, . . . , n), equally likely to be any of the n!
possible permutations. Find the expectation of
n
X
L= |ai − i|.
i=1
2.60. A blood test is being performed on n individuals. Each person can be tested separately. but
this is expensive. Pooling can decrease the cost. The blood samples of k people can be pooled
and analyzed together. If the test is negative, this one test suffices for the group of k individuals.
If the test is positive, then each of the k persons must be tested separately and thus k + 1 total
tests are required for the k people. Suppose that we create n/k disjoint groups of k people (where
k divides n) and use the pooling method. Assume that each person has a positive result on the
test independently with probability p.
(a) What is the probability that the test for a pooled sample of k people will be positive?
(d) Give an inequality that shows for what values of p pooling is better than just testing every
individual.
2.61. You need a new staff assistant, and you have n people to interview. You want to hire the
best candidate for the position. When you interview a candidate, you can give them a score, with
the highest score being the best and no ties being possible. You interview the candidates one
by one. Because of your company’s hiring practices, after you interview the kth candidate, you
either offer the candidate the job before the next interview or you forever lose the chance to hire
that candidate. We suppose the candidates are interviewed in a random order, chosen uniformly
at random from all n! possible orderings.
We consider the following strategy. First, interview m candidates but reject them alL these
candidates give you an idea of how strong the field is. After the mth candidate. hire the first
candidate you interview who is better than all of the previous candidates you have interviewed.
1. Let E be the event that we hire the best assistant, and let Ei be the event that ith candidate
is the best and we hire him. Determine P(Ei ), and show that
n
m X 1
P(E) = .
n j−1
j=m+1
2. Show that
m m
(ln n − ln m) ≤ P(E) ≤ (ln(n − 1) − ln(m − 1)).
n n
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 60
3. Show that m(ln n−ln m)/n is maximized when m = n/e, and explain why this means P(E) ≥
1/e for this choice of m.
2.64. Let X be a normal with µ = 0 and σ 2 < ∞, and let Θ be uniform on [0, π]. Assume that X
and θ are independent. Find the distribution of Z = X + a cos Θ.
2.65. Let X and Y be independent random variable with the same distribution N (0, σ 2 ).
2.66. (Simulation of Normal Random Variables) Let U and V be two independent uniform ran-
dom variable on [0, 1]. Let θ = 2πU and S = − ln(V ).
Y1 = min{Xi , 1 ≤ i ≤ n},
Y2 = second smallest of X1 , . . . , Xn ,
..
.
Yn = max{Xi , 1 ≤ i ≤ n}.
Then Y1 , . . . , Yn are also random variables, and Y1 ≤ Y2 ≤ . . . ≤ Yn . They are called the or-
der statistics of (X1 , . . . , Xn ) and are usually denoted Yk = X(k) . Assume that Xi are i.i.d. with
common density f .
2.69. Let X and Y be independent and suppose P[X + Y = α] = 1 for some constant α. Show
that both X and Y are constant random variables.
2.70. Let (Xn )n≥1 be iid with common continuous distribution function F (x). Denote Rn =
Pn
j=1 I{Xj ≥Xn } , and An = {Rn = 1}.
1. Show that the sequence of random variables (Rn )n≥1 is independent and
1
P[Rn = k] = , for k = 1, . . . , n.
n
1
P(An ) = .
n
Chapter 3
Definition 3.1.1. Let (Xn )n≥1 be a sequence of random variables defined on(Ω, A, P). We say that
Xn
a.s.
• converges almost surely to X and denoted by Xn −→ X or limn Xn = X a.s., if
P w : lim Xn (w) = X(w) = 1;
n→∞
P
• converges in probability to X and denoted by Xn −→ X, if for any > 0,
Lp
• converges in Lp (p > 0) to X and denoted by Xn −→ X if E(|Xn |p ) < ∞ for any n, E(|X|p ) <
∞ and
lim E(|Xn − X|p ) = 0.
n→∞
Note that, the value of a random variable is a number, so the most natural way to consider
the convergence of random variables is via the convergence of a sequence of numbers; and here
comes the convergence almost surely. But sometimes this mode of convergence can fail, then the
convergence in probability is defined in the meaning that the larger n is, the smaller and smaller
62
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 63
the probability that Xn is far away from X becomes; and the convergence in Lp is considered in
the sense that the average distance between Xn and X must tends to 0.
We have the following example.
P
It implies that Xn −→ 0.
• In order to prove the convergence in Lp for p ∈ (0, 2), we must check that
This limit can be deduced from the computation that E (|Xn |p ) = np−2 .
• Usually, in order to prove or disprove the convergence almost surely, we use the Borel-
Cantelli lemma that can be stated as follows.
Lemma 3.1.3 (Borel-Cantelli). Let An be a sequence of events in a probability space {Ω, F, P}. De-
note lim sup An = ∩∞
n=1 (∪m≥n Am ) .
1. If Σ∞
n=1 P(An ) < ∞, then P(lim sup An ) = 0.
2. If Σ∞
n=1 P(An ) = ∞ and An ’s are independent, then P(lim sup An ) = 1.
Proof. 1. From the definition of lim sup An , it is clear that for every i,
2. We have
1 − P(lim sup An ) = P ∩∞
n=1 ∪m≥n Am
= P ∪∞
n=1 ∪m≥n Am
= P ∪∞
n=1 ∩m≥n Am .
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 64
In order to prove that P(lim sup An ) = 1, i.e 1 − P(lim sup An ) = 0, we will show that
P ∩m≥n Am = 0 for every n. Indeed, since An ’s are independent,
P ∩m≥n Am = Πm≥n P(Am )
= Πm≥n [1 − P(Am )]
≤ Πm≥n e−P(Am ) = e−Σm≥n P(Am ) = e−∞ = 0.
The meaning of the event lim sup An is that An occurs for an infinite number of n. Therefore
P(lim sup An ) = 0 means that almost surely there exists just a finite number of n that we can see
An .
Now, let’s see the application of the Borel-Cantelli Lemma in our example. We denote the
event An = {Xn 6= 0} = {Xn = n}. Then,
1
Σ∞ ∞
n=1 P(An ) = Σn=1 < ∞.
n2
It implies that almost surely An occurs a finite number of n, i.e the number of n such that Xn
differs from zero is finite. Hence, almost surely the limit of Xn exists and it must be zero. So
a.s.
Xn −→ X.
P
Proof. ⇒) Suppose that Xn → X. For any > 0 and w ∈ Ω, because of the increasing property of
x
the function x 7→ x+1 on the interval [0, ∞), we have
The following proposition shows that among the three modes of convergence, the conver-
gence in probability is the weakest form.
E(|Xn − X|p )
P(|Xn − X| > ) = P(|Xn − X|p > p ) ≤ .
p
Since E(|Xn − X|p ) → 0, by the sandwich theorem, P(|Xn − X| > ) converges also to 0.
P
Therefore Xn −→ X.
a.s.
2. Suppose that Xn −→ X. It is clear that
|Xn − X|
≤ 1,
1 + |Xn − X|
P
From the Proposition 3.1.4, we have Xn −→ X.
In the above example, we can see that convergence in probability is not sufficient for conver-
gence almost surely. However, we have the following result.
P
Proposition 3.1.6. 1. Suppose Xn −→ X. Then there exists a subsequence (nk )k≥1 such that
a.s.
Xnk −→ X.
2. On the contrary, if for all subsequence (nk )k≥1 , there exists a further subsequence (mk )k≥1
a.s. P
such that Xmk −→ X then Xn −→ X.
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 66
P
Proof. 1. Suppose Xn −→ X. Then from Proposition 3.1.4,
|Xn − X|
limE = 0.
n→ 1 + |Xn − X|
It is clear that
|Xnk − X|
Σ∞
k=1 E < ∞.
1 + |Xnk − X|
Therefore,
|Xnk − X|
Σ∞
k=1 < ∞ a.s.
1 + |Xnk − X|
Then, almost surely
|Xnk − X|
lim = 0,
k→∞ 1 + |Xnk − X|
it implies that
lim |Xnk − X| = 0,
k→∞
i.e, lim Xnk = X.
k→∞
2. Indeed, if we assume that Xn does not converge in probability to X, then from Proposition
|Xn −X|
3.1.4, the sequence E 1+|Xn −X| does not converge to 0, i.e. there exists a positive constant
> 0 such that we can find a subsequence nk satisfying
|Xnk − X|
E > , ∀k.
1 + |Xnk − X|
It implies that for all subsequence {mk } of {nk }, Xnk can not converge almost surely to X.
This is in contradiction with the hypothesis.
P
So we must have that Xn −→ X.
P
2. From the second part of Proposition 3.1.6, in order to prove that f (Xn , Yn ) −→ f (X, Y ),
we can check that for all subsequence {nk }, there exists a subsequence {mk } such that
a.s P P
f (Xmk , Ymk ) −→ f (X, Y ). Indeed, since Xnk −→ X and Ynk −→ Y , then from the first
a.s
part of Proposition 3.1.6, we can extract a subsequence {mk } satisfying Xmk −→ X and
a.s
Ymk −→ Y . Then from the first part of this theorem, the result follows.
Corollary 3.2.2. Let (Xn )n≥1 be a sequence of pairwise uncorrelated random variables satisfying
D(X1 ) + . . . + D(Xn )
lim = 0.
n→∞ n2
Then
Sn − ESn P
−→ 0, asn → ∞.
n
Proof. Observe that D(Sn ) = D(X1 ) + . . . + D(Xn ) and apply Theorem 3.2.1.
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 68
Lemma 3.2.3. Let (Xn )n≥1 be a sequence of i.i.d random variables with finite variance. Then
Sn P
−→ EX1 , as n → ∞.
n
Note that when Xn has the Bernoulli law, then Sn is the number of successful trials and
Bernoulli showed that Sn /n converges in probability to the probability of success of a trial. How-
ever, his proof is much more complicated than the one given here.
Theorem 3.2.4. Let (Xn )n≥1 be a sequence of pair-wise uncorrelated random variables satisfying
supn D(Xn2 ) ≤ σ 2 < ∞. Then
Sn − ESn
lim = 0 a.s and in L2 .
n→∞ n
Proof. At first, we assume that E(Xn ) = 0. Denote Yn = Sn /n. Then E(Yn ) = 0, and from
Proposition 2.7.3,
n
1 X σ2
E(Yn2 ) = 2 DXi ≤ .
n n
i=1
L2
Hence Yn → 0. We also have
∞ ∞
X X σ2
E(Yn22 ) ≤ < ∞.
n2
n=1 n=1
we have
√
h p(n)2 2 i n − p(n)2
2 2p(n) + 1 2 2 n + 1 2 3
E Yn − Yp(n)2 ≤ 2
σ ≤ 2
σ ≤ 2
σ ≤ 3/2 σ 2 ,
n n n n n
√
with the observations n ≤ (p(n) + 1)2 and p(n) ≤ n. By the same argument, since
∞ ∞
X h p(n)2 2 i X 3 2
E Yn − Yp(n)2 ≤ 3/2
σ < ∞,
n n
n=1 n=1
then
p(n)2 h.c.c
Yn − Yp(n)2 → 0.
n
2 a.s
From (3.2) and the observation p(n) n → 1, we deduce that Yn → 0.
In general, if E(Xn ) 6= 0, we denote Zn = Xn − E(Xn ). Then {Zn } is a sequence of pair-wise
uncorrelated random variables with mean zero satisfying the condition of the theorem. There-
fore
Sn − ESn Z1 + . . . + Zn a.s
= → 0.
n n
In the following, we state without proof two general versions of strong law of large numbers.
Theorem 3.2.5. Let (Xn )n≥1 be a sequence of independent random variable and, (bn )n≥1 a se-
quence of positive numbers satisfying bn ↑ ∞. If
∞
X DXn Sn − E(Sn ) a.s.
< ∞ then −→ 0.
b2n bn
n=1
Theorem 3.2.6. Let (Xn )n≥1 be a sequence of iid random variables. Then
Sn
lim = E(X1 ) iff E(|X1 |) < ∞.
n→∞ n
i.i.d
Example 3.2.7. Consider (Xn )n≥1 ∼ B(1, p). From Theorem 3.2.6,
Sn h.c.c
−→ E(X1 ) = p.
n
Then, to approximate the probability of success of each trial, we can use the approximation Sn /n
for n large enough.
An application of Strong law of large numbers that is quite simple but very useful is the Monte
Carlo method.
is smooth enough, I can be approximated well by taking the average (with some weight) of the
values of f at some fixed points. For example, if f is twice differentiable, we have
f (tn0 ) + 2f (tn1 ) + . . . + 2f (tnn−1 ) + f (tnn )
I≈ ,
2n
where tni = ni , i = 0, 1, . . . , n.
However, the above method is not good in the sense that we must take too many points to
have a good approximation when f is not smooth enough. In this case, we can use the Monte
Carlo method that can be stated in the simplest version as follows. Let (Uj )j≥1 be a sequence of
i.i.d random variables of the uniform distribution over [0, 1] and denote
n
1X
In = f (Uj ).
n
j=1
R1
Since E[|f (Uj )|] = 0 |f (x)|dx < ∞, then from Theorem 3.2.6, In converges almost surely to
E[f (U1 )] = I as n → ∞. To evaluate the error of the approximation, we assume more that
Z 1
|f (x)|2 dx < ∞. (3.4)
0
Then, the square of the error is
Z 1
2 12 1
E[(In − I) ] = E[(In − E[In ]) ] = Df (U1 ) ≤ |f (x)|2 dx.
n n 0
In practical, we use the computer to generate the sequence (Uj )j≥1 and obtain an approximation
of I for any function f satisfying the condition (3.3). Under the condition (3.4), the error of the
approximation only depends on the size n and not on the smoothness of f . The Monte Carlo
method also seems to be more useful than the other deterministic ones in approximating the
multiple integral. The only thing we must care about is the square of the error. If we can reduce
it, then the calculation will be more accurate and we can also reduce the time on computer (see
[?]). That is the way one wants to improve the Monte Carlo method.
The error of the Monte Carlo method will be analysed in more detail based on the Central
limit theorems that will be explained in the following.
Theorem 3.3.2. For every random variable X, the characteristic function ϕX has the following
properties;
1. ϕX (0) = 1;
2. ϕX (−t) = ϕX (t);
3. |ϕX (t)| ≤ 1;
Proof. It is easy to see that ϕX (0) = 1. Applying the inequality (EX)2 ≤ E(X 2 ),
p q
|ϕX (t)| = (E cos tX)2 + (E sin tX)2 ≤ E(cos2 tX) + E(sin2 tX) = 1,
then ϕX is bounded. And the continuity of can be deduced by Lebesgue dominated convergence
theorem.
The following theorem shows the connection between the characteristic function of a ran-
dom variable and its moments.
Theorem 3.3.3. If E[|X|m ] < ∞ for some positive integer m. Then ϕX has continuous derivatives
up to order m, and
Proof. Since E(|X|m ) < ∞, we have E(|X|k ) < ∞ for all k = 1, . . . , m. Then
Z Z
sup |(ix) e |dFX (x) ≤ |x|k dFX (x) < ∞.
k itx
t
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 72
From Lebesgue theorem, we can take the differentation under the integral sign and obtain (3.5).
In (3.5), let t = 0 then we have (3.6).
Consider the Taylor expansion of function exp(x) at x = 0,
n−1
X (itX)k (itX)n iθX
itX
E(e )=E + e
k! n!
k=0
n−1
X (it)k (it)n
= E(X k ) + E(X n ) + αn (t) ,
k! n!
k=0
where |θ| ≤ |t|, αn (t) = E X n (eiθX − 1) . Therefore |αn (t)| ≤ 2E(|X|n ), i.e it is bounded. So from
If X ∼ N (a, σ 2 ) then
(x−a)2
Z
1
ϕX (t) = √ dx. eitx e− 2σ 2
2πσ 2
Using the change of variable: y = (x − a)/σ, we get
eita
Z
2 2 2
ϕX (t) = √ eitσy e−y /2 dy = eita−t σ /2 .
2π
The following theorem shows the meaning of the name ”characteristic function”.
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 73
Theorem 3.3.6. Two random vectors have the same distribution if their characteristic functions
R
coincide. Moreover, if |ϕX (t)|dt < ∞ then X has bounded continuous density given by
1 −ity
fX (y) = e ϕ(t)dt.
2π
Example 3.3.7. Let X and Y have Poisson distribution with the corresponding parameters µ and
λ. Assume more that X and Y are independent. Let us consider the distribution of the random
variable X + Y . We can compute its characteristic function as
it −1)
ϕX+Y (t) = E(eit(X+Y ) ) = E(eitX )E(eitY ) = e(λ+µ)(e .
Then this characteristic function agrees with the one of P oi(µ+λ). So the random variable X +Y
has the Poisson distribution with the parameter µ + λ.
We can also use the characteristic function to check whether the random variables are inde-
pendent.
ϕ(X1 ,...,Xn ) (t1 , . . . , tn ) = ϕX1 (t1 ) . . . ϕXn (tn ) for all (t1 , . . . , tn ) ∈ Rn .
Example 3.3.9. Let X and Y be independent random variables which have standard normal
distribution N (0, 1). According to Example 3.3.5, we have
2 −s2
ϕ(X+Y,X−Y ) (t, s) = Eeit(X+Y )+is(X−Y ) = Eei(t+s)X Eei(t−s)Y = e−t .
2 2
Put s = 0 and t = 0, we have ϕX+Y (t) = e−t and ϕX−Y (s) = e−s . Hence both X + Y and X − Y
have normal distribution N (0, 2). Furthermore, they are independent since
Note that we do not require or suppose that the random variables (Xn )n≥1 are defined on
the same probability space in the above definition. We just care about the expectation or the
distribution. Therefore sometimes we call weakly convergence by convergence in distribution
(See Exercise 3.27).
If we suppose that Xn ’s and X are defined on the same probability space, we have the follow-
ing propositions.
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 74
Proposition 3.3.11. Let (Xn )n≥1 and X be random variables defined on the same probability space
w
(Ω, F, P). If Xn −→ X then Xn −→ X.
P
P w
Proof. We prove by contradiction. Assume that Xn −→ X but Xn 6−→ X. Then there exist a
bounded continuous function f , a constant > 0 and a subsequence (nk )k≥1 such that
From Proposition 3.1.6, there exists a subsequence (mk )k≥1 of the sequence (nk )k≥1 such that
a.s h.c.c
Xmk −→ X. Since f is continuous, f (Xmk ) −→ f (X). By Dominated Convergence Theorem,
E(f (Xmk )) → E(f (X)). It is in contradiction with (3.8). Then the result follows.
Proposition 3.3.12. Let (Xn )n≥1 and X be random variables defined on the same probability space
w
(Ω, F, P). If Xn −→ X and X = const a.s then Xn −→ X.
P
|x−a| w
Proof. Let X ≡ a a.s. Consider the bounded continuous function f (x) = |x−a|+1 . Since Xn −→ a,
P
E(f (Xn )) → f (a) = 0. From Proposition 3.1.4, Xn → a.
The following theorem gives us a very useful criterion to verify the weak convergence of ran-
dom variables by using the characteristic function. Its proof is provided in [13, page 196-199].
Theorem 3.3.13. Let (Fn )n≥1 be a sequence of distribution function whose characteristic functions
are (ϕn )n≥1 respectively, Z
ϕn (t) = eitx dFn (x).
R
w
1. If Fn → F for some distribution function F then (ϕn ) converges point-wise to the character-
istic function ϕ of F .
Example 3.3.14. Let Xn be normal N (an , σn2 ). Suppose that an → 0 and σn2 → 1 as n → ∞. Then
the sequence (Xn ) converges weakly to N (0, 1) since
2 2 /2 2 /2
ϕXn (t) = eitan −σn t → e−t .
Example 3.3.15 (Weak laws of large numbers). Let (Xk )k≥1 be a iid sequence of random variables
whose mean is finite. Then
1 P
(X1 + . . . + Xn ) −→ a.
n
Indeed, denote Sn = X1 + . . . + Xn and ϕ is characteristic function of Xk . Then,
Theorem 3.3.17. Let (Xn )n≥1 be a sequence of i.i.d random variables and E(Xn ) = µ and DXn =
−nµ w
σ 2 ∈ (0, ∞). Denote Sn = X1 + . . . + Xn . Then Yn = Sσn√ n
→ N (0, 1).
Proof. Denote ϕ by the characteristic function of the random variable Xn − µ. Since Xn ’s have
the same law, ϕ does not depend on n. Moreover, since Xn ’s are independent,
n n
X Xj − µ Y X − µ
j t n
ϕYn (t) = E exp it √ = E exp it √ = ϕ √ .
σ n σ n σ n
j=1 j=1
It is clear that E(Xj − µ) = 0 and E((Xj − µ)2 ) = σ 2 . Then from Theorem 3.3.3, ϕ has the
continuous second derivative and
σ 2 t2
ϕ(t) = 1 − + t2 α(t),
2
where α(t) → 0 as t → 0. Using the expansion ln(1 + x) = x + o(x) as x → 0,
t2 t2 t t2
ln ϕYn (t) = n ln 1 − + α √ → − .
2n nσ 2 σ n 2
2 /2
Therefore ϕYn (t) → e−t as n → ∞. Applying Theorem 3.3.13, we have the desired result.
In the following, we give an example of the central limit theorem. More detail, we will approx-
imate the binomial probability by the normal probability.
Example 3.3.18. We know that a binomial random variable Sn ∼ B(n, p) can be written as the
sum of n i.i.d random variables ∼ B(1, p). Then as n large enough, from the central limit theorem,
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 76
p
we can approximate the random variable (Sn − np)/ np(1 − p) by the standard normal variable
N(0, 1).
Usually, the probability that a ≤ Sn ≤ b can be formulated as
However, when n is too large, calculating Cni for some i is impossible since it exceeds the capacity
of the calculator or the computer (please, consider 1000! or 5000!). Then in practical, we can
estimate this probability by
" #!
Sn − np a − np b − np
P(a ≤ Sn ≤ b) = P p ∈ p ,p
np(1 − p) np(1 − p) np(1 − p)
" #!
∼ a − np b − np
= P N(0, 1) ∈ p ,p .
np(1 − p) np(1 − p)
Note that to compute the last probability, we can write down it as an integral from the density
function of the normal variable. It can be computed or approximated easily.
In order to define the rate that the distribution of FYn converges to normal distribution, we
use the Berry-Esseen’s inequality: Suppose E(|X1 |3 ) < ∞ then
2
x
e−t /2 E(|X1 − EX1 |3 )
Z
sup FYn (x) − √ dt ≤ KBE √ , (3.9)
−∞<x<∞ −∞ 2π σ3 n
√
where KBE is some constant in ( 610+3√
2π
, 0.4748) (see[12]).
The condition that Xn ’s are iid is too restrictive. Many authors manage to weaken this condi-
tion. In the following, we state the Lindeberg’s central limit theorem. Its proof can be found in
[13, page 221-225].
Theorem 3.3.19. Let (Xn )n≥1 be a sequence of independent random variables with finite variance.
Denote Sn = X1 + . . . + Xn , Bn = DX1 + . . . + DXn . Suppose that
n
1 X 2
Ln () := E (Xk − EXk ) I {|X −EX |>B } → 0, ∀ > 0. (3.10)
Bn2 k k n
k=1
Sn −ESn w
Then Sn∗ = Bn → N (0, 1).
3.4 Exercises
lim E(|Xn − X| ∧ 1) = 0.
n→∞
3.4. Consider the probability space ([0; 1], B([0; 1]), P ). Let X = 0 and X1 , X2 , . . . be random
variables
0 if n1 ≤ ω ≤ 1
Xn (ω) =
en if 0 ≤ ω < 1/n.
P
Show that X −→ X, but E|Xn − X|p does not converge for any p > 0.
3.5. Consider the probability space ([0; 1], B([0; 1]), P ). Let X = 0. For each n = 2m + k where
0 ≤ k < 2m , we define
1 if k ≤ ω ≤ k+1
2m 2m
Xn (ω) =
0 otherwise.
P
Show that X −→ X, but {Xn } does not converge to X a.s.
3.6. Let (Xn )n≥1 be a sequence of exponential random variables with parameter λ = 1. Show
that h Xn i
P lim sup = 1 = 1.
n→∞ ln n
3.7. Let X1 , X2 , . . . be a sequence of identically distributed random variables with E|X1 | < ∞
and let Yn = n−1 max1≤i≤n |Xi |. Show that limn E(Yn ) = 0 and limn Yn = 0 a.s.
P
3.8. [5] Let (Xn )n≥1 be random variables with Xn −→ X. Suppose |Xn (ω)| ≤ C for a constant
C > 0 and all ω. Show that limn→∞ E|Xn − X| = 0.
3.9. [10] Let X1 , . . . , Xn be independent and identically distributed random variables such that
for x = 3, 4, . . . , P (X1 = ±x) = (2cx2 log x)−1 , where c = ∞ −2
P
x=3 x / log x. Show that E|X1 | = ∞
P
but n−1 ni=1 Xi −→ 0.
P
3.10. [10] Let X1 , . . . , Xn be independent and identically distributed random variables with V ar(S1 ) <
∞. Show that
n
1 X P
jXj −→ EX1 .
n(n + 1)
j=1
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 78
3.11. [2] If for every n, V ar(Xi ) ≤ c < ∞ and Cov(Xi , Xj ) < 0 (i, j = 1, 2, . . .), then the WLLN
holds.
3.12. [2](Theorem of Bernstein) Let {Xn } be a sequence of random variables so that V ar(Xi ) ≤
c < ∞ (i = 1, 2, . . .) and Cov(Xi , Xj ) → 0 when |i − j| → ∞ then the WLLN holds.
3.13. [5] Let (Yj )j≥1 be a sequence of independent Binomial random variables, all defined on the
same probability space, and with law B(p, 1). Let Xn = nj=1 Yj . Show that Xj is B(p, j) and that
P
Xj
j converges a.s to p.
Q 1
n n
3.14. [5] Let {Xj }j≥1 be i.i.d with Xj in L1 . Let Yj = eXj . Show that j=1 Yj converges to a
constant α a.s.
3.15. [5] Let (Xj )j≥1 be i.i.d with Xj in L1 and EXj = µ. Let (Yj )j≥1 be also i.i.d with Yj in L1 and
EYj = ν 6= 0. Show that
n
1 X µ
lim Pn Xj = a.s.
j=1 Yj ν
n→∞
j=1
Pn
3.16. [5] Let (Xj )j≥1 be i.i.d with Xj in L1 and suppose √1n j=1 (Xj −ν) converges in distribution
to a random variable Z. Show that
n
1X
lim Xj = ν a.s.
n→∞ n
j=1
3.18. [5] Let (Xj )j≥1 be i.i.d. N (1; 3) random variables. Show that
X1 + X2 + . . . Xn 1
lim 2 2 2
= a.s.
n→∞ X + X + . . . + Xn
1 2 4
3.19. [5] Let (Xj )j≥1 be i.i.d with mean µ and variance σ 2 . Show that
n
1X
lim (Xi − µ)2 = σ 2 a.s.
n→∞ n
i=1
3. X ∼ U (a, b);
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 79
5. X ∼ Exp(λ);
3.21. Show that if X1 , . . . , Xn are independent and uniformly distribution on (−1, 1), then for
n ≥ 2, X1 + . . . + Xn has density
1 ∞ sin t n
Z
f (x) = cos txdt.
π 0 t
3.22. Suppose that X has density
1 − cos x
f (x) = .
πx2
Show that
ϕX (t) = (1 − |t|)+ .
3.24. Let X1 , X2 , . . . be independent taking values 0 and 1 with probability 1/2 each.
3.26. Consider the probability space ([0; 1], B([0; 1]), P ). Let X and X1 , X2 , . . . be random vari-
ables
1 if 0 ≤ ω ≤ 1/2
X2n (ω) =
0 if 1/2 < ω ≤ 1.
and
0 if 0 ≤ ω ≤ 1/2
X2n+1 (ω) =
1 if 1/2 < ω ≤ 1.
Show that the sequence (Xn ) converges in distribution? Does it converge in probability?
3.27. Let (Xn )n≥1 and X are random variables whose distribution functions are (Fn )n≥1 and F ,
respectively.
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 80
w
1. If Xn −→ X then limn→∞ Fn (x) = F (x) for all x ∈ D where D is a dense subset of R given
by
D = {x ∈ R : F (x+) = F (x)}.
w
2. If limn→∞ Fn (x) = F (x) for any x in some dense subset of R then Xn −→ X.
w P
3.28. If Xn −→ X, Yn −→ c, then
w
a) Xn + Yn −→ X + c
w
b) Xn Yn −→ cX
w
c) Xn /Yn −→ X/c if Yn 6= 0 a.s for all n and c 6= 0.
d P
3.29. [10] Show that if Xn −→ X and X = c a.s for a real number c, then Xn −→ X.
3.30. [10] A family of random variable (Xi )i∈I is called uniformly integrable if
Let X1 , X2 , . . . be random variables. Show that {|Xn |} is uniformly integrable if one of the follow-
ing condition holds:
b) P (|Xn | ≥ c) ≤ P (|X| ≥ c) for all n and c > 0, where X is an integrable random variable.
3.31. Let Xn be random variable distributed as N (µn , σn2 ), n = 1, 2, . . . and X be a random vari-
d
able distributed as N (µ, σ 2 ). Show that Xn −→ X if and only if limn µn = µ and limn σn2 = σ 2 .
w
3.32. If Yn are random variables with characteristic function ϕn , then Yn → 0 iff there is a δ > 0
so that ϕn (t) → 1 for |t| ≤ δ.
3.33. [10] Let U1 , U2 , . . . be independent random variables having the uniform distribution on
√ d
[0;1] and Yn = ( ni=1 Ui )−1/n . Show that n(Yn − e) −→ N (0, e2 ).
Q
3.34. [10] Suppose that Xn is a random variable having the binomial distribution with size n and
probability θ ∈ (0, 1), n = 1, 2, . . . Define Yn = log(Xn /n) when Xn ≥ 1 and Yn = 1 when Xn = 0.
√ d
Show that limn Yn = log θ a.s and n(Yn − log θ) −→ N (0, 1−θ θ ).
3.35. [2] Show that for the sequence {Xn } of independent random variables with
1−2−n 1
a) P [Xn = ±1] = 2 , P [Xn = ±2n ] = 2n+1
, n = 1, 2, . . . ,
b) P [Xn = ±n2 ] = 12 ,
2 p √ d
( Sn − n) −→ N (0, 1).
σ
3.37. [5] Show that
n
!
X nk 1
lim e−n = .
n→∞ k! 2
k=0
2 = σ 2 < ∞. Let S =
Pn
3.38. [5] Let (Xj )j≥1 be i.i.d with EXj = 0 and σXj n j=1 Xj . Show that
r
|Sn | 2
lim E √ = σ.
n→∞ n n
3.39. [5] Let (Xj )j≥1 be i.i.d with the uniform distribution on (-1;1). Let
Pn
j=1 Xj
Yn = Pn 2
Pn 3.
j=1 Xj + j=1 Xj
√
Show that nYn converges in distribution.
3.40. [5] Let (Xj )j≥1 be i.i.d with the uniform distribution on (−j; j).
a) Show that
Sn d 1
−→ N (0; ).
3
n 2 9
b) Show that
S d
qP n −→ N (0, 1).
n 2
j=1 σj
Chapter 4
Recall that a random variable X has a Gamma distribution G(α, λ) if its density is given by
xα−1 e−x/λ
fX (x) = I .
Γ(α)λα {x>0}
Note that G(1, λ) = Exp(λ).
Corollary 4.1.2. Let (Xi )1≤i≤n be a sequence of independent random variables. Suppose that Xi is
G(αi , λ) distributed. Then S = X1 + · · · + Xn is G(α1 + · · · + αi , λ) distributed.
Definition 4.1.3. Let (Zi )1≤i≤n be a sequence of independent, standard normal distributed ran-
dom variables. The distribution of V = Z12 + . . . + Zn2 is called chi-square distribution with n
degrees of freedom and is denoted by χ2n .
A notable consequence of the definition of the chi-square distribution is that if U and V are
independent and U ∼ χ2n and V ∼ χ2m , then U + V ∼ χ2m+n .
82
CHAPTER 4. SOME USEFUL DISTRIBUTIONS IN STATISTICS 83
2.2
alpha =7/8, lambda = 1
2 alpha = 1,lambda = 1
alpha = 2, lambda = 1
1.8 alpha = 3, lambda = 1
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0.9
n=1
n=2
0.8
n=4
n=6
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10
0.4
n=1
n=2
0.35 n=8
normal
0.3
0.25
0.2
0.15
0.1
0.05
0
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6
Definition 4.1.4. If Z ∼ N (0; 1) and U ∼ χ2n and Z and U are independent, then the distribution
Z
of p is called student’s t distribution with n degrees of freedom.
U/n
Student’s t distribution is also call t distribution.
A direct computation with the density gives the following result.
Proposition 4.1.5. The density function of the student’s t distribution with n degrees of freedom is
Γ n+1
2
t2 −(n+1)/2
fn (t) = √ 1+ .
nπΓ n n
2
In addition,
n→∞ 1 2
fn (t) −→ √ e−t /2 .
2π
4.1.4 F distribution
Definition 4.1.6. Let U and V be independent chi-square random variables with m and n degrees
of freedom, respectively. The distribution of
U/m
W =
V /n
is called the F distribution with m and n degrees of freedom and is denoted by Fm,n .
0.8
n = 4, m = 4
n = 10, m = 4
0.7 n = 10, m = 10
n=4, m= 10
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
Proposition 4.2.1. The random variable X n and the vector of random variables (X1 − X n , X2 −
X n , . . . , Xn − X n ) are independent.
Proof. We write
n
X n
X
sX n + ti (Xi − X n ) = ai Xi
i=1 i=1
s
where ai = n + (ti − t). Note that
n n n
X X s2 X
ai = s and a2i = + (ti − t)2 .
n
i=1 i=1 i=1
The first factor is the cf of X n while the second factor is the cf of (X1 − X n , X2 − X n , . . . , Xn − X n )
(this is obtained by let s = 0 in the formula). This implies the desired result.
Theorem 4.2.3. The distribution of (n − 1)s2n /σ 2 is the chi-square distribution with n − 1 degrees
of freedom.
Also,
n n X − µ 2
1 X 2 1 X 2
(Xi − µ) = (Xi − X n ) + √ =: U + V.
σ2 σ2 σ/ n
i=1 i=1
Since U and V are independent, ϕW (t) = ϕU (t)ϕV (t). Since W and V both follow chi-square
distribution, so
ϕW (t) (1 − i2t)−n/2
ϕU (t) = = = (1 − i2t)−(n−1)/2 .
ϕV (t) (1 − i2t)−1/2
The last expression is the c.f. of a random variable with a χ2n−1 distribution.
Corollary 4.2.4.
Xn − µ
√ ∼ tn−1 .
sn / n
4.3 Exercises
4.1. Show that
2. if X ∼ tn then X 2 ∼ F1,n .
3. the Cauchy distribution and the t distribution with 1 degree of freedom are the same.
4. Iif X and Y are independent exponential random variable with λ = 1, then X/Y follows an
F distribution.
4.2. Show how to use the chi-square distribution to calculate P(a < s2n /σ 2 < b).
4.5. Let X1 , X2 and X3 be three independent chi-square variables with r1 , r2 and r3 degrees of
freedom, respectively.
2. Deduce that
X1 /r1 X2 /r3
and
X2 /r2 (X1 + X2 )/(r1 + r2 )
are independent F -variables.
Chapter 5
Parameter estimation
Example 5.1.2. An urn contains m balls, labeled from 1, 2, . . . , m and are identical except for the
number. The experiment is to choose a ball at random and record the number. Let X denote the
number. Then the distribution of X is given by
1
P[X = k] = , f or x = 1, . . . , m.
m
In case m is unknown, to obtain information on m we take a sample of n balls, which we will
denote as X = (X1 , . . . , Xn ) where Xi is the number on the ith ball.
The sample can be drawn in several ways.
1. Sampling with replacement: We randomly select a ball, record its number and put it back
to the urn. All the ball are then remixed, and the next ball is chosen. We can see that
X1 , . . . , Xn are mutually independent random variables and each has the same distribution
as X. Hence (X1 , . . . , Xn ) is a random sample.
2. Sampling without replacement: Here n balls are selected at random. After a ball is selected,
we do not return it to the urn. The X1 , . . . , Xn are not independent, but each Xi has the
same distribution as X.
If m is much greater than n, the sampling schemes are practically the same.
88
CHAPTER 5. PARAMETER ESTIMATION 89
6. The sample median is a measure of central tendency that divides the data into two equal
parts, half below the median and half above. If the number of observations is even, the
median is halfway between the two central values. If the number of observations is odd,
the median is the central value.
7. When an ordered set of data is divided into four equal parts, the division points are called
quartiles. The first or lower quartile, q1 , is a value that has approximately 25% of the ob-
servations below it and approximately 75% of the observations above. The second quartile,
q2 , has approximately 50% of the observations below its value. The second quartile is ex-
actly equal to the median. The third or upper quartile, q3 , has approximately 75% of the
observations below its value.
CHAPTER 5. PARAMETER ESTIMATION 90
8. The interquartile range is defined as IQR = q3 − q1 . The IQR is also used as a measure of
variability.
A stem-and-leaf diagram is a good way to obtain an informative visual display of a data set
x1 , x2 , . . . , xn , where each number xi consists of at least two digits. To construct a stem-and-leaf
diagram, use the following steps.
1. Divide each number xi into two parts: a stem, consisting of one or more of the leading
digits and a leaf, consisting of the remaining digit.
Example 5.2.1. The weights of 80 students are given in the following table.
59.0 59.5 52.7 47.9 55.7 48.3 52.1 53.1 55.2 45.3
46.5 54.8 48.4 53.1 56.9 47.4 50.2 52.1 49.6 46.4
52.9 41.1 51.0 50.0 56.8 45.9 59.5 52.8 46.7 55.7
48.6 51.6 53.2 54.1 45.8 50.4 54.1 52.0 56.2 62.7
62.0 46.8 54.6 54.7 50.2 45.9 49.1 42.6 49.8 52.1
56.5 53.5 46.5 51.9 46.5 53.5 45.5 50.2 55.1 49.6
47.6 44.8 55.0 56.2 49.4 57.0 52.4 48.4 55.0 47.1
52.4 56.8 53.2 50.5 56.6 49.5 53.1 51.2 55.5 53.7
2. Mark and label the vertical scale with the frequencies or the relative frequencies.
3. Above each bin, draw a rectangle where height is equal to the frequency (or relative fre-
quency) corresponding to that bin.
Histogram of weight
15
No. of student
10
5
0
40 45 50 55 60
weight
The box plot is a graphical display that simultaneously describes several important features
of a data set, such as center, spread, departure from symmetry, and identification of unusual ob-
servations or outliers.
A box plot displays the three quartiles, the minimum, and the maximum of the data on a rectan-
gular box, aligned either horizontally or vertically. The box encloses the interquartile range with
the left (or lower) edge at the first quartile, q1 , and the right (or upper) edge at the third quartile,
q3 . A line is drawn through the box at the second quartile (which is the 50th percentile or the me-
dian). A line, or whisker, extends from each end of the box. The lower whisker is a line from the
first quartile to the smallest data point within 1.5 interquartile ranges from the first quartile. The
upper whisker is a line from the third quartile to the largest data point within 1.5 interquartile
ranges from the third quartile. Data farther from the box than the whiskers are plotted as indi-
vidual points. A point beyond a whisker, but less than 3 interquartile ranges from the box edge, is
called an outlier. A point more than 3 interquartile ranges from the box edge is called an extreme
outlier.
Example 5.2.3. Consider the sample in Example 5.2.1. The quantiles of the sample are q1 =
48.40, q2 = 52.10, q3 = 54.85. Bellow is the box plot of the students’ weight.
CHAPTER 5. PARAMETER ESTIMATION 93
60
55
50
45
158.7 167.6 164.0 153.1 179.3 153.0 170.6 152.4 161.5 146.7
147.2 158.2 157.7 161.8 168.4 151.2 158.7 161.0 147.9 155.5
How do we know if a particular probability distribution is a reasonable model for data? Some
of the visual displays we have used earlier, such as the histogram, can provide insight about the
form of the underlying distribution. However, histograms are usually not really reliable indica-
tors of the distribution form unless the sample size is very large. Probability plotting is a graphi-
cal method for determining whether sample data conform to a hypothesized distribution based
on a subjective visual examination of the data. The general procedure is very simple and can be
performed quickly. It is also more reliable than the histogram for small to moderate size samples.
CHAPTER 5. PARAMETER ESTIMATION 94
To construct a probability plot, the observations in the sample are first ranked from smallest
to largest. That is, the sample x1 , x2 , . . . , xn is arranged as x(1) ≤ x(2) < . . . ≤ x(n) . The or-
dered observations x(j) are then plotted against their observed cumulative frequency (j − 0.5)/n.
If the hypothesized distribution adequately describes the data, the plotted points will fall ap-
proximately along a straight line which is approximately between the 25th and 75th percentile
points; if the plotted points deviate significantly from a straight line, the hypothesized model is
not appropriate. Usually, the determination of whether or not the data plot as a straight line is
subjective.
In particular, a normal
probability plot can be constructed by plotting the standardized nor-
j−0.5
mal scores zj = Φ−1 n against x(j) .
2.86, 3.33, 3.43, 3.77, 4.16, 3.52, 3.56, 3.63, 2.43, 2.78.
Since all the points are very close to the straight line, one may conclude that a normal distribution
adequately describes the data.
Remark 4. This is very surjective method. Please use it at your own risk! Later we will introduce
the Shapiro and Wilcoxon tests for the normal distribution hypothesis.
CHAPTER 5. PARAMETER ESTIMATION 95
5.3.1 Statistics
Example 5.3.1. We continue Example 5.1.2. Recall that we do not know the number of balls m
and have to use the sample (X1 , . . . , Xn ) to obtain information about m.
Since E(X) = m+12 , using laws of large numbers, we have
X1 + . . . + Xn a.s. m + 1
−→ .
n 2
Therefore, we get the first estimator for m given by
X1 + . . . + Xn a.s.
m̂n := 2 − 1 −→ m.
n
Another estimation for m is defined by
m̃n := max{X1 , . . . , Xn }.
Since
n
Y m − 1 n
P[m̃n 6= m] = P[X1 < m, . . . , Xn < m] = P[Xi < m] = →0
m
i=1
a.s.
as n → ∞, we have m̃n −→ m.
The estimator m̂n and m̃n are called statistics which depend only on the observations X1 , . . . , Xn
not m.
Definition 5.3.2. Let X = (X1 , . . . , Xn ) be a sample observed from X and (T, BT ) a measurable
space. Then any function ϕ(X) = ϕ(X1 , . . . , Xn ), where ϕ : (Rn , B(Rn )) → (T, BT ) is a measur-
able function, of the sample is called a statistic.
In the following we only consider the case that (T, BT ) is a subset of (Rd , B(Rd )).
Definition 5.3.3. Let X = (X1 , . . . , Xn ) be a sample observed from a distribution with density
f (x; θ), θ ∈ Θ. Let Y = ϕ(X) be a statistic with density fY (y; θ). Then Y is called a sufficient
statistic for θ if
f (x; θ)
= H(x),
fY (ϕ(x); θ)
where x = (x1 , . . . , xn ), f (x; θ) is density of X at x, and H(x) does not depend on θ ∈ Θ.
Example 5.3.4. Let (X1 , . . . , Xn ) be a sample observed from a Poisson distribution with param-
eter λ > 0. Then Yn = ϕ(X) = X1 + . . . + Xn has Poisson distribution with parameter nλ. Hence
Qn Pn
f (X; θ) i=1 f (Xi ; θ) e−nλ λ i=1 Xi Yn ! Yn !
= = Qn −nλ Y
= Yn Q n .
fY (ϕ(X); θ) fYn (Yn ; nθ) i=1 Xi ! e (nλ) n i=1 Xi !
In order to directly verify the sufficiency of statistic ϕ(X) we need to know the density of ϕ(X)
which is not always the case in practice. We next introduce the following criterion of Neyman to
overcome this difficulty.
CHAPTER 5. PARAMETER ESTIMATION 96
Theorem 5.3.5. Let X = (X1 , . . . , Xn ) be a random sample from a distribution that has
density f (x; θ), θ ∈ Θ. The statistic Y1 = ϕ(X) is a sufficient statistic for θ iff we can find
two nonnegative functions k1 and k2 such that
Example 5.3.6. Let (X1 , . . . , Xn ) be a sample from normal distribution N (θ, 1) with θ ∈ Θ = R.
Denote x̄ = n1 ni=1 xi . The joint density of X1 , . . . , Xn at (x1 , . . . , xn ) is given by
P
h i
n 2
2
exp − Pn (xi −x̄)2
1 h X (xi − θ) i n(x̄ − θ) i=1 2
n/2
exp − = exp − n/2
.
(2π) 2 2 (2π)
i=1
We see that the first factor on the right hand side depends upon x1 , . . . , xn only through x̄ and
the second factor does not depend upon θ, the factorization theorem implies that the mean X̄ of
the sample is a sufficient statistic for θ, the mean of the normal distribution.
(a) Eθ [ϕ(X1 , . . . , Xn )] = θ;
(b) Dθ ϕ(X1 , . . . , Xn ) ≤ Dθ ϕ̄(X1 , . . . , Xn ) for any unbiased estimator ϕ̄(X1 , . . . , Xn ) of θ.
4. a consistent estimator of θ if
P θ
ϕ(X1 , . . . , Xn ) −→ θ khi n → ∞.
Here we denote Eθ , Dθ , Pθ the expectation, variance and probability under the condition that
the distribution of Xi is F (x; θ).
Example 5.3.8. Let (X1 , . . . , Xn ) be a sample from normal distribution N (a, σ 2 ). Using the lin-
earity of expectation and laws of large numbers, we have
CHAPTER 5. PARAMETER ESTIMATION 97
Example 5.3.9. In Example 5.3.1, both m̂n and m̃n are consistent estimators of m. Moreover, m̂n
is unbiased and m̃n is asymptotic unbiased.
Let X be a random variable whose density is f (x, θ), θ ∈ Θ, where θ is unknown. In the last
section, we discussed estimating θ by a statistic ϕ(X1 , . . . , Xn ) where X1 , . . . , Xn is a sample from
the distribution of X. When the sample is drawn, it is unlikely that the value of ϕ is the true value
of the parameter. In fact, if ϕ has a continuous distribution then Pθ [ϕ = θ] = 0. What is needed is
an estimate of the error of the estimation.
Example 5.3.10. Let (X1 , . . . , Xn ) be a sample from normal distribution N(a; σ 2 ) where σ 2 is
known. We know that X̄n is an unbiased, consistent estimator of a. But how close is X̄n to a?
√
Since X̄n ∼ N(a; σ 2 /n), we have (X̄n − a)/(σ/ n) has a standard normal N(0; 1) distribution.
Therefore,
h X̄n − a i h σ σ i
0.954 = P − 2 < √ < 2 = P X̄n − 2 √ < a < X̄n + 2 √ . (5.2)
σ/ n n n
In the following, thanks to the central limit theorems, we will present a general method to
find the confident interval for parameters of a large class of distribution. To avoid confusion, let
θ0 denote the true, unknown value of the parameter θ. Suppose ϕ is an estimator of θ0 such that
√ w
n(ϕ − θ0 ) → N(0, σϕ2 ). (5.3)
√
The parameter σϕ2 is the asymptotic variance of nϕ and, in practice, it is usually unknown. For
the present, though, we suppose that σϕ2 is known.
CHAPTER 5. PARAMETER ESTIMATION 98
√
Let Z = n(ϕ−θ0 )/σϕ be the standardized random variable. Then Z is asymptotically N(0, 1).
Hence, P[−1.96 < Z < 1.96] = 0.95. This implies
h σϕ σϕ i
0.95 = P ϕ − 1.96 √ < θ0 < ϕ + 1.96 √ (5.4)
n n
σ σ
Because the interval ϕ − 1.96 √ϕn < θ0 < ϕ + 1.96 √ϕn is a function of the random variable ϕ,
we will call it a random interval. The probability that the random interval contains θ is approxi-
mately 0.95.
Since in practice, we often do not know σϕ . Suppose that there exists a consistent estimator
of σϕ , say Sϕ . It then follows from Slutsky’s theorem that
√
n(ϕ − θ0 ) w
→ N (0, 1).
Sϕ
√ √
Hence the interval ϕ − 1.96Sϕ / n, ϕ − 1.96Sϕ / n would be a random interval with approxi-
mate probability 0.95% of covering θ0 .
In general, we have the following definition.
Definition 5.3.11. Let (X1 , . . . , Xn ) be a sample from a distribution F (x, θ), θ ∈ Θ. A random
interval (ϕ1 , ϕ2 ), where ϕ1 and ϕ2 are some estimator of θ, is called a (1 − α)-confidence interval
for θ if
P(ϕ1 < θ < ϕ2 ) = 1 − α,
Let X1 , . . . , Xn be a random sample from the distribution of a random variable X which has
unknown mean a and unknown variance σ 2 . Let X̄ and s2 be sample mean and sample variance,
√
respectively. By the Central limit theorem, the distribution of n(X̄ − a)/s approximates N(0; 1).
Hence, an approximated (1 − α) confidence interval for a is
s s
x̄ − zα/2 √ , x̄ + zα/2 √ , (5.5)
n n
1. Because α < α0 implies that xα/2 > xα0 /2 , selection of higher values for confidence coeffi-
cients leads to larger error terms and hence, longer confidence intervals, assuming all else
remains the same.
2. Choosing a larger sample size decreases the error part and hence, leads to shorter confi-
dence intervals, assuming all else stays the same.
3. Usually the parameter σ is some type of scale parameter of the underlying distribution.
In these situations, assuming all else remains the same, an increase in scale (noise level),
generally results in larger error terms and, hence, longer confidence intervals.
CHAPTER 5. PARAMETER ESTIMATION 99
Let X1 , . . . , Xn be a random sample from the Bernoulli distribution with probability of suc-
cess p. Let p̂ = X̄ be the sample proportion of successes. It follows from the Central limit theorem
that p̂ has an approximate N(p; p(1−p) n ) distribution. Since p̂ and p̂(1 − p̂) are consistent estimators
for p and p(1 − p), respectively, an approximate (1 − α) confidence interval for p is given by
r r
p̂(1 − p̂ p̂(1 − p̂
p̂ − zα/2 , p̂ + zα/2 .
n n
In general, the confidence intervals developed so far in this section are approximate. They are
based on the Central Limit Theorem and also, often require a consistent estimate of the variance.
In our next example, we develop an exact confidence interval for the mean when sampling from
a normal distribution
Sometimes confidence intervals on the population variance or standard deviation are needed.
When the population is modelled by a normal distribution, the tests and intervals described in
this section are applicable. The following result provides the basis of constructing these confi-
dence intervals.
Theorem 5.3.14. If s2 is the sample variance from a random sample of n observation from a nor-
mal distribution with unknown variance σ 2 , then a 100(1 − α)% CI on σ 2 is
(n − 1)s2 2 (n − 1)s2
≤ σ ≤ ,
c2α/2,n−1 c21−α/2,n−1
where c2a,n−1 satisfies P(χ2n−1 > c2a,n−1 ) = a and the random variable χ2n−1 has a chi-square distri-
bution with n − 1 degrees of freedom.
A practical problem of interest is the comparison of two distributions; that is, comparing the
distributions of two random variables, say X and Y . In this section, we will compare the means
of X and Y . Denote the means of X and Y by aX and aY , respectively. In particular, we shall
obtain confidence intervals for the difference ∆ = aX − aY . Assume that the variances of X and
Y are finite and denote them as σX 2 = V ar(X) and let σ 2 = V ar(Y ). Let X ...., X be a random
Y 1 n
sample from the distribution of X and let Y1 , ..., Ym be a random sample from the distribution of
Y. Assume that the sample were gathered independently of one another. Let X̄ and Ȳ the sample
means of X and Y , respectively. Let ∆ ˆ = X̄ − Ȳ . Next we obtain a large sample confidence
interval for ∆ based on the asymptotic distribution of ∆. ˆ
CHAPTER 5. PARAMETER ESTIMATION 101
Proposition 5.3.15. Let N = n + m denote the total sample size. We suppose that
n m
→ λX , and → λY where λX + λY = 1.
N N
Then a (1 − α) confidence interval for ∆ is
2 and σ 2 are known)
1. (if σX Y
r r
2
σX σ2 2
σX σ2
(X̄ − Ȳ ) − zα/2 + Y , (X̄ − Ȳ ) + zα/2 + Y ; (5.7)
n m n m
2 and σ 2 are unknown)
2. (if σX Y
r r
s2 (X) s2 (Y ) s2 (X) s2 (Y )
(X̄ − Ȳ ) − zα/2 + , (X̄ − Ȳ ) + zα/2 + , (5.8)
n m n m
where s2 (X) and s2 (Y ) are sample variances of (Xn ) and (Ym ), respectively.
√ w
Proof. It follows from the Central limit theorem that n(X̄ − aX ) −→ N(0; σX
2 ). Thus,
√ w
2
σX
N (X̄ − aX ) −→ N(0; ).
λX
Likewise,
√ w σY2
N (Ȳ − aY ) −→ N(0; ).
λY
Since the samples are independent of one another, we have
√
w σ2 σ2
N (X̄ − Ȳ ) − (aX − aY ) −→ N(0; X + Y ).
λX λY
Let X and Y be two independent random variables with Bernoulli distributions B(1, p1 ) and
B(1, p2 ), respectively. Let X1 , . . . , Xn be a random sample from the distribution of X and let
Y1 , . . . , Ym be a random sample from the distribution of Y .
The method of maximum likelihood is one of the most popular technique for deriving esti-
mators. Let X = (X1 , . . . , Xn ) be a random sample from a distribution with pdf/pdm f (x; θ). The
likelihood function is defined by
n
Y
L(x; θ) = f (xi ; θ).
i=1
Definition 5.4.1. For each sample point x, let θ̂(x) be a parameter value at which L(x; θ) attains
its maximum as a function of θ, with x held fixed. A maximum likelihood estimator (MLE) of the
parameter θ based on a sample X is θ̂(X).
Example 5.4.2. Let (X1 , . . . , Xn ) be a random sample from the distribution N (θ, 1), where θ is
unknown. We have
n
Y 1 2
L(x; θ) = √ e−(xi −θ) /2 ,
i=1
2π
A simple calculus shows that the MLE of θ is θ̂ = n1 ni=1 xi . One can easily verify that θ̂ is an
P
Example 5.4.3. Let (X1 , . . . , Xn ) be a random sample from the Bernoulli distribution with a un-
known parameter p. The likelihood function is
n
Y
L(x; p) = pxi (1 − p)1−xi .
i=1
1 Pn
A simple calculus shows that the MLE of p is θ̂ = n i=1 xi . One can easily verify that θ̂ is an
unbiased and consistent estimator of θ.
Let X1 , . . . , Xn denote a random sample from the distribution with pdf f (x; θ), θ ∈ Θ. Let θ0
denote the true value of θ. The following theorem gives a theoretical reason for maximizing the
likelihood function. It says that the maximum of L(θ) asymptotically separates the true model at
θ0 from models at θ 6= θ0 .
Then
lim Pθ0 [L(X; θ0 ) > L(X; θ)] = 1, for all θ 6= θ0 .
n→∞
CHAPTER 5. PARAMETER ESTIMATION 103
Since the function φ(x) = − ln x is strictly convex, it follows from the Law of Large Numbers and
Jensen’s inequality that
n
1 X f (Xi ; θ) P h f (X ; θ) i
1
h f (X ; θ) i
1
ln → Eθ0 ln < ln Eθ0 = 0.
n f (Xi ; θ0 ) f (X1 ; θ0 ) f (X1 ; θ0 )
i=1
Note that condition f (.; θ) 6= f (.; θ0 ) for all θ 6= θ0 is needed to obtain the last strict inequality
while the common support is needed to obtain the last equality.
Theorem 5.4.4 says that asymptotically the likelihood function is maximized at the true value θ0 .
So in considering estimates of θ0 , it seems natural to consider the value of θ which maximizes the
likelihood.
We close this section by showing that maximum likelihood estimators, under regularity con-
ditions, are consistent estimators.
Theorem 5.4.5. Suppose that the pdfs f (x, θ) satisfying (R0), (R1) and
∂ ∂
L(θ) = 0 ⇔ ln L(θ) = 0,
∂θ ∂θ
P
has a solution θ̂n such that θ̂n → θ0 .
Let (X1 , . . . , Xn ) be a random sample from a distribution with density f (x; θ) where θ =
(θ1 , . . . , θk ) ∈ Θ ⊂ Rk . Method of moments estimators are found by equating the first k sam-
ple moments to the corresponding k population moments, and solving the resulting system of
simultaneous equations. More precisely, define
µj = E[X j ] = gj (θ1 , . . . , θk ), j = 1, . . . , k.
and
n
1X j
mj = Xi .
n
i=1
The moments estimator (θ̂1 , . . . , θ̂k ) is obtained by solving the system of equations
mj = gj (θ1 , . . . , θk ), j = 1, . . . , k.
CHAPTER 5. PARAMETER ESTIMATION 104
Example 5.4.6 (Binomial distribution). Let (X1 , . . . , Xn ) be a random sample from the Binomial
distribution B(k, p), that is,
Here we assume that p and k are unknown parameters. Equating the first two sample moments
to those of the population yields
X̄ = kp k = k̂ = X̄n2
n 1 Pn
X̄n − n 2
⇔ i=1 (Xi −X̄n )
1 n X 2 = kp(1 − p) + k 2 p2
P p = p̂ = X̄n .
n i=1 i k̂
Theorem 5.5.1 (Rao-Cramer Lower Bound). Let X1 , . . . , Xn be iid with common pdf
f (x; θ) for θ ∈ Θ. Assume that the regularity conditions (R0)-(R2) hold. Moreover, suppose
that
[k 0 (θ)]2
DY ≥ ,
nI(θ)
Proof. Since Z
k(θ) = u(x1 , . . . , xn )f (x1 ; θ) . . . f (xn ; θ)dx1 . . . dxn ,
Rn
we have
Z n
0
X ∂ ln f (xi ; θ)
k (θ) = u(x1 , . . . , xn ) f (x1 ; θ) . . . f (xn ; θ)dx1 . . . dxn .
Rn ∂θ
i=1
CHAPTER 5. PARAMETER ESTIMATION 105
Denote Z = ni=1 ∂ ln f∂θ(xi ;θ) . It is easy to verify that E[Z] = 0 and DZ = nI(θ). Moreover, k 0 (θ) =
P
where ρ is the correlation coefficient between Y and Z. Since E[Z] = 0 and ρ2 ≤ 1, we get
|k 0 (θ)|2
≤ 1,
nI(θ)DY
Definition 5.5.2. Let Y be an unbiased estimator of a parameter θ in the case of point estimation.
The statistic Y is called an efficient estimator of θ if and only if the variance of Y attains the Rao-
Cramer lower bound.
Example 5.5.3. Let X1 , X2 , . . . , Xn denote a random sample from a exponential distribution that
has the mean λ > 0. Show that X̄ is an efficient estimator of λ.
Example 5.5.4 (Poisson distribution). Let X1 , X2 , . . . , Xn denote a random sample from a Pois-
son distribution that has the mean θ > 0. Show that X̄ is an efficient estimator of θ.
In the above examples, we were able to obtain the MLEs in closed form along with their dis-
tributions and, hence, moments. This is often not the case. Maximum likelihood estimators,
however, have an asymptotic normal distribution. In fact, MLEs are asymptotically efficient.
Theorem 5.5.5. Assume X1 , . . . , Xn are iid with pdf f (x; θ0 ) for θ0 ∈ Θ such that the regularity
condition (R0)-(R5) are satisfied. Suppose further that 0 < I(θ0 ) < ∞, and
(R6) The pdf f (x; θ) is three times differentiable as a function of θ. Moreover, for all θ ∈ Θ, there
exists a constant c and a function M (x) such that
∂ 2 ln f (x; θ)
≤ M (x),
∂θ3
with Eθ0 [M (X)] < ∞, for all θ0 − c < θ < θ0 + c and all x in the support of X.
Proof. Expanding the function l0 (θ) into a Taylor series of order two about θ0 and evaluating it at
θ̂n , we get
1
l0 (θ̂n ) = l0 (θ0 ) + (θ̂n − θ0 )l00 (θ0 ) + (θ̂n − θ0 )2 l000 (θn∗ ),
2
where θn∗ is between θ0 and θ̂n . But l0 (θ̂n ) = 0. Hence,
√ n−1/2 l0 (θ0 )
n(θ̂n − θ0 ) = .
−n−1 l00 (θ0 ) − (2n)−1 (θ̂n − θ0 )l000 (θn∗ )
CHAPTER 5. PARAMETER ESTIMATION 106
Note that |θ̂n − θ0 | < c0 implies that |θn∗ − θ0 | < c0 , thanks to Condition (R6), we have
n n
1 1 X ∂ 2 ln f (Xi ; θ) 1X
− l000 (θn∗ ) ≤ ≤ M (Xi ).
n n ∂θ3 n
i=1 i=1
1 Pn P
Since Eθ0 |M (X)| < ∞, applying Law of Large Numbers, we have n i=1 M (Xi ) → Eθ0 [M (X)].
P
Moreover, since θ̂n → θ0 , for any > 0, there exists N > 0 so that P[|θ̂n − θ0 | < c0 ] ≥ 1 − 2 and
h 1Xn i
P M (Xi ) − Eµ0 [M (X)] < 1 ≥ 1 − ,
n 2
i=1
5.6 Exercises
5.1. For a normal population with known variance σ 2 , answer the following questions:
√ √
1. What is the confidence level for the interval x̄ − 2.14σ/ n ≤ µ ≤ x̄ + 2.14σ/ n.
√ √
2. What is the confidence level for the interval x̄ − 2.49σ/ n ≤ µ ≤ x̄ + 2.49σ/ n.
√ √
3. What is the confidence level for the interval x̄ − 1.85σ/ n ≤ µ ≤ x̄ + 1.84σ/ n.
5.2. A confidence interval estimate is desired for the gain in a circuit on a semiconductor device.
Assume that gain is normally distributed with standard deviation σ = 20.
5.3. Following are two confidence interval estimates of the mean µ of the cycles to failure of an
automotive door latch mechanism (the test was conducted at an elevated stress level to acceler-
ate the failure).
3124.9 ≤ µ ≤ 3215.7 3110.5 ≤ µ ≤ 3230.1.
1. What is the value of the sample mean cycles to failure?
2. The confidence level for one of these CIs is 95% and the confidence level for the other is
99%. Both CIs are calculated from the same sample data. Which is the 95% CI? Explain
why.
5.4. n = 100 random samples of water from a fresh water lake were taken and the calcium
concentration (milligrams per liter) measured. A 95% CI on the mean calcium concentration
is 0.49 ≤ µ ≤ 0.82.
1. Would a 99% CI calculated from the same sample data been longer or shorter?
2. Consider the following statement: There is a 95% chance that µ is between 0.49 and 0.82. Is
this statement correct? Explain your answer.
3. Consider the following statement: If n = 100 random samples of water from the lake were
taken and the 95% CI on µ computed, and this process was repeated 1000 times, 950 of the
CIs will contain the true value of µ. Is this statement correct? Explain your answer.
5.5. A research engineer for a tire manufacturer is investigating tire life for a new rubber com-
pound and has built 16 tires and tested them to end-of-life in a road test. The sample mean and
standard deviation are 60,139.7 and 3645.94 kilometers. Find a 95% confidence interval on mean
tire life.
5.6. An Izod impact test was performed on 20 specimens of PVC pipe. The sample mean is X̄ =
1.25 and the sample standard deviation is S = 0.25. Find a 99% lower confidence bound on Izod
impact strength.
5.7. The compressive strength of concrete is being tested by a civil engineer. He tests 12 speci-
mens and obtains the following data.
2216 2237 2225 2301 2318 2255
2249 2204 2281 2263 2275 2295
1. Is there evidence to support the assumption that compressive strength is normally dis-
tributed? Does this data set support your point of view? Include a graphical display in
your answer.
5.8. A machine produces metal rods. A random sample of 15 rods is selected, and the diameter
is measured. The resulting date (in millimetres) are as follows
8.24 8.25 8.2 8.23 8.24
8.21 8.26 8.26 8.2 8.25
8.23 8.23 8.19 8.28 8.24
CHAPTER 5. PARAMETER ESTIMATION 108
5.9. A rivet is to be inserted into a hole. A random sample of n = 15 parts is selected, and the
hole diameter is measured. The sample standard deviation of the hole diameter measurements
is s = 0.008 millimeters. Construct a 99% CI for σ 2 .
5.10. The sugar content of the syrup in canned peaches is normally distributed with standard
deviation σ. A random sample of n = 10 cans yields a sample standard deviation of s = 4.8
milligrams. Find a 95% CI for σ.
5.11. Of 1000 randomly selected cases of lung cancer, 823 resulted in death within 10 years.
2. How large a sample would be required to be at least 95% confident that the error in esti-
mating the 10-year death rate from lung cancer is less than 0.03?
5.12. A random sample of 50 suspension helmets used by motorcycle riders and automobile
race-car drivers was subjected to an impact test, and on 18 of these helmets some damage was
observed.
1. Find a 95% CI on the true proportion of helmets of this type that would show damage from
this test.
2. Using the point estimate of p obtained from the preliminary sample of 50 helmets, how
many helmets must be tested to be 95% confident that the error in estimating the true
value of p is less than 0.02?
3. How large must the sample be if we wish to be at least 95% confident that the error in
estimating p is less than 0.02, regardless of the true value of p?
where α1 + α2 = α. If α1 = α2 = α/2, we have the usual 100(1 − α)% CI for µ. In the above, when
√
α1 6= α2 , the CI is not symmetric about µ. The length of the interval is L = σ(zα1 + zα2 )/ n. Prove
that the length of the interval L is minimized when α1 = α2 = α/2.
5.14. Let the observed value of the mean X̄ of a random sample of size 20 from a distribution
that is N (µ, 80) be 81.2. Find a 95 percent confidence interval for µ.
5.15. Let X̄ be the mean of a random sample of size n from a distribution that is N (µ, 9). Find n
such that P[X̄ − 1 < µ < X̄ + 1] = 0.90, approximately.
5.16. Let a random sample of size 17 from the normal distribution N (µ, σ 2 ) yield x̄ = 4.7 and
s2 = 5.76. Determine a 90 percent confidence interval for µ.
CHAPTER 5. PARAMETER ESTIMATION 109
5.17. Let X̄ denote the mean of a random sample of size n from a distribution that has mean
µ and variance σ 2 = 10. Find n so that the probability is approximately 0.954 that the random
interval (X̄ − 21 , X̄ + 21 ) includes µ.
1. If σ 2 is known, find a minimum value for n to guarantee that a 0.95 CI for µ will have length
no more than σ/4.
2. If σ 2 is unknown, find a minimum value for n to guarantee, with probability 0.90, that a 0.95
CI for µ will have length no more than σ/4.
5.21. Let (X1 , . . . , Xn ) be iid uniform U(0; θ). Let Y be the largest order statistics. Show that the
distribution of Y /θ does not depend on θ, and find the shortest (1 − α) CI for θ.
5.22. Let X1 , X2 , X3 be a random sample of size three from a uniform (θ, 2θ) distribution, where
θ > 0.
eθ−x
f (x; θ) = , x ∈ R, θ ∈ R.
(1 + eθ−x )2
5.25. Let X1 , . . . , Xn represent a random sample from each of the distributions having the fol-
lowing pdfs or pmfs:
5.28. Suppose X1 , . . . , Xn are iid with pdf f (x; θ) = e−x/θ I{0<x<∞} . Find the mle of P[X ≥ 3].
3. The length (in millimeters) of cuckoos’ eggs found in hedge sparrow nests can be modelled
with this distribution. Fot the data
22, 0, 23, 9, 20, 9, 23, 8, 25, 0, 24, 0 21, 7, 23, 8, 22, 8, 23, 1, 23, 1, 23, 5, 23, 0, 23, 0,
Yi = βxi + i , i = 1, . . . , n,
3. Find the mle β̄n of β, and show that it is an unbiased estimator of β. Compare the variances
of β̄n and β̂n .
5.31. Let (X1 , . . . , Xn ) be a random sample from a population with eman µ and variance σ 2 .
2. Among all such unbiased estimator, find the one with minimum variance, and calculate
the variance.
2. If (X1 , . . . , Xn ) is a random sample from this distribution, show that the mle of θ is an effi-
cient estimator of θ.
√
3. What is the asymptotic distribution of n(θ̂ − θ).
2. If (X1 , . . . , Xn ) is a random sample from this distribution, show that the mle of θ is an effi-
cient estimator of θ.
√
3. What is the asymptotic distribution of n(θ̂ − θ).
5.35. Let (X1 , . . . , Xn ) be a random sample from a N(0; θ) distribution. We want to estimate the
√
standard deviation θ. Find the constant c so that Y = c ni=1 |Xi | is an unbiased estimator of
P
√
θ and determine its efficiency.
CHAPTER 5. PARAMETER ESTIMATION 112
5.37 (Beta (θ, 1) distribution). Let X1 , X2 , . . . , Xn denote a random sample of size n > 2 from a
distribution with pdf
θxθ−1 for x ∈ (0, 1)
f (x; θ) =
0 otherwise
where the parameter space Ω = (0, ∞).
n
1. Show that θ̂ = − Pn ln Xi
is the MLE estimator of θ.
i=1
4. Is θ̂ an efficient estimator of θ?
2
5.38. Let X1 , . . . , Xn be iid N(θ, 1). Show that the best unbiased estimator of θ2 is X n − n1 . Calcu-
late its variance and show that it is greater thatn the Cramer-Rao lower bound.
Chapter 6
Hypothesis Testing
6.1 Introduction
Point estimation and confidence intervals are useful statistical inference procedures. Another
type of inference that is frequently used concerns tests of hypotheses. As in the last section,
suppose our interest centers on a random variable X which has density function f (x; θ) where
θ ∈ Θ. Suppose we think, due to theory or a preliminary experiment, that θ ∈ Θ0 or θ ∈ Θ1 where
Θ0 and Θ1 are subsets of Θ and Θ0 ∪ Θ1 = Θ. We label the hypothesis as
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1 . (6.1)
We call H0 the null hypothesis and H1 the alternative hypothesis. A hypothesis of the form θ = θ0
is called a simple hypothesis while a hypothesis of the form θ > θ0 or θ < θ0 is called a composite
hypothesis. A test of the form
H0 : θ = θ0 versus H1 : θ 6= θ0
H0 : θ ≤ θ0 versus H1 : θ > θ0 ,
or
H0 : θ ≥ θ0 versus H1 : θ < θ0
113
CHAPTER 6. HYPOTHESIS TESTING 114
Our goal is to select a critical region which minimizes the probability of making error. In general,
it is not possible to simultaneously reduce Type I and Type II errors because of a see-saw effect:
if one takes C = ∅ then H0 would be never rejected so the probability of Type I error would be 0,
but the Type II error occurs with probability 1. Type I error is usually considered to be worse than
Type II. Therefore, we will choose a critical regions which, on one hand, bound the probability of
Type I error at a certain level, and on the other hand, minimizes the probability of Type II error.
α is also called the significance level of the test associated with critical region C.
Over all critical regions of size α, we will look for the one whom has the lowest probability of
Type II error. It also means that for θ ∈ Θ1 , we want to maximize
We call the probability on the right side of this equation the power of the test at θ. So our task is
to find among all the critical region of size α the one with highest power.
We define the power function of a critical region by
In particular, if we have µ0 = 0, µ1 = 1, n = 100 then at the significant level 5%, we would reject
H0 in favor of H1 if X̄n > 0.1965 and the power of the test is 1 − Φ(−8.135) = 0.9999.
CHAPTER 6. HYPOTHESIS TESTING 115
Example 6.1.3 (Large Sample Test for the Mean). Let X1 , . . . , Xn be a random sample from the
distribution of X with mean µ and finite variance σ 2 . We want to test the hypotheses
H0 : µ = µ0 versus H1 : µ > µ0
where µ0 is specified. To illustrate, suppose µ0 is the mean level on a standardized test of students
who have been taught a course by a standard method of teaching. Suppose it is hoped that a new
method which incorporates computers will have a mean level µ > µ0 , where µ = E[X] and X
is the score of a student taught by the new method. This conjecture will be tested by having n
students (randomly selected) to be taught under this new method.
Because X̄n → µ in probability, an intuitive decision rule is given by
Reject H0 in favor of H1 if X̄n is much large than µ0 .
In general, the distribution of the sample mean cannot be obtained in closed form. So we will
use the Central Limit Theorem to find the critical region. Indeed, since
X̄n − µ w
√ → N (0, 1),
S/ n
we obtain a test with an approximate size α:
X̄n −µ
Reject H0 in favor of H1 if √ 0
S/ n
≥ xα .
The power of the test is also approximated by using the Central Limit Theorem
√
√ n(µ0 − µ)
γ(µ) = P[X̄n ≥ µ0 + xα σ/ n] ≈ Φ − xα − .
σ
So if we have some reasonable idea of what σ equals, we can compute the approximate power
function.
−µ
Finally, note that if X has normal distribution then X̄n√
S/ n
has a t distribution with n−1 degrees
of freedom. Thus we can establish a rejection rule having exact level α:
−µ
X̄n√
Reject H0 in favor of H1 if T = S/ n
≥ tα,n−1 ,
where tα,n−1 is the upper α critical point of a t distribution with n − 1 degrees of freedom.
One way to report the results of a hypothesis test is to state that the null hypothesis was or
was not rejected at a specified α-value or level of significance. For example, we can say that
H0 : µ = 0 was rejected at the 0.05 level of significance. This statement of conclusions is often
inadequate because it gives the decision maker no idea about whether the computed value of
the test statistic was just barely in the rejection region or whether it was very far into this region.
Furthermore, stating the results this way imposes the predefined level of significance on other
users of the information. This approach may be unsatisfactory because some decision makers
might be uncomfortable with the risks implied by α = 0.05.
To avoid these difficulties the P-value approach has been adopted widely in practice. The
P -value is the probability that the test statistic will take on a value that is at least as extreme as
the observed value of the statistic when the null hypothesis H0 is true. Thus, a P -value conveys
much information about the weight of evidence against H0 , and so a decision maker can draw a
conclusion at any specified level of significance. We now give a formal definition of a P -value.
CHAPTER 6. HYPOTHESIS TESTING 116
Definition 6.1.4. The P -value is the smallest level of significance that would lead to rejection of
the null hypothesis H0 with the given data.
This mean that if α > P -value, we would reject H0 while if α < P -value, we would not reject
H0 .
Let L(x; θ) be the likelihood function of the sample (X1 , . . . , Xn ) from a distribution with den-
sity p(x; θ).
supθ∈Θ0 L(x; θ)
λ(x) = .
supθ∈Θ L(x; θ)
A likelihood ratio test is any test that has a rejection region of the form C = {x : λ(x) ≤ c} for
some c ∈ [0, 1].
The motivation of likelihood ratio test comes from the fact that if θ0 is the true value of θ then,
asymptotically, L(θ0 ) is the maximum value of L(θ). Therefore, if H0 is true, λ should be close to
1; while if H1 is true, λ should be smaller.
Example 6.2.2 (Likelihood Ratio Test for the Exponential Distribution). Suppose X1 , . . . , Xn are
iid with pdf f (x; θ) = θ−1 e−x/θ I{x>0} and θ > 0. Let’s consider the hypotheses
H0 : θ = θ0 versus H1 : θ 6= θ0 ,
where θ0 > 0 is a specified value. The likelihood ratio test statistic simplifies to
X̄ n
n
λ(X) = en e−nX̄n /θ0 .
θ0
The decision rule is to reject H0 if λ ≤ c. Using differential calculus, it is easy to show that λ ≤ c iff
X̄ ≤ c1 θ0 or X̄ ≥ c2 θ0 for some positive constants c1 , c2 . Note that under the null hypothesis, H0 ,
the statistic θ20 ni=1 Xi has a χ2 distribution with 2n degrees of freedom. Therefore, the following
P
where χ21−α/2 (2n) is the lower α/2 quantile of a χ2 distribution with 2n degrees of freedom and
χ2α/2 (2n) is the upper α/2 quantile of a χ2 distribution with 2n degrees of freedom.
If ϕ(X) is a sufficient statistic for θ with pdf or pmf g(t; θ), then we might consider construct-
ing an likelihood ratio test based on ϕ and its likelihood function L∗ (t; θ) = g(t; θ) rather than on
the sample X and its likelihood function L(x; θ).
CHAPTER 6. HYPOTHESIS TESTING 117
Theorem 6.2.3. If ϕ(X) is a sufficient statistic for θ and λ∗ (t) and λ(x) are the likelihood ratio test
statistics based on ϕ and X, respectively, then λ∗ (ϕ(x)) = λ(x) for every x in the sample space.
Proof. From the Factorization Theorem, the pdf or pmf of X can be written as f (x; θ) = g(T (x); θ)h(x),
where g(t; θ) is the pdf or pmf of T and h(x) does not depend on θ. Thus
Now we consider a test of a simple hypothesis H0 versus a simple alternative H1 . Let f (x; θ)
denote the density of a random variable X where θ ∈ Θ = {θ0 , θ1 }. Let X = (X1 , . . . , Xn ) be a
random sample from the distribution of X.
Definition 6.3.1. A subset C of the sample space is called a best critical region of size α for testing
the simple hypothesis
H0 : θ = θ0 versus H1 : θ = θ1 ,
The following theorem of Neyman and Pearson provides a systematic method of determining
a best critical region.
Theorem 6.3.2. Let (X1 , . . . , Xn ) be a sample from a distribution that has density f (x; θ).
Then the likelihood of X1 , X2 , . . . , Xn is
n
Y
L(x; θ) = f (xi ; θ), for x = (x1 , . . . , xn ).
i=1
Let θ0 and θ1 be distinct fixed values of θ so that Θ = {θ0 , θ1 }, and let k be a positive
number. Let C be a subset of the sample space such that
L(x;θ0 )
(a) L(x;θ1 ) ≤ k for each x ∈ C;
L(x;θ0 )
(b) L(x;θ1 ) ≥ k for each x ∈ D\C;
Then C is a best critical region of size α for testing the simple hypothesis
H0 : θ = θ0 versus H1 : θ = θ1 .
Proof. We prove the theorem when the random variables are of the continuous type. If A is an-
other critical region of size α, we will show that
Z Z
L(x; θ1 )dx ≥ L(x; θ1 )dx.
C A
Write C as the disjoint union of C ∩ A and C ∩ Ac and A as the disjoint union of A ∩ C and A ∩ C c ,
we have
Z Z Z Z
L(x; θ1 )dx − L(x; θ1 )dx = L(x; θ1 )dx − L(x; θ1 )dx
C A C∩A c A∩CZc
Z
1 1
≥ L(x; θ0 )dx − L(x; θ0 )dx,
k C∩Ac k A∩C c
where the last inequality follows from conditions (a) and (b). Moreover, we have
Z Z Z Z
L(x; θ0 )dx − L(x; θ0 )dx = L(x; θ0 )dx − L(x; θ0 )dx = α − α = 0.
C∩Ac A∩C c C A
Example 6.3.3. Let X = (X1 , . . . , Xn ) denote a random sample from the distribution N (θ, 1). It
is desired to test the simple hypothesis
H0 : θ = 0 versus H1 : θ = 1.
We have
1
exp − 12 ni=1 x2i
P
n
L(0; x) (2π)n/2
X n
= = exp − x i + .
L(1; x) 1
exp − 12 ni=1 (xi − 1)2
P 2
(2π)n/2 i=1
is a best critical region, where c is a constant that can be determined so that the size of the critical
region is α. Since X̄n ∼ N (0, 1/n),
For example, if n = 25, α = 0.05 then c = 0.329. Thus the power of this best test of H0 against H1
is 0.05 at θ = 1 is
Z ∞ (x − 1)2
1
p exp − dx = 1 − Φ(−3.355) = 0.999.
0.329 2π/25 2/25
We now define a critical region when it exists, which is a best critical region for testing a simple
hypothesis H0 against an alternative composite hypothesis H1 .
Definition 6.3.4. The critical region C is a uniformly most powerful (UMP) critical region of size
α for testing the simple hypothesis H0 against an alternative composite hypothesis H1 if the set
C is a best critical region of size a for testing H0 against each simple hypothesis in H1 . A test
defined by this critical region C is called a uniformly most powerful (UMP) test, with significance
level α, for testing the simple hypothesis H0 against the alternative composite hypothesis H1 .
It is well-known that uniformly most powerful tests do not always exist. However, when they
do exist, the Neyman-Pearson theorem provides a technique for finding them.
Example 6.3.5. Let (X1 , X2 , . . . , Xn ) be a random sample from the distribution N (0, θ), where
the variance θ is an unknown positive number. We will show that there exists a uniformly most
powerful test with significance level α for testing
H0 : θ = θ0 versus H1 : θ > θ0 .
Let θ0 be a number greater than θ0 an let k denote a positive number. Let C be the set of points
where
n
L(θ0 ; x) X
2 2θ0 θ0 n θ0
≤ k ⇔ x i ≥ ln − ln k = c.
L(θ0 ; x) θ 0 − θ0 2 θ0
i=1
n o
The set C = (x1 , . . . , xn ) : ni=1 x2i ≥ c is then a best critical region for our testing problem. It
P
This can be done using the observation that θ10 ni=1 Xi2 has a χ2 -distribution with n degrees of
P
freedom. Note that for each number θ0 > θ0 , the foregoing argument holds. It means that C is a
uniformly most powerful critical region of size α.
In conclusion, if ni=1 Xi2 ≥ c, H0 is rejected at the significance level α and H1 is accepted;
P
otherwise, H0 is accepted.
Example 6.3.6. Let (X1 , . . . , Xn ) be a sample from the normal distribution N (a, 1), where a is
unknown. We will show that there is no uniformly most powerful test of the simple hypothesis
H0 : a = a0 versus H1 : a 6= a0 .
However, if the alternative composite hypothesis is either H1 : a > a0 or H1 : a < a0 , a uniformly
most powerful test will exist in each instance.
Let a1 be a number not equal to a0 . Let k be a positive number and consider
1 1 Pn 2
(2π)n/2
exp − 2 (x
i=1 i − a0 ) n
X n
≤ k ⇔ (a 1 − a 0 ) xi ≥ (a21 − a20 ) − ln k.
1
exp − 1 n (x − a )2
P 2
(2π)n/2 2 i=1 i 1 i=1
In this section we introduce general forms of uniformly most powerful tests for these hypotheses
when the sample has a so called monotone likelihood ratio.
Definition 6.3.7. Let X0 = (X1 , . . . , Xn ) be a random sample with common pdf (or pmf ) f (x; θ), θ ∈
Θ. We say that its likelihood function L(x; θ) = ni=1 f (xi ; θ) has monotone likelihood ratio in the
Q
Theorem 6.3.8. Assume that L(x; θ) has a monotone decreasing likelihood ratio in the statistic
y = u(x). The following test is uniformly most powerful of level α for the hypotheses (6.2):
In case L(x; θ) has a monotone increasing likelihood ratio in the statistic y = u(x) we can
construct a uniformly most powerful test in a similar way.
Proof. We first consider the simple null hypothesis: H00 : θ = θ0 . Let θ1 > θ0 be arbitrary but
fixed. Let C denote the most powerful critical region for θ0 versus θ1 . By the Neyman-Pearson
Theorem, C is defined by,
L(X; θ0 )
≤ k, if and only if X ∈ C,
L(X; θ1 )
L(X; θ0 )
= g(u(X)) ≤ k ⇔ u(X) > g −1 (k),
L(X; θ1 )
L(x;θ0 )
where g(u(x)) = L(x;θ 1)
. Since α = Pθ0 [u(X) ≥ g −1 (k), we have c = g −1 (k). Hence, the Neyman-
Pearson test is equivalent to the test defined by (6.3). Moreover, the test is uniformly most pow-
erful for θ0 versus θ1 > θ0 because the test only depends on θ1 > θ0 and g −1 (k) is uniquely
determined under θ0 .
Let γ(θ) denote the power function of the test (6.3). For any θ0 < θ00 , the test (6.3) is the most
powerful test for testing θ0 versu θ00 with the level γ(θ0 ), we have γ(θ00 ) > γ(θ0 ). Hence γ(θ) is a
nondecreasing function. This implies maxθ<θ0 γ(θ) = α.
CHAPTER 6. HYPOTHESIS TESTING 122
Example 6.3.9. Let X1 , . . . , Xn be a random sample from a Bernoulli distribution with parameter
p = θ, where 0 < θ < 1. Let θ0 < θ1 . Consider the ratio of likelihood,
P
L(x1 , . . . , xn ; θ0 ) θ0 (1 − θ1 ) xi 1 − θ0 n
= .
L(x1 , . . . , xn ; θ1 ) θ1 (1 − θ0 ) 1 − θ1
By Theorem 6.3.8, the uniformly most powerful level α decision rule is given by
n
X
Reject H0 if Y = Xi ≥ c,
i=1
In this section, we will assume that a random sample X1 , X2 , . . . , Xn has been taken from a
normal N (µ, σ 2 ) population. It is known that X̄ is an unbiased point estimator of µ.
Null hypothesis: H0 : µ = µ0
Test statistic: Z0 = X̄−µ
√0
σ/ n
Example 6.4.1. The following data give the score of 10 students in a certain exam.
75 64 75 65 72 80 71 68 78 62.
Assume that the score is normally distributed with mean µ and known variance σ 2 = 36, test the
following hypotheses at the 0.05 level of significance and find the P -value of each test.
Solution: (a) We may solve the problem by following the six-step procedure as follows.
6. Since |Z0 | < zα/2 we do not reject H0 : µ = 70 in favour of H1 : µ 6= 70 at the 0.05 level of
significance. More precisely, we conclude that the mean score is 70 based on a sample of 10
measurements.
The P -value of this test is 2(1 − Φ(|Z0 |)) = 2(1 − Φ(0.5270)) = 0.598.
(b)
6. Since Z0 < −zα we reject H0 : µ = 75 in favour of H1 : µ < 75 at the 0.05 level of significance.
More precisely, we conclude that the mean score is less than 75 based on a sample of 10
measurements.
There is a close relationship between the test of a hypothesis about any parameter, say θ, and
the confidence interval for θ. If [l, u] is a 100(1 − α)% confidence interval for the parameter θ, the
test of size α of the hypothesis
H0 : θ = θ0 , H1 : θ 6= θ0
will lead to rejection of H0 if and only if θ0 is not in the 100(1 − α)% confidence interval [l, u].
Although hypothesis tests and CIs are equivalent procedures insofar as decision making or
inference about µ is concerned, each provides somewhat different insights. For instance, the
confidence interval provides a range of likely values for µ at a stated confidence level, whereas
hypothesis testing is an easy framework for displaying the risk levels such as the P -value associ-
ated with a specific decision.
In testing hypotheses, the analyst directly selects the type I error probability. However, the
probability of type II error β depends on the choice of sample size. In this section, we will show
how to calculate the probability of type II error β. We will also show how to select the sample size
to obtain a specified value of β.
In the following we will derive the formula for β of the two-sided test. The ones for one-sided
tests can be derived in a similar way and we leave it as an exercise for the reader.
Finding the probability of type II error β: Consider the two-sided hypothesis
H0 : µ = µ 0 , H1 : µ 6= µ0 .
Suppose the null hypothesis is false and that the true value of the mean is µ = µ0 + δ for some δ.
The test statistic Z0 is
√ δ √n
X − µ0 X − (µ0 + δ) δ n
Z0 = √ = √ ∼N , 1).
σ/ n σ/ n σ σ
Sample size formula There are no closed form for n from equation (6.4). However, we can esti-
mate n as follows. √
Case 1 If δ > 0, then Φ(−zα/2 − δ σ n ) ≈ 0 then
√
δ n (zα/2 + zβ )2 σ 2
β ≈ Φ zα/2 − ⇔n≈ .
σ δ2
√
δ n
Case 2 If δ < 0, then Φ(zα/2 − σ ) ≈ 1 then
√
δ n (zα/2 + zβ )2 σ 2
β ≈ 1 − Φ − zα/2 − ⇔n≈ .
σ δ2
Therefore, the sample size formula is defined by
(zα/2 + zβ )2 σ 2
n≈
δ2
Large-sample test
We have developed the test procedure for the null hypothesis H0 : µ = µ0 assuming that the
population is normally distributed and that σ 2 is known. In many if not most practical situations
σ 2 will be unknown. Furthermore, we may not be certain that the population is well modeled
by a normal distribution. In these situations if n is large (say n > 40) the sample variance s2
can be substituted for σ 2 in the test procedures with little effect. Thus, while we have given a
test for the mean of a normal distribution with known σ 2 , it can be easily converted into a large-
sample test procedure for unknown σ 2 that is valid regardless of the form of the distribution of
the population. This large-sample test relies on the central limit theorem just as the large-sample
confidence interval on σ 2 that was presented in the previous chapter did. Exact treatment of
the case where the population is normal, σ 2 is unknown, and n is small involves use of the t
distribution and will be deferred in the next section.
We assume again that a random sample X1 , X2 , . . . , Xn has been taken from a normal N (µ, σ 2 )
population. Recall that X and s(X)2 are sample mean and sample variance of the random sam-
ple X1 , X2 , . . . , Xn , respectively. It is known that
X −µ
tn−1 = √
s(X)/ n
CHAPTER 6. HYPOTHESIS TESTING 126
has a t distribution with n − 1 degree of freedom. This fact leads to the following test on the mean
µ.
Null hypothesis: H0 : µ = µ0
X̄−µ√
Test statistic: T0 = s(X)/ 0
n
Because the t-table in the Appendix contains a few critical values for each t distribution, com-
putation of the exact P -value directly from the table is usually impossible. However, it is easy to
find upper and lower bounds on the P -value from this table.
Example 6.4.2. The following data give the IQ score of 10 students.
112 116 115 120 118 125 118 113 117 121.
Suppose that the IQ score is normally distributed N(µ, σ 2 ), test the following hypotheses at the
0.05 level of significance and estimate the P -value of each test.
(a) H0 : µ = 115 against H1 : µ 6= 115.
When the true value of the mean is µ = µ0 + δ, the distribution for T0 is called the non-central
√
t distribution with n − 1 degrees of freedom and non-centrality parameter δ n/σ. Therefore, the
type II error of the two-sided alternative would be
β = P(|T00 | ≤ tα/2,n−1 )
We assume that a random sample X1 , X2 , . . . , Xn has been taken from a normal N (µ, σ 2 ) pop-
ulation. Since (n − 1)s2 (X)/σ 2 follows the chi-square distribution with n − 1 degrees of freedom,
we obtain the following test for value of σ 2 .
Null hypothesis: H0 : σ = σ0
2 (X)
Test statistic: χ20 = (n−1)s
σ20
Example 6.4.3. An automatic filling machine is used to fill bottles with liquid detergent. A ran-
dom sample of 20 bottles results in a sample variance of fill volume of s2 = 0.0153 (fluid ounces)2 .
If the variance of fill volume exceeds 0.01 (fluid ounces)2 , an unacceptable proportion of bottles
will be underfilled or overfilled. Is there evidence in the sample data to suggest that the manu-
facturer has a problem with underfilled or overfilled bottles? Use α = 0.05, and assume that fill
volume has a normal distribution.
Solution
6. Since χ20 < cα,19 , we conclude that there is no strong evidence that the variance of fill vol-
ume exceeds 0.01 (fluid ounces)2 .
Since P(χ21 9 > 27.2) = 0.10 and P(χ21 9 > 30.4) = 0.05, we conclude that the P -value of the test is
in the interval (0.05, 0.10). Note that the actual P -value is 0.0649.
Let (X1 , . . . , Xn ) be a random sample observing from a random variable X with B(1, p) dis-
tribution. Then p̂ = X is a point estimator of p. By the Central limit theorem, when n is large, p̂
is approximately normal with mean p and variance p(1 − p)/n. We thus obtain the following test
for value of p.
6. Since Z0 < −zα , we reject H0 and conclude that the process fraction defective p is less than
0.05. The P -value for this value of the test statistic is Φ(−1.947)) = 0.0256, which is less than
α = 0.05. We conclude that the process is capable.
Suppose that p is the true value of the population proportion. The approximate β-error is
defined as follows
These equations can be solved to find the approximate sample size n that gives a test of level α
that has a specified β risk. The sample size is defined as follows.
6.5.1 Inference for a difference in means of two normal distributions, variances known
X 1 − X 2 − (µ1 − µ2 )
Z= q 2 ∼ N(0, 1).
σ1 σ22
n1 + n2
Null hypothesis: H0 : µ1 − µ2 = ∆0
X 1 − X 2 − ∆0
Test statistic: Z0 = q 2
σ1 σ22
n1 + n2
When the population variances are unknown, the sample variances s21 and s22 can be substi-
tuted into the test statistic Z0 to produce a large-sample test for the difference in means. This
procedure will also work well when the populations are not necessarily normally distributed.
However, both n1 and n2 should exceed 40 for this large-sample test to be valid.
Example 6.5.2. A product developer is interested in reducing the drying time of a primer paint.
Two formulations of the paint are tested; formulation 1 is the standard chemistry, and formu-
lation 2 has a new drying ingredient that should reduce the drying time. From experience, it
is known that the standard deviation of drying time is 8 minutes, and this inherent variability
should be unaffected by the addition of the new ingredient. Ten specimens are painted with for-
mulation 1, and another 10 specimens are painted with formulation 2; the 20 specimens are
painted in random order. The two sample average drying times are X 1 = 121 minutes and
X 2 = 112 minutes, respectively. What conclusions can the product developer draw about the
effectiveness of the new ingredient, using α = 0.05?
Solution:
6. Since Φ−1 (α) = Φ−1 (0.05) = 1.645 < Z0 , we reject H0 at the α = 0.05 level and conclude
that adding the new ingredient to the paint significantly reduces the drying time.
6.5.2 Inference for the difference in means of two normal distributions, variances
unknown
Suppose we have two independent normal populations with unknown means µ1 and µ2 , and
unknown but equal variances σ 2 . Assume that assumptions (6.5) hold.
The pooled estimator of σ 2 , denoted by Sp2 is defined by
Theorem 6.5.3. Under all the assumption mentioned above, the quantity
X 1 − X 2 − (µ1 − µ2 )
T = q
Sp n11 + n12
Null hypothesis: H0 : µ1 − µ2 = ∆0
X 1 − X 2 − ∆0
Test statistic: T0 = q
Sp n11 + n12
CHAPTER 6. HYPOTHESIS TESTING 132
Example 6.5.4. The IQ’s of 9 children in a district of a large city have empirical mean 107 and
standard deviation 10. The IQs of 12 children in another district have empirical mean 112 and
standard deviation 9. Test the equality of means at the 0.05 significance of level.
Example 6.5.5. Two catalysts are being analyzed to determine how they affect the mean yield of
a chemical process. Specifically, catalyst 1 is currently in use, but catalyst 2 is acceptable. Since
catalyst 2 is cheaper, it should be adopted, providing it does not change the process yield. A test
is run in the pilot plant and results in the data shown in the following table.
Is there any difference between the mean yields? Use α = 0.05, and assume equal variances.
In some situations, we cannot reasonably assume that the unknown variances σ12 andσ22 are
equal. There is not an exact t-statistic available for testing H0 : µ1 −µ2 = ∆0 in this case. However,
if H0 is true, the statistic
X 1 − X 2 − ∆0
T0∗ = q 2
s1 s22
n1 + n2
Therefore, if σ12 6= σ22 , the hypotheses on differences in the means of two normal distribution are
tested as in the equal variances case, except that T0∗ is used as the test statistic and n1 + n2 − 2 is
replaced by ν in determining the degrees of freedom for the test.
CHAPTER 6. HYPOTHESIS TESTING 133
A special case of the two-sample t-tests of previous section occurs when the observations on
the two populations of interest are collected in pairs. Each pair of observations, say (Xj , Xj ), is
taken under homogeneous conditions, but these conditions may change from one pair to an-
other. For example, suppose that we are interested in comparing two different types of tips for
a hardness-testing machine. This machine presses the tip into a metal specimen with a known
force. By measuring the depth of the depression caused by the tip, the hardness of the specimen
can be determined. If several specimens were selected at random, half tested with tip 1, half
tested with tip 2, and the pooled or independent t-test in the previous was applied, the results of
the test could be erroneous. The metal specimens could have been cut from bar stock that was
produced in different heats, or they might not be homogeneous in some other way that might
affect hardness. Then the observed difference between mean hardness readings for the two tip
types also includes hardness differences between specimens.
A more powerful experimental procedure is to collect the data in pairs - that is, to make two
hardness readings on each specimen, one with each tip. The test procedure would then consist of
analyzing the differences between hardness readings on each specimen. If there is no difference
between tips, the mean of the differences should be zero. This test procedure is called the paired
t-test.
Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be a set of n paired observations where we assume that the
mean and variance of the population represented by X are µX and σX 2 , and the mean and vari-
ance of the population represented by Y are µY and σY2 . Define the differences between each pair
of observations as Dj = Xj − Yj , j = 1, 2, . . . , n. The Dj s are assumed to be normally distributed
with mean µD = µX − µY and variance σD 2 , so testing hypotheses about the difference between
Null hypothesis: H0 : µD = ∆0
D − ∆0
Test statistic: T0 = √
SD / n
Example 6.5.6. An article in the Journal of Strain Analysis (1983, Vol. 18, No. 2) compares several
methods for predicting the shear strength for steel plate girders. Data for two of these methods,
the Karlsruhe and Lehigh procedures, when applied to nine specific girders, are shown in the
following table.
Karlsruhe Method 1.186 1.151 1.322 1.339 1.200 1.402 1.365 1.537 1.559
Lehigh Method 1.061 0.992 1.063 1.062 1.065 1.178 1.037 1.086 1.052
Difference Dj 0.119 0.159 0.259 0.277 0.138 0.224 0.328 0.451 0.507
CHAPTER 6. HYPOTHESIS TESTING 134
Test whether there is any difference (on the average) between the two methods?
Solution:
2
D = 0.2736, SD = 0.018349, T0 = 6.05939, t0.025,8 = 2, 306.
We conclude that there is difference between the two method at the 0.05 level of significance.
A hypothesis-testing procedure for the equality of two variances is based on the following
result.
Theorem 6.5.7. Let X11 , X12 , . . . , X1n1 be a random sample from a normal population with mean
µ1 and variance σ12 and let X21 , X22 , . . . , X2n2 be a random sample from a second normal popula-
tion with mean µ2 and variance σ22 . Assume that both normal populations are independent. Let
s21 and s22 be the sample variances. Then the ratio
s21 /σ12
F =
s22 /σ22
This result is based on the fact that (n1 − 1)s21 /σ12 is a chi-square random variable with n1 − 1
degrees of freedom, that (n2 − 1)s21 /σ22 is a chi-square random variable with n2 − 1 degrees of free-
dom, and that the two normal populations are independent. Clearly under the null hypothesis
H0 : σ12 = σ22 , the ratio F0 = s21 /s22 has an Fn1 −1,n2 −1 distribution. Let fα,n1 −1,n2 −1 be a constant
satisfying
P[F0 > fα,n1 −1,n2 −1 ] = α.
Example 6.5.8. Oxide layers on semiconductor wafers are etched in a mixture of gases to achieve
the proper thickness. The variability in the thickness of these oxide layers is a critical character-
istic of the wafer, and low variability is desirable for subsequent processing steps. Two different
mixtures of gases are being studied to determine whether one is superior in reducing the variabil-
ity of the oxide thickness. Twenty wafers are etched in each gas. The sample standard deviations
of oxide thickness are s1 = 1.96 angstroms and s2 = 2.13 angstroms, respectively. Is there any
evidence to indicate that either gas is preferable? Use α = 0.05.
Solution: At the α = 0.05 level of significance we need to test
level of significance. Therefore, there is no strong evidence to indicate that either gas results in a
smaller variance of oxide thickness.
We now consider the case where there are two binomial parameters of interest, say, p1 and p2 ,
and we wish to draw inferences about these proportions. We will present large-sample hypothe-
sis testing based on the normal approximation to the binomial.
Suppose that two independent random samples of sizes n1 and n2 are taken from two pop-
ulations, and let X1 and X2 represent the number of observations that belong to the class of
interest in samples 1 and 2, respectively. Furthermore, suppose that the normal approximation
to the binomial is applied to each population, so the estimators of the population proportions
P̂1 = X1 /n1 and P̂2 = X2 /n2 have approximate normal distribution. Moreover, under the null
hypothesis H0 : p1 = p2 = p, the random variable
P̂1 − P̂2
Z=r
n1 + n2
p(1 − p)
n1 n2
Null hypothesis: H0 : p1 = p2
P̂1 − P̂2 X1 + X2
Test statistic: Z0 = r with p̂ = .
n1 + n2 n1 + n2
p̂(1 − p̂)
n1 n2
CHAPTER 6. HYPOTHESIS TESTING 136
Example 6.5.9. Two different types of polishing solution are being evaluated for possible use in a
tumble-polish operation for manufacturing interocular lenses used in the human eye following
cataract surgery. Three hundred lenses were tumble- polished using the first polishing solution,
and of this number 253 had no polishing-induced defects. Another 300 lenses were tumble-
polished using the second polishing solution, and 196 lenses were satisfactory upon completion.
Is there any reason to believe that the two polishing solutions differ? Use α = 0.01.
Suppose that a large population consists of items of k different types, and let pi denote the
probability that an item selected at random will be of type i(i = 1, . . . , k). Suppose that the
following hypothesis is to be tested
We shall assume that a random sample of size n is to be taken from the given population. That
is, n independent observations are to be taken, and there is probability pi that each observation
will be of type i(i = 1, ..., k). On the basis of these n observations, the hypothesis is to be tested.
For each i, we denote Ni the number of observations in the random sample that are of type i.
has the property that if H0 is true and the sample size n → ∞, then Q converges in distri-
bution to the χ2 distribution with k − 1 degrees of freedom.
Suppose that we observer an i.i.d. sample X1 , . . . , Xn of random variables that take a finite
number of values B1 , . . . , Bk with unknown probability p1 , . . . , pk . Consider the hypothesis
where Ni is number of Xj equal to Bj . On the other hand, if H1 holds, then for some index i∗ ,
pi∗ 6= p0i . We write
νi∗ − np0i∗ pi∗ νi∗ − npi∗ √ pi∗ − p0i∗
r
= √ + n q .
p0i∗
q
np0i∗ npi∗ p0i∗
the first term converges to N(0, (1 − pi∗ )pi∗ /p0i∗ ) by the central limit theorem while the second
term diverges to plus or minus infinity. It means that if H1 holds then Q → ∞. Therefore, we will
reject H0 if Q ≥ cα,k−1 where cα,k−1 is chosen such that the error of type 1 is equal to the level of
significance α:
α = P0 (Q > cα,k−1 ) ≈ P(χ2k−1 > cα,k−1 ).
This test is called chi-squared goodness-of-fit test
Example 6.6.2. A study of blood types among a sample of 6004 people gives the following result
Blood type A B AB O
Number of people 2162 738 228 2876.
A previous study claimed that the proportions of people whose blood of types A, B, AB and O
are 33.33%, 12.5%, 4.17% and 50%, respectively.
We can use the data in Table 6.2 to test the null hypothesis H0 that the probabilities (p1 , p2 , p3 , p4 )
of the four blood type equal ( 13 , 81 , 24
1 1
, 2 ). The χ2 test statistic is then
(2162 − 6004 × 13 )2 (738 − 6004 × 18 )2 (228 − 6004 × 1 2
24 ) (2876 − 6004 × 12 )2
Q= + + + = 20.37
6004 × 31 6004 × 18 1
6004 × 24 6004 × 12
To test H0 at the level α0 , we would compare Q to the 1 − α0 quantile of the χ2 distribution
with three degrees of freedom. Alternatively, we can compute the P -value, which would be the
smallest α0 at which we could reject H0 . In general, the P -value is 1 − F (Q) where F is the
cumulative distribution function of the χ2 distribution with k − 1 degrees of freedom. In this
example k = 4 and Q = 20.37 then the p-value is 1.42 × 10−4 .
CHAPTER 6. HYPOTHESIS TESTING 138
Let X1 , . . . , Xn be an i.i.d. sample from unknown distribution P and consider the following
hypotheses
H0 : P = P0 vs H1 : P 6= P0
for some particular, possibly continuous distribution P0 . To do this, we will split a set of all pos-
sible outcomes of Xi , say X, into a finite number of intervals I1 , . . . , Ik . The null hypothesis H0
implies that for all intervals
P(X ∈ Ij ) = P0 (X ∈ Ij ) = p0j .
It is clear that H0 implies H00 . However, the fact that H00 holds does not guarantee that H0 hold.
There are many distribution different from P that have the same distribution on the intervals
I1 , . . . , Ik as P . On one hand, if we group into more and more intervals, our discrete approxi-
mation of P will get closer and closer to P , so in some sense H00 will get closer to H0 . However,
we can not split into too many intervals either, because the χ2k1 -distribution approximation for
statistic T in Pearsons theorem is asymptotic. The rule of thumb is to group the data in such a
way that the expected count in each interval np0i is at least 5.
Example 6.6.3. Suppose that we wish to test the null hypothesis that the logarithms of the life-
time of ball bearings are an i.i.d. sample from the normal distribution with mean ln(50) = 3.912
and variance 0.25. The observed logarithms are
The number of observation in each of the four intervals are then 3, 4, 8 and 8. We then calculate
Q = 3.609.
Our table of the χ2 distribution with three degrees of freedom indicates that 3.609 is between the
0.6 and 0.7 quantiles, so we would not reject the null hypothesis at levels less 0.3 and reject the
null hypothesis at levels greater than 0.4. (Actually, the P -value is 0.307.)
CHAPTER 6. HYPOTHESIS TESTING 139
We can extend the goodness-of-fit test to deal with the case in which the null hypothesis
is that the distribution of our data belongs to a particular parametric family. The alternative hy-
pothesis is that the data have a distribution that is not a member of that parametric family. There
are two changes to the test procedure in going from the case of a simple null hypothesis to the
case of a composite null hypothesis. First, in the test statistic Q, the probabilities p0i are replaced
by estimated probabilities based on the parametric family. Second, the degrees of freedom are
reduced by the number of parameters.
Let us start with a discrete case when a random variable takes a finite number of values
B1 , . . . , Bk and
pi = P(X = Bi ), i = 1, . . . , k.
We would like to test a hypothesis that this distribution comes from a family of distributions
{Pθ : θ ∈ Θ}. In other words, if we denote
pj (θ) = Pθ (X = Bj ),
we want to test
The situation now is complicated since we want to test if pj = pj (θ), j ≤ r at least for some θ ∈ Θ
which means that we may have many candidates for θ. One way to approach this problem is as
follows.
Step 1: Assuming that hypothesis H0 holds, we can find an estimator θ∗ of this unknown θ.
Step 2: Try to test if, indeed, the distribution P is equal to Pθ∗ by using the statistics
k
X (Ni − npi (θ∗ ))2
Q∗ =
npi (θ∗ )
i=1
This approach looks natural, the only question is what estimate θ∗ to use and how the fact
that θ∗ also depends on the data will affect the convergence of Q. It turns out that if we let θ∗ be
the maximum likelihood estimate, i.e. θ that maximizes the likelihood function
The we will reject H0 if T ≤ c where the threshold c is determined from the condition
Example 6.6.4. Suppose that a gene has two possible alleles A1 and A2 and the combinations of
these alleles define three genotypes A1 A1 , A1 A2 and A2 A2 . We want to test a theory that
Probability to pass A to a child = θ
1
Probability to pass A to a child = 1 − θ
2
p1 (θ) = P(A1 A1 ) = θ2
p2 (θ) = P(A1 A2 ) = 2θ(1 − θ)
p3 (θ) = P(A2 A2 ) = (1 − θ)2 .
Suppose that given a random sample X1 , . . . , Xn from the population the counts of each geno-
type are N1 , N2 and N3 . To test the theory we want to test the hypothesis
H0 : pi = pi (θ), i = 1, 2, 3 vs H1 : otherwise.
First of all, the dimension of the parameter set is s = 1 since the distributions are determined by
one parameter θ. To find the MLE θ∗ we have to maximize the likelihood function
After computing the critical point by setting the derivative equal to 0, we get
2N1 + N2
θ∗ = .
2n
Therefore, under the null hypothesis H0 the statistics
3
∗
X (Ni − npi (θ∗ ))2
Q =
npi (θ∗ )
i=1
In the case when the distributions Pθ are continuous or, more generally, have infinite number
of values that must be grouped in order to use chi-squared test (for example, normal or Poisson
distribution), it can be a difficult numerical problem to maximize the grouped likelihood func-
tion
Pθ (I1 )N1 · · · Pθ (Ik )Nk .
CHAPTER 6. HYPOTHESIS TESTING 141
It is tempting to use a usual non-grouped MLE θ̂ of θ instead of the above θ∗ because it is often
easier to compute, in fact, for many distributions we know explicit formulas for these MLEs.
However, if we use θ̂ in the statistic
k
X (Ni − npi (θ̂))2
Q̂ =
i=1 npi (θ̂)
then it will no longer converges to χ2r−s−1 distribution. It has been shown that typically this Q̂
will converge to a distribution “in between” χ2k−s−1 and χ2k−s 1 . Thus, a conservative decision rule
is to reject H0 whether Q̂ > c where c is chosen such that P(χ2k−1 > c) = α.
Example 6.6.5 (Testing Whether a Distribution Is Normal). Consider now a problem in which
a random sample X1 , ..., Xn is taken from some continuous distribution for which the p.d.f. is
unknown, and it is desired to test the null hypothesis H0 that this distribution is a normal dis-
tribution against the alternative hypothesis H1 that the distribution is not normal. To perform a
χ2 test of goodness- of-fit in this problem, divide the real line into k subintervals and count the
number Ni of observations in the random sample that fall into the ith subinterval (i = 1, ..., k).
If H0 is true, and if µ and σ 2 denote the unknown mean and variance of the normal distribu-
tion, then the parameter vector θ is the two-dimensional vector θ(µ, σ 2 ). The probability πi (θ),
or πi (µ, σ 2 ), that an observation will fall within the ith subinterval, is the probability assigned to
that subinterval by the normal distribution with mean µ and variance σ 2 . In other words, if the
ith subinterval is the interval from ai to bi , then
b − µ a − µ
2 i i
πi (µ, σ ) = Φ −Φ .
σ σ
It is important to note that in order to calculate the value of the statistic Q∗ , the M.L.E.s µ∗ and
σ 2∗ must be found by using the numbers N1 , ..., Nk of observations in the different subintervals.
The M.L.E.s should not be found by using the observed values of X1 , ..., Xn themselves. In other
words, µ∗ and σ 2∗ will be the values of µ and σ 2 that maximize the likelihood function
Because of the complicated nature of the function πi (µ, σ 2 ) a lengthy numerical computation
would usually be required to determine the values of µ and σ 2 that maximize L(µ, σ 2 ). On the
other hand, we know that the M.L.E.s of µ and σ 2 based on the n observed values X1 , ..., Xn in
the original sample are simply the sample mean X n and the sample variance s2n . Furthermore,
if the estimators that maximize the likelihood function L(µ, σ 2 ) are used to calculate the statis-
tic Q∗ , then we know that when H0 is true, the distribution of Q∗ will be approximately the χ2
distribution with k − 3 degrees of freedom. On the other hand, if the M.L.E.s X n and s2n , which
are based on the observed values in the original sample, are used to calculate Q̂, then this χ2
approximation to the distribution of Q̂ will not be appropriate. Indeed, the distribution of Q̂ is
asymptotically “in between” χ2k−3 and χ2k−1 .
1
Chernoff, Herman; Lehmann, E. L. (1954) The use of maximum likelihood estimates in χ2 tests for goodness of fit.
Ann. Math. Statistics 25, pp. 579-586.
CHAPTER 6. HYPOTHESIS TESTING 142
Return to Example 6.6.3. We are now in a position to try to test the composite null hypothesis
that the logarithms of ball bearing lifetimes have some normal distribution. We shall divide the
real line into the subintervals (∞, 3.575], (3.575, 3.912], (3.912, 4.249], and (4.249, +∞). The counts
for the four intervals are 3, 4, 8, and 8. The M.L.E.s based on the original data gives µ̂ = 4.150 and
σ̂ 2 = 0.2722. The probabilities of the four intervals are (π1 , π2 , π3 , π4 ) = (0.1350, 0.1888, 0.2511, 0.4251).
This make the value of Q̂ equal to 1.211.
The tail area corresponding to 1.211 needs to be computed for χ2 distributions with k − 1 = 3
and k − 3 = 1 degrees of freedom. For one degree of freedom, the p-value is 0.2711, and for three
degrees of freedom the p-value is 0.7504. So, our actual p-value lies in the interval [0.2711, 0.7504].
Although this interval is wide, it tells not to reject H0 at level α if α < 0.2711.
In this section we will consider a situation when our observations are classified by two differ-
ent features and we would like to test if these features are independent. For example, we can ask
if the number of children in a family and family income are independent. Our sample space X
will consist of a × b pairs.
X = {(i, j) : i = 1, . . . , a, j = 1, . . . , b}
where the first coordinate represents the first feature that belongs to one of a categories and the
second coordinate represents the second feature that belongs to one of b categories. An i.i.d.
sample X1 , ..., Xn can be represented by a contingency table below where Nij is the number all
observations in a cell (i, j).
Feature 2
Feature 1 1 2 ··· b
1 N11 N12 ··· N1b
2 N21 N22 ··· N2b
.. .. .. .. ..
. . . . .
a Na1 Na2 ··· Nab
We would like to test the independence of two features which means that
Denote θij = P[X = (i, j)]; pi = P[X 1 = i]; qj P[X 2 = j]. Then we want to test
We can see that this null hypothesis H0 is a special case of the composite hypotheses from previ-
ous lecture and it can be tested using the chi-squared goodness-of-fit test. The total number of
groups is k = a × b. Since pi s and qj s should add up to one, one parameter in each sequence, for
example pa and qb , can be computed in terms of other probabilities and we can take (p1 , ..., pa−1 )
CHAPTER 6. HYPOTHESIS TESTING 143
and (q1 , ..., qb−1 ) as free parameters of the model. This means that the dimension of the parame-
ter set is
s = (a − 1) × (b − 1).
Therefore, if we find the maximum likelihood estimates for the parameters of this model then
the chi-squared statistic satisfies
X (Nij − np∗i qj∗ )2 w
Q= −→ χ2k−s−1 = χ2(a−1)(b−1)
np∗i qj∗
ij
To formulate the test it remains to find the maximum likelihood estimates of the parameters. We
need to maximize the likelihood function
Y Y
N N
Y
(pi qj )Nij = pi i+ pj +j ,
ij i j
P P
where Ni+ = j Nij and N+j = Since pi s and qj s are not related to each other, maximiz-
i Nij .
Q N Q N
ing the likelihood function above is equivalent to maximizing i pi i+ and j pj +j separately.
We have
Y X a−1
N
ln pi i+ = Ni+ ln pi + Na+ ln(1 − p1 − · · · − pa−1 ).
i i=1
An elementary computation shows that
Ni+
p∗i = , i = 1, . . . , a.
n
Similarly, the MLE for qj is
N+j
qj∗ =
, ij = 1, . . . , b.
n
Therefore, chi-square statistic Q in this case can be written as
Ni+ N+j 2
X Nij − n
Q= N N
.
i+ +j
ij n
We reject H0 if Q > cα,(a−1)(b−1) where the threshold cα,(a−1)(b−1) is determined from the condition
Suppose that the population is divided into R groups and each group (or the entire popula-
tion) is divided into C categories. We would like to test whether the distribution of categories in
each group is the same.
Category 1 Category 2 ··· Category C Σ
Group 1 N11 N12 ··· N1C N1+
Group 2 N21 N22 ··· N2b N2+
.. .. .. .. .. ..
. . . . . .
Group R NR1 NR2 ··· NRC NR+
Σ N+1 N+2 ··· N+C n
CHAPTER 6. HYPOTHESIS TESTING 144
If we denote
pij = P(Categoryj |Groupi )
so that for each group i we have
C
X
pij = 1,
j=1
If observations X1 , ..., Xn are sampled independently from the entire population then homo-
geneity over groups is the same as independence of groups and categories. Indeed, if have ho-
mogeneity
P(Categoryj |Groupi ) = P(Categoryj ),
then we have
P(Categoryj , Groupi ) = P(Categoryj )P(Groupi ).
This means that to test homogeneity we can use the test of independence above. Denote
2
N N
R X
X C Nij − i+n +j w
Q= N N
−→ χ2(C−1)(R−1) .
i+ +j
i=1 j=1 n
We reject H0 at the significance level α if Q > cα,(C−1)(R−1) where the threshold cα,(C−1)(R−1) is
determined from the condtion
Example 6.6.6. In this example, 100 people were asked whether the service provided by the fire
department in the city was satisfactory. Shortly after the survey, a large fire occured in the city.
Suppose that the same 100 people were asked whether they thought that the service provided by
the fire department was satisfactory. The result are in the following table:
Satisfactory Unsatisfactory
Before fire 80 20
After fire 72 28
Suppose that we would like to test whether the opinions changed after the fire by using a chi-
squared test. However, the i.i.d. sample consisted of pairs of opinions of 100 people (Xi1 , Xi2 ), i =
1, . . . , 100 where the first coordinate/feature is a persons opinion before the fire and it belongs to
one of two categories
{“Satisf actory”, “U nsatisf actory”},
and the second coordinate/feature is a persons opinion after the fire and it also belongs to one
of two categories
{“Satisf actory”, “U nsatisf actory”},
CHAPTER 6. HYPOTHESIS TESTING 145
So the correct contingency table corresponding to the above data and satisfying the assumption
of the chi-squared test would be the following:
Satisfactory Unsatisfactory
Satisfactory 70 10
Unsatisfactory 2 18
In order to use the first contingency table, we would have to poll 100 people after the fire inde-
pendently of the 100 people polled before the fire.
6.7 Exercises
6.1. Suppose that X has a pdf of the form f (x; θ) = θxθ−1 I{0<x<1} where θ ∈ {1, 2}. To test the
simple hypotheses H0 : θ = 1 against H1 : θ = 2}, one uses a random sample X1 , X2 of size n = 2
and define the critical region to be C = {(x1 , x2 ) : x1 x2 ≥ 43 }. Find the power function of the test.
6.2. Suppose that X has a binomial distribution with the number of trials n = 10 and with p ∈
{ 14 , 12 }. The simple hypothesis H0 : p = 12 is rejected, and the alternative simple hypothesis
H1 : p = 41 is accepted, if the observed value of X1 , a random sample of size 1, is less than or
equal to 3. Find the significance level and the power of the test.
6.3. Let us say the life of a light bulb, say X, is normally distributed with mean θ and standard
deviation 5000. Past experience indicates that θ = 30000. The manufacturer claims that the light
bulb made by a new process have mean θ > 30000. It is possible that θ = 35000. Check his claim
by testing H0 : θ = 30000 against H1 : θ > 30000. We shall observe n independent values of X,
say X1 , . . . , Xn , and we shall reject H0 (thus accept H1 ) if and only if x̄ ≥ c. Determine n and c so
that the power function γ(θ) of the test has the values γ(30000) = 0, 01 and γ(35000) = 0, 98.
6.4. Suppose that X has a Poisson distribution with mean λ. Consider the simple hypothesis
H0 : λ = 21 and the alternative composite hypothesis H1 : λ < 12 . Let X1 , . . . , X12 denote a
random sample of size 12 from this distribution. One rejects H0 if and only if the observed value
of Y = X1 + . . . + X12 ≤ 2.. Find γ(λ) for λ ∈ (0, 12 ] and the significance level of the test.
6.5. Let Y1 < Y2 < Y2 < Y4 be the order statistics of a random sample of size n = 4 from a
distribution with pdf f (x; θ) = 1/θ, 0 < x < θ, zero elsewhere, where θ > 0. The hypothesis
H0 : θ = 1 is rejected and H1 : θ > 1 is accepted if the observed Y4 ≥ c.
6.6. Let X1 , . . . , Xn be a random sample from a N (a0 , σ 2 ) distribution where 0 < σ 2 < ∞ and a0
is known. Show that the likelihood ratio test of H0 : σ 2 = σ02 versus H1 : σ 2 6= σ02 can be based
upon the statistics W = ni=1 (Xi − a0 )2 /σ02 . Determine the null distribution of W and give the
P
6.7. Let X1 , . . . , Xn be a random sample from a Poisson distribution with mean λ > 0.
1. Show that the likelihood ratio test of H0 : λ = λ0 versus H1 : λ 6= λ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of Y .
2. For λ0 = 2 and n = 5, find the significance level of the test that rejects H0 if Y ≤ 4 or Y ≥ 17.
6.8. Let X1 , . . . , Xn be a random sample from a Bernoulli B(1, θ) distribution, where 0 < θ < 1.
1. Show that the likelihood ratio test of H0 : θ = θ0 versus H1 : θ 6= θ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of Y .
2. For n = 100 and θ0 = 1/2, find c1 so that the test reject H0 when Y ≤ c1 or Y ≥ c2 = 100 − c1
has the approximate significance level α = 0.05.
6.9. Let X1 , . . . , Xn be a random sample from a Γ(α = 3, β = θ) distribution, where 0 < θ < ∞.
1. Show that the likelihood ratio test of H0 : θ = θ0 versus H1 : θ 6= θ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of 2Y /θ0 .
2. For θ0 = 3 and n = 5, find c1 and c2 so that the test that rejects H0 when Y ≤ c1 or Y ≥ c2
has significance level 0.05.
6.10. Let X1 , X2 be a random sample of size 2 from a random variable X having the pdf f (x; θ) =
e−x/θ 0
θ I{0<x<∞} . Consider the simple hypothesis H0 : θ = θ = 2 and the alternative hypothesis
H1 : θ = θ00 = 4. Show that the best test of H0 against H1 may be carried out by use of the statistics
X1 + X2 .
6.11. Let X1 , . . . , Xn be a random sample of size 10 from a normal distribution N (0, σ 2 ). Find a
best critical region of size α = 0.05 for testing H0 : σ 2 = 1 against H1 : σ 2 = 2. Is this a best critical
region of size 0.05 for testing H0 : σ 2 = 1 against H1 : σ 2 = 4? Against H1 : σ 2 = σ12 > 1.
6.12. If X1 , . . . , Xn is a random sample from a distribution having pdf of the form f (x; θ) =
θxθ−1 , 0 < x < 1,nzero elsewhere, show that aobest critical region for testing H0 : θ = 1 against
H1 : θ = 2 is C = (x1 , . . . , xn ) : c ≤ x1 x2 . . . xn .
6.13. Let X1 , . . . , Xn denote a random sample from a normal distribution N (θ, 100). Show that
C = (x1 , . . . , xn ) : x̄ ≥ c is a best critical region for testing H0 : θ = 75 against H1 : θ = 78. Find
n and c so that
PH0 [(X1 , . . . , Xn ) ∈ C] = PH0 [X̄ ≥ c] = 0.05
CHAPTER 6. HYPOTHESIS TESTING 147
and
PH1 [(X1 , . . . , Xn ) ∈ C] = PH1 [X̄ ≥ c] = 0.90,
approximately.
6.14. Let X1 , . . . , Xn be iid with pmf f (x; p) = px (1 − p)1−x , x = 0, 1, zero elsewhere. Show that
xi ≤ c is a best critical region for testing H0 : p = 21 against H1 : p = 13 .
P
C = (x1 , . . . , xn ) :
P
Use the Central Limit Theorem to find n and c so that approximately PH0 [ Xi ≤ c] = 0.10 and
P
PH1 [ Xi ≤ c] = 0.80.
6.15. Let X1 , . . . , X10 denote a random sample of size 10 from a Poisson distribution with mean
λ. Show that the critical region C defined by 10
P
i=1 xi ≥ 3 is a best critical region for testing
H0 : λ = 0.1 against H1 : λ = 0.5. Determine, for this test, the significance level α and the power
at θ = 0.5.
6.16. Let X have the pmf f (x; θ) = θx (1 − θ)1−x , x = 0, 1, zero elsewhere. We test the simple
hypothesis H0 : λ = 41 against the alternative composite hypothesis H1 : θ < 14 by taking a
random sample of size 10 and rejecting H0 : θ = 14 iff the observed values x1 , . . . , x1 0 of the
sample observations are such that 10 1
P
i=1 xi ≤ 1. Find the power function γ(θ), 0 < θ ≤ 4 , of this
test.
Tests on mean
6.17. (a) The sample mean and standard deviation from a random sample of 10 observations
from a normal population were computed as x = 23 and σ = 9. Calculate the value of the test
statistic of the test required to determine whether there is enough evidence to infer at the 5%
significance level that the population mean is greater than 20.
(b) Repeat part (a) with n = 30.
(c) Repeat part (b) with n = 40.
6.18. (a) A statistics practitioner is in the process of testing to determine whether there is enough
evidence to infer that the population mean is different from 180. She calculated the mean and
standard deviation of a sample of 200 observations as x = 175 and σ = 22. Calculate the value
of the test statistic of the test required to determine whether there is enough evidence at the 5%
significance level.
(b) Repeat part (a) with s = 45.
(c) Repeat part (a) with s = 60.
6.19. A courier service advertises that its average delivery time is less than 6 hours for local deliv-
eries. A random sample of times for 12 deliveries to an address across town was recorded. These
data are shown here. Is this sufficient evidence to support the couriers advertisement, at the 5%
level of significance?
3.03, 6.33, 7.98, 4.82, 6.50, 5.22, 3.56, 6.76, 7.96, 4.54, 5.09, 6.46.
CHAPTER 6. HYPOTHESIS TESTING 148
6.20. Aircrew escape systems are powered by a solid propellant. The burning rate of this pro-
pellant is an important product characteristic. Specifications require that the mean burning rate
must be 50 centimeters per second. We know that the standard deviation of burning rate is σ = 2
centimeters per second. The experimenter decides to specify a type I error probability or signif-
icance level of α = 0.05 and selects a random sample of n = 25 and obtains a sample average
burning rate of X = 51.3 centimeters per second. What conclusions should be drawn?
6.21. The mean water temperature downstream from a power plant cooling tower discharge pipe
should be no more than 100◦ F . Past experience has indicated that the standard deviation of
temperature is 2◦ F . The water temperature is measured on nine randomly chosen days, and the
average temperature is found to be 98◦ F.
(a) Should the water temperature be judged acceptable with α = 0.05?
(b) What is the P -value for this test?
(c) What is the probability of accepting the null hypothesis at α = 0.05 if the water has a true
mean temperature of 104◦ F ?
6.23. Cloud seeding has been studied for many decades as a weather modification procedure.
The rainfall in acre-feet from 20 clouds that were selected at random and seeded with silver ni-
trate follows:
18.0, 30.7, 19.8, 27.1, 22.3, 18.8, 31.8, 23.4, 21.2, 27.9,
31.9, 27.1, 25.0, 24.7, 26.9, 21.8, 29.2, 34.8, 26.7, 31.6.
(a) Can you support a claim that mean rainfall from seeded clouds exceeds 25 acre-feet? Use
α = 0.01.
(b) Compute the power of the test if the true mean rainfall is 27 acre-feet.
(c) What sample size would be required to detect a true mean rainfall of 27.5 acre-feet if we
wanted the power of the test to be at least 0.9?
6.24. The life in hours of a battery is known to be approximately normally distributed, with stan-
dard deviation σ = 1.25 hours. A random sample of 10 batteries has a mean life of x = 40.5 hours.
(a) Is there evidence to support the claim that battery life exceeds 40 hours? Use α = 0.05.
(b) What is the P -value for the test in part (a)?
(c) What is the power for the test in part (a) if the true mean life is 42 hours?
(d) What sample size would be required to ensure that the probability of making type II error
CHAPTER 6. HYPOTHESIS TESTING 149
6.25. Medical researchers have developed a new artificial heart constructed primarily of tita-
nium and plastic. The heart will last and operate almost indefinitely once it is implanted in the
patients body, but the battery pack needs to be recharged about every four hours. A random
sample of 50 battery packs is selected and subjected to a life test. The average life of these batter-
ies is 4.05 hours. Assume that battery life is normally distributed with standard deviation σ = 0.2
hour.
(a) Is there evidence to support the claim that mean battery life exceeds 4 hours? Use α = 0.05.
(b) Compute the power of the test if the true mean battery life is 4.5 hours.
(c) What sample size would be required to detect a true mean battery life of 4.5 hours if we wanted
the power of the test to be at least 0.9?
(d) Explain how the question in part (a) could be answered by constructing a one-sided confi-
dence bound on the mean life.
6.26. After many years of teaching, a statistics professor computed the variance of the marks on
her final exam and found it to be σ 2 = 250. She recently made changes to the way in which the
final exam is marked and wondered whether this would result in a reduction in the variance. A
random sample of this years final exam marks are listed here. Can the professor infer at the 10%
significance level that the variance has decreased?
57 92 99 73 62 64 75 70 88 60.
6.27. With gasoline prices increasing, drivers are more concerned with their cars’ gasoline con-
sumption. For the past 5 years, a driver has tracked the gas mileage of his car and found that the
variance from fill-up to fill-up was σ 2 = 23 mpg2 . Now that his car is 5 years old, he would like to
know whether the variability of gas mileage has changed. He recorded the gas mileage from his
last eight fill-ups; these are listed here. Conduct a test at a 10% significance level to infer whether
the variability has changed.
28 25 29 25 32 36 27 24.
Tests on proportion
6.28. (a) Calculate the P -value of the test of the following hypotheses given that p̂ = 0.63 and
n = 100:
H0 : p = 0.6 vs H1 : p > 0.6.
(b) Repeat part (a) with n = 200.
(c) Repeat part (a) with n = 400.
(d) Describe the effect on P -value of increasing sample size.
CHAPTER 6. HYPOTHESIS TESTING 150
6.29. Has the recent drop in airplane passengers resulted in better on-time performance? Before
the recent economic downturn, one airline bragged that 92% of its flights were on time. A random
sample of 165 flights completed this year reveals that 153 were on time. Can we conclude at the
5% significance level that the airlines on-time performance has improved?
6.30. In a random sample of 85 automobile engine crank- shaft bearings, 10 have a surface finish
roughness that exceeds the specifications. Does this data present strong evidence that the pro-
portion of crankshaft bearings exhibiting excess surface roughness exceeds 0.10? State and test
the appropriate hypotheses using α = 0.05.
6.31. An study claimed that nearly one-half of all engineers continue academic studies beyond
the B.S. degree, ultimately receiving either an M.S. or a Ph.D. degree. Data from an article in
Engineering Horizons (Spring 1990) indicated that 117 of 484 new engineering graduates were
planning graduate study.
(a) Are the data from Engineering Horizons consistent with the claim reported by Fortune? Use
α = 0.05 in reaching your conclusions.
(b) Find the P -value for this test.
(c) Discuss how you could have answered the question in part (a) by constructing a two-sided
confidence interval on p.
6.32. A researcher claims that at least 10% of all football helmets have manufacturing flaws that
could potentially cause injury to the wearer. A sample of 200 helmets revealed that 16 helmets
contained such defects.
(a) Does this finding support the researchers claim? Use α = 0.01.
(b) Find the P -value for this test.
6.33. In random samples 12 from each of two normal populations, we found the following statis-
tics: x1 = 74, s1 = 18 and x2 = 71, s2 = 16.
(a) Test with α = 0.05 to determine whether we can infer that the population means differ.
(b) Repeat part (a) increasing the standard deviation to s1 = 210 and s2 = 198.
(c) Describe what happens when the sample stan- dard deviations get larger.
(d) Repeat part (a) with sample size 150.
(e) Discuss the effects of increasing the sample size.
6.34. Random sampling from two normal populations produced the following results
(c) Describe what happens when the sample stan- dard deviations get smaller.
(d) Repeat part (a) with samples of size 20.
(e) Discuss the effects of decreasing the sample size.
(f) Repeat part (a) changing the mean of sample 1 to x1 = 409.
6.35. Two machines are used for filling plastic bottles with a net volume of 16.0 ounces. The fill
volume can be assumed normal, with standard deviation σ1 = 0.020 and σ2 = 0.025 ounces. A
member of the quality engineering staff suspects that both machines fill to the same mean net
volume, whether or not this volume is 16.0 ounces. A random sample of 10 bottles is taken from
the output of each machine.
Machine 1 Machine 2
16.03 16.01 16.02 16.03
16.04 15.96 15.97 16.04
16.05 15.98 15.96 16.02
16.05 16.02 16.01 16.01
16.02 15.99 15.99 16.00
6.36. Every month a clothing store conducts an inventory and calculates losses from theft. The
store would like to reduce these losses and is considering two methods. The first is to hire a
security guard, and the second is to install cameras. To help decide which method to choose,
the manager hired a security guard for 6 months. During the next 6-month period, the store
installed cameras. The monthly losses were recorded and are listed here. The manager decided
that because the cameras were cheaper than the guard, he would install the cameras unless there
was enough evidence to infer that the guard was better. What should the manager do?
Pair t-test
6.37. Many people use scanners to read documents and store them in a Word (or some other
software) file. To help determine which brand of scanner to buy, a student conducts an experi-
ment in which eight documents are scanned by each of the two scanners he is interested in. He
records the number of errors made by each. These data are listed here. Can he infer that brand A
(the more expensive scanner) is better than brand B?
CHAPTER 6. HYPOTHESIS TESTING 152
Document 1 2 3 4 5 6 7 8
BrandA 17 29 18 14 21 25 22 29
BrandB 21 38 15 19 22 30 31 37
6.38. In an effort to determine whether a new type of fertilizer is more effective than the type cur-
rently in use, researchers took 12 two-acre plots of land scattered throughout the county. Each
plot was divided into two equal-sized subplots, one of which was treated with the current fertil-
izer and the other with the new fertilizer. Wheat was planted, and the crop yields were measured.
Plot 1 2 3 4 5 6 7 8 9 10 11 12
Current fertilizer 56 45 68 72 61 69 57 55 60 72 75 66
New fertilizer 60 49 66 73 59 67 61 60 58 75 72 68
(a) Can we conclude at the 5% significance level that the new fertilizer is more effective than the
current one?
(b) Estimate with 95% confidence the difference in mean crop yields between the two fertilizers.
(c) What is the required condition(s) for the validity of the results obtained in parts (a) and (b)?
6.39. Random samples from two normal population produced the following statistics
(a) Can we infer at the 10% significance level that the two population variances differ?
(b) Repeat part (a) changing the sample sizes to n1 = 15 and n2 = 15.
(c) Describe what happens to the test statistics and the conclusion when the sample sizes de-
crease.
6.40. A statistics professor hypothesized that not only would the means vary but also so would
the variances if the business statistics course was taught in two different ways but had the same
final exam. He organized an experiment wherein one section of the course was taught using
detailed PowerPoint slides whereas the other required students to read the book and answer
questions in class discussions. A sample of the marks was recorded and listed next. Can we
infer that the variances of the marks differ between the two sections?
Class 1 64 85 80 64 48 62 75 77 50 81 90
Class 2 73 78 66 69 79 81 74 59 83 79 84
6.41. An operations manager who supervises an assembly line has been experiencing problems
with the sequencing of jobs. The problem is that bottle- necks are occurring because of the in-
consistency of sequential operations. He decides to conduct an experiment wherein two differ-
ent methods are used to complete the same task. He measures the times (in seconds). The data
are listed here. Can he infer that the second method is more consistent than the first method?
Method 1 8.8 9.6 8.4 9.0 8.3 9.2 9.0 8.7 8.5 9.4
Method 2 9.2 9.4 8.9 9.6 9.7 8.4 8.8 8.9 9.0 9.7
CHAPTER 6. HYPOTHESIS TESTING 153
6.42. Random samples from two binomial populations yielded the following statistics:
(a) Calculate the P -value of a test to determine whether we can infer that the population propor-
tions differ.
(b) Repeat part (a) increasing the sample sizes to 400.
(c) Describe what happens to the p-value when the sample sizes increase.
6.43. Random samples from two binomial populations yielded the following statistics:
(a) Calculate the P -value of a test to determine whether we there is evidence to infer that the
population proportions differ.
(b) Repeat part (a) p̂1 = 0.95 and p̂2 = 0.90.
(c) Describe the effect on the P -value of increasing the sample proportions.
(d) Repeat part (a) p̂1 = 0.10 and p̂2 = 0.05.
(e) Describe the effect on the P -value of decreasing the sample proportions.
6.44. Many stores sell extended warranties for products they sell. These are very lucrative for
store owners. To learn more about who buys these warranties, a random sample was drawn
of a stores customers who recently purchased a product for which an extended warranty was
available. Among other vari- ables, each respondent reported whether he or she paid the regular
price or a sale price and whether he or she purchased an extended warranty.
Can we conclude at the 10% significance level that those who paid the regular price are more
likely to buy an extended warranty?
6.45. Surveys have been widely used by politicians around the world as a way of monitoring the
opinions of the electorate. Six months ago, a survey was undertaken to determine the degree
of support for a national party leader. Of a sample of 1100, 56% indicated that they would vote
for this politician. This month, another survey of 800 voters revealed that 46% now support the
leader.
(a) At the 5% significance level, can we infer that the national leaders popularity has decreased?
(b) At the 5% significance level, can we infer that the national leaders popularity has decreased
by more than 5%?
CHAPTER 6. HYPOTHESIS TESTING 154
6.46. A random sample of 500 adult residents of Maricopa County found that 385 were in favour
of increasing the highway speed limit to 75 mph, while another sample of 400 adult residents of
Pima County found that 267 were in favour of the increased speed limit. Do these data indicate
that there is a difference in the support for increasing the speed limit between the residents of
the two counties? Use α = 0.05. What is the P -value for this test?
6.47. Two different types of injection-molding machines are used to form plastic parts. A part
is considered defective if it has excessive shrinkage or is discolored. Two random samples, each
of size 300, are selected, and 15 defective parts are found in the sample from machine 1 while 8
defective parts are found in the sample from machine 2. Is it reasonable to conclude that both
machines produce the same fraction of defective parts, using α = 0.05? Find the P -value for this
test.
6.48. A new casino game involves rolling 3 dice. The winnings are directly proportional to the
total number of sixes rolled. Suppose a gambler plays the game 100 times, with the following
observed counts:
Number of Sixes 0 1 2 3
Number of Rolls 48 35 15 3
The casino becomes suspicious of the gambler and wishes to determine whether the dice are fair.
What do they conclude?
6.49. Suppose that the distribution of the heights of men who reside in a certain large city is
the normal distribution for which the mean is 68 inches and the standard deviation is 1 inch.
Suppose also that when the heights of 500 men who reside in a certain neighbourhood of the city
were measured, the distribution in the following table was obtained. Test the hypothesis that,
with regard to height, these 500 men form a random sample from all the men who reside in the
city.
6.50. The 50 values in the following table are intended to be a random sample from the standard
normal distribution.
1.28 1.22 0.32 0.80 1.38 1.26 2.33 0.34 1.14 0.64
0.41 0.01 0.49 0.36 1.05 0.04 0.35 2.82 0.64 0.56
0.45 1.66 0.49 1.96 3.44 0.67 1.24 0.76 0.46 0.11
0.35 1.39 0.14 0.64 1.67 1.13 0.04 0.61 0.63 0.13
0.72 0.38 0.85 1.32 0.85 0.41 0.11 2.04 1.61 1.81
CHAPTER 6. HYPOTHESIS TESTING 155
a) Carry out a χ2 test of goodness-of-fit by dividing the real line into five intervals, each of which
has probability 0.2 under the standard normal distribution.
b) Carry out a χ2 test of goodness-of-fit by dividing the real line into ten intervals, each of which
has probability 0.1 under the standard normal distribution.
Chapter 7
Regression
Suppose that we have a pair of variables (X, Y ) and a variable Y is a linear function of X plus
random noise:
Y = f (X) + = β0 + β1 X + ,
where a random noise is assumed to have normal distribution N(0, σ 2 ). A variable X is called a
predictor variable, Y - a response variable and a function f (x) = β0 + β1 x - a linear regeression
function.
Suppose that we are given a sequence of pairs (X1 , Y1 ), . . . , (Xn , Yn ) that are described by the
above model:
Yi = β0 + β1 Xi + i ,
and 1 , . . . , n are i.i.d. N(0, σ 2 ). We have three unknown parameters β0 , β1 and σ 2 and we want
to estimate them using a given sample. The points X1 , . . . , Xn can be either random or non
random, but from the point of view of estimating linear regression function the nature of Xs is
in some sense irrelevant so we will think of them as fixed and non random and assume that the
randomness comes from the noise variables i . For a fixed Xi , the distribution of Yi is equal to
N(f (Xi ), σ 2 ). The likelihood function of the sequence (Y1 , . . . , Yn ) is
1 Pn 2
L(Y1 , . . . , Yn ; β0 , β1 , σ 2 ) = (2πσ)−n/2 e− 2σ2 i=1 (Yi −f (Xi )) .
Let us find the m.l.e. of β̂0 , β̂1 , σ̂ 2 that maximize this likelihood function L. First of all, it is obvious
that (β̂0 , β̂1 ) is also minimized
n
X
∗
L (β0 , β1 ) := (Yi − β0 − β1 Xi )2
i=1
156
CHAPTER 7. REGRESSION 157
Denote
1X 1X 1X 2 1X
X̄ = Xi , Ȳ = Yi , X2 = Xi , XY = Xi Yi ,
n n n n
i
we obtain
β̂0 = Ȳ − β̂1 X̄
XY − X̄ Ȳ
β̂1 =
X 2 − X̄ 2
n
1X
σ̂ 2 = (Yi − β̂0 − β̂1 Xi )2
n
i=1
Proposition 7.1.1. 1. Vector (β̂0 , β̂1 ) has a normal distribution with mean (β0 , β1 ) and
covariance matrix
!
σ2 X 2 −X̄
Σ= , where σx2 = X 2 − X̄ 2 .
nσx2 −X̄ 1
α α
P[χ2n−1 > c1−α/2,n−1 ] = 1 − , P[χ2n−1 > cα/2,n−1 ] = ,
2 2
CHAPTER 7. REGRESSION 158
then
h nσ̂ 2 nσ̂ 2 i
P ≤ σ2 ≤ = 1 − α.
cα/2,n−1 c1−α/2,n−1
Therefore the (1 − α) CI for σ 2 is
nσ̂ 2 nσ̂ 2
≤ σ2 ≤ .
cα/2,n−1 c1−α/2,n−1
P[|tn−2 | < xα ] = 1 − α
Suppose now that we have a new observation X for which Y is unknown and we want to
predict Y or find the confidence interval for Y . According to simple regression model,
Y = β0 + β1 X +
and it is natural to take Ŷ = β̂0 + β̂1 X as the prediction of Y . Let us find the distribution of their
difference Ŷ − Y .
CHAPTER 7. REGRESSION 159
Ŷ − Y
r
σ̂ 2 (X̄−X)2
n−2 n+1+ σx2
2
e−x /2
Rz
Table of Normal distribution Φ(z) = −∞
√
2π
dx
z 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0 0.5 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.5279 0.53188 0.53586
0.1 0.5398 0.5438 0.54776 0.55172 0.55567 0.55966 0.5636 0.56749 0.57142 0.57535
0.2 0.5793 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.6293 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.6591 0.66276 0.6664 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.7054 0.70884 0.71226 0.71566 0.71904 0.7224
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.7549
0.7 0.75804 0.76115 0.76424 0.7673 0.77035 0.77337 0.77637 0.77935 0.7823 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.8665 0.86864 0.87076 0.87286 0.87493 0.87698 0.879 0.881 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.9032 0.9049 0.90658 0.90824 0.90988 0.91149 0.91308 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.9222 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.9452 0.9463 0.94738 0.94845 0.9495 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.9608 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.9732 0.97381 0.97441 0.975 0.97558 0.97615 0.9767
2 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.9803 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.983 0.98341 0.98382 0.98422 0.98461 0.985 0.98537 0.98574
2.2 0.9861 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.9884 0.9887 0.98899
2.3 0.98928 0.98956 0.98983 0.9901 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.9918 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.9943 0.99446 0.99461 0.99477 0.99492 0.99506 0.9952
2.6 0.99534 0.99547 0.9956 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.9972 0.99728 0.99736
2.8 0.99744 0.99752 0.9976 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.999
160
CHAPTER 7. REGRESSION 161
1
P[T1 < 1.376] = 0.8 v P[|T1 | < 1.376] = 0.6
CHAPTER 7. REGRESSION 162
DF: n 0.995 0.975 0.2 0.1 0.05 0.025 0.02 0.01 0.005 0.002 0.001
1 0.00004 0.001 1.642 2.706 3.841 5.024 5.412 6.635 7.879 9.55 10.828
2 0.01 0.0506 3.219 4.605 5.991 7.378 7.824 9.21 10.597 12.429 13.816
3 0.0717 0.216 4.642 6.251 7.815 9.348 9.837 11.345 12.838 14.796 16.266
4 0.207 0.484 5.989 7.779 9.488 11.143 11.668 13.277 14.86 16.924 18.467
5 0.412 0.831 7.289 9.236 11.07 12.833 13.388 15.086 16.75 18.907 20.515
6 0.676 1.237 8.558 10.645 12.592 14.449 15.033 16.812 18.548 20.791 22.458
7 0.989 1.69 9.803 12.017 14.067 16.013 16.622 18.475 20.278 22.601 24.322
8 1.344 2.18 11.03 13.362 15.507 17.535 18.168 20.09 21.955 24.352 26.124
9 1.735 2.7 12.242 14.684 16.919 19.023 19.679 21.666 23.589 26.056 27.877
10 2.156 3.247 13.442 15.987 18.307 20.483 21.161 23.209 25.188 27.722 29.588
11 2.603 3.816 14.631 17.275 19.675 21.92 22.618 24.725 26.757 29.354 31.264
12 3.074 4.404 15.812 18.549 21.026 23.337 24.054 26.217 28.3 30.957 32.909
13 3.565 5.009 16.985 19.812 22.362 24.736 25.472 27.688 29.819 32.535 34.528
14 4.075 5.629 18.151 21.064 23.685 26.119 26.873 29.141 31.319 34.091 36.123
15 4.601 6.262 19.311 22.307 24.996 27.488 28.259 30.578 32.801 35.628 37.697
16 5.142 6.908 20.465 23.542 26.296 28.845 29.633 32 34.267 37.146 39.252
17 5.697 7.564 21.615 24.769 27.587 30.191 30.995 33.409 35.718 38.648 40.79
18 6.265 8.231 22.76 25.989 28.869 31.526 32.346 34.805 37.156 40.136 42.312
19 6.844 8.907 23.9 27.204 30.144 32.852 33.687 36.191 38.582 41.61 43.82
20 7.434 9.591 25.038 28.412 31.41 34.17 35.02 37.566 39.997 43.072 45.315
21 8.034 10.283 26.171 29.615 32.671 35.479 36.343 38.932 41.401 44.522 46.797
22 8.643 10.982 27.301 30.813 33.924 36.781 37.659 40.289 42.796 45.962 48.268
23 9.26 11.689 28.429 32.007 35.172 38.076 38.968 41.638 44.181 47.391 49.728
24 9.886 12.401 29.553 33.196 36.415 39.364 40.27 42.98 45.559 48.812 51.179
25 10.52 13.12 30.675 34.382 37.652 40.646 41.566 44.314 46.928 50.223 52.62
26 11.16 13.844 31.795 35.563 38.885 41.923 42.856 45.642 48.29 51.627 54.052
27 11.808 14.573 32.912 36.741 40.113 43.195 44.14 46.963 49.645 53.023 55.476
28 12.461 15.308 34.027 37.916 41.337 44.461 45.419 48.278 50.993 54.411 56.892
29 13.121 16.047 35.139 39.087 42.557 45.722 46.693 49.588 52.336 55.792 58.301
30 13.787 16.791 36.25 40.256 43.773 46.979 47.962 50.892 53.672 57.167 59.703
Bibliography
[1] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury,
2002.
[3] DeGroot, M., & Mark J. Schervish. Probability and Statistics. 3rd ed. Boston, MA: Addison-
Wesley, 2002.
[4] Hogg, R., McKean, J.W., & Craig, A.T. (2005) Introduction to Mathematical Statistics, 6th Edi-
tion. Pearson Education International.
[6] Montgomery, D. C., & Runger, G. C. (2010). Applied statistics and probability for engineers.
John Wiley & Sons.
[8] Rahman N. A. (1983) Theoretical exercises in probability and statistics, second edition.
Macmillan Publishing.
[9] Rice, John. Mathematical statistics and data analysis. Nelson Education, 2006.
[11] Yuri, S. & Kelbert, M. (2008) Probability and Statistics by Example: Volume 1 and 2. Cam-
bridge University Press.
[12] Shevtsova, I. (2011). On the absolute constants in the Berry-Esseen type inequalities for
identically distributed summands. arXiv preprint arXiv:1111.6554.
[13] Nguyen Duy Tien, Vu Viet Yen (2001) Probability Theory (in Vietnamese). Educational Pub-
lishing House.
163