0% found this document useful (0 votes)

12 views164 pages

Inbound 8969254549211759123

The document consists of lecture notes on Probability and Statistics by Ngo Hoang Long, covering various topics such as probability space, random variables, distributions, fundamental limit theorems, and parameter estimation. It includes detailed sections on definitions, properties, examples, and exercises related to each topic. The content is structured to facilitate understanding of statistical concepts and methods.

Uploaded by

Adrien Marie Legendre

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views164 pages

Inbound 8969254549211759123

Uploaded by

Adrien Marie Legendre

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 164

Lecture notes on

Probability and Statistics

Ngo Hoang Long

Division of Applied Mathematics

Faculty of Mathematics and Informatics
Hanoi National University of Education
Email: ngolong@hnue.edu.vn
Contents

1 Probability Space 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Probabilities on a Finite or Countable Space . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Random Variables and Distributions 29

2.1 Random variables on a countable space . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Random variables on a general probability space . . . . . . . . . . . . . . . . . . . . 32
2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.2 Structure of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.1 Construction of expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.2 Some limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.3 Some inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.4 Expectation of random variable with density . . . . . . . . . . . . . . . . . . 41
2.5 Random elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.3 Density of function of random vectors . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Independent random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.8 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.8.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.8.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

1
CONTENTS 2

2.8.3 Properties of conditional expectation . . . . . . . . . . . . . . . . . . . . . . 47

2.8.4 Convergence theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.8.5 Conditional expectation given a random variable . . . . . . . . . . . . . . . . 50
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3 Fundamental Limit Theorems 62

3.1 Convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.1 Weak laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.2 Strong laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Central limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3.1 Characteristic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3.2 Weak convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.3 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.1 Convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.2 Law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.3 Characteristic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.4 Weak convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4.5 Central limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4 Some useful distributions in statistics 82

4.1 Gamma, chi-square, student and F distributions . . . . . . . . . . . . . . . . . . . . 82
4.1.1 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.2 Chi-square distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.3 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1.4 F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 Sample mean and sample variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 Parameter estimation 88
5.1 Samples and characteristic of sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Data display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.1 Stem-and-leaf diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.2 Frequency distribution and histogram . . . . . . . . . . . . . . . . . . . . . . 91
5.2.3 Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.4 Probability plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Point estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.2 Point estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Method of finding estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
CONTENTS 3

5.4.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.4.2 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5 Lower bound for variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6.1 Confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6.2 Point estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.6.3 Lower bound for variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6 Hypothesis Testing 113

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Method of finding test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2.1 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3 Method of evaluating test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.1 Most powerful test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.2 Uniformly most powerful test . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3.3 Monotone likelihood ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.4 Some well-known tests for a single sample . . . . . . . . . . . . . . . . . . . . . . . . 122
6.4.1 Hypothesis test on the mean of a normal distribution, variance σ 2 known . 122
6.4.2 Hypothesis test on the mean of a normal distribution, variance σ 2 unknown 125
6.4.3 Hypothesis test on the variance of a normal distribution . . . . . . . . . . . 127
6.4.4 Test on a population proportion . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5 Some well-known tests for two samples . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.5.1 Inference for a difference in means of two normal distributions, variances
known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.5.2 Inference for the difference in means of two normal distributions, variances
unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.5.3 Paired t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.5.4 Inference on the variance of two normal populations . . . . . . . . . . . . . 134
6.5.5 Inference on two population proportions . . . . . . . . . . . . . . . . . . . . 135
6.6 The chi-square test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.6.1 Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.6.2 Tests of independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.6.3 Test of homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.7.1 Significance level and power function . . . . . . . . . . . . . . . . . . . . . . 145
6.7.2 Null distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.7.3 Best critical region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.7.4 Some tests for single sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.7.5 Some tests for two samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.7.6 Chi-squared tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
CONTENTS 4

7 Regression 156
7.1 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.1.1 Simple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.1.2 Confidence interval for σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.1.3 Confidence interval for β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.1.4 Confidence interval for β0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.1.5 Prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Chapter 1

Probability Space

1.1 Introduction
Random experiments are experiments whose output cannot be surely predicted in advance.
But when one repeats the same experiment a large number of times one can observe some ”regu-
larity” in the average output. A typical example is the toss of a coin: one cannot predict the result
of a single toss, but if we toss the coin many times we get an average of about 50% of ”heads” if
the coin is fair. The theory of probability aims towards a mathematical theory which describes
such phenomena. This theory contains three main ingredients:
a) The state space: this is the set of all possible outcomes of the experiment, and it is usually
denoted by Ω.
Examples

1. A toss of a coin: Ω = {H, T }.

2. Two successive tosses of a coin: Ω = {HH, HT, T H, T T }.

3. A toss of two dice: Ω = {(i, j) : 1 ≤ i ≤ 6, 1 ≤ j ≤ 6}.

4. The measurement of a length L, with a measurement error: Ω = R+ , where R+ denotes

the positive real numbers; ω ∈ Ω denotes the result of the measurement, and w − L is the
measurement error.

5. The lifetime of a light-bulb: Ω = R+ .

b) The event: An ”event” is a property which can be observed either to hold or not to hold
after the experiment is done. In mathematical terms, an event is a subset of Ω. If A and B are two
events, then

• the contrary event is interpreted as the complement set Ac ;

1
This is a draft version which contains many errors. Comments are very welcome

5
CHAPTER 1. PROBABILITY SPACE 6

• the event ”A or B” is interpreted as the union A ∪ B;

• the event ”A and B” is interpreted as the intersection A ∩ B;

• the sure event is Ω;

• the impossible event is the empty set ∅;

• an elementary event is a ”singleton*’, i.e. a subset {w} containing a single outcome w of Ω.

We denote by A the family of all events. Often (but not always: we will see why later) we have
A = 2Ω the set of all subsets of Ω. The family A should be ”stable” by the logical operations
described above: if A, B ∈ A then we must have Ac ∈ A, A ∩ B ∈ A, A ∪ B ∈ A, and also Ω ∈ A
and ∅ ∈ A.
c) The probability: With each event A one associates a number denoted by P(A) and called
the ”probability of A”. This number measures the likelihood of the event A to be realized a priori,
before performing the experiment. It is chosen between 0 and 1, and the more likely the event is,
the closer to 1 this number is.
To get an idea of the properties of these numbers, one can imagine that they are the limits of
the ”frequency” with which the events are realized: let us repeat the same experiment n times;
the n outcomes might of course be different (think of n successive tosses of the same die, for
instance). Denote by fn (A) the frequency with which the event A is realized (i.e. the number of
times the event occurs, divided by n). Intuitively we have:

P(A) = limit of fn (A) as n → ∞

(we will give a precise meaning to this ”limit” later). From the obvious properties of frequencies,
we immediately deduce that:

1. 0 ≤ P(A) ≤ 1;

2. P(Ω) = 1;

3. P(A ∪ B) = P(A) + P(B) if A ∩ B = ∅.

A mathematical model for our experiment is thus a triple (Ω, A, P), consisting of the space Ω,
the family A of all events, and the family of all P(A) for A ∈ A; hence we can consider that P is
a map from A into [0, 1], which satisfies at least the properties (2) and (3) above (plus in fact an
additional property, more difficult to understand, and which is given later).
A fourth notion, also important although less basic, is the following one:
d) Random variable: A random variable is a quantity which depends on the outcome of the
experiment. In mathematical terms, this is a map from Ω into a space E, where often E = R or
E = Rd . Warning: this terminology, which is rooted in the history of Probability Theory going
back 400 years, is quite unfortunate; a random ”variable” is not a variable in the analytical sense,
but a function!
CHAPTER 1. PROBABILITY SPACE 7

Let X be such a random variable, mapping Ω into E. One can then ”transport” the prob-
abilistic structure onto the target space E, by setting PX (B) = P(X −1 (B)) for B ∈ E, where
X −1 (B) = {w ∈ Ω : X(w) ∈ B} denotes the pre-image of B by X. This formula defines a new
probability, denoted by PX but on the space E instead of Ω. The probability PX is called the law
of the variable X.
Example (toss of two dice): One tosses two fair dice and observers the number of dots ap-
pearing on each dice. The sample space is Ω = {(i, j) : 1 ≤ i ≤ 6, 1 ≤ i ≤ 6}, and it is natural to
take here A = 2Ω and
|A|
P(A) = if A ⊂ Ω,
36
where |A| denotes the number of points in A. One easily verifies the properties (1), (2), (3) above,
and P({w}) = 1/36 for each singleton. The map X : Ω → N defined by X(i, j) = |j − i| is the
random variable ”different of the two dice”, and its law is
number of pairs (i, j) such that |i − j| ∈ B
PX (B) =
36
(for example PX ({1}) = 5/18, PX ({5}) = 1/18, etc . . .). We will formalize the concepts of a prob-
ability space and random variable in following sections.

1.2 Probability space

Let Ω be a non-empty set without any special structure. Let 2Ω denote all subsets of Ω, includ-
ing the empty set denoted by ∅. With A being a subset of 2Ω , we consider the following properties:

1. ∅ ∈ A and Ω ∈ A;

2. If A ∈ A then Ac := Ω\A ∈ A; Ac is called the complement of A;

3. A is closed under finite unions and finite intersections: that is, if A1 , . . . , An are all in A,
then ∪ni=1 and ∩ni=1 Ai are in A as well (for this it is enough that A be stable by the union and
the intersection of any two sets);

4. A is closed under countable unions and intersections: that is, if A1 , A2 . . . is a countable

sequence of events in A then ∪i Ai and ∩i Ai are both also in A.

Definition 1.2.1. A is an algebra if it satisfies (1), (2) and (3) above. It is a σ-algebra, (or a σ-field)
if it satisfies (1), (2), and (4) above.

Definition 1.2.2. If A is a σ-algebra on Ω then (Ω, A) is called a measurable space.

Definition 1.2.3. If C ⊂ 2Ω , the σ-algebra generated by C, and written σ(C), is the smallest σ-
algebra containing C. (It always exists because 2Ω is a σ-algebra, and the intersection of a family
of σ-algebras is again a σ-algebra)
CHAPTER 1. PROBABILITY SPACE 8

Example: (i) A = {∅, Ω} (the trivial σ-algebra).

(ii) If A is a subset; then σ(A) = {∅, A, Ac , Ω}.
(iii) If Ω = Rd the Borel σ-algebra, written B(Rd ), is the σ-algebra generated by all the intervals A
of the following type
A = (−∞, x1 ] × (−∞, x2 ] × . . . × (−∞, xn ],

where x1 , . . . , xn ∈ Q.
We can show that B(Rd ) is also the σ-algebra generated by all open subsets (or by all the
closed subsets) of Rd .

Definition 1.2.4. A probability measure defined on a σ-algebra A of Ω is a function P : A → [0, 1]

that satisfies:

1. P(Ω) = 1;

2. For every countable sequence (An )n≥1 of elements of A, pairwise disjoint (that is, An ∩Am =
∅ whenever n 6= m), one has
[∞ X ∞
P An = P(An ).
n=1 n=1

Axiom (2) above is called countable additivity; the number P(A) is called the probability of
the event A.
In Definition 1.2.4 one might imagine a more elementary condition than (2), namely:

P(A ∪ B) = P(A) + P(B) (1.1)

for any disjoint sets A, B ∈ A.

This property is called additivity (or “finite additivity”) and, by an elementary induction, it
implies that for every finite A1 , . . . , Am of pairwise disjoint events Ai ∈ A, we have
m
[ m
X
P Ai = P(Ai ).
i=1 i=1

Theorem 1.2.5. Let (Ω, A, P) be a probability space. The following properties hold:

(i) P(∅) = 0;

(ii) P is additive.

(iii) P(Ac ) = 1 − P(A);

(iv) If A, B ∈ A and A ⊂ B then P(A) ≤ P(B).

Proof. If in Axiom (2) we take An = ∅ for all n, we see that the number a = P(∅) is equal to an
infinite sum of itself; since 0 ≤ a ≤ 1, this is possible only if a = 0, and we have (i).
For (ii) it sufffices to apply Axiom (2) with A1 = A and A2 = B and A3 = A4 = ... = ∅, plus the
fact that P(∅) = 0, to obtain (1.1).
Applying (1.1) for A ∈ A and B = Ac we get (iii).
CHAPTER 1. PROBABILITY SPACE 9

To show (iv), suppose A ⊂ B then applying (1.1) for A and B\A we have

P(B) = P(A) + P(B\A) ≥ P(A).

1.3 Properties of probability

Countable additivity is not implied by additivity. In fact, in spite of its intuitive appeal, addi-
tivity is not enough to handle the mathematical problems of the theory. The next theorem shows
exactly what is extra when we assume countable additivity instead of just finite additivity.

Theorem 1.3.1. Let A be a σ-algebra. Suppose that P : A → [0, 1] satisfies P(Ω) = 1 and is additive.
Then the following statements are equivalent:

(i) Axiom (2) of Definition 1.2.4.

(ii) If An ∈ A and An ↓ A, then P(An ) ↓ 0.

(iii) If An ∈ A and An ↓ A, then P(An ) ↓ P(A).

(iv) If An ∈ A and An ↑ Ω, then P(An ) ↑ 1.

(v) If An ∈ A and An ↑ A, then P(An ) ↑ P(A).

Proof. The notation An ↓ A means that An+1 ⊂ An , each n, and ∩∞ i=1 An = A. The notation
∞
An ↑ A means that An ⊂ An+1 and ∪i=1 An = A, ∀n ≥ 2.
(i) ⇒ (v) Let An ∈ A with An ↑ A. We construct a new sequence as follows: B1 = A1 and
Bn = An+1 \An .Then ∪∞
i=1 Bn = A; An = ∪ni=1 Bi and the events (Bn )n≥1 are pairwise disjoint.
Therefore
∞
X
P(A) = P(∪k≥1 Bk ) = Bk .
k=1

Hence
n
X ∞
X
P(An ) = P(Bk ) ↑ P(Bk ) = P(A).
k=1 k=1

(v) ⇒ (i) Let An ∈ A be pairwise disjoint. Define Bn = ∪nk=1 Ak . We have Bn ↑ ∪∞

k=1 Ak . Hence

n
X ∞
X
P(∪∞
k=1 Ak ) = lim P(Bn ) = lim P(Ak ) = P(Ak ).
n→∞ n→∞
k=1 k=1

(v) → (iii): Suppose that An ↓ A. Then Acn ↑ Ac . Hence, we have

P(An ) = 1 − P(Acn ) ↓ 1 − P(Ac ) = P(A).

CHAPTER 1. PROBABILITY SPACE 10

(iii) → (ii) is obvious.

(ii) → (iv): Let An ∈ A with An ↑ Ω. Thus P(Acn ) → 0. Therefore P(An ) = 1 − P(Acn ) ↑ 1.
(iv) → (v): Suppose An ↑ A. Denote Bn = An ∪ Ac . One gets Bn ↑ Ω, it implies P(Bn ) ↑ 1. Hence

P(An ) = P(Bn ) − P(Ac ) ↑ 1 − P(Ac ) = P(A).

If A ∈ 2Ω , we define the indicator function by


1 if w ∈ A
IA (w) =
0 if w 6∈ A.

We can say that An ∈ A converges to A (we write An → A) if limn→∞ IAn (w) = IA (w) for all
w ∈ Ω. Note that if the sequence An increases (resp. decreases) to A, then it also tends to A in the
above sense.

Theorem 1.3.2. Let P be a probability measure and let An be a sequence of events in A which
converges to A. Then A ∈ A and limn→∞ P(An ) = P(A).

Proof. Now let Bn = ∩m≥n Am and Cn = ∪m≤n Am . Then Bn increases to A and Cn decreases to
A, thus lim P(Bn ) = lim P(Cn ) = P(A), by Theorem 1.3.1. However Bn ⊂ An ⊂ Cn , therefore
n→∞ n→∞
P(Bn ) ≤ P(An ) ≤ P(Cn ), so lim P(An ) = P(A) as well.
n→∞

Lemma 1.3.3. Let S be a set. Let I be a π-system on S, that is, a family of subsets of S stable under
finite intersection:
I1 , I2 ∈ I ⇒ I1 ∩ I2 ∈ I.
Let Σ = σ(I). Suppose that µ1 and µ2 are probability measure on (S, Σ) such that µ1 = µ2 on I.
Then µ1 = µ2 on Σ.

Proof. Let

D = {F ∈ Σ : µ1 (F ) = µ2 (F )}.

Then D is a d-system on S, that is, a family of subsets of S satisfied:

a) S ∈ D,
b) if A, B ∈ D and A ⊆ B then B \ A ∈ D,
c) if An ∈ D and An ↑ A, then A ∈ D.

Indeed, the fact that S ∈ D is given. If A, B ∈ D, then

µ1 (B \ A) = µ1 (B) − µ1 (A) = µ2 (B) − µ2 (A) = µ2 (B \ A).

so that B \ A ∈ D. Finally, if Fn ∈ D and Fn ↑ F , then

µ1 (F ) =↑ lim µ1 (Fn ) =↑ lim µ2 (Fn ) = µ2 (F ),

so that F ∈ D.
Since D is a d-system and D ⊇ I then D ⊇ σ(I) = Σ, and the result follows.
CHAPTER 1. PROBABILITY SPACE 11

This lemma implies that if two probability measure agree on a π-system, then they agree on
the σ-algebra generated by that π-system.

Theorem 1.3.4 (Carathéodory’s Extension Theorem). Let S be a set and Σ0 be an algebra on S,

and let Σ = σ(Σ0 ). If µ0 is a countably additive map µ0 : Σ0 → [0, 1], then there exists a unique
measure µ on (S, Σ) such that µ = µ0 on Σ0 .

Proof. Step 1: Let G be the σ−algebra of all subsets of S. For G ∈ G, define

X
λ(G) := inf µ0 (Fn ),
n

where the infimum is taken over all sequences (Fn ) in Σ0 with G ⊆ ∪n Fn .

We now prove that
(a) λ is an outer measure on (S, G).
The facts that λ() = 0 and λ is increasing are obvious. Suppose that (Gn ) is a sequence in G, such
that each λ(Gn ) is finite. Let > 0 be given. For each n, choose a sequence (Fn,k : k ∈ N) of
elements of Σ0 such that
[ X
Gn ⊆ Fn,k , µ0 (Fn,k ) < λ(Gn ) + 2−n .
k k
S SS
Then G := Gn ⊆ Fn,k , so that
n k
XX X
λ(G) ≤ µ0 (Fn,k ) < λ(Gn ) + .
n k n

Since is arbitrary, we have proved result (a).

Step 2: We have λ is a measure on (S, L), where L is a σ-algebra of λ−sets in G. All we need show
is that
(b) Σ0 ⊆ L, and λ = µ0 on σ0 ;
for then Σ := σ(Σ0 ) ⊆ L and we can define µ to be the restriction of λ to (S, Σ).
Step 3: Proof that λ = µ0 on Σ0 .
S
Let F ∈ Σ0 . Then, clearly, λ(F ) ≤ µ0 (F ). Now suppose that F ⊆ Fn , where Fn ∈ Σ0 . As usual,
n
we can define a sequence (En ) of disjoint sets:
[ c
E1 := F1 , En = Fn ∩ Fk
k<n
S S
such that En ⊆ Fn and En = Fn ⊇ F. Then
[ X
µ0 (F ) = µ0 ( (F ∩ En )) = µ0 (F ∩ En ),

by using the countable additivity of µ0 on Σ0 . Hence

X X
µ0 (F ) ≤ µ0 (En ) ≤ µ0 (Fn ),
CHAPTER 1. PROBABILITY SPACE 12

so that λ(F ) ≥ µ0 (F ). Step 3 is complete.

Step 4: Proof that Σ0 ⊆ L. Let E ∈ Σ0 and G ∈ G. Then there exists a sequence (Fn ) in Σ0 such
S
that G ⊆ Fn , and
n X
µ0 (Fn ) ≤ λ(G) + .
n

Now, by definition of λ,
X X X
µ0 (Fn ) = µ0 (E ∩ Fn ) + µ0 (E c ∩ Fn )
n n n
c
≥ λ(E ∩ G) + λ(E ∩ G),

(E ∩ Fn ) and E c ∩ G ⊆ (E c ∩ Fn ). Thus, since is arbitrary,

S S
since E ∩ G ⊆

λ(G) ≥ λ(E ∩ G) + λ(E c ∩ G).

However, since λ is subadditive,

λ(G) ≤ λ(E ∩ G) + λ(E c ∩ G).

We see that E is indeed a λ−set.

CHAPTER 1. PROBABILITY SPACE 13

1.4 Probabilities on a Finite or Countable Space

We suppose that Ω is finite or countable and consider A = 2Ω . Then a probability on Ω is
characterized by its values on the atoms pw = P({w}), w ∈ Ω. Indeed, one can easily verify the
following theorem.

Theorem 1.4.1. Let (pw )w∈Ω be a family of real numbers indexed by the finite or countable set
Ω. Then there exists a unique probability P such that P({w}) = pw if and only if pw ≥ 0 and
P
w∈Ω pw = 1. In this case for any A ⊂ Ω,
X
P(A) = pw .
w∈A

Suppose first that Ω is finite. Any family of nonnegative terms summing up to 1 gives an exam-
ple of a probability on Ω. But among all these examples the following is particularly important:

Definition 1.4.2. A probability P on the finite set Ω is called uniform if pw = P({w}) does not
depend on w.

In this case, it is immediate that

|A| number of outcomes in favor of A
P(A) = = .
|Ω| total number of possible outcomes

Then computing the probability of any event A amounts to counting the number of points in A.
On a given finite set Ω there is one and only one uniform probability.
Example: There are 20 balls in the urn, 10 white 10 red. One draws a set of 5 balls from the
urn. Denote X the number of white ball in the set. We want to find the probability that X = x,
where x is an arbitrary fixed integer.
We label from 1 to 10 for white balls and from 11 to 20 for red balls. Since the balls are drawn
at once, it is natural to consider that an outcome is a subset with 5 elements of the set {1, . . . , 20}
of all 20 balls. That is, Ω is the family of all subsets with 5 points, and the total number of possible
outcomes is |Ω| = C20 5 . Next, it is also natural to consider that all possible outcomes are equally

likely, that is P is the uniform probability on Ω. The quantity X is a “random variable” because
when the outcome w is known, one also knows the number X(w). The possible values of X is
from 0 to 5 and the set X −1 ({x}) = {X = x} contains C10 x C 5−x points for all 0 ≤ x ≤ 5. Hence
10
 x 5−x
 C10 C510 if 0 ≤ x ≤ 5
C20
P(X = x) =
0 otherwise.

We thus obtain, when x varies, the distribution or the law, of X. This distribution is called the
hypergeometric distribution.
CHAPTER 1. PROBABILITY SPACE 14

1.5 Conditional Probability

We have known how to answer questions of the following kind: If there are 5 balls in an urn,
2 white and 3 black, what is the probability P(A) of the event A that a selected ball is white? With
the classical approach, P(A) = 2/5.
The concept of conditional probability, which will be introduced below, let us answer ques-
tions of the following kind: What is the probability that the second ball is white (event B) under
the condition that the first ball was also white (event A)? (We are thinking of sampling without
replacement.) It is natural to reason as follows: if the first ball is white, then at the second step
we have an urn containing 4 balls, of which 1 is white and 3 black; hence it seems reasonable to
suppose that the (conditional) probability in question is 1/4.
In general, computing the probability of an event A, given that an event B occurs, means
finding which fraction of the probability of A is also in the event B.

Definition 1.5.1. Let A, B be events, P(B) > 0. The conditional probability of A given B is
P(A ∩ B)
P(A|B) = .
P(B)
Theorem 1.5.2. Suppose P(B) > 0. The operation A 7→ P(A|B) from A → [0, 1] defines a new
probability measure on A, called the conditional probability measure given B.

Proof. We define Q(A) = P(A|B), with B fixed. We must show Q satisfies (1) and (2) of 1.2.4. But
P(Ω ∩ B) P(B)
Q(Ω) = P(Ω|B) = = = 1.
P(B) P(B)
Therefore, Q satisfies (1). As for (2), note that if (An )n≥1 is a sequence of elements of A which are
pairwise disjoint, then
P((∪∞
n=1 An ) ∩ B) P(∪∞
n=1 (An ∩ B))
Q(∪∞ ∞
n=1 An ) = P(∪n=1 An |B) = =
P(B) P(B)
and also the sequence (An ∩ B)n≥1 is pairwise disjoint as well; thus
∞ ∞ ∞
X P(An ∩ B) X X
Q(∪∞
n=1 An ) = = P(An |B) = Q(An ).
P(B)
n=1 n=1 n=1

Theorem 1.5.3. If A1 , . . . , An ∈ A and if P(A1 ∩ . . . ∩ An−1 ) > 0, then

P(A1 ∩ . . . ∩ An ) = P(A1 )P(A2 |A1 ) . . . P(An |A1 ∩ . . . ∩ An−1 ).

Proof. We use induction. For n = 2, the theorem is simply 1.5.1. Suppose the theorem holds for
n − 1 events. Let B = A1 ∩ . . . ∩ An−1 . Then by 1.5.1 P(B ∩ An ) = P(An |B)P(B); next we replace
P(B) by its value given in the inductive hypothesis:

P(B) = P(A1 )P(A2 |A1 ) . . . P(An−1 |A1 ∩ . . . ∩ An−2 ),

and we get the result.

CHAPTER 1. PROBABILITY SPACE 15

Definition 1.5.4. A collection of events (En ) is called a partition of Ω in A if

1. En ∈ A and P(En ) > 0, each n,

2. they are pairwise disjoint,

3. Ω = ∪n En .

Theorem 1.5.5 (Partition Equation). Let (En )n≥1 be a finite or countable partition of Ω. Then if
A ∈ A,
X
P(A) = P(A|En )P(En ).
n

Proof. Note that

A = A ∩ Ω = ∪n (A ∩ En ).

Since the En are pairwise disjoint so also are (A ∩ En )n≥1 , hence

X X
P(A) = P ∪n (A ∩ En ) = P(A ∩ En ) = P(A|En )P(En ).
n n

Theorem 1.5.6 (Bayes’ Theorem). Let (En ) be a finite or countable partition of Ω, and suppose
P(A) > 0. Then
P(A|En )P(En )
P(En |A) = P .
m P(A|Em )P(Em )

Proof. Applying partition equation, we have that the denominator

X
P(A|Em )P(Em ) = P(A).
m

Hence the formula becomes

P(A|En )P(En ) P(A ∩ En )
= = P(En |A).
P(A) P(A)

Example 1.5.7. Because a new medical procedure has been shown to be effective in the early
detection of an illness, a medical screening of the population is proposed. The probability that
the test correctly identifies someone with the illness as positive is 0.99, and the probability that
the test correctly identifies someone without the illness as negative is 0.95. The incidence of the
illness in the general population is 0.0001. You take the test, and the result is positive. What is
the probability that you have the illness? Let D denote the event that you have the illness, and
let S denote the event that the test signals positive. The probability requested can be denoted as
. The probability that the test correctly signals someone without the illness as negative is 0.95.
Consequently, the probability of a positive test without the illness is

P(S|Dc ) = 0.05.
CHAPTER 1. PROBABILITY SPACE 16

From Bayes’s Theorem,

P(S|D)P(D)
P(D|S) = = 0.002.
P(S|D)P(D) + P(S|Dc )P(Dc )
Surprisingly, even though the test is effective, in the sense that P(S|D) is high and P(S|Dc ) is low,
because the incidence of the illness in the general population is low, the chances are quite small
that you actually have the disease even if the test is positive.

Example 1.5.8. Suppose that Bob can decide to go to work by one of three modes of transporta-
tion, car, bus, or commuter train. Because of high traffic, if he decides to go by car, there is a
0.5 chance he will be late. If he goes by bus, which has special reserved lanes but is somtimes
overcrowded, the probability of being late is only 0.2. The commuter train is almost never late,
with a probability of only 0.01, but is more expensive than the bus.
(a) Suppose that Bob is late one day, and his boss wishes to estimate the probability that he drove
to work that day by car. Since he does not know which mode of transportation Bod usually uses,
he gives a prior probability of 13 to each of the three possibilities. What is the boss’ estimate of the
probability that Bob drove to work?
(b) Suppose that the coworker of Bob’s knows that he almost always takes the commuter train
to work, never take the bus, but somtimes, 0.1 of the time, takes the car. What is the coworkers
probability that Bob drove to work that day, given that he was late?
We have the following information given in the problem:
1
P(bus) = P(car) = P(train) =
3
P(late|car) = 0.5;
P(late|train) = 0.01;
P(late|bus) = 0.2.

By Bayes Theorem, this is

Repeat the identical calculations as above, but instead of the prior probabilities being 13 , we use
pr(bus) = 0, P(car) = 0.1, and P(train) = 0.9. Plugging in to the same equation with these three
changes, we get P(car|late) = 0.8475.

1.6 Independence
Definition 1.6.1. 1. Two events A and B are independent if

P(A ∩ B) = P(A)P(B).
CHAPTER 1. PROBABILITY SPACE 17

2. A (possibly infinite) collection of events (Ai )i∈I is a pairwise independent collection if for
any distinct elements i1 , i2 ∈ I,

P(Ai1 ∩ Ai2 ) = P(Ai1 )P(Ai2 ).

3. A (possibly infinite) collection of events (Ai )i∈I is an independent collection if for every
finite subset J of I, one has
Y
P(∩i∈J Ai ) = P(Ai ).
i∈J

If events (Ai )i∈I are independent, they are pairwise independent, but the converse is false.

Proposition 1.6.2. a) If A and B are independent, so also are A and B c ; Ac and B; and Ac and B c .
b) If A and B are independent and P(B) > 0, then

P(A|B) = P(A|B c ) = P(A).

Proof. a) A and B c . Since A and B are independent, then P(A ∩ B) = P(A)P(B) = P(A)(1 −
P(B c )) = P(A) − P(A)P(B c ). We have P(A ∩ B) = P(A) − P(A ∩ B c ). Substituting these into the
equation P(A ∩ B) = P(A)P(B), we obtain

P(A) − P(A ∩ B c ) = P(A) − P(A)P(B c ).

Hence

P(A ∩ B c ) = P(A)P(B c ).

Therefore, A and B c are independent.

Ac and B c . We have P(A)P(B c ) = [1 − P(Ac )]P(B c ) = P(B c ) − P(Ac )P(B c ) and P(A ∩ B c ) =
P(B c ) − P(Ac ∩ B c ). Substituting into the equation P(A ∩ B c ) = P(A)P(B c ), we obtain

P(B c ) − P(Ac ∩ B c ) = P(B c ) − P(Ac )P(B c ).

P(Ac ∩ B c ) = P(Ac )P(B c ).

Therefore, Ac and B c are independent.

b) We have
P(A ∩ B) P(A)P(B)
P(A|B) = = = P(A),
P(B) P(B)

and
P(A ∩ B c ) P(A)P(B c )
P(A|B c ) = = = P(A).
P(B c ) P(B c )
CHAPTER 1. PROBABILITY SPACE 18

Examples:

1. Toss a coin 3 times. If Ai is an event depending only on the ith toss, then it is standard to
model (Ai )1≤i≤3 as being independent.

2. One chooses a card at random from a deck of 52 cards. A = ”the card is a heart”, and
B = ”the card is Queen”. A natural model for this experiment consists in prescribing the
probability 1/52 for picking any one of the cards. By additivity, P(A) = 13/52 and P(B) =
4/52 and P(A ∩ B) = 1/52 hence A and B are independent.

3. Let n = {1, 2, 3, 4}, and A = 2Ω . Let P(i) = 1/4, where i = 1, 2, 3, 4. Let A = {1, 2}, B =
{1, 3}, and C = {2, 3}. Then A, B, C are pairwise independent but are not independent.

Exercises

Axiom of Probability

1.1. Give a possible sample space for each of the following experiments:

1. A two-sided coin is tossed.

2. A student is asked for the month of the year and the day of the week on which her birthday
falls.

3. A student is chosen at random from a class of ten students.

4. You receive a grade in this course.

1.2. Let A be a σ-algebra of subsets of Ω and let B is a subset of Ω. Show that F = {A ∩ B : A ∈ A}

is a σ-algebra of subsets of B.

1.3. Let f be a function mapping Ω to another space E with a σ-algebra E. Let A = {A ⊂ Ω :

there exists B ∈ E with A = f −1 (B)}. Show that A is a σ-algebra on Ω.

1.4. Let (Gα )α∈I be an arbitrary family of σ-algebras defined on an abstract space Ω. Show that
H = ∩α∈I Gα is also a σ-algebra.

1.5. Suppose that Ω is an infinite set (countable or not), and let A be the family of all subsets
which are either finite or have a finite complement. Show that A is an algebra, but not a σ-
algebra.

1.6. Give a counterexample that shows that, in general, the union A ∪ B of two σ-algebras need
not be a σ-algebra.

1.7. Let Ω = {a, b, c} be a sample space. Let P({a}) = 1/2, P({b}) = 1/3, and P({c}) = 1/6. Find
the probabilities for all eight subsets of Ω.

1.8. For A, B ∈ A, show

CHAPTER 1. PROBABILITY SPACE 19

1. P(A ∩ B) = P(A) − P(A ∩ B).

2. P(A ∪ B) = P(A) + P(B) − P(A ∩ B).

3
1.9. Suppose P(A) = 4 and P(B) = 13 . Show that always 1
12 ≤ P(A ∩ B) ≤ 13 .

1.10. Let (Bn ) be a sequence of events such that P(Bn ) = 1 for all n ≥ 1. Show that
\
P Bn = 1.
n

1.11. Let A1 , . . . , An be given events. Show the inclusion-exclusion formula:

X X
P( ∪ni=1 Ai ) = P(Ai ) − P(Ai ∩ Aj )
i i<j
X
P(Ai ∩ Aj ∩ Ak ) − . . . + (−1)n+1 P(A1 ∩ A2 ∩ . . . ∩ An )
i<j<k
P
where (for example) i<j means to sum over all ordered pairs (i, j) with i < j.

1.12. Let Ai ∈ A be a sequence of events. Show that

n
X
P (∪ni=1 Ai ) ≤ P (Ai ),
i=1

each n, and also

∞
X
P (∪∞
i=1 Ai ) ≤ P (Ai ).
i=1

1.13. (Bonferroni Inequalities) Let Ai ∈ A be a sequence of events. Show that

Pn
1. P (∪ni=1 Ai ) ≥
P
i=1 P (Ai ) − i<j P (Ai ∩ Aj ),
Pn
2. P (∪ni=1 Ai ≤
P P
i=1 P (Ai ) − i<j P (Ai ∩ Aj ) + i<j<k P (Ai ∩ Aj ∩ Ak ).

1.14. Let (Ω, A, P) be a probability space. Show for events Bi ⊂ Ai the following inequality
X
P(∪i Ai ) − P(∪i Bi ) ≤ P(Ai ) − P(Bi ) .
i
Pn
1.15. If (Bk ) are events such that k=1 P(Bk ) > n − 1, then
n
\
P Bk ) > 0.
k=1

Classical definition of probability

1.16. In the laboratory analysis of samples from a chemical process, five samples from the pro-
cess are analyzed daily. In addition, a control sample is analyzed two times each day to check the
calibration of the laboratory instruments.
CHAPTER 1. PROBABILITY SPACE 20

1. How many different sequences of process and control samples are possible each day? As-
sume that the five process samples are considered identical and that the two control sam-
ples are considered identical.

2. How many different sequences of process and control samples are possible if we consider
the five process samples to be different and the two control samples to be identical.

3. For the same situation as part (b), how many sequences are possible if the first test of each
day must be a control sample?

1.17. In the design of an electromechanical product, seven different components are to be stacked
into a cylindrical casing that holds 12 components in a manner that minimizes the impact of
shocks. One end of the casing is designated as the bottom and the other end is the top.

1. How many different designs are possible?

2. If the seven components are all identical, how many different designs are possible?

3. If the seven components consist of three of one type of component and four of another
type, how many different designs are possible? (more difficult)

1.18. The design of a communication system considered the following questions:

1. How many three-digit phone prefixes that are used to represent a particular geographic
area (such as an area code) can be created from the digits 0 through 9?

2. As in part (a), how many three-digit phone prefixes are possible that do not start with 0 or
1, but contain 0 or 1 as the middle digit?

3. How many three-digit phone prefixes are possible in which no digit appears more than
once in each prefix?

1.19. A byte is a sequence of eight bits and each bit is either 0 or 1.

1. How many different bytes are possible?

2. If the first bit of a byte is a parity check, that is, the first byte is determined from the other
seven bits, how many different bytes are possible?

1.20. A bowl contains 16 chips, of which 6 are red, 7 are white, and 3 are blue. If four chips are
talmn at random and without replacement, find the probability that: (a) each of the 4 chips is
red; (b) none of the 4 chips is red; (c) there is at least 1 chip of each color.

1.21. Three distinct integers are chosen at random from the first 20 positive integers. Compute
the probability that: (a) their stun is even; (b) their product is even.

1.22. There are 5 red chips and 3 blue chips in a bowl. The red chips are numbered 1, 2, 3, 4,
5, respectively, and the blue chips are numbered 1, 2, 3, respectively. If 2 chips are to be drawn
at random and without replacement, find the probability that these chips have either the same
number or the same color.
CHAPTER 1. PROBABILITY SPACE 21

1.23. In a lot of 50 light bulbs, there are 2 bad bulbs. An inspector examines 5 bulbs, which are
selected at random and without replacement. (a) Find the probability of at least 1 defective bulb
among the 5. (b) How many bulbs should be examined so that the probability of finding at least
1 bad bulb exceeds 0.2 ?

1.24. Three winning tickets are drawn from urn of 100 tickets. What is the probability of winning
for a person who buys:

1. 4 tickets?

2. only one ticket?

1.25. A drawer contains eight different pairs of socks. If six socks are taken at random and with-
out replacement, compute the probability that there is at least one matching pair among these
six socks.

1.26. In a classroom there are n students.

1. What is the probability that at least two students have the same birthday?

2. What is the minimum value of n which secures probability 1/2 that at least two have a
common birthday?

1.27. Four mice are chosen (without replacement) from a litter containing two white mice. The
probability that both white mice are chosen is twice the probability that neither is chosen. How
many mice are there in the litter?

1.28. Suppose there are N different types of coupons available when buying cereal; each box
contains one coupon and the collector is seeking to collect one of each in order to win a prize.
After buying n boxes, what is the probability pn that the collector has at least one of each type?
(Consider sampling with replacement from a population of N distinct elements. The sample size
is n > N . Use inclusion-exclusion formula)

1.29. An absent-minded person has to put n personal letters in n addressed envelopes, and he
does it at random. What is the probability pm,n that exactly m letters will be put correctly in their
envelopes?

1.30. N men run out of a men’s club after a fire and each takes a coat and a hat. Prove that:

a) the probability that no one will take his own coat and hat is
N
X (N − k)!
(−1)k ;
N !k!
k=1

b) the probability that each man takes a wrong coat and a wrong hat is
"N #2
X 1
(−1)k .
k!
k=2
CHAPTER 1. PROBABILITY SPACE 22

1.31. You throw 6n dice at random. Find the probability that each number appears exactly n
times.

1.32. * Mary tosses n + 1 fair coins and John tosses n fair coins. What is the probability that Mary
gets more heads than John?

Conditional Probability

1.33. Bowl I contains 6 red chips and 4 blue chips. Five of these 10 chips are selected at random
and without replacement and put in bowl II, which was originally empty. One chip is then drawn
at random from bowl II. Given that this chip is blue, find the conditional probability that 2 red
chips and 3 blue chips are transferred from bowl I to bowl II.

1.34. You enter a chess tournament where your probability of winning a game is 0.3 against half
the players (call them type 1), 0.4 against a quarter of the players (call them type 2), and 0.5
against the remaining quarter of the players (call them type 3). You play a game against a ran-
domly chosen opponent. What is the probability of winning?

1.35. We roll a fair four-sided die. If the result is 1 or 2, we roll once more but otherwise, we stop.
What is the probability that the sum total of our rolls is at least 4?

1.36. There are three coins in a box. One is a two-headed coin, another is a fair coin, and the
third is a biased coin that comes up heads 75 percent of the time. When one of the three coins
is selected at random and flipped, it shows heads. What is the probability that it was the two-
headed coin?

1.37. Alice is taking a probability class and at the end of each week she can be either up-to-
date or she may have fallen behind. If she is up-to-date in a given week, the probability that she
will be up-to-date (or behind) in the next week is 0.8 (or 0.2, respectively). If she is behind in a
given week, the probability that she will be up-to-date (or behind) in the next week is 0.6 (or 0.4,
respectively). Alice is (by default) up-to-date when she starts the class. What is the probability
that she is up-to-date after three weeks?

1.38. At the station there are three payphones which accept 20p pieces. One never works, an-
other always works, while the third works with probability 1/2. On my way to the metropolis
for the day, I wish to identify the reliable phone, so that I can use it on my return. The station
is empty and I have just three 20p pieces. I try one phone and it does not work. I try another
twice in succession and it works both times. What is the probability that this second phone is the
reliable one?

1.39. An insurance company insure an equal number of male and female drivers. In any given
year the probability that a male driver has an accident involving a claim is α, independently of
other years. The analogous probability for females is β. Assume the insurance company selects
a driver at random.

a) What is the probability the selected driver will make a claim this year?
CHAPTER 1. PROBABILITY SPACE 23

b) What is the probability the selected driver makes a claim in two consecutive years?

c) Let A1 , A2 be the events that a randomly chosen driver makes a claim in each of the first
and second years, respectively. Show that P (A2 |A1 ) ≥ P (A1 ).

d) Find the probability that a claimant is female.

1.40. Three newspapers A, B and C are published in a certain city, and a survey shows that for
the adult population 20% read A, 16% B, and 14% C, 8% read both A and B, 5% both A and C, 4%
both B and C, and 2% read all three. If an adult chosen at random, find the probability that

a) he reads none of these paper;

b) he reads only one of these papers; and

c) he reads at least A and B if is known that he reads at least one paper.

1.41. Customers are used to evaluate preliminary product designs. In the past, 95% of highly
successful products received good reviews, 60% of moderately successful products received good
reviews, and 10% of poor products received good reviews. In addition, 40% of products have been
highly successful, 35% have been moderately successful, and 25% have been poor products.

1. What is the probability that a product attains a good review?

2. If a new design attains a good review, what is the probability that it will be a highly success-
ful product?

3. If a product does not attain a good review, what is the probability that it will be a highly
successful product?

1.42. An inspector working for a manufacturing company has a 99% chance of correctly identi-
fying defective items and a 0.5% chance of incorrectly classifying a good item as defective. The
company has evidence that its line produces 0.9% of nonconforming items.

1. What is the probability that an item selected for inspection is classified as defective?

2. If an item selected at random is classified as nondefective, what is the probability that it is

indeed good?

1.43. A new analytical method to detect pollutants in water is being tested. This new method
of chemical analysis is important because, if adopted, it could be used to detect three different
contaminantsorganic pollutants, volatile solvents, and chlorinated compoundsinstead of having
to use a single test for each pollutant. The makers of the test claim that it can detect high levels of
organic pollutants with 99.7% accuracy, volatile solvents with 99.95% accuracy, and chlorinated
compounds with 89.7% accuracy. If a pollutant is not present, the test does not signal. Sam-
ples are prepared for the calibration of the test and 60% of them are contaminated with organic
pollutants, 27% with volatile solvents, and 13% with traces of chlorinated compounds.
A test sample is selected randomly.
CHAPTER 1. PROBABILITY SPACE 24

1. What is the probability that the test will signal?

2. If the test signals, what is the probability that chlorinated compounds are present?

1.44. Software to detect fraud in consumer phone cards tracks the number of metropolitan areas
where calls originate each day. It is found that 1% of the legitimate users originate calls from two
or more metropolitan areas in a single day. However, 30% of fraudulent users originate calls from
two or more metropolitan areas in a single day. The proportion of fraudulent users is 0.01%. If
the same user originates calls from two or more metropolitan areas in a single day, what is the
probability that the user is fraudulent?

1.45. The probability of getting through by telephone to buy concert tickets is 0.92. For the same
event, the probability of accessing the vendors Web site is 0.95. Assume that these two ways
to buy tickets are independent. What is the probability that someone who tries to buy tickets
through the Internet and by phone will obtain tickets?

1.46. The British government has stepped up its information campaign regarding foot and mouth
disease by mailing brochures to farmers around the country. It is estimated that 99% of Scot-
tish farmers who receive the brochure possess enough information to deal with an outbreak of
the disease, but only 90% of those without the brochure can deal with an outbreak. After the
first three months of mailing, 95% of the farmers in Scotland received the informative brochure.
Compute the probability that a randomly selected farmer will have enough information to deal
effectively with an outbreak of the disease.

1.47. In an automated filling operation, the probability of an incorrect fill when the process is
operated at a low speed is 0.001. When the process is operated at a high speed, the probability of
an incorrect fill is 0.01. Assume that 30% of the containers are filled when the process is operated
at a high speed and the remainder are filled when the process is operated at a low speed.

1. What is the probability of an incorrectly filled container?

2. If an incorrectly filled container is found, what is the probability that it was filled during the
high-speed operation?

1.48. An encryption-decryption system consists of three elements: encode, transmit, and de-
code. A faulty encode occurs in 0.5% of the messages processed, transmission errors occur in
1% of the messages, and a decode error occurs in 0.1% of the messages. Assume the errors are
independent.

1. What is the probability of a completely defect-free message?

2. What is the probability of a message that has either an encode or a decode error?

1.49. It is known that two defective copies of a commercial software program were erroneously
sent to a shipping lot that has now a total of 75 copies of the program. A sample of copies will be
selected from the lot without replacement.
CHAPTER 1. PROBABILITY SPACE 25

1. If three copies of the software are inspected, determine the probability that exactly one of
the defective copies will be found.

2. If three copies of the software are inspected, determine the probability that both defective
copies will be found.

3. If 73 copies are inspected, determine the probability that both copies will be found. Hint:
Work with the copies that remain in the lot.

1.50. A robotic insertion tool contains 10 primary components. The probability that any com-
ponent fails during the warranty period is 0.01. Assume that the components fail independently
and that the tool fails if any component fails. What is the probability that the tool fails during the
warranty period?

1.51. A machine tool is idle 15% of the time. You request immediate use of the tool on five dif-
ferent occasions during the year. Assume that your requests represent independent events.

1. What is the probability that the tool is idle at the time of all of your requests?

2. What is the probability that the machine is idle at the time of exactly four of your requests?

3. What is the probability that the tool is idle at the time of at least three of your requests?

1.52. A lot of 50 spacing washers contains 30 washers that are thicker than the target dimension.
Suppose that three washers are selected at random, without replacement, from the lot.

1. What is the probability that all three washers are thicker than the target?

2. What is the probability that the third washer selected is thicker than the target if the first
two washers selected are thinner than the target?

3. What is the probability that the third washer selected is thicker than the target?

1.53. Continuation of previous exercise. Washers are selected from the lot at random, without
replacement.

1. What is the minimum number of washers that need to be selected so that the probability
that all the washers are thinner than the target is less than 0.10?

2. What is the minimum number of washers that need to be selected so that the probability
that one or more washers are thicker than the target is at least 0.90?

1.54. The alignment between the magnetic tape and head in a magnetic tape storage system
affects the performance of the system. Suppose that 10% of the read operations are degraded
by skewed alignments, 5% by off-center alignments, 1% by both skewness and offcenter, and the
remaining read operations are properly aligned. The probability of a read error is 0.01 from a
skewed alignment, 0.02 from an off-center alignment, 0.06 from both conditions, and 0.001 from
a proper alignment. What is the probability of a read error.
CHAPTER 1. PROBABILITY SPACE 26

1.55. Suppose that a lot of washers is large enough that it can be assumed that the sampling is
done with replacement. Assume that 60% of the washers exceed the target thickness.

1. What is the minimum number of washers that need to be selected so that the probability
that all the washers are thinner than the target is less than 0.10?

2. What is the minimum number of washers that need to be selected so that the probability
that one or more washers are thicker than the target is at least 0.90?

1.56. In a chemical plant, 24 holding tanks are used for final product storage. Four tanks are
selected at random and without replacement. Suppose that six of the tanks contain material in
which the viscosity exceeds the customer requirements.

1. What is the probability that exactly one tank in the sample contains high viscosity material?

2. What is the probability that at least one tank in the sample contains high viscosity material?

3. In addition to the six tanks with high viscosity levels, four different tanks contain material
with high impurities. What is the probability that exactly one tank in the sample contains
high viscosity material and exactly one tank in the sample contains material with high im-
purities?

1.57. Plastic parts produced by an injection-molding operation are checked for conformance
to specifications. Each tool contains 12 cavities in which parts are produced, and these parts
fall into a conveyor when the press opens. An inspector chooses 3 parts from among the 12 at
random. Two cavities are affected by a temperature malfunction that results in parts that do not
conform to specifications.

1. What is the probability that the inspector finds exactly one nonconforming part?

2. What is the probability that the inspector finds at least one nonconforming part?

1.58. A bin of 50 parts contains five that are defective. A sample of two is selected at random,
without replacement.

1. Determine the probability that both parts in the sample are defective by computing a con-
ditional probability.

2. Determine the answer to part (a) by using the subset approach that was described in this
section.

1.59. * The Polya urn model is as follows. We start with an urn which contains one white ball and
one black ball. At each second we choose a ball at random from the urn and replace it together
with one more ball of the same color. Calculate the probability that when n balls are in the urn, i
of them are white.
CHAPTER 1. PROBABILITY SPACE 27

1.60. You have n urns, the rth of which contains r − 1 red balls and n − r blue balls, r = 1, . . . , n.
You pick an urn at random and remove two balls from it without replacement. Find the prob-
ability that the two balls are of different colors. Find the same probability when you put back a
removed ball.

1.61. A coin shows heads with probability p on each toss. Let πn be the probability that the
number of heads after n tosses is even. Show that πn+1 = (1 − p)πn + p(1 − πn ) and find πn .

1.62. There are n similarly biased dice such that the probability of obtaining a 6 with each one
of them is the same and equal to p (0 < p < 1). If all the dice are rolled once, show that pn , the
probability that an odd number of 6’s is obtained, satisfies the difference equation

pn + (2p − 1)pn−1 = p,

and hence derive an explicit expression for pn .

1.63. Dubrovsky sits down to a night of gambling with his fellow officers. Each time he stakes u
roubles there is a probability r that he will win and receive back 2u roubles (including his stake).
At the beginning of the night he has 8000 roubles. If ever he has 256000 roubles he will marry the
beautiful Natasha and retire to his estate in the country. Otherwise, he will commit suicide. He
decides to follow one of two courses of action:

(i) to stake 1000 roubles each time until the issue is decided;

(ii) to stake everything each time until the issue is decided.

Advise him (a) if r = 1/4 and (b) if r = 3/4. What are the chances of a happy ending in each case
if he follows your advice?

Independence

1.64. Let the events A1 , A2 , . . . , An be independent and P (Ai ) = p (i = 1, 2, . . . , n). What is the
probability that:

a) at least one of the events will occur?

b) at least m of the events will occur?

c) exactly m of the events will occur?

1.65. Each of four persons fires one shot at a target. Let Ck denote the event that the target is hit
by person k, k = 1, 2, 3, 4. If C1 , C2 , C3 , C4 are independent and if P(C1 ) = P(C2 ) = 0.7, P(C3 ) =
0.9, and P(C4 ) = 0.4, compute the probability that (a) all of them hit the target; (b) exactly one
hits the target; (c) no one hits the target; (d) at least one hits the target.

1.66. The probability of winning on a single toss of the dice is p. A starts, and if he fails, he passes
the dice to B, who then attempts to win on her toss. They continue tossing the dice back and
forth until one of them wins. What are their respective probabilities of winning?
CHAPTER 1. PROBABILITY SPACE 28

1.67. Two darts players throw alternately at a board and the first to score a bull wins. On each of
their throws player A has probability pA and player B pB of success; the results of different throws
are independent. If A starts, calculate the probability that he/she wins.

1.68. * A fair coin is tossed until either the sequence HHH occurs in which case I win or the
sequence T HH occurs, when you win. What is the probability that you win?

1.69. Let A1 , . . . , An be independent events, with P(Ai ) < 1. Prove that there exists an event B
with P(B) > 0 such that B ∩ Ai = ∅ for 1 ≤ i ≤ n.

1.70. n balls are placed at random into n cells. Find the probability pn that exactly two cells
remain empty.

1.71. An urn contains b black balls and r red balls. One of the balls is drawn at random, but when
it is put back in the urn c additional balls of the same color are put in with it. Now suppose that
we draw another ball. Show that the probability that the first ball drawn was black given that the
second ball drawn was red is b/(b + r + c).

1.72. Suppose every packet of the detergent TIDE contains a coupon bearing one of the letters
of the word TIDE. A customer who has all the letters of the word gets a free packet. All the letters
have the same possibility of appearing in a packet. Find the probability that a housewife who
buys 8 packets will get:

a) one free packet,

b) two free packets.

Chapter 2

Random Variables and Distributions

2.1 Random variables on a countable space

2.1.1 Definitions

Throughout this section we suppose that Ω is countable and A = 2Ω . A random variable X in

this case is defined as a map from Ω into R. A random variable stands for an observation of the
outcome of a random event. Before the random event we may know the range of X but we do
not know its exact value until the random event happens. The distribution of a random variable
X is defined by

PX (B) = P({w : X(w) ∈ B}) = P(X −1 (B)) = P[X ∈ B], B ∈ B(R).

Since the set Ω is countable, the range of X is also countable. Suppose that X(Ω) = {x1 , x2 , . . .}.
Then the distribution of X is completely determined by the following numbers pX i = P(X =
xi ), i ≥ 1. Indeed, for any event A ∈ A,
X X
PX (A) = P[X = xi ] = pX
i .
xi ∈A xi ∈A

Definition 2.1.1. Let X be a real-valued random variable on a countable space Ω. Suppose that
X(Ω) = {x1 , x2 , . . .}. The expectation of X, denoted E[X], is defined to be
X X
E[X] := xi P[X = xi ] = xi pX
i
i i

provided this sum makes sense: this is the case when at least one of the following conditions is
satisfied

1. Ω is finite;
X
P
2. Ω is countable and the series i xi p i absolutely convergence;

3. X ≥ 0 always (in this case, the above sum and hence E[X] as well may take value +∞.

29
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 30

Remark 1. Since Ω is countable, we denote pw the probability that the elementary event w ∈ Ω
happens. Then the expectation of X is given by
X
E[X] = X(w)pw .
w∈Ω

Let L1 denote the space of all random variables with finite expectation defined on (Ω, A, P).
The following facts are straightforward from the definition of expectation.

Theorem 2.1.2. Let X, Y ∈ L1 . The following statements hold:

1. L1 is a vector space over R and

E[aX + bY ] = aE[X] + bE[Y ], ∀ a, b ∈ R;

2. If X ≥ 0 then EX ≥ 0. Moreover, if X ≥ Y then E[X] ≥ E[Y ];

3. If Z is a bounded random variable then Z ∈ L1 . Furthermore, if Z 0 ∈ L1 and |Z 0 | ≤ X ∈ L1

then Z 0 ∈ L1 ;

4. If X = IA is the indicator function of an event A, then E[X] = P(A);

5. Let ϕ : R → R. Then
X X
E[ϕ(X)] = ϕ(xi )pX
i = ϕ(X(w))pw
i w∈Ω

if the above series is absolutely convergent.

Remark 2. If E[X 2 ] = 2 X
P
i xi p i < ∞, then
X 1X 1
E[|X|] = |xi |pX
i ≤ (|xi |2 + 1)pX 2
i = (E(X ) + 1) < ∞.
2 2
i i

Definition 2.1.3. Variation of a random variable X is defined to be

DX = E[(X − E[X])2 ]

It follows from the linearity of expectation operator that

DX = E[X 2 ] − (E[X])2 .

Hence X 2
X
DX = x2i pX
i − xi pX
i .
i i
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 31

2.1.2 Examples

Poisson distribution

X has a Poisson distribution with parameter λ > 0, denoted X ∼ P oi(λ), if X(Ω) = {0, 1, . . .}
and
e−λ λk
P[X = k] = , k = 0, 1, . . .
k!
The expectation of X is
∞ ∞
X e−λ λk −λ
X λj
E[X] = k = λe = λeλ e−λ = λ.
k! j!
k=0 j=0

A similar calculation gives us the variance of X,

DX = λ.

Bernoulli distribution

X is Bernoulli with parameter p ∈ [0, 1], denoted X ∼ Ber(p), if it takes only two values 0 and
1 and
P[X = 1] = 1 − P[X = 0] = p.
X corresponds to an experiment with only two outcomes, usually called “success” (X = 1) and
“failure” (X = 0). The expectation and variance of X are

E[X] = p, DX = p(1 − p).

Binomial distribution

X has Binomial distribution with parameters p ∈ [0, 1] and n ∈ N, denoted X ∼ B(n, p), if X
takes on the values {0, 1, . . . , n} and

P[X = k] = Cnk pk (1 − p)n−k , k = 0, 1, . . . , n.

One has
n
X n
X
E[X] = kP[X = k] = kCnk pk (1 − p)n−k
k=0 k=0
n
X
k−1 k−1
= np Cn−1 p (1 − p)n−k = np,
k=1

and
n
X n
X
E[X 2 ] = k 2 P[X = k] = k 2 Cnk pk (1 − p)n−k
k=0 k=0
n
X n
X
k−2 k−2 k−1 k−1
= n(n − 1)p2 Cn−2 p (1 − p)n−k + np Cn−1 p (1 − p)n−k
k=2 k=1
2
= n(n − 1)p + np,

thus DX = np(1 − p).

CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 32

Geometric distribution

One repeatedly performs a sequence of independent Bernoulli trials until achieving the first
sucesses. Let X denote the number of failures before reaching the first success. X has a Geomet-
ric distribution with parameter q = 1 − p ∈ [0, 1], denoted X ∼ Geo(q),

P[X = k] = q k p, k = 0, 1, . . .

where p is the probability of success. Then we have

∞ ∞
X X q
E[X] = kP[X = k] = kpq k = .
p
k=0 k=0

Moreover, one can easily shows that

q
DX = .
p2

2.2 Random variables on a general probability space

2.2.1 Definition

Let (Ω, A) be a measurable space and B(R) the σ-algebra Borel on R.

Definition 2.2.1. A function X : Ω → R is called A-measurable if

X −1 (B) := {w : X(w) ∈ B} ∈ A for all B ∈ B(R).

An A-measurable function X is called a random variable.

Theorem 2.2.2. Let X : Ω → R. The following statements are equivalent

1. X is a random variable.

2. {w : X(w) ≤ a} ∈ A for all a ∈ R.

Proof. Claim (1) ⇒ (2) is self-evident, we will prove: (2) ⇒ (1). Let

C = {B ∈ B(R) : X −1 (B) ∈ A}.

We have C is a σ-algebra and it contains all sets with the form (−∞, a] for every a ∈ R. Thus C
contains B(R). On the other hand, C ⊂ B(R), so C = B(R). This concludes our proof.

Example 2.2.3. Let (Ω, A) be a measurable space. For each subset B of Ω one can verifies that
IB is a random variable iff B ∈ A. More general, if xi ∈ R and Ai ∈ A for all i belongs to some
P
countable index set I, then X(w) = i∈I xi IAi (w) is also a random variable. We call such random
variable X discrete random variable. When I is finite then X is called simple random variable.

Definition 2.2.4. A function ϕ : Rd → R is called Borel measurable if X −1 (B) ∈ B(Rd ) for all
B ∈ B(R).
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 33

Remark 3. It implies from the above definition that every continuous function is Borel. Conse-
quently, all the functions (x, y) 7→ x + y, (x, y) 7→ xy, (x, y) 7→ x/y, (x, y) 7→ x ∨ y, (x, y) 7→ x ∧ y
are Borel, where x ∨ y = max(x, y), x ∧ y = min(x, y).

Theorem 2.2.5. Let X1 , . . . , Xd be random variables defined on a measurable space (Ω, A) and
ϕ : Rd → R a Borel function. Then Y = ϕ(X1 , . . . , Xd ) is also a random variable.

Proof. Let: X(w) = (X1 (w), . . . , Xd (w)) is the function on (Ω, A) and takes values in Rd . For every
a1 , . . . , ad ∈ R we have:
d
Y \d
X −1 (−∞, ai ] = {w : Xi (w) ≤ ai } ∈ A.
i=1 i=1

This implies X −1 (B) ∈ A for every B ∈ B(Rd ). Hence, for every C ∈ B(Rd ), B := ϕ−1 (C) ∈ B(Rd ).
Thus,
Y −1 (C) = X −1 (ϕ−1 (C)) ∈ A,

i.e. Y is the random variable.

Corollary 2.2.6. If X and Y are random variables, so also are X ± Y, XY, X ∧ Y, X ∨ Y, |X|, X + :=
X ∨ 0, X − = (−X) ∨ 0 and X/Y (if Y 6= 0).

Theorem 2.2.7. If X1 , X2 , . . . are random variables then so are supn Xn , inf n Xn , lim supn Xn , lim inf n Xn

It follows from Theorem 2.2.7 that if the sequence of random variables (Xn )n≥1 point-wise
converges to X, i.e. Xn (w) → X(w) for all w ∈ Ω, then X is a random variable.

2.2.2 Structure of random variables

Theorem 2.2.8. Let X be a random variable defined on a probability space (Ω, A).

1. There exists a sequence of discrete random variables which uniformly point-wise converges
to X.

2. If X is non-negative then there exists a sequence of simple random variables Yn such that
Yn ↑ X.

Proof. 1. For each n ≥ 1, denote Xn (w) = nk if nk ≤ X(w) < k+1n for some k ∈ Z. Xn is a
1
discrete random variable and |Xn (w) − X(w)| ≤ n for every w ∈ Ω. Hence, the sequence
(Xn ) converges uniformly in w to X.

2. Suppose that X ≥ 0. For each n ≥ 1, denote Yn (w) = 2kn if 2kn ≤ X(w) < k+1 2n for some
n n n
k ∈ {0, 1, . . . , n2 − 1} and Yn (w) = 2 if X(w) ≥ 2 . We can easily verify that the sequence
of simple random variables (Yn ) satisfying Yn (w) ↑ X(w) for all w ∈ Ω.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 34

Definition 2.2.9. Let X be a random variable defined on a measurable space (Ω, A).

σ(X) := {X −1 (B) : B ∈ B(R)}

is called σ-algebra generated by X.

Theorem 2.2.10. Let X be a random variable defined on a measurable space (Ω, A) and Y a func-
tion Ω → R. Then Y is σ(X)-measurable iff there exists a Borel function ϕ : R → R such that
Y = ϕ(X).

Proof. The sufficient condition is evident. We prove the necessary condition. Firstly, suppose
Y is a discrete random variable taking values y1 , y2 , . . . Since Y is σ(X)-measurable, sets An =
{w : Y (w) = yn } ∈ σ(X). By definition of σ(X), there exists a sequence Bn ∈ B(R) such that
An = X −1 (Bn ). Denote
i=1 Bi ∈ B(R), n ≥ 1.
Cn = Bn \ ∪n−1

We have sets Cn are pairwise disjoint and X −1 (Cn ) = An for every n. Consider the Borel function
ϕ defined by
X
ϕ(x) = yn ICn (x),
n≥1

we have Y = ϕ(X).
In general case, by Theorem 2.2.8, there exists a sequence of discrete σ(X)-measurable func-
tions Yn which uniformly converges to Y . Thus, there exists Borel functions ϕn such that Yn =
ϕn (X). Denote
B = {x ∈ R : ∃ lim ϕn (x)}.
n

Clearly, B ∈ B(R) and B ⊃ X(Ω). Let: ϕ(x) = limn ϕn (x)IB (x). We have Y = limn Yn =
limn ϕn (X) = ϕ(X).

2.3 Distribution Functions

2.3.1 Definition

Definition 2.3.1. Let X be a real valued random variable.

FX (x) = P[X < x], x ∈ R,

is called distribution funtion of X.

One can verifies that F = FX satisfies the following properties

1. F is non-decreasing: if x ≤ y then F (x) ≤ F (y);

2. F is left continuous and has right limit at any point;

3. limx→−∞ F (x) = 0, limx→+∞ F (x) = 1.

CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 35

On the other hand, for any function F : R → [0, 1] satisfying the these three conditions there
exists a (unique) probability measure µ on (R, B(R)) such that F (x) = µ((−∞, x)), for all x ∈ R
(See [13], section 2.5.2).
If X and Y has the same distribution function we say X and Y are equal in distribution and
d
denote X = Y .

Definition 2.3.2. If the distribution function FX has the form

Z a
FX (a) = P[X < a] = fX (x)dx, ∀a ∈ R
−∞

we say that X has a density function f .

The density function f = fX has the following properties:

R +∞
1. f (x) ≥ 0 for all x ∈ R and −∞ f (x)dx = 1.
Rb
2. P[a < X < b] = a f (x)dx for any a < b. Moreover, for any A ∈ B(R), it holds
Z
P[X ∈ A] = f (x)dx. (2.1)
A

As a consequence we see that if X has a density then P[X = a] = 0 for all a ∈ R.

2.3.2 Examples

Uniform distribution U [a, b]


1

b−a if a ≤ x ≤ b,
f (x) =
0 otherwise,
is called the Uniform distribution on [a, b] and denoted by U [a, b]. The distribution function cor-
responds to f is 


 0 if x < a,

F (x) = x−a b−a if a ≤ x ≤ b,


1 if x > b.


Exponential distribution Exp(λ)

Suppose λ > 0. X has a exponential distribution with rate λ, denoted X ∼ Exp(λ), if X takes
values in (0, ∞) and its density is given by

fX (x) = λ−1 e−x/λ I(0,∞) (x).

The distribution function of X is

FX (x) = (1 − λ−1 e−x/λ )I(0,∞) (x).

CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 36

Normal distribution N(a, σ 2 )

1 (x−a)2
f (x) = √ e− 2σ 2 , x ∈ R,
2πσ 2
is called the Normal distribution with mean a and variance σ 2 and denoted by N(a, σ 2 ). When
a = 0 and σ 2 = 1, N(0, 1) is called the Standard normal distribution.

Gamma distributiton G(α, λ)

xα−1 e−x/λ
fX (x) = I (x)
Γ(α)λα (0,∞)
is called the Gamma distribution with parameters α, λ(α, λ > 0); Γ denotes the gamma function
R∞
Γ(α) = 0 xα−1 e−x dx. In particular, an Exp(λ) distribution is G(1, λ) distribution. The gamma
distribution is frequently a probability model for waiting times; for instance, in life testing, the
waiting time until ”death” is a random variable which is frequently modeled with a gamma dis-
tribution.

2.4 Expectation

2.4.1 Construction of expectation

Definition 2.4.1. Let X be a simple random variable which can be written in the form
n
X
X= ai IAi (2.2)
i=1

where ai ∈ R and Ai ∈ A for all i = 1, . . . , n. Expectation of X (or integration of X with respect to

probability measure P) is defined to be
n
X
E[X] := ai P(Ai ).
i=1

Denote Ls = Ls (Ω, A, P) the set of simple random variable. It should be noted that a simple
random variable has of course many different representations of the form (2.2). However, E[X]
does not depend on the particular representation chosen for X.
Let X and Y be in Ls . We can write
n
X n
X
X= ai IAi , and Y = bi IAi .
i=1 i=1

for some subsets Ai which form a measurable partition of Ω. Then for any α, β ∈ R, αX + βY is
also in Ls and
Xn
αX + βY = (ai + bi )IAi .
i=1
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 37

Thus E[αX + βY ] = αE[X] + βE[Y ]; that is expectation is linear on Ls . Furthermore, expectation

is a positive operator, i.e., if X ≤ Y , we have ai ≤ bi for all i, and thus E[X] ≤ E[Y ].
Next we define expetation for non-negative random variables. For X non-negative, i.e. X(Ω) ⊂
[0, ∞], denote
E[X] = sup{EY : Y ∈ Ls and 0 ≤ Y ≤ X}. (2.3)

This supremum always exists in [0, ∞]. It follows from the positivity of expectation operator that
the definition above for E[X] coincides with Definition 2.4.1 on Ls .
Note that EX ≥ 0 but it may happen that EX = +∞ even when X is never equal to +∞.

Definition 2.4.2. 1. A random variable X is called integrable if E[|X|] < ∞. In this case, its
expectation is defined to be
E[X] = E[X + ] − E[X − ]. (2.4)
R R
We also write E[X] = X(w)dP(w) = XdP.

2. If E[X + ] and E[X − ] are not both equal to +∞ then the expectation of X is still defined and
given by (2.4) where we use the convention that +∞ + a = +∞ and −∞ + a = −∞ for any
a ∈ R.

If X ≥ 0 then X = X + and X − = 0. Therefore Definition 2.4.2 again coincides with defi-

nition (2.3) on set of non-negative random variables. We denote by L1 = L1 (Ω, A, P) the set of
integrable random variables.

Lemma 2.4.3. Let X be a non-negative random variable and (Xn )n≥1 a sequence of simple ran-
dom variables increasing to X. Then E[Xn ] ↑ E[X] (even if E[X] = ∞).

Proof. We have (EXn )n≥1 is the increasing sequence and upper bounded by EX by Definition
(2.3) so (EXn )n≥1 is convergent to a with a ≤ EX. To prove a = EX, we only show that for every
simple random variable Y satisfying 0 ≤ Y ≤ X, we have EY ≤ a.
Indeed, suppose Y takes m different values y1 , . . . , ym . Let Ak = {w : Y (w) = yk }. For each
∈ (0, 1], consider the sequence Yn, = (1 − )Y I{(1−)Y ≤Xn } . We have Yn, is simple random
variable, Yn, ≤ Xn so
EYn, ≤ EXn ≤ a for every n. (2.5)

On the other hand, Y ≤ limn Xn so for every w ∈ Ω, there exists n = n(w) such that (1 − )Y (w) ≤
Xn (w), i.e. Ak ∩ {w : (1 − )Y (w) ≤ Xn (w)} → Ak as n → ∞. We have
m
X
EYn, = (1 − ) yk P Ak ∩ [(1 − )Y ≤ Xn ]
k=1
Xm
→ (1 − ) yk P(Ak ) = (1 − )EY, as n → ∞.
k=1

Asscociate with (2.5), we have (1 − )EY ≤ a for every ∈ (0, 1], i.e. EY ≤ a.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 38

Theorem 2.4.4. 1. L1 is a vector space on R and expectation is an linear operator on L1 , i.e.,

for any X, Y ∈ L1 and x, y ∈ R, one has αX + βY ∈ L1 and

E(αX + βY ) = αEX + βEY.

2. If 0 ≤ X ≤ Y and Y ∈ L1 then X ∈ L1 and EX ≤ EY .

Proof. Statement 2 follows exactly from equation 2.3. To prove statement 1, firstly we remark
that if X and Y are two non-negative random variables and α, β ≥ 0, by Theorem 2.2.8 there exist
two increasing non-negative sequences (Xn ) and (Yn ) in Ls converging to X and Y respectively.
Hence, αXn + βYn are also simple non-negative random variables, and convege to αX + βY .
Applying linear and non-negative properties of expectation operator on Ls and Lemma 2.4.3, we
have E(αX + βY ) = αEX + βEY.
Now we prove Theorem 2.4.4. Consider two random variables X, Y ∈ L1 . Since |αX + βY | ≤
|α||X| + |β||Y |, αX + βY ∈ L1 . We have: if α > 0,

E(αX) = E((αX)+ ) − E((αX)− ) = E(α(X + )) − E(α(X − )) = αE(X + ) − αE(X − ) = αEX.

Similarly to α < 0, we also have

E(αX) = E((αX)+ ) − E((αX)− ) = E(−α(X − )) − E(−α(X + )) = −αE(X − ) + αE(X + ) = αEX,

i.e.
E(αX) = αE(X) for every α ∈ R. (2.6)
On the other hand, let Z = X + Y we have Z + − Z − = X + Y = X + + Y + − (X − + Y − ), so
Z + + X − + Y − = Z − + X + + Y + . Thus E(Z + ) + E(X − ) + E(Y − ) = E(Z − ) + E(X + ) + E(Y + ), then

EZ = E(Z + ) − E(Z − ) = E(X + ) + E(Y + ) − E(X − ) − E(Y − ) = EX + EY.

Asscociate with (2.6) we obtain

E(αX + βY ) = E(αX) + E(βY ) = αEX + βEY.

An event A happens almost surely if P(A) = 1. Thus we say X equals Y almost surely if
P[X = Y ] = 1 and denote X = Y a.s.
Corollary 2.4.5. 1. If Y ∈ L1 and |X| ≤ Y , then X ∈ L1 .

2. If X ≥ 0 a.s. and E(X) < ∞, then X < ∞ a.s.

3. If E(|X|) = 0 then X = 0 a.s.

Proof. 2) Let A = {w : X(w) = ∞}. For every n, we have X(w) ≥ X(w)IA (w) ≥ nIA (w) so
E(X) ≥ nP(A) for every n. Thus P(A) ≤ E(X)n → 0 as n → ∞. From this, we have P(A) = 0.
3) Let An = {w : |X(w)| ≥ 1/n}. We have (An )n≥1 is the decreasing sequence and P(X 6= 0) =
limn→∞ P(An ). Moreover,
1
IA (w) ≤ |X(w)|IAn (w) ≤ |X(w)|
n n
so P(An ) ≤ nE|X| = 0 for every n. Thus P(A) = 0 i.e. X = 0 a.s.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 39

Theorem 2.4.6. Let X and Y be integrable random variables. If X = Y a.s. then E[X] = E[Y ].

Proof. Firstly, we consider the case: X and Y are non-negative. Let A = {w : X(w) 6= Y (w)}. We
have P(A) = 0. Moreover,

EY = E(Y IA + Y IAc ) = E(Y IA ) + E(Y IAc ) = E(Y IA ) + E(XIAc ).

Suppose (Yn ) is a sequence of simple random variables increasing to Y . Hence, (Yn IA ) is also a
sequence of simple random variables increasing to (Y IA ). Suppose for each n ≥ 1, the random
variable Yn is bouned by Nn , so

0 ≤ E(Yn IA ) ≤ E(Nn IA ) = Nn P(A) = 0

for each n. Hence E(Y IA ) = 0. Similarly, E(XIA ) = 0. Thus EY = EX.

In general case, from X = Y a.s. we can easily find that X + = Y + and X − = Y − a.s.. Thus,
we also have EX = E(X + ) − E(X − ) = E(Y + ) − E(Y − ) = EY.

2.4.2 Some limit theorems

Theorem 2.4.7 (Monotone convergence theorem). If the random variables Xn are non-negative
and increasing a.s. to X, then limn→∞ E[Xn ] = E[X] (even if E[X] = ∞).

Proof. For each n, let (Yn,k )k≥1 be a sequence of simple random variables increasing to Xn and
let Zk = maxn≤k Yn,k . Then (Zk )k≥1 is the sequence of simple non-negative random variables,
and thus there exists Z = limk→∞ Zk . Also

Yn,k ≤ Zk ≤ X ∀n ≤ k

which implies that

Xn ≤ Z ≤ X a.s.

Next let n → ∞ we have Z = X a.s. Since expectation is a positive operator, we have

EYn,k ≤ EZk ≤ EXk ∀n ≤ k.

Fix n and let k → ∞, using Lemma 2.4.3 we obtain

EXn ≤ EZ ≤ lim EXk .

k→∞

Now let n → ∞ to obtain

lim EXn ≤ EZ ≤ lim EXk .
n→∞ k→∞

Since the left and right sides are the same, X = Z a.s., we deduce the result.

Theorem 2.4.8 (Fatou’s lemma). If the random variables Xn satisfy Xn ≥ Y a.s for all n and some
Y ∈ L1 no . Then
E[lim inf Xn ] ≤ lim inf E[Xn ].
n→∞ n→∞
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 40

Proof. Firstly we prove Theorem to the case Y = 0. Let Yn = inf k≥n Xk . We have (Yn ) is the
sequence of non-decreasing random variables and

lim Yn = lim inf Xn .

n→∞ n→∞

Since Xn ≥ Yn , we have EXn ≥ EYn . Asscociate with monotone convergence theorem to the
sequence Yn , we obtain

lim inf EXn ≥ lim EYn = E( lim Yn ) = E(lim inf Xn ).

n→∞ n→∞ n→∞ n→∞

The general case follows from appling the above result to the sequence of non-negative random
variables X̂n := Xn − Y.

Theorem 2.4.9 (Lebesgue’s dominated convergence theorem). If the random variables Xn con-
verge a.s. to X and supn |Xn | ≤ Y a.s. for some Y ∈ L1 . We have X, Xn ∈ L1 and

lim E[|Xn − X|] = 0.

n→∞

Proof. Since |X| ≤ Y , X ∈ L1 . Let Zn = |Xn − X|. Since Zn ≥ 0 and −Zn ≥ −2Y , applying Fatou
Lemma to Zn and −Zn , we obtain

0 = E(lim inf Zn ) ≤ lim inf EZn ≤ lim sup EZn = − lim inf E(−Zn ) ≤ −E(lim inf (−Zn )) = 0.
n→∞ n→∞ n→∞ n→∞ n→∞

Thus limn→∞ EZn = 0 i.e. limn→∞ E(|Xn − X|) = 0.

2.4.3 Some inequalities

Theorem 2.4.10. 1. (Cauchy-Schwarz’s inequality) If X, Y ∈ L2 then XY ∈ L1 and

|E(XY )|2 ≤ E(X 2 )E(Y 2 ). (2.7)

2. L2 ⊂ L1 and if X ∈ L2 , then (EX)2 ≤ E(X 2 ).

3. L2 is a vector space on R, i.e., for any X, Y ∈ L2 and α, β ∈ R, we have αX + βY ∈ L2 .

Proof. If E(X 2 )E(Y 2 ) = 0 then XY = 0 a.s. Thus |E(XY )|2 = E(X 2 )E(Y 2 ) = 0.
p
If E(X 2 )E(Y 2 ) 6= 0, applying the inequality 2|ab| ≤ a2 + b2 for a = X/ E(X 2 ) and b =
p
Y / E(Y 2 ) and then taking expectation for two sides, we obtain
XY X2 Y2
2E p ≤E + E = 2.
E(X 2 )E(Y 2 ) E(X 2 ) E(Y 2 )

Hence we have (2.7).

Applying (2.7) for Y = 1 we obtain the second claim. The third claim follows from (2.7) and
the linearity of expectation.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 41

If X ∈ L2 , we denote
DX = E[(X − EX)2 ].
DX is called the variance of X. Using the linearity of expectation operator, one can verify that
DX = E(X 2 ) − (EX)2 .
Theorem 2.4.11. 1. (Markov’s inequality) Suppose X ∈ L1 , then for any a > 0, it holds
E(|X|)
P(|X| ≥ a) ≤ .
a
2. (Chebyshev’s inequality) Suppose X ∈ L2 , then for any a > 0, it holds
DX
P(|X − EX| ≥ a) ≤ .
a2
Proof. 1) Since aI{|X|≥a} (w) ≤ |X(w)|I{|X|≥a} (w) ≤ |X(w)| for every w ∈ Ω. Taking expectation
for two sides, we obtain aP(|X| ≥ a) ≤ E(|X|).
2) Applying Markov’s inequality’, we have
DX
P(|X − EX| ≥ a) = P(|X − EX|2 ≥ a2 ) ≤ .
a2

2.4.4 Expectation of random variable with density

Theorem 2.4.12. Suppose that X has a density function f . Let h : R → R be a Borel function. We
have Z
E(h(X)) = h(x)f (x)dx.

provided that either E(|h(X)|) < ∞ or h non-negative.

Proof. Firstly, we consider the case h ≥ 0. Then there exists a sequence of simple non-negative
Borel functions (hn ) increasing to h. Suppose hn = ki=1 ai IAni for ani ∈ R+ and Ani ∈ B(R) for
P n n

every i. By monotone convergence theorem

kn
X
E(h(X)) = E(lim hn (X)) = lim E(hn (X)) = lim hn (ani )P[X ∈ Ani ].
n n n
i=1

Applying the property (2.1) and monotone convergence theorem, we obtain

Xkn Z Z Z
n
E(h(X)) = lim hn (ai ) f (x)dx = lim f (x)hn (x)dx = f (x)h(x)dx.
n An n
i=1 i
R
Thus, if h is non-negative, we usually have E(h(X)) = h(x)f (x)dx.
In general case, applying above result for h+ and h− we deduce this proof.

Example 2.4.13. Let X ∼ Exp(1). Applying Theorem 2.4.12 for h(x) = x and h(x) = x2 respec-
tively, we have Z ∞
EX = xe−x dx = 1,
0
and Z ∞
EX 2 = x2 e−x dx = 2.
0
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 42

2.5 Random elements

2.5.1 Definitions

Definition 2.5.1. Let (E, E) be a measure space. A function X : Ω → E is called A/E-measurable

or random element if X −1 (B) ∈ A for all B ∈ E. The function

PX (B) = P(X −1 (B)), B ∈ E,

is called probablity distribution of X on (E, E).

When (E, E) = (Rd , B(Rd )), we call X a random vector.

Let X = (X1 , . . . , Xd ) be a d-dimensional random vector defined on (Ω, A, P). The distribu-
tion function of X is defined by

F (x) = P[X < x] = P[X1 < x1 , . . . , Xd < xd ], x ∈ Rd .

We can easily verify that F satisfying the following properties:

1. 0 ≤ F (x) ≤ 1 for all x ∈ Rd .

2. limxk →−∞ F (x) = 0 for all k = 1, . . . , d.

3. limx1 →+∞,...,xd →+∞ F (x) = 1.

4. F is left continuous.

The random vector X has a density f : Rd → R+ if f is a non-negative Borel measurable

function satisfying Z
F (x) = f (u)du, for any x ∈ Rd .
u<x
This equation implies that
Z
P[X ∈ B] = f (x)dx, vi mi B ∈ B(Rd ).
B

In particular, we have
Z Z
P[X1 ∈ B1 ] = P[X ∈ B1 × R d−1
]= f (x1 , . . . , xd )dx2 . . . dxd dx1 for all B1 ∈ B(Rd ).
B1 Rd−1

This implies that if X = (X1 , . . . , Xd ) has a density f then X1 also has a density given by
Z
fX1 (x1 ) = f (x1 , x2 , . . . , xd )dx2 . . . dxd , for all x1 ∈ R. (2.8)
Rd−1

A similar argument as the proof Theorem 2.4.12 yields,

Theorem 2.5.2. Let X = (X1 , . . . , Xd ) be a random vector which has density function f , ϕ : Rd →
R a Borel measurable function. We have
Z
E[ϕ(X)] = ϕ(x)f (x)dx
Rd
R
provided that ϕ is non-negative or Rd |ϕ(x)|f (x)dx < ∞.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 43

2.5.2 Example

Multivariate normal distribution

Let a = (a1 , . . . , ad ) be a d-dimensional vector and M = (mi,j )di,j=1 a d × d-square matrix.

Suppose that M is symmetric and positive define. Denote A = M −1 . The random vector X =
(X1 , . . . , Xd ) has normal distribution N(a, M ) if its density p verifies
√
det A n 1
∗
o
p(x) = exp − (x − a)A(x − a) ,
(2π)d/2 2

where (x − a)A(x − a)∗ = i j aij (xi − ai )(xj − aj ).

P P

Polynomial distribution

The d-dimensional random vector X has a polynomial distribution with parameters n, p1 , . . . , pd ,

denoted by X ∼ M U T (n; p1 , . . . , pd ), for n ∈ N∗ and p1 , . . . , pd ≥ 0, if

n! kd+1
P[X1 = k1 , . . . , Xd = kd ] = pk11 pk22 . . . pd+1 ,
k1 !k2 ! . . . kd+1 !

where pd+1 = 1 − (p1 + . . . + pd ), 0 ≤ ki ≤ n and kd+1 = n − (k1 + . . . + kd ) ≥ 0.

2.5.3 Density of function of random vectors

Using Theorem 2.5.2 and the change of variables formula we have the following useful result.

Theorem 2.5.3. Let X = (X1 , . . . , Xn ) have a joint density f . Suppose g : Rn → Rn is continuously

differentiable and injective, with Jacobian given by

∂gi
Jg (x) = (x)
∂xj i,j=1,...,d

never vanishes. Then Y = g(X) has density


f (g −1 (y))| det J −1 (y)| if y ∈ g(Rd )
X g
fY (y) =
0 otherwise.

2.6 Independent random variables

Definition 2.6.1. 1. The sub-σ-algebras (Ai )i∈I of A are independent if for all finite subset J
of I and for all Ai ∈ Ai ,
Y
P(∩i∈J Ai ) = P(Ai ).
i∈J

2. The (Ei , Ei )-valued random variables (Xi )i∈I are independent if the σ-algebras (Xi−1 (Ei ))i∈I
are independent.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 44

A class C of subsets of Ω is called a π-system if is closes under finite intersections, so that

A ∩ B ∈ C implies A, B ∈ C. Furthermore, a class D is a λ-system if contains Ω and is closed
under proper differences and increasing limits, i.e.,

• A1 , A2 , . . . ∈ D with An ↑ A implies A ∈ D;

• A, B ∈ D with A ⊂ B implies B\A ∈ D.

Lemma 2.6.2 (Monotone classes). Let C, D be classes of subsets of Ω where C is a π-system and D
is a λ-system such that C ⊂ D. Then σ(C) ⊂ D.

Lemma 2.6.3. Let G and F be sub-σ-algebras of A. Let G1 and F1 be π-systems such that σ(G1 ) = G
and σ(F1 ) = F. Then G is independent of F if F1 and G are independent, i.e.,

P(F ∩ G) = P(F )P(G), F ∈ F1 , G ∈ G1 .

Proof. Suppose that F1 and G1 are independent. We fix any F ∈ F1 and define

σF = {G ∈ G : P(F ∩ G) = P(F )P(G)}.

Then σF is a λ-system containing π-system G1 . Applying monotone classes theorem, we have

σF = G, it means that
P(F ∩ G) = P(F )P(G), F ∈ F1 , G ∈ G.

Next, for any G ∈ G we define

σG = {F ∈ F : P(F ∩ G) = P(F )P(G)}.

We also have that σG is a λ-system containing π-system F1 so that σG = F, which yields the
desired property.

Theorem 2.6.4. Let X and Y be two random variables. The following statements are equivalent:

(i) X is independent of Y ;

(ii) FX,Y (x, y) = FX (x)FY (y) for all x, y ∈ R;

(iii) f (X) and g(Y ) are independent for any Borel functions f, g : R → R;

(iv) E[f (X)g(Y )] = E[f (X)]E[g(Y )] for any Borel function f, g : R → R which are either positive
or bounded.

Proof. (i) ⇒ (ii): Suppose X be independent of Y , then two events {w : X(w) < x} v {w : Y (w) <
y} are also independent for every x, y ∈ R. We have (ii).
(ii) ⇒ (i): Since the set of events {w : X(w) < x}, x ∈ R, is a π-system generating σ(X) and
{w : X(w) < y}, y ∈ R, is a π-system generating σ(Y ) , so applying Lemma 2.6.3 we have X is
independent of Y .
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 45

(i) ⇒ (iii): For every A, B ∈ B(R), we have f −1 (A), g −1 (B) ∈ B(R) then

P(f (X) ∈ A, g(Y ) ∈ B) = P(X ∈ f −1 (A), Y ∈ g −1 (B))

= P(X ∈ f −1 (A))P(Y ∈ g −1 (B)) = P(f (X) ∈ A)P(g(Y ) ∈ B).

Thus, f (X) is independent of g(Y ).

(iii) ⇒ (i): We choose f (x) = g(x) = x.
(i) ⇒ (iv): Since (i) is equivalent of (iii), we only prove

E(XY ) = E(X)E(Y ) for every random variable which is integrable or non-negative X and Y.

Firstly, we suppose that: X and Y are non-negative. By Theorem 2.2.8 there exists a sequence of
simple random variables Xn = ki=1 ai IAi increasing to X and Yn = lj=1
P n Pn
bj IBj increasing to Y
for Ai ∈ σ(X) v Bi ∈ σ(Y ). Applying monotone convergence theorem, we have
kn X
X ln kn X
X ln
E(XY ) = lim E(Xn Yn ) = lim ai bj P(Ai Bj ) = lim ai bj P(Ai )P(Bj )
n→∞ n→∞ n→∞
i=1 j=1 i=1 j=1
kn
X ln
X
= lim ai P(Ai ) bj P(Bj ) = lim E(Xn )E(Yn ) = E(X)E(Y ).
n→∞ n→∞
i=1 j=1

In general case, we write X = X + − X − v Y = Y + − Y − . Since (iii), we have X ± are independent

of Y ± , then

E(XY ) = E(X + Y + ) + E(X − Y − ) − E(X + Y − ) − E(X − Y + )

= E(X + )E(Y + ) + E(X − )E(Y − ) − E(X + )E(Y − ) − E(X − )E(Y + ) = E(X)E(Y ).

(iv) ⇒ (i): Choose f = I(−∞,x) and g = I(−∞,y) .

(iv) ⇒ (v) is evident.

2.7 Covariance
Definition 2.7.1. The covariance of random variables X, Y ∈ L2 is defined by

cov(X, Y ) = E[(X − EX)(Y − EY )].

The correlation coefficient of X, Y ∈ L2 is

cov(X, Y )
ρ(X, Y ) = √ .
DXDY
Using the linearity of expectation operator, we have

cov(X, Y ) = E(XY ) − EXEY.

√
Furthermore, it follows from Cauchy-Schwarz’s inequality that |cov(X, Y )| ≤ DXDY , it means

|ρ(X, Y )| ≤ 1.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 46

Example 2.7.2. Let X and Y be independent random variables whose distributions are N (0, 1).
Denote Z = XY and T = X − Y . We have

cov(Z, T ) = E(XY (X − Y )) − E(XY )E(X − Y ) = 0,

and
cov(Z, T 2 ) = E(XY (X − Y )2 ) − E(XY )E((X − Y )2 ) = −2,

since E(XY ) = EXEY = 0, E(X 3 Y ) = E(X 3 )EY = 0, E(XY 3 ) = EXE(Y 3 ) = 0 and E(X 2 Y 2 ) =
E(X 2 )E(Y 2 ) = 1. Thus Z and T are uncorrelated random variables but not independent.

Proposition 2.7.3. Let (Xn )n≥1 be a sequence of pair-wise uncorrelated random variables. Then

D(X1 + . . . + Xn ) = D(X1 ) + . . . + D(Xn ).

Proof. We have
h 2 i
D(X1 + . . . + Xn ) = E (X1 − EX1 ) + . . . (Xn − EXn )
n
X X
= E[(Xi − EXi )2 ] + 2 E[(Xi − EXi )(Xj − EXj )]
i=1 1≤i<j≤n
Xn n
X
= E[(Xi − EXi )2 ] = D(Xi ),
i=1 i=1

since E[(Xi − EXi )(Xj − EXj )] = E(Xi Xj ) − E(Xi )E(Xj ) = 0 for any 1 ≤ i < j ≤ n.

2.8 Conditional Expectation

2.8.1 Definition

Definition 2.8.1. Let (Ω, A, P) be a probability space and X an integrable random variable. Let G
be a sub-σ-algebra of A. Then there exists a random variable Y such that

1. Y is G-measurable,

2. E[|Y |] < ∞,

3. for every set G ∈ G, we have Z Z

Y dP = XdP.
G G

Moreover, if Z is another random variable with these properties then P[Z = Y ] = 1. Y is called a
version of the conditional expectation E[X|G] of X given G, and we write Y = E[X|G], a.s.

We often write E[X|Z1 , Z2 , . . .] for E[X|σ(Z1 , Z2 , . . .)].

CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 47

2.8.2 Examples

Example 2.8.2. Let X be an integrable random variable and G = σ(A1 , . . . , Am ) where (Ai )1≤i≤m
is a measurable partition of Ω. Suppose that P(Ai ) > 0 for all i = 1, . . . , m. Then
n Z
X 1
E(X|G) = XdP IAi .
P(Ai ) Ai
i=1

Example 2.8.3. Let X and Z be random variables whose joint density is fX,Z (x, z). We know
R
that fZ (z) = R fX,Z (x, z)dx is density of Z. Define the elementary conditional density fX|Z of X
given Z by 
 fX,Z (x,z) if f (z) 6= 0,
fZ (z) Z
fX|Z (x|z) :=
0 otherwise.
Let h be a Borel function on R such that E[|h(X)|] < ∞. Set
Z
g(z) = h(x)fX|Z (x|z)dx.
R

Then Y = g(Z) is a version of the conditional expectation E[h(X)|Z].

Indeed, for a typical element of σ(Z) which has the form {w : Z(w) ∈ B}, where B ∈ B, we
have
Z Z Z
E[h(X)IB (Z)] = h(x)IB (z)fX,Z dxdz = g(z)IB (z)fZ (z)dz = E[g(Z)IB (Z)].

2.8.3 Properties of conditional expectation

Theorem 2.8.4. Let ξ, η be integrable random variables defined on (Ω, F, P). Let G be a sub-σ-
algebras of F.

1. If c is a constant, then E[c|G] = c a.s.

2. If ξ ≥ η a.s. then E(ξ|G) ≥ E(η|G) a.s.

3. If a, b are constants, then

E(aξ + bη|G) = aE(ξ|G) + bE(η|G).

4. If G = {∅, Ω}, then E(ξ|G) = E(ξ) a.s.

5. E(ξ|F) = ξ a.s.

6. E(E(ξ|G)) = E(ξ).

7. (Tower property) Let G1 ⊂ G2 be sub-σ-algebras of F then

E(E(ξ|G1 )|G2 ) = E(E(ξ|G2 )|G1 ) = E(ξ|G1 ) a.s.

8. If ξ is independent of G, then E(ξ|G) = E(ξ) a.s.

CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 48

9. If η is G-measurable and E(|ξη|) < ∞, then

E(ξη|G) = ηE(ξ|G) a.s.

10. Let H be a sub-σ-algebras of F which is independent of σ(G, ξ), then

E ξ|σ(G, H) = E(ξ|G)

a.s.

Proof. 1. Statement 1 is evident.

2. Since ξ ≥ η a.s. so A ξdP ≥ A ηdP for every A ∈ G. Hence, A E(ξ|G)dP ≥ A E(η|G)dP for every
R R R R

A ∈ G. Thus, E(ξ|G) ≥ E(η|G) a.s.

3. If A ∈ G, Z Z Z
(aξ + bη)dP = a ξdP + b ηdP
A A A
Z Z Z
=a E(ξ|G)dP + b E(η|G)dP = (aE(ξ|G) + bE(η|G))dP
A A A
From this, we have proof.
4. Since Eξ is measurable with respect to σ-algebra G = {∅, Ω} and if A = ∅ or A = Ω, we have
Z Z
ξdP = EξdP ⇒ E(ξ|G) = E(ξ) a.s.
A A

5. Statement 5 is evident.
6. Using Definition 2.8.1 for G = Ω, we have:
Z Z
E(ξ|G)dP = ξdP ⇒ E(E(ξ|G)) = Eξ a.s.
Ω Ω

7. If A ∈ G, we have: Z Z Z
E[E(ξ|G2 )|G1 ]dP = E(ξ|G2 )dP = ξdP.
A A A
From this and Definition 2.8.1, the first equation is proven. The second one follows from State-
ment 5 and remark that E(ξ|G1 ) is G2 -measurable.
8. If A ∈ G, X and IA are independent. Hence, we have:
Z Z
ξdP = E(ξIA ) = Eξ.P(A) = (Eξ)dP ⇒ E(ξ|G) = E(ξ) a.s.
A A

9. First suppose that ξ and η are non-negative. For η = IA , where A is G-measurable, B ∩ A ∈ G,

so that, by the defining relation,
Z Z Z Z
ηE(ξ|G)dP = E(ξ|G)dP = ξdP = ξηdP,
A B∩A B∩A B

which proves the desired relation for indicators, and hence for simple random variables. Next,
if {ηn , n ≥ 1} are simple random variables, such that ηn ↑ η almost surely as n → ∞ , it follows
that ηn ξ ↑ ηξ and ηn E(ξ|G) ↑ ηE(ξG|) almost surely as n → ∞, from which the conclusion follows
by monotone convergence. The general case follows by the decomposition ξ = ξ + − ξ − and
η = η+ − η−.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 49

2.8.4 Convergence theorem

Let (ξn ), ξ and η be random variables defined on (Ω, F, P). Let G be a sub-σ-algebras of F.

Theorem 2.8.5 (Monotone convergence theorem). a) Suppose that ξn ↑ ξ a.s. and there exists a
positive integer m such that E(ξm− ) < ∞. Then, E(ξ |G) ↑ E(ξ|G) a.s.
n
+ ) < ∞, then E(ξ |G) ↓
b) Suppose that ξn ↓ ξ a.s. and there exists a positive integer m such that E(ξm n
E(ξ|G) a.s.

Proof. Suppose Eξn−0 < ∞. Hence 0 ≤ ξn + ξn−0 ↑ ξ + ξn−0 , by Theorem ??

Z Z
−
lim E(ξn + ξn0 |G)dP = lim E(ξn + ξn−0 |G)dP
A n n A
Z Z Z
= lim (ξn + ξn−0 )dP = lim(ξn + ξn−0 )dP = (ξ + ξn−0 )dP
n A A n A
By linearity we have
Z Z Z
lim E(ξn |G)dP = ξdP = E(ξ|G)dP, ∀A ∈ G
A n A A

and then
lim E(ξn |G) = E(ξ|G) a.s.
n
Similarly to claim (b).

Theorem 2.8.6 (Fatou’s lemma). a) If ξn ≤ η, ∀n ≥ 1 a.s., and E(η) < ∞ then

lim sup E(ξn |G) ≤ E(lim sup ξn |G), a.s.

n n

b) If ξn ≥ η, ∀n ≥ 1 a.s., and E(η) > −∞, then

lim inf E(ξn |G) ≥ E(lim inf ξn |G), a.s.

n n

c) If |ξn | ≤ η, ∀n ≥ 1 a.s., and E(η) < ∞, then

E(lim inf ξn |G) ≤ lim inf E(ξn |G) ≤ lim sup E(ξn |G) ≤ E(lim sup ξn |G), a.s.
n n n n

Theorem 2.8.7 (Lebesgue’s dominated convergence theorem). Suppose that E(η) < ∞, |ξn | ≤ η
a.s.
a.s., and ξn −→ ξ. Then,

lim E(ξn |G) = E(ξ|G) and lim E(|ξn − ξ| G) = 0, a.s.

n n

The proofs of Fatou’s lemma and Lebesgue’s dominated convergence theorem are analogous
in a similar vein to the proofs of Fatou’s lemma and the Dominated convergence theorem without
conditioning.

Theorem 2.8.8 (Jensen’s inequality). Let ϕ : R → R be a convex function such that ϕ(ξ) is inte-
grable. Then
E(ϕ(ξ)|G) ≥ ϕ(E(ξ|G)), a.s.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 50

Proof. A result in real analysis is that if ϕ : R → R is convex, then ϕ(x) = supn (an x + bn ) for a
countable collection of real numbers (an , bn ). Then

E(an ξ + bn |G) = an E(ξ|G) + bn .

2.8.5 Conditional expectation given a random variable

Since E(ξ|η) is σ(η)-measurable, there exists a measurable function f : R → R such that

E(ξ|η) = f (η). We denote f (x) = E(ξ|η = x).
R
a) Since E(ξ) = E(f (η)) = R f (x)dFη (x),
Z
E(ξ) = E(ξ|η = x)dFη (x). (2.9)
R

b) Let ϕ : R → R be a Borel function such that both ξ and ξϕ(η) are integrable. Then, the equation

E(ξϕ(η)|η = y) = ϕ(y)E(ξ|η = y)

holds Pη -a.s.
c) If ξ and η are independent, then

E(ξ|η = y) = E(ξ).

Moreover, let ϕ : R2 → R satisfy E|ϕ(ξ, η)| < ∞, then

E(ϕ(ξ, η)|η = y) = E(ϕ(ξ, y)) (Pη − a.s.). (2.10)

2.9 Exercises

Discrete random variables

2.1. An urn contains five red, three orange, and two blue balls. Two balls are randomly selected.
What is the sample space of this experiment? Let X represent the number of orange balls se-
lected. What are the possible values of X? Calculate expectation and variance of X.

2.2. An urn contains 7 white balls numbered 1,2,...,7 and 3 black ball numbered 8,9,10. Five balls
are randomly selected, (a) with replacement, (b) without replacement. For each of cases (a) and
(b) give the distribution:

1. of the number of white balls in the sample;

2. of the minimum number in the sample;

CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 51

3. of the maximum number in the sample;

4. of the minimum number of balls needed for selecting a white ball.

2.3. A machine normally makes items of which 4% are defective. Every hour the producer draws
a sample of size 10 for inspection. If the sample contains no defective items he does not stop
the machine. What is the probability that the machine will not be stopped when it has started
producing items of which 10% are defective.

2.4. Let X represent the difference between the number of heads and the number of tails ob-
tained when a fair coin is tossed n times. What are the possible values of X? Calculate expecta-
tion and variance of X.

2.5. An urn contains N1 white balls and N2 black balls; n balls are drawn at random, (a) with re-
placement, (b) without replacement. What is the expected number of white balls in the sample?

2.6. A student takes a multiple choice test consisting of two problems. The first one has 3 possi-
ble answers and the second one has 5. The student chooses, at random, one answer as the right
one from each of the two problems. Find:

a) the expected number, E(X) of the right answers X of the student;

b) the V ar(X).

2.7. In a lottery that sells 3,000 tickets the first lot wins $1,000, the second $500, and five other
lots that come next win $100 each. What is the expected gain of a man who pays 1 dollar to buy
a ticket?

2.8. A pays 1 dollar for each participation in the following game: three dice are thrown; if one
ace appears he gets 1 dollar, if two aces appear he gets 2 dollars and if three aces appear he gets
8 dollars; otherwise he gets nothing. Is the game fair, i.e., is the expected gain of the player zero?
If not, how much should the player receive when three aces appear to make the game fair?

2.9. Suppose a die is rolled twice. What are the possible values that the following random vari-
ables can take on?

1. The maximum value to appear in the two rolls.

2. The minimum value to appear in the two rolls.

3. The sum of the two rolls.

4. The value of the first roll minus the value of the second roll.

2.10. Suppose X has a binomial distribution with parameters n and p ∈ (0, 1). What is the most
likely outcome of X?
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 52

2.11. An airline knows that 5 percent of the people making reservations on a certain flight will
not show up. Consequently, their policy is to sell 52 tickets for a flight that can hold only 50
passengers. What is the probability that there will be a seat available for every passenger who
shows up?

2.12. Suppose that an experiment can result in one of r possible outcomes, the ith outcome
having probability pi , i = 1, . . . , r, ri=1 pi = 1. If n of these experiments are performed, and if the
P

outcome of any one of the n does not affect the outcome of the other n1 experiments, then show
that the probability that the first outcome appears x1 times, the second x2 times, and the rth xr
times is
n!
px1 px2 · · · pxr r
x1 !x2 ! · · · xr ! 1 2
when x1 + x2 + . . . + xr = n. This is known as the multinomial distribution.

2.13. A television store owner figures that 50 percent of the customers entering his store will
purchase an ordinary television set, 20 percent will purchase a color television set, and 30 percent
will just be browsing. If five customers enter his store on a certain day, what is the probability that
two customers purchase color sets, one customer purchases an ordinary set, and two customers
purchase nothing?

2.14. Let X be Geometric. Show that for i, j > 0,

P[X > i + j|X > i] = P[X > j].

2.15. If a fair coin is successively flipped, find the probability that a head first appears on the fifth
trial.

2.16. A coin having probability p of coming up heads is successively flipped until the rth head
appears. Argue that X, the number of flips required, will be n, n ≥ r, with probability

r−1 r
P[X = n] = Cn−1 p (1 − p)n−r , n ≥ r.

This is known as the negative binomial distribution. Find the expectation and variance of X.

2.17. A fair coin is independently flipped n times, k times by A and n − k times by B. Show that
the probability that A and B flip the same number of heads is equal to the probability that there
are a total of k heads.

2.18. Suppose that we want to generate a random variable X that is equally likely to be either 0
or 1, and that all we have at our disposal is a biased coin that, when flipped, lands on heads with
some (unknown) probability p. Consider the following procedure:

1. Flip the coin, and let 01 , either heads or tails, be the result.

2. Flip the coin again, and let 02 be the result.

3. If 01 and 02 are the same, return to step 1.

CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 53

4. If 02 is heads, set X = 0, otherwise set X = 1.

(a) Show that the random variable X generated by this procedure is equally likely to be either
0 or 1.

(b) Could we use a simpler procedure that continues to flip the coin until the last two flips are
different, and then sets X = 0 if the final flip is a head, and sets X = 1 if it is a tail?

2.19. Consider n independent flips of a coin having probability p of landing heads. Say a changeover
occurs whenever an outcome differs from the one preceding it. For instance, if the results of the
flips are HHT HT HHT , then there are a total of five changeovers. If p = 1/2, what is the proba-
bility there are k changeovers?

2.20. Let X be a Poisson random variable with parameter λ. What is the most likely outcome of
X?

2.21. * Poisson Approximation to the Binomial Let P be a Binomial probability with probability
of success p and number of trial n. Let λ = np. Show that

λk λ −k

λ n n−1 n−k+1
P (k successes) = 1− ... 1− .
k! n n n n n

Let n → ∞ and let p change so that λ remains constant. Conclude that for small p and large n,

λk −λ
P (k successes) = e , where λ = pn.
k!
2.22. * Let X be the Binomial B(n, p).

a) Show that for λ > 0 and ε > 0 then

P (X − np > nε) ≤ E[exp(λ(X − np − nε))].

b) With a > 0 show that

p
X p(1 − p) p √
P (| − p |> a) ≤ 2
min{ p(1 − p, a n}.
n a n

2.23. Let X be Poisson (λ).

2λλ e−λ
a) With λ a positive integer. Show E{|X − λ|} = (λ−1)! ,

b) Show for r = 2, 3, 4, . . .,
E{X(X − 1) . . . (X − r + 1)} = λr .

2.24. Let X be Geometric (p).

n o p
1
a) Show E 1+X = log (1 − p) p−1 .

r!pr
b) Show for r = 2, 3, 4, . . ., E{X(X − 1) . . . (X − r + 1)} = (1−p)r .
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 54

2.25. Suppose X takes all its values in N = {0, 1, 2, . . .}. Show that
∞
X
E[X] = P[X > n].
n=0

The following exercises use the additivity of expectation

2.26. Liam’s bowl of spaghetti contains n strands. He selects two ends at random and joins them
together. He does this until there are no ends left. What is the expected number of spaghetti
hoops in the bowl?

2.27. Sarah collects figures from cornflakes packets. Each packet contains one figure, and n
distinct figures make a complete set. Find the expected number of packets Sarah needs to collect
a complete set.

2.28. Each packet of the breakfast cereal Soggies contains exactly one token, and tokens are
available in each of the three colours blue, white and red. You may assume that each token
obtained is equally likely to be of the three available colours, and that the (random) colours of
different tokens are independent. Find the probability that, having searched the contents of k
packets of Soggies, you have not yet obtained tokens of every colour.
Let N be the number of packets required until you have obtained tokens of every colour.
Show that E[N ] = 11
2 .

2.29. Each box of cereal contains one of 2n different coupons. The coupons are organized into n
pairs, so that coupons 1 and 2 are a pair, coupons 3 and 4 are a pair, and so on.
Once you obtain one coupon from every pair, you can obtain a prize. Assuming that the
coupon in each box is chosen independently and uniformly at random from the 2n possibilities,
what is the expected number of boxes you must buy before you can claim the prize?

Continuous random variables

2.30. The amount of bread (in hundreds of kilos) that a bakery sells in a day is a random variable
with density 
cx

 for 0 ≤ x < 3,

f (x) = c(6 − x) for 3 ≤ x < 6,


0 otherwise.


a) Find the value of c which makes f a probability density function.

b) What is the probability that the number of kilos of bread that will be sold in a day is, (i)
more than 300 kilos? (ii) between 150 and 450 kilos?

c) Denote by A and B the events in (i) and (ii), respectively. Are A and B independent events?
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 55

2.31. Suppose that the duration in minutes of long-distance telephone conversations follows an
exponential density function:
1
f (x) = e−x/5 for x > 0.
5
Find the probability that the duration of a conversation:

a) will exceed 5 minutes;

b) will be between 5 and 6 minutes;

c) will be less than 3 minutes;

d) will be less than 6 minutes given that it was greater than 3 minutes.

2.32. A number is randomly chosen from the interval (0;1). What is the probability that:

a) its first decimal digit will be a 1;

b) its second decimal digit will be a 5;

c) the first decimal digit of its square root will be a 3?

2.33. The height of men is normally distributed with mean µ=167 cm and standard deviation
σ=3 cm.

a) What is the percentage of the population of men that have height, (i) greater than 167 cm,
(ii) greater than 170 cm, (iii) between 161 cm and 173 cm?

b) In a random sample of four men what is the probability that:

i) all will have height greater than 170 cm;

ii) two will have height smaller than the mean (and two bigger than the mean)?

2.34. Find the constant k and the mean and variance of the population defined by the probability
density function
f (x) = k(1 + x)−3 for 0 ≤ x < ∞

and zero otherwise.

2.35. A mode of a distribution of one random variable X is a value of x that maximizes the pdf
or pmf. For X of the continuous type, f (x) must be continuous. If there is only one such x, it is
called the mode of the distribution. Find the mode of each of the following distributions

1. f (x) = 12x2 (1 − x), 0 < x < 1, zero elsewhere.

2. f (x) = 12 x2 e−x , 0 < x < ∞, zero elsewhere.

1
2.36. A median of a distribution of a random variable X is a value x such that P[X < x] ≤ 2 and
P[X ≤ x] ≥ 12 . Find the median of each of the following distribution:
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 56

1. f (x) = 3x2 , 0 < x < 1, zero elsewhere.

1
2. f (x) = π(1+x2 )
.

2.37. Let 0 < p < 1. A (100p)th percentile (quantile of order p) of the distribution of a random
variable X is a value ζp such that

P[X < ζp ] ≤ p, and P[X ≤ ζp ] ≥ p.

Find the pdf f (x), the 25th percentile and the 60th percentile for each of the the followin cdfs.

1. F (x) = (1 + ex )−1 , −∞ < x < ∞.

−x
2. F (x) = e−e , −∞ < x < ∞.

3. F (x) = 1
2 + 1
π tan−1 (x), −∞ < x < ∞.

2.38. If X is a random variable with the probability density function f , find the probability den-
sity function of Y = X 2 if
2
(a) f (x) = 2xe−x , for 0 ≤ X < ∞

(b) f (x) = (1 + x)/2, for −1 ≤ X ≤ 1

(c) f (x) = 21 , for − 21 ≤ X ≤ 32 .

2.39. Let X be a standard normal random variable. Denote Y = eX .

1. Find the density of Y . This is known as log-normal distribution.

2. Find the expectation and variance of Y .

2.40. Let X be a uniform distribution U (0, 1). Find the density of the following random variable.

1. Y = − λ1 ln(1 − X).
X
2. Z = ln 1−X . This is known as Logistic distribution.
q
1
3. T = 2 ln 1−X . This is known as Rayleigh distribution.

2.41. Let X have the uniform distribution U (− π2 , − π2 ).

1. Find the pdf of Y = tan X. This is the pdf of a Cauchy distribution.

2. Show that Y is not integrable.

3. Denote Z = (X ∨ (−a)) ∧ a for some a > 0. Find E[Z].

2.42. Let X be a random variable with distribution function F that is continuous. Show that
Y = F (X) is uniform.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 57

2.43. Let F be a distribution function that is continuous and is such that the inverse function
F −1 exists. Let U be uniform on [0, 1]. Show that X = F −1 (U ) has distribution function F .

2.44. 1. Let X be a non-negative random variable satisfying E[X α ] < ∞ for some α > 0. Show
that Z ∞
E[X α ] = α xα−1 (1 − F (x))dx.
0

2. Let Y be a continuous random variable. Show that

Z +∞
E[Y ] = (P[Y > t] − P[Y < −t])dt.
0
Rb
2.45. Suppose that the density function of X satisfies a f (x)dx = 1 for some real constants a < b.
2
Show that a < E[X] < b and DX ≤ (b−a) 4 .

2.46. Let X be a nonnegative random variable with mean µ and variance σ 2 , both finite. Show
that for any b > 0,
1
P[X ≥ µ + bσ] ≤ .
1 + b2
[(x−µ)b+σ]2
Hint: consider the function g(x) = σ 2 (1+b2 )2
.

2.47. Let X be a random variable with mean µ and variance σ 2 , both finite. Show that for any
d > 1,
1
P[µ − dσ < X < µ + dσ] ≥ 1 − 2 .
d
2.48. Divide a line segment into two parts by selecting a point at random. Find the probability
that the larger segment is at least three times the shorter. Assume a uniform distribution.

2.49. Let X be an integrable random variable.

1. Let (An ) be a sequence of events such that limn P(An ) = 0. Show that limn→∞ E[XIAn ] = 0.

2. Show that for any > 0, there exists a δ > 0 such that for any event A satisfying P(A) < δ,
E[XIA ] < .

2.50. Let (Xn ) be a sequence of non-negative random variable. Show that

∞
X ∞
X
E[ Xn ] = E[Xn ].
n=1 n=1

2.51. Given the probability space (Ω, A, P), suppose X is a non-negative random variable and
E[X] = 1. Define Q : A → R by Q(A) = E[XIA ].

1. Show that Q defines a probability measure on (Ω, A).

2. Show that if P(A) = 0, then Q(A) = 0.

3. Suppose P(X > 0) = 1. Let EQ denote expectation with respect to Q. Show that EQ [Y ] =
EP [Y X].
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 58

Random elements

2.52. An urn contains 3 red balls, 4 blue balls and 2 yellow balls. Pick up randomly 2 ball from
that urn and denote X and Y the number of red and yellow balls in the 2 balls, respectively.

1. Make the joint distribution table of X and Y .

2. Are X and Y independent?

3. Find the distribution of Z = XY .

2.53. Suppose that the joint pmf of X and Y is

P[X = i, Y = j] = Cji e−2λ λj /j!, 0 ≤ i ≤ j.

1. Find the probability mass function of Y .

2. Find the probability mass function of X.

3. Find the probability mass function of Y − X.

2.54. Let X and Y be independent random variables taking values in N with

1
P[X = i] = P[Y = i] = , i = 1, 2, . . .
2i
Find the following probability

1. P[X ∧ Y ≤ i].

2. P[X = Y ].

3. P[X > Y ].

4. P[X divides Y ].

5. P[X ≥ kY ] for a given positive integer k.

2.55. Let X and Y be independent geometric random variables with parameters λ and µ.

1. Let Z = X ∧ Y . Show that Z is geometric and find its parameter.

2. Find the probability that X = Y .

2.56. Let X and Y be independent random variables with uniform distribution on the set {−1, 1}.
Let Z = XY . Show that X, Y, Z are pairwise independent but that they are not mutually inde-
pendent.

2.57. * Let n be a prime number greater than 2; and X, Y be independent and uniformly dis-
tributed on {0, 1, . . . , n − 1}. For each r, 0 ≤ r ≤ n − 1, define Zr = X + rY ( mod n). Show that
the random variable Zr , r = 0, . . . , n − 1, are pairwise independent.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 59

1
2.58. Let (Xn ) be a sequence of independent random variables with P[Xn = 1] = P[Xn = −1] = 2
for all n. Let Zn = X0 X1 . . . Xn . Show that Z1 , Z2 , . . . are independent.

2.59. Let (a1 , . . . , an ) be a random permutation of (1, . . . , n), equally likely to be any of the n!
possible permutations. Find the expectation of
n
X
L= |ai − i|.
i=1

2.60. A blood test is being performed on n individuals. Each person can be tested separately. but
this is expensive. Pooling can decrease the cost. The blood samples of k people can be pooled
and analyzed together. If the test is negative, this one test suffices for the group of k individuals.
If the test is positive, then each of the k persons must be tested separately and thus k + 1 total
tests are required for the k people. Suppose that we create n/k disjoint groups of k people (where
k divides n) and use the pooling method. Assume that each person has a positive result on the
test independently with probability p.

(a) What is the probability that the test for a pooled sample of k people will be positive?

(b) What is the expected number of tests necessary?

(c) Describe how to find the best value of k.

(d) Give an inequality that shows for what values of p pooling is better than just testing every
individual.

2.61. You need a new staff assistant, and you have n people to interview. You want to hire the
best candidate for the position. When you interview a candidate, you can give them a score, with
the highest score being the best and no ties being possible. You interview the candidates one
by one. Because of your company’s hiring practices, after you interview the kth candidate, you
either offer the candidate the job before the next interview or you forever lose the chance to hire
that candidate. We suppose the candidates are interviewed in a random order, chosen uniformly
at random from all n! possible orderings.
We consider the following strategy. First, interview m candidates but reject them alL these
candidates give you an idea of how strong the field is. After the mth candidate. hire the first
candidate you interview who is better than all of the previous candidates you have interviewed.

1. Let E be the event that we hire the best assistant, and let Ei be the event that ith candidate
is the best and we hire him. Determine P(Ei ), and show that
n
m X 1
P(E) = .
n j−1
j=m+1

2. Show that
m m
(ln n − ln m) ≤ P(E) ≤ (ln(n − 1) − ln(m − 1)).
n n
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 60

3. Show that m(ln n−ln m)/n is maximized when m = n/e, and explain why this means P(E) ≥
1/e for this choice of m.

2.62. Let X and Y have the joint pdf


6(1 − x − y) if x + y < 1, x > 0, y > 0,
f (x, y) =
0 otherwise.

Compute P[2X + 3Y < 1] and E[XY + 2X 2 ].

2.63. Let X and Y have the joint pdf


10xy if 0 < x < y < 1
fX,Y (x, y) = .
0 otherwise

Find the joint pdf of X/Y and Y .

2.64. Let X be a normal with µ = 0 and σ 2 < ∞, and let Θ be uniform on [0, π]. Assume that X
and θ are independent. Find the distribution of Z = X + a cos Θ.

2.65. Let X and Y be independent random variable with the same distribution N (0, σ 2 ).

1. Let U = X + Y and V = X − Y . Show that U and V are independent.

√
2. Let Z = X 2 + Y 2 and W = arctan X π π
Y ∈ (− 2 , 2 ). Show that X has a Rayleight distribution,
that W is uniform, and that Z and W are independent.

2.66. (Simulation of Normal Random Variables) Let U and V be two independent uniform ran-
dom variable on [0, 1]. Let θ = 2πU and S = − ln(V ).

1. Show that S has an exponential distribution, and that R

2.67. Let (X1 , . . . , Xn ) be random variables. Define

Y1 = min{Xi , 1 ≤ i ≤ n},
Y2 = second smallest of X1 , . . . , Xn ,
..
.
Yn = max{Xi , 1 ≤ i ≤ n}.

Then Y1 , . . . , Yn are also random variables, and Y1 ≤ Y2 ≤ . . . ≤ Yn . They are called the or-
der statistics of (X1 , . . . , Xn ) and are usually denoted Yk = X(k) . Assume that Xi are i.i.d. with
common density f .

1. Show that the joint density of the order statistics is given by


n! Qn f (y ) for y < y < . . . < y
i=1 i 1 2 n
f(X(1) ,X(2) ,...,X(n) ) (y1 , . . . , yn ) =
0 otherwise.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS 61

2. Show that X(k) has density

f(k) (y) = kCnk f (y)(1 − F (y))n−k F (y)k−1

where F is distribution function of Xk .

2.68. Show that the function 

0 for x + y < 1,
F (x, y) =
1 for x + y ≥ 1,

is not a joint distribution function.

2.69. Let X and Y be independent and suppose P[X + Y = α] = 1 for some constant α. Show
that both X and Y are constant random variables.

2.70. Let (Xn )n≥1 be iid with common continuous distribution function F (x). Denote Rn =
Pn
j=1 I{Xj ≥Xn } , and An = {Rn = 1}.

1. Show that the sequence of random variables (Rn )n≥1 is independent and

1
P[Rn = k] = , for k = 1, . . . , n.
n

2. The sequence of events (An )n≥1 is independent and

1
P(An ) = .
n
Chapter 3

Fundamental Limit Theorems

3.1 Convergence of random variables

In this section, we study about convergence of random variables. This is an important concept
in probability theory and it has many applications in statistics. Here we study a sequence of
random events or variables and we consider whether it obeys some behavior. Such a behavior
can be characterized in two cases: the limit is a constant value or the limit is still random but we
can describe its law.
When discussing the convergence of random variables, we need to define the metric between
two random variables or the manner that the random variables are close to each other. Then in
the following, we give some ”manners” or some modes of convergence.

Definition 3.1.1. Let (Xn )n≥1 be a sequence of random variables defined on(Ω, A, P). We say that
Xn
a.s.
• converges almost surely to X and denoted by Xn −→ X or limn Xn = X a.s., if

P w : lim Xn (w) = X(w) = 1;
n→∞

P
• converges in probability to X and denoted by Xn −→ X, if for any > 0,

lim P(|Xn − X| > ) = 0;

n→∞

Lp
• converges in Lp (p > 0) to X and denoted by Xn −→ X if E(|Xn |p ) < ∞ for any n, E(|X|p ) <
∞ and
lim E(|Xn − X|p ) = 0.
n→∞

Note that, the value of a random variable is a number, so the most natural way to consider
the convergence of random variables is via the convergence of a sequence of numbers; and here
comes the convergence almost surely. But sometimes this mode of convergence can fail, then the
convergence in probability is defined in the meaning that the larger n is, the smaller and smaller

62
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 63

the probability that Xn is far away from X becomes; and the convergence in Lp is considered in
the sense that the average distance between Xn and X must tends to 0.
We have the following example.

Example 3.1.2. Let {Xn } be a sequence of random variables such that

1 1
P(Xn = 0) = 1 − and P(Xn = n) = .
n2 n2
Then Xn converges to 0 in probability, in Lp for 0 < p < 2 and almost surely.
• At first, we consider the convergence in probability. For any > 0, observe that the event
{|Xn − 0| > } is included in the event {Xn 6= 0} = {Xn = n}. Then
1
0 ≤ P(|Xn − 0| > ) ≤ P(Xn = n) = .
n2
Therefore from the sandwich theorem,

lim P(|Xn − 0| > ) = 0.

n→∞

P
It implies that Xn −→ 0.
• In order to prove the convergence in Lp for p ∈ (0, 2), we must check that

lim E(|Xn − 0|p ) = 0.

n→∞

This limit can be deduced from the computation that E (|Xn |p ) = np−2 .
• Usually, in order to prove or disprove the convergence almost surely, we use the Borel-
Cantelli lemma that can be stated as follows.

Lemma 3.1.3 (Borel-Cantelli). Let An be a sequence of events in a probability space {Ω, F, P}. De-
note lim sup An = ∩∞
n=1 (∪m≥n Am ) .

1. If Σ∞
n=1 P(An ) < ∞, then P(lim sup An ) = 0.

2. If Σ∞
n=1 P(An ) = ∞ and An ’s are independent, then P(lim sup An ) = 1.

Proof. 1. From the definition of lim sup An , it is clear that for every i,

lim sup An ⊂ ∪m≥i Am .

So P(lim sup An ) ≤ P(∪m≥i Am ) ≤ Σ∞ ∞

m=i P(Am ). Since Σn=1 P(An ) < ∞, the right hand side
can be arbitrary small for suitable i. Then P(lim sup An ) = 0.

2. We have

1 − P(lim sup An ) = P ∩∞

n=1 ∪m≥n Am

= P ∪∞

n=1 ∪m≥n Am

= P ∪∞

n=1 ∩m≥n Am .
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 64

In order to prove that P(lim sup An ) = 1, i.e 1 − P(lim sup An ) = 0, we will show that

P ∩m≥n Am = 0 for every n. Indeed, since An ’s are independent,

P ∩m≥n Am = Πm≥n P(Am )
= Πm≥n [1 − P(Am )]
≤ Πm≥n e−P(Am ) = e−Σm≥n P(Am ) = e−∞ = 0.

Then the result follows.

The meaning of the event lim sup An is that An occurs for an infinite number of n. Therefore
P(lim sup An ) = 0 means that almost surely there exists just a finite number of n that we can see
An .
Now, let’s see the application of the Borel-Cantelli Lemma in our example. We denote the
event An = {Xn 6= 0} = {Xn = n}. Then,
1
Σ∞ ∞
n=1 P(An ) = Σn=1 < ∞.
n2
It implies that almost surely An occurs a finite number of n, i.e the number of n such that Xn
differs from zero is finite. Hence, almost surely the limit of Xn exists and it must be zero. So
a.s.
Xn −→ X.

For any random variables X v Y , we denote

h |X − Y | i
dP (X, Y ) = E .
|X − Y | + 1

The following proposition characterizes the convergence in probability via metric dP .1

Proposition 3.1.4. Xn converges in probability to X iff

lim dP (Xn , X) = 0. (3.1)

n→∞

P
Proof. ⇒) Suppose that Xn → X. For any > 0 and w ∈ Ω, because of the increasing property of
x
the function x 7→ x+1 on the interval [0, ∞), we have

|Xn (w) − X(w)|

≤ I (w) + I{|Xn −X|<} (w).
|Xn (w) − X(w)| + 1 + 1 {|Xn −X|≤}

Taking expectation of both sides, we have

dP (Xn , X) ≤ P(|Xn − X| ≤ ) + P(|Xn − X| > ).
+1
Hence
lim sup dP (Xn , X) ≤ + lim sup P(|Xn − X| > ) = for all > 0.
n→∞ n→∞
1 0
dP is indeed a metric on L .
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 65

This implies (3.1).

⇐) On the other hand, we suppose that condition (3.1) holds. Then for any > 0, it follows
from Markov’s inequality that
|X − X| + 1 h |Xn − X| i
n
P(|Xn − X| ≥ ) = P ≥ ≤ E → 0 as n → ∞.
|Xn − X| + 1 +1 |Xn − X| + 1

The following proposition shows that among the three modes of convergence, the conver-
gence in probability is the weakest form.

Proposition 3.1.5. Let (Xn )n≥1 be a sequence of random variables.

Lp P
1. If Xn −→ X for some p > 0 then Xn −→ X.
a.s. P
2. If Xn −→ X then Xn −→ X.
Lp
Proof. 1. Suppose that Xn −→ X. Then by Markov inequality, for each > 0,

E(|Xn − X|p )
P(|Xn − X| > ) = P(|Xn − X|p > p ) ≤ .
p
Since E(|Xn − X|p ) → 0, by the sandwich theorem, P(|Xn − X| > ) converges also to 0.
P
Therefore Xn −→ X.
a.s.
2. Suppose that Xn −→ X. It is clear that

|Xn − X|
≤ 1,
1 + |Xn − X|

then by Lebesgue’s Dominated Convergence Theorem (see ??);

|Xn − X| |Xn − X|
limE = E lim = E(0) = 0.
n→ 1 + |Xn − X| n→ 1 + |Xn − X|

P
From the Proposition 3.1.4, we have Xn −→ X.

In the above example, we can see that convergence in probability is not sufficient for conver-
gence almost surely. However, we have the following result.
P
Proposition 3.1.6. 1. Suppose Xn −→ X. Then there exists a subsequence (nk )k≥1 such that
a.s.
Xnk −→ X.

2. On the contrary, if for all subsequence (nk )k≥1 , there exists a further subsequence (mk )k≥1
a.s. P
such that Xmk −→ X then Xn −→ X.
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 66

P
Proof. 1. Suppose Xn −→ X. Then from Proposition 3.1.4,

|Xn − X|
limE = 0.
n→ 1 + |Xn − X|

So there exists a subsequence {nk } such that

|Xnk − X| 1
E < k.
1 + |Xnk − X| 2

It implies that for all subsequence {mk } of {nk }, Xnk can not converge almost surely to X.
This is in contradiction with the hypothesis.
P
So we must have that Xn −→ X.

We have the following elementary but useful proposition.

Proposition 3.1.7. Let f : R2 → R be a continuous function.

a.s. a.s. a.s.
1. If Xn −→ X and Yn −→ Y then f (Xn , Yn ) −→ f (X, Y ).
P P P
2. If Xn −→ X and Yn −→ Y then f (Xn , Yn ) −→ f (X, Y ).

Proof. 1. Denote by A = {w ∈ Ω : limn→∞ Xn (w) = X(w)} ∩ {w ∈ Ω : limn→∞ Yn (w) =

Y (w)}. It is clear that P(A) = 1 and for all w ∈ A, we have limn→∞ f (Xn (w), Yn (w)) =
a.s
f (X(w), Y (w)) since f is continuous. Then f (Xn ) −→ f (X).
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 67

P
2. From the second part of Proposition 3.1.6, in order to prove that f (Xn , Yn ) −→ f (X, Y ),
we can check that for all subsequence {nk }, there exists a subsequence {mk } such that
a.s P P
f (Xmk , Ymk ) −→ f (X, Y ). Indeed, since Xnk −→ X and Ynk −→ Y , then from the first
a.s
part of Proposition 3.1.6, we can extract a subsequence {mk } satisfying Xmk −→ X and
a.s
Ymk −→ Y . Then from the first part of this theorem, the result follows.

3.2 Laws of large numbers

In this section, we study the first special and classical limit theorem named ”Law of large
number”. It was first stated but without proof by Cardano. Later, the first proof was given by
Bernoulli when he considered the binary random variables. This theorem shows that in some
cases, the limit behaviour of the average of some random variables is a constant.
More precise, throughout this section, we consider (Xn )n≥1 a sequence of random variables
defined on (Ω, F, P) and denote
Sn = X1 + . . . + Xn .

We have the ”weak law” and the ”strong law”.

3.2.1 Weak laws of large numbers

In the weak law, we have the convergence in probability as follows.

Theorem 3.2.1. Suppose that

D(Sn )
lim = 0.
n→∞ n2
Then
Sn − ESn P
−→ 0, as n → ∞.
n
Proof. For every > 0, by the Chebyshev inequality,
S − ES D(S )
n n n
P ≥ ≤ 2 2 → 0 khi n → ∞.
n n
It implies the result.

We have a recent corollary.

Corollary 3.2.2. Let (Xn )n≥1 be a sequence of pairwise uncorrelated random variables satisfying

D(X1 ) + . . . + D(Xn )
lim = 0.
n→∞ n2
Then
Sn − ESn P
−→ 0, asn → ∞.
n
Proof. Observe that D(Sn ) = D(X1 ) + . . . + D(Xn ) and apply Theorem 3.2.1.
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 68

In a special case, when Xn ’s are i.i.d with finite variance, we have

D(X1 ) + . . . + D(Xn ) n.D(X1 ) D(X1 ))
lim 2
= lim 2
= lim = 0.
n→∞ n n→∞ n n→∞ n
So the condition in Theorem 3.2.1 is met. Moreover, by linearity,

ESn = EX1 + EX2 + . . . + EXn = nEX1 .

So we have proved the following corollary

Lemma 3.2.3. Let (Xn )n≥1 be a sequence of i.i.d random variables with finite variance. Then
Sn P
−→ EX1 , as n → ∞.
n
Note that when Xn has the Bernoulli law, then Sn is the number of successful trials and
Bernoulli showed that Sn /n converges in probability to the probability of success of a trial. How-
ever, his proof is much more complicated than the one given here.

3.2.2 Strong laws of large numbers

We first claim a simple version of the strong laws.

Theorem 3.2.4. Let (Xn )n≥1 be a sequence of pair-wise uncorrelated random variables satisfying
supn D(Xn2 ) ≤ σ 2 < ∞. Then
Sn − ESn
lim = 0 a.s and in L2 .
n→∞ n
Proof. At first, we assume that E(Xn ) = 0. Denote Yn = Sn /n. Then E(Yn ) = 0, and from
Proposition 2.7.3,
n
1 X σ2
E(Yn2 ) = 2 DXi ≤ .
n n
i=1
L2
Hence Yn → 0. We also have
∞ ∞
X X σ2
E(Yn22 ) ≤ < ∞.
n2
n=1 n=1

From the Monotone Convergence Theorem,

"∞ # ∞
X X
2
E Yn2 = E(Yn22 ) < ∞,
n=1 n=1
P∞ 2
so n=1 Yn2 < ∞ almost surely. It implies that
a.s
Yn2 → 0. (3.2)
√
For each n ∈ N, denote by p(n) the integer part of n. Since
n
p(n)2 1 X
Yn − Yp(n)2 = Xj ,
n n
j=p(n)2 +1
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 69

we have
√
h p(n)2 2 i n − p(n)2
2 2p(n) + 1 2 2 n + 1 2 3
E Yn − Yp(n)2 ≤ 2
σ ≤ 2
σ ≤ 2
σ ≤ 3/2 σ 2 ,
n n n n n
√
with the observations n ≤ (p(n) + 1)2 and p(n) ≤ n. By the same argument, since
∞ ∞
X h p(n)2 2 i X 3 2
E Yn − Yp(n)2 ≤ 3/2
σ < ∞,
n n
n=1 n=1

then
p(n)2 h.c.c
Yn − Yp(n)2 → 0.
n
2 a.s
From (3.2) and the observation p(n) n → 1, we deduce that Yn → 0.
In general, if E(Xn ) 6= 0, we denote Zn = Xn − E(Xn ). Then {Zn } is a sequence of pair-wise
uncorrelated random variables with mean zero satisfying the condition of the theorem. There-
fore
Sn − ESn Z1 + . . . + Zn a.s
= → 0.
n n

In the following, we state without proof two general versions of strong law of large numbers.

Theorem 3.2.5. Let (Xn )n≥1 be a sequence of independent random variable and, (bn )n≥1 a se-
quence of positive numbers satisfying bn ↑ ∞. If
∞
X DXn Sn − E(Sn ) a.s.
< ∞ then −→ 0.
b2n bn
n=1

Theorem 3.2.6. Let (Xn )n≥1 be a sequence of iid random variables. Then
Sn
lim = E(X1 ) iff E(|X1 |) < ∞.
n→∞ n

i.i.d
Example 3.2.7. Consider (Xn )n≥1 ∼ B(1, p). From Theorem 3.2.6,
Sn h.c.c
−→ E(X1 ) = p.
n
Then, to approximate the probability of success of each trial, we can use the approximation Sn /n
for n large enough.

An application of Strong law of large numbers that is quite simple but very useful is the Monte
Carlo method.

Example 3.2.8. Let f be an integrable function over [0, 1], i.e.

Z 1
|f (x)|dx < ∞. (3.3)
0
R1
In most of the practical applications, the quantity I = 0 f (x)dx can not be calculated exactly by
the analytical method. Therefore one usually approximate it by the numerical method. When f
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 70

is smooth enough, I can be approximated well by taking the average (with some weight) of the
values of f at some fixed points. For example, if f is twice differentiable, we have
f (tn0 ) + 2f (tn1 ) + . . . + 2f (tnn−1 ) + f (tnn )
I≈ ,
2n
where tni = ni , i = 0, 1, . . . , n.
However, the above method is not good in the sense that we must take too many points to
have a good approximation when f is not smooth enough. In this case, we can use the Monte
Carlo method that can be stated in the simplest version as follows. Let (Uj )j≥1 be a sequence of
i.i.d random variables of the uniform distribution over [0, 1] and denote
n
1X
In = f (Uj ).
n
j=1
R1
Since E[|f (Uj )|] = 0 |f (x)|dx < ∞, then from Theorem 3.2.6, In converges almost surely to
E[f (U1 )] = I as n → ∞. To evaluate the error of the approximation, we assume more that
Z 1
|f (x)|2 dx < ∞. (3.4)
0
Then, the square of the error is
Z 1
2 12 1
E[(In − I) ] = E[(In − E[In ]) ] = Df (U1 ) ≤ |f (x)|2 dx.
n n 0
In practical, we use the computer to generate the sequence (Uj )j≥1 and obtain an approximation
of I for any function f satisfying the condition (3.3). Under the condition (3.4), the error of the
approximation only depends on the size n and not on the smoothness of f . The Monte Carlo
method also seems to be more useful than the other deterministic ones in approximating the
multiple integral. The only thing we must care about is the square of the error. If we can reduce
it, then the calculation will be more accurate and we can also reduce the time on computer (see
[?]). That is the way one wants to improve the Monte Carlo method.
The error of the Monte Carlo method will be analysed in more detail based on the Central
limit theorems that will be explained in the following.

3.3 Central limit theorems

In this section, we will state and prove the second classical limit theorem in probability the-
ory named ”Central limit theorem”. It is the most beautiful pearl of probability and has a lot of
applications in statistics. However, to understand this theorem, we need to define a new mode
of convergence and the tools to study it.

3.3.1 Characteristic functions

Sometimes to analyse a quantity or a function, it is better to transform it in another form.

Since a random variable can be seen as a special function, we can do the same. In this section,
we study the Fourier transformation of a random variable. We have the following definition.
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 71

Definition 3.3.1. 1. Let X be a random variable. We define its characteristics function by

Z
ϕX (t) = E[eitX ] = eitx dFX (x).
R

2. The characteristic function of random vector X = (X1 , . . . , Xn ) is defined by

h Xn i
ϕX (t1 , . . . , tn ) = E exp i tj Xj .
j=1

Theorem 3.3.2. For every random variable X, the characteristic function ϕX has the following
properties;

1. ϕX (0) = 1;

2. ϕX (−t) = ϕX (t);

3. |ϕX (t)| ≤ 1;

4. |ϕX (t + h) − ϕX (t)| ≤ E[|eihX − 1|], so ϕX is uniformly continuous on (−∞, +∞);

5. E[eit(aX+b) ] = eitb ϕX (at).

Proof. It is easy to see that ϕX (0) = 1. Applying the inequality (EX)2 ≤ E(X 2 ),
p q
|ϕX (t)| = (E cos tX)2 + (E sin tX)2 ≤ E(cos2 tX) + E(sin2 tX) = 1,

then ϕX is bounded. And the continuity of can be deduced by Lebesgue dominated convergence
theorem.

The following theorem shows the connection between the characteristic function of a ran-
dom variable and its moments.

Theorem 3.3.3. If E[|X|m ] < ∞ for some positive integer m. Then ϕX has continuous derivatives
up to order m, and

ϕ(k) (t) = ik E[X k eitX ], (3.5)

ϕ(k) (0)
E[X k ] = , (3.6)
ik
n
X (it)k (it)n
ϕX (t) = E[X k ] + αn (t), (3.7)
k! n!
k=0

where |αn (t)| ≤ 2E(|X n |) and αn (t) → 0 as t → 0.

On the other hand, if ϕ(2m) (0) exists and is finite for some positive integer m, then E[X 2m ] < ∞.

Proof. Since E(|X|m ) < ∞, we have E(|X|k ) < ∞ for all k = 1, . . . , m. Then
Z Z
sup |(ix) e |dFX (x) ≤ |x|k dFX (x) < ∞.
k itx
t
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 72

From Lebesgue theorem, we can take the differentation under the integral sign and obtain (3.5).
In (3.5), let t = 0 then we have (3.6).
Consider the Taylor expansion of function exp(x) at x = 0,

n−1
X (itX)k (itX)n iθX
itX
E(e )=E + e
k! n!
k=0
n−1
X (it)k (it)n
= E(X k ) + E(X n ) + αn (t) ,
k! n!
k=0

where |θ| ≤ |t|, αn (t) = E X n (eiθX − 1) . Therefore |αn (t)| ≤ 2E(|X|n ), i.e it is bounded. So from

the Dominated convergence theorem, we have αn (t) → 0 as t → 0.

The inverse statement can be proved by concurrence, see [13, page 190-193].

We consider the characteristic function of some usual distributions.

Example 3.3.4. Let X ∼ P oi(λ). We have

∞ −λ k
X e λ (λeit )k it
ϕX (t) = eitk = e−ld = eλ(e −1) .
k! k!
k=0

Example 3.3.5. Let X ∼ N (0, 1). We have

∞ 2 ∞ ∞
e−x /2
Z Z Z
cos tx −x2 /2 sin tx −x2 /2
ϕX (t) = eitx √ dx = √ e dx + √ e dx.
−∞ 2π −∞ 2π −∞ 2π
2
Since the funtion x 7→ e−x /2 sin tx is an odd and integrable function, the second integral equals
to 0. Thanks to Theorem 3.3.3, we can take the derivative of both side with respect to t and get
Z ∞
0 sin tx −x2 /2
ϕX (t) = − √ xe dx.
−∞ 2π
By integration by parts,
Z ∞
cos tx −x2 /2
ϕ0X (t) =− √ te dx = −tϕX (t).
−∞ 2π
ϕ0X
The differential equation ϕX = −t with initial condition ϕX (0) = 1 has a solution
2 /2
ϕX (t) = e−t .

If X ∼ N (a, σ 2 ) then
(x−a)2
Z
1
ϕX (t) = √ dx. eitx e− 2σ 2
2πσ 2
Using the change of variable: y = (x − a)/σ, we get

eita
Z
2 2 2
ϕX (t) = √ eitσy e−y /2 dy = eita−t σ /2 .
2π
The following theorem shows the meaning of the name ”characteristic function”.
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 73

Theorem 3.3.6. Two random vectors have the same distribution if their characteristic functions
R
coincide. Moreover, if |ϕX (t)|dt < ∞ then X has bounded continuous density given by
1 −ity
fX (y) = e ϕ(t)dt.
2π
Example 3.3.7. Let X and Y have Poisson distribution with the corresponding parameters µ and
λ. Assume more that X and Y are independent. Let us consider the distribution of the random
variable X + Y . We can compute its characteristic function as
it −1)
ϕX+Y (t) = E(eit(X+Y ) ) = E(eitX )E(eitY ) = e(λ+µ)(e .

Then this characteristic function agrees with the one of P oi(µ+λ). So the random variable X +Y
has the Poisson distribution with the parameter µ + λ.

We can also use the characteristic function to check whether the random variables are inde-
pendent.

Theorem 3.3.8. X1 , . . . , Xn are independent random variables iff

ϕ(X1 ,...,Xn ) (t1 , . . . , tn ) = ϕX1 (t1 ) . . . ϕXn (tn ) for all (t1 , . . . , tn ) ∈ Rn .

Example 3.3.9. Let X and Y be independent random variables which have standard normal
distribution N (0, 1). According to Example 3.3.5, we have
2 −s2
ϕ(X+Y,X−Y ) (t, s) = Eeit(X+Y )+is(X−Y ) = Eei(t+s)X Eei(t−s)Y = e−t .
2 2
Put s = 0 and t = 0, we have ϕX+Y (t) = e−t and ϕX−Y (s) = e−s . Hence both X + Y and X − Y
have normal distribution N (0, 2). Furthermore, they are independent since

ϕ(X+Y,X−Y ) (t, s) = ϕX+Y (t)ϕX−Y (s) for all t, s ∈ R.

3.3.2 Weak convergence

In this section, we consider another mode of convergence of a sequence of random variables

that is the weak convergence. In the first three modes of convergence, we can see the trace of
analysis such as the limit of a sequence of numbers or a Cauchy sequence. Here the weak con-
vergence totally comes from the probability theory.
w
Definition 3.3.10. Xn converges weakly to X and denoted by Xn −→ X, if limn→∞ E[f (Xn )] =
E[f (X)] for each f which is real-valued, continuous and bounded.
w w
When Xn −→ X we also say that FXn converges weakly to FX and denote FXn −→ FX .

Note that we do not require or suppose that the random variables (Xn )n≥1 are defined on
the same probability space in the above definition. We just care about the expectation or the
distribution. Therefore sometimes we call weakly convergence by convergence in distribution
(See Exercise 3.27).
If we suppose that Xn ’s and X are defined on the same probability space, we have the follow-
ing propositions.
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 74

Proposition 3.3.11. Let (Xn )n≥1 and X be random variables defined on the same probability space
w
(Ω, F, P). If Xn −→ X then Xn −→ X.
P

P w
Proof. We prove by contradiction. Assume that Xn −→ X but Xn 6−→ X. Then there exist a
bounded continuous function f , a constant > 0 and a subsequence (nk )k≥1 such that

|E(f (Xnk )) − E(f (X))| > for all k ≥ 1. (3.8)

From Proposition 3.1.6, there exists a subsequence (mk )k≥1 of the sequence (nk )k≥1 such that
a.s h.c.c
Xmk −→ X. Since f is continuous, f (Xmk ) −→ f (X). By Dominated Convergence Theorem,
E(f (Xmk )) → E(f (X)). It is in contradiction with (3.8). Then the result follows.

Proposition 3.3.12. Let (Xn )n≥1 and X be random variables defined on the same probability space
w
(Ω, F, P). If Xn −→ X and X = const a.s then Xn −→ X.
P

|x−a| w
Proof. Let X ≡ a a.s. Consider the bounded continuous function f (x) = |x−a|+1 . Since Xn −→ a,
P
E(f (Xn )) → f (a) = 0. From Proposition 3.1.4, Xn → a.

The following theorem gives us a very useful criterion to verify the weak convergence of ran-
dom variables by using the characteristic function. Its proof is provided in [13, page 196-199].

Theorem 3.3.13. Let (Fn )n≥1 be a sequence of distribution function whose characteristic functions
are (ϕn )n≥1 respectively, Z
ϕn (t) = eitx dFn (x).
R
w
1. If Fn → F for some distribution function F then (ϕn ) converges point-wise to the character-
istic function ϕ of F .

2. If ϕn (t) → ϕ(t), t ∈ R. Then the following statements are equivalent.

w
(a) ϕ(t) is a characteristic function and Fn → F where F is a distribution function whose
characteristic function is ϕ;
(b) ϕ is continuous at t = 0.

Example 3.3.14. Let Xn be normal N (an , σn2 ). Suppose that an → 0 and σn2 → 1 as n → ∞. Then
the sequence (Xn ) converges weakly to N (0, 1) since
2 2 /2 2 /2
ϕXn (t) = eitan −σn t → e−t .

Example 3.3.15 (Weak laws of large numbers). Let (Xk )k≥1 be a iid sequence of random variables
whose mean is finite. Then
1 P
(X1 + . . . + Xn ) −→ a.
n
Indeed, denote Sn = X1 + . . . + Xn and ϕ is characteristic function of Xk . Then,

ϕSn /n (t) = ϕSn (t/n) = [ϕ(t/n)]n .

CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 75

Thanks to Theorem 3.3.3, we have

ita t
ϕ(t/n) = 1 + + α(t/n),
n n
with α(t) → 0 as t → 0. Thus,
ita t n
ϕSn /n (t) = 1 + + α(t/n) → eita
n n
w
as n → ∞. Note that if X ≡ a then ϕX (t) = eita . Hence, S/n → a. Applying Proposition 3.3.12 we
obtain the desired result.

The following theorem is very useful in statistics.

w P P
Theorem 3.3.16 (Slutsky’s theorem). Suppose that Xn → X, An → a and Bn → b where a and b
are some constants. Then
w
An + Bn Xn → a + bX.

3.3.3 Central limit theorem

The central limit theorem is stated as follows.

Theorem 3.3.17. Let (Xn )n≥1 be a sequence of i.i.d random variables and E(Xn ) = µ and DXn =
−nµ w
σ 2 ∈ (0, ∞). Denote Sn = X1 + . . . + Xn . Then Yn = Sσn√ n
→ N (0, 1).

Proof. Denote ϕ by the characteristic function of the random variable Xn − µ. Since Xn ’s have
the same law, ϕ does not depend on n. Moreover, since Xn ’s are independent,
n n
X Xj − µ Y X − µ
j t n
ϕYn (t) = E exp it √ = E exp it √ = ϕ √ .
σ n σ n σ n
j=1 j=1

It is clear that E(Xj − µ) = 0 and E((Xj − µ)2 ) = σ 2 . Then from Theorem 3.3.3, ϕ has the
continuous second derivative and
σ 2 t2
ϕ(t) = 1 − + t2 α(t),
2
where α(t) → 0 as t → 0. Using the expansion ln(1 + x) = x + o(x) as x → 0,
t2 t2 t t2
ln ϕYn (t) = n ln 1 − + α √ → − .
2n nσ 2 σ n 2
2 /2
Therefore ϕYn (t) → e−t as n → ∞. Applying Theorem 3.3.13, we have the desired result.

In the following, we give an example of the central limit theorem. More detail, we will approx-
imate the binomial probability by the normal probability.

Example 3.3.18. We know that a binomial random variable Sn ∼ B(n, p) can be written as the
sum of n i.i.d random variables ∼ B(1, p). Then as n large enough, from the central limit theorem,
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 76
p
we can approximate the random variable (Sn − np)/ np(1 − p) by the standard normal variable
N(0, 1).
Usually, the probability that a ≤ Sn ≤ b can be formulated as

Σbi=a Cni pi (1 − p)n−i .

However, when n is too large, calculating Cni for some i is impossible since it exceeds the capacity
of the calculator or the computer (please, consider 1000! or 5000!). Then in practical, we can
estimate this probability by
" #!
Sn − np a − np b − np
P(a ≤ Sn ≤ b) = P p ∈ p ,p
np(1 − p) np(1 − p) np(1 − p)
" #!
∼ a − np b − np
= P N(0, 1) ∈ p ,p .
np(1 − p) np(1 − p)

Note that to compute the last probability, we can write down it as an integral from the density
function of the normal variable. It can be computed or approximated easily.

In order to define the rate that the distribution of FYn converges to normal distribution, we
use the Berry-Esseen’s inequality: Suppose E(|X1 |3 ) < ∞ then
2
x
e−t /2 E(|X1 − EX1 |3 )
Z
sup FYn (x) − √ dt ≤ KBE √ , (3.9)
−∞<x<∞ −∞ 2π σ3 n
√
where KBE is some constant in ( 610+3√
2π
, 0.4748) (see[12]).
The condition that Xn ’s are iid is too restrictive. Many authors manage to weaken this condi-
tion. In the following, we state the Lindeberg’s central limit theorem. Its proof can be found in
[13, page 221-225].

Theorem 3.3.19. Let (Xn )n≥1 be a sequence of independent random variables with finite variance.
Denote Sn = X1 + . . . + Xn , Bn = DX1 + . . . + DXn . Suppose that
n
1 X 2

Ln () := E (Xk − EXk ) I {|X −EX |>B } → 0, ∀ > 0. (3.10)
Bn2 k k n
k=1

Sn −ESn w
Then Sn∗ = Bn → N (0, 1).

3.4 Exercises

3.4.1 Convergence of random variables

P P
3.1. Prove that if Xn → X and, at the same time, Xn → Y , then X and Y are equivalent, in the
sense that P[X 6= Y ] = 0.

3.2. Show that dP is a metric on L0 , it means that

1. d(X, Y ) ≥ 0 and d(X, Y ) = 0 iff X = Y a.s.;

CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 77

2. d(X, Y ) = d(Y, X);

3. d(X, Y ) ≤ d(X, Z) + d(Z, Y );

for any random variables X, Y, Z.

3.3. Show that (Xn )n≥1 converges in probability to X iff

lim E(|Xn − X| ∧ 1) = 0.
n→∞

3.4. Consider the probability space ([0; 1], B([0; 1]), P ). Let X = 0 and X1 , X2 , . . . be random
variables 
0 if n1 ≤ ω ≤ 1
Xn (ω) =
en if 0 ≤ ω < 1/n.

P
Show that X −→ X, but E|Xn − X|p does not converge for any p > 0.

3.5. Consider the probability space ([0; 1], B([0; 1]), P ). Let X = 0. For each n = 2m + k where
0 ≤ k < 2m , we define 
1 if k ≤ ω ≤ k+1
2m 2m
Xn (ω) =
0 otherwise.

P
Show that X −→ X, but {Xn } does not converge to X a.s.

3.6. Let (Xn )n≥1 be a sequence of exponential random variables with parameter λ = 1. Show
that h Xn i
P lim sup = 1 = 1.
n→∞ ln n

3.7. Let X1 , X2 , . . . be a sequence of identically distributed random variables with E|X1 | < ∞
and let Yn = n−1 max1≤i≤n |Xi |. Show that limn E(Yn ) = 0 and limn Yn = 0 a.s.

P
3.8. [5] Let (Xn )n≥1 be random variables with Xn −→ X. Suppose |Xn (ω)| ≤ C for a constant
C > 0 and all ω. Show that limn→∞ E|Xn − X| = 0.

3.4.2 Law of large numbers

3.9. [10] Let X1 , . . . , Xn be independent and identically distributed random variables such that
for x = 3, 4, . . . , P (X1 = ±x) = (2cx2 log x)−1 , where c = ∞ −2
P
x=3 x / log x. Show that E|X1 | = ∞
P
but n−1 ni=1 Xi −→ 0.
P

3.10. [10] Let X1 , . . . , Xn be independent and identically distributed random variables with V ar(S1 ) <
∞. Show that
n
1 X P
jXj −→ EX1 .
n(n + 1)
j=1
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 78

3.11. [2] If for every n, V ar(Xi ) ≤ c < ∞ and Cov(Xi , Xj ) < 0 (i, j = 1, 2, . . .), then the WLLN
holds.

3.12. [2](Theorem of Bernstein) Let {Xn } be a sequence of random variables so that V ar(Xi ) ≤
c < ∞ (i = 1, 2, . . .) and Cov(Xi , Xj ) → 0 when |i − j| → ∞ then the WLLN holds.

3.13. [5] Let (Yj )j≥1 be a sequence of independent Binomial random variables, all defined on the
same probability space, and with law B(p, 1). Let Xn = nj=1 Yj . Show that Xj is B(p, j) and that
P
Xj
j converges a.s to p.
Q 1
n n
3.14. [5] Let {Xj }j≥1 be i.i.d with Xj in L1 . Let Yj = eXj . Show that j=1 Yj converges to a
constant α a.s.

3.15. [5] Let (Xj )j≥1 be i.i.d with Xj in L1 and EXj = µ. Let (Yj )j≥1 be also i.i.d with Yj in L1 and
EYj = ν 6= 0. Show that
n
1 X µ
lim Pn Xj = a.s.
j=1 Yj ν
n→∞
j=1
Pn
3.16. [5] Let (Xj )j≥1 be i.i.d with Xj in L1 and suppose √1n j=1 (Xj −ν) converges in distribution
to a random variable Z. Show that
n
1X
lim Xj = ν a.s.
n→∞ n
j=1

3.17. [5] Let (Xj )j≥1 be i.i.d with Xj in Lp . Show that

n
1X p
lim Xj = EX p a.s.
n→∞ n
j=1

3.18. [5] Let (Xj )j≥1 be i.i.d. N (1; 3) random variables. Show that

X1 + X2 + . . . Xn 1
lim 2 2 2
= a.s.
n→∞ X + X + . . . + Xn
1 2 4

3.19. [5] Let (Xj )j≥1 be i.i.d with mean µ and variance σ 2 . Show that
n
1X
lim (Xi − µ)2 = σ 2 a.s.
n→∞ n
i=1

3.4.3 Characteristic function

3.20. Find the characteristic function of X,

1. P[X = 1] = P[X = −1] = 1/2;

2. P[X = 1] = P[X = 0] = 1/2;

3. X ∼ U (a, b);
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 79

4. the density of X is f (x) = (1 − |x|)I|x|<1 ;

5. X ∼ Exp(λ);

3.21. Show that if X1 , . . . , Xn are independent and uniformly distribution on (−1, 1), then for
n ≥ 2, X1 + . . . + Xn has density
1 ∞ sin t n
Z
f (x) = cos txdt.
π 0 t
3.22. Suppose that X has density
1 − cos x
f (x) = .
πx2
Show that
ϕX (t) = (1 − |t|)+ .

3.23. 1. Suppose that X has Cauchy distribution with density

1
f (x) = .
π(1 + x2 )
Show that
ϕX (t) = e−|t| .

2. Let X1 , . . . , Xn be a sequence of independent Cauchy random variables. Find the distribu-

tion of (X1 + . . . + Xn )/n.

3.24. Let X1 , X2 , . . . be independent taking values 0 and 1 with probability 1/2 each.

1. Find the distribution of ξ = ∞ Xi

P
i=1 2i .
P∞ Xi
2. Find the characteristic function of ζ = 2 i=1 3i . We say that ζ has the Cantor distribution.

3.4.4 Weak convergence

w w w
3.25. Show that if Xn and Yn are independent for 1 ≤ n, Xn → X and Yn → Y , then Xn + Yn →
X +Y.

3.26. Consider the probability space ([0; 1], B([0; 1]), P ). Let X and X1 , X2 , . . . be random vari-
ables 
1 if 0 ≤ ω ≤ 1/2
X2n (ω) =
0 if 1/2 < ω ≤ 1.

and 
0 if 0 ≤ ω ≤ 1/2
X2n+1 (ω) =
1 if 1/2 < ω ≤ 1.
Show that the sequence (Xn ) converges in distribution? Does it converge in probability?

3.27. Let (Xn )n≥1 and X are random variables whose distribution functions are (Fn )n≥1 and F ,
respectively.
CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 80

w
1. If Xn −→ X then limn→∞ Fn (x) = F (x) for all x ∈ D where D is a dense subset of R given
by
D = {x ∈ R : F (x+) = F (x)}.
w
2. If limn→∞ Fn (x) = F (x) for any x in some dense subset of R then Xn −→ X.
w P
3.28. If Xn −→ X, Yn −→ c, then
w
a) Xn + Yn −→ X + c
w
b) Xn Yn −→ cX
w
c) Xn /Yn −→ X/c if Yn 6= 0 a.s for all n and c 6= 0.
d P
3.29. [10] Show that if Xn −→ X and X = c a.s for a real number c, then Xn −→ X.

3.30. [10] A family of random variable (Xi )i∈I is called uniformly integrable if

lim sup E[|Xi |I{|Xi |≥N } = 0.

N →∞ i∈I

Let X1 , X2 , . . . be random variables. Show that {|Xn |} is uniformly integrable if one of the follow-
ing condition holds:

a) supn E|Xn |1+δ < ∞ for a δ > 0.

b) P (|Xn | ≥ c) ≤ P (|X| ≥ c) for all n and c > 0, where X is an integrable random variable.

3.31. Let Xn be random variable distributed as N (µn , σn2 ), n = 1, 2, . . . and X be a random vari-
d
able distributed as N (µ, σ 2 ). Show that Xn −→ X if and only if limn µn = µ and limn σn2 = σ 2 .
w
3.32. If Yn are random variables with characteristic function ϕn , then Yn → 0 iff there is a δ > 0
so that ϕn (t) → 1 for |t| ≤ δ.

3.4.5 Central limit theorems

3.33. [10] Let U1 , U2 , . . . be independent random variables having the uniform distribution on
√ d
[0;1] and Yn = ( ni=1 Ui )−1/n . Show that n(Yn − e) −→ N (0, e2 ).
Q

3.34. [10] Suppose that Xn is a random variable having the binomial distribution with size n and
probability θ ∈ (0, 1), n = 1, 2, . . . Define Yn = log(Xn /n) when Xn ≥ 1 and Yn = 1 when Xn = 0.
√ d
Show that limn Yn = log θ a.s and n(Yn − log θ) −→ N (0, 1−θ θ ).

3.35. [2] Show that for the sequence {Xn } of independent random variables with
1−2−n 1
a) P [Xn = ±1] = 2 , P [Xn = ±2n ] = 2n+1
, n = 1, 2, . . . ,

b) P [Xn = ±n2 ] = 12 ,

the CLT holds.

CHAPTER 3. FUNDAMENTAL LIMIT THEOREMS 81

2 = σ 2 ∈ (0; ∞). Show that

3.36. [5] Let (Xj )j≥1 be i.i.d with EX1 = 1 and σX1

2 p √ d
( Sn − n) −→ N (0, 1).
σ
3.37. [5] Show that
n
!
X nk 1
lim e−n = .
n→∞ k! 2
k=0

2 = σ 2 < ∞. Let S =
Pn
3.38. [5] Let (Xj )j≥1 be i.i.d with EXj = 0 and σXj n j=1 Xj . Show that
r
|Sn | 2
lim E √ = σ.
n→∞ n n

3.39. [5] Let (Xj )j≥1 be i.i.d with the uniform distribution on (-1;1). Let
Pn
j=1 Xj
Yn = Pn 2
Pn 3.
j=1 Xj + j=1 Xj
√
Show that nYn converges in distribution.

3.40. [5] Let (Xj )j≥1 be i.i.d with the uniform distribution on (−j; j).

a) Show that
Sn d 1
−→ N (0; ).
3
n 2 9

b) Show that
S d
qP n −→ N (0, 1).
n 2
j=1 σj
Chapter 4

Some useful distributions in statistics

4.1 Gamma, chi-square, student and F distributions

4.1.1 Gamma distribution

Recall that a random variable X has a Gamma distribution G(α, λ) if its density is given by

xα−1 e−x/λ
fX (x) = I .
Γ(α)λα {x>0}
Note that G(1, λ) = Exp(λ).

Proposition 4.1.1. If X is G(α, λ) distributed, then

E[X] = αλ, DX = αλ2 .

Moreover, the characteristic function of X is given by

Z ∞
xα−1 e−x/λ 1 α
ϕX (t) = eitx dx = .
0 Γ(α)λα 1 − iλt

Corollary 4.1.2. Let (Xi )1≤i≤n be a sequence of independent random variables. Suppose that Xi is
G(αi , λ) distributed. Then S = X1 + · · · + Xn is G(α1 + · · · + αi , λ) distributed.

4.1.2 Chi-square distribution

Definition 4.1.3. Let (Zi )1≤i≤n be a sequence of independent, standard normal distributed ran-
dom variables. The distribution of V = Z12 + . . . + Zn2 is called chi-square distribution with n
degrees of freedom and is denoted by χ2n .

Note that since Zi2 is G( 12 , 2) distributed, χ2n is G( n2 , 2) distributed. Moreover,

E[χ2n ] = n, Dχ2n = 2n.

A notable consequence of the definition of the chi-square distribution is that if U and V are
independent and U ∼ χ2n and V ∼ χ2m , then U + V ∼ χ2m+n .

82
CHAPTER 4. SOME USEFUL DISTRIBUTIONS IN STATISTICS 83

2.2
alpha =7/8, lambda = 1
2 alpha = 1,lambda = 1
alpha = 2, lambda = 1
1.8 alpha = 3, lambda = 1

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Figure 4.1: Density of gamma distribution

0.9
n=1
n=2
0.8
n=4
n=6
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10

Figure 4.2: Density of χ2 distribution

CHAPTER 4. SOME USEFUL DISTRIBUTIONS IN STATISTICS 84

0.4
n=1
n=2
0.35 n=8
normal

0.3

0.25

0.2

0.15

0.1

0.05

0
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6

Figure 4.3: Density of student distribution

4.1.3 Student’s t distribution

Definition 4.1.4. If Z ∼ N (0; 1) and U ∼ χ2n and Z and U are independent, then the distribution
Z
of p is called student’s t distribution with n degrees of freedom.
U/n
Student’s t distribution is also call t distribution.
A direct computation with the density gives the following result.

Proposition 4.1.5. The density function of the student’s t distribution with n degrees of freedom is

Γ n+1
2
t2 −(n+1)/2
fn (t) = √ 1+ .
nπΓ n n
2

In addition,
n→∞ 1 2
fn (t) −→ √ e−t /2 .
2π

4.1.4 F distribution

Definition 4.1.6. Let U and V be independent chi-square random variables with m and n degrees
of freedom, respectively. The distribution of
U/m
W =
V /n
is called the F distribution with m and n degrees of freedom and is denoted by Fm,n .

The density of W is given by

Γ m+n
2
m m/2 m −(m+n)/2
f (x) = xm/2−1 1 + x , x ≥ 0.
Γ m n n n
2 Γ 2
CHAPTER 4. SOME USEFUL DISTRIBUTIONS IN STATISTICS 85

0.8
n = 4, m = 4
n = 10, m = 4
0.7 n = 10, m = 10
n=4, m= 10

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

Figure 4.4: Density of F distribution

4.2 Sample mean and sample variance

Let (Xn ) be independent N(µ, σ 2 ) random variables, and
n n−1
1X 1 X
Xn = Xi , s2n = (Xi − X n )2
n n−1
i=1 i=1

Proposition 4.2.1. The random variable X n and the vector of random variables (X1 − X n , X2 −
X n , . . . , Xn − X n ) are independent.

Proof. We write
n
X n
X
sX n + ti (Xi − X n ) = ai Xi
i=1 i=1
s
where ai = n + (ti − t). Note that
n n n
X X s2 X
ai = s and a2i = + (ti − t)2 .
n
i=1 i=1 i=1

Therefore, the characteristic function of (X n , X1 − X n , X2 − X n , . . . , Xn − X n ) is

n n
X Y σ2
= E[exp(isX n + i tj (Xj − X n )] = exp iµaj − a2j
2
j=1 j=1
n
σ2 σ2 X
= exp iµs − s2 exp − (ti − t)2 .
2n 2
i=1

The first factor is the cf of X n while the second factor is the cf of (X1 − X n , X2 − X n , . . . , Xn − X n )
(this is obtained by let s = 0 in the formula). This implies the desired result.

Corollary 4.2.2. X n and s2n are independently distributed.

CHAPTER 4. SOME USEFUL DISTRIBUTIONS IN STATISTICS 86

Theorem 4.2.3. The distribution of (n − 1)s2n /σ 2 is the chi-square distribution with n − 1 degrees
of freedom.

Proof. We first note that

n n
1 X 2
X Xi − µ 2
(Xi − µ) = ∼ χ2n .
σ2 σ
i=1 i=1

Also,
n n X − µ 2
1 X 2 1 X 2
(Xi − µ) = (Xi − X n ) + √ =: U + V.
σ2 σ2 σ/ n
i=1 i=1

Since U and V are independent, ϕW (t) = ϕU (t)ϕV (t). Since W and V both follow chi-square
distribution, so
ϕW (t) (1 − i2t)−n/2
ϕU (t) = = = (1 − i2t)−(n−1)/2 .
ϕV (t) (1 − i2t)−1/2
The last expression is the c.f. of a random variable with a χ2n−1 distribution.

We end up with the following result.

Corollary 4.2.4.
Xn − µ
√ ∼ tn−1 .
sn / n

4.3 Exercises
4.1. Show that

1. if X ∼ Fn,m then X −1 ∼ Fm,n .

2. if X ∼ tn then X 2 ∼ F1,n .

3. the Cauchy distribution and the t distribution with 1 degree of freedom are the same.

4. Iif X and Y are independent exponential random variable with λ = 1, then X/Y follows an
F distribution.

4.2. Show how to use the chi-square distribution to calculate P(a < s2n /σ 2 < b).

4.3. Let X1 , . . . , Xn be a sequence of independent and N(µX , σ 2 ) distributed random variables.

and Y1 , . . . , Ym be a sequence of independent and N(µY , σ 2 ) distributed random variables. Show
how to use F distribution to find P(s(X))2n /s(Y )2n > c), for some positive constant c.
m
4.4. Let W ∼ Fn,m and denote Y = m+nW . Show that Y has a beta distribution.

4.5. Let X1 , X2 and X3 be three independent chi-square variables with r1 , r2 and r3 degrees of
freedom, respectively.

1. Show that Y1 = X1 /X2 and Y2 = X1 + X2 are independent.

CHAPTER 4. SOME USEFUL DISTRIBUTIONS IN STATISTICS 87

2. Deduce that
X1 /r1 X2 /r3
and
X2 /r2 (X1 + X2 )/(r1 + r2 )
are independent F -variables.
Chapter 5

Parameter estimation

5.1 Samples and characteristic of sample

Definition 5.1.1 (Random samples). A sequence of random variable X1 , . . . , Xn is called a ran-
dom sample observing from a random variable X if

• (Xi )1≤i≤n are independent;

• Xi has the same distribution as X for all i = 1, . . . , n.

We call n the sample size.

Example 5.1.2. An urn contains m balls, labeled from 1, 2, . . . , m and are identical except for the
number. The experiment is to choose a ball at random and record the number. Let X denote the
number. Then the distribution of X is given by
1
P[X = k] = , f or x = 1, . . . , m.
m
In case m is unknown, to obtain information on m we take a sample of n balls, which we will
denote as X = (X1 , . . . , Xn ) where Xi is the number on the ith ball.
The sample can be drawn in several ways.

1. Sampling with replacement: We randomly select a ball, record its number and put it back
to the urn. All the ball are then remixed, and the next ball is chosen. We can see that
X1 , . . . , Xn are mutually independent random variables and each has the same distribution
as X. Hence (X1 , . . . , Xn ) is a random sample.

2. Sampling without replacement: Here n balls are selected at random. After a ball is selected,
we do not return it to the urn. The X1 , . . . , Xn are not independent, but each Xi has the
same distribution as X.

If m is much greater than n, the sampling schemes are practically the same.

Definition 5.1.3. The empirical distribution function is defined by

n
1X
Fn (x) = IXi <x , x ∈ R.
n
i=1

88
CHAPTER 5. PARAMETER ESTIMATION 89

1. Fn (x) is non-dercreasing with respect to x;

2. Fn (x) is left continuous and has right limit at any points;

3. limx→−∞ Fn (x) = 0, limx→+∞ Fn (x) = 1.

a.s.
4. Fn (x) −→ F (x) for any x ∈ R. Indeed, applying law of large numbers for the iid sequence
IXi <x , we have
n
1X a.s.
Fn (x) = IXi <x −→ E[IX1 <x ] = F (x).
n
i=1

Definition 5.1.4. Let (X1 , . . . , Xn ) be a random sample.

1. Sample mean
X1 + . . . + Xn
X̄n = .
n
2. Population variance
n
1X
Sn2 (X) = (Xi − X̄n )2 ,
n
i=1
and sample variance
n
1 X
s2n (X) = (Xi − X̄n )2 .
n−1
i=1

3. kth sample moment

n
1X k
mk = Xi ,
n
i=1
and centralized kth sample moment
n
1X
vk = (Xi − X̄n )k .
n
i=1

4. Sample covariance of 2-dimensional sample (X1 , Y1 ), . . . , (Xn , Yn )

1 Pn
n i=1 (Xi − X̄n )(Yi − Ȳn )
r= .
Sn (X)Sn (Y )

5. The sample mode is the most frequently occurring data value.

6. The sample median is a measure of central tendency that divides the data into two equal
parts, half below the median and half above. If the number of observations is even, the
median is halfway between the two central values. If the number of observations is odd,
the median is the central value.

7. When an ordered set of data is divided into four equal parts, the division points are called
quartiles. The first or lower quartile, q1 , is a value that has approximately 25% of the ob-
servations below it and approximately 75% of the observations above. The second quartile,
q2 , has approximately 50% of the observations below its value. The second quartile is ex-
actly equal to the median. The third or upper quartile, q3 , has approximately 75% of the
observations below its value.
CHAPTER 5. PARAMETER ESTIMATION 90

8. The interquartile range is defined as IQR = q3 − q1 . The IQR is also used as a measure of
variability.

5.2 Data display

Well-constructed data display is essential to good statistical thinking, because it helps us ex-
ploring important features of the data and providing insight about the type of model that should
be used in solving the problem. In this section we will briefly introduce some methods to display
data.

5.2.1 Stem-and-leaf diagrams

A stem-and-leaf diagram is a good way to obtain an informative visual display of a data set
x1 , x2 , . . . , xn , where each number xi consists of at least two digits. To construct a stem-and-leaf
diagram, use the following steps.

1. Divide each number xi into two parts: a stem, consisting of one or more of the leading
digits and a leaf, consisting of the remaining digit.

2. List the stem values in a vertical column.

3. Record the leaf for each observation beside its stem.

4. Write the units for stems and leaves on the display.

It is usually best to choose between 5 and 20 stems.

Example 5.2.1. The weights of 80 students are given in the following table.

59.0 59.5 52.7 47.9 55.7 48.3 52.1 53.1 55.2 45.3
46.5 54.8 48.4 53.1 56.9 47.4 50.2 52.1 49.6 46.4
52.9 41.1 51.0 50.0 56.8 45.9 59.5 52.8 46.7 55.7
48.6 51.6 53.2 54.1 45.8 50.4 54.1 52.0 56.2 62.7
62.0 46.8 54.6 54.7 50.2 45.9 49.1 42.6 49.8 52.1
56.5 53.5 46.5 51.9 46.5 53.5 45.5 50.2 55.1 49.6
47.6 44.8 55.0 56.2 49.4 57.0 52.4 48.4 55.0 47.1
52.4 56.8 53.2 50.5 56.6 49.5 53.1 51.2 55.5 53.7

Construct a stem-and-leaf diagram for their weight.

CHAPTER 5. PARAMETER ESTIMATION 91

Stem Leaf Frequency

41 1 1
42 6 1
44 8 1
45 3 5 8 9 9 5
46 4 5 5578 6
47 1 4 69 4
48 3 4 4 6 4
49 1 4 56 6 8 6
50 0 2 22 45 6
51 0 2 6 9 4
52 0 1 1 1 447 8 9 9
53 1 1 1 2 2 5 5 7 8
54 1 1 6 7 8 5
55 0 0 12 5 7 7 7
56 2 2 5 6 8 8 9 7
57 1 1
59 0 5 5 3
62 0 7 2

5.2.2 Frequency distribution and histogram

A frequency distribution is a more compact summary of data than a stem-and-leaf diagram.

To construct a frequency distribution, we must divide the range of the data into intervals, which
are usually called class intervals, cells, or bins. If possible, the bins should be of equal width in
order to enhance the visual information in the frequency distribution. Some judgment must be
used in selecting the number of bins so that a reasonable display can be developed. The number
of bins depends on the number of observations and the amount of scatter or dispersion in the
data. A frequency distribution that uses either too few or too many bins will not be informative.
We usually find that between 5 and 20 bins is satisfactory in most cases and that the number of
bins should increase with n. Choosing the number of bins approximately equal to the square
root of the number of observations often works well in practice.
The histogram is a visual display of the frequency distribution. The stages for constructing a
histogram follow.

1. Label the bin (class interval) boundaries on a horizontal scale.

2. Mark and label the vertical scale with the frequencies or the relative frequencies.

3. Above each bin, draw a rectangle where height is equal to the frequency (or relative fre-
quency) corresponding to that bin.

Example 5.2.2. Histogram of the students’ weight given in Example 5.2.1.

CHAPTER 5. PARAMETER ESTIMATION 92

Histogram of weight

15
No. of student

10
5
0

40 45 50 55 60

weight

5.2.3 Box plots

The box plot is a graphical display that simultaneously describes several important features
of a data set, such as center, spread, departure from symmetry, and identification of unusual ob-
servations or outliers.
A box plot displays the three quartiles, the minimum, and the maximum of the data on a rectan-
gular box, aligned either horizontally or vertically. The box encloses the interquartile range with
the left (or lower) edge at the first quartile, q1 , and the right (or upper) edge at the third quartile,
q3 . A line is drawn through the box at the second quartile (which is the 50th percentile or the me-
dian). A line, or whisker, extends from each end of the box. The lower whisker is a line from the
first quartile to the smallest data point within 1.5 interquartile ranges from the first quartile. The
upper whisker is a line from the third quartile to the largest data point within 1.5 interquartile
ranges from the third quartile. Data farther from the box than the whiskers are plotted as indi-
vidual points. A point beyond a whisker, but less than 3 interquartile ranges from the box edge, is
called an outlier. A point more than 3 interquartile ranges from the box edge is called an extreme
outlier.

Example 5.2.3. Consider the sample in Example 5.2.1. The quantiles of the sample are q1 =
48.40, q2 = 52.10, q3 = 54.85. Bellow is the box plot of the students’ weight.
CHAPTER 5. PARAMETER ESTIMATION 93

60
55
50
45

Example 5.2.4. Construct a box plot of the following data.

158.7 167.6 164.0 153.1 179.3 153.0 170.6 152.4 161.5 146.7
147.2 158.2 157.7 161.8 168.4 151.2 158.7 161.0 147.9 155.5

The quantiles of this sample are q1 = 152.85, q2 = 158.45, q3 = 162.35

180
170
160
150

5.2.4 Probability plots

How do we know if a particular probability distribution is a reasonable model for data? Some
of the visual displays we have used earlier, such as the histogram, can provide insight about the
form of the underlying distribution. However, histograms are usually not really reliable indica-
tors of the distribution form unless the sample size is very large. Probability plotting is a graphi-
cal method for determining whether sample data conform to a hypothesized distribution based
on a subjective visual examination of the data. The general procedure is very simple and can be
performed quickly. It is also more reliable than the histogram for small to moderate size samples.
CHAPTER 5. PARAMETER ESTIMATION 94

To construct a probability plot, the observations in the sample are first ranked from smallest
to largest. That is, the sample x1 , x2 , . . . , xn is arranged as x(1) ≤ x(2) < . . . ≤ x(n) . The or-
dered observations x(j) are then plotted against their observed cumulative frequency (j − 0.5)/n.
If the hypothesized distribution adequately describes the data, the plotted points will fall ap-
proximately along a straight line which is approximately between the 25th and 75th percentile
points; if the plotted points deviate significantly from a straight line, the hypothesized model is
not appropriate. Usually, the determination of whether or not the data plot as a straight line is
subjective.
In particular, a normal
probability plot can be constructed by plotting the standardized nor-
j−0.5
mal scores zj = Φ−1 n against x(j) .

Example 5.2.5. Consider the following sample:

2.86, 3.33, 3.43, 3.77, 4.16, 3.52, 3.56, 3.63, 2.43, 2.78.

We construct a normal probability plot for this sample as follows.

j x(j) (j − 0.5)/10 Φ−1 j−0.5 n
1 2.43 0.05 -1.64
2 2.78 0.15 -1.04
3 2.86 0.25 -0.67
4 3.33 0.35 -0.39
5 3.43 0.45 -0.13
6 3.52 0.55 0.13
7 3.56 0.65 0.39
8 3.63 0.75 0.67
9 3.77 0.85 1.04
10 4.16 0.95 1.64

Since all the points are very close to the straight line, one may conclude that a normal distribution
adequately describes the data.

Remark 4. This is very surjective method. Please use it at your own risk! Later we will introduce
the Shapiro and Wilcoxon tests for the normal distribution hypothesis.
CHAPTER 5. PARAMETER ESTIMATION 95

5.3 Point estimations

5.3.1 Statistics

Example 5.3.1. We continue Example 5.1.2. Recall that we do not know the number of balls m
and have to use the sample (X1 , . . . , Xn ) to obtain information about m.
Since E(X) = m+12 , using laws of large numbers, we have

X1 + . . . + Xn a.s. m + 1
−→ .
n 2
Therefore, we get the first estimator for m given by
X1 + . . . + Xn a.s.
m̂n := 2 − 1 −→ m.
n
Another estimation for m is defined by

m̃n := max{X1 , . . . , Xn }.

Since
n
Y m − 1 n
P[m̃n 6= m] = P[X1 < m, . . . , Xn < m] = P[Xi < m] = →0
m
i=1
a.s.
as n → ∞, we have m̃n −→ m.
The estimator m̂n and m̃n are called statistics which depend only on the observations X1 , . . . , Xn
not m.

Definition 5.3.2. Let X = (X1 , . . . , Xn ) be a sample observed from X and (T, BT ) a measurable
space. Then any function ϕ(X) = ϕ(X1 , . . . , Xn ), where ϕ : (Rn , B(Rn )) → (T, BT ) is a measur-
able function, of the sample is called a statistic.

In the following we only consider the case that (T, BT ) is a subset of (Rd , B(Rd )).

Definition 5.3.3. Let X = (X1 , . . . , Xn ) be a sample observed from a distribution with density
f (x; θ), θ ∈ Θ. Let Y = ϕ(X) be a statistic with density fY (y; θ). Then Y is called a sufficient
statistic for θ if
f (x; θ)
= H(x),
fY (ϕ(x); θ)
where x = (x1 , . . . , xn ), f (x; θ) is density of X at x, and H(x) does not depend on θ ∈ Θ.

Example 5.3.4. Let (X1 , . . . , Xn ) be a sample observed from a Poisson distribution with param-
eter λ > 0. Then Yn = ϕ(X) = X1 + . . . + Xn has Poisson distribution with parameter nλ. Hence
Qn Pn
f (X; θ) i=1 f (Xi ; θ) e−nλ λ i=1 Xi Yn ! Yn !
= = Qn −nλ Y
= Yn Q n .
fY (ϕ(X); θ) fYn (Yn ; nθ) i=1 Xi ! e (nλ) n i=1 Xi !

Therefore Yn is a sufficient statistic for λ.

In order to directly verify the sufficiency of statistic ϕ(X) we need to know the density of ϕ(X)
which is not always the case in practice. We next introduce the following criterion of Neyman to
overcome this difficulty.
CHAPTER 5. PARAMETER ESTIMATION 96

Theorem 5.3.5. Let X = (X1 , . . . , Xn ) be a random sample from a distribution that has
density f (x; θ), θ ∈ Θ. The statistic Y1 = ϕ(X) is a sufficient statistic for θ iff we can find
two nonnegative functions k1 and k2 such that

f (x; θ) = k1 (ϕ(x); θ)k2 (x) (5.1)

where k2 does not depend upon θ.

Example 5.3.6. Let (X1 , . . . , Xn ) be a sample from normal distribution N (θ, 1) with θ ∈ Θ = R.
Denote x̄ = n1 ni=1 xi . The joint density of X1 , . . . , Xn at (x1 , . . . , xn ) is given by
P

h i
n 2
2
exp − Pn (xi −x̄)2
1 h X (xi − θ) i n(x̄ − θ) i=1 2
n/2
exp − = exp − n/2
.
(2π) 2 2 (2π)
i=1

We see that the first factor on the right hand side depends upon x1 , . . . , xn only through x̄ and
the second factor does not depend upon θ, the factorization theorem implies that the mean X̄ of
the sample is a sufficient statistic for θ, the mean of the normal distribution.

5.3.2 Point estimators

Let X = (X1 , . . . , Xn ) be a sample from distribution F (x; θ) which depends on a unknown

parameter θ ∈ Θ. Even thought function ϕ does not depend on the unknown parameter θ, the
statistic ϕ(X) may convey information about θ. In such cases, we may call the statistic a point
estimator of θ.

Definition 5.3.7. A statistic ϕ(X1 , . . . , Xn ) is called

1. an unbiased estimator of θ if Eθ [ϕ(X1 , . . . , Xn )] = θ;

2. an asymptotic unbiased estimator of θ if limn→∞ Eθ [ϕ(X1 , . . . , Xn )] = θ;

3. a best unbiased estimator of θ if

(a) Eθ [ϕ(X1 , . . . , Xn )] = θ;
(b) Dθ ϕ(X1 , . . . , Xn ) ≤ Dθ ϕ̄(X1 , . . . , Xn ) for any unbiased estimator ϕ̄(X1 , . . . , Xn ) of θ.

4. a consistent estimator of θ if
P θ
ϕ(X1 , . . . , Xn ) −→ θ khi n → ∞.

Here we denote Eθ , Dθ , Pθ the expectation, variance and probability under the condition that
the distribution of Xi is F (x; θ).

Example 5.3.8. Let (X1 , . . . , Xn ) be a sample from normal distribution N (a, σ 2 ). Using the lin-
earity of expectation and laws of large numbers, we have
CHAPTER 5. PARAMETER ESTIMATION 97

• X̄n is an unbiased estimator of a;

• s2n (X) is an unbiased and consistent estimator of σ 2 .

• Sn2 (X) is an asymptotic unbiased and consistent estimator of σ 2 .

Example 5.3.9. In Example 5.3.1, both m̂n and m̃n are consistent estimators of m. Moreover, m̂n
is unbiased and m̃n is asymptotic unbiased.

5.3.3 Confidence intervals

Let X be a random variable whose density is f (x, θ), θ ∈ Θ, where θ is unknown. In the last
section, we discussed estimating θ by a statistic ϕ(X1 , . . . , Xn ) where X1 , . . . , Xn is a sample from
the distribution of X. When the sample is drawn, it is unlikely that the value of ϕ is the true value
of the parameter. In fact, if ϕ has a continuous distribution then Pθ [ϕ = θ] = 0. What is needed is
an estimate of the error of the estimation.

Example 5.3.10. Let (X1 , . . . , Xn ) be a sample from normal distribution N(a; σ 2 ) where σ 2 is
known. We know that X̄n is an unbiased, consistent estimator of a. But how close is X̄n to a?
√
Since X̄n ∼ N(a; σ 2 /n), we have (X̄n − a)/(σ/ n) has a standard normal N(0; 1) distribution.
Therefore,
h X̄n − a i h σ σ i
0.954 = P − 2 < √ < 2 = P X̄n − 2 √ < a < X̄n + 2 √ . (5.2)
σ/ n n n

Expression (5.2) says that before the sample

is drawn the probability that a belongs to the random
σ σ
interval X̄n − 2 √n < a < X̄n + 2 √n is 0.954. After the sample is drawn the realized interval

x̄n − 2 √σn < a < x̄n + 2 √σn has either trapped a or it has not. But because of the high probability

of success before the sample is drawn, we call the interval X̄n − 2 √σn < a < X̄n + 2 √σn a 95.4%
confidence interval for a. We can say, with some confidence, that x̄ is within 2 √σn from a. The
number 0.954 = 95.4% is called a confidence coefficient. Instead of using 2, we could use, say,
1.645, 1.96 or 2.576 to obtain 90%, 95% or 99% confidence intervals for a. Note that the lengths of
these confidence intervals increase as the confidence increases; i.e., the increase in confidence
implies a loss in precision. On the other hand, for any confidence coefficient, an increase in
sample size leads to shorter confidence intervals.

In the following, thanks to the central limit theorems, we will present a general method to
find the confident interval for parameters of a large class of distribution. To avoid confusion, let
θ0 denote the true, unknown value of the parameter θ. Suppose ϕ is an estimator of θ0 such that
√ w
n(ϕ − θ0 ) → N(0, σϕ2 ). (5.3)
√
The parameter σϕ2 is the asymptotic variance of nϕ and, in practice, it is usually unknown. For
the present, though, we suppose that σϕ2 is known.
CHAPTER 5. PARAMETER ESTIMATION 98

√
Let Z = n(ϕ−θ0 )/σϕ be the standardized random variable. Then Z is asymptotically N(0, 1).
Hence, P[−1.96 < Z < 1.96] = 0.95. This implies
h σϕ σϕ i
0.95 = P ϕ − 1.96 √ < θ0 < ϕ + 1.96 √ (5.4)
n n

σ σ
Because the interval ϕ − 1.96 √ϕn < θ0 < ϕ + 1.96 √ϕn is a function of the random variable ϕ,
we will call it a random interval. The probability that the random interval contains θ is approxi-
mately 0.95.
Since in practice, we often do not know σϕ . Suppose that there exists a consistent estimator
of σϕ , say Sϕ . It then follows from Slutsky’s theorem that
√
n(ϕ − θ0 ) w
→ N (0, 1).
Sϕ
√ √
Hence the interval ϕ − 1.96Sϕ / n, ϕ − 1.96Sϕ / n would be a random interval with approxi-
mate probability 0.95% of covering θ0 .
In general, we have the following definition.

Definition 5.3.11. Let (X1 , . . . , Xn ) be a sample from a distribution F (x, θ), θ ∈ Θ. A random
interval (ϕ1 , ϕ2 ), where ϕ1 and ϕ2 are some estimator of θ, is called a (1 − α)-confidence interval
for θ if
P(ϕ1 < θ < ϕ2 ) = 1 − α,

for some α ∈ [0, 1].

Confidence interval for the Mean a

Let X1 , . . . , Xn be a random sample from the distribution of a random variable X which has
unknown mean a and unknown variance σ 2 . Let X̄ and s2 be sample mean and sample variance,
√
respectively. By the Central limit theorem, the distribution of n(X̄ − a)/s approximates N(0; 1).
Hence, an approximated (1 − α) confidence interval for a is
s s
x̄ − zα/2 √ , x̄ + zα/2 √ , (5.5)
n n

where zα/2 = Φ−1 (1 − α/2).

1. Because α < α0 implies that xα/2 > xα0 /2 , selection of higher values for confidence coeffi-
cients leads to larger error terms and hence, longer confidence intervals, assuming all else
remains the same.

2. Choosing a larger sample size decreases the error part and hence, leads to shorter confi-
dence intervals, assuming all else stays the same.

3. Usually the parameter σ is some type of scale parameter of the underlying distribution.
In these situations, assuming all else remains the same, an increase in scale (noise level),
generally results in larger error terms and, hence, longer confidence intervals.
CHAPTER 5. PARAMETER ESTIMATION 99

Confidence interval for p

Let X1 , . . . , Xn be a random sample from the Bernoulli distribution with probability of suc-
cess p. Let p̂ = X̄ be the sample proportion of successes. It follows from the Central limit theorem
that p̂ has an approximate N(p; p(1−p) n ) distribution. Since p̂ and p̂(1 − p̂) are consistent estimators
for p and p(1 − p), respectively, an approximate (1 − α) confidence interval for p is given by
r r
p̂(1 − p̂ p̂(1 − p̂
p̂ − zα/2 , p̂ + zα/2 .
n n

Confidence interval for mean of normal distribution

In general, the confidence intervals developed so far in this section are approximate. They are
based on the Central Limit Theorem and also, often require a consistent estimate of the variance.
In our next example, we develop an exact confidence interval for the mean when sampling from
a normal distribution

Theorem 5.3.12. Let X1 , . . . , Xn be a random sample from a N(a, σ 2 ) distribution. Recall

that X̄ and s2 are sample mean and sample variance, respectively. The random variable
√
T = (X̄ − a)/(s/ n) has a t-distribution with n − 1 degrees of freedom. a
a
In statistics, the t-distribution was first derived as a posterior distribution in 1876 by Helmert and
Lroth. The t-distribution also appeared in a more general form as Pearson Type IV distribution in Karl
Pearson’s 1895 paper.
In the English-language literature the distribution takes its name from William Sealy Gosset’s 1908 paper
in Biometrika under the pseudonym “Student”.

For each α ∈ (0, 1), denote tα/2,n−1 satisfying

α
= P T > tα/2,n−1 .
2
Thanks to the symmetry of t-distribution, we have
X̄ − a
1 − α = P − tα/2,n−1 < T < tα/2,n−1 = P − tα/2,n−1 < √ < tα/2,n−1
S/ n
s s
= P X̄ − tα/2,n−1 √ < a < X̄ + tα/2,n−1 √ .
n n
Thus, a (1 − α) confidence interval for a is given by
S S
X̄ − tα/2,n−1 √ , X̄ + tα/2,n−1 √ . (5.6)
n n
Note that the only difference between this confidence interval and the large sample confidence
interval (5.5) is that tα/2,n−1 replaces zα/2 . This one is exact while (5.5) is approximate. Of course,
we have to assume we are sampling a normal population to get the exactness. In practice, we
often do not know if the population is normal. Which confidence interval should we use? Gen-
erally, for the same α, the intervals based on tα/2,n−1 are larger than those based on zα/2 . Hence,
the interval (5.6) is generally more conservative than the interval (5.5). So in practice, statisticians
generally prefer the interval (5.6).
CHAPTER 5. PARAMETER ESTIMATION 100

Confidence interval on the variance and standard deviation of a normal population

Sometimes confidence intervals on the population variance or standard deviation are needed.
When the population is modelled by a normal distribution, the tests and intervals described in
this section are applicable. The following result provides the basis of constructing these confi-
dence intervals.

Theorem 5.3.13. Let (X1 , X2 , . . . , Xn ) be a random sample from a normal distribution

with mean µ and variance σ 2 , and let s2 be the sample variance, i.e,
n
1 X
s2 = (Xi − X̄i )2 .
n−1
i=1

Then the random variable

(n − 1)s2
χ2n−1 =
σ2
has a χ2 -distribution with n − 1 degrees of freedom.

We recall that the pdf of a χ2 random variable with k degree of freedom is

1 k x
f (x) = x 2 −1 e− 2 , x > 0.
2k/2 Γ(k/2)

Theorem 5.3.13 leads to the following construction of the CI for σ 2 .

Theorem 5.3.14. If s2 is the sample variance from a random sample of n observation from a nor-
mal distribution with unknown variance σ 2 , then a 100(1 − α)% CI on σ 2 is

(n − 1)s2 2 (n − 1)s2
≤ σ ≤ ,
c2α/2,n−1 c21−α/2,n−1

where c2a,n−1 satisfies P(χ2n−1 > c2a,n−1 ) = a and the random variable χ2n−1 has a chi-square distri-
bution with n − 1 degrees of freedom.

Confidence interval for differences in means

A practical problem of interest is the comparison of two distributions; that is, comparing the
distributions of two random variables, say X and Y . In this section, we will compare the means
of X and Y . Denote the means of X and Y by aX and aY , respectively. In particular, we shall
obtain confidence intervals for the difference ∆ = aX − aY . Assume that the variances of X and
Y are finite and denote them as σX 2 = V ar(X) and let σ 2 = V ar(Y ). Let X ...., X be a random
Y 1 n
sample from the distribution of X and let Y1 , ..., Ym be a random sample from the distribution of
Y. Assume that the sample were gathered independently of one another. Let X̄ and Ȳ the sample
means of X and Y , respectively. Let ∆ ˆ = X̄ − Ȳ . Next we obtain a large sample confidence
interval for ∆ based on the asymptotic distribution of ∆. ˆ
CHAPTER 5. PARAMETER ESTIMATION 101

Proposition 5.3.15. Let N = n + m denote the total sample size. We suppose that
n m
→ λX , and → λY where λX + λY = 1.
N N
Then a (1 − α) confidence interval for ∆ is
2 and σ 2 are known)
1. (if σX Y
r r
2
σX σ2 2
σX σ2
(X̄ − Ȳ ) − zα/2 + Y , (X̄ − Ȳ ) + zα/2 + Y ; (5.7)
n m n m
2 and σ 2 are unknown)
2. (if σX Y
r r
s2 (X) s2 (Y ) s2 (X) s2 (Y )
(X̄ − Ȳ ) − zα/2 + , (X̄ − Ȳ ) + zα/2 + , (5.8)
n m n m
where s2 (X) and s2 (Y ) are sample variances of (Xn ) and (Ym ), respectively.

√ w
Proof. It follows from the Central limit theorem that n(X̄ − aX ) −→ N(0; σX
2 ). Thus,

√ w
2
σX
N (X̄ − aX ) −→ N(0; ).
λX
Likewise,
√ w σY2
N (Ȳ − aY ) −→ N(0; ).
λY
Since the samples are independent of one another, we have
√
w σ2 σ2
N (X̄ − Ȳ ) − (aX − aY ) −→ N(0; X + Y ).
λX λY

This implies (5.7). Since S ∗2 (X) and S ∗2 (Y ) are consistent estimators of σX

2 and σ 2 , applying
Y
Slutsky’s theorem, we obtain (5.8).

Confidence interval for difference in proportions

Let X and Y be two independent random variables with Bernoulli distributions B(1, p1 ) and
B(1, p2 ), respectively. Let X1 , . . . , Xn be a random sample from the distribution of X and let
Y1 , . . . , Ym be a random sample from the distribution of Y .

Proposition 5.3.16. A (1 − α) confidence interval for p1 − p2 is

r r
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 ) p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
p̂1 − p̂2 − zα/2 + , p̂1 − p̂2 + zα/2 + ,
n m n m
where p̂1 = X̄ and p̂2 = Ȳ .
CHAPTER 5. PARAMETER ESTIMATION 102

5.4 Method of finding estimation

5.4.1 Maximum likelihood estimation

The method of maximum likelihood is one of the most popular technique for deriving esti-
mators. Let X = (X1 , . . . , Xn ) be a random sample from a distribution with pdf/pdm f (x; θ). The
likelihood function is defined by
n
Y
L(x; θ) = f (xi ; θ).
i=1

Definition 5.4.1. For each sample point x, let θ̂(x) be a parameter value at which L(x; θ) attains
its maximum as a function of θ, with x held fixed. A maximum likelihood estimator (MLE) of the
parameter θ based on a sample X is θ̂(X).

Example 5.4.2. Let (X1 , . . . , Xn ) be a random sample from the distribution N (θ, 1), where θ is
unknown. We have
n
Y 1 2
L(x; θ) = √ e−(xi −θ) /2 ,
i=1
2π

A simple calculus shows that the MLE of θ is θ̂ = n1 ni=1 xi . One can easily verify that θ̂ is an
P

unbiased and consistent estimator of θ.

Example 5.4.3. Let (X1 , . . . , Xn ) be a random sample from the Bernoulli distribution with a un-
known parameter p. The likelihood function is
n
Y
L(x; p) = pxi (1 − p)1−xi .
i=1

1 Pn
A simple calculus shows that the MLE of p is θ̂ = n i=1 xi . One can easily verify that θ̂ is an
unbiased and consistent estimator of θ.

Let X1 , . . . , Xn denote a random sample from the distribution with pdf f (x; θ), θ ∈ Θ. Let θ0
denote the true value of θ. The following theorem gives a theoretical reason for maximizing the
likelihood function. It says that the maximum of L(θ) asymptotically separates the true model at
θ0 from models at θ 6= θ0 .

Theorem 5.4.4. Suppose that

(R0) f (.; θ) 6= f (.; θ0 ) for all θ 6= θ0 ;

(R1) all f (.; θ), θ ∈ Θ have common support for all θ.

Then
lim Pθ0 [L(X; θ0 ) > L(X; θ)] = 1, for all θ 6= θ0 .
n→∞
CHAPTER 5. PARAMETER ESTIMATION 103

Proof. By taking logs, we have

n
h1 X f (X ; θ) i
i
Pθ0 [L(X; θ0 ) > L(X; θ)] = P ln <0 .
n f (Xi ; θ0 )
i=1

Since the function φ(x) = − ln x is strictly convex, it follows from the Law of Large Numbers and
Jensen’s inequality that
n
1 X f (Xi ; θ) P h f (X ; θ) i
1
h f (X ; θ) i
1
ln → Eθ0 ln < ln Eθ0 = 0.
n f (Xi ; θ0 ) f (X1 ; θ0 ) f (X1 ; θ0 )
i=1

Note that condition f (.; θ) 6= f (.; θ0 ) for all θ 6= θ0 is needed to obtain the last strict inequality
while the common support is needed to obtain the last equality.

Theorem 5.4.4 says that asymptotically the likelihood function is maximized at the true value θ0 .
So in considering estimates of θ0 , it seems natural to consider the value of θ which maximizes the
likelihood.
We close this section by showing that maximum likelihood estimators, under regularity con-
ditions, are consistent estimators.

Theorem 5.4.5. Suppose that the pdfs f (x, θ) satisfying (R0), (R1) and

(R2) The point θ0 is an interior point in Θ.

(R3) f (x; θ) is differentiable with respect to θ in Θ.

Then the likelihood equation,

∂ ∂
L(θ) = 0 ⇔ ln L(θ) = 0,
∂θ ∂θ
P
has a solution θ̂n such that θ̂n → θ0 .

5.4.2 Method of Moments

Let (X1 , . . . , Xn ) be a random sample from a distribution with density f (x; θ) where θ =
(θ1 , . . . , θk ) ∈ Θ ⊂ Rk . Method of moments estimators are found by equating the first k sam-
ple moments to the corresponding k population moments, and solving the resulting system of
simultaneous equations. More precisely, define

µj = E[X j ] = gj (θ1 , . . . , θk ), j = 1, . . . , k.

and
n
1X j
mj = Xi .
n
i=1

The moments estimator (θ̂1 , . . . , θ̂k ) is obtained by solving the system of equations

mj = gj (θ1 , . . . , θk ), j = 1, . . . , k.
CHAPTER 5. PARAMETER ESTIMATION 104

Example 5.4.6 (Binomial distribution). Let (X1 , . . . , Xn ) be a random sample from the Binomial
distribution B(k, p), that is,

Pk,p [Xi = x] = Ckx px (1 − p)k−x , x = 0, 1, . . . , k.

Here we assume that p and k are unknown parameters. Equating the first two sample moments
to those of the population yields
 
X̄ = kp k = k̂ = X̄n2
n 1 Pn
X̄n − n 2
⇔ i=1 (Xi −X̄n )
 1 n X 2 = kp(1 − p) + k 2 p2
P p = p̂ = X̄n .
n i=1 i k̂

5.5 Lower bound for variance

In this section we establish a remarkable inequality called the Rao-Cramer lower bound which
gives a lower bound on the variance of any unbiased estimate. We then show that, under regu-
larity conditions, the variances of the maximum likelihood estimates achieve this lower bound
asymptotically.

Theorem 5.5.1 (Rao-Cramer Lower Bound). Let X1 , . . . , Xn be iid with common pdf
f (x; θ) for θ ∈ Θ. Assume that the regularity conditions (R0)-(R2) hold. Moreover, suppose
that

(R4) The pdf f (x; θ) is twice differentiable as a function of θ.

R
(R5) The integral f (x; θ)dx can be differentiated twice under integral sign as a function
of θ.

Let Y = u(X1 , . . . , Xn ) be a statistic with mean E[Y ] = E[u(X1 , . . . , Xn )] = k(θ). Then

[k 0 (θ)]2
DY ≥ ,
nI(θ)

where I(θ) is called Fisher information and given by

Z ∞ 2
∂ ln f (x; θ) h ∂ ln f (X; θ) i
I(θ) = − f (x; θ)dx = D .
−∞ ∂θ2 ∂θ

Proof. Since Z
k(θ) = u(x1 , . . . , xn )f (x1 ; θ) . . . f (xn ; θ)dx1 . . . dxn ,
Rn
we have
Z n
0
X ∂ ln f (xi ; θ)
k (θ) = u(x1 , . . . , xn ) f (x1 ; θ) . . . f (xn ; θ)dx1 . . . dxn .
Rn ∂θ
i=1
CHAPTER 5. PARAMETER ESTIMATION 105

Denote Z = ni=1 ∂ ln f∂θ(xi ;θ) . It is easy to verify that E[Z] = 0 and DZ = nI(θ). Moreover, k 0 (θ) =
P

E[Y Z]. Hence, we have

k 0 (θ) = E[Y Z] = E[Y ]E[Z] + ρ nI(θ)DY ,

where ρ is the correlation coefficient between Y and Z. Since E[Z] = 0 and ρ2 ≤ 1, we get

|k 0 (θ)|2
≤ 1,
nI(θ)DY

which implies the desired result.

Definition 5.5.2. Let Y be an unbiased estimator of a parameter θ in the case of point estimation.
The statistic Y is called an efficient estimator of θ if and only if the variance of Y attains the Rao-
Cramer lower bound.

Example 5.5.3. Let X1 , X2 , . . . , Xn denote a random sample from a exponential distribution that
has the mean λ > 0. Show that X̄ is an efficient estimator of λ.

Example 5.5.4 (Poisson distribution). Let X1 , X2 , . . . , Xn denote a random sample from a Pois-
son distribution that has the mean θ > 0. Show that X̄ is an efficient estimator of θ.

In the above examples, we were able to obtain the MLEs in closed form along with their dis-
tributions and, hence, moments. This is often not the case. Maximum likelihood estimators,
however, have an asymptotic normal distribution. In fact, MLEs are asymptotically efficient.

Theorem 5.5.5. Assume X1 , . . . , Xn are iid with pdf f (x; θ0 ) for θ0 ∈ Θ such that the regularity
condition (R0)-(R5) are satisfied. Suppose further that 0 < I(θ0 ) < ∞, and

(R6) The pdf f (x; θ) is three times differentiable as a function of θ. Moreover, for all θ ∈ Θ, there
exists a constant c and a function M (x) such that

∂ 2 ln f (x; θ)
≤ M (x),
∂θ3
with Eθ0 [M (X)] < ∞, for all θ0 − c < θ < θ0 + c and all x in the support of X.

Then any consistent sequence of solutions of the mle equations satisfies

√ w
n(θ̂n − θ0 ) → N (0, I(θ0 )−1 ).

Proof. Expanding the function l0 (θ) into a Taylor series of order two about θ0 and evaluating it at
θ̂n , we get
1
l0 (θ̂n ) = l0 (θ0 ) + (θ̂n − θ0 )l00 (θ0 ) + (θ̂n − θ0 )2 l000 (θn∗ ),
2
where θn∗ is between θ0 and θ̂n . But l0 (θ̂n ) = 0. Hence,

√ n−1/2 l0 (θ0 )
n(θ̂n − θ0 ) = .
−n−1 l00 (θ0 ) − (2n)−1 (θ̂n − θ0 )l000 (θn∗ )
CHAPTER 5. PARAMETER ESTIMATION 106

By the Central Limit Theorem,

n
1 1 X ∂ ln f (Xi ; θ0 ) w
√ l0 (θ0 ) = √ → N (0, I(θ0 )).
n n ∂θ
i=1

Also, by the Law of Large Numbers,

n
1 1 X ∂ 2 ln f (Xi ; θ0 ) P
− l00 (θ0 ) = − → I(θ0 ).
n n ∂θ2
i=1

Note that |θ̂n − θ0 | < c0 implies that |θn∗ − θ0 | < c0 , thanks to Condition (R6), we have
n n
1 1 X ∂ 2 ln f (Xi ; θ) 1X
− l000 (θn∗ ) ≤ ≤ M (Xi ).
n n ∂θ3 n
i=1 i=1

1 Pn P
Since Eθ0 |M (X)| < ∞, applying Law of Large Numbers, we have n i=1 M (Xi ) → Eθ0 [M (X)].
P
Moreover, since θ̂n → θ0 , for any > 0, there exists N > 0 so that P[|θ̂n − θ0 | < c0 ] ≥ 1 − 2 and
h 1Xn i
P M (Xi ) − Eµ0 [M (X)] < 1 ≥ 1 − ,
n 2
i=1

for all n ≥ N . Therefore,

h 1 i
P − l000 (θn∗ ) ≤ 1 + Eθ0 [M (X)] ≥ 1 − ,
n 2
hence n−1 l000 (θn∗ ) is bounded in probability. This implies the desired result.

5.6 Exercises

5.6.1 Confidence interval

5.1. For a normal population with known variance σ 2 , answer the following questions:
√ √
1. What is the confidence level for the interval x̄ − 2.14σ/ n ≤ µ ≤ x̄ + 2.14σ/ n.
√ √
2. What is the confidence level for the interval x̄ − 2.49σ/ n ≤ µ ≤ x̄ + 2.49σ/ n.
√ √
3. What is the confidence level for the interval x̄ − 1.85σ/ n ≤ µ ≤ x̄ + 1.84σ/ n.

5.2. A confidence interval estimate is desired for the gain in a circuit on a semiconductor device.
Assume that gain is normally distributed with standard deviation σ = 20.

1. Find a 95% CI for µ when n = 10 and x̄ = 1000.

2. Find a 95% CI for µ when n = 25 and x̄ = 1000.

3. Find a 99% CI for µ when n = 10 and x̄ = 1000.

4. Find a 99% CI for µ when n = 25 and x̄ = 1000.

CHAPTER 5. PARAMETER ESTIMATION 107

5.3. Following are two confidence interval estimates of the mean µ of the cycles to failure of an
automotive door latch mechanism (the test was conducted at an elevated stress level to acceler-
ate the failure).
3124.9 ≤ µ ≤ 3215.7 3110.5 ≤ µ ≤ 3230.1.
1. What is the value of the sample mean cycles to failure?

2. The confidence level for one of these CIs is 95% and the confidence level for the other is
99%. Both CIs are calculated from the same sample data. Which is the 95% CI? Explain
why.
5.4. n = 100 random samples of water from a fresh water lake were taken and the calcium
concentration (milligrams per liter) measured. A 95% CI on the mean calcium concentration
is 0.49 ≤ µ ≤ 0.82.
1. Would a 99% CI calculated from the same sample data been longer or shorter?

2. Consider the following statement: There is a 95% chance that µ is between 0.49 and 0.82. Is
this statement correct? Explain your answer.

3. Consider the following statement: If n = 100 random samples of water from the lake were
taken and the 95% CI on µ computed, and this process was repeated 1000 times, 950 of the
CIs will contain the true value of µ. Is this statement correct? Explain your answer.
5.5. A research engineer for a tire manufacturer is investigating tire life for a new rubber com-
pound and has built 16 tires and tested them to end-of-life in a road test. The sample mean and
standard deviation are 60,139.7 and 3645.94 kilometers. Find a 95% confidence interval on mean
tire life.
5.6. An Izod impact test was performed on 20 specimens of PVC pipe. The sample mean is X̄ =
1.25 and the sample standard deviation is S = 0.25. Find a 99% lower confidence bound on Izod
impact strength.
5.7. The compressive strength of concrete is being tested by a civil engineer. He tests 12 speci-
mens and obtains the following data.
2216 2237 2225 2301 2318 2255
2249 2204 2281 2263 2275 2295
1. Is there evidence to support the assumption that compressive strength is normally dis-
tributed? Does this data set support your point of view? Include a graphical display in
your answer.

2. Construct a 95% confidence interval on the mean strength.

5.8. A machine produces metal rods. A random sample of 15 rods is selected, and the diameter
is measured. The resulting date (in millimetres) are as follows
8.24 8.25 8.2 8.23 8.24
8.21 8.26 8.26 8.2 8.25
8.23 8.23 8.19 8.28 8.24
CHAPTER 5. PARAMETER ESTIMATION 108

1. Check the assumption of normality for rod diameter.

2. Find a 95% CI on mean rod diameter.

5.9. A rivet is to be inserted into a hole. A random sample of n = 15 parts is selected, and the
hole diameter is measured. The sample standard deviation of the hole diameter measurements
is s = 0.008 millimeters. Construct a 99% CI for σ 2 .

5.10. The sugar content of the syrup in canned peaches is normally distributed with standard
deviation σ. A random sample of n = 10 cans yields a sample standard deviation of s = 4.8
milligrams. Find a 95% CI for σ.

5.11. Of 1000 randomly selected cases of lung cancer, 823 resulted in death within 10 years.

1. Construct a 95% CI on the death rate from lung cancer.

2. How large a sample would be required to be at least 95% confident that the error in esti-
mating the 10-year death rate from lung cancer is less than 0.03?

5.12. A random sample of 50 suspension helmets used by motorcycle riders and automobile
race-car drivers was subjected to an impact test, and on 18 of these helmets some damage was
observed.

1. Find a 95% CI on the true proportion of helmets of this type that would show damage from
this test.

2. Using the point estimate of p obtained from the preliminary sample of 50 helmets, how
many helmets must be tested to be 95% confident that the error in estimating the true
value of p is less than 0.02?

3. How large must the sample be if we wish to be at least 95% confident that the error in
estimating p is less than 0.02, regardless of the true value of p?

5.13. Consider a CI for the mean µ when σ is known,

√ √
x̄ − zα1 σ/ n ≤ µ ≤ x̄ + zα2 σ/ n

where α1 + α2 = α. If α1 = α2 = α/2, we have the usual 100(1 − α)% CI for µ. In the above, when
√
α1 6= α2 , the CI is not symmetric about µ. The length of the interval is L = σ(zα1 + zα2 )/ n. Prove
that the length of the interval L is minimized when α1 = α2 = α/2.

5.14. Let the observed value of the mean X̄ of a random sample of size 20 from a distribution
that is N (µ, 80) be 81.2. Find a 95 percent confidence interval for µ.

5.15. Let X̄ be the mean of a random sample of size n from a distribution that is N (µ, 9). Find n
such that P[X̄ − 1 < µ < X̄ + 1] = 0.90, approximately.

5.16. Let a random sample of size 17 from the normal distribution N (µ, σ 2 ) yield x̄ = 4.7 and
s2 = 5.76. Determine a 90 percent confidence interval for µ.
CHAPTER 5. PARAMETER ESTIMATION 109

5.17. Let X̄ denote the mean of a random sample of size n from a distribution that has mean
µ and variance σ 2 = 10. Find n so that the probability is approximately 0.954 that the random
interval (X̄ − 21 , X̄ + 21 ) includes µ.

5.18. Find a 1 − α confidence interval for θ, given X1 , . . . , Xn iid with pdf

1
1. f (x; θ) = 1 if θ − 2 < x < θ + 12 .

2. f (x; θ) = 2x/θ2 if 0 < x < θ, θ > 0.

5.19. Let (X1 , . . . , Xn ) be a random sample from a N(0, σX

2 ), and let (Y , . . . , Y ) be a random
1 m
sample from a N(0, σY2 ), independent of the Xs. Define λ = σY2 /σX 2 . Find a (1 − α) CI for λ.

5.20. Suppose that X1 , . . . , Xn is a random sample from a N(µ, σ 2 ) population.

1. If σ 2 is known, find a minimum value for n to guarantee that a 0.95 CI for µ will have length
no more than σ/4.

2. If σ 2 is unknown, find a minimum value for n to guarantee, with probability 0.90, that a 0.95
CI for µ will have length no more than σ/4.

5.21. Let (X1 , . . . , Xn ) be iid uniform U(0; θ). Let Y be the largest order statistics. Show that the
distribution of Y /θ does not depend on θ, and find the shortest (1 − α) CI for θ.

5.6.2 Point estimator

5.22. Let X1 , X2 , X3 be a random sample of size three from a uniform (θ, 2θ) distribution, where
θ > 0.

1. Find the method of moments estimator of θ.

2. Find the MLE of θ.

5.23. Let X1 , . . . , Xn be a random sample from the pdf

f (x; θ) = θx−2 , 0 < θ < x < ∞.

1. What is a sufficient statistics for θ.

2. Find the mle of θ.

3. Find the method of moments estimator of θ.

5.24. Let X1 , . . . , Xn be iid with density

eθ−x
f (x; θ) = , x ∈ R, θ ∈ R.
(1 + eθ−x )2

Show that the mle of θ exists and is unique.

CHAPTER 5. PARAMETER ESTIMATION 110

5.25. Let X1 , . . . , Xn represent a random sample from each of the distributions having the fol-
lowing pdfs or pmfs:

1. f (x; θ) = θx e−θ /x!, x = 0, 1, 2, . . . , 0 ≤ θ < ∞, zero elsewhere.

2. f (x; θ) = 1θ I{0<x<θ} , θ > 0.

3. f (x; θ) = θxθ−1 I{0<x<1} , 0 < θ < ∞.

e−x/θ
4. f (x; θ) = θ I{x>0} , 0 < θ < ∞.

5. f (x; θ) = eθ−x I{x>θ} , −∞ < θ < ∞.

6. f (x; θ) = 12 e−|x−θ| , −∞ < x < ∞, −∞ < θ < ∞. Find the mle of θ.

In each case find the mle θ̂ of θ.

5.26. Let X1 , . . . , Xn be a sample from the inverse Gaussian pdf

λ 1/2
exp − λ(x − µ)2 /(2µ2 x) ,

f (x; µ, λ) = 3
x > 0.
2πx
Find the mles of µ and λ.
2x
5.27. Suppose X1 , . . . , Xn are iid with pdf f (x; θ) = I
θ2 {0<x≤θ}
. Find

1. the mle θ̂ for θ;

2. the constant c so that E[cθ̂] = θ;

3. the mle for the median of the distribution.

5.28. Suppose X1 , . . . , Xn are iid with pdf f (x; θ) = e−x/θ I{0<x<∞} . Find the mle of P[X ≥ 3].

5.29. The independent random variables X1 , . . . , Xn have the common distribution




 0 if x < 0

P(Xi ≤ x|α, β) = (x/β)α if 0 ≤ x ≤ β


1 if x > β,


where the parameter α and β are positive.

1. Find a two dimensional sufficient statistics for (α, β).

2. Find the mles of α and β.

3. The length (in millimeters) of cuckoos’ eggs found in hedge sparrow nests can be modelled
with this distribution. Fot the data

22, 0, 23, 9, 20, 9, 23, 8, 25, 0, 24, 0 21, 7, 23, 8, 22, 8, 23, 1, 23, 1, 23, 5, 23, 0, 23, 0,

find the mles of α and β.

CHAPTER 5. PARAMETER ESTIMATION 111

5.30. Suppose that the random variables Y1 , . . . , Yn satisfy

Yi = βxi + i , i = 1, . . . , n,

where x1 , . . . , xn are fixed constants, and ! , . . . , n are iid N(0, σ 2 ), σ 2 unknown.

P P
1. Show that β̂n := Yi / xi is an unbiased estimator of β. Find the variance of β̂.

2. Find a two-dimensional sufficient statistics for (β, σ 2 ).

3. Find the mle β̄n of β, and show that it is an unbiased estimator of β. Compare the variances
of β̄n and β̂n .

4. Find the distribution of the mle of β.

5.6.3 Lower bound for variance

5.31. Let (X1 , . . . , Xn ) be a random sample from a population with eman µ and variance σ 2 .

1. Show that the estimator ni=1 ai Xi is an unbiased estimator of µ if ni=1 ai = 1.

P P

2. Among all such unbiased estimator, find the one with minimum variance, and calculate
the variance.

5.32. Given the pdf

1
f (x; θ) = , x ∈ R, θ ∈ R,
π(1 + (x − θ)2 )
show that the Rao-Cramér lower bound is n2 , where n is the sample size. What is the asymptotic
√
distribution of n(θ̂ − θ) if θ̂ is the mle of θ?

5.33. Let X have a gamma distribution with α = 4 and β = θ > 0.

1. Find the Fisher information I(θ).

2. If (X1 , . . . , Xn ) is a random sample from this distribution, show that the mle of θ is an effi-
cient estimator of θ.
√
3. What is the asymptotic distribution of n(θ̂ − θ).

5.34. Let X be N(0; θ), 0 < θ < ∞.

1. Find the Fisher information I(θ).

2. If (X1 , . . . , Xn ) is a random sample from this distribution, show that the mle of θ is an effi-
cient estimator of θ.
√
3. What is the asymptotic distribution of n(θ̂ − θ).

5.35. Let (X1 , . . . , Xn ) be a random sample from a N(0; θ) distribution. We want to estimate the
√
standard deviation θ. Find the constant c so that Y = c ni=1 |Xi | is an unbiased estimator of
P
√
θ and determine its efficiency.
CHAPTER 5. PARAMETER ESTIMATION 112

5.36. If (X1 , . . . , Xn ) is a random sample from a distribution with pdf


 3θ3 0 < x < ∞, 0 < θ < ∞
4
f (x; θ) = (x+θ)
0 otherwise,

show that Y = 2X̄n is an unbiased estimator of θ and determine its efficiency.

5.37 (Beta (θ, 1) distribution). Let X1 , X2 , . . . , Xn denote a random sample of size n > 2 from a
distribution with pdf 
θxθ−1 for x ∈ (0, 1)
f (x; θ) =
0 otherwise
where the parameter space Ω = (0, ∞).
n
1. Show that θ̂ = − Pn ln Xi
is the MLE estimator of θ.
i=1

2. Show that θ̂ is gamma distributed.

3. Show that θ̂ is asymptotic unbiased estimator of θ.

4. Is θ̂ an efficient estimator of θ?
2
5.38. Let X1 , . . . , Xn be iid N(θ, 1). Show that the best unbiased estimator of θ2 is X n − n1 . Calcu-
late its variance and show that it is greater thatn the Cramer-Rao lower bound.
Chapter 6

Hypothesis Testing

6.1 Introduction
Point estimation and confidence intervals are useful statistical inference procedures. Another
type of inference that is frequently used concerns tests of hypotheses. As in the last section,
suppose our interest centers on a random variable X which has density function f (x; θ) where
θ ∈ Θ. Suppose we think, due to theory or a preliminary experiment, that θ ∈ Θ0 or θ ∈ Θ1 where
Θ0 and Θ1 are subsets of Θ and Θ0 ∪ Θ1 = Θ. We label the hypothesis as

H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1 . (6.1)

We call H0 the null hypothesis and H1 the alternative hypothesis. A hypothesis of the form θ = θ0
is called a simple hypothesis while a hypothesis of the form θ > θ0 or θ < θ0 is called a composite
hypothesis. A test of the form

H0 : θ = θ0 versus H1 : θ 6= θ0

is called a two-sided test. A test of the form

H0 : θ ≤ θ0 versus H1 : θ > θ0 ,

or
H0 : θ ≥ θ0 versus H1 : θ < θ0

is called a one-sided test.

Often the null hypothesis represents no change or no difference from the past while the al-
ternative represents change or difference. The alternative is often referred to as the research
worker’s hypothesis. The decision rule to take H0 or H1 is based on a sample X1 , . . . , Xn from
the distribution of X and hence, the decision could be right or wrong. There are only two types
of statistical errors we may commit: rejecting H0 when H0 is true (called the Type I error) and
accepting H0 when H0 is wrong (called the Type II error).
Let D denote the sample space. A test of H0 versus H1 is based on a subset C of D. This set C
is called the critical region and its corresponding decision rule is:

113
CHAPTER 6. HYPOTHESIS TESTING 114

Table 6.1: Decision Table for a Test of Hypothesis

Decision H0 is True H1 is True

Reject H0 Type I Error Correct Decision
Accept H0 Correct Decision Type II Error

• Reject H0 (Accept H1 ) if (X1 , . . . , Xn ) ∈ C;

• Retain H0 (Reject H0 ) if (X1 , . . . , Xn ) 6∈ C.

Our goal is to select a critical region which minimizes the probability of making error. In general,
it is not possible to simultaneously reduce Type I and Type II errors because of a see-saw effect:
if one takes C = ∅ then H0 would be never rejected so the probability of Type I error would be 0,
but the Type II error occurs with probability 1. Type I error is usually considered to be worse than
Type II. Therefore, we will choose a critical regions which, on one hand, bound the probability of
Type I error at a certain level, and on the other hand, minimizes the probability of Type II error.

Definition 6.1.1. A critical region C is called of size α if

α = max Pθ [(X1 , . . . , Xn ) ∈ C].

θ∈Θ0

α is also called the significance level of the test associated with critical region C.

Over all critical regions of size α, we will look for the one whom has the lowest probability of
Type II error. It also means that for θ ∈ Θ1 , we want to maximize

1 − Pθ [Type II Error] = Pθ [(X1 , . . . , Xn ) ∈ C].

We call the probability on the right side of this equation the power of the test at θ. So our task is
to find among all the critical region of size α the one with highest power.
We define the power function of a critical region by

γC (θ) = Pθ [(X1 , . . . , Xn ) ∈ C], θ ∈ Θ1 .

Example 6.1.2. Suppose X1 , . . . , Xn is a random sample from a N (µ, 1) distribution. Consider

the hypotheses
H0 : µ = µ0 versus H1 : µ = µ1
where µ0 < µ1 are specified. Let’s consider a critical region C of the form C = {X̄n > k}. Since
X̄n has the N (µ, n1 ) distribution, the size of critical regions is
√
α = Pµ0 [X̄n > k] = 1 − Φ( n(k − µ0 )).

The power function of the critical region C is

√
γC (µ1 ) = Pµ1 [X̄n > k] = 1 − Φ( n(k − µ1 )).

In particular, if we have µ0 = 0, µ1 = 1, n = 100 then at the significant level 5%, we would reject
H0 in favor of H1 if X̄n > 0.1965 and the power of the test is 1 − Φ(−8.135) = 0.9999.
CHAPTER 6. HYPOTHESIS TESTING 115

Example 6.1.3 (Large Sample Test for the Mean). Let X1 , . . . , Xn be a random sample from the
distribution of X with mean µ and finite variance σ 2 . We want to test the hypotheses

H0 : µ = µ0 versus H1 : µ > µ0

where µ0 is specified. To illustrate, suppose µ0 is the mean level on a standardized test of students
who have been taught a course by a standard method of teaching. Suppose it is hoped that a new
method which incorporates computers will have a mean level µ > µ0 , where µ = E[X] and X
is the score of a student taught by the new method. This conjecture will be tested by having n
students (randomly selected) to be taught under this new method.
Because X̄n → µ in probability, an intuitive decision rule is given by
Reject H0 in favor of H1 if X̄n is much large than µ0 .
In general, the distribution of the sample mean cannot be obtained in closed form. So we will
use the Central Limit Theorem to find the critical region. Indeed, since
X̄n − µ w
√ → N (0, 1),
S/ n
we obtain a test with an approximate size α:
X̄n −µ
Reject H0 in favor of H1 if √ 0
S/ n
≥ xα .
The power of the test is also approximated by using the Central Limit Theorem
√
√ n(µ0 − µ)
γ(µ) = P[X̄n ≥ µ0 + xα σ/ n] ≈ Φ − xα − .
σ
So if we have some reasonable idea of what σ equals, we can compute the approximate power
function.
−µ
Finally, note that if X has normal distribution then X̄n√
S/ n
has a t distribution with n−1 degrees
of freedom. Thus we can establish a rejection rule having exact level α:
−µ
X̄n√
Reject H0 in favor of H1 if T = S/ n
≥ tα,n−1 ,
where tα,n−1 is the upper α critical point of a t distribution with n − 1 degrees of freedom.
One way to report the results of a hypothesis test is to state that the null hypothesis was or
was not rejected at a specified α-value or level of significance. For example, we can say that
H0 : µ = 0 was rejected at the 0.05 level of significance. This statement of conclusions is often
inadequate because it gives the decision maker no idea about whether the computed value of
the test statistic was just barely in the rejection region or whether it was very far into this region.
Furthermore, stating the results this way imposes the predefined level of significance on other
users of the information. This approach may be unsatisfactory because some decision makers
might be uncomfortable with the risks implied by α = 0.05.
To avoid these difficulties the P-value approach has been adopted widely in practice. The
P -value is the probability that the test statistic will take on a value that is at least as extreme as
the observed value of the statistic when the null hypothesis H0 is true. Thus, a P -value conveys
much information about the weight of evidence against H0 , and so a decision maker can draw a
conclusion at any specified level of significance. We now give a formal definition of a P -value.
CHAPTER 6. HYPOTHESIS TESTING 116

Definition 6.1.4. The P -value is the smallest level of significance that would lead to rejection of
the null hypothesis H0 with the given data.

This mean that if α > P -value, we would reject H0 while if α < P -value, we would not reject
H0 .

6.2 Method of finding test

6.2.1 Likelihood Ratio Tests

Let L(x; θ) be the likelihood function of the sample (X1 , . . . , Xn ) from a distribution with den-
sity p(x; θ).

Definition 6.2.1. The likelihood test statistic for testing H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1 is

supθ∈Θ0 L(x; θ)
λ(x) = .
supθ∈Θ L(x; θ)

A likelihood ratio test is any test that has a rejection region of the form C = {x : λ(x) ≤ c} for
some c ∈ [0, 1].

The motivation of likelihood ratio test comes from the fact that if θ0 is the true value of θ then,
asymptotically, L(θ0 ) is the maximum value of L(θ). Therefore, if H0 is true, λ should be close to
1; while if H1 is true, λ should be smaller.

Example 6.2.2 (Likelihood Ratio Test for the Exponential Distribution). Suppose X1 , . . . , Xn are
iid with pdf f (x; θ) = θ−1 e−x/θ I{x>0} and θ > 0. Let’s consider the hypotheses

H0 : θ = θ0 versus H1 : θ 6= θ0 ,

where θ0 > 0 is a specified value. The likelihood ratio test statistic simplifies to
X̄ n
n
λ(X) = en e−nX̄n /θ0 .
θ0
The decision rule is to reject H0 if λ ≤ c. Using differential calculus, it is easy to show that λ ≤ c iff
X̄ ≤ c1 θ0 or X̄ ≥ c2 θ0 for some positive constants c1 , c2 . Note that under the null hypothesis, H0 ,
the statistic θ20 ni=1 Xi has a χ2 distribution with 2n degrees of freedom. Therefore, the following
P

decision rule results in a level α test:

2 Pn 2 Pn
Reject H0 if θ0 i=1 Xi ≤ χ21−α/2 (2n) or θ0 i=1 Xi ≥ χ2α/2 (2n),

where χ21−α/2 (2n) is the lower α/2 quantile of a χ2 distribution with 2n degrees of freedom and
χ2α/2 (2n) is the upper α/2 quantile of a χ2 distribution with 2n degrees of freedom.

If ϕ(X) is a sufficient statistic for θ with pdf or pmf g(t; θ), then we might consider construct-
ing an likelihood ratio test based on ϕ and its likelihood function L∗ (t; θ) = g(t; θ) rather than on
the sample X and its likelihood function L(x; θ).
CHAPTER 6. HYPOTHESIS TESTING 117

Theorem 6.2.3. If ϕ(X) is a sufficient statistic for θ and λ∗ (t) and λ(x) are the likelihood ratio test
statistics based on ϕ and X, respectively, then λ∗ (ϕ(x)) = λ(x) for every x in the sample space.

Proof. From the Factorization Theorem, the pdf or pmf of X can be written as f (x; θ) = g(T (x); θ)h(x),
where g(t; θ) is the pdf or pmf of T and h(x) does not depend on θ. Thus

supΘ0 L(x; θ) supΘ0 f (x; θ) supΘ0 g(T (x); θ)h(x)

λ(x) = = =
supΘ L(x; θ) supΘ f (x; θ) supΘ g(T (x); θ)h(x)
∗
supΘ0 L (T (x); θ)
= = λ∗ (T (x)).
supΘ L∗ (x; θ)

6.3 Method of evaluating test

6.3.1 Most powerful test

Now we consider a test of a simple hypothesis H0 versus a simple alternative H1 . Let f (x; θ)
denote the density of a random variable X where θ ∈ Θ = {θ0 , θ1 }. Let X = (X1 , . . . , Xn ) be a
random sample from the distribution of X.

Definition 6.3.1. A subset C of the sample space is called a best critical region of size α for testing
the simple hypothesis
H0 : θ = θ0 versus H1 : θ = θ1 ,

if Pθ0 [X ∈ C] = α and for every subset A of the sample space

Pθ0 [X ∈ A] = α implies Pθ1 [X ∈ C] ≥ Pθ1 [X ∈ A].

The following theorem of Neyman and Pearson provides a systematic method of determining
a best critical region.

Theorem 6.3.2. Let (X1 , . . . , Xn ) be a sample from a distribution that has density f (x; θ).
Then the likelihood of X1 , X2 , . . . , Xn is
n
Y
L(x; θ) = f (xi ; θ), for x = (x1 , . . . , xn ).
i=1

Let θ0 and θ1 be distinct fixed values of θ so that Θ = {θ0 , θ1 }, and let k be a positive
number. Let C be a subset of the sample space such that
L(x;θ0 )
(a) L(x;θ1 ) ≤ k for each x ∈ C;

L(x;θ0 )
(b) L(x;θ1 ) ≥ k for each x ∈ D\C;

(c) α = Pθ0 [X ∈ C].

CHAPTER 6. HYPOTHESIS TESTING 118

Then C is a best critical region of size α for testing the simple hypothesis

H0 : θ = θ0 versus H1 : θ = θ1 .

Proof. We prove the theorem when the random variables are of the continuous type. If A is an-
other critical region of size α, we will show that
Z Z
L(x; θ1 )dx ≥ L(x; θ1 )dx.
C A

Write C as the disjoint union of C ∩ A and C ∩ Ac and A as the disjoint union of A ∩ C and A ∩ C c ,
we have
Z Z Z Z
L(x; θ1 )dx − L(x; θ1 )dx = L(x; θ1 )dx − L(x; θ1 )dx
C A C∩A c A∩CZc
Z
1 1
≥ L(x; θ0 )dx − L(x; θ0 )dx,
k C∩Ac k A∩C c

where the last inequality follows from conditions (a) and (b). Moreover, we have
Z Z Z Z
L(x; θ0 )dx − L(x; θ0 )dx = L(x; θ0 )dx − L(x; θ0 )dx = α − α = 0.
C∩Ac A∩C c C A

This implies the desired result.

Example 6.3.3. Let X = (X1 , . . . , Xn ) denote a random sample from the distribution N (θ, 1). It
is desired to test the simple hypothesis

H0 : θ = 0 versus H1 : θ = 1.

We have

1
exp − 12 ni=1 x2i
P
n
L(0; x) (2π)n/2
X n
= = exp − x i + .
L(1; x) 1
exp − 12 ni=1 (xi − 1)2
P 2
(2π)n/2 i=1

If k > 0, the set of all points (x1 , . . . , xn ) such that

n n
X n 1X 1 ln k
exp − xi + ≤k⇔ xi ≥ − =c
2 n 2 n
i=1 i=1

is a best critical region, where c is a constant that can be determined so that the size of the critical
region is α. Since X̄n ∼ N (0, 1/n),

Pθ0 (X̄n ≥ c) = α ⇔ c = Φ−1 (1 − α),

Rx 2
where Φ−1 is the reverse function of Φ(x) = √12π −∞ e−t /2 dt..
If X̄n ≥ c, the simple hypothesis H0 : θ = 0 would be rejected at the significance level α; if
X̄n < c, the hypothesis H0 would be accepted.
CHAPTER 6. HYPOTHESIS TESTING 119

The probability of rejecting H0 when H0 is true is α; the probability of rejecting H0 , when H0

is false, is the value of the power of the test at θ = 1. That is
Z ∞ (x − 1)2
1
Pθ1 [X̄n ≥ c] = p exp − dx.
c 2π/n 2/n

For example, if n = 25, α = 0.05 then c = 0.329. Thus the power of this best test of H0 against H1
is 0.05 at θ = 1 is
Z ∞ (x − 1)2
1
p exp − dx = 1 − Φ(−3.355) = 0.999.
0.329 2π/25 2/25

6.3.2 Uniformly most powerful test

We now define a critical region when it exists, which is a best critical region for testing a simple
hypothesis H0 against an alternative composite hypothesis H1 .

Definition 6.3.4. The critical region C is a uniformly most powerful (UMP) critical region of size
α for testing the simple hypothesis H0 against an alternative composite hypothesis H1 if the set
C is a best critical region of size a for testing H0 against each simple hypothesis in H1 . A test
defined by this critical region C is called a uniformly most powerful (UMP) test, with significance
level α, for testing the simple hypothesis H0 against the alternative composite hypothesis H1 .

It is well-known that uniformly most powerful tests do not always exist. However, when they
do exist, the Neyman-Pearson theorem provides a technique for finding them.

Example 6.3.5. Let (X1 , X2 , . . . , Xn ) be a random sample from the distribution N (0, θ), where
the variance θ is an unknown positive number. We will show that there exists a uniformly most
powerful test with significance level α for testing

H0 : θ = θ0 versus H1 : θ > θ0 .

The joint density of X1 , . . . , Xn is

n
1 1 X 2
L(θ; x1 , . . . , xn ) = exp − xi .
(2nθ)n/2 2θ
i=1

Let θ0 be a number greater than θ0 an let k denote a positive number. Let C be the set of points
where
n
L(θ0 ; x) X
2 2θ0 θ0 n θ0
≤ k ⇔ x i ≥ ln − ln k = c.
L(θ0 ; x) θ 0 − θ0 2 θ0
i=1
n o
The set C = (x1 , . . . , xn ) : ni=1 x2i ≥ c is then a best critical region for our testing problem. It
P

remains to determine c so that this critical region has size α, i.e.,

n
hX i
α = Pθ0 Xi2 ≥ c .
i=1
CHAPTER 6. HYPOTHESIS TESTING 120

This can be done using the observation that θ10 ni=1 Xi2 has a χ2 -distribution with n degrees of
P

freedom. Note that for each number θ0 > θ0 , the foregoing argument holds. It means that C is a
uniformly most powerful critical region of size α.
In conclusion, if ni=1 Xi2 ≥ c, H0 is rejected at the significance level α and H1 is accepted;
P

otherwise, H0 is accepted.
Example 6.3.6. Let (X1 , . . . , Xn ) be a sample from the normal distribution N (a, 1), where a is
unknown. We will show that there is no uniformly most powerful test of the simple hypothesis
H0 : a = a0 versus H1 : a 6= a0 .
However, if the alternative composite hypothesis is either H1 : a > a0 or H1 : a < a0 , a uniformly
most powerful test will exist in each instance.
Let a1 be a number not equal to a0 . Let k be a positive number and consider

1 1 Pn 2
(2π)n/2
exp − 2 (x
i=1 i − a0 ) n
X n
≤ k ⇔ (a 1 − a 0 ) xi ≥ (a21 − a20 ) − ln k.
1
exp − 1 n (x − a )2
P 2
(2π)n/2 2 i=1 i 1 i=1

This last inequality is equivalent to

n
X n ln k
xi ≥ (a1 − a0 ) − ,
2 a1 − a0
i=1
provided that a1 > a0 , and it is equivalent to
n
X n ln k
xi ≤ (a1 − a0 ) − ,
2 a1 − a0
i=1
if a1 < a0 . The first of these two expressions defines a best critical region for testing H0 : a = a0
against the hypothesis a = a1 provided that a1 > a0 , while the second expression defines a best
critical region for testing H0 : a = a0 against the hypothesis a = a1 provided that a1 < a0 . That is,
a best critical region for testing the simple hypothesis against an alternative simple hypothesis,
say a = a0 + 1, will not serve as a best critical region for testing H0 : a = a0 against the alternative
simple hypothesis a = a0 − 1. By definition, then, there is no uniformly most powerful test in the
case under consideration. However, if the alternative composite hypothesis is either H1 : a > a0
or H1 : a < a0 , a uniformly most powerful test will exist in each instance.
Remark 5. The sufficiency is importance for finding a test. Indeed, let X1 , . . . , Xn be a random
sample from a distribution that has pdf f (x, θ), θ ∈ Θ. Suppose that Y = u(X1 , . . . , Xn ) is a
sufficient statistic for θ. It follows from the factorization theorem that the joint pdf of X1 , . . . , Xn
may be written
L(x1 , . . . , xn ; θ) = k1 (u(x1 , . . . , xn ); θ)k2 (x1 , . . . , xn ),
where k2 (x1 , . . . , xn ) does not depend upon θ. It implies that the ratio
L(x1 , . . . , xn ; θ0 ) k1 (u(x1 , . . . , xn ); θ0 )
=
L(x1 , . . . , xn ; θ00 ) k1 (u(x1 , . . . , xn ), θ00 )
depends upon x1 , . . . , xn only through u(x1 , . . . , xn ). Consequently, if there is a sufficient statistic
Y = u(X1 , . . . , Xn ) for θ and if a best test or a uniformly most powerful test is desired, there is no
need to consider tests which are based upon any statistic other than the sufficient statistic.
CHAPTER 6. HYPOTHESIS TESTING 121

6.3.3 Monotone likelihood ratio

Consider the general one-sided hypotheses of the form

H0 : θ ≤ θ 0 versus H1 : θ > θ0 . (6.2)

In this section we introduce general forms of uniformly most powerful tests for these hypotheses
when the sample has a so called monotone likelihood ratio.

Definition 6.3.7. Let X0 = (X1 , . . . , Xn ) be a random sample with common pdf (or pmf ) f (x; θ), θ ∈
Θ. We say that its likelihood function L(x; θ) = ni=1 f (xi ; θ) has monotone likelihood ratio in the
Q

statistic y = u(x) if for θ1 < θ2 , the ratio

L(x; θ1 )
L(x; θ2 )
is a monotone function of y = u(x).

Theorem 6.3.8. Assume that L(x; θ) has a monotone decreasing likelihood ratio in the statistic
y = u(x). The following test is uniformly most powerful of level α for the hypotheses (6.2):

Reject H0 if u(X) ≥ c, (6.3)

where c is determined by α = Pθ0 [u(X) ≥ c].

In case L(x; θ) has a monotone increasing likelihood ratio in the statistic y = u(x) we can
construct a uniformly most powerful test in a similar way.

Proof. We first consider the simple null hypothesis: H00 : θ = θ0 . Let θ1 > θ0 be arbitrary but
fixed. Let C denote the most powerful critical region for θ0 versus θ1 . By the Neyman-Pearson
Theorem, C is defined by,

L(X; θ0 )
≤ k, if and only if X ∈ C,
L(X; θ1 )

where k is determined by α = Pθ0 [X ∈ C]. However, since θ1 > θ0 ,

L(X; θ0 )
= g(u(X)) ≤ k ⇔ u(X) > g −1 (k),
L(X; θ1 )
L(x;θ0 )
where g(u(x)) = L(x;θ 1)
. Since α = Pθ0 [u(X) ≥ g −1 (k), we have c = g −1 (k). Hence, the Neyman-
Pearson test is equivalent to the test defined by (6.3). Moreover, the test is uniformly most pow-
erful for θ0 versus θ1 > θ0 because the test only depends on θ1 > θ0 and g −1 (k) is uniquely
determined under θ0 .
Let γ(θ) denote the power function of the test (6.3). For any θ0 < θ00 , the test (6.3) is the most
powerful test for testing θ0 versu θ00 with the level γ(θ0 ), we have γ(θ00 ) > γ(θ0 ). Hence γ(θ) is a
nondecreasing function. This implies maxθ<θ0 γ(θ) = α.
CHAPTER 6. HYPOTHESIS TESTING 122

Example 6.3.9. Let X1 , . . . , Xn be a random sample from a Bernoulli distribution with parameter
p = θ, where 0 < θ < 1. Let θ0 < θ1 . Consider the ratio of likelihood,
P
L(x1 , . . . , xn ; θ0 ) θ0 (1 − θ1 ) xi 1 − θ0 n
= .
L(x1 , . . . , xn ; θ1 ) θ1 (1 − θ0 ) 1 − θ1

Since θθ10 (1−θ

(1−θ1 ) P
0)
< 1, the ratio is an decreasing function of y = xi . Thus we have a monotone
P
likelihood ratio in the statistic Y = Xi .
Consider the hypotheses

H0 : θ < θ 0 versus H1 : θ > θ0 .

By Theorem 6.3.8, the uniformly most powerful level α decision rule is given by
n
X
Reject H0 if Y = Xi ≥ c,
i=1

where c is such that α = Pθ0 [Y ≥ c].

6.4 Some well-known tests for a single sample

6.4.1 Hypothesis test on the mean of a normal distribution, variance σ 2 known

In this section, we will assume that a random sample X1 , X2 , . . . , Xn has been taken from a
normal N (µ, σ 2 ) population. It is known that X̄ is an unbiased point estimator of µ.

Hypothesis tests on the mean

Null hypothesis: H0 : µ = µ0
Test statistic: Z0 = X̄−µ
√0
σ/ n

Alternative hypothesis Rejection criteria P value

H1 : µ 6= µ0 |Z0 | > zα/2 2[1 − Φ(|Z0 |)]
H1 : µ > µ0 Z0 > zα 1 − Φ(Z0 )
H1 : µ < µ0 Z0 < −zα Φ(Z0 )

Example 6.4.1. The following data give the score of 10 students in a certain exam.

75 64 75 65 72 80 71 68 78 62.

Assume that the score is normally distributed with mean µ and known variance σ 2 = 36, test the
following hypotheses at the 0.05 level of significance and find the P -value of each test.

(a) H0 : µ = 70 against H1 : µ 6= 70.

(b) H0 : µ = 68 against H1 : µ > 68.

CHAPTER 6. HYPOTHESIS TESTING 123

(c) H0 : µ = 75 against H1 : µ < 75.

Solution: (a) We may solve the problem by following the six-step procedure as follows.

1. The parameter of interest is µ, the score.

2. We are going to test: H0 : µ = 70, H1 : µ 6= 70.

3. Sample size n = 10,

1
and sample mean X = (75 + 64 + 75 + 65 + 72 + 80 + 71 + 68 + 78 + 62) = 71.
10
4. Significance level α = 0.05 so zα/2 = 1.96.

5. The test statistic is

X − µ0 71 − 70
Z0 = √ = √ = 0.5270.
σ/ n 6/ 10

6. Since |Z0 | < zα/2 we do not reject H0 : µ = 70 in favour of H1 : µ 6= 70 at the 0.05 level of
significance. More precisely, we conclude that the mean score is 70 based on a sample of 10
measurements.

The P -value of this test is 2(1 − Φ(|Z0 |)) = 2(1 − Φ(0.5270)) = 0.598.
(b)

1. The parameter of interest is µ, the score.

2. We are going to test: H0 : µ = 68, H1 : µ > 68.

3. Sample size n = 10,

1
and sample mean X = (75 + 64 + 75 + 65 + 72 + 80 + 71 + 68 + 78 + 62) = 71.
10
4. Significance level α = 0.05 so zα = 1.645.

5. The test statistic is

X − µ0 71 − 68
Z0 = √ = √ = 1.581.
σ/ n 6/ 10

6. Since Z0 < zα we do not reject H0 : µ = 68 in favour of H1 : µ > 68 at the 0.05 level of

significance. More precisely, we conclude that the mean score is 68 based on a sample of 10
measurements.

The P -value of this test is 1 − Φ(Z0 ) = 1 − Φ(1.581) = 0.057.

(c)

1. The parameter of interest is µ, the score.

2. We are going to test: H0 : µ = 75, H1 : µ < 75.

3. Sample size n = 10,

1
and sample mean X = (75 + 64 + 75 + 65 + 72 + 80 + 71 + 68 + 78 + 62) = 71.
10
CHAPTER 6. HYPOTHESIS TESTING 124

4. Significance level α = 0.05 so zα = 1.645.

5. The test statistic is

X − µ0 71 − 75
Z0 = √ = √ = −2.108.
σ/ n 6/ 10

6. Since Z0 < −zα we reject H0 : µ = 75 in favour of H1 : µ < 75 at the 0.05 level of significance.
More precisely, we conclude that the mean score is less than 75 based on a sample of 10
measurements.

The P -value of this test is Φ(Z0 ) = Φ(−2.108) = 0.018.

Connection between hypothesis tests and confidence intervals

There is a close relationship between the test of a hypothesis about any parameter, say θ, and
the confidence interval for θ. If [l, u] is a 100(1 − α)% confidence interval for the parameter θ, the
test of size α of the hypothesis
H0 : θ = θ0 , H1 : θ 6= θ0

will lead to rejection of H0 if and only if θ0 is not in the 100(1 − α)% confidence interval [l, u].
Although hypothesis tests and CIs are equivalent procedures insofar as decision making or
inference about µ is concerned, each provides somewhat different insights. For instance, the
confidence interval provides a range of likely values for µ at a stated confidence level, whereas
hypothesis testing is an easy framework for displaying the risk levels such as the P -value associ-
ated with a specific decision.

Type II error and choice of sample size

In testing hypotheses, the analyst directly selects the type I error probability. However, the
probability of type II error β depends on the choice of sample size. In this section, we will show
how to calculate the probability of type II error β. We will also show how to select the sample size
to obtain a specified value of β.
In the following we will derive the formula for β of the two-sided test. The ones for one-sided
tests can be derived in a similar way and we leave it as an exercise for the reader.
Finding the probability of type II error β: Consider the two-sided hypothesis

H0 : µ = µ 0 , H1 : µ 6= µ0 .

Suppose the null hypothesis is false and that the true value of the mean is µ = µ0 + δ for some δ.
The test statistic Z0 is
√ δ √n
X − µ0 X − (µ0 + δ) δ n
Z0 = √ = √ ∼N , 1).
σ/ n σ/ n σ σ

Therefore, the probability of type II error is β = Pµ0 +δ (|Z0 | ≤ zα/2 ), i.e.,

CHAPTER 6. HYPOTHESIS TESTING 125

Type II error for two-sided test

√ √
δ n δ n
β = Φ zα/2 − − Φ − zα/2 − . (6.4)
σ σ

Sample size formula There are no closed form for n from equation (6.4). However, we can esti-
mate n as follows. √
Case 1 If δ > 0, then Φ(−zα/2 − δ σ n ) ≈ 0 then
√
δ n (zα/2 + zβ )2 σ 2
β ≈ Φ zα/2 − ⇔n≈ .
σ δ2
√
δ n
Case 2 If δ < 0, then Φ(zα/2 − σ ) ≈ 1 then
√
δ n (zα/2 + zβ )2 σ 2
β ≈ 1 − Φ − zα/2 − ⇔n≈ .
σ δ2
Therefore, the sample size formula is defined by

Sample size formula for two-sided test

(zα/2 + zβ )2 σ 2
n≈
δ2

Large-sample test

We have developed the test procedure for the null hypothesis H0 : µ = µ0 assuming that the
population is normally distributed and that σ 2 is known. In many if not most practical situations
σ 2 will be unknown. Furthermore, we may not be certain that the population is well modeled
by a normal distribution. In these situations if n is large (say n > 40) the sample variance s2
can be substituted for σ 2 in the test procedures with little effect. Thus, while we have given a
test for the mean of a normal distribution with known σ 2 , it can be easily converted into a large-
sample test procedure for unknown σ 2 that is valid regardless of the form of the distribution of
the population. This large-sample test relies on the central limit theorem just as the large-sample
confidence interval on σ 2 that was presented in the previous chapter did. Exact treatment of
the case where the population is normal, σ 2 is unknown, and n is small involves use of the t
distribution and will be deferred in the next section.

6.4.2 Hypothesis test on the mean of a normal distribution, variance σ 2 unknown

Hypothesis test on the mean

We assume again that a random sample X1 , X2 , . . . , Xn has been taken from a normal N (µ, σ 2 )
population. Recall that X and s(X)2 are sample mean and sample variance of the random sam-
ple X1 , X2 , . . . , Xn , respectively. It is known that
X −µ
tn−1 = √
s(X)/ n
CHAPTER 6. HYPOTHESIS TESTING 126

has a t distribution with n − 1 degree of freedom. This fact leads to the following test on the mean
µ.

Null hypothesis: H0 : µ = µ0
X̄−µ√
Test statistic: T0 = s(X)/ 0
n

Alternative hypothesis Rejection criteria P -value

H1 : µ 6= µ0 |T0 | > tα/2,n−1 2P(tn−1 > |T0 |)
H1 : µ > µ0 T0 > tα,n−1 P(tn−1 > T0 )
H1 : µ < µ0 T0 < −tα,n−1 P(tn−1 < −T0 )

Where ta,n−1 satisfies P[tn−1 > ta,n−1 ] = a.

Because the t-table in the Appendix contains a few critical values for each t distribution, com-
putation of the exact P -value directly from the table is usually impossible. However, it is easy to
find upper and lower bounds on the P -value from this table.
Example 6.4.2. The following data give the IQ score of 10 students.
112 116 115 120 118 125 118 113 117 121.
Suppose that the IQ score is normally distributed N(µ, σ 2 ), test the following hypotheses at the
0.05 level of significance and estimate the P -value of each test.
(a) H0 : µ = 115 against H1 : µ 6= 115.

(b) H0 : µ = 115 against H1 : µ > 115.

(c) H0 : µ = 120 against H1 : µ < 120.

Solution (a)
1. The parameter of interest is the mean IQ score µ.

2. We are going to test H0 : µ = 115 against H1 : µ 6= 115.

3. Sample size n = 10,

sample mean X = 117.5,
sample variance s2 (X) = 14.944.

4. Significance level α = 0.05 so tα/2,9 = 2.262.

5. The test statistic is

X̄ − µ0 117.5 − 115
T0 = √ =√ √ = 2.04.
s(X)/ n 14.944/ 10
6. Since |T0 | < tα/2,9 we do not reject H0 : µ = 115 in favour of H1 : µ 6= 115 at the 0.05 level
of significance. More precisely, we conclude that the average IQ score is 115 based on a
sample of 10 measurements.
Based on the table of Student distribution, we know that the P -value of this test is 2P(t9 > 2.04) ∈
(0.05; 0.1). The actual value of the P -value is 0.072.
CHAPTER 6. HYPOTHESIS TESTING 127

Type II error and choice of sample size

When the true value of the mean is µ = µ0 + δ, the distribution for T0 is called the non-central
√
t distribution with n − 1 degrees of freedom and non-centrality parameter δ n/σ. Therefore, the
type II error of the two-sided alternative would be

β = P(|T00 | ≤ tα/2,n−1 )

where T00 denotes the non-central t random variable.

6.4.3 Hypothesis test on the variance of a normal distribution

The hypothesis testing procedures

We assume that a random sample X1 , X2 , . . . , Xn has been taken from a normal N (µ, σ 2 ) pop-
ulation. Since (n − 1)s2 (X)/σ 2 follows the chi-square distribution with n − 1 degrees of freedom,
we obtain the following test for value of σ 2 .

Null hypothesis: H0 : σ = σ0
2 (X)
Test statistic: χ20 = (n−1)s
σ20

Alternative hypothesis Rejection criteria P -value

H1 : σ 6= σ0 χ20 > cα/2,n−1 or T0 < c1−α/2,n−1 1 − |2P(χ2n−1 > χ20 ) − 1|
H1 : σ > σ0 χ20 > cα,n−1 P(χ2n−1 > χ20 )
H1 : σ < σ 0 χ20 < c1−α,n−1 P(χ2n−1 < χ20 )

Where ca,n−1 satisfy P[χ2n−1 > ca,n−1 ] = a.

Example 6.4.3. An automatic filling machine is used to fill bottles with liquid detergent. A ran-
dom sample of 20 bottles results in a sample variance of fill volume of s2 = 0.0153 (fluid ounces)2 .
If the variance of fill volume exceeds 0.01 (fluid ounces)2 , an unacceptable proportion of bottles
will be underfilled or overfilled. Is there evidence in the sample data to suggest that the manu-
facturer has a problem with underfilled or overfilled bottles? Use α = 0.05, and assume that fill
volume has a normal distribution.
Solution

1. The parameter of interest is the population variance σ 2 .

2. We are going to test H0 : σ 2 = 0.01 against H1 : σ 2 > 0.01.

3. Sample size n = 20,

sample variance s2 (X) = 0.0153.

4. Significance level α = 0.05 so cα,19 = 30.14.

CHAPTER 6. HYPOTHESIS TESTING 128

5. The test statistic is

(n − 1)s2 (X) 19 × 0.0153
χ20 = = = 29.07.
σ02 0.01

6. Since χ20 < cα,19 , we conclude that there is no strong evidence that the variance of fill vol-
ume exceeds 0.01 (fluid ounces)2 .

Since P(χ21 9 > 27.2) = 0.10 and P(χ21 9 > 30.4) = 0.05, we conclude that the P -value of the test is
in the interval (0.05, 0.10). Note that the actual P -value is 0.0649.

6.4.4 Test on a population proportion

Large-Sample tests on a proportion

Let (X1 , . . . , Xn ) be a random sample observing from a random variable X with B(1, p) dis-
tribution. Then p̂ = X is a point estimator of p. By the Central limit theorem, when n is large, p̂
is approximately normal with mean p and variance p(1 − p)/n. We thus obtain the following test
for value of p.

Null hypothesis: H0√: p = p0

Test statistic: Z0 = √n(X̄−p0 )
p0 (1−p0 )

Alternative hypothesis Rejection criteria P -value

H1 : p 6= p0 |Z0 | > zα/2 2(1 − Φ(|Z0 |)
H1 : p > p0 Z0 > zα 1 − Φ(Z0 )
H1 : p < p0 Z0 < −zα Φ(Z0 )

Example 6.4.4. A semiconductor manufacturer produces controllers used in automobile engine

applications. The customer requires that the process fallout or fraction defective at a critical
manufacturing step not exceed 0.05 and that the manufacturer demonstrate process capability
at this level of quality using α = 0.05. The semiconductor manufacturer takes a random sample
of 200 devices and finds that four of them are defective. Can the manufacturer demonstrate
process capability for the customer?
Solution

1. The parameter of interest is the process fraction defective p.

2. H0 : p = 0.05 against H1 : p < 0.05.

4
3. The sample size n = 200, and sample proportion X = 200 = 0.02.

4. Significance level α = 0.05 so zα = 1.645.

5. The test statistic is

√ √
n(X − p0 ) 200(0.02 − 0.05)
Z0 = p = √ = −1.947.
p0 (1 − p0 ) 0.05 × 0.95
CHAPTER 6. HYPOTHESIS TESTING 129

6. Since Z0 < −zα , we reject H0 and conclude that the process fraction defective p is less than
0.05. The P -value for this value of the test statistic is Φ(−1.947)) = 0.0256, which is less than
α = 0.05. We conclude that the process is capable.

Type II error and choice of sample size

Suppose that p is the true value of the population proportion. The approximate β-error is
defined as follows

• the two-sided alternative H1 : p 6= p0

p p
p0 − p + z
α/2 p0 (1 − p0 )/n
p0 − p − z
α/2 p0 (1 − p0 )/n

β≈Φ p −Φ p
p(1 − p)/n p(1 − p)/n

• the one-sided alternative H1 : p < p0

p
p0 − p − z
α/2 p0 (1 − p0 )/n

β ≈1−Φ p
p(1 − p)/n

• the one-sided alternative H1 : p > p0

p
p0 − p + z
α/2 p0 (1 − p0 )/n

β≈Φ p
p(1 − p)/n

These equations can be solved to find the approximate sample size n that gives a test of level α
that has a specified β risk. The sample size is defined as follows.

• the two-sided alternative H1 : p 6= p0

p p
α/2 p0 (1 − p0 ) + zβ p(1 − p) 2
hz i
n=
p − p0

• the one-sided alternative

h z pp (1 − p ) + z pp(1 − p) i2
α 0 0 β
n=
p − p0

6.5 Some well-known tests for two samples

6.5.1 Inference for a difference in means of two normal distributions, variances known

In this section we consider statistical inferences on the difference in means µ1 − µ2 of two

normal distributions, where the variances σ12 and σ2 are known. The assumptions for this section
CHAPTER 6. HYPOTHESIS TESTING 130

are summarized as follows.

X11 , X12 , . . . , X1n1 is a random sample from population 1.

X21 , X22 , . . . , X2n2 is a random sample from population 2. (6.5)
The two populations represented by X1 and X2 are independent.
Both populations are normal.

The inference for µ1 − µ2 is based on the following result.

Theorem 6.5.1. Under the assumptions stated above, the quantity

X 1 − X 2 − (µ1 − µ2 )
Z= q 2 ∼ N(0, 1).
σ1 σ22
n1 + n2

Null hypothesis: H0 : µ1 − µ2 = ∆0
X 1 − X 2 − ∆0
Test statistic: Z0 = q 2
σ1 σ22
n1 + n2

Alternative hypothesis Rejection criteria

H1 : µ1 − µ2 =
6 ∆0 |Z0 | > zα/2
H1 : µ1 − µ2 > ∆0 Z0 > zα
H1 : µ1 − µ2 < ∆0 Z0 < zα

When the population variances are unknown, the sample variances s21 and s22 can be substi-
tuted into the test statistic Z0 to produce a large-sample test for the difference in means. This
procedure will also work well when the populations are not necessarily normally distributed.
However, both n1 and n2 should exceed 40 for this large-sample test to be valid.

Example 6.5.2. A product developer is interested in reducing the drying time of a primer paint.
Two formulations of the paint are tested; formulation 1 is the standard chemistry, and formu-
lation 2 has a new drying ingredient that should reduce the drying time. From experience, it
is known that the standard deviation of drying time is 8 minutes, and this inherent variability
should be unaffected by the addition of the new ingredient. Ten specimens are painted with for-
mulation 1, and another 10 specimens are painted with formulation 2; the 20 specimens are
painted in random order. The two sample average drying times are X 1 = 121 minutes and
X 2 = 112 minutes, respectively. What conclusions can the product developer draw about the
effectiveness of the new ingredient, using α = 0.05?
Solution:

1. The quantity of interest is the difference in mean drying time, µ1 − µ2 , and ∆0 = 0.

2. We are going to test: H0 : µ1 − µ2 = 0 vs H1 : µ1 > µ2 .

CHAPTER 6. HYPOTHESIS TESTING 131

3. The sample means n1 = n2 = 10.

4. The significance level α = 0.05 so zα = 1.645.

5. The test statistic is

121 − 112
Z0 = q = 2.52.
82 82
10 + 10

6. Since Φ−1 (α) = Φ−1 (0.05) = 1.645 < Z0 , we reject H0 at the α = 0.05 level and conclude
that adding the new ingredient to the paint significantly reduces the drying time.

Alternatively, we can find the P -value for this test as

P -value = 1 − Φ(2.52) = 0.0059.

Therefore H0 : µ1 = µ2 would be rejected at any significance level α ≥ 0.0059.

Type 2 error and choice of sample size

6.5.2 Inference for the difference in means of two normal distributions, variances
unknown

Case 1: σ12 = σ22 = σ 2

Suppose we have two independent normal populations with unknown means µ1 and µ2 , and
unknown but equal variances σ 2 . Assume that assumptions (6.5) hold.
The pooled estimator of σ 2 , denoted by Sp2 is defined by

(n1 − 1)s21 + (n2 − 1)s22

Sp2 = .
n1 + n2 − 2
The inference for µ1 − µ2 is based on the following result.

Theorem 6.5.3. Under all the assumption mentioned above, the quantity

X 1 − X 2 − (µ1 − µ2 )
T = q
Sp n11 + n12

has a student’s t distribution with n1 + n2 − 2 degrees of freedom.

Null hypothesis: H0 : µ1 − µ2 = ∆0
X 1 − X 2 − ∆0
Test statistic: T0 = q
Sp n11 + n12
CHAPTER 6. HYPOTHESIS TESTING 132

Alternative hypothesis Rejection criteria

H1 : µ1 − µ2 6= ∆0 |T0 | > tα/2,n1 +n2 −2
H1 : µ1 − µ2 > ∆0 T0 > tα,n1 +n2 −2
H1 : µ1 − µ2 < ∆0 T0 < −tα,n1 +n2 −2

Example 6.5.4. The IQ’s of 9 children in a district of a large city have empirical mean 107 and
standard deviation 10. The IQs of 12 children in another district have empirical mean 112 and
standard deviation 9. Test the equality of means at the 0.05 significance of level.

Example 6.5.5. Two catalysts are being analyzed to determine how they affect the mean yield of
a chemical process. Specifically, catalyst 1 is currently in use, but catalyst 2 is acceptable. Since
catalyst 2 is cheaper, it should be adopted, providing it does not change the process yield. A test
is run in the pilot plant and results in the data shown in the following table.

Observation Num. Catalyst 1 Catalyst 2

1 91.50 89.19
2 94.18 90.95
3 92.18 90.46
4 95.39 93.21
5 91.79 97.19
6 89.07 97.04
7 94.72 91.07
8 89.21 92.75

Is there any difference between the mean yields? Use α = 0.05, and assume equal variances.

Cases 2: σ12 6= σ22

In some situations, we cannot reasonably assume that the unknown variances σ12 andσ22 are
equal. There is not an exact t-statistic available for testing H0 : µ1 −µ2 = ∆0 in this case. However,
if H0 is true, the statistic
X 1 − X 2 − ∆0
T0∗ = q 2
s1 s22
n1 + n2

is distributed approximately as t with degrees of freedom given by

2
s21 s22

n1 + n2
ν= (s21 /n1 )2 (s22 /n2 )2
.
n1 −1 + n2 −1

Therefore, if σ12 6= σ22 , the hypotheses on differences in the means of two normal distribution are
tested as in the equal variances case, except that T0∗ is used as the test statistic and n1 + n2 − 2 is
replaced by ν in determining the degrees of freedom for the test.
CHAPTER 6. HYPOTHESIS TESTING 133

6.5.3 Paired t-test

A special case of the two-sample t-tests of previous section occurs when the observations on
the two populations of interest are collected in pairs. Each pair of observations, say (Xj , Xj ), is
taken under homogeneous conditions, but these conditions may change from one pair to an-
other. For example, suppose that we are interested in comparing two different types of tips for
a hardness-testing machine. This machine presses the tip into a metal specimen with a known
force. By measuring the depth of the depression caused by the tip, the hardness of the specimen
can be determined. If several specimens were selected at random, half tested with tip 1, half
tested with tip 2, and the pooled or independent t-test in the previous was applied, the results of
the test could be erroneous. The metal specimens could have been cut from bar stock that was
produced in different heats, or they might not be homogeneous in some other way that might
affect hardness. Then the observed difference between mean hardness readings for the two tip
types also includes hardness differences between specimens.
A more powerful experimental procedure is to collect the data in pairs - that is, to make two
hardness readings on each specimen, one with each tip. The test procedure would then consist of
analyzing the differences between hardness readings on each specimen. If there is no difference
between tips, the mean of the differences should be zero. This test procedure is called the paired
t-test.
Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be a set of n paired observations where we assume that the
mean and variance of the population represented by X are µX and σX 2 , and the mean and vari-

ance of the population represented by Y are µY and σY2 . Define the differences between each pair
of observations as Dj = Xj − Yj , j = 1, 2, . . . , n. The Dj s are assumed to be normally distributed
with mean µD = µX − µY and variance σD 2 , so testing hypotheses about the difference between

µX and µY can be accomplished by performing a one-sample t-test on µD .

Null hypothesis: H0 : µD = ∆0
D − ∆0
Test statistic: T0 = √
SD / n

Alternative hypothesis Rejection criteria

H1 : µD 6= ∆0 |T0 | > tα/2,n−1
H1 : µD > ∆ 0 T0 > tα,n−1
H1 : µD < ∆ 0 T0 < −tα,n−1

Example 6.5.6. An article in the Journal of Strain Analysis (1983, Vol. 18, No. 2) compares several
methods for predicting the shear strength for steel plate girders. Data for two of these methods,
the Karlsruhe and Lehigh procedures, when applied to nine specific girders, are shown in the
following table.

Karlsruhe Method 1.186 1.151 1.322 1.339 1.200 1.402 1.365 1.537 1.559
Lehigh Method 1.061 0.992 1.063 1.062 1.065 1.178 1.037 1.086 1.052
Difference Dj 0.119 0.159 0.259 0.277 0.138 0.224 0.328 0.451 0.507
CHAPTER 6. HYPOTHESIS TESTING 134

Test whether there is any difference (on the average) between the two methods?
Solution:

2
D = 0.2736, SD = 0.018349, T0 = 6.05939, t0.025,8 = 2, 306.

We conclude that there is difference between the two method at the 0.05 level of significance.

6.5.4 Inference on the variance of two normal populations

A hypothesis-testing procedure for the equality of two variances is based on the following
result.

Theorem 6.5.7. Let X11 , X12 , . . . , X1n1 be a random sample from a normal population with mean
µ1 and variance σ12 and let X21 , X22 , . . . , X2n2 be a random sample from a second normal popula-
tion with mean µ2 and variance σ22 . Assume that both normal populations are independent. Let
s21 and s22 be the sample variances. Then the ratio

s21 /σ12
F =
s22 /σ22

has an F distribution with n1 − 1 numerator degrees of freedom and n2 − 1 denominator degrees

of freedom.

This result is based on the fact that (n1 − 1)s21 /σ12 is a chi-square random variable with n1 − 1
degrees of freedom, that (n2 − 1)s21 /σ22 is a chi-square random variable with n2 − 1 degrees of free-
dom, and that the two normal populations are independent. Clearly under the null hypothesis
H0 : σ12 = σ22 , the ratio F0 = s21 /s22 has an Fn1 −1,n2 −1 distribution. Let fα,n1 −1,n2 −1 be a constant
satisfying
P[F0 > fα,n1 −1,n2 −1 ] = α.

It follows from the property of F distribution that

1
f1−α,n1 −1,n2 −1 = .
fα,n1 −1,n2 −1

Null hypothesis: H0 : σ12 = σ22

s2
Test statistic: F0 = 12
s2

Alternative hypothesis Rejection criteria

H1 : σ12 = σ22 F0 > fα/2,n1 −1,n2 −1 or F0 < f1−α/2,n1 −1,n2 −1
H1 : σ12 > σ22 F0 > fα,n1 −1,n2 −1
H1 : σ12 < σ22 F0 < f1−α,n1 −1,n2 −1
CHAPTER 6. HYPOTHESIS TESTING 135

Example 6.5.8. Oxide layers on semiconductor wafers are etched in a mixture of gases to achieve
the proper thickness. The variability in the thickness of these oxide layers is a critical character-
istic of the wafer, and low variability is desirable for subsequent processing steps. Two different
mixtures of gases are being studied to determine whether one is superior in reducing the variabil-
ity of the oxide thickness. Twenty wafers are etched in each gas. The sample standard deviations
of oxide thickness are s1 = 1.96 angstroms and s2 = 2.13 angstroms, respectively. Is there any
evidence to indicate that either gas is preferable? Use α = 0.05.
Solution: At the α = 0.05 level of significance we need to test

H0 : σ12 = σ22 vs H1 : σ12 6= σ22

s2 1
Since n1 = n2 = 20, we will reject H0 if F0 = s12 > f0.025,19,19 = 2.53 or F0 < f0.975,19,19 = 2.53 =
2
0.40.
1.962
Computation: F0 = 2.13 2 = 0.85. Hence we cannot reject the null hypothesis H0 at the 0.05

level of significance. Therefore, there is no strong evidence to indicate that either gas results in a
smaller variance of oxide thickness.

6.5.5 Inference on two population proportions

We now consider the case where there are two binomial parameters of interest, say, p1 and p2 ,
and we wish to draw inferences about these proportions. We will present large-sample hypothe-
sis testing based on the normal approximation to the binomial.
Suppose that two independent random samples of sizes n1 and n2 are taken from two pop-
ulations, and let X1 and X2 represent the number of observations that belong to the class of
interest in samples 1 and 2, respectively. Furthermore, suppose that the normal approximation
to the binomial is applied to each population, so the estimators of the population proportions
P̂1 = X1 /n1 and P̂2 = X2 /n2 have approximate normal distribution. Moreover, under the null
hypothesis H0 : p1 = p2 = p, the random variable

P̂1 − P̂2
Z=r
n1 + n2
p(1 − p)
n1 n2

is distributed approximately N (0, 1).

This leads to the test procedures described below.

Null hypothesis: H0 : p1 = p2
P̂1 − P̂2 X1 + X2
Test statistic: Z0 = r with p̂ = .
n1 + n2 n1 + n2
p̂(1 − p̂)
n1 n2
CHAPTER 6. HYPOTHESIS TESTING 136

Alternative hypothesis Rejection criteria

H1 : p1 6= p2 |Z0 | > zα/2
H1 : p1 > p2 Z0 > zα
H1 : p1 < p2 Z0 < −zα

Example 6.5.9. Two different types of polishing solution are being evaluated for possible use in a
tumble-polish operation for manufacturing interocular lenses used in the human eye following
cataract surgery. Three hundred lenses were tumble- polished using the first polishing solution,
and of this number 253 had no polishing-induced defects. Another 300 lenses were tumble-
polished using the second polishing solution, and 196 lenses were satisfactory upon completion.
Is there any reason to believe that the two polishing solutions differ? Use α = 0.01.

6.6 The chi-square test

6.6.1 Goodness-of-fit test

Suppose that a large population consists of items of k different types, and let pi denote the
probability that an item selected at random will be of type i(i = 1, . . . , k). Suppose that the
following hypothesis is to be tested

H0 : pi = p0i for i = 1, . . . k vs H1 : pi = p0i for at least one value of i

We shall assume that a random sample of size n is to be taken from the given population. That
is, n independent observations are to be taken, and there is probability pi that each observation
will be of type i(i = 1, ..., k). On the basis of these n observations, the hypothesis is to be tested.
For each i, we denote Ni the number of observations in the random sample that are of type i.

Theorem 6.6.1 (Pearson’s theorem). The following statistic

k
X (Ni − np0 )2 i
Q=
i=1
np0i

has the property that if H0 is true and the sample size n → ∞, then Q converges in distri-
bution to the χ2 distribution with k − 1 degrees of freedom.

Chi-squared goodness-of-fit test for simple hypothesis

Suppose that we observer an i.i.d. sample X1 , . . . , Xn of random variables that take a finite
number of values B1 , . . . , Bk with unknown probability p1 , . . . , pk . Consider the hypothesis

H0 : pi = p0i for i = 1, . . . k vs H1 : pi = p0i for at least one value of i

CHAPTER 6. HYPOTHESIS TESTING 137

If the null hypothesis H0 is true then by Pearson’s theorem,

k
X (Ni − np0 )2 d
Q= i
−→ χ2k−1
i=1
np0i

where Ni is number of Xj equal to Bj . On the other hand, if H1 holds, then for some index i∗ ,
pi∗ 6= p0i . We write
νi∗ − np0i∗ pi∗ νi∗ − npi∗ √ pi∗ − p0i∗
r
= √ + n q .
p0i∗
q
np0i∗ npi∗ p0i∗
the first term converges to N(0, (1 − pi∗ )pi∗ /p0i∗ ) by the central limit theorem while the second
term diverges to plus or minus infinity. It means that if H1 holds then Q → ∞. Therefore, we will
reject H0 if Q ≥ cα,k−1 where cα,k−1 is chosen such that the error of type 1 is equal to the level of
significance α:
α = P0 (Q > cα,k−1 ) ≈ P(χ2k−1 > cα,k−1 ).
This test is called chi-squared goodness-of-fit test

Null hypothesis: H0 : pi = p0i for i = 1, . . . k

Test statistic
k
X (Ni − np0 )2 i
Q=
i=1
np0i

Alternative hypothesis Rejection criteria P -value

H1 : pi = p0i for at least one value of i Q ≥ cα,k−1 P(χ2k−1 > Q)

Example 6.6.2. A study of blood types among a sample of 6004 people gives the following result

Blood type A B AB O
Number of people 2162 738 228 2876.

Table 6.2: Blood types

A previous study claimed that the proportions of people whose blood of types A, B, AB and O
are 33.33%, 12.5%, 4.17% and 50%, respectively.
We can use the data in Table 6.2 to test the null hypothesis H0 that the probabilities (p1 , p2 , p3 , p4 )
of the four blood type equal ( 13 , 81 , 24
1 1
, 2 ). The χ2 test statistic is then
(2162 − 6004 × 13 )2 (738 − 6004 × 18 )2 (228 − 6004 × 1 2
24 ) (2876 − 6004 × 12 )2
Q= + + + = 20.37
6004 × 31 6004 × 18 1
6004 × 24 6004 × 12
To test H0 at the level α0 , we would compare Q to the 1 − α0 quantile of the χ2 distribution
with three degrees of freedom. Alternatively, we can compute the P -value, which would be the
smallest α0 at which we could reject H0 . In general, the P -value is 1 − F (Q) where F is the
cumulative distribution function of the χ2 distribution with k − 1 degrees of freedom. In this
example k = 4 and Q = 20.37 then the p-value is 1.42 × 10−4 .
CHAPTER 6. HYPOTHESIS TESTING 138

Goodness-of-fit for continuous distribution

Let X1 , . . . , Xn be an i.i.d. sample from unknown distribution P and consider the following
hypotheses
H0 : P = P0 vs H1 : P 6= P0

for some particular, possibly continuous distribution P0 . To do this, we will split a set of all pos-
sible outcomes of Xi , say X, into a finite number of intervals I1 , . . . , Ik . The null hypothesis H0
implies that for all intervals
P(X ∈ Ij ) = P0 (X ∈ Ij ) = p0j .

Therefore, we can do a chi-squared test for

H00 : P(X ∈ Ij ) = p0j for all j ≤ k vs H10 : otherwise.

It is clear that H0 implies H00 . However, the fact that H00 holds does not guarantee that H0 hold.
There are many distribution different from P that have the same distribution on the intervals
I1 , . . . , Ik as P . On one hand, if we group into more and more intervals, our discrete approxi-
mation of P will get closer and closer to P , so in some sense H00 will get closer to H0 . However,
we can not split into too many intervals either, because the χ2k1 -distribution approximation for
statistic T in Pearsons theorem is asymptotic. The rule of thumb is to group the data in such a
way that the expected count in each interval np0i is at least 5.

Example 6.6.3. Suppose that we wish to test the null hypothesis that the logarithms of the life-
time of ball bearings are an i.i.d. sample from the normal distribution with mean ln(50) = 3.912
and variance 0.25. The observed logarithms are

2.88 3.36 3.95 3.99 4.53 4.59 3.50 3.73

4.02 4.22 4.66 4.66 3.74 3.82 4.23 4.23
4.85 4.85 5.16 3.88 3.95 4.23 4.43
In order to have the expected count in each interval be at least 5, we can use at most k = 4 in-
tervals. We shall make these intervals each have probability 0.25 under the null hypothesis. That
is, we shall divide the intervals at the 0.25, 0.5, and 0.75 quantiles of the hypothesized normal
distribution. These quantiles are

3.912 + 0.5Φ−1 (0.25) = 3.575

3.912 + 0.5Φ−1 (0.5) = 3.912
3.912 + 0.5Φ−1 (0.75) = 4.249.

The number of observation in each of the four intervals are then 3, 4, 8 and 8. We then calculate

Q = 3.609.

Our table of the χ2 distribution with three degrees of freedom indicates that 3.609 is between the
0.6 and 0.7 quantiles, so we would not reject the null hypothesis at levels less 0.3 and reject the
null hypothesis at levels greater than 0.4. (Actually, the P -value is 0.307.)
CHAPTER 6. HYPOTHESIS TESTING 139

Goodness-of-fit for composite hypotheses

We can extend the goodness-of-fit test to deal with the case in which the null hypothesis
is that the distribution of our data belongs to a particular parametric family. The alternative hy-
pothesis is that the data have a distribution that is not a member of that parametric family. There
are two changes to the test procedure in going from the case of a simple null hypothesis to the
case of a composite null hypothesis. First, in the test statistic Q, the probabilities p0i are replaced
by estimated probabilities based on the parametric family. Second, the degrees of freedom are
reduced by the number of parameters.
Let us start with a discrete case when a random variable takes a finite number of values
B1 , . . . , Bk and
pi = P(X = Bi ), i = 1, . . . , k.

We would like to test a hypothesis that this distribution comes from a family of distributions
{Pθ : θ ∈ Θ}. In other words, if we denote

pj (θ) = Pθ (X = Bj ),

we want to test

H0 : pj = pj (θ) for all j ≤ r for some θ ∈ Θ vs H1 : otherwise.

The situation now is complicated since we want to test if pj = pj (θ), j ≤ r at least for some θ ∈ Θ
which means that we may have many candidates for θ. One way to approach this problem is as
follows.

Step 1: Assuming that hypothesis H0 holds, we can find an estimator θ∗ of this unknown θ.

Step 2: Try to test if, indeed, the distribution P is equal to Pθ∗ by using the statistics
k
X (Ni − npi (θ∗ ))2
Q∗ =
npi (θ∗ )
i=1

in chi-squared goodness-of-fit test.

This approach looks natural, the only question is what estimate θ∗ to use and how the fact
that θ∗ also depends on the data will affect the convergence of Q. It turns out that if we let θ∗ be
the maximum likelihood estimate, i.e. θ that maximizes the likelihood function

φ(θ) = p1 (θ)N1 . . . pk (θ)Nk ,

then the statistics Q∗ converges in distribution to a χ2r−s−1 distribution with r − s − 1 degrees of

freedom, where s is the dimension of the parameter set Θ. Note that we must assume s ≤ r − 2
so that we have at least one degree of freedom. Very informally, by dimension we understand the
number of free parameters that describe the set

{(p1 (θ), . . . , pk (θ)) : θ ∈ Θ}.

CHAPTER 6. HYPOTHESIS TESTING 140

The we will reject H0 if T ≤ c where the threshold c is determined from the condition

P(T > c|H0 ) = α

where α ∈ [0, 1] is the level of significance.

Example 6.6.4. Suppose that a gene has two possible alleles A1 and A2 and the combinations of
these alleles define three genotypes A1 A1 , A1 A2 and A2 A2 . We want to test a theory that

Probability to pass A to a child = θ
1
Probability to pass A to a child = 1 − θ
2

and that the probabilities of genotypes are given by

p1 (θ) = P(A1 A1 ) = θ2
p2 (θ) = P(A1 A2 ) = 2θ(1 − θ)
p3 (θ) = P(A2 A2 ) = (1 − θ)2 .

Suppose that given a random sample X1 , . . . , Xn from the population the counts of each geno-
type are N1 , N2 and N3 . To test the theory we want to test the hypothesis

H0 : pi = pi (θ), i = 1, 2, 3 vs H1 : otherwise.

First of all, the dimension of the parameter set is s = 1 since the distributions are determined by
one parameter θ. To find the MLE θ∗ we have to maximize the likelihood function

p1 (θ)N1 p2 (θ)N2 p3 (θ)N3 .

After computing the critical point by setting the derivative equal to 0, we get

2N1 + N2
θ∗ = .
2n
Therefore, under the null hypothesis H0 the statistics
3
∗
X (Ni − npi (θ∗ ))2
Q =
npi (θ∗ )
i=1

converges to χ2 distribution with 1 degree of freedom. Therefore, if α = 0.05 we will reject H0 if

Q∗ > 3.841.

In the case when the distributions Pθ are continuous or, more generally, have infinite number
of values that must be grouped in order to use chi-squared test (for example, normal or Poisson
distribution), it can be a difficult numerical problem to maximize the grouped likelihood func-
tion
Pθ (I1 )N1 · · · Pθ (Ik )Nk .
CHAPTER 6. HYPOTHESIS TESTING 141

It is tempting to use a usual non-grouped MLE θ̂ of θ instead of the above θ∗ because it is often
easier to compute, in fact, for many distributions we know explicit formulas for these MLEs.
However, if we use θ̂ in the statistic
k
X (Ni − npi (θ̂))2
Q̂ =
i=1 npi (θ̂)

then it will no longer converges to χ2r−s−1 distribution. It has been shown that typically this Q̂
will converge to a distribution “in between” χ2k−s−1 and χ2k−s 1 . Thus, a conservative decision rule
is to reject H0 whether Q̂ > c where c is chosen such that P(χ2k−1 > c) = α.

Example 6.6.5 (Testing Whether a Distribution Is Normal). Consider now a problem in which
a random sample X1 , ..., Xn is taken from some continuous distribution for which the p.d.f. is
unknown, and it is desired to test the null hypothesis H0 that this distribution is a normal dis-
tribution against the alternative hypothesis H1 that the distribution is not normal. To perform a
χ2 test of goodness- of-fit in this problem, divide the real line into k subintervals and count the
number Ni of observations in the random sample that fall into the ith subinterval (i = 1, ..., k).
If H0 is true, and if µ and σ 2 denote the unknown mean and variance of the normal distribu-
tion, then the parameter vector θ is the two-dimensional vector θ(µ, σ 2 ). The probability πi (θ),
or πi (µ, σ 2 ), that an observation will fall within the ith subinterval, is the probability assigned to
that subinterval by the normal distribution with mean µ and variance σ 2 . In other words, if the
ith subinterval is the interval from ai to bi , then
b − µ a − µ
2 i i
πi (µ, σ ) = Φ −Φ .
σ σ
It is important to note that in order to calculate the value of the statistic Q∗ , the M.L.E.s µ∗ and
σ 2∗ must be found by using the numbers N1 , ..., Nk of observations in the different subintervals.
The M.L.E.s should not be found by using the observed values of X1 , ..., Xn themselves. In other
words, µ∗ and σ 2∗ will be the values of µ and σ 2 that maximize the likelihood function

L(µ, σ 2 ) = [π1 (µ, σ 2 )]N1 · · · [πk (µ, σ 2 )]Nk . (6.6)

Because of the complicated nature of the function πi (µ, σ 2 ) a lengthy numerical computation
would usually be required to determine the values of µ and σ 2 that maximize L(µ, σ 2 ). On the
other hand, we know that the M.L.E.s of µ and σ 2 based on the n observed values X1 , ..., Xn in
the original sample are simply the sample mean X n and the sample variance s2n . Furthermore,
if the estimators that maximize the likelihood function L(µ, σ 2 ) are used to calculate the statis-
tic Q∗ , then we know that when H0 is true, the distribution of Q∗ will be approximately the χ2
distribution with k − 3 degrees of freedom. On the other hand, if the M.L.E.s X n and s2n , which
are based on the observed values in the original sample, are used to calculate Q̂, then this χ2
approximation to the distribution of Q̂ will not be appropriate. Indeed, the distribution of Q̂ is
asymptotically “in between” χ2k−3 and χ2k−1 .
1
Chernoff, Herman; Lehmann, E. L. (1954) The use of maximum likelihood estimates in χ2 tests for goodness of fit.
Ann. Math. Statistics 25, pp. 579-586.
CHAPTER 6. HYPOTHESIS TESTING 142

Return to Example 6.6.3. We are now in a position to try to test the composite null hypothesis
that the logarithms of ball bearing lifetimes have some normal distribution. We shall divide the
real line into the subintervals (∞, 3.575], (3.575, 3.912], (3.912, 4.249], and (4.249, +∞). The counts
for the four intervals are 3, 4, 8, and 8. The M.L.E.s based on the original data gives µ̂ = 4.150 and
σ̂ 2 = 0.2722. The probabilities of the four intervals are (π1 , π2 , π3 , π4 ) = (0.1350, 0.1888, 0.2511, 0.4251).
This make the value of Q̂ equal to 1.211.
The tail area corresponding to 1.211 needs to be computed for χ2 distributions with k − 1 = 3
and k − 3 = 1 degrees of freedom. For one degree of freedom, the p-value is 0.2711, and for three
degrees of freedom the p-value is 0.7504. So, our actual p-value lies in the interval [0.2711, 0.7504].
Although this interval is wide, it tells not to reject H0 at level α if α < 0.2711.

6.6.2 Tests of independence

In this section we will consider a situation when our observations are classified by two differ-
ent features and we would like to test if these features are independent. For example, we can ask
if the number of children in a family and family income are independent. Our sample space X
will consist of a × b pairs.
X = {(i, j) : i = 1, . . . , a, j = 1, . . . , b}

where the first coordinate represents the first feature that belongs to one of a categories and the
second coordinate represents the second feature that belongs to one of b categories. An i.i.d.
sample X1 , ..., Xn can be represented by a contingency table below where Nij is the number all
observations in a cell (i, j).

Feature 2
Feature 1 1 2 ··· b
1 N11 N12 ··· N1b
2 N21 N22 ··· N2b
.. .. .. .. ..
. . . . .
a Na1 Na2 ··· Nab

We would like to test the independence of two features which means that

P[X = (i, j)] = P[X 1 = i]P[X 2 = j].

Denote θij = P[X = (i, j)]; pi = P[X 1 = i]; qj P[X 2 = j]. Then we want to test

H0 : θij = pi qj for all (i, j) for some (p1 , . . . , pa ) and (q1 , . . . , qb )

H1 : otherwise.

We can see that this null hypothesis H0 is a special case of the composite hypotheses from previ-
ous lecture and it can be tested using the chi-squared goodness-of-fit test. The total number of
groups is k = a × b. Since pi s and qj s should add up to one, one parameter in each sequence, for
example pa and qb , can be computed in terms of other probabilities and we can take (p1 , ..., pa−1 )
CHAPTER 6. HYPOTHESIS TESTING 143

and (q1 , ..., qb−1 ) as free parameters of the model. This means that the dimension of the parame-
ter set is
s = (a − 1) × (b − 1).
Therefore, if we find the maximum likelihood estimates for the parameters of this model then
the chi-squared statistic satisfies
X (Nij − np∗i qj∗ )2 w
Q= −→ χ2k−s−1 = χ2(a−1)(b−1)
np∗i qj∗
ij

To formulate the test it remains to find the maximum likelihood estimates of the parameters. We
need to maximize the likelihood function
Y Y
N N
Y
(pi qj )Nij = pi i+ pj +j ,
ij i j
P P
where Ni+ = j Nij and N+j = Since pi s and qj s are not related to each other, maximiz-
i Nij .
Q N Q N
ing the likelihood function above is equivalent to maximizing i pi i+ and j pj +j separately.
We have
Y X a−1
N
ln pi i+ = Ni+ ln pi + Na+ ln(1 − p1 − · · · − pa−1 ).
i i=1
An elementary computation shows that
Ni+
p∗i = , i = 1, . . . , a.
n
Similarly, the MLE for qj is
N+j
qj∗ =
, ij = 1, . . . , b.
n
Therefore, chi-square statistic Q in this case can be written as
Ni+ N+j 2

X Nij − n
Q= N N
.
i+ +j
ij n

We reject H0 if Q > cα,(a−1)(b−1) where the threshold cα,(a−1)(b−1) is determined from the condition

P[χ2(a−1)(b−1) > cα,(a−1)(b−1) ] = α.

6.6.3 Test of homogeneity

Suppose that the population is divided into R groups and each group (or the entire popula-
tion) is divided into C categories. We would like to test whether the distribution of categories in
each group is the same.
Category 1 Category 2 ··· Category C Σ
Group 1 N11 N12 ··· N1C N1+
Group 2 N21 N22 ··· N2b N2+
.. .. .. .. .. ..
. . . . . .
Group R NR1 NR2 ··· NRC NR+
Σ N+1 N+2 ··· N+C n
CHAPTER 6. HYPOTHESIS TESTING 144

If we denote
pij = P(Categoryj |Groupi )
so that for each group i we have
C
X
pij = 1,
j=1

then we want to test the following hypotheses

H0 : pij = pj for all groupsi ≤ R

H1 : otherwise.

If observations X1 , ..., Xn are sampled independently from the entire population then homo-
geneity over groups is the same as independence of groups and categories. Indeed, if have ho-
mogeneity
P(Categoryj |Groupi ) = P(Categoryj ),
then we have
P(Categoryj , Groupi ) = P(Categoryj )P(Groupi ).
This means that to test homogeneity we can use the test of independence above. Denote
2
N N
R X
X C Nij − i+n +j w
Q= N N
−→ χ2(C−1)(R−1) .
i+ +j
i=1 j=1 n

We reject H0 at the significance level α if Q > cα,(C−1)(R−1) where the threshold cα,(C−1)(R−1) is
determined from the condtion

P[χ2(C−1)(R−1) > cα,(C−1)(R−1) ] = α.

Example 6.6.6. In this example, 100 people were asked whether the service provided by the fire
department in the city was satisfactory. Shortly after the survey, a large fire occured in the city.
Suppose that the same 100 people were asked whether they thought that the service provided by
the fire department was satisfactory. The result are in the following table:

Satisfactory Unsatisfactory
Before fire 80 20
After fire 72 28
Suppose that we would like to test whether the opinions changed after the fire by using a chi-
squared test. However, the i.i.d. sample consisted of pairs of opinions of 100 people (Xi1 , Xi2 ), i =
1, . . . , 100 where the first coordinate/feature is a persons opinion before the fire and it belongs to
one of two categories
{“Satisf actory”, “U nsatisf actory”},
and the second coordinate/feature is a persons opinion after the fire and it also belongs to one
of two categories
{“Satisf actory”, “U nsatisf actory”},
CHAPTER 6. HYPOTHESIS TESTING 145

So the correct contingency table corresponding to the above data and satisfying the assumption
of the chi-squared test would be the following:

Satisfactory Unsatisfactory
Satisfactory 70 10
Unsatisfactory 2 18

In order to use the first contingency table, we would have to poll 100 people after the fire inde-
pendently of the 100 people polled before the fire.

6.7 Exercises

6.7.1 Significance level and power function

6.1. Suppose that X has a pdf of the form f (x; θ) = θxθ−1 I{0<x<1} where θ ∈ {1, 2}. To test the
simple hypotheses H0 : θ = 1 against H1 : θ = 2}, one uses a random sample X1 , X2 of size n = 2
and define the critical region to be C = {(x1 , x2 ) : x1 x2 ≥ 43 }. Find the power function of the test.

6.2. Suppose that X has a binomial distribution with the number of trials n = 10 and with p ∈
{ 14 , 12 }. The simple hypothesis H0 : p = 12 is rejected, and the alternative simple hypothesis
H1 : p = 41 is accepted, if the observed value of X1 , a random sample of size 1, is less than or
equal to 3. Find the significance level and the power of the test.

6.3. Let us say the life of a light bulb, say X, is normally distributed with mean θ and standard
deviation 5000. Past experience indicates that θ = 30000. The manufacturer claims that the light
bulb made by a new process have mean θ > 30000. It is possible that θ = 35000. Check his claim
by testing H0 : θ = 30000 against H1 : θ > 30000. We shall observe n independent values of X,
say X1 , . . . , Xn , and we shall reject H0 (thus accept H1 ) if and only if x̄ ≥ c. Determine n and c so
that the power function γ(θ) of the test has the values γ(30000) = 0, 01 and γ(35000) = 0, 98.

6.4. Suppose that X has a Poisson distribution with mean λ. Consider the simple hypothesis
H0 : λ = 21 and the alternative composite hypothesis H1 : λ < 12 . Let X1 , . . . , X12 denote a
random sample of size 12 from this distribution. One rejects H0 if and only if the observed value
of Y = X1 + . . . + X12 ≤ 2.. Find γ(λ) for λ ∈ (0, 12 ] and the significance level of the test.

6.5. Let Y1 < Y2 < Y2 < Y4 be the order statistics of a random sample of size n = 4 from a
distribution with pdf f (x; θ) = 1/θ, 0 < x < θ, zero elsewhere, where θ > 0. The hypothesis
H0 : θ = 1 is rejected and H1 : θ > 1 is accepted if the observed Y4 ≥ c.

1. Find the constant c so that the significance level is α = 0.05.

2. Determine the power function of the test.

CHAPTER 6. HYPOTHESIS TESTING 146

6.7.2 Null distribution

6.6. Let X1 , . . . , Xn be a random sample from a N (a0 , σ 2 ) distribution where 0 < σ 2 < ∞ and a0
is known. Show that the likelihood ratio test of H0 : σ 2 = σ02 versus H1 : σ 2 6= σ02 can be based
upon the statistics W = ni=1 (Xi − a0 )2 /σ02 . Determine the null distribution of W and give the
P

rejection rule for a level α test.

6.7. Let X1 , . . . , Xn be a random sample from a Poisson distribution with mean λ > 0.

1. Show that the likelihood ratio test of H0 : λ = λ0 versus H1 : λ 6= λ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of Y .

2. For λ0 = 2 and n = 5, find the significance level of the test that rejects H0 if Y ≤ 4 or Y ≥ 17.

6.8. Let X1 , . . . , Xn be a random sample from a Bernoulli B(1, θ) distribution, where 0 < θ < 1.

1. Show that the likelihood ratio test of H0 : θ = θ0 versus H1 : θ 6= θ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of Y .

2. For n = 100 and θ0 = 1/2, find c1 so that the test reject H0 when Y ≤ c1 or Y ≥ c2 = 100 − c1
has the approximate significance level α = 0.05.

6.9. Let X1 , . . . , Xn be a random sample from a Γ(α = 3, β = θ) distribution, where 0 < θ < ∞.

1. Show that the likelihood ratio test of H0 : θ = θ0 versus H1 : θ 6= θ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of 2Y /θ0 .

2. For θ0 = 3 and n = 5, find c1 and c2 so that the test that rejects H0 when Y ≤ c1 or Y ≥ c2
has significance level 0.05.

6.7.3 Best critical region

6.10. Let X1 , X2 be a random sample of size 2 from a random variable X having the pdf f (x; θ) =
e−x/θ 0
θ I{0<x<∞} . Consider the simple hypothesis H0 : θ = θ = 2 and the alternative hypothesis
H1 : θ = θ00 = 4. Show that the best test of H0 against H1 may be carried out by use of the statistics
X1 + X2 .

6.11. Let X1 , . . . , Xn be a random sample of size 10 from a normal distribution N (0, σ 2 ). Find a
best critical region of size α = 0.05 for testing H0 : σ 2 = 1 against H1 : σ 2 = 2. Is this a best critical
region of size 0.05 for testing H0 : σ 2 = 1 against H1 : σ 2 = 4? Against H1 : σ 2 = σ12 > 1.

6.12. If X1 , . . . , Xn is a random sample from a distribution having pdf of the form f (x; θ) =
θxθ−1 , 0 < x < 1,nzero elsewhere, show that aobest critical region for testing H0 : θ = 1 against
H1 : θ = 2 is C = (x1 , . . . , xn ) : c ≤ x1 x2 . . . xn .

6.13. Let X1 , . . . , Xn denote a random sample from a normal distribution N (θ, 100). Show that

C = (x1 , . . . , xn ) : x̄ ≥ c is a best critical region for testing H0 : θ = 75 against H1 : θ = 78. Find
n and c so that
PH0 [(X1 , . . . , Xn ) ∈ C] = PH0 [X̄ ≥ c] = 0.05
CHAPTER 6. HYPOTHESIS TESTING 147

and
PH1 [(X1 , . . . , Xn ) ∈ C] = PH1 [X̄ ≥ c] = 0.90,
approximately.

6.14. Let X1 , . . . , Xn be iid with pmf f (x; p) = px (1 − p)1−x , x = 0, 1, zero elsewhere. Show that
xi ≤ c is a best critical region for testing H0 : p = 21 against H1 : p = 13 .
P
C = (x1 , . . . , xn ) :
P
Use the Central Limit Theorem to find n and c so that approximately PH0 [ Xi ≤ c] = 0.10 and
P
PH1 [ Xi ≤ c] = 0.80.

6.15. Let X1 , . . . , X10 denote a random sample of size 10 from a Poisson distribution with mean
λ. Show that the critical region C defined by 10
P
i=1 xi ≥ 3 is a best critical region for testing
H0 : λ = 0.1 against H1 : λ = 0.5. Determine, for this test, the significance level α and the power
at θ = 0.5.

6.16. Let X have the pmf f (x; θ) = θx (1 − θ)1−x , x = 0, 1, zero elsewhere. We test the simple
hypothesis H0 : λ = 41 against the alternative composite hypothesis H1 : θ < 14 by taking a
random sample of size 10 and rejecting H0 : θ = 14 iff the observed values x1 , . . . , x1 0 of the
sample observations are such that 10 1
P
i=1 xi ≤ 1. Find the power function γ(θ), 0 < θ ≤ 4 , of this
test.

6.7.4 Some tests for single sample

Tests on mean

6.17. (a) The sample mean and standard deviation from a random sample of 10 observations
from a normal population were computed as x = 23 and σ = 9. Calculate the value of the test
statistic of the test required to determine whether there is enough evidence to infer at the 5%
significance level that the population mean is greater than 20.
(b) Repeat part (a) with n = 30.
(c) Repeat part (b) with n = 40.

6.18. (a) A statistics practitioner is in the process of testing to determine whether there is enough
evidence to infer that the population mean is different from 180. She calculated the mean and
standard deviation of a sample of 200 observations as x = 175 and σ = 22. Calculate the value
of the test statistic of the test required to determine whether there is enough evidence at the 5%
significance level.
(b) Repeat part (a) with s = 45.
(c) Repeat part (a) with s = 60.

6.19. A courier service advertises that its average delivery time is less than 6 hours for local deliv-
eries. A random sample of times for 12 deliveries to an address across town was recorded. These
data are shown here. Is this sufficient evidence to support the couriers advertisement, at the 5%
level of significance?

3.03, 6.33, 7.98, 4.82, 6.50, 5.22, 3.56, 6.76, 7.96, 4.54, 5.09, 6.46.
CHAPTER 6. HYPOTHESIS TESTING 148

X = 5, 6875; s2 = 2, 1325; T0 = −0.7413.

6.20. Aircrew escape systems are powered by a solid propellant. The burning rate of this pro-
pellant is an important product characteristic. Specifications require that the mean burning rate
must be 50 centimeters per second. We know that the standard deviation of burning rate is σ = 2
centimeters per second. The experimenter decides to specify a type I error probability or signif-
icance level of α = 0.05 and selects a random sample of n = 25 and obtains a sample average
burning rate of X = 51.3 centimeters per second. What conclusions should be drawn?

6.21. The mean water temperature downstream from a power plant cooling tower discharge pipe
should be no more than 100◦ F . Past experience has indicated that the standard deviation of
temperature is 2◦ F . The water temperature is measured on nine randomly chosen days, and the
average temperature is found to be 98◦ F.
(a) Should the water temperature be judged acceptable with α = 0.05?
(b) What is the P -value for this test?
(c) What is the probability of accepting the null hypothesis at α = 0.05 if the water has a true
mean temperature of 104◦ F ?

6.22. A study reported body temperatures (◦ F ) for 25 female subjects follow:

97.8, 97.2, 97.4, 97.6, 97.8, 97.9, 98.0, 98.0, 98.0, 98.1, 98.2, 98.3,
98.3, 98.4, 98.4, 98.4, 98.5, 98.6, 98.6, 98.7, 98.8, 98.8, 98.9, 98.9, and 99.0.
(a) Test the hypotheses H0 : µ = 98.6 versus H1 : µ 6= 98.6, using α = 0.05. Find the P -value.
(b) Compute the power of the test if the true mean female body temperature is as low as 98.0.
(c) What sample size would be required to detect a true mean female body temperature as low as
98.2 if we wanted the power of the test to be at least 0.9?

6.23. Cloud seeding has been studied for many decades as a weather modification procedure.
The rainfall in acre-feet from 20 clouds that were selected at random and seeded with silver ni-
trate follows:
18.0, 30.7, 19.8, 27.1, 22.3, 18.8, 31.8, 23.4, 21.2, 27.9,
31.9, 27.1, 25.0, 24.7, 26.9, 21.8, 29.2, 34.8, 26.7, 31.6.
(a) Can you support a claim that mean rainfall from seeded clouds exceeds 25 acre-feet? Use
α = 0.01.
(b) Compute the power of the test if the true mean rainfall is 27 acre-feet.
(c) What sample size would be required to detect a true mean rainfall of 27.5 acre-feet if we
wanted the power of the test to be at least 0.9?

6.24. The life in hours of a battery is known to be approximately normally distributed, with stan-
dard deviation σ = 1.25 hours. A random sample of 10 batteries has a mean life of x = 40.5 hours.
(a) Is there evidence to support the claim that battery life exceeds 40 hours? Use α = 0.05.
(b) What is the P -value for the test in part (a)?
(c) What is the power for the test in part (a) if the true mean life is 42 hours?
(d) What sample size would be required to ensure that the probability of making type II error
CHAPTER 6. HYPOTHESIS TESTING 149

does not exceed 0.10 if the true mean life is 44 hours?

(e) Explain how you could answer the question in part (a) by calculating an appropriate confi-
dence bound on life.

6.25. Medical researchers have developed a new artificial heart constructed primarily of tita-
nium and plastic. The heart will last and operate almost indefinitely once it is implanted in the
patients body, but the battery pack needs to be recharged about every four hours. A random
sample of 50 battery packs is selected and subjected to a life test. The average life of these batter-
ies is 4.05 hours. Assume that battery life is normally distributed with standard deviation σ = 0.2
hour.
(a) Is there evidence to support the claim that mean battery life exceeds 4 hours? Use α = 0.05.
(b) Compute the power of the test if the true mean battery life is 4.5 hours.
(c) What sample size would be required to detect a true mean battery life of 4.5 hours if we wanted
the power of the test to be at least 0.9?
(d) Explain how the question in part (a) could be answered by constructing a one-sided confi-
dence bound on the mean life.

Tests on population variance

6.26. After many years of teaching, a statistics professor computed the variance of the marks on
her final exam and found it to be σ 2 = 250. She recently made changes to the way in which the
final exam is marked and wondered whether this would result in a reduction in the variance. A
random sample of this years final exam marks are listed here. Can the professor infer at the 10%
significance level that the variance has decreased?

57 92 99 73 62 64 75 70 88 60.

6.27. With gasoline prices increasing, drivers are more concerned with their cars’ gasoline con-
sumption. For the past 5 years, a driver has tracked the gas mileage of his car and found that the
variance from fill-up to fill-up was σ 2 = 23 mpg2 . Now that his car is 5 years old, he would like to
know whether the variability of gas mileage has changed. He recorded the gas mileage from his
last eight fill-ups; these are listed here. Conduct a test at a 10% significance level to infer whether
the variability has changed.

28 25 29 25 32 36 27 24.

Tests on proportion

6.28. (a) Calculate the P -value of the test of the following hypotheses given that p̂ = 0.63 and
n = 100:
H0 : p = 0.6 vs H1 : p > 0.6.
(b) Repeat part (a) with n = 200.
(c) Repeat part (a) with n = 400.
(d) Describe the effect on P -value of increasing sample size.
CHAPTER 6. HYPOTHESIS TESTING 150

6.29. Has the recent drop in airplane passengers resulted in better on-time performance? Before
the recent economic downturn, one airline bragged that 92% of its flights were on time. A random
sample of 165 flights completed this year reveals that 153 were on time. Can we conclude at the
5% significance level that the airlines on-time performance has improved?

6.30. In a random sample of 85 automobile engine crankshaft bearings, 10 have a surface finish
roughness that exceeds the specifications. Does this data present strong evidence that the pro-
portion of crankshaft bearings exhibiting excess surface roughness exceeds 0.10? State and test
the appropriate hypotheses using α = 0.05.

6.31. An study claimed that nearly one-half of all engineers continue academic studies beyond
the B.S. degree, ultimately receiving either an M.S. or a Ph.D. degree. Data from an article in
Engineering Horizons (Spring 1990) indicated that 117 of 484 new engineering graduates were
planning graduate study.
(a) Are the data from Engineering Horizons consistent with the claim reported by Fortune? Use
α = 0.05 in reaching your conclusions.
(b) Find the P -value for this test.
(c) Discuss how you could have answered the question in part (a) by constructing a two-sided
confidence interval on p.

6.32. A researcher claims that at least 10% of all football helmets have manufacturing flaws that
could potentially cause injury to the wearer. A sample of 200 helmets revealed that 16 helmets
contained such defects.
(a) Does this finding support the researchers claim? Use α = 0.01.
(b) Find the P -value for this test.

6.7.5 Some tests for two samples

Compare two means

6.33. In random samples 12 from each of two normal populations, we found the following statis-
tics: x1 = 74, s1 = 18 and x2 = 71, s2 = 16.
(a) Test with α = 0.05 to determine whether we can infer that the population means differ.
(b) Repeat part (a) increasing the standard deviation to s1 = 210 and s2 = 198.
(c) Describe what happens when the sample standard deviations get larger.
(d) Repeat part (a) with sample size 150.
(e) Discuss the effects of increasing the sample size.

6.34. Random sampling from two normal populations produced the following results

x1 = 412 s1 = 128 n1 = 150

x2 = 405 s2 = 54 n2 = 150.
(a) Can we infer that at the 5% significance level that µ1 is greater than µ2 .
(b) Repeat part (a) decreasing the standard deviation to s1 = 31, s2 = 16.
CHAPTER 6. HYPOTHESIS TESTING 151

(c) Describe what happens when the sample standard deviations get smaller.
(d) Repeat part (a) with samples of size 20.
(e) Discuss the effects of decreasing the sample size.
(f) Repeat part (a) changing the mean of sample 1 to x1 = 409.

6.35. Two machines are used for filling plastic bottles with a net volume of 16.0 ounces. The fill
volume can be assumed normal, with standard deviation σ1 = 0.020 and σ2 = 0.025 ounces. A
member of the quality engineering staff suspects that both machines fill to the same mean net
volume, whether or not this volume is 16.0 ounces. A random sample of 10 bottles is taken from
the output of each machine.

Machine 1 Machine 2
16.03 16.01 16.02 16.03
16.04 15.96 15.97 16.04
16.05 15.98 15.96 16.02
16.05 16.02 16.01 16.01
16.02 15.99 15.99 16.00

(a) Do you think the engineer is correct? Use α = 0.05.

(b) What is the P -value for this test?
(c) What is the power of the test in part (a) for a true difference in means of 0.04?
(d) Find a 95% confidence interval on the difference in means. Provide a practical interpretation
of this interval.
(e) Assuming equal sample sizes, what sample size should be used to assure that the probability
of making type II error is 0.05 if the true difference in means is 0.04? Assume that α = 0.05.

6.36. Every month a clothing store conducts an inventory and calculates losses from theft. The
store would like to reduce these losses and is considering two methods. The first is to hire a
security guard, and the second is to install cameras. To help decide which method to choose,
the manager hired a security guard for 6 months. During the next 6-month period, the store
installed cameras. The monthly losses were recorded and are listed here. The manager decided
that because the cameras were cheaper than the guard, he would install the cameras unless there
was enough evidence to infer that the guard was better. What should the manager do?

Security guard 355 284 401 398 477 254

Cameras 486 303 270 386 411 435

Pair t-test

6.37. Many people use scanners to read documents and store them in a Word (or some other
software) file. To help determine which brand of scanner to buy, a student conducts an experi-
ment in which eight documents are scanned by each of the two scanners he is interested in. He
records the number of errors made by each. These data are listed here. Can he infer that brand A
(the more expensive scanner) is better than brand B?
CHAPTER 6. HYPOTHESIS TESTING 152

Document 1 2 3 4 5 6 7 8
BrandA 17 29 18 14 21 25 22 29
BrandB 21 38 15 19 22 30 31 37

6.38. In an effort to determine whether a new type of fertilizer is more effective than the type cur-
rently in use, researchers took 12 two-acre plots of land scattered throughout the county. Each
plot was divided into two equal-sized subplots, one of which was treated with the current fertil-
izer and the other with the new fertilizer. Wheat was planted, and the crop yields were measured.

Plot 1 2 3 4 5 6 7 8 9 10 11 12
Current fertilizer 56 45 68 72 61 69 57 55 60 72 75 66
New fertilizer 60 49 66 73 59 67 61 60 58 75 72 68

(a) Can we conclude at the 5% significance level that the new fertilizer is more effective than the
current one?
(b) Estimate with 95% confidence the difference in mean crop yields between the two fertilizers.
(c) What is the required condition(s) for the validity of the results obtained in parts (a) and (b)?

Compare two variances

6.39. Random samples from two normal population produced the following statistics

s21 = 350, n1 = 30, s22 = 700, n2 = 30.

(a) Can we infer at the 10% significance level that the two population variances differ?
(b) Repeat part (a) changing the sample sizes to n1 = 15 and n2 = 15.
(c) Describe what happens to the test statistics and the conclusion when the sample sizes de-
crease.

6.40. A statistics professor hypothesized that not only would the means vary but also so would
the variances if the business statistics course was taught in two different ways but had the same
final exam. He organized an experiment wherein one section of the course was taught using
detailed PowerPoint slides whereas the other required students to read the book and answer
questions in class discussions. A sample of the marks was recorded and listed next. Can we
infer that the variances of the marks differ between the two sections?
Class 1 64 85 80 64 48 62 75 77 50 81 90
Class 2 73 78 66 69 79 81 74 59 83 79 84

6.41. An operations manager who supervises an assembly line has been experiencing problems
with the sequencing of jobs. The problem is that bottle- necks are occurring because of the in-
consistency of sequential operations. He decides to conduct an experiment wherein two differ-
ent methods are used to complete the same task. He measures the times (in seconds). The data
are listed here. Can he infer that the second method is more consistent than the first method?
Method 1 8.8 9.6 8.4 9.0 8.3 9.2 9.0 8.7 8.5 9.4
Method 2 9.2 9.4 8.9 9.6 9.7 8.4 8.8 8.9 9.0 9.7
CHAPTER 6. HYPOTHESIS TESTING 153

Compare two proportions

6.42. Random samples from two binomial populations yielded the following statistics:

p̂1 = 0.45 n1 = 100 p̂2 = 0.40 n2 = 100.

(a) Calculate the P -value of a test to determine whether we can infer that the population propor-
tions differ.
(b) Repeat part (a) increasing the sample sizes to 400.
(c) Describe what happens to the p-value when the sample sizes increase.

6.43. Random samples from two binomial populations yielded the following statistics:

p̂1 = 0.60 n1 = 225 p̂2 = 0.55 n2 = 225.

(a) Calculate the P -value of a test to determine whether we there is evidence to infer that the
population proportions differ.
(b) Repeat part (a) p̂1 = 0.95 and p̂2 = 0.90.
(c) Describe the effect on the P -value of increasing the sample proportions.
(d) Repeat part (a) p̂1 = 0.10 and p̂2 = 0.05.
(e) Describe the effect on the P -value of decreasing the sample proportions.

6.44. Many stores sell extended warranties for products they sell. These are very lucrative for
store owners. To learn more about who buys these warranties, a random sample was drawn
of a stores customers who recently purchased a product for which an extended warranty was
available. Among other variables, each respondent reported whether he or she paid the regular
price or a sale price and whether he or she purchased an extended warranty.

Regular Price Sale Price

Sample size 229 178
Number who bought extended warranty 47 25

Can we conclude at the 10% significance level that those who paid the regular price are more
likely to buy an extended warranty?

6.45. Surveys have been widely used by politicians around the world as a way of monitoring the
opinions of the electorate. Six months ago, a survey was undertaken to determine the degree
of support for a national party leader. Of a sample of 1100, 56% indicated that they would vote
for this politician. This month, another survey of 800 voters revealed that 46% now support the
leader.
(a) At the 5% significance level, can we infer that the national leaders popularity has decreased?
(b) At the 5% significance level, can we infer that the national leaders popularity has decreased
by more than 5%?
CHAPTER 6. HYPOTHESIS TESTING 154

6.46. A random sample of 500 adult residents of Maricopa County found that 385 were in favour
of increasing the highway speed limit to 75 mph, while another sample of 400 adult residents of
Pima County found that 267 were in favour of the increased speed limit. Do these data indicate
that there is a difference in the support for increasing the speed limit between the residents of
the two counties? Use α = 0.05. What is the P -value for this test?

6.47. Two different types of injection-molding machines are used to form plastic parts. A part
is considered defective if it has excessive shrinkage or is discolored. Two random samples, each
of size 300, are selected, and 15 defective parts are found in the sample from machine 1 while 8
defective parts are found in the sample from machine 2. Is it reasonable to conclude that both
machines produce the same fraction of defective parts, using α = 0.05? Find the P -value for this
test.

6.7.6 Chi-squared tests

Chi-squared test on distribution

6.48. A new casino game involves rolling 3 dice. The winnings are directly proportional to the
total number of sixes rolled. Suppose a gambler plays the game 100 times, with the following
observed counts:
Number of Sixes 0 1 2 3
Number of Rolls 48 35 15 3

The casino becomes suspicious of the gambler and wishes to determine whether the dice are fair.
What do they conclude?

6.49. Suppose that the distribution of the heights of men who reside in a certain large city is
the normal distribution for which the mean is 68 inches and the standard deviation is 1 inch.
Suppose also that when the heights of 500 men who reside in a certain neighbourhood of the city
were measured, the distribution in the following table was obtained. Test the hypothesis that,
with regard to height, these 500 men form a random sample from all the men who reside in the
city.

Height (in inch) <66 66-67.5 67.7-68.5 68.5-70 >70

Number of men 18 177 198 102 5

6.50. The 50 values in the following table are intended to be a random sample from the standard
normal distribution.
1.28 1.22 0.32 0.80 1.38 1.26 2.33 0.34 1.14 0.64
0.41 0.01 0.49 0.36 1.05 0.04 0.35 2.82 0.64 0.56
0.45 1.66 0.49 1.96 3.44 0.67 1.24 0.76 0.46 0.11
0.35 1.39 0.14 0.64 1.67 1.13 0.04 0.61 0.63 0.13
0.72 0.38 0.85 1.32 0.85 0.41 0.11 2.04 1.61 1.81
CHAPTER 6. HYPOTHESIS TESTING 155

a) Carry out a χ2 test of goodness-of-fit by dividing the real line into five intervals, each of which
has probability 0.2 under the standard normal distribution.
b) Carry out a χ2 test of goodness-of-fit by dividing the real line into ten intervals, each of which
has probability 0.1 under the standard normal distribution.
Chapter 7

Regression

7.1 Simple linear regression

7.1.1 Simple linear regression model

Suppose that we have a pair of variables (X, Y ) and a variable Y is a linear function of X plus
random noise:
Y = f (X) + = β0 + β1 X + ,
where a random noise is assumed to have normal distribution N(0, σ 2 ). A variable X is called a
predictor variable, Y - a response variable and a function f (x) = β0 + β1 x - a linear regeression
function.
Suppose that we are given a sequence of pairs (X1 , Y1 ), . . . , (Xn , Yn ) that are described by the
above model:
Yi = β0 + β1 Xi + i ,
and 1 , . . . , n are i.i.d. N(0, σ 2 ). We have three unknown parameters β0 , β1 and σ 2 and we want
to estimate them using a given sample. The points X1 , . . . , Xn can be either random or non
random, but from the point of view of estimating linear regression function the nature of Xs is
in some sense irrelevant so we will think of them as fixed and non random and assume that the
randomness comes from the noise variables i . For a fixed Xi , the distribution of Yi is equal to
N(f (Xi ), σ 2 ). The likelihood function of the sequence (Y1 , . . . , Yn ) is
1 Pn 2
L(Y1 , . . . , Yn ; β0 , β1 , σ 2 ) = (2πσ)−n/2 e− 2σ2 i=1 (Yi −f (Xi )) .
Let us find the m.l.e. of β̂0 , β̂1 , σ̂ 2 that maximize this likelihood function L. First of all, it is obvious
that (β̂0 , β̂1 ) is also minimized
n
X
∗
L (β0 , β1 ) := (Yi − β0 − β1 Xi )2
i=1

so β̂0 , β̂1 are solution to

 ∗
∂L
= − ni=1 2(Yi − (β0 + β1 Xi )) = 0
 P

∂β0
∂L∗
= − ni=1 2(Yi − (β0 + β1 Xi ))Xi = 0.
 P

∂β1

156
CHAPTER 7. REGRESSION 157

Denote
1X 1X 1X 2 1X
X̄ = Xi , Ȳ = Yi , X2 = Xi , XY = Xi Yi ,
n n n n
i

we obtain

β̂0 = Ȳ − β̂1 X̄
XY − X̄ Ȳ
β̂1 =
X 2 − X̄ 2
n
1X
σ̂ 2 = (Yi − β̂0 − β̂1 Xi )2
n
i=1

Denote Ŷi = β̂0 + β̂1 Xi and Pn

2 (Yi − Ŷi )2
R = 1 − Pi=1
n 2
.
i=1 (Yi − Ȳ )
The numerator in the last sum is the sum of squares of the residuals and the numerator is the
variance of Y and R2 is usually interpreted as the proportion of variability in the data explained
by the linear model. The higher R2 the better our model explains the data. Next, we would like
to do statistical inference about the linear model.
1. Construct confidence intervals for parameters of the model β0 , β1 and σ 2 .
2. Construct prediction intervals for Y given any point X.
3. Test hypotheses about parameters of the model.
The distribution of β̂0 , β̂1 and σ̂ 2 are defined by the following result.

Proposition 7.1.1. 1. Vector (β̂0 , β̂1 ) has a normal distribution with mean (β0 , β1 ) and
covariance matrix
!
σ2 X 2 −X̄
Σ= , where σx2 = X 2 − X̄ 2 .
nσx2 −X̄ 1

2. σ̂ 2 is independent of β̂0 , β̂1 .

nσ̂ 2
3. has χ2n−2 distribution with n − 2 degrees of freedom.
σ2

7.1.2 Confidence interval for σ 2

nσ̂ 2
It follows from Proposition 7.1.1 that 2 is χ2n−2 distributed, so if we choose c1−α/2,n−1 , cα/2,n−1
σ
such that

α α
P[χ2n−1 > c1−α/2,n−1 ] = 1 − , P[χ2n−1 > cα/2,n−1 ] = ,
2 2
CHAPTER 7. REGRESSION 158

then
h nσ̂ 2 nσ̂ 2 i
P ≤ σ2 ≤ = 1 − α.
cα/2,n−1 c1−α/2,n−1
Therefore the (1 − α) CI for σ 2 is
nσ̂ 2 nσ̂ 2
≤ σ2 ≤ .
cα/2,n−1 c1−α/2,n−1

7.1.3 Confidence interval for β1

It follows from Proposition 7.1.1 that

r
nσx2 nσ̂ 2
(β̂1 − β1 ) ∼ N(0, 1) and ∼ χ2n−2
σ2 σ2
and β̂1 is independent of σ̂ 2 , so
q
nσx2 r
σ2
(β̂1 − β1 ) (n − 2)σx2
= (β̂1 − β1 )
σ̂ 2
q
1 nσ̂ 2
n−2 σ 2

has a tn−2 distribution with n − 2 degrees of freedom. Choose xα such that

P[|tn−2 | < xα ] = 1 − α

we obtain the (1 − α) CI for β1 as follows

s s
σ̂ 2 σ̂ 2
βˆ1 − xα ≤ β 1 ≤ ˆ1 + xα
β .
(n − 2)σx2 (n − 2)σx2

7.1.4 Confidence interval for β0

A similar argument as above yields

s
βˆ0 − β0 1 nσ̂ 2 βˆ0 − β0
: =
n − 2 σ2
s r
X̄ 2 2
1
σ̂ 2 X̄ 2
+ σ n−2 1 + σ 2
x
n nσx2
has a Student’s t distribution with n − 2 degrees of freedom. Thus the (1 − α) CI for β0 is
s s
σ̂ 2 X̄ 2 σ̂ 2 X̄ 2
β̂0 − xα 1 + 2 ≤ β0 ≤ β̂0 + xα 1+ 2 .
n−2 σx n−2 σx

7.1.5 Prediction intervals

Suppose now that we have a new observation X for which Y is unknown and we want to
predict Y or find the confidence interval for Y . According to simple regression model,

Y = β0 + β1 X +

and it is natural to take Ŷ = β̂0 + β̂1 X as the prediction of Y . Let us find the distribution of their
difference Ŷ − Y .
CHAPTER 7. REGRESSION 159

Proposition 7.1.2. The random variable

Ŷ − Y
r
σ̂ 2 (X̄−X)2
n−2 n+1+ σx2

has a Student’s t distribution with n − 2 degrees of freedom.

Choose xα such that P[|tn−2 | < xα ] = 1 − α we obtain the (1 − α) CI for Y as follows

s s
σ̂ 2 (X̄ − X)2 σ̂ 2 (X̄ − X)2
Ŷ − xα n+1+ ≤ Y ≤ Ŷ + x α n + 1 + .
n−2 σx2 n−2 σx2
Appendies

2
e−x /2
Rz
Table of Normal distribution Φ(z) = −∞
√
2π
dx
z 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0 0.5 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.5279 0.53188 0.53586
0.1 0.5398 0.5438 0.54776 0.55172 0.55567 0.55966 0.5636 0.56749 0.57142 0.57535
0.2 0.5793 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.6293 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.6591 0.66276 0.6664 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.7054 0.70884 0.71226 0.71566 0.71904 0.7224
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.7549
0.7 0.75804 0.76115 0.76424 0.7673 0.77035 0.77337 0.77637 0.77935 0.7823 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.8665 0.86864 0.87076 0.87286 0.87493 0.87698 0.879 0.881 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.9032 0.9049 0.90658 0.90824 0.90988 0.91149 0.91308 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.9222 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.9452 0.9463 0.94738 0.94845 0.9495 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.9608 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.9732 0.97381 0.97441 0.975 0.97558 0.97615 0.9767
2 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.9803 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.983 0.98341 0.98382 0.98422 0.98461 0.985 0.98537 0.98574
2.2 0.9861 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.9884 0.9887 0.98899
2.3 0.98928 0.98956 0.98983 0.9901 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.9918 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.9943 0.99446 0.99461 0.99477 0.99492 0.99506 0.9952
2.6 0.99534 0.99547 0.9956 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.9972 0.99728 0.99736
2.8 0.99744 0.99752 0.9976 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.999

160
CHAPTER 7. REGRESSION 161

Table of Student distribution1

1 side 75% 80% 85% 90% 95% 97.5% 99% 99.5% 99.75% 99.9% 99.95%
2 side 50% 60% 70% 80% 90% 95% 98% 99% 99.5% 99.8% 99.9%
1 1 1.376 1.963 3.078 6.314 12.71 31.82 63.66 127.3 318.3 636.6
2 0.816 1.08 1.386 1.886 2.92 4.303 6.965 9.925 14.09 22.33 31.6
3 0.765 0.978 1.25 1.638 2.353 3.182 4.541 5.841 7.453 10.21 12.92
4 0.741 0.941 1.19 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.61
5 0.727 0.92 1.156 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869
6 0.718 0.906 1.134 1.44 1.943 2.447 3.143 3.707 4.317 5.208 5.959
7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408
8 0.706 0.889 1.108 1.397 1.86 2.306 2.896 3.355 3.833 4.501 5.041
9 0.703 0.883 1.1 1.383 1.833 2.262 2.821 3.25 3.69 4.297 4.781
10 0.7 0.879 1.093 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587
11 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437
12 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 3.428 3.93 4.318
13 0.694 0.87 1.079 1.35 1.771 2.16 2.65 3.012 3.372 3.852 4.221
14 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.14
15 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073
16 0.69 0.865 1.071 1.337 1.746 2.12 2.583 2.921 3.252 3.686 4.015
17 0.689 0.863 1.069 1.333 1.74 2.11 2.567 2.898 3.222 3.646 3.965
18 0.688 0.862 1.067 1.33 1.734 2.101 2.552 2.878 3.197 3.61 3.922
19 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883
20 0.687 0.86 1.064 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.85
21 0.686 0.859 1.063 1.323 1.721 2.08 2.518 2.831 3.135 3.527 3.819
22 0.686 0.858 1.061 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792
23 0.685 0.858 1.06 1.319 1.714 2.069 2.5 2.807 3.104 3.485 3.767
24 0.685 0.857 1.059 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745
25 0.684 0.856 1.058 1.316 1.708 2.06 2.485 2.787 3.078 3.45 3.725
26 0.684 0.856 1.058 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707
27 0.684 0.855 1.057 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.69
28 0.683 0.855 1.056 1.313 1.701 2.048 2.467 2.763 3.047 3.408 3.674
29 0.683 0.854 1.055 1.311 1.699 2.045 2.462 2.756 3.038 3.396 3.659
30 0.683 0.854 1.055 1.31 1.697 2.042 2.457 2.75 3.03 3.385 3.646
40 0.681 0.851 1.05 1.303 1.684 2.021 2.423 2.704 2.971 3.307 3.551
50 0.679 0.849 1.047 1.299 1.676 2.009 2.403 2.678 2.937 3.261 3.496
60 0.679 0.848 1.045 1.296 1.671 2 2.39 2.66 2.915 3.232 3.46
80 0.678 0.846 1.043 1.292 1.664 1.99 2.374 2.639 2.887 3.195 3.416
100 0.677 0.845 1.042 1.29 1.66 1.984 2.364 2.626 2.871 3.174 3.39
120 0.677 0.845 1.041 1.289 1.658 1.98 2.358 2.617 2.86 3.16 3.373
∞ 0.674 0.842 1.036 1.282 1.645 1.96 2.326 2.576 2.807 3.09 3.291

1
P[T1 < 1.376] = 0.8 v P[|T1 | < 1.376] = 0.6
CHAPTER 7. REGRESSION 162

Table of χ2 -distribution P[χ2n > α]

DF: n 0.995 0.975 0.2 0.1 0.05 0.025 0.02 0.01 0.005 0.002 0.001
1 0.00004 0.001 1.642 2.706 3.841 5.024 5.412 6.635 7.879 9.55 10.828
2 0.01 0.0506 3.219 4.605 5.991 7.378 7.824 9.21 10.597 12.429 13.816
3 0.0717 0.216 4.642 6.251 7.815 9.348 9.837 11.345 12.838 14.796 16.266
4 0.207 0.484 5.989 7.779 9.488 11.143 11.668 13.277 14.86 16.924 18.467
5 0.412 0.831 7.289 9.236 11.07 12.833 13.388 15.086 16.75 18.907 20.515
6 0.676 1.237 8.558 10.645 12.592 14.449 15.033 16.812 18.548 20.791 22.458
7 0.989 1.69 9.803 12.017 14.067 16.013 16.622 18.475 20.278 22.601 24.322
8 1.344 2.18 11.03 13.362 15.507 17.535 18.168 20.09 21.955 24.352 26.124
9 1.735 2.7 12.242 14.684 16.919 19.023 19.679 21.666 23.589 26.056 27.877
10 2.156 3.247 13.442 15.987 18.307 20.483 21.161 23.209 25.188 27.722 29.588
11 2.603 3.816 14.631 17.275 19.675 21.92 22.618 24.725 26.757 29.354 31.264
12 3.074 4.404 15.812 18.549 21.026 23.337 24.054 26.217 28.3 30.957 32.909
13 3.565 5.009 16.985 19.812 22.362 24.736 25.472 27.688 29.819 32.535 34.528
14 4.075 5.629 18.151 21.064 23.685 26.119 26.873 29.141 31.319 34.091 36.123
15 4.601 6.262 19.311 22.307 24.996 27.488 28.259 30.578 32.801 35.628 37.697
16 5.142 6.908 20.465 23.542 26.296 28.845 29.633 32 34.267 37.146 39.252
17 5.697 7.564 21.615 24.769 27.587 30.191 30.995 33.409 35.718 38.648 40.79
18 6.265 8.231 22.76 25.989 28.869 31.526 32.346 34.805 37.156 40.136 42.312
19 6.844 8.907 23.9 27.204 30.144 32.852 33.687 36.191 38.582 41.61 43.82
20 7.434 9.591 25.038 28.412 31.41 34.17 35.02 37.566 39.997 43.072 45.315
21 8.034 10.283 26.171 29.615 32.671 35.479 36.343 38.932 41.401 44.522 46.797
22 8.643 10.982 27.301 30.813 33.924 36.781 37.659 40.289 42.796 45.962 48.268
23 9.26 11.689 28.429 32.007 35.172 38.076 38.968 41.638 44.181 47.391 49.728
24 9.886 12.401 29.553 33.196 36.415 39.364 40.27 42.98 45.559 48.812 51.179
25 10.52 13.12 30.675 34.382 37.652 40.646 41.566 44.314 46.928 50.223 52.62
26 11.16 13.844 31.795 35.563 38.885 41.923 42.856 45.642 48.29 51.627 54.052
27 11.808 14.573 32.912 36.741 40.113 43.195 44.14 46.963 49.645 53.023 55.476
28 12.461 15.308 34.027 37.916 41.337 44.461 45.419 48.278 50.993 54.411 56.892
29 13.121 16.047 35.139 39.087 42.557 45.722 46.693 49.588 52.336 55.792 58.301
30 13.787 16.791 36.25 40.256 43.773 46.979 47.962 50.892 53.672 57.167 59.703
Bibliography

[1] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury,
2002.

[2] Cacoullos, T. (1989) Exercises in probability, Springer-Verlag New York Inc.

[3] DeGroot, M., & Mark J. Schervish. Probability and Statistics. 3rd ed. Boston, MA: Addison-
Wesley, 2002.

[4] Hogg, R., McKean, J.W., & Craig, A.T. (2005) Introduction to Mathematical Statistics, 6th Edi-
tion. Pearson Education International.

[5] Jacod, J., Protter, P. (2003) Probability Essential. Springer.

[6] Montgomery, D. C., & Runger, G. C. (2010). Applied statistics and probability for engineers.
John Wiley & Sons.

[7] Panchenko, D. (2006) Lecture note “Statistics for Applications”.

http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-
2006/readings/

[8] Rahman N. A. (1983) Theoretical exercises in probability and statistics, second edition.
Macmillan Publishing.

[9] Rice, John. Mathematical statistics and data analysis. Nelson Education, 2006.

[10] Shao, J. (2005) Mathematical Statistics: Exercises and Solutions. Springer

[11] Yuri, S. & Kelbert, M. (2008) Probability and Statistics by Example: Volume 1 and 2. Cam-
bridge University Press.

[12] Shevtsova, I. (2011). On the absolute constants in the Berry-Esseen type inequalities for
identically distributed summands. arXiv preprint arXiv:1111.6554.

[13] Nguyen Duy Tien, Vu Viet Yen (2001) Probability Theory (in Vietnamese). Educational Pub-
lishing House.

163

Stat230 Spring
No ratings yet
Stat230 Spring
345 pages
STAT Exercises
No ratings yet
STAT Exercises
258 pages
Lecture Notes - Probability Theory: Manuel Cabral Morais
No ratings yet
Lecture Notes - Probability Theory: Manuel Cabral Morais
297 pages
Intuition To Probability (Version 1.19)
No ratings yet
Intuition To Probability (Version 1.19)
396 pages
Probability and Statistics
No ratings yet
Probability and Statistics
580 pages
Ee230 Lectures
No ratings yet
Ee230 Lectures
103 pages
Probability
No ratings yet
Probability
67 pages
A First Course in Probability Notes
No ratings yet
A First Course in Probability Notes
103 pages
A Level Statistics-2
No ratings yet
A Level Statistics-2
108 pages
Public International Law
100% (1)
Public International Law
17 pages
Probability and Statistics Explorations With Maple
No ratings yet
Probability and Statistics Explorations With Maple
287 pages
Cours-Corrigé de Proba&Statis
No ratings yet
Cours-Corrigé de Proba&Statis
79 pages
MA-202: Probability & Statistics: Class Notes
No ratings yet
MA-202: Probability & Statistics: Class Notes
221 pages
Statistical Models
No ratings yet
Statistical Models
248 pages
ST2133 ASDT 2021 Guide
No ratings yet
ST2133 ASDT 2021 Guide
242 pages
JB Ies 109 Exercises Answers
No ratings yet
JB Ies 109 Exercises Answers
246 pages
ST104B Subject Guide 2023 Final
No ratings yet
ST104B Subject Guide 2023 Final
274 pages
Ma 202
No ratings yet
Ma 202
219 pages
Lectnotemat 5
No ratings yet
Lectnotemat 5
346 pages
Statistics For Econometrics
No ratings yet
Statistics For Econometrics
100 pages
Probability
No ratings yet
Probability
180 pages
Probability and Statistics With Examples Using R Siva Athreya, Deepayan Sarkar, and Steve Tanner
No ratings yet
Probability and Statistics With Examples Using R Siva Athreya, Deepayan Sarkar, and Steve Tanner
258 pages
Subject Guide (2019) PDF
100% (1)
Subject Guide (2019) PDF
398 pages
Summary I 2018-2019
No ratings yet
Summary I 2018-2019
72 pages
Book Statistik Non Parametrik, Komang Suardika
No ratings yet
Book Statistik Non Parametrik, Komang Suardika
492 pages
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
100% (1)
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
235 pages
Review Notes - Probability
No ratings yet
Review Notes - Probability
16 pages
Data Analysis
No ratings yet
Data Analysis
51 pages
Fundamentals of Statistics 2
No ratings yet
Fundamentals of Statistics 2
168 pages
Stat Book
No ratings yet
Stat Book
413 pages
Master List: Private Process Servers Certified by The Supreme Court of Texas
100% (2)
Master List: Private Process Servers Certified by The Supreme Court of Texas
78 pages
st104b Statistics 2 Revision Guide
100% (1)
st104b Statistics 2 Revision Guide
394 pages
+an Introduction To The Science of Statistics PDF
No ratings yet
+an Introduction To The Science of Statistics PDF
383 pages
An Introduction To The Science of Statis PDF
No ratings yet
An Introduction To The Science of Statis PDF
430 pages
Statbook PDF
No ratings yet
Statbook PDF
433 pages
ECE 313 Course Notes: Probability With Engineering Applications
No ratings yet
ECE 313 Course Notes: Probability With Engineering Applications
188 pages
Probabilityjan13 PDF
No ratings yet
Probabilityjan13 PDF
281 pages
ST104B Statistics 2 PDF
No ratings yet
ST104B Statistics 2 PDF
108 pages
STAT 251 Course Text
No ratings yet
STAT 251 Course Text
179 pages
Lectnotemat 2
No ratings yet
Lectnotemat 2
348 pages
Principles of Statistical Analysis - V1
No ratings yet
Principles of Statistical Analysis - V1
426 pages
Stat 230 No Tess 16 Print
No ratings yet
Stat 230 No Tess 16 Print
359 pages
Stat Book
No ratings yet
Stat Book
383 pages
Probability and Statistical Inference 9t PDF
100% (1)
Probability and Statistical Inference 9t PDF
30 pages
2020-2021 EDA 101 Handout
No ratings yet
2020-2021 EDA 101 Handout
192 pages
Doc-Cours MathsV
No ratings yet
Doc-Cours MathsV
69 pages
EC400Stats Lecturenotes2021
No ratings yet
EC400Stats Lecturenotes2021
101 pages
Probability and Stats For Data Science PDF
100% (1)
Probability and Stats For Data Science PDF
237 pages
Mathematical Statistics
No ratings yet
Mathematical Statistics
271 pages
Volume - 1 Part 3
No ratings yet
Volume - 1 Part 3
219 pages
Arbitration MCQ Judiciary Exams
No ratings yet
Arbitration MCQ Judiciary Exams
51 pages
Unfair Labour Practices and Victimisation
100% (1)
Unfair Labour Practices and Victimisation
9 pages
Contents
No ratings yet
Contents
10 pages
5ENT2054 - LECTURE - (Slides) 1. Aviation Law & Legislation - NOTES - $ - 14FEB24
No ratings yet
5ENT2054 - LECTURE - (Slides) 1. Aviation Law & Legislation - NOTES - $ - 14FEB24
36 pages
Book Solutions
No ratings yet
Book Solutions
17 pages
Motion To Quash: Municipal Trial Court in Cities
No ratings yet
Motion To Quash: Municipal Trial Court in Cities
2 pages
NUS ST2334 Lecture Notes
No ratings yet
NUS ST2334 Lecture Notes
56 pages
ST2334 Notes (Probability and Statistics - NUS)
No ratings yet
ST2334 Notes (Probability and Statistics - NUS)
55 pages
Final Examination - Nov 2016 Admit Card
No ratings yet
Final Examination - Nov 2016 Admit Card
14 pages
Complaint, For Our Rights v. Ige, No. 1:20-cv-00268 (D. Haw. June 9, 2020)
No ratings yet
Complaint, For Our Rights v. Ige, No. 1:20-cv-00268 (D. Haw. June 9, 2020)
170 pages
LAND TENURE - Revised Version
No ratings yet
LAND TENURE - Revised Version
31 pages
Probability and Statistics With Examples Using R: Siva Athreya, Deepayan Sarkar, and Steve Tanner April 25, 2016
No ratings yet
Probability and Statistics With Examples Using R: Siva Athreya, Deepayan Sarkar, and Steve Tanner April 25, 2016
4 pages
Vector Shipping Vs American Home Assurance
No ratings yet
Vector Shipping Vs American Home Assurance
3 pages
Letter N Worksheet Set
No ratings yet
Letter N Worksheet Set
9 pages
Introduction To International Human Rights Law and Humanitarian Law
No ratings yet
Introduction To International Human Rights Law and Humanitarian Law
6 pages
XYZ (Ontel) v. (Mystery Defendants) - Complaint
No ratings yet
XYZ (Ontel) v. (Mystery Defendants) - Complaint
27 pages
Commercial Law ASSIGNMENT 1: Case Note
No ratings yet
Commercial Law ASSIGNMENT 1: Case Note
4 pages
Suraj Verma 12 KW Quote TATA
No ratings yet
Suraj Verma 12 KW Quote TATA
10 pages
Brngy Reso
No ratings yet
Brngy Reso
2 pages
Law Society of Zim V Min of Transport
No ratings yet
Law Society of Zim V Min of Transport
14 pages
Membership Form v2017-01
No ratings yet
Membership Form v2017-01
2 pages
Law and Information Technology Assignemnt
No ratings yet
Law and Information Technology Assignemnt
18 pages
01 People v. Crisanto Haya, G.R. No. 230718, September 16, 2020
No ratings yet
01 People v. Crisanto Haya, G.R. No. 230718, September 16, 2020
4 pages
8143 Companies Income Tax (Amendment) Act, 2007
No ratings yet
8143 Companies Income Tax (Amendment) Act, 2007
10 pages
Syllabus
No ratings yet
Syllabus
29 pages
Leela's Friend 11
No ratings yet
Leela's Friend 11
5 pages
CSample - Chaves v. Gonzales, GR No. L-27454 (Typewriter Repair, Obligation To Repair, Nonperformance, Fixing The Period) - Synthesis Legis
No ratings yet
CSample - Chaves v. Gonzales, GR No. L-27454 (Typewriter Repair, Obligation To Repair, Nonperformance, Fixing The Period) - Synthesis Legis
15 pages
ACCT 304 Auditing: Session 4 - Fundamental Principles of Auditing
No ratings yet
ACCT 304 Auditing: Session 4 - Fundamental Principles of Auditing
22 pages
G.R. No. 117604 March 26, 1997 China Banking Corporation, Petitioner, Court of Appeals, and Valley Golf and Country Club, Inc., Respondents
No ratings yet
G.R. No. 117604 March 26, 1997 China Banking Corporation, Petitioner, Court of Appeals, and Valley Golf and Country Club, Inc., Respondents
14 pages
Veronica Fernandez Court of Appeals
No ratings yet
Veronica Fernandez Court of Appeals
2 pages
The State of Rajasthan Vs Shri G Chawla and DR Pohumal On 16 December 1958
No ratings yet
The State of Rajasthan Vs Shri G Chawla and DR Pohumal On 16 December 1958
5 pages
Watch Daddys Head in Cinemas Horror Reel Cinemas UAE
No ratings yet
Watch Daddys Head in Cinemas Horror Reel Cinemas UAE
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Inbound 8969254549211759123

Uploaded by

Inbound 8969254549211759123

Uploaded by

Lecture notes on

Probability and Statistics

Ngo Hoang Long

Division of Applied Mathematics

2 Random Variables and Distributions 29

2.8.3 Properties of conditional expectation . . . . . . . . . . . . . . . . . . . . . . 47

3 Fundamental Limit Theorems 62

4 Some useful distributions in statistics 82

5.4.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . 102

6 Hypothesis Testing 113

1. A toss of a coin: Ω = {H, T }.

2. Two successive tosses of a coin: Ω = {HH, HT, T H, T T }.

3. A toss of two dice: Ω = {(i, j) : 1 ≤ i ≤ 6, 1 ≤ j ≤ 6}.

4. The measurement of a length L, with a measurement error: Ω = R+ , where R+ denotes

5. The lifetime of a light-bulb: Ω = R+ .

• the contrary event is interpreted as the complement set Ac ;

• the event ”A or B” is interpreted as the union A ∪ B;

• the event ”A and B” is interpreted as the intersection A ∩ B;

• the sure event is Ω;

• the impossible event is the empty set ∅;

• an elementary event is a ”singleton*’, i.e. a subset {w} containing a single outcome w of Ω.

P(A) = limit of fn (A) as n → ∞

3. P(A ∪ B) = P(A) + P(B) if A ∩ B = ∅.

1.2 Probability space

2. If A ∈ A then Ac := Ω\A ∈ A; Ac is called the complement of A;

4. A is closed under countable unions and intersections: that is, if A1 , A2 . . . is a countable

Definition 1.2.2. If A is a σ-algebra on Ω then (Ω, A) is called a measurable space.

Example: (i) A = {∅, Ω} (the trivial σ-algebra).

Definition 1.2.4. A probability measure defined on a σ-algebra A of Ω is a function P : A → [0, 1]

P(A ∪ B) = P(A) + P(B) (1.1)

for any disjoint sets A, B ∈ A.

(iii) P(Ac ) = 1 − P(A);

(iv) If A, B ∈ A and A ⊂ B then P(A) ≤ P(B).

P(B) = P(A) + P(B\A) ≥ P(A).

1.3 Properties of probability

(i) Axiom (2) of Definition 1.2.4.

(ii) If An ∈ A and An ↓ A, then P(An ) ↓ 0.

(iii) If An ∈ A and An ↓ A, then P(An ) ↓ P(A).

(iv) If An ∈ A and An ↑ Ω, then P(An ) ↑ 1.

(v) If An ∈ A and An ↑ A, then P(An ) ↑ P(A).

(v) ⇒ (i) Let An ∈ A be pairwise disjoint. Define Bn = ∪nk=1 Ak . We have Bn ↑ ∪∞

(v) → (iii): Suppose that An ↓ A. Then Acn ↑ Ac . Hence, we have

P(An ) = 1 − P(Acn ) ↓ 1 − P(Ac ) = P(A).

(iii) → (ii) is obvious.

P(An ) = P(Bn ) − P(Ac ) ↑ 1 − P(Ac ) = P(A).

If A ∈ 2Ω , we define the indicator function by

Then D is a d-system on S, that is, a family of subsets of S satisfied:

Indeed, the fact that S ∈ D is given. If A, B ∈ D, then

µ1 (B \ A) = µ1 (B) − µ1 (A) = µ2 (B) − µ2 (A) = µ2 (B \ A).

so that B \ A ∈ D. Finally, if Fn ∈ D and Fn ↑ F , then

µ1 (F ) =↑ lim µ1 (Fn ) =↑ lim µ2 (Fn ) = µ2 (F ),

Theorem 1.3.4 (Carathéodory’s Extension Theorem). Let S be a set and Σ0 be an algebra on S,

Proof. Step 1: Let G be the σ−algebra of all subsets of S. For G ∈ G, define

where the infimum is taken over all sequences (Fn ) in Σ0 with G ⊆ ∪n Fn .

Since  is arbitrary, we have proved result (a).

by using the countable additivity of µ0 on Σ0 . Hence

so that λ(F ) ≥ µ0 (F ). Step 3 is complete.

(E ∩ Fn ) and E c ∩ G ⊆ (E c ∩ Fn ). Thus, since  is arbitrary,

λ(G) ≥ λ(E ∩ G) + λ(E c ∩ G).

However, since λ is subadditive,

λ(G) ≤ λ(E ∩ G) + λ(E c ∩ G).

We see that E is indeed a λ−set.

1.4 Probabilities on a Finite or Countable Space

In this case, it is immediate that

1.5 Conditional Probability

Theorem 1.5.3. If A1 , . . . , An ∈ A and if P(A1 ∩ . . . ∩ An−1 ) > 0, then

P(A1 ∩ . . . ∩ An ) = P(A1 )P(A2 |A1 ) . . . P(An |A1 ∩ . . . ∩ An−1 ).

P(B) = P(A1 )P(A2 |A1 ) . . . P(An−1 |A1 ∩ . . . ∩ An−2 ),

and we get the result.

Definition 1.5.4. A collection of events (En ) is called a partition of Ω in A if

1. En ∈ A and P(En ) > 0, each n,

2. they are pairwise disjoint,

Proof. Note that

Since the En are pairwise disjoint so also are (A ∩ En )n≥1 , hence

Proof. Applying partition equation, we have that the denominator

Hence the formula becomes

Since is arbitrary, we have proved result (a).

(E ∩ Fn ) and E c ∩ G ⊆ (E c ∩ Fn ). Thus, since is arbitrary,