0% found this document useful (0 votes)
30 views180 pages

Probability

Uploaded by

Trang Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views180 pages

Probability

Uploaded by

Trang Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 180

Lecture notes on

Probability and Statistics

Ngo Hoang Long

Division of Applied Mathematics


Faculty of Mathematics and Informatics
Hanoi National University of Education
Email: ngolong@hnue.edu.vn
Contents

1 Probability Space 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Probabilities on a Finite or Countable Space . . . . . . . . . . . . . . . . . . . . . 17
1.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 Random Variables and Distributions 47


2.1 Random variables on a countable space . . . . . . . . . . . . . . . . . . . . . . . 47
2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.2 Random variables on a general probability space . . . . . . . . . . . . . . . . . . 52
2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.2.2 Structure of random variables . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4.1 Construction of expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4.2 Some limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.4.3 Some inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.4.4 Expectation of random variable with density . . . . . . . . . . . . . . . . 67
2.5 Random elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.5.3 Density of function of random vectors . . . . . . . . . . . . . . . . . . . . 71
2.6 Independent random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.7 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.8 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

1
CONTENTS 2

2.8.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.8.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.8.3 Properties of conditional expectation . . . . . . . . . . . . . . . . . . . . 77
2.8.4 Convergence theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.8.5 Conditional expectation given a random variable . . . . . . . . . . . . . . 82
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3 Fundamental Limit Theorems 102


3.1 Convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.2 Laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.2.1 Weak laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.2.2 Strong laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.3 Central limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.3.1 Characteristic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.3.2 Weak convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.3.3 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.4.1 Convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . 126
3.4.2 Law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.4.3 Characteristic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.4.4 Weak convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.4.5 Central limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.5.1 Designing Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.5.2 Collecting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.5.3 Describing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.5.4 Analyzing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.5.5 Making Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.6 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
3.6.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
3.6.2 Counts and Percents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
3.6.3 Measures of Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

4 Some useful distributions in statistics 148


4.1 Multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.2 Gamma, chi-square, student and F distributions . . . . . . . . . . . . . . . . . . 149
4.2.1 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.2.2 Chi-square distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.2.3 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.2.4 F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
CONTENTS 3

4.3 Sample mean and sample variance . . . . . . . . . . . . . . . . . . . . . . . . . . 153


4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

5 Parameter estimation 157


5.1 Samples and characteristic of sample . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.2 Data display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.2.1 Stem-and-leaf diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.2.2 Frequency distribution and histogram . . . . . . . . . . . . . . . . . . . . 162
5.2.3 Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.2.4 Probability plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.3 Point estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.3.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.3.2 Point estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.3.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.4 Method of finding estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.4.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 181
5.4.2 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.5 Lower bound for variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.6.1 Confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.6.2 Point estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.6.3 Lower bound for variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

6 Hypothesis Testing 199


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.2 Method of finding test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.2.1 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.3 Method of evaluating test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.3.1 Most powerful test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.3.2 Uniformly most powerful test . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.3.3 Monotone likelihood ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.4 Some well-known tests for a single sample . . . . . . . . . . . . . . . . . . . . . . 215
6.4.1 Hypothesis test on the mean of a normal distribution, variance σ 2 known 215
6.4.2 Hypothesis test on the mean of a normal distribution, variance σ 2 un-
known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.4.3 Hypothesis test on the variance of a normal distribution . . . . . . . . . 224
6.4.4 Test on a population proportion . . . . . . . . . . . . . . . . . . . . . . . 226
6.5 Some well-known tests for two samples . . . . . . . . . . . . . . . . . . . . . . . 228
6.5.1 Inference for a difference in means of two normal distributions, vari-
ances known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
CONTENTS 4

6.5.2 Inference for the difference in means of two normal distributions, vari-
ances unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.5.3 Paired t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.5.4 Inference on the variance of two normal populations . . . . . . . . . . . 236
6.5.5 Inference on two population proportions . . . . . . . . . . . . . . . . . . 238
6.6 The chi-square test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.6.1 Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.6.2 Tests of independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
6.6.3 Test of homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
6.7.1 Significance level and power function . . . . . . . . . . . . . . . . . . . . 256
6.7.2 Null distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.7.3 Best critical region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
6.7.4 Some tests for single sample . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.7.5 Some tests for two samples . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.7.6 Chi-squared tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

7 Regression 277
7.1 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
7.1.1 Simple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . 277
7.1.2 Confidence interval for σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.1.3 Confidence interval for β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.1.4 Confidence interval for β0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.1.5 Prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Chapter 1

Probability Space

1.1 Introduction
Random experiments are experiments whose output cannot be surely predicted in ad-
vance. But when one repeats the same experiment a large number of times one can observe
some ”regularity” in the average output. A typical example is the toss of a coin: one can-
not predict the result of a single toss, but if we toss the coin many times we get an average of
about 50% of ”heads” if the coin is fair. The theory of probability aims towards a mathematical
theory which describes such phenomena. This theory contains three main ingredients:
a) The state space: this is the set of all possible outcomes of the experiment, and it is
usually denoted by Ω.
Examples

1. A toss of a coin: Ω = {H, T }.

2. Two successive tosses of a coin: Ω = {HH, HT, T H, T T }.

3. A toss of two dice: Ω = {(i, j) : 1 ≤ i ≤ 6, 1 ≤ j ≤ 6}.

4. The measurement of a length L, with a measurement error: Ω = R+ , where R+ denotes


the positive real numbers; ω ∈ Ω denotes the result of the measurement, and w − L is
the measurement error.

5. The lifetime of a light-bulb: Ω = R+ .

b) The event: An ”event” is a property which can be observed either to hold or not to hold
after the experiment is done. In mathematical terms, an event is a subset of Ω. If A and B are
two events, then
1
This is a draft version which contains many errors. Comments are very welcome

5
1.1. INTRODUCTION 6

• the contrary event is interpreted as the complement set Ac ;

• the event ”A or B” is interpreted as the union A ∪ B;

• the event ”A and B” is interpreted as the intersection A ∩ B;

• the sure event is Ω;

• the impossible event is the empty set ∅;

• an elementary event is a ”singleton*’, i.e. a subset {w} containing a single outcome w of


Ω.

We denote by A the family of all events. Often (but not always: we will see why later)
we have A = 2Ω the set of all subsets of Ω. The family A should be ”stable” by the logical
operations described above: if A, B ∈ A then we must have Ac ∈ A, A ∩ B ∈ A, A ∪ B ∈ A,
and also Ω ∈ A and ∅ ∈ A.
c) The probability: With each event A one associates a number denoted by P(A) and called
the ”probability of A”. This number measures the likelihood of the event A to be realized a
priori, before performing the experiment. It is chosen between 0 and 1, and the more likely
the event is, the closer to 1 this number is.
To get an idea of the properties of these numbers, one can imagine that they are the limits
of the ”frequency” with which the events are realized: let us repeat the same experiment n
times; the n outcomes might of course be different (think of n successive tosses of the same
die, for instance). Denote by fn (A) the frequency with which the event A is realized (i.e. the
number of times the event occurs, divided by n). Intuitively we have:

P(A) = limit of fn (A) as n → ∞

(we will give a precise meaning to this ”limit” later). From the obvious properties of frequen-
cies, we immediately deduce that:

1. 0 ≤ P(A) ≤ 1;

2. P(Ω) = 1;

3. P(A ∪ B) = P(A) + P(B) if A ∩ B = ∅.

A mathematical model for our experiment is thus a triple (Ω, A, P), consisting of the space
Ω, the family A of all events, and the family of all P(A) for A ∈ A; hence we can consider that
P is a map from A into [0, 1], which satisfies at least the properties (2) and (3) above (plus in
fact an additional property, more difficult to understand, and which is given later).
A fourth notion, also important although less basic, is the following one:
d) Random variable: A random variable is a quantity which depends on the outcome of
the experiment. In mathematical terms, this is a map from Ω into a space E, where often
1.2. PROBABILITY SPACE 7

E = R or E = Rd . Warning: this terminology, which is rooted in the history of Probability


Theory going back 400 years, is quite unfortunate; a random ”variable” is not a variable in the
analytical sense, but a function!
Let X be such a random variable, mapping Ω into E. One can then ”transport” the prob-
abilistic structure onto the target space E, by setting PX (B) = P(X −1 (B)) for B ∈ E, where
X −1 (B) = {w ∈ Ω : X(w) ∈ B} denotes the pre-image of B by X. This formula defines a new
probability, denoted by PX but on the space E instead of Ω. The probability PX is called the
law of the variable X.
Example (toss of two dice): One tosses two fair dice and observers the number of dots
appearing on each dice. The sample space is Ω = {(i, j) : 1 ≤ i ≤ 6, 1 ≤ i ≤ 6}, and it is
natural to take here A = 2Ω and
|A|
P(A) = if A ⊂ Ω,
36
where |A| denotes the number of points in A. One easily verifies the properties (1), (2), (3)
above, and P({w}) = 1/36 for each singleton. The map X : Ω → N defined by X(i, j) = |j − i|
is the random variable ”different of the two dice”, and its law is
number of pairs (i, j) such that |i − j| ∈ B
PX (B) =
36
(for example PX ({1}) = 5/18, PX ({5}) = 1/18, etc . . .). We will formalize the concepts of a
probability space and random variable in following sections.

1.2 Probability space


Let Ω be a non-empty set without any special structure. Let 2Ω denote all subsets of Ω,
including the empty set denoted by ∅. With A being a subset of 2Ω , we consider the following
properties:

1. ∅ ∈ A and Ω ∈ A;

2. If A ∈ A then Ac := Ω\A ∈ A; Ac is called the complement of A;

3. A is closed under finite unions and finite intersections: that is, if A1 , . . . , An are all in A,
then ∪ni=1 and ∩ni=1 Ai are in A as well (for this it is enough that A be stable by the union
and the intersection of any two sets);

4. A is closed under countable unions and intersections: that is, if A1 , A2 . . . is a countable


sequence of events in A then ∪i Ai and ∩i Ai are both also in A.

Definition 1.2.1. A is an algebra if it satisfies (1), (2) and (3) above. It is a σ-algebra, (or a
σ-field) if it satisfies (1), (2), and (4) above.
1.2. PROBABILITY SPACE 8

Definition 1.2.2. If A is a σ-algebra on Ω then (Ω, A) is called a measurable space.

Definition 1.2.3. If C ⊂ 2Ω , the σ-algebra generated by C, and written σ(C), is the smallest
σ-algebra containing C. (It always exists because 2Ω is a σ-algebra, and the intersection of a
family of σ-algebras is again a σ-algebra)

Example: (i) A = {∅, Ω} (the trivial σ-algebra).


(ii) If A is a subset; then σ(A) = {∅, A, Ac , Ω}.
(iii) If Ω = Rd the Borel σ-algebra, written B(Rd ), is the σ-algebra generated by all the intervals
A of the following type

A = (−∞, x1 ] × (−∞, x2 ] × . . . × (−∞, xn ],

where x1 , . . . , xn ∈ Q.
We can show that B(Rd ) is also the σ-algebra generated by all open subsets (or by all the
closed subsets) of Rd .

Definition 1.2.4. A probability measure defined on a σ-algebra A of Ω is a function P : A →


[0, 1] that satisfies:

1. P(Ω) = 1;

2. For every countable sequence (An )n≥1 of elements of A, pairwise disjoint (that is, An ∩
Am = ∅ whenever n 6= m), one has

[  ∞
X
P An = P(An ).
n=1 n=1

Axiom (2) above is called countable additivity; the number P(A) is called the probability of
the event A.
In Definition 1.2.4 one might imagine a more elementary condition than (2), namely:

P(A ∪ B) = P(A) + P(B) (1.1)

for any disjoint sets A, B ∈ A.


This property is called additivity (or “finite additivity”) and, by an elementary induction, it
implies that for every finite A1 , . . . , Am of pairwise disjoint events Ai ∈ A, we have
m
[  m
X
P Ai = P(Ai ).
i=1 i=1

Theorem 1.2.5. Let (Ω, A, P) be a probability space. The following properties hold:

(i) P(∅) = 0;

(ii) P is additive.
1.3. PROPERTIES OF PROBABILITY 9

(iii) P(Ac ) = 1 − P(A);

(iv) If A, B ∈ A and A ⊂ B then P(A) ≤ P(B).

Proof. If in Axiom (2) we take An = ∅ for all n, we see that the number a = P(∅) is equal to an
infinite sum of itself; since 0 ≤ a ≤ 1, this is possible only if a = 0, and we have (i).
For (ii) it sufffices to apply Axiom (2) with A1 = A and A2 = B and A3 = A4 = ... = ∅, plus
the fact that P(∅) = 0, to obtain (1.1).
Applying (1.1) for A ∈ A and B = Ac we get (iii).
To show (iv), suppose A ⊂ B then applying (1.1) for A and B\A we have

P(B) = P(A) + P(B\A) ≥ P(A).

1.3 Properties of probability


Countable additivity is not implied by additivity. In fact, in spite of its intuitive appeal,
additivity is not enough to handle the mathematical problems of the theory. The next the-
orem shows exactly what is extra when we assume countable additivity instead of just finite
additivity.

Theorem 1.3.1. Let A be a σ-algebra. Suppose that P : A → [0, 1] satisfies P(Ω) = 1 and is
additive. Then the following statements are equivalent:

(i) Axiom (2) of Definition 1.2.4.

(ii) If An ∈ A and An ↓ A, then P(An ) ↓ 0.

(iii) If An ∈ A and An ↓ A, then P(An ) ↓ P(A).

(iv) If An ∈ A and An ↑ Ω, then P(An ) ↑ 1.

(v) If An ∈ A and An ↑ A, then P(An ) ↑ P(A).

Proof. The notation An ↓ A means that An+1 ⊂ An , each n, and ∩∞ i=1 An = A. The notation

An ↑ A means that An ⊂ An+1 and ∪i=1 An = A, ∀n ≥ 2.
(i) ⇒ (v) Let An ∈ A with An ↑ A. We construct a new sequence as follows: B1 = A1 and
Bn = An+1 \An .Then ∪∞
i=1 Bn = A; An = ∪ni=1 Bi and the events (Bn )n≥1 are pairwise disjoint.
Therefore

X
P(A) = P(∪k≥1 Bk ) = Bk .
k=1
1.3. PROPERTIES OF PROBABILITY 10

Hence
n
X ∞
X
P(An ) = P(Bk ) ↑ P(Bk ) = P(A).
k=1 k=1

(v) ⇒ (i) Let An ∈ A be pairwise disjoint. Define Bn = ∪nk=1 Ak . We have Bn ↑ ∪∞


k=1 Ak . Hence

n
X ∞
X
P(∪∞
k=1 Ak ) = lim P(Bn ) = lim P(Ak ) = P(Ak ).
n→∞ n→∞
k=1 k=1

(v) → (iii): Suppose that An ↓ A. Then Acn ↑ Ac . Hence, we have

P(An ) = 1 − P(Acn ) ↓ 1 − P(Ac ) = P(A).

(iii) → (ii) is obvious.


(ii) → (iv): Let An ∈ A with An ↑ Ω. Thus P(Acn ) → 0. Therefore P(An ) = 1 − P(Acn ) ↑ 1.
(iv) → (v): Suppose An ↑ A. Denote Bn = An ∪ Ac . One gets Bn ↑ Ω, it implies P(Bn ) ↑ 1.
Hence

P(An ) = P(Bn ) − P(Ac ) ↑ 1 − P(Ac ) = P(A).

If A ∈ 2Ω , we define the indicator function by



1 if w ∈ A
IA (w) =
0 if w 6∈ A.

We can say that An ∈ A converges to A (we write An → A) if limn→∞ IAn (w) = IA (w) for all
w ∈ Ω. Note that if the sequence An increases (resp. decreases) to A, then it also tends to A in
the above sense.

Theorem 1.3.2. Let P be a probability measure and let An be a sequence of events in A which
converges to A. Then A ∈ A and limn→∞ P(An ) = P(A).

Proof. Now let Bn = ∩m≥n Am and Cn = ∪m≤n Am . Then Bn increases to A and Cn decreases to
A, thus lim P(Bn ) = lim P(Cn ) = P(A), by Theorem 1.3.1. However Bn ⊂ An ⊂ Cn , therefore
n→∞ n→∞
P(Bn ) ≤ P(An ) ≤ P(Cn ), so lim P(An ) = P(A) as well.
n→∞

Lemma 1.3.3. Let S be a set. Let I be a π-system on S, that is, a family of subsets of S stable
under finite intersection:
I1 , I2 ∈ I ⇒ I1 ∩ I2 ∈ I.
Let Σ = σ(I). Suppose that µ1 and µ2 are probability measure on (S, Σ) such that µ1 = µ2 on I.
Then µ1 = µ2 on Σ.
1.3. PROPERTIES OF PROBABILITY 11

Proof. Let

D = {F ∈ Σ : µ1 (F ) = µ2 (F )}.

Then D is a d-system on S, that is, a family of subsets of S satisfied:

a) S ∈ D,
b) if A, B ∈ D and A ⊆ B then B \ A ∈ D,
c) if An ∈ D and An ↑ A, then A ∈ D.

Indeed, the fact that S ∈ D is given. If A, B ∈ D, then

µ1 (B \ A) = µ1 (B) − µ1 (A) = µ2 (B) − µ2 (A) = µ2 (B \ A).

so that B \ A ∈ D. Finally, if Fn ∈ D and Fn ↑ F , then

µ1 (F ) =↑ lim µ1 (Fn ) =↑ lim µ2 (Fn ) = µ2 (F ),

so that F ∈ D.
Since D is a d-system and D ⊇ I then D ⊇ σ(I) = Σ, and the result follows.

This lemma implies that if two probability measure agree on a π-system, then they agree
on the σ-algebra generated by that π-system.

Theorem 1.3.4 (Carathéodory’s Extension Theorem). Let S be a set and Σ0 be an algebra on S,


and let Σ = σ(Σ0 ). If µ0 is a countably additive map µ0 : Σ0 → [0, 1], then there exists a unique
measure µ on (S, Σ) such that µ = µ0 on Σ0 .

Proof. Step 1: Let G be the σ−algebra of all subsets of S. For G ∈ G, define


X
λ(G) := inf µ0 (Fn ),
n

where the infimum is taken over all sequences (Fn ) in Σ0 with G ⊆ ∪n Fn .


We now prove that
(a) λ is an outer measure on (S, G).
The facts that λ() = 0 and λ is increasing are obvious. Suppose that (Gn ) is a sequence in G,
such that each λ(Gn ) is finite. Let  > 0 be given. For each n, choose a sequence (Fn,k : k ∈ N)
of elements of Σ0 such that
[ X
Gn ⊆ Fn,k , µ0 (Fn,k ) < λ(Gn ) + 2−n .
k k
S SS
Then G := Gn ⊆ Fn,k , so that
n k
XX X
λ(G) ≤ µ0 (Fn,k ) < λ(Gn ) + .
n k n
1.3. PROPERTIES OF PROBABILITY 12

Since  is arbitrary, we have proved result (a).


Step 2: We have λ is a measure on (S, L), where L is a σ-algebra of λ−sets in G. All we need
show is that
(b) Σ0 ⊆ L, and λ = µ0 on σ0 ;
for then Σ := σ(Σ0 ) ⊆ L and we can define µ to be the restriction of λ to (S, Σ).
Step 3: Proof that λ = µ0 on Σ0 .
S
Let F ∈ Σ0 . Then, clearly, λ(F ) ≤ µ0 (F ). Now suppose that F ⊆ Fn , where Fn ∈ Σ0 . As usual,
n
we can define a sequence (En ) of disjoint sets:
[ c
E1 := F1 , En = Fn ∩ Fk
k<n
S S
such that En ⊆ Fn and En = Fn ⊇ F. Then
[ X
µ0 (F ) = µ0 ( (F ∩ En )) = µ0 (F ∩ En ),

by using the countable additivity of µ0 on Σ0 . Hence


X X
µ0 (F ) ≤ µ0 (En ) ≤ µ0 (Fn ),

so that λ(F ) ≥ µ0 (F ). Step 3 is complete.


Step 4: Proof that Σ0 ⊆ L. Let E ∈ Σ0 and G ∈ G. Then there exists a sequence (Fn ) in Σ0 such
S
that G ⊆ Fn , and
n X
µ0 (Fn ) ≤ λ(G) + .
n

Now, by definition of λ,
X X X
µ0 (Fn ) = µ0 (E ∩ Fn ) + µ0 (E c ∩ Fn )
n n n
c
≥ λ(E ∩ G) + λ(E ∩ G),

(E ∩ Fn ) and E c ∩ G ⊆ (E c ∩ Fn ). Thus, since  is arbitrary,


S S
since E ∩ G ⊆

λ(G) ≥ λ(E ∩ G) + λ(E c ∩ G).

However, since λ is subadditive,

λ(G) ≤ λ(E ∩ G) + λ(E c ∩ G).

We see that E is indeed a λ−set.


1.4. PROBABILITIES ON A FINITE OR COUNTABLE SPACE 13

1.4 Probabilities on a Finite or Countable Space


We suppose that Ω is finite or countable and consider A = 2Ω . Then a probability on Ω is
characterized by its values on the atoms pw = P({w}), w ∈ Ω. Indeed, one can easily verify the
following theorem.

Theorem 1.4.1. Let (pw )w∈Ω be a family of real numbers indexed by the finite or countable set
Ω. Then there exists a unique probability P such that P({w}) = pw if and only if pw ≥ 0 and
P
w∈Ω pw = 1. In this case for any A ⊂ Ω,
X
P(A) = pw .
w∈A

Suppose first that Ω is finite. Any family of nonnegative terms summing up to 1 gives an
example of a probability on Ω. But among all these examples the following is particularly
important:

Definition 1.4.2. A probability P on the finite set Ω is called uniform if pw = P({w}) does not
depend on w.

In this case, it is immediate that


|A| number of outcomes in favor of A
P(A) = = .
|Ω| total number of possible outcomes

Then computing the probability of any event A amounts to counting the number of points in
A. On a given finite set Ω there is one and only one uniform probability.
Example: There are 20 balls in the urn, 10 white 10 red. One draws a set of 5 balls from the
urn. Denote X the number of white ball in the set. We want to find the probability that X = x,
where x is an arbitrary fixed integer.
We label from 1 to 10 for white balls and from 11 to 20 for red balls. Since the balls are
drawn at once, it is natural to consider that an outcome is a subset with 5 elements of the
set {1, . . . , 20} of all 20 balls. That is, Ω is the family of all subsets with 5 points, and the total
5
number of possible outcomes is |Ω| = C20 . Next, it is also natural to consider that all possible
outcomes are equally likely, that is P is the uniform probability on Ω. The quantity X is a
“random variable” because when the outcome w is known, one also knows the number X(w).
5−x
The possible values of X is from 0 to 5 and the set X −1 ({x}) = {X = x} contains C10 x
C10
points for all 0 ≤ x ≤ 5. Hence
 x 5−x
 C10 C510 if 0 ≤ x ≤ 5
C20
P(X = x) =
0 otherwise.

We thus obtain, when x varies, the distribution or the law, of X. This distribution is called the
hypergeometric distribution.
1.5. CONDITIONAL PROBABILITY 14

1.5 Conditional Probability


We have known how to answer questions of the following kind: If there are 5 balls in an
urn, 2 white and 3 black, what is the probability P(A) of the event A that a selected ball is
white? With the classical approach, P(A) = 2/5.
The concept of conditional probability, which will be introduced below, let us answer
questions of the following kind: What is the probability that the second ball is white (event
B) under the condition that the first ball was also white (event A)? (We are thinking of sam-
pling without replacement.) It is natural to reason as follows: if the first ball is white, then at
the second step we have an urn containing 4 balls, of which 1 is white and 3 black; hence it
seems reasonable to suppose that the (conditional) probability in question is 1/4.
In general, computing the probability of an event A, given that an event B occurs, means
finding which fraction of the probability of A is also in the event B.

Definition 1.5.1. Let A, B be events, P(B) > 0. The conditional probability of A given B is

P(A ∩ B)
P(A|B) = .
P(B)

Theorem 1.5.2. Suppose P(B) > 0. The operation A 7→ P(A|B) from A → [0, 1] defines a new
probability measure on A, called the conditional probability measure given B.

Proof. We define Q(A) = P(A|B), with B fixed. We must show Q satisfies (1) and (2) of 1.2.4.
But
P(Ω ∩ B) P(B)
Q(Ω) = P(Ω|B) = = = 1.
P(B) P(B)

Therefore, Q satisfies (1). As for (2), note that if (An )n≥1 is a sequence of elements of A which
are pairwise disjoint, then

P((∪∞
n=1 An ) ∩ B) P(∪∞
n=1 (An ∩ B))
Q(∪∞ ∞
n=1 An ) = P(∪n=1 An |B) = =
P(B) P(B)

and also the sequence (An ∩ B)n≥1 is pairwise disjoint as well; thus
∞ ∞ ∞
X P(An ∩ B) X X
Q(∪∞
n=1 An ) = = P(An |B) = Q(An ).
n=1
P(B) n=1 n=1

Theorem 1.5.3. If A1 , . . . , An ∈ A and if P(A1 ∩ . . . ∩ An−1 ) > 0, then

P(A1 ∩ . . . ∩ An ) = P(A1 )P(A2 |A1 ) . . . P(An |A1 ∩ . . . ∩ An−1 ).


1.5. CONDITIONAL PROBABILITY 15

Proof. We use induction. For n = 2, the theorem is simply 1.5.1. Suppose the theorem holds
for n − 1 events. Let B = A1 ∩ . . . ∩ An−1 . Then by 1.5.1 P(B ∩ An ) = P(An |B)P(B); next we
replace P(B) by its value given in the inductive hypothesis:

P(B) = P(A1 )P(A2 |A1 ) . . . P(An−1 |A1 ∩ . . . ∩ An−2 ),

and we get the result.

Definition 1.5.4. A collection of events (En ) is called a partition of Ω in A if

1. En ∈ A and P(En ) > 0, each n,

2. they are pairwise disjoint,

3. Ω = ∪n En .

Theorem 1.5.5 (Partition Equation). Let (En )n≥1 be a finite or countable partition of Ω. Then if
A ∈ A,
X
P(A) = P(A|En )P(En ).
n

Proof. Note that


A = A ∩ Ω = ∪n (A ∩ En ).
Since the En are pairwise disjoint so also are (A ∩ En )n≥1 , hence
  X X
P(A) = P ∪n (A ∩ En ) = P(A ∩ En ) = P(A|En )P(En ).
n n

Theorem 1.5.6 (Bayes’ Theorem). Let (En ) be a finite or countable partition of Ω, and suppose
P(A) > 0. Then
P(A|En )P(En )
P(En |A) = P .
m P(A|Em )P(Em )

Proof. Applying partition equation, we have that the denominator


X
P(A|Em )P(Em ) = P(A).
m

Hence the formula becomes


P(A|En )P(En ) P(A ∩ En )
= = P(En |A).
P(A) P(A)
1.5. CONDITIONAL PROBABILITY 16

Example 1.5.7. Because a new medical procedure has been shown to be effective in the early
detection of an illness, a medical screening of the population is proposed. The probability
that the test correctly identifies someone with the illness as positive is 0.99, and the probability
that the test correctly identifies someone without the illness as negative is 0.95. The incidence
of the illness in the general population is 0.0001. You take the test, and the result is positive.
What is the probability that you have the illness? Let D denote the event that you have the
illness, and let S denote the event that the test signals positive. The probability requested can
be denoted as . The probability that the test correctly signals someone without the illness as
negative is 0.95. Consequently, the probability of a positive test without the illness is

P(S|Dc ) = 0.05.

From Bayes’s Theorem,

P(S|D)P(D)
P(D|S) = = 0.002.
P(S|D)P(D) + P(S|Dc )P(Dc )

Surprisingly, even though the test is effective, in the sense that P(S|D) is high and P(S|Dc )
is low, because the incidence of the illness in the general population is low, the chances are
quite small that you actually have the disease even if the test is positive.

Example 1.5.8. Suppose that Bob can decide to go to work by one of three modes of trans-
portation, car, bus, or commuter train. Because of high traffic, if he decides to go by car, there
is a 0.5 chance he will be late. If he goes by bus, which has special reserved lanes but is som-
times overcrowded, the probability of being late is only 0.2. The commuter train is almost
never late, with a probability of only 0.01, but is more expensive than the bus.
(a) Suppose that Bob is late one day, and his boss wishes to estimate the probability that he
drove to work that day by car. Since he does not know which mode of transportation Bod usu-
ally uses, he gives a prior probability of 31 to each of the three possibilities. What is the boss’
estimate of the probability that Bob drove to work?
(b) Suppose that the coworker of Bob’s knows that he almost always takes the commuter train
to work, never take the bus, but somtimes, 0.1 of the time, takes the car. What is the coworkers
probability that Bob drove to work that day, given that he was late?
We have the following information given in the problem:
1
P(bus) = P(car) = P(train) =
3
P(late|car) = 0.5;
P(late|train) = 0.01;
P(late|bus) = 0.2.
1.6. INDEPENDENCE 17

By Bayes Theorem, this is


P(late|car)P(car)
P(car|late) =
P(late|car)P(car) + P(late|bus)P(bus) + P(late|train)P(train)
0.5 × 1/3
0.5 × 1/3 + 0.2 × 1/3 + 0.01 × 1/3
= 0.7042

Repeat the identical calculations as above, but instead of the prior probabilities being 31 , we
use pr(bus) = 0, P(car) = 0.1, and P(train) = 0.9. Plugging in to the same equation with these
three changes, we get P(car|late) = 0.8475.

1.6 Independence
Definition 1.6.1. 1. Two events A and B are independent if

P(A ∩ B) = P(A)P(B).

2. A (possibly infinite) collection of events (Ai )i∈I is a pairwise independent collection if for
any distinct elements i1 , i2 ∈ I,

P(Ai1 ∩ Ai2 ) = P(Ai1 )P(Ai2 ).

3. A (possibly infinite) collection of events (Ai )i∈I is an independent collection if for every
finite subset J of I, one has
Y
P(∩i∈J Ai ) = P(Ai ).
i∈J

If events (Ai )i∈I are independent, they are pairwise independent, but the converse is false.

Proposition 1.6.2. a) If A and B are independent, so also are A and B c ; Ac and B; and Ac and
Bc.
b) If A and B are independent and P(B) > 0, then

P(A|B) = P(A|B c ) = P(A).

Proof. a) A and B c . Since A and B are independent, then P(A ∩ B) = P(A)P(B) = P(A)(1 −
P(B c )) = P(A) − P(A)P(B c ). We have P(A ∩ B) = P(A) − P(A ∩ B c ). Substituting these into the
equation P(A ∩ B) = P(A)P(B), we obtain

P(A) − P(A ∩ B c ) = P(A) − P(A)P(B c ).

Hence

P(A ∩ B c ) = P(A)P(B c ).
1.6. INDEPENDENCE 18

Therefore, A and B c are independent.


Ac and B c . We have P(A)P(B c ) = [1 − P(Ac )]P(B c ) = P(B c ) − P(Ac )P(B c ) and P(A ∩ B c ) =
P(B c ) − P(Ac ∩ B c ). Substituting into the equation P(A ∩ B c ) = P(A)P(B c ), we obtain

P(B c ) − P(Ac ∩ B c ) = P(B c ) − P(Ac )P(B c ).

So

P(Ac ∩ B c ) = P(Ac )P(B c ).

Therefore, Ac and B c are independent.


b) We have
P(A ∩ B) P(A)P(B)
P(A|B) = = = P(A),
P(B) P(B)

and
P(A ∩ B c ) P(A)P(B c )
P(A|B c ) = = = P(A).
P(B c ) P(B c )

Examples:

1. Toss a coin 3 times. If Ai is an event depending only on the ith toss, then it is standard
to model (Ai )1≤i≤3 as being independent.

2. One chooses a card at random from a deck of 52 cards. A = ”the card is a heart”, and
B = ”the card is Queen”. A natural model for this experiment consists in prescribing
the probability 1/52 for picking any one of the cards. By additivity, P(A) = 13/52 and
P(B) = 4/52 and P(A ∩ B) = 1/52 hence A and B are independent.

3. Let n = {1, 2, 3, 4}, and A = 2Ω . Let P({i}) = 1/4, where i = 1, 2, 3, 4. Let A = {1, 2}, B =
{1, 3}, and C = {2, 3}. Then A, B, C are pairwise independent but are not independent.

Exercises

Axiom of Probability
1.1. Give a possible sample space for each of the following experiments:

1. A two-sided coin is tossed.

2. A student is asked for the month of the year and the day of the week on which her birth-
day falls.
1.6. INDEPENDENCE 19

3. A student is chosen at random from a class of ten students.

4. You receive a grade in this course.

1.2. Let A be a σ-algebra of subsets of Ω and let B is a subset of Ω. Show that F = {A ∩ B : A ∈


A} is a σ-algebra of subsets of B.

1.3. Let f be a function mapping Ω to another space E with a σ-algebra E. Let A = {A ⊂ Ω :


there exists B ∈ E with A = f −1 (B)}. Show that A is a σ-algebra on Ω.

1.4. Let (Gα )α∈I be an arbitrary family of σ-algebras defined on an abstract space Ω. Show that
H = ∩α∈I Gα is also a σ-algebra.

1.5. Suppose that Ω is an infinite set (countable or not), and let A be the family of all subsets
which are either finite or have a finite complement. Show that A is an algebra, but not a σ-
algebra.

1.6. Give a counterexample that shows that, in general, the union A ∪ B of two σ-algebras
need not be a σ-algebra.

1.7. Let Ω = {a, b, c} be a sample space. Let P({a}) = 1/2, P({b}) = 1/3, and P({c}) = 1/6. Find
the probabilities for all eight subsets of Ω.

1.8. For A, B ∈ A, show

1. P(A ∩ B) = P(A) − P(A ∩ B).

2. P(A ∪ B) = P(A) + P(B) − P(A ∩ B).


3
1.9. Suppose P(A) = 4
and P(B) = 13 . Show that always 1
12
≤ P(A ∩ B) ≤ 13 .

1.10. Let (Bn ) be a sequence of events such that P(Bn ) = 1 for all n ≥ 1. Show that
\ 
P Bn = 1.
n

1.11. Let A1 , . . . , An be given events. Show the inclusion-exclusion formula:


X X
P( ∪ni=1 Ai ) = P(Ai ) − P(Ai ∩ Aj )
i i<j
X
P(Ai ∩ Aj ∩ Ak ) − . . . + (−1)n+1 P(A1 ∩ A2 ∩ . . . ∩ An )
i<j<k
P
where (for example) i<j means to sum over all ordered pairs (i, j) with i < j.
1.6. INDEPENDENCE 20

1.12. Let Ai ∈ A be a sequence of events. Show that


n
X
P (∪ni=1 Ai ) ≤ P (Ai ),
i=1

each n, and also



X
P (∪∞
i=1 Ai ) ≤ P (Ai ).
i=1

1.13. (Bonferroni Inequalities) Let Ai ∈ A be a sequence of events. Show that

1. P (∪ni=1 Ai ) ≥ ni=1 P (Ai ) − i<j P (Ai ∩ Aj ),


P P

Pn
2. P (∪ni=1 Ai ≤
P P
i=1 P (Ai ) − i<j P (Ai ∩ Aj ) + i<j<k P (Ai ∩ Aj ∩ Ak ).

1.14. Let (Ω, A, P) be a probability space. Show for events Bi ⊂ Ai the following inequality
X 
P(∪i Ai ) − P(∪i Bi ) ≤ P(Ai ) − P(Bi ) .
i
Pn
1.15. If (Bk ) are events such that k=1 P(Bk ) > n − 1, then
n
\
P Bk ) > 0.
k=1

Classical definition of probability


1.16. In the laboratory analysis of samples from a chemical process, five samples from the
process are analyzed daily. In addition, a control sample is analyzed two times each day to
check the calibration of the laboratory instruments.

1. How many different sequences of process and control samples are possible each day?
Assume that the five process samples are considered identical and that the two control
samples are considered identical.

2. How many different sequences of process and control samples are possible if we con-
sider the five process samples to be different and the two control samples to be identical.

3. For the same situation as part (b), how many sequences are possible if the first test of
each day must be a control sample?

1.17. In the design of an electromechanical product, seven different components are to be


stacked into a cylindrical casing that holds 12 components in a manner that minimizes the
impact of shocks. One end of the casing is designated as the bottom and the other end is the
top.

1. How many different designs are possible?


1.6. INDEPENDENCE 21

2. If the seven components are all identical, how many different designs are possible?

3. If the seven components consist of three of one type of component and four of another
type, how many different designs are possible? (more difficult)

1.18. The design of a communication system considered the following questions:

1. How many three-digit phone prefixes that are used to represent a particular geographic
area (such as an area code) can be created from the digits 0 through 9?

2. As in part (a), how many three-digit phone prefixes are possible that do not start with 0
or 1, but contain 0 or 1 as the middle digit?

3. How many three-digit phone prefixes are possible in which no digit appears more than
once in each prefix?

1.19. A byte is a sequence of eight bits and each bit is either 0 or 1.

1. How many different bytes are possible?

2. If the first bit of a byte is a parity check, that is, the first byte is determined from the other
seven bits, how many different bytes are possible?

1.20. A bowl contains 16 chips, of which 6 are red, 7 are white, and 3 are blue. If four chips are
talmn at random and without replacement, find the probability that: (a) each of the 4 chips is
red; (b) none of the 4 chips is red; (c) there is at least 1 chip of each color.

1.21. Three distinct integers are chosen at random from the first 20 positive integers. Com-
pute the probability that: (a) their stun is even; (b) their product is even.

1.22. There are 5 red chips and 3 blue chips in a bowl. The red chips are numbered 1, 2, 3, 4, 5,
respectively, and the blue chips are numbered 1, 2, 3, respectively. If 2 chips are to be drawn at
random and without replacement, find the probability that these chips have either the same
number or the same color.

1.23. In a lot of 50 light bulbs, there are 2 bad bulbs. An inspector examines 5 bulbs, which are
selected at random and without replacement. (a) Find the probability of at least 1 defective
bulb among the 5. (b) How many bulbs should be examined so that the probability of finding
at least 1 bad bulb exceeds 0.2 ?

1.24. Three winning tickets are drawn from urn of 100 tickets. What is the probability of win-
ning for a person who buys:

1. 4 tickets?

2. only one ticket?


1.6. INDEPENDENCE 22

1.25. A drawer contains eight different pairs of socks. If six socks are taken at random and
without replacement, compute the probability that there is at least one matching pair among
these six socks.

1.26. In a classroom there are n students.

1. What is the probability that at least two students have the same birthday?

2. What is the minimum value of n which secures probability 1/2 that at least two have a
common birthday?

1.27. Four mice are chosen (without replacement) from a litter containing two white mice.
The probability that both white mice are chosen is twice the probability that neither is chosen.
How many mice are there in the litter?

1.28. Suppose there are N different types of coupons available when buying cereal; each box
contains one coupon and the collector is seeking to collect one of each in order to win a prize.
After buying n boxes, what is the probability pn that the collector has at least one of each type?
(Consider sampling with replacement from a population of N distinct elements. The sample
size is n > N . Use inclusion-exclusion formula)

1.29. An absent-minded person has to put n personal letters in n addressed envelopes, and
he does it at random. What is the probability pm,n that exactly m letters will be put correctly in
their envelopes?

1.30. N men run out of a men’s club after a fire and each takes a coat and a hat. Prove that:

a) the probability that no one will take his own coat and hat is
N
X (N − k)!
(−1)k ;
k=1
N !k!

b) the probability that each man takes a wrong coat and a wrong hat is
" N
#2
X 1
(−1)k .
k=2
k!

1.31. You throw 6n dice at random. Find the probability that each number appears exactly n
times.

1.32. * Mary tosses n + 1 fair coins and John tosses n fair coins. What is the probability that
Mary gets more heads than John?
1.6. INDEPENDENCE 23

Conditional Probability
1.33. Bowl I contains 6 red chips and 4 blue chips. Five of these 10 chips are selected at ran-
dom and without replacement and put in bowl II, which was originally empty. One chip is
then drawn at random from bowl II. Given that this chip is blue, find the conditional proba-
bility that 2 red chips and 3 blue chips are transferred from bowl I to bowl II.

1.34. You enter a chess tournament where your probability of winning a game is 0.3 against
half the players (call them type 1), 0.4 against a quarter of the players (call them type 2), and
0.5 against the remaining quarter of the players (call them type 3). You play a game against a
randomly chosen opponent. What is the probability of winning?

1.35. We roll a fair four-sided die. If the result is 1 or 2, we roll once more but otherwise, we
stop. What is the probability that the sum total of our rolls is at least 4?

1.36. There are three coins in a box. One is a two-headed coin, another is a fair coin, and
the third is a biased coin that comes up heads 75 percent of the time. When one of the three
coins is selected at random and flipped, it shows heads. What is the probability that it was the
two-headed coin?

1.37. Alice is taking a probability class and at the end of each week she can be either up-to-
date or she may have fallen behind. If she is up-to-date in a given week, the probability that
she will be up-to-date (or behind) in the next week is 0.8 (or 0.2, respectively). If she is behind
in a given week, the probability that she will be up-to-date (or behind) in the next week is 0.6
(or 0.4, respectively). Alice is (by default) up-to-date when she starts the class. What is the
probability that she is up-to-date after three weeks?

1.38. At the station there are three payphones which accept 20p pieces. One never works, an-
other always works, while the third works with probability 1/2. On my way to the metropolis
for the day, I wish to identify the reliable phone, so that I can use it on my return. The station
is empty and I have just three 20p pieces. I try one phone and it does not work. I try another
twice in succession and it works both times. What is the probability that this second phone is
the reliable one?

1.39. An insurance company insure an equal number of male and female drivers. In any given
year the probability that a male driver has an accident involving a claim is α, independently
of other years. The analogous probability for females is β. Assume the insurance company
selects a driver at random.

a) What is the probability the selected driver will make a claim this year?

b) What is the probability the selected driver makes a claim in two consecutive years?

c) Let A1 , A2 be the events that a randomly chosen driver makes a claim in each of the first
and second years, respectively. Show that P (A2 |A1 ) ≥ P (A1 ).
1.6. INDEPENDENCE 24

d) Find the probability that a claimant is female.

1.40. Three newspapers A, B and C are published in a certain city, and a survey shows that for
the adult population 20% read A, 16% B, and 14% C, 8% read both A and B, 5% both A and C,
4% both B and C, and 2% read all three. If an adult chosen at random, find the probability that

a) he reads none of these paper;

b) he reads only one of these papers; and

c) he reads at least A and B if is known that he reads at least one paper.

1.41. Customers are used to evaluate preliminary product designs. In the past, 95% of highly
successful products received good reviews, 60% of moderately successful products received
good reviews, and 10% of poor products received good reviews. In addition, 40% of products
have been highly successful, 35% have been moderately successful, and 25% have been poor
products.

1. What is the probability that a product attains a good review?

2. If a new design attains a good review, what is the probability that it will be a highly suc-
cessful product?

3. If a product does not attain a good review, what is the probability that it will be a highly
successful product?

1.42. An inspector working for a manufacturing company has a 99% chance of correctly iden-
tifying defective items and a 0.5% chance of incorrectly classifying a good item as defective.
The company has evidence that its line produces 0.9% of nonconforming items.

1. What is the probability that an item selected for inspection is classified as defective?

2. If an item selected at random is classified as nondefective, what is the probability that it


is indeed good?

1.43. A new analytical method to detect pollutants in water is being tested. This new method
of chemical analysis is important because, if adopted, it could be used to detect three different
contaminants—organic pollutants, volatile solvents, and chlorinated compounds—instead
of having to use a single test for each pollutant. The makers of the test claim that it can detect
high levels of organic pollutants with 99.7% accuracy, volatile solvents with 99.95% accuracy,
and chlorinated compounds with 89.7% accuracy. If a pollutant is not present, the test does
not signal. Samples are prepared for the calibration of the test and 60% of them are contami-
nated with organic pollutants, 27% with volatile solvents, and 13% with traces of chlorinated
compounds.
A test sample is selected randomly.
1.6. INDEPENDENCE 25

1. What is the probability that the test will signal?

2. If the test signals, what is the probability that chlorinated compounds are present?

1.44. Software to detect fraud in consumer phone cards tracks the number of metropolitan
areas where calls originate each day. It is found that 1% of the legitimate users originate calls
from two or more metropolitan areas in a single day. However, 30% of fraudulent users origi-
nate calls from two or more metropolitan areas in a single day. The proportion of fraudulent
users is 0.01%. If the same user originates calls from two or more metropolitan areas in a
single day, what is the probability that the user is fraudulent?

1.45. The probability of getting through by telephone to buy concert tickets is 0.92. For the
same event, the probability of accessing the vendor’s Web site is 0.95. Assume that these two
ways to buy tickets are independent. What is the probability that someone who tries to buy
tickets through the Internet and by phone will obtain tickets?

1.46. The British government has stepped up its information campaign regarding foot and
mouth disease by mailing brochures to farmers around the country. It is estimated that 99%
of Scottish farmers who receive the brochure possess enough information to deal with an out-
break of the disease, but only 90% of those without the brochure can deal with an outbreak.
After the first three months of mailing, 95% of the farmers in Scotland received the informa-
tive brochure. Compute the probability that a randomly selected farmer will have enough
information to deal effectively with an outbreak of the disease.

1.47. In an automated filling operation, the probability of an incorrect fill when the process is
operated at a low speed is 0.001. When the process is operated at a high speed, the probability
of an incorrect fill is 0.01. Assume that 30% of the containers are filled when the process is
operated at a high speed and the remainder are filled when the process is operated at a low
speed.

1. What is the probability of an incorrectly filled container?

2. If an incorrectly filled container is found, what is the probability that it was filled during
the high-speed operation?

1.48. An encryption-decryption system consists of three elements: encode, transmit, and de-
code. A faulty encode occurs in 0.5% of the messages processed, transmission errors occur in
1% of the messages, and a decode error occurs in 0.1% of the messages. Assume the errors are
independent.

1. What is the probability of a completely defect-free message?

2. What is the probability of a message that has either an encode or a decode error?
1.6. INDEPENDENCE 26

1.49. It is known that two defective copies of a commercial software program were erro-
neously sent to a shipping lot that has now a total of 75 copies of the program. A sample
of copies will be selected from the lot without replacement.

1. If three copies of the software are inspected, determine the probability that exactly one
of the defective copies will be found.

2. If three copies of the software are inspected, determine the probability that both defec-
tive copies will be found.

3. If 73 copies are inspected, determine the probability that both copies will be found.
Hint: Work with the copies that remain in the lot.

1.50. A robotic insertion tool contains 10 primary components. The probability that any com-
ponent fails during the warranty period is 0.01. Assume that the components fail indepen-
dently and that the tool fails if any component fails. What is the probability that the tool fails
during the warranty period?

1.51. A machine tool is idle 15% of the time. You request immediate use of the tool on five
different occasions during the year. Assume that your requests represent independent events.

1. What is the probability that the tool is idle at the time of all of your requests?

2. What is the probability that the machine is idle at the time of exactly four of your re-
quests?

3. What is the probability that the tool is idle at the time of at least three of your requests?

1.52. A lot of 50 spacing washers contains 30 washers that are thicker than the target dimen-
sion. Suppose that three washers are selected at random, without replacement, from the lot.

1. What is the probability that all three washers are thicker than the target?

2. What is the probability that the third washer selected is thicker than the target if the first
two washers selected are thinner than the target?

3. What is the probability that the third washer selected is thicker than the target?

1.53. Continuation of previous exercise. Washers are selected from the lot at random, without
replacement.

1. What is the minimum number of washers that need to be selected so that the probability
that all the washers are thinner than the target is less than 0.10?

2. What is the minimum number of washers that need to be selected so that the probability
that one or more washers are thicker than the target is at least 0.90?
1.6. INDEPENDENCE 27

1.54. The alignment between the magnetic tape and head in a magnetic tape storage system
affects the performance of the system. Suppose that 10% of the read operations are degraded
by skewed alignments, 5% by off-center alignments, 1% by both skewness and offcenter, and
the remaining read operations are properly aligned. The probability of a read error is 0.01
from a skewed alignment, 0.02 from an off-center alignment, 0.06 from both conditions, and
0.001 from a proper alignment. What is the probability of a read error.

1.55. Suppose that a lot of washers is large enough that it can be assumed that the sampling
is done with replacement. Assume that 60% of the washers exceed the target thickness.

1. What is the minimum number of washers that need to be selected so that the probability
that all the washers are thinner than the target is less than 0.10?

2. What is the minimum number of washers that need to be selected so that the probability
that one or more washers are thicker than the target is at least 0.90?

1.56. In a chemical plant, 24 holding tanks are used for final product storage. Four tanks are
selected at random and without replacement. Suppose that six of the tanks contain material
in which the viscosity exceeds the customer requirements.

1. What is the probability that exactly one tank in the sample contains high viscosity ma-
terial?

2. What is the probability that at least one tank in the sample contains high viscosity ma-
terial?

3. In addition to the six tanks with high viscosity levels, four different tanks contain ma-
terial with high impurities. What is the probability that exactly one tank in the sample
contains high viscosity material and exactly one tank in the sample contains material
with high impurities?

1.57. Plastic parts produced by an injection-molding operation are checked for conformance
to specifications. Each tool contains 12 cavities in which parts are produced, and these parts
fall into a conveyor when the press opens. An inspector chooses 3 parts from among the 12 at
random. Two cavities are affected by a temperature malfunction that results in parts that do
not conform to specifications.

1. What is the probability that the inspector finds exactly one nonconforming part?

2. What is the probability that the inspector finds at least one nonconforming part?

1.58. A bin of 50 parts contains five that are defective. A sample of two is selected at random,
without replacement.
1.6. INDEPENDENCE 28

1. Determine the probability that both parts in the sample are defective by computing a
conditional probability.

2. Determine the answer to part (a) by using the subset approach that was described in
this section.

1.59. * The Polya urn model is as follows. We start with an urn which contains one white ball
and one black ball. At each second we choose a ball at random from the urn and replace it
together with one more ball of the same color. Calculate the probability that when n balls are
in the urn, i of them are white.

1.60. You have n urns, the rth of which contains r − 1 red balls and n − r blue balls, r =
1, . . . , n. You pick an urn at random and remove two balls from it without replacement. Find
the probability that the two balls are of different colors. Find the same probability when you
put back a removed ball.

1.61. A coin shows heads with probability p on each toss. Let πn be the probability that the
number of heads after n tosses is even. Show that πn+1 = (1 − p)πn + p(1 − πn ) and find πn .

1.62. There are n similarly biased dice such that the probability of obtaining a 6 with each one
of them is the same and equal to p (0 < p < 1). If all the dice are rolled once, show that pn , the
probability that an odd number of 6’s is obtained, satisfies the difference equation

pn + (2p − 1)pn−1 = p,

and hence derive an explicit expression for pn .

1.63. Dubrovsky sits down to a night of gambling with his fellow officers. Each time he stakes
u roubles there is a probability r that he will win and receive back 2u roubles (including his
stake). At the beginning of the night he has 8000 roubles. If ever he has 256000 roubles he will
marry the beautiful Natasha and retire to his estate in the country. Otherwise, he will commit
suicide. He decides to follow one of two courses of action:

(i) to stake 1000 roubles each time until the issue is decided;

(ii) to stake everything each time until the issue is decided.

Advise him (a) if r = 1/4 and (b) if r = 3/4. What are the chances of a happy ending in each
case if he follows your advice?
1.6. INDEPENDENCE 29

Independence
1.64. Let the events A1 , A2 , . . . , An be independent and P (Ai ) = p (i = 1, 2, . . . , n). What is
the probability that:

a) at least one of the events will occur?

b) at least m of the events will occur?

c) exactly m of the events will occur?

1.65. Each of four persons fires one shot at a target. Let Ck denote the event that the tar-
get is hit by person k, k = 1, 2, 3, 4. If C1 , C2 , C3 , C4 are independent and if P(C1 ) = P(C2 ) =
0.7, P(C3 ) = 0.9, and P(C4 ) = 0.4, compute the probability that (a) all of them hit the target;
(b) exactly one hits the target; (c) no one hits the target; (d) at least one hits the target.

1.66. The probability of winning on a single toss of the dice is p. A starts, and if he fails, he
passes the dice to B, who then attempts to win on her toss. They continue tossing the dice
back and forth until one of them wins. What are their respective probabilities of winning?

1.67. Two darts players throw alternately at a board and the first to score a bull wins. On each
of their throws player A has probability pA and player B pB of success; the results of different
throws are independent. If A starts, calculate the probability that he/she wins.

1.68. * A fair coin is tossed until either the sequence HHH occurs in which case I win or the
sequence T HH occurs, when you win. What is the probability that you win?

1.69. Let A1 , . . . , An be independent events, with P(Ai ) < 1. Prove that there exists an event B
with P(B) > 0 such that B ∩ Ai = ∅ for 1 ≤ i ≤ n.

1.70. n balls are placed at random into n cells. Find the probability pn that exactly two cells
remain empty.

1.71. An urn contains b black balls and r red balls. One of the balls is drawn at random, but
when it is put back in the urn c additional balls of the same color are put in with it. Now
suppose that we draw another ball. Show that the probability that the first ball drawn was
black given that the second ball drawn was red is b/(b + r + c).

1.72. Suppose every packet of the detergent TIDE contains a coupon bearing one of the letters
of the word TIDE. A customer who has all the letters of the word gets a free packet. All the let-
ters have the same possibility of appearing in a packet. Find the probability that a housewife
who buys 8 packets will get:

a) one free packet,

b) two free packets.


Chapter 2

Random Variables and Distributions

2.1 Random variables on a countable space

2.1.1 Definitions
Throughout this section we suppose that Ω is countable and A = 2Ω . A random variable
X in this case is defined as a map from Ω into R. A random variable stands for an observation
of the outcome of a random event. Before the random event we may know the range of X but
we do not know its exact value until the random event happens. The distribution of a random
variable X is defined by

PX (B) = P({w : X(w) ∈ B}) = P(X −1 (B)) = P[X ∈ B], B ∈ B(R).

Since the set Ω is countable, the range of X is also countable. Suppose that X(Ω) = {x1 , x2 , . . .}.
Then the distribution of X is completely determined by the following numbers pX i = P(X =
xi ), i ≥ 1. Indeed, for any event A ∈ A,
X X
PX (A) = P[X = xi ] = pXi .
xi ∈A xi ∈A

Definition 2.1.1. Let X be a real-valued random variable on a countable space Ω. Suppose


that X(Ω) = {x1 , x2 , . . .}. The expectation of X, denoted E[X], is defined to be
X X
E[X] := xi P[X = xi ] = x i pX
i
i i

provided this sum makes sense: this is the case when at least one of the following conditions
is satisfied

1. Ω is finite;

xi p X
P
2. Ω is countable and the series i i absolutely convergence;

3. X ≥ 0 always (in this case, the above sum and hence E[X] as well may take value +∞.

30
2.1. RANDOM VARIABLES ON A COUNTABLE SPACE 31

Remark 1. Since Ω is countable, we denote pw the probability that the elementary event w ∈ Ω
happens. Then the expectation of X is given by
X
E[X] = X(w)pw .
w∈Ω

Let L1 denote the space of all random variables with finite expectation defined on (Ω, A, P).
The following facts are straightforward from the definition of expectation.

Theorem 2.1.2. Let X, Y ∈ L1 . The following statements hold:

1. L1 is a vector space over R and

E[aX + bY ] = aE[X] + bE[Y ], ∀ a, b ∈ R;

2. If X ≥ 0 then EX ≥ 0. Moreover, if X ≥ Y then E[X] ≥ E[Y ];

3. If Z is a bounded random variable then Z ∈ L1 . Furthermore, if Z 0 ∈ L1 and |Z 0 | ≤ X ∈


L1 then Z 0 ∈ L1 ;

4. If X = IA is the indicator function of an event A, then E[X] = P(A);

5. Let ϕ : R → R. Then
X X
E[ϕ(X)] = ϕ(xi )pX
i = ϕ(X(w))pw
i w∈Ω

if the above series is absolutely convergent.

Remark 2. If E[X 2 ] = x2i pX


P
i i < ∞, then

X 1X 1
E[|X|] = |xi |pX
i ≤ (|xi |2 + 1)pX 2
i = (E(X ) + 1) < ∞.
i
2 i 2

Definition 2.1.3. Variation of a random variable X is defined to be

DX = E[(X − E[X])2 ]

It follows from the linearity of expectation operator that

DX = E[X 2 ] − (E[X])2 .

Hence X X 2
DX = x2i pX
i − xi p X
i .
i i
2.1. RANDOM VARIABLES ON A COUNTABLE SPACE 32

2.1.2 Examples
Poisson distribution

X has a Poisson distribution with parameter λ > 0, denoted X ∼ P oi(λ), if X(Ω) =


{0, 1, . . .} and
e−λ λk
P[X = k] = , k = 0, 1, . . .
k!
The expectation of X is
∞ ∞
X e−λ λk X λj
E[X] = k = λe−λ = λeλ e−λ = λ.
k=0
k! j=0
j!

A similar calculation gives us the variance of X,

DX = λ.

Bernoulli distribution

X is Bernoulli with parameter p ∈ [0, 1], denoted X ∼ Ber(p), if it takes only two values 0
and 1 and
P[X = 1] = 1 − P[X = 0] = p.
X corresponds to an experiment with only two outcomes, usually called “success” (X = 1)
and “failure” (X = 0). The expectation and variance of X are

E[X] = p, DX = p(1 − p).

Binomial distribution

X has Binomial distribution with parameters p ∈ [0, 1] and n ∈ N, denoted X ∼ B(n, p), if
X takes on the values {0, 1, . . . , n} and

P[X = k] = Cnk pk (1 − p)n−k , k = 0, 1, . . . , n.

One has
n
X n
X
E[X] = kP[X = k] = kCnk pk (1 − p)n−k
k=0 k=0
n
X
k−1 k−1
= np Cn−1 p (1 − p)n−k = np,
k=1
2.2. RANDOM VARIABLES ON A GENERAL PROBABILITY SPACE 33

and
n
X n
X
2 2
E[X ] = k P[X = k] = k 2 Cnk pk (1 − p)n−k
k=0 k=0
n
X n
X
2 k−2 k−2 n−k k−1 k−1
= n(n − 1)p Cn−2 p (1 − p) + np Cn−1 p (1 − p)n−k
k=2 k=1
2
= n(n − 1)p + np,

thus DX = np(1 − p).

Geometric distribution

One repeatedly performs a sequence of independent Bernoulli trials until achieving the
first sucesses. Let X denote the number of failures before reaching the first success. X has a
Geometric distribution with parameter q = 1 − p ∈ [0, 1], denoted X ∼ Geo(q),

P[X = k] = q k p, k = 0, 1, . . .

where p is the probability of success. Then we have


∞ ∞
X X q
E[X] = kP[X = k] = kpq k = .
k=0 k=0
p

Moreover, one can easily shows that


q
DX = .
p2

2.2 Random variables on a general probability space

2.2.1 Definition
Let (Ω, A) be a measurable space and B(R) the σ-algebra Borel on R.

Definition 2.2.1. A function X : Ω → R is called A-measurable if

X −1 (B) := {w : X(w) ∈ B} ∈ A for all B ∈ B(R).

An A-measurable function X is called a random variable.

Theorem 2.2.2. Let X : Ω → R. The following statements are equivalent

1. X is a random variable.

2. {w : X(w) ≤ a} ∈ A for all a ∈ R.


2.2. RANDOM VARIABLES ON A GENERAL PROBABILITY SPACE 34

Proof. Claim (1) ⇒ (2) is self-evident, we will prove: (2) ⇒ (1). Let

C = {B ∈ B(R) : X −1 (B) ∈ A}.

We have C is a σ-algebra and it contains all sets with the form (−∞, a] for every a ∈ R. Thus C
contains B(R). On the other hand, C ⊂ B(R), so C = B(R). This concludes our proof.

Example 2.2.3. Let (Ω, A) be a measurable space. For each subset B of Ω one can verifies
that IB is a random variable iff B ∈ A. More general, if xi ∈ R and Ai ∈ A for all i belongs to
P
some countable index set I, then X(w) = i∈I xi IAi (w) is also a random variable. We call such
random variable X discrete random variable. When I is finite then X is called simple random
variable.

Definition 2.2.4. A function ϕ : Rd → R is called Borel measurable if X −1 (B) ∈ B(Rd ) for all
B ∈ B(R).

Remark 3. It implies from the above definition that every continuous function is Borel. Con-
sequently, all the functions (x, y) 7→ x+y, (x, y) 7→ xy, (x, y) 7→ x/y, (x, y) 7→ x∨y, (x, y) 7→ x∧y
are Borel, where x ∨ y = max(x, y), x ∧ y = min(x, y).

Theorem 2.2.5. Let X1 , . . . , Xd be random variables defined on a measurable space (Ω, A) and
ϕ : Rd → R a Borel function. Then Y = ϕ(X1 , . . . , Xd ) is also a random variable.

Proof. Let: X(w) = (X1 (w), . . . , Xd (w)) is the function on (Ω, A) and takes values in Rd . For
every a1 , . . . , ad ∈ R we have:
d
Y d
 \
X −1
(−∞, ai ] = {w : Xi (w) ≤ ai } ∈ A.
i=1 i=1

This implies X −1 (B) ∈ A for every B ∈ B(Rd ). Hence, for every C ∈ B(Rd ), B := ϕ−1 (C) ∈
B(Rd ). Thus,
Y −1 (C) = X −1 (ϕ−1 (C)) ∈ A,
i.e. Y is the random variable.

Corollary 2.2.6. If X and Y are random variables, so also are X ±Y, XY, X ∧Y, X ∨Y, |X|, X + :=
X ∨ 0, X − = (−X) ∨ 0 and X/Y (if Y 6= 0).

Theorem 2.2.7. If X1 , X2 , . . . are random variables then so are supn Xn , inf n Xn , lim supn Xn ,
lim inf n Xn

It follows from Theorem 2.2.7 that if the sequence of random variables (Xn )n≥1 point-wise
converges to X, i.e. Xn (w) → X(w) for all w ∈ Ω, then X is a random variable.
2.2. RANDOM VARIABLES ON A GENERAL PROBABILITY SPACE 35

2.2.2 Structure of random variables


Theorem 2.2.8. Let X be a random variable defined on a probability space (Ω, A).
1. There exists a sequence of discrete random variables which uniformly point-wise con-
verges to X.

2. If X is non-negative then there exists a sequence of simple random variables Yn such that
Yn ↑ X.
Proof. 1. For each n ≥ 1, denote Xn (w) = nk if nk ≤ X(w) < k+1
n
for some k ∈ Z. Xn is a
1
discrete random variable and |Xn (w) − X(w)| ≤ n for every w ∈ Ω. Hence, the sequence
(Xn ) converges uniformly in w to X.

2. Suppose that X ≥ 0. For each n ≥ 1, denote Yn (w) = 2kn if 2kn ≤ X(w) < k+1 2n
for some
n n n
k ∈ {0, 1, . . . , n2 −1} and Yn (w) = 2 if X(w) ≥ 2 . We can easily verify that the sequence
of simple random variables (Yn ) satisfying Yn (w) ↑ X(w) for all w ∈ Ω.

Definition 2.2.9. Let X be a random variable defined on a measurable space (Ω, A).

σ(X) := {X −1 (B) : B ∈ B(R)}

is called σ-algebra generated by X.


Theorem 2.2.10. Let X be a random variable defined on a measurable space (Ω, A) and Y a
function Ω → R. Then Y is σ(X)-measurable iff there exists a Borel function ϕ : R → R such
that Y = ϕ(X).
Proof. The sufficient condition is evident. We prove the necessary condition. Firstly, suppose
Y is a discrete random variable taking values y1 , y2 , . . . Since Y is σ(X)-measurable, sets An =
{w : Y (w) = yn } ∈ σ(X). By definition of σ(X), there exists a sequence Bn ∈ B(R) such that
An = X −1 (Bn ). Denote
i=1 Bi ∈ B(R), n ≥ 1.
Cn = Bn \ ∪n−1
We have sets Cn are pairwise disjoint and X −1 (Cn ) = An for every n. Consider the Borel func-
tion ϕ defined by
X
ϕ(x) = yn ICn (x),
n≥1

we have Y = ϕ(X).
In general case, by Theorem 2.2.8, there exists a sequence of discrete σ(X)-measurable
functions Yn which uniformly converges to Y . Thus, there exists Borel functions ϕn such that
Yn = ϕn (X). Denote
B = {x ∈ R : ∃ lim ϕn (x)}.
n
Clearly, B ∈ B(R) and B ⊃ X(Ω). Let: ϕ(x) = limn ϕn (x)IB (x). We have Y = limn Yn =
limn ϕn (X) = ϕ(X).
2.3. DISTRIBUTION FUNCTIONS 36

2.3 Distribution Functions

2.3.1 Definition
Definition 2.3.1. Let X be a real valued random variable.

FX (x) = P[X < x], x ∈ R,

is called distribution funtion of X.

One can verifies that F = FX satisfies the following properties

1. F is non-decreasing: if x ≤ y then F (x) ≤ F (y);

2. F is left continuous and has right limit at any point;

3. limx→−∞ F (x) = 0, limx→+∞ F (x) = 1.

On the other hand, for any function F : R → [0, 1] satisfying the these three conditions there
exists a (unique) probability measure µ on (R, B(R)) such that F (x) = µ((−∞, x)), for all x ∈ R
(See [13], section 2.5.2).
If X and Y has the same distribution function we say X and Y are equal in distribution
d
and denote X = Y .

Definition 2.3.2. If the distribution function FX has the form


Z a
FX (a) = P[X < a] = fX (x)dx, ∀a ∈ R
−∞

we say that X has a density function f .

The density function f = fX has the following properties:


R +∞
1. f (x) ≥ 0 for all x ∈ R and −∞ f (x)dx = 1.
Rb
2. P[a < X < b] = a f (x)dx for any a < b. Moreover, for any A ∈ B(R), it holds
Z
P[X ∈ A] = f (x)dx. (2.1)
A

As a consequence we see that if X has a density then P[X = a] = 0 for all a ∈ R.


2.3. DISTRIBUTION FUNCTIONS 37

2.3.2 Examples
Uniform distribution U [a, b]


1

b−a
if a ≤ x ≤ b,
f (x) =
0 otherwise,
is called the Uniform distribution on [a, b] and denoted by U [a, b]. The distribution function
corresponds to f is 
0


 if x < a,
F (x) = x−a b−a
if a ≤ x ≤ b,


1 if x > b.

Exponential distribution Exp(λ)

Suppose λ > 0. X has a exponential distribution with rate λ, denoted X ∼ Exp(λ), if X


takes values in (0, ∞) and its density is given by

fX (x) = λ−1 e−x/λ I(0,∞) (x).

The distribution function of X is

FX (x) = (1 − λ−1 e−x/λ )I(0,∞) (x).

Normal distribution N(a, σ 2 )

1 (x−a)2
f (x) = √ e− 2σ 2 , x ∈ R,
2πσ 2
is called the Normal distribution with mean a and variance σ 2 and denoted by N(a, σ 2 ). When
a = 0 and σ 2 = 1, N(0, 1) is called the Standard normal distribution.

Gamma distributiton G(α, λ)

xα−1 e−x/λ
fX (x) = I(0,∞) (x)
Γ(α)λα
is called the Gamma distribution with parameters α, λ(α, λ > 0); Γ denotes the gamma func-
R∞
tion Γ(α) = 0 xα−1 e−x dx. In particular, an Exp(λ) distribution is G(1, λ) distribution. The
gamma distribution is frequently a probability model for waiting times; for instance, in life
testing, the waiting time until ”death” is a random variable which is frequently modeled with
a gamma distribution.
2.4. EXPECTATION 38

2.4 Expectation

2.4.1 Construction of expectation


Definition 2.4.1. Let X be a simple random variable which can be written in the form
n
X
X= ai I A i (2.2)
i=1

where ai ∈ R and Ai ∈ A for all i = 1, . . . , n. Expectation of X (or integration of X with respect


to probability measure P) is defined to be
n
X
E[X] := ai P(Ai ).
i=1

Denote Ls = Ls (Ω, A, P) the set of simple random variable. It should be noted that a simple
random variable has of course many different representations of the form (2.2). However,
E[X] does not depend on the particular representation chosen for X.
Let X and Y be in Ls . We can write
n
X n
X
X= ai IAi , and Y = bi IAi .
i=1 i=1

for some subsets Ai which form a measurable partition of Ω. Then for any α, β ∈ R, αX + βY
is also in Ls and
Xn
αX + βY = (ai + bi )IAi .
i=1

Thus E[αX + βY ] = αE[X] + βE[Y ]; that is expectation is linear on Ls . Furthermore, expecta-


tion is a positive operator, i.e., if X ≤ Y , we have ai ≤ bi for all i, and thus E[X] ≤ E[Y ].
Next we define expetation for non-negative random variables. For X non-negative, i.e.
X(Ω) ⊂ [0, ∞], denote

E[X] = sup{EY : Y ∈ Ls and 0 ≤ Y ≤ X}. (2.3)

This supremum always exists in [0, ∞]. It follows from the positivity of expectation operator
that the definition above for E[X] coincides with Definition 2.4.1 on Ls .
Note that EX ≥ 0 but it may happen that EX = +∞ even when X is never equal to +∞.

Definition 2.4.2. 1. A random variable X is called integrable if E[|X|] < ∞. In this case, its
expectation is defined to be

E[X] = E[X + ] − E[X − ]. (2.4)


R R
We also write E[X] = X(w)dP(w) = XdP.
2.4. EXPECTATION 39

2. If E[X + ] and E[X − ] are not both equal to +∞ then the expectation of X is still defined
and given by (2.4) where we use the convention that +∞ + a = +∞ and −∞ + a = −∞
for any a ∈ R.

If X ≥ 0 then X = X + and X − = 0. Therefore Definition 2.4.2 again coincides with


definition (2.3) on set of non-negative random variables. We denote by L1 = L1 (Ω, A, P) the
set of integrable random variables.

Lemma 2.4.3. Let X be a non-negative random variable and (Xn )n≥1 a sequence of simple
random variables increasing to X. Then E[Xn ] ↑ E[X] (even if E[X] = ∞).

Proof. We have (EXn )n≥1 is the increasing sequence and upper bounded by EX by Definition
(2.3) so (EXn )n≥1 is convergent to a with a ≤ EX. To prove a = EX, we only show that for
every simple random variable Y satisfying 0 ≤ Y ≤ X, we have EY ≤ a.
Indeed, suppose Y takes m different values y1 , . . . , ym . Let Ak = {w : Y (w) = yk }. For each
 ∈ (0, 1], consider the sequence Yn, = (1 − )Y I{(1−)Y ≤Xn } . We have Yn, is simple random
variable, Yn, ≤ Xn so
EYn, ≤ EXn ≤ a for every n. (2.5)
On the other hand, Y ≤ limn Xn so for every w ∈ Ω, there exists n = n(w) such that (1 −
)Y (w) ≤ Xn (w), i.e. Ak ∩ {w : (1 − )Y (w) ≤ Xn (w)} → Ak as n → ∞. We have
m
X  
EYn, = (1 − ) yk P Ak ∩ [(1 − )Y ≤ Xn ]
k=1
Xm
→ (1 − ) yk P(Ak ) = (1 − )EY, as n → ∞.
k=1

Asscociate with (2.5), we have (1 − )EY ≤ a for every  ∈ (0, 1], i.e. EY ≤ a.

Theorem 2.4.4. 1. L1 is a vector space on R and expectation is an linear operator on L1 , i.e.,


for any X, Y ∈ L1 and x, y ∈ R, one has αX + βY ∈ L1 and

E(αX + βY ) = αEX + βEY.

2. If 0 ≤ X ≤ Y and Y ∈ L1 then X ∈ L1 and EX ≤ EY .

Proof. Statement 2 follows exactly from equation 2.3. To prove statement 1, firstly we remark
that if X and Y are two non-negative random variables and α, β ≥ 0, by Theorem 2.2.8 there
exist two increasing non-negative sequences (Xn ) and (Yn ) in Ls converging to X and Y re-
spectively. Hence, αXn + βYn are also simple non-negative random variables, and convege
to αX + βY . Applying linear and non-negative properties of expectation operator on Ls and
Lemma 2.4.3, we have E(αX + βY ) = αEX + βEY.
2.4. EXPECTATION 40

Now we prove Theorem 2.4.4. Consider two random variables X, Y ∈ L1 . Since |αX +
βY | ≤ |α||X| + |β||Y |, αX + βY ∈ L1 . We have: if α > 0,

E(αX) = E((αX)+ ) − E((αX)− ) = E(α(X + )) − E(α(X − )) = αE(X + ) − αE(X − ) = αEX.

Similarly to α < 0, we also have

E(αX) = E((αX)+ ) − E((αX)− ) = E(−α(X − )) − E(−α(X + )) = −αE(X − ) + αE(X + ) = αEX,

i.e.
E(αX) = αE(X) for every α ∈ R. (2.6)
On the other hand, let Z = X + Y we have Z + − Z − = X + Y = X + + Y + − (X − + Y − ), so
Z + + X − + Y − = Z − + X + + Y + . Thus E(Z + ) + E(X − ) + E(Y − ) = E(Z − ) + E(X + ) + E(Y + ),
then
EZ = E(Z + ) − E(Z − ) = E(X + ) + E(Y + ) − E(X − ) − E(Y − ) = EX + EY.
Asscociate with (2.6) we obtain

E(αX + βY ) = E(αX) + E(βY ) = αEX + βEY.

An event A happens almost surely if P(A) = 1. Thus we say X equals Y almost surely if
P[X = Y ] = 1 and denote X = Y a.s.

Corollary 2.4.5. 1. If Y ∈ L1 and |X| ≤ Y , then X ∈ L1 .

2. If X ≥ 0 a.s. and E(X) < ∞, then X < ∞ a.s.

3. If E(|X|) = 0 then X = 0 a.s.

Proof. 2) Let A = {w : X(w) = ∞}. For every n, we have X(w) ≥ X(w)IA (w) ≥ nIA (w) so
E(X) ≥ nP(A) for every n. Thus P(A) ≤ E(X)
n
→ 0 as n → ∞. From this, we have P(A) = 0.
3) Let An = {w : |X(w)| ≥ 1/n}. We have (An )n≥1 is the decreasing sequence and P(X 6=
0) = limn→∞ P(An ). Moreover,
1
IA (w) ≤ |X(w)|IAn (w) ≤ |X(w)|
n n
so P(An ) ≤ nE|X| = 0 for every n. Thus P(A) = 0 i.e. X = 0 a.s.

Theorem 2.4.6. Let X and Y be integrable random variables. If X = Y a.s. then E[X] = E[Y ].

Proof. Firstly, we consider the case: X and Y are non-negative. Let A = {w : X(w) 6= Y (w)}.
We have P(A) = 0. Moreover,

EY = E(Y IA + Y IAc ) = E(Y IA ) + E(Y IAc ) = E(Y IA ) + E(XIAc ).


2.4. EXPECTATION 41

Suppose (Yn ) is a sequence of simple random variables increasing to Y . Hence, (Yn IA ) is also a
sequence of simple random variables increasing to (Y IA ). Suppose for each n ≥ 1, the random
variable Yn is bouned by Nn , so

0 ≤ E(Yn IA ) ≤ E(Nn IA ) = Nn P(A) = 0

for each n. Hence E(Y IA ) = 0. Similarly, E(XIA ) = 0. Thus EY = EX.


In general case, from X = Y a.s. we can easily find that X + = Y + and X − = Y − a.s.. Thus,
we also have EX = E(X + ) − E(X − ) = E(Y + ) − E(Y − ) = EY.

2.4.2 Some limit theorems


Theorem 2.4.7 (Monotone convergence theorem). If the random variables Xn are non-negative
and increasing a.s. to X, then limn→∞ E[Xn ] = E[X] (even if E[X] = ∞).

Proof. For each n, let (Yn,k )k≥1 be a sequence of simple random variables increasing to Xn and
let Zk = maxn≤k Yn,k . Then (Zk )k≥1 is the sequence of simple non-negative random variables,
and thus there exists Z = limk→∞ Zk . Also

Yn,k ≤ Zk ≤ X ∀n ≤ k

which implies that


Xn ≤ Z ≤ X a.s.
Next let n → ∞ we have Z = X a.s. Since expectation is a positive operator, we have

EYn,k ≤ EZk ≤ EXk ∀n ≤ k.

Fix n and let k → ∞, using Lemma 2.4.3 we obtain

EXn ≤ EZ ≤ lim EXk .


k→∞

Now let n → ∞ to obtain


lim EXn ≤ EZ ≤ lim EXk .
n→∞ k→∞

Since the left and right sides are the same, X = Z a.s., we deduce the result.

Theorem 2.4.8 (Fatou’s lemma). If the random variables Xn satisfy Xn ≥ Y a.s for all n and
some Y ∈ L1 . Then
E[lim inf Xn ] ≤ lim inf E[Xn ].
n→∞ n→∞

Proof. Firstly we prove Theorem to the case Y = 0. Let Yn = inf k≥n Xk . We have (Yn ) is the
sequence of non-decreasing random variables and

lim Yn = lim inf Xn .


n→∞ n→∞
2.4. EXPECTATION 42

Since Xn ≥ Yn , we have EXn ≥ EYn . Asscociate with monotone convergence theorem to the
sequence Yn , we obtain

lim inf EXn ≥ lim EYn = E( lim Yn ) = E(lim inf Xn ).


n→∞ n→∞ n→∞ n→∞

The general case follows from appling the above result to the sequence of non-negative ran-
dom variables X̂n := Xn − Y.

Theorem 2.4.9 (Lebesgue’s dominated convergence theorem). If the random variables Xn


converge a.s. to X and supn |Xn | ≤ Y a.s. for some Y ∈ L1 . We have X, Xn ∈ L1 and

lim E[|Xn − X|] = 0.


n→∞

Proof. Since |X| ≤ Y , X ∈ L1 . Let Zn = |Xn − X|. Since Zn ≥ 0 and −Zn ≥ −2Y , applying
Fatou Lemma to Zn and −Zn , we obtain

0 = E(lim inf Zn ) ≤ lim inf EZn ≤ lim sup EZn = − lim inf E(−Zn ) ≤ −E(lim inf (−Zn )) = 0.
n→∞ n→∞ n→∞ n→∞ n→∞

Thus limn→∞ EZn = 0 i.e. limn→∞ E(|Xn − X|) = 0.

2.4.3 Some inequalities


Theorem 2.4.10. 1. (Cauchy-Schwarz’s inequality) If X, Y ∈ L2 then XY ∈ L1 and

|E(XY )|2 ≤ E(X 2 )E(Y 2 ). (2.7)

2. L2 ⊂ L1 and if X ∈ L2 , then (EX)2 ≤ E(X 2 ).

3. L2 is a vector space on R, i.e., for any X, Y ∈ L2 and α, β ∈ R, we have αX + βY ∈ L2 .

Proof. If E(X 2 )E(Y 2 ) = 0 then XY = 0 a.s. Thus |E(XY )|2 = E(X 2 )E(Y 2 ) = 0.
p
If E(X 2 )E(Y 2 ) 6= 0, applying the inequality 2|ab| ≤ a2 + b2 for a = X/ E(X 2 ) and b =
p
Y / E(Y 2 ) and then taking expectation for two sides, we obtain
 XY   X2   Y2 
2E p ≤E + E = 2.
E(X 2 )E(Y 2 ) E(X 2 ) E(Y 2 )

Hence we have (2.7).


Applying (2.7) for Y = 1 we obtain the second claim. The third claim follows from (2.7)
and the linearity of expectation.

If X ∈ L2 , we denote
DX = E[(X − EX)2 ].
DX is called the variance of X. Using the linearity of expectation operator, one can verify that
DX = E(X 2 ) − (EX)2 .
2.4. EXPECTATION 43

Theorem 2.4.11. 1. (Markov’s inequality) Suppose X ∈ L1 , then for any a > 0, it holds
E(|X|)
P(|X| ≥ a) ≤ .
a

2. (Chebyshev’s inequality) Suppose X ∈ L2 , then for any a > 0, it holds


DX
P(|X − EX| ≥ a) ≤ .
a2
Proof. 1) Since aI{|X|≥a} (w) ≤ |X(w)|I{|X|≥a} (w) ≤ |X(w)| for every w ∈ Ω. Taking expectation
for two sides, we obtain aP(|X| ≥ a) ≤ E(|X|).
2) Applying Markov’s inequality’, we have
DX
P(|X − EX| ≥ a) = P(|X − EX|2 ≥ a2 ) ≤ .
a2

2.4.4 Expectation of random variable with density


Theorem 2.4.12. Suppose that X has a density function f . Let h : R → R be a Borel function.
We have Z
E(h(X)) = h(x)f (x)dx.

provided that either E(|h(X)|) < ∞ or h non-negative.

Proof. Firstly, we consider the case h ≥ 0. Then there exists a sequence of simple non-
Pkn n n +
negative Borel functions (hn ) increasing to h. Suppose hn = i=1 ai IAi for ai ∈ R and
n

Ani ∈ B(R) for every i. By monotone convergence theorem


kn
X
E(h(X)) = E(lim hn (X)) = lim E(hn (X)) = lim hn (ani )P[X ∈ Ani ].
n n n
i=1

Applying the property (2.1) and monotone convergence theorem, we obtain


kn
X Z Z Z
E(h(X)) = lim hn (ani ) f (x)dx = lim f (x)hn (x)dx = f (x)h(x)dx.
n An n
i=1 i

R
Thus, if h is non-negative, we usually have E(h(X)) = h(x)f (x)dx.
In general case, applying above result for h+ and h− we deduce this proof.

Example 2.4.13. Let X ∼ Exp(1). Applying Theorem 2.4.12 for h(x) = x and h(x) = x2 respec-
tively, we have Z ∞
EX = xe−x dx = 1,
0
and Z ∞
EX = 2
x2 e−x dx = 2.
0
2.5. RANDOM ELEMENTS 44

2.5 Random elements

2.5.1 Definitions
Definition 2.5.1. Let (E, E) be a measure space. A function X : Ω → E is called A/E-measurable
or random element if X −1 (B) ∈ A for all B ∈ E. The function
PX (B) = P(X −1 (B)), B ∈ E,
is called probablity distribution of X on (E, E).
When (E, E) = (Rd , B(Rd )), we call X a random vector.
Let X = (X1 , . . . , Xd ) be a d-dimensional random vector defined on (Ω, A, P). The distri-
bution function of X is defined by
F (x) = P[X < x] = P[X1 < x1 , . . . , Xd < xd ], x ∈ Rd .
We can easily verify that F satisfying the following properties:
1. 0 ≤ F (x) ≤ 1 for all x ∈ Rd .

2. limxk →−∞ F (x) = 0 for all k = 1, . . . , d.

3. limx1 →+∞,...,xd →+∞ F (x) = 1.

4. F is left continuous.
The random vector X has a density f : Rd → R+ if f is a non-negative Borel measurable
function satisfying Z
F (x) = f (u)du, for any x ∈ Rd .
u<x
This equation implies that
Z
P[X ∈ B] = f (x)dx, for all B ∈ B(Rd ).
B
In particular, we have
Z Z 
P[X1 ∈ B1 ] = P[X ∈ B1 × R d−1
]= f (x1 , . . . , xd )dx2 . . . dxd dx1 for all B1 ∈ B(Rd ).
B1 Rd−1

This implies that if X = (X1 , . . . , Xd ) has a density f then X1 also has a density given by
Z
fX1 (x1 ) = f (x1 , x2 , . . . , xd )dx2 . . . dxd , for all x1 ∈ R. (2.8)
Rd−1
A similar argument as the proof Theorem 2.4.12 yields,
Theorem 2.5.2. Let X = (X1 , . . . , Xd ) be a random vector which has density function f , ϕ :
Rd → R a Borel measurable function. We have
Z
E[ϕ(X)] = ϕ(x)f (x)dx
Rd
R
provided that ϕ is non-negative or Rd |ϕ(x)|f (x)dx < ∞.
2.6. INDEPENDENT RANDOM VARIABLES 45

2.5.2 Example
Multivariate normal distribution

Let a = (a1 , . . . , ad ) be a d-dimensional vector and M = (mi,j )di,j=1 a d × d-square matrix.


Suppose that M is symmetric and positive define. Denote A = M −1 . The random vector
X = (X1 , . . . , Xd ) has normal distribution N(a, M ) if its density p verifies

det A n 1

o
p(x) = exp − (x − a)A(x − a) ,
(2π)d/2 2

where (x − a)A(x − a)∗ = i j aij (xi − ai )(xj − aj ).


P P

Polynomial distribution

The d-dimensional random vector X has a polynomial distribution with parameters n, p1 , . . . , pd ,


denoted by X ∼ M U T (n; p1 , . . . , pd ), for n ∈ N∗ and p1 , . . . , pd ≥ 0, if

n! kd+1
P[X1 = k1 , . . . , Xd = kd ] = pk11 pk22 . . . pd+1 ,
k1 !k2 ! . . . kd+1 !

where pd+1 = 1 − (p1 + . . . + pd ), 0 ≤ ki ≤ n and kd+1 = n − (k1 + . . . + kd ) ≥ 0.

2.5.3 Density of function of random vectors


Using Theorem 2.5.2 and the change of variables formula we have the following useful
result.

Theorem 2.5.3. Let X = (X1 , . . . , Xn ) have a joint density f . Suppose g : Rn → Rn is continu-


ously differentiable and injective, with Jacobian given by
 
∂gi
Jg (x) = (x)
∂xj i,j=1,...,d

never vanishes. Then Y = g(X) has density



f (g −1 (y))| det J −1 (y)| if y ∈ g(Rd )
X g
fY (y) =
0 otherwise.

2.6 Independent random variables


Definition 2.6.1. 1. The sub-σ-algebras (Ai )i∈I of A are independent if for all finite subset
J of I and for all Ai ∈ Ai , Y
P(∩i∈J Ai ) = P(Ai ).
i∈J
2.6. INDEPENDENT RANDOM VARIABLES 46

2. The (Ei , Ei )-valued random variables (Xi )i∈I are independent if the σ-algebras (Xi−1 (Ei ))i∈I
are independent.

A class C of subsets of Ω is called a π-system if is closes under finite intersections, so that


A ∩ B ∈ C implies A, B ∈ C. Furthermore, a class D is a λ-system if contains Ω and is closed
under proper differences and increasing limits, i.e.,

• A1 , A2 , . . . ∈ D with An ↑ A implies A ∈ D;

• A, B ∈ D with A ⊂ B implies B\A ∈ D.

Lemma 2.6.2 (Monotone classes). Let C, D be classes of subsets of Ω where C is a π-system and
D is a λ-system such that C ⊂ D. Then σ(C) ⊂ D.

Lemma 2.6.3. Let G and F be sub-σ-algebras of A. Let G1 and F1 be π-systems such that σ(G1 ) =
G and σ(F1 ) = F. Then G is independent of F if F1 and G are independent, i.e.,

P(F ∩ G) = P(F )P(G), F ∈ F1 , G ∈ G 1 .

Proof. Suppose that F1 and G1 are independent. We fix any F ∈ F1 and define

σF = {G ∈ G : P(F ∩ G) = P(F )P(G)}.

Then σF is a λ-system containing π-system G1 . Applying monotone classes theorem, we have


σF = G, it means that
P(F ∩ G) = P(F )P(G), F ∈ F1 , G ∈ G.
Next, for any G ∈ G we define

σG = {F ∈ F : P(F ∩ G) = P(F )P(G)}.

We also have that σG is a λ-system containing π-system F1 so that σG = F, which yields the
desired property.

Theorem 2.6.4. Let X and Y be two random variables. The following statements are equiva-
lent:

(i) X is independent of Y ;

(ii) FX,Y (x, y) = FX (x)FY (y) for all x, y ∈ R;

(iii) f (X) and g(Y ) are independent for any Borel functions f, g : R → R;

(iv) E[f (X)g(Y )] = E[f (X)]E[g(Y )] for any Borel function f, g : R → R which are either posi-
tive or bounded.
2.7. COVARIANCE 47

Proof. (i) ⇒ (ii): Suppose X be independent of Y , then two events {w : X(w) < x} và {w :
Y (w) < y} are also independent for every x, y ∈ R. We have (ii).
(ii) ⇒ (i): Since the set of events {w : X(w) < x}, x ∈ R, is a π-system generating σ(X)
and {w : X(w) < y}, y ∈ R, is a π-system generating σ(Y ) , so applying Lemma 2.6.3 we have
X is independent of Y .
(i) ⇒ (iii): For every A, B ∈ B(R), we have f −1 (A), g −1 (B) ∈ B(R) then

P(f (X) ∈ A, g(Y ) ∈ B) = P(X ∈ f −1 (A), Y ∈ g −1 (B))


= P(X ∈ f −1 (A))P(Y ∈ g −1 (B)) = P(f (X) ∈ A)P(g(Y ) ∈ B).

Thus, f (X) is independent of g(Y ).


(iii) ⇒ (i): We choose f (x) = g(x) = x.
(i) ⇒ (iv): Since (i) is equivalent of (iii), we only prove

E(XY ) = E(X)E(Y ) for every random variable which is integrable or non-negative X and Y.

Firstly, we suppose that: X and Y are non-negative. By Theorem 2.2.8 there exists a sequence
of simple random variables Xn = ki=1 ai IAi increasing to X and Yn = lj=1
Pn Pn
bj IBj increasing to
Y for Ai ∈ σ(X) và Bi ∈ σ(Y ). Applying monotone convergence theorem, we have
X ln
kn X kn X
X ln
E(XY ) = lim E(Xn Yn ) = lim ai bj P(Ai Bj ) = lim ai bj P(Ai )P(Bj )
n→∞ n→∞ n→∞
i=1 j=1 i=1 j=1
kn
X ln
 X 
= lim ai P(Ai ) bj P(Bj ) = lim E(Xn )E(Yn ) = E(X)E(Y ).
n→∞ n→∞
i=1 j=1

In general case, we write X = X + − X − và Y = Y + − Y − . Since (iii), we have X ± are indepen-


dent of Y ± , then

E(XY ) = E(X + Y + ) + E(X − Y − ) − E(X + Y − ) − E(X − Y + )


= E(X + )E(Y + ) + E(X − )E(Y − ) − E(X + )E(Y − ) − E(X − )E(Y + ) = E(X)E(Y ).

(iv) ⇒ (i): Choose f = I(−∞,x) and g = I(−∞,y) .


(iv) ⇒ (v) is evident.

2.7 Covariance
Definition 2.7.1. The covariance of random variables X, Y ∈ L2 is defined by

cov(X, Y ) = E[(X − EX)(Y − EY )].

The correlation coefficient of X, Y ∈ L2 is


cov(X, Y )
ρ(X, Y ) = √ .
DXDY
2.8. CONDITIONAL EXPECTATION 48

Using the linearity of expectation operator, we have

cov(X, Y ) = E(XY ) − EXEY.



Furthermore, it follows from Cauchy-Schwarz’s inequality that |cov(X, Y )| ≤ DXDY , it
means
|ρ(X, Y )| ≤ 1.

Example 2.7.2. Let X and Y be independent random variables whose distributions are N (0, 1).
Denote Z = XY and T = X − Y . We have

cov(Z, T ) = E(XY (X − Y )) − E(XY )E(X − Y ) = 0,

and
cov(Z, T 2 ) = E(XY (X − Y )2 ) − E(XY )E((X − Y )2 ) = −2,
since E(XY ) = EXEY = 0, E(X 3 Y ) = E(X 3 )EY = 0, E(XY 3 ) = EXE(Y 3 ) = 0 and E(X 2 Y 2 ) =
E(X 2 )E(Y 2 ) = 1. Thus Z and T are uncorrelated random variables but not independent.

Proposition 2.7.3. Let (Xn )n≥1 be a sequence of pair-wise uncorrelated random variables. Then

D(X1 + . . . + Xn ) = D(X1 ) + . . . + D(Xn ).

Proof. We have
h 2 i
D(X1 + . . . + Xn ) = E (X1 − EX1 ) + . . . (Xn − EXn )
n
X X
= E[(Xi − EXi )2 ] + 2 E[(Xi − EXi )(Xj − EXj )]
i=1 1≤i<j≤n
Xn n
X
= E[(Xi − EXi )2 ] = D(Xi ),
i=1 i=1

since E[(Xi − EXi )(Xj − EXj )] = E(Xi Xj ) − E(Xi )E(Xj ) = 0 for any 1 ≤ i < j ≤ n.

2.8 Conditional Expectation

2.8.1 Definition
Definition 2.8.1. Let (Ω, A, P) be a probability space and X an integrable random variable.
Let G be a sub-σ-algebra of A. Then there exists a random variable Y such that

1. Y is G-measurable,

2. E[|Y |] < ∞,
2.8. CONDITIONAL EXPECTATION 49

3. for every set G ∈ G, we have Z Z


Y dP = XdP.
G G

Moreover, if Z is another random variable with these properties then P[Z = Y ] = 1. Y is called
a version of the conditional expectation E[X|G] of X given G, and we write Y = E[X|G], a.s.

We often write E[X|Z1 , Z2 , . . .] for E[X|σ(Z1 , Z2 , . . .)].

2.8.2 Examples
Example 2.8.2. Let X be an integrable random variable and G = σ(A1 , . . . , Am ) where (Ai )1≤i≤m
is a measurable partition of Ω. Suppose that P(Ai ) > 0 for all i = 1, . . . , m. Then
n  Z
X 1 
E(X|G) = XdP IAi .
i=1
P(Ai ) Ai

Example 2.8.3. Let X and Z be random variables whose joint density is fX,Z (x, z). We know
R
that fZ (z) = R fX,Z (x, z)dx is density of Z. Define the elementary conditional density fX|Z of
X given Z by 
 fX,Z (x,z) if f (z) 6= 0,
fZ (z) Z
fX|Z (x|z) :=
0 otherwise.
Let h be a Borel function on R such that E[|h(X)|] < ∞. Set
Z
g(z) = h(x)fX|Z (x|z)dx.
R

Then Y = g(Z) is a version of the conditional expectation E[h(X)|Z].


Indeed, for a typical element of σ(Z) which has the form {w : Z(w) ∈ B}, where B ∈ B, we
have
Z Z Z
E[h(X)IB (Z)] = h(x)IB (z)fX,Z dxdz = g(z)IB (z)fZ (z)dz = E[g(Z)IB (Z)].

2.8.3 Properties of conditional expectation


Theorem 2.8.4. Let ξ, η be integrable random variables defined on (Ω, F, P). Let G be a sub-σ-
algebras of F.

1. If c is a constant, then E[c|G] = c a.s.

2. If ξ ≥ η a.s. then E(ξ|G) ≥ E(η|G) a.s.

3. If a, b are constants, then

E(aξ + bη|G) = aE(ξ|G) + bE(η|G).


2.8. CONDITIONAL EXPECTATION 50

4. If G = {∅, Ω}, then E(ξ|G) = E(ξ) a.s.

5. E(ξ|F) = ξ a.s.

6. E(E(ξ|G)) = E(ξ).

7. (Tower property) Let G1 ⊂ G2 be sub-σ-algebras of F then

E(E(ξ|G1 )|G2 ) = E(E(ξ|G2 )|G1 ) = E(ξ|G1 ) a.s.

8. If ξ is independent of G, then E(ξ|G) = E(ξ) a.s.

9. If η is G-measurable and E(|ξη|) < ∞, then

E(ξη|G) = ηE(ξ|G) a.s.

10. Let H be a sub-σ-algebras of F which is independent of σ(G, ξ), then

E ξ|σ(G, H) = E(ξ|G) a.s.




Proof. 1. Statement 1 is evident.


2. Since ξ ≥ η a.s. so A ξdP ≥ A ηdP for every A ∈ G. Hence, A E(ξ|G)dP ≥ A E(η|G)dP for
R R R R

every A ∈ G. Thus, E(ξ|G) ≥ E(η|G) a.s.


3. If A ∈ G, Z Z Z
(aξ + bη)dP = a ξdP + b ηdP
A A A
Z Z Z
= a E(ξ|G)dP + b E(η|G)dP = (aE(ξ|G) + bE(η|G))dP
A A A
From this, we have proof.
4. Since Eξ is measurable with respect to σ-algebra G = {∅, Ω} and if A = ∅ or A = Ω, we have
Z Z
ξdP = EξdP ⇒ E(ξ|G) = E(ξ) a.s.
A A

5. Statement 5 is evident.
6. Using Definition 2.8.1 for G = Ω, we have:
Z Z
E(ξ|G)dP = ξdP ⇒ E(E(ξ|G)) = Eξ a.s.
Ω Ω

7. If A ∈ G, we have: Z Z Z
E[E(ξ|G2 )|G1 ]dP = E(ξ|G2 )dP = ξdP.
A A A
From this and Definition 2.8.1, the first equation is proven. The second one follows from
Statement 5 and remark that E(ξ|G1 ) is G2 -measurable.
8. If A ∈ G, X and IA are independent. Hence, we have:
Z Z
ξdP = E(ξIA ) = Eξ.P(A) = (Eξ)dP ⇒ E(ξ|G) = E(ξ) a.s.
A A
2.8. CONDITIONAL EXPECTATION 51

9. First suppose that ξ and η are non-negative. For η = IA , where A is G-measurable, B ∩ A ∈ G,


so that, by the defining relation,
Z Z Z Z
ηE(ξ|G)dP = E(ξ|G)dP = ξdP = ξηdP,
A B∩A B∩A B

which proves the desired relation for indicators, and hence for simple random variables. Next,
if {ηn , n ≥ 1} are simple random variables, such that ηn ↑ η almost surely as n → ∞ , it follows
that ηn ξ ↑ ηξ and ηn E(ξ|G) ↑ ηE(ξG|) almost surely as n → ∞, from which the conclusion
follows by monotone convergence. The general case follows by the decomposition ξ = ξ + −ξ −
and η = η + − η − .

2.8.4 Convergence theorem


Let (ξn ), ξ and η be random variables defined on (Ω, F, P). Let G be a sub-σ-algebras of F.

Theorem 2.8.5 (Monotone convergence theorem). a) Suppose that ξn ↑ ξ a.s. and there exists

a positive integer m such that E(ξm ) < ∞. Then, E(ξn |G) ↑ E(ξ|G) a.s.
+
b) Suppose that ξn ↓ ξ a.s. and there exists a positive integer m such that E(ξm ) < ∞, then
E(ξn |G) ↓ E(ξ|G) a.s.

Proof. Suppose Eξn−0 < ∞. Hence 0 ≤ ξn + ξn−0 ↑ ξ + ξn−0 , by Theorem ??


Z Z

lim E(ξn + ξn0 |G)dP = lim E(ξn + ξn−0 |G)dP
A n n A
Z Z Z
= lim (ξn + ξn−0 )dP = lim(ξn + ξn−0 )dP = (ξ + ξn−0 )dP
n A A n A
By linearity we have
Z Z Z
lim E(ξn |G)dP = ξdP = E(ξ|G)dP, ∀A ∈ G
A n A A

and then
lim E(ξn |G) = E(ξ|G) a.s.
n
Similarly to claim (b).

Theorem 2.8.6 (Fatou’s lemma). a) If ξn ≤ η, ∀n ≥ 1 a.s., and E(η) < ∞ then

lim sup E(ξn |G) ≤ E(lim sup ξn |G), a.s.


n n

b) If ξn ≥ η, ∀n ≥ 1 a.s., and E(η) > −∞, then

lim inf E(ξn |G) ≥ E(lim inf ξn |G), a.s.


n n

c) If |ξn | ≤ η, ∀n ≥ 1 a.s., and E(η) < ∞, then

E(lim inf ξn |G) ≤ lim inf E(ξn |G) ≤ lim sup E(ξn |G) ≤ E(lim sup ξn |G), a.s.
n n n n
2.8. CONDITIONAL EXPECTATION 52

Theorem 2.8.7 (Lebesgue’s dominated convergence theorem). Suppose that E(η) < ∞, |ξn | ≤
a.s.
η a.s., and ξn −→ ξ. Then,

lim E(ξn |G) = E(ξ|G) and lim E(|ξn − ξ| G) = 0, a.s.


n n

The proofs of Fatou’s lemma and Lebesgue’s dominated convergence theorem are analo-
gous in a similar vein to the proofs of Fatou’s lemma and the Dominated convergence theorem
without conditioning.

Theorem 2.8.8 (Jensen’s inequality). Let ϕ : R → R be a convex function such that ϕ(ξ) is
integrable. Then
E(ϕ(ξ)|G) ≥ ϕ(E(ξ|G)), a.s.

Proof. A result in real analysis is that if ϕ : R → R is convex, then ϕ(x) = supn (an x + bn ) for a
countable collection of real numbers (an , bn ). Then

E(an ξ + bn |G) = an E(ξ|G) + bn .

But E(an ξ + bn |G) ≤ E(ϕ(ξ)|G), hence an E(ξ|G) + bn ≤ E(ξ|G), for every n. Taking the supremum
in n, we get the result.
2
In particular, if ϕ(x) = x2 then E(ξ 2 |G) ≥ E(ξ|G) .

2.8.5 Conditional expectation given a random variable


Since E(ξ|η) is σ(η)-measurable, there exists a measurable function f : R → R such that
E(ξ|η) = f (η). We denote f (x) = E(ξ|η = x).
R
a) Since E(ξ) = E(f (η)) = R f (x)dFη (x),
Z
E(ξ) = E(ξ|η = x)dFη (x). (2.9)
R

b) Let ϕ : R → R be a Borel function such that both ξ and ξϕ(η) are integrable. Then, the
equation
E(ξϕ(η)|η = y) = ϕ(y)E(ξ|η = y)
holds Pη -a.s.
c) If ξ and η are independent, then

E(ξ|η = y) = E(ξ).

Moreover, let ϕ : R2 → R satisfy E|ϕ(ξ, η)| < ∞, then

E(ϕ(ξ, η)|η = y) = E(ϕ(ξ, y)) (Pη − a.s.). (2.10)


2.9. EXERCISES 53

2.9 Exercises

Discrete random variables


2.1. An urn contains five red, three orange, and two blue balls. Two balls are randomly se-
lected. What is the sample space of this experiment? Let X represent the number of orange
balls selected. What are the possible values of X? Calculate expectation and variance of X.

2.2. An urn contains 7 white balls numbered 1,2,...,7 and 3 black ball numbered 8,9,10. Five
balls are randomly selected, (a) with replacement, (b) without replacement. For each of cases
(a) and (b) give the distribution:

1. of the number of white balls in the sample;

2. of the minimum number in the sample;

3. of the maximum number in the sample;

4. of the minimum number of balls needed for selecting a white ball.

2.3. A machine normally makes items of which 4% are defective. Every hour the producer
draws a sample of size 10 for inspection. If the sample contains no defective items he does
not stop the machine. What is the probability that the machine will not be stopped when it
has started producing items of which 10% are defective.

2.4. Let X represent the difference between the number of heads and the number of tails
obtained when a fair coin is tossed n times. What are the possible values of X? Calculate
expectation and variance of X.

2.5. An urn contains N1 white balls and N2 black balls; n balls are drawn at random, (a) with
replacement, (b) without replacement. What is the expected number of white balls in the
sample?

2.6. A student takes a multiple choice test consisting of two problems. The first one has 3
possible answers and the second one has 5. The student chooses, at random, one answer as
the right one from each of the two problems. Find:

a) the expected number, E(X) of the right answers X of the student;

b) the V ar(X).

2.7. In a lottery that sells 3,000 tickets the first lot wins $1,000, the second $500, and five other
lots that come next win $100 each. What is the expected gain of a man who pays 1 dollar to
buy a ticket?
2.9. EXERCISES 54

2.8. A pays 1 dollar for each participation in the following game: three dice are thrown; if
one ace appears he gets 1 dollar, if two aces appear he gets 2 dollars and if three aces appear
he gets 8 dollars; otherwise he gets nothing. Is the game fair, i.e., is the expected gain of the
player zero? If not, how much should the player receive when three aces appear to make the
game fair?

2.9. Suppose a die is rolled twice. What are the possible values that the following random
variables can take on?

1. The maximum value to appear in the two rolls.

2. The minimum value to appear in the two rolls.

3. The sum of the two rolls.

4. The value of the first roll minus the value of the second roll.

2.10. Suppose X has a binomial distribution with parameters n and p ∈ (0, 1). What is the
most likely outcome of X?

2.11. An airline knows that 5 percent of the people making reservations on a certain flight will
not show up. Consequently, their policy is to sell 52 tickets for a flight that can hold only 50
passengers. What is the probability that there will be a seat available for every passenger who
shows up?

2.12. Suppose that an experiment can result in one of r possible outcomes, the ith outcome
having probability pi , i = 1, . . . , r, ri=1 pi = 1. If n of these experiments are performed, and if
P

the outcome of any one of the n does not affect the outcome of the other n − 1 experiments,
then show that the probability that the first outcome appears x1 times, the second x2 times,
and the rth xr times is
n!
px1 px2 · · · pxr r
x1 !x2 ! · · · xr ! 1 2
when x1 + x2 + . . . + xr = n. This is known as the multinomial distribution.

2.13. A television store owner figures that 50 percent of the customers entering his store will
purchase an ordinary television set, 20 percent will purchase a color television set, and 30
percent will just be browsing. If five customers enter his store on a certain day, what is the
probability that two customers purchase color sets, one customer purchases an ordinary set,
and two customers purchase nothing?

2.14. Let X be Geometric. Show that for i, j > 0,

P[X > i + j|X > i] = P[X > j].

2.15. If a fair coin is successively flipped, find the probability that a head first appears on the
fifth trial.
2.9. EXERCISES 55

2.16. A coin having probability p of coming up heads is successively flipped until the rth head
appears. Argue that X, the number of flips required, will be n, n ≥ r, with probability
r−1 r
P[X = n] = Cn−1 p (1 − p)n−r , n ≥ r.

This is known as the negative binomial distribution. Find the expectation and variance of X.

2.17. A fair coin is independently flipped n times, k times by A and n − k times by B. Show that
the probability that A and B flip the same number of heads is equal to the probability that
there are a total of k heads.

2.18. Suppose that we want to generate a random variable X that is equally likely to be either
0 or 1, and that all we have at our disposal is a biased coin that, when flipped, lands on heads
with some (unknown) probability p. Consider the following procedure:

1. Flip the coin, and let 01 , either heads or tails, be the result.

2. Flip the coin again, and let 02 be the result.

3. If 01 and 02 are the same, return to step 1.

4. If 02 is heads, set X = 0, otherwise set X = 1.

(a) Show that the random variable X generated by this procedure is equally likely to be
either 0 or 1.

(b) Could we use a simpler procedure that continues to flip the coin until the last two flips
are different, and then sets X = 0 if the final flip is a head, and sets X = 1 if it is a tail?

2.19. Consider n independent flips of a coin having probability p of landing heads. Say a
changeover occurs whenever an outcome differs from the one preceding it. For instance, if
the results of the flips are HHT HT HHT , then there are a total of five changeovers. If p = 1/2,
what is the probability there are k changeovers?

2.20. Let X be a Poisson random variable with parameter λ. What is the most likely outcome
of X?

2.21. * Poisson Approximation to the Binomial Let P be a Binomial probability with probabil-
ity of success p and number of trial n. Let λ = np. Show that
−k
λk
        
λ n n−1 n−k+1 λ
P (k successes) = 1− ... 1− .
k! n n n n n

Let n → ∞ and let p change so that λ remains constant. Conclude that for small p and large n,

λk −λ
P (k successes) = e , where λ = pn.
k!
2.9. EXERCISES 56

2.22. * Let X be the Binomial B(n, p).

a) Show that for λ > 0 and ε > 0 then

P (X − np > nε) ≤ E[exp(λ(X − np − nε))].

b) With a > 0 show that


p
X p(1 − p) p √
P (| − p |> a) ≤ min{ p(1 − p, a n}.
n a2 n

2.23. Let X be Poisson (λ).


2λλ e−λ
a) With λ a positive integer. Show E{|X − λ|} = (λ−1)!
,

b) Show for r = 2, 3, 4, . . .,

E{X(X − 1) . . . (X − r + 1)} = λr .

2.24. Let X be Geometric (p).


 1  p

a) Show E 1+X = log (1 − p) p−1 .
r!pr
b) Show for r = 2, 3, 4, . . ., E{X(X − 1) . . . (X − r + 1)} = (1−p)r
.

2.25. Suppose X takes all its values in N = {0, 1, 2, . . .}. Show that

X
E[X] = P[X > n].
n=0

The following exercises use the additivity of expectation

2.26. Liam’s bowl of spaghetti contains n strands. He selects two ends at random and joins
them together. He does this until there are no ends left. What is the expected number of
spaghetti hoops in the bowl?

2.27. Sarah collects figures from cornflakes packets. Each packet contains one figure, and n
distinct figures make a complete set. Find the expected number of packets Sarah needs to
collect a complete set.

2.28. Each packet of the breakfast cereal Soggies contains exactly one token, and tokens are
available in each of the three colours blue, white and red. You may assume that each token
obtained is equally likely to be of the three available colours, and that the (random) colours
of different tokens are independent. Find the probability that, having searched the contents
of k packets of Soggies, you have not yet obtained tokens of every colour.
Let N be the number of packets required until you have obtained tokens of every colour.
Show that E[N ] = 11
2
.
2.9. EXERCISES 57

2.29. Each box of cereal contains one of 2n different coupons. The coupons are organized
into n pairs, so that coupons 1 and 2 are a pair, coupons 3 and 4 are a pair, and so on.
Once you obtain one coupon from every pair, you can obtain a prize. Assuming that the
coupon in each box is chosen independently and uniformly at random from the 2n possibili-
ties, what is the expected number of boxes you must buy before you can claim the prize?

Continuous random variables


2.30. The amount of bread (in hundreds of kilos) that a bakery sells in a day is a random
variable with density 
cx


 for 0 ≤ x < 3,
f (x) = c(6 − x) for 3 ≤ x < 6,


0 otherwise.

a) Find the value of c which makes f a probability density function.

b) What is the probability that the number of kilos of bread that will be sold in a day is, (i)
more than 300 kilos? (ii) between 150 and 450 kilos?

c) Denote by A and B the events in (i) and (ii), respectively. Are A and B independent
events?

2.31. Suppose that the duration in minutes of long-distance telephone conversations follows
an exponential density function:
1
f (x) = e−x/5 for x > 0.
5
Find the probability that the duration of a conversation:

a) will exceed 5 minutes;

b) will be between 5 and 6 minutes;

c) will be less than 3 minutes;

d) will be less than 6 minutes given that it was greater than 3 minutes.

2.32. A number is randomly chosen from the interval (0;1). What is the probability that:

a) its first decimal digit will be a 1;

b) its second decimal digit will be a 5;

c) the first decimal digit of its square root will be a 3?


2.9. EXERCISES 58

2.33. The height of men is normally distributed with mean µ=167 cm and standard deviation
σ=3 cm.

a) What is the percentage of the population of men that have height, (i) greater than 167
cm, (ii) greater than 170 cm, (iii) between 161 cm and 173 cm?

b) In a random sample of four men what is the probability that:

i) all will have height greater than 170 cm;

ii) two will have height smaller than the mean (and two bigger than the mean)?

2.34. Find the constant k and the mean and variance of the population defined by the prob-
ability density function
f (x) = k(1 + x)−3 for 0 ≤ x < ∞
and zero otherwise.

2.35. A mode of a distribution of one random variable X is a value of x that maximizes the pdf
or pmf. For X of the continuous type, f (x) must be continuous. If there is only one such x, it
is called the mode of the distribution. Find the mode of each of the following distributions

1. f (x) = 12x2 (1 − x), 0 < x < 1, zero elsewhere.

2. f (x) = 12 x2 e−x , 0 < x < ∞, zero elsewhere.


1
2.36. A median of a distribution of a random variable X is a value x such that P[X < x] ≤ 2
and P[X ≤ x] ≥ 12 . Find the median of each of the following distribution:

1. f (x) = 3x2 , 0 < x < 1, zero elsewhere.


1
2. f (x) = π(1+x2 )
.

2.37. Let 0 < p < 1. A (100p)th percentile (quantile of order p) of the distribution of a random
variable X is a value ζp such that

P[X < ζp ] ≤ p, and P[X ≤ ζp ] ≥ p.

Find the pdf f (x), the 25th percentile and the 60th percentile for each of the the followin cdfs.

1. F (x) = (1 + ex )−1 , −∞ < x < ∞.


−x
2. F (x) = e−e , −∞ < x < ∞.

3. F (x) = 1
2
+ 1
π
tan−1 (x), −∞ < x < ∞.

2.38. If X is a random variable with the probability density function f , find the probability
density function of Y = X 2 if
2.9. EXERCISES 59

2
(a) f (x) = 2xe−x , for 0 ≤ X < ∞

(b) f (x) = (1 + x)/2, for −1 ≤ X ≤ 1

(c) f (x) = 12 , for − 12 ≤ X ≤ 32 .

2.39. Let X be a standard normal random variable. Denote Y = eX .

1. Find the density of Y . This is known as log-normal distribution.

2. Find the expectation and variance of Y .

2.40. Let X be a uniform distribution U (0, 1). Find the density of the following random vari-
able.

1. Y = − λ1 ln(1 − X).
X
2. Z = ln 1−X . This is known as Logistic distribution.
q
1
3. T = 2 ln 1−X . This is known as Rayleigh distribution.

2.41. Let X have the uniform distribution U (− π2 , − π2 ).

1. Find the pdf of Y = tan X. This is the pdf of a Cauchy distribution.

2. Show that Y is not integrable.

3. Denote Z = (X ∨ (−a)) ∧ a for some a > 0. Find E[Z].

2.42. Let X be a random variable with distribution function F that is continuous. Show that
Y = F (X) is uniform.

2.43. Let F be a distribution function that is continuous and is such that the inverse function
F −1 exists. Let U be uniform on [0, 1]. Show that X = F −1 (U ) has distribution function F .

2.44. 1. Let X be a non-negative random variable satisfying E[X α ] < ∞ for some α > 0.
Show that Z ∞
α
E[X ] = α xα−1 (1 − F (x))dx.
0

2. Let Y be a continuous random variable. Show that


Z +∞
E[Y ] = (P[Y > t] − P[Y < −t])dt.
0
Rb
2.45. Suppose that the density function of X satisfies a
f (x)dx = 1 for some real constants
2
a < b. Show that a < E[X] < b and DX ≤ (b−a)
4
.
2.9. EXERCISES 60

2.46. Let X be a nonnegative random variable with mean µ and variance σ 2 , both finite. Show
that for any b > 0,
1
P[X ≥ µ + bσ] ≤ .
1 + b2
[(x−µ)b+σ]2
Hint: consider the function g(x) = σ 2 (1+b2 )2
.

2.47. Let X be a random variable with mean µ and variance σ 2 , both finite. Show that for any
d > 1,
1
P[µ − dσ < X < µ + dσ] ≥ 1 − 2 .
d
2.48. Divide a line segment into two parts by selecting a point at random. Find the probability
that the larger segment is at least three times the shorter. Assume a uniform distribution.

2.49. Let X be an integrable random variable.

1. Let (An ) be a sequence of events such that limn P(An ) = 0. Show that limn→∞ E[XIAn ] = 0.

2. Show that for any  > 0, there exists a δ > 0 such that for any event A satisfying P(A) < δ,
E[XIA ] < .

2.50. Let (Xn ) be a sequence of non-negative random variable. Show that



X ∞
X
E[ Xn ] = E[Xn ].
n=1 n=1

2.51. Given the probability space (Ω, A, P), suppose X is a non-negative random variable and
E[X] = 1. Define Q : A → R by Q(A) = E[XIA ].

1. Show that Q defines a probability measure on (Ω, A).

2. Show that if P(A) = 0, then Q(A) = 0.

3. Suppose P(X > 0) = 1. Let EQ denote expectation with respect to Q. Show that EQ [Y ] =
EP [Y X].

Random elements
2.52. An urn contains 3 red balls, 4 blue balls and 2 yellow balls. Pick up randomly 2 ball from
that urn and denote X and Y the number of red and yellow balls in the 2 balls, respectively.

1. Make the joint distribution table of X and Y .

2. Are X and Y independent?

3. Find the distribution of Z = XY .


2.9. EXERCISES 61

2.53. Suppose that the joint pmf of X and Y is

P[X = i, Y = j] = Cji e−2λ λj /j!, 0 ≤ i ≤ j.

1. Find the probability mass function of Y .

2. Find the probability mass function of X.

3. Find the probability mass function of Y − X.

2.54. Let X and Y be independent random variables taking values in N with


1
P[X = i] = P[Y = i] = , i = 1, 2, . . .
2i
Find the following probability

1. P[X ∧ Y ≤ i].

2. P[X = Y ].

3. P[X > Y ].

4. P[X divides Y ].

5. P[X ≥ kY ] for a given positive integer k.

2.55. Let X and Y be independent geometric random variables with parameters λ and µ.

1. Let Z = X ∧ Y . Show that Z is geometric and find its parameter.

2. Find the probability that X = Y .

2.56. Let X and Y be independent random variables with uniform distribution on the set
{−1, 1}. Let Z = XY . Show that X, Y, Z are pairwise independent but that they are not mutu-
ally independent.

2.57. * Let n be a prime number greater than 2; and X, Y be independent and uniformly dis-
tributed on {0, 1, . . . , n − 1}. For each r, 0 ≤ r ≤ n − 1, define Zr = X + rY ( mod n). Show that
the random variable Zr , r = 0, . . . , n − 1, are pairwise independent.

2.58. Let (Xn ) be a sequence of independent random variables with P[Xn = 1] = P[Xn =
−1] = 21 for all n. Let Zn = X0 X1 . . . Xn . Show that Z1 , Z2 , . . . are independent.

2.59. Let (a1 , . . . , an ) be a random permutation of (1, . . . , n), equally likely to be any of the n!
possible permutations. Find the expectation of
n
X
L= |ai − i|.
i=1
2.9. EXERCISES 62

2.60. A blood test is being performed on n individuals. Each person can be tested separately.
but this is expensive. Pooling can decrease the cost. The blood samples of k people can be
pooled and analyzed together. If the test is negative, this one test suffices for the group of k
individuals. If the test is positive, then each of the k persons must be tested separately and
thus k + 1 total tests are required for the k people. Suppose that we create n/k disjoint groups
of k people (where k divides n) and use the pooling method. Assume that each person has a
positive result on the test independently with probability p.
(a) What is the probability that the test for a pooled sample of k people will be positive?

(b) What is the expected number of tests necessary?

(c) Describe how to find the best value of k.

(d) Give an inequality that shows for what values of p pooling is better than just testing every
individual.
2.61. You need a new staff assistant, and you have n people to interview. You want to hire the
best candidate for the position. When you interview a candidate, you can give them a score,
with the highest score being the best and no ties being possible. You interview the candidates
one by one. Because of your company’s hiring practices, after you interview the kth candidate,
you either offer the candidate the job before the next interview or you forever lose the chance
to hire that candidate. We suppose the candidates are interviewed in a random order, chosen
uniformly at random from all n! possible orderings.
We consider the following strategy. First, interview m candidates but reject them alL these
candidates give you an idea of how strong the field is. After the mth candidate. hire the
first candidate you interview who is better than all of the previous candidates you have in-
terviewed.
1. Let E be the event that we hire the best assistant, and let Ei be the event that ith candi-
date is the best and we hire him. Determine P(Ei ), and show that
n
m X 1
P(E) = .
n j=m+1 j − 1

2. Show that
m m
(ln n − ln m) ≤ P(E) ≤ (ln(n − 1) − ln(m − 1)).
n n
3. Show that m(ln n − ln m)/n is maximized when m = n/e, and explain why this means
P(E) ≥ 1/e for this choice of m.
2.62. Let X and Y have the joint pdf

6(1 − x − y) if x + y < 1, x > 0, y > 0,
f (x, y) =
0 otherwise.
2.9. EXERCISES 63

Compute P[2X + 3Y < 1] and E[XY + 2X 2 ].

2.63. Let X and Y have the joint pdf



10xy if 0 < x < y < 1
fX,Y (x, y) = .
0 otherwise

Find the joint pdf of X/Y and Y .

2.64. Let X be a normal with µ = 0 and σ 2 < ∞, and let Θ be uniform on [0, π]. Assume that
X and θ are independent. Find the distribution of Z = X + a cos Θ.

2.65. Let X and Y be independent random variable with the same distribution N (0, σ 2 ).

1. Let U = X + Y and V = X − Y . Show that U and V are independent.



2. Let Z = X 2 + Y 2 and W = arctan X Y
∈ (− π2 , π2 ). Show that X has a Rayleight distribu-
tion, that W is uniform, and that Z and W are independent.

2.66. (Simulation of Normal Random Variables) Let U and V be two independent uniform
random variable on [0, 1]. Let θ = 2πU and S = − ln(V ).

1. Show that S has an exponential distribution, and that R

2.67. Let (X1 , . . . , Xn ) be random variables. Define

Y1 = min{Xi , 1 ≤ i ≤ n},
Y2 = second smallest of X1 , . . . , Xn ,
..
.
Yn = max{Xi , 1 ≤ i ≤ n}.

Then Y1 , . . . , Yn are also random variables, and Y1 ≤ Y2 ≤ . . . ≤ Yn . They are called the order
statistics of (X1 , . . . , Xn ) and are usually denoted Yk = X(k) . Assume that Xi are i.i.d. with
common density f .

1. Show that the joint density of the order statistics is given by



n! Qn f (y ) for y < y < . . . < y
i=1 i 1 2 n
f(X(1) ,X(2) ,...,X(n) ) (y1 , . . . , yn ) =
0 otherwise.

2. Show that X(k) has density

f(k) (y) = kCnk f (y)(1 − F (y))n−k F (y)k−1

where F is distribution function of Xk .


2.9. EXERCISES 64

2.68. Show that the function



0 for x + y < 1,
F (x, y) =
1 for x + y ≥ 1,

is not a joint distribution function.

2.69. Let X and Y be independent and suppose P[X + Y = α] = 1 for some constant α. Show
that both X and Y are constant random variables.

2.70. Let (Xn )n≥1 be iid with common continuous distribution function F (x). Denote Rn =
Pn
j=1 I{Xj ≥Xn } , and An = {Rn = 1}.

1. Show that the sequence of random variables (Rn )n≥1 is independent and
1
P[Rn = k] = , for k = 1, . . . , n.
n

2. The sequence of events (An )n≥1 is independent and


1
P(An ) = .
n
Chapter 3

Fundamental Limit Theorems

3.1 Convergence of random variables


In this section, we study about convergence of random variables. This is an important con-
cept in probability theory and it has many applications in statistics. Here we study a sequence
of random events or variables and we consider whether it obeys some behavior. Such a behav-
ior can be characterized in two cases: the limit is a constant value or the limit is still random
but we can describe its law.
When discussing the convergence of random variables, we need to define the metric be-
tween two random variables or the manner that the random variables are close to each other.
Then in the following, we give some ”manners” or some modes of convergence.

Definition 3.1.1. Let (Xn )n≥1 be a sequence of random variables defined on(Ω, A, P). We say
that Xn
a.s.
• converges almost surely to X and denoted by Xn −→ X or limn Xn = X a.s., if
 
P w : lim Xn (w) = X(w) = 1;
n→∞

P
• converges in probability to X and denoted by Xn −→ X, if for any  > 0,

lim P(|Xn − X| > ) = 0;


n→∞

Lp
• converges in Lp (p > 0) to X and denoted by Xn −→ X if E(|Xn |p ) < ∞ for any n,
E(|X|p ) < ∞ and
lim E(|Xn − X|p ) = 0.
n→∞

Note that, the value of a random variable is a number, so the most natural way to consider
the convergence of random variables is via the convergence of a sequence of numbers; and
here comes the convergence almost surely. But sometimes this mode of convergence can fail,

65
3.1. CONVERGENCE OF RANDOM VARIABLES 66

then the convergence in probability is defined in the meaning that the larger n is, the smaller
and smaller the probability that Xn is far away from X becomes; and the convergence in Lp is
considered in the sense that the average distance between Xn and X must tends to 0.
We have the following example.

Example 3.1.2. Let {Xn } be a sequence of random variables such that


1 1
P(Xn = 0) = 1 − 2
and P(Xn = n) = 2 .
n n
p
Then Xn converges to 0 in probability, in L for 0 < p < 2 and almost surely.
• At first, we consider the convergence in probability. For any  > 0, observe that the event
{|Xn − 0| > } is included in the event {Xn 6= 0} = {Xn = n}. Then
1
0 ≤ P(|Xn − 0| > ) ≤ P(Xn = n) = .
n2
Therefore from the sandwich theorem,

lim P(|Xn − 0| > ) = 0.


n→∞

P
It implies that Xn −→ 0.
• In order to prove the convergence in Lp for p ∈ (0, 2), we must check that

lim E(|Xn − 0|p ) = 0.


n→∞

This limit can be deduced from the computation that E (|Xn |p ) = np−2 .
• Usually, in order to prove or disprove the convergence almost surely, we use the Borel-
Cantelli lemma that can be stated as follows.

Lemma 3.1.3 (Borel-Cantelli). Let An be a sequence of events in a probability space {Ω, F, P}.
Denote lim sup An = ∩∞
n=1 (∪m≥n Am ) .

1. If Σ∞
n=1 P(An ) < ∞, then P(lim sup An ) = 0.

2. If Σ∞
n=1 P(An ) = ∞ and An ’s are independent, then P(lim sup An ) = 1.

Proof. 1. From the definition of lim sup An , it is clear that for every i,

lim sup An ⊂ ∪m≥i Am .

So P(lim sup An ) ≤ P(∪m≥i Am ) ≤ Σ∞ ∞


m=i P(Am ). Since Σn=1 P(An ) < ∞, the right hand side
can be arbitrary small for suitable i. Then P(lim sup An ) = 0.

2. We have

1 − P(lim sup An ) = P ∩∞
n=1 ∪m≥n Am


= P ∪n=1 ∪m≥n Am
= P ∪∞

n=1 ∩m≥n Am .
3.1. CONVERGENCE OF RANDOM VARIABLES 67

In order to prove that P(lim sup An ) = 1, i.e 1 − P(lim sup An ) = 0, we will show that

P ∩m≥n Am = 0 for every n. Indeed, since An ’s are independent,

P ∩m≥n Am = Πm≥n P(Am )
= Πm≥n [1 − P(Am )]
≤ Πm≥n e−P(Am ) = e−Σm≥n P(Am ) = e−∞ = 0.

Then the result follows.

The meaning of the event lim sup An is that An occurs for an infinite number of n. Therefore
P(lim sup An ) = 0 means that almost surely there exists just a finite number of n that we can
see An .
Now, let’s see the application of the Borel-Cantelli Lemma in our example. We denote the
event An = {Xn 6= 0} = {Xn = n}. Then,
1
Σ∞ ∞
n=1 P(An ) = Σn=1 < ∞.
n2
It implies that almost surely An occurs a finite number of n, i.e the number of n such that Xn
differs from zero is finite. Hence, almost surely the limit of Xn exists and it must be zero. So
a.s.
Xn −→ X.

For any random variables X và Y , we denote


h |X − Y | i
dP (X, Y ) = E .
|X − Y | + 1

The following proposition characterizes the convergence in probability via metric dP .1

Proposition 3.1.4. Xn converges in probability to X iff

lim dP (Xn , X) = 0. (3.1)


n→∞

P
Proof. ⇒) Suppose that Xn → X. For any  > 0 and w ∈ Ω, because of the increasing property
x
of the function x 7→ x+1 on the interval [0, ∞), we have

|Xn (w) − X(w)| 


≤ I{|Xn −X|≤} (w) + I{|Xn −X|<} (w).
|Xn (w) − X(w)| + 1 +1

Taking expectation of both sides, we have



dP (Xn , X) ≤ P(|Xn − X| ≤ ) + P(|Xn − X| > ).
+1
1
dP is indeed a metric on L0 .
3.1. CONVERGENCE OF RANDOM VARIABLES 68

Hence
lim sup dP (Xn , X) ≤  + lim sup P(|Xn − X| > ) =  for all  > 0.
n→∞ n→∞

This implies (3.1).


⇐) On the other hand, we suppose that condition (3.1) holds. Then for any  > 0, it follows
from Markov’s inequality that
 |X − X|    + 1 h |Xn − X| i
n
P(|Xn − X| ≥ ) = P ≥ ≤ E → 0 as n → ∞.
|Xn − X| + 1 +1  |Xn − X| + 1

The following proposition shows that among the three modes of convergence, the conver-
gence in probability is the weakest form.

Proposition 3.1.5. Let (Xn )n≥1 be a sequence of random variables.


Lp P
1. If Xn −→ X for some p > 0 then Xn −→ X.
a.s. P
2. If Xn −→ X then Xn −→ X.
Lp
Proof. 1. Suppose that Xn −→ X. Then by Markov inequality, for each  > 0,

E(|Xn − X|p )
P(|Xn − X| > ) = P(|Xn − X|p > p ) ≤ .
p
Since E(|Xn − X|p ) → 0, by the sandwich theorem, P(|Xn − X| > ) converges also to 0.
P
Therefore Xn −→ X.
a.s.
2. Suppose that Xn −→ X. It is clear that

|Xn − X|
≤ 1,
1 + |Xn − X|

then by Lebesgue’s Dominated Convergence Theorem (see ??);


   
|Xn − X| |Xn − X|
limE = E lim = E(0) = 0.
n→ 1 + |Xn − X| n→ 1 + |Xn − X|

P
From the Proposition 3.1.4, we have Xn −→ X.

In the above example, we can see that convergence in probability is not sufficient for con-
vergence almost surely. However, we have the following result.
P
Proposition 3.1.6. 1. Suppose Xn −→ X. Then there exists a subsequence (nk )k≥1 such that
a.s.
Xnk −→ X.
3.1. CONVERGENCE OF RANDOM VARIABLES 69

2. On the contrary, if for all subsequence (nk )k≥1 , there exists a further subsequence (mk )k≥1
a.s. P
such that Xmk −→ X then Xn −→ X.
P
Proof. 1. Suppose Xn −→ X. Then from Proposition 3.1.4,
 
|Xn − X|
limE = 0.
n→ 1 + |Xn − X|

So there exists a subsequence {nk } such that


 
|Xnk − X| 1
E < k.
1 + |Xnk − X| 2

It is clear that  
|Xnk − X|
Σ∞
k=1 E < ∞.
1 + |Xnk − X|
Therefore,
|Xnk − X|
Σ∞
k=1 <∞ a.s.
1 + |Xnk − X|
Then, almost surely
|Xnk − X|
lim = 0,
k→∞ 1 + |Xnk − X|

it implies that
lim |Xnk − X| = 0,
k→∞

i.e, lim Xnk = X.


k→∞

2. Indeed, if we assume that X


 n does not converge in probability to X, then from Proposi-
|Xn −X|
tion 3.1.4, the sequence E 1+|Xn −X| does not converge to 0, i.e. there exists a positive
constant  > 0 such that we can find a subsequence nk satisfying
 
|Xnk − X|
E > , ∀k.
1 + |Xnk − X|

It implies that for all subsequence {mk } of {nk }, Xnk can not converge almost surely to
X. This is in contradiction with the hypothesis.
P
So we must have that Xn −→ X.

We have the following elementary but useful proposition.

Proposition 3.1.7. Let f : R2 → R be a continuous function.


a.s. a.s. a.s.
1. If Xn −→ X and Yn −→ Y then f (Xn , Yn ) −→ f (X, Y ).
P P P
2. If Xn −→ X and Yn −→ Y then f (Xn , Yn ) −→ f (X, Y ).
3.2. LAWS OF LARGE NUMBERS 70

Proof. 1. Denote by A = {w ∈ Ω : limn→∞ Xn (w) = X(w)} ∩ {w ∈ Ω : limn→∞ Yn (w) =


Y (w)}. It is clear that P(A) = 1 and for all w ∈ A, we have limn→∞ f (Xn (w), Yn (w)) =
a.s
f (X(w), Y (w)) since f is continuous. Then f (Xn ) −→ f (X).
P
2. From the second part of Proposition 3.1.6, in order to prove that f (Xn , Yn ) −→ f (X, Y ),
we can check that for all subsequence {nk }, there exists a subsequence {mk } such that
a.s P P
f (Xmk , Ymk ) −→ f (X, Y ). Indeed, since Xnk −→ X and Ynk −→ Y , then from the first
a.s
part of Proposition 3.1.6, we can extract a subsequence {mk } satisfying Xmk −→ X and
a.s
Ymk −→ Y . Then from the first part of this theorem, the result follows.

3.2 Laws of large numbers


In this section, we study the first special and classical limit theorem named ”Law of large
number”. It was first stated but without proof by Cardano. Later, the first proof was given by
Bernoulli when he considered the binary random variables. This theorem shows that in some
cases, the limit behaviour of the average of some random variables is a constant.
More precise, throughout this section, we consider (Xn )n≥1 a sequence of random vari-
ables defined on (Ω, F, P) and denote

Sn = X1 + . . . + Xn .

We have the ”weak law” and the ”strong law”.

3.2.1 Weak laws of large numbers


In the weak law, we have the convergence in probability as follows.

Theorem 3.2.1. Suppose that


D(Sn )
lim = 0.
n→∞ n2
Then
Sn − ESn P
−→ 0, as n → ∞.
n
Proof. For every  > 0, by the Chebyshev inequality,
 S − ES  D(S )
n n n
P ≥  ≤ 2 2 → 0 khi n → ∞.
n n
It implies the result.

We have a recent corollary.


3.2. LAWS OF LARGE NUMBERS 71

Corollary 3.2.2. Let (Xn )n≥1 be a sequence of pairwise uncorrelated random variables satisfy-
ing
D(X1 ) + . . . + D(Xn )
lim = 0.
n→∞ n2
Then
Sn − ESn P
−→ 0, asn → ∞.
n
Proof. Observe that D(Sn ) = D(X1 ) + . . . + D(Xn ) and apply Theorem 3.2.1.

In a special case, when Xn ’s are i.i.d with finite variance, we have

D(X1 ) + . . . + D(Xn ) n.D(X1 ) D(X1 ))


lim 2
= lim 2
= lim = 0.
n→∞ n n→∞ n n→∞ n
So the condition in Theorem 3.2.1 is met. Moreover, by linearity,

ESn = EX1 + EX2 + . . . + EXn = nEX1 .

So we have proved the following corollary

Lemma 3.2.3. Let (Xn )n≥1 be a sequence of i.i.d random variables with finite variance. Then

Sn P
−→ EX1 , as n → ∞.
n
Note that when Xn has the Bernoulli law, then Sn is the number of successful trials and
Bernoulli showed that Sn /n converges in probability to the probability of success of a trial.
However, his proof is much more complicated than the one given here.

3.2.2 Strong laws of large numbers


We first claim a simple version of the strong laws.

Theorem 3.2.4. Let (Xn )n≥1 be a sequence of pair-wise uncorrelated random variables satisfy-
ing supn D(Xn2 ) ≤ σ 2 < ∞. Then

Sn − ESn
lim = 0 a.s and in L2 .
n→∞ n
Proof. At first, we assume that E(Xn ) = 0. Denote Yn = Sn /n. Then E(Yn ) = 0, and from
Proposition 2.7.3,
n
2 1 X σ2
E(Yn ) = 2 DXi ≤ .
n i=1 n
L2
Hence Yn → 0. We also have
∞ ∞
X X σ2
E(Yn22 ) ≤ < ∞.
n=1 n=1
n2
3.2. LAWS OF LARGE NUMBERS 72

From the Monotone Convergence Theorem,


"∞ # ∞
X X
2
E Yn2 = E(Yn22 ) < ∞,
n=1 n=1
P∞
so n=1 Yn22 < ∞ almost surely. It implies that
a.s
Yn2 → 0. (3.2)

For each n ∈ N, denote by p(n) the integer part of n. Since
n
p(n)2 1 X
Yn − Yp(n)2 = Xj ,
n n
j=p(n)2 +1

we have

h p(n)2 2 i n − p(n)2
2 2p(n) + 1 2 2 n + 1 2 3
E Yn − Yp(n)2 ≤ 2
σ ≤ 2
σ ≤ 2
σ ≤ 3/2 σ 2 ,
n n n n n

with the observations n ≤ (p(n) + 1)2 and p(n) ≤ n. By the same argument, since
∞ ∞
X h p(n)2 2 i X 3 2
E Yn − Yp(n)2 ≤ 3/2
σ < ∞,
n=1
n n=1
n

then
p(n)2 h.c.c
Yn − Yp(n)2 → 0.
n
2 a.s
From (3.2) and the observation p(n)n
→ 1, we deduce that Yn → 0.
In general, if E(Xn ) 6= 0, we denote Zn = Xn − E(Xn ). Then {Zn } is a sequence of pair-
wise uncorrelated random variables with mean zero satisfying the condition of the theorem.
Therefore
Sn − ESn Z1 + . . . + Zn a.s
= → 0.
n n

In the following, we state without proof two general versions of strong law of large num-
bers.

Theorem 3.2.5. Let (Xn )n≥1 be a sequence of independent random variable and, (bn )n≥1 a se-
quence of positive numbers satisfying bn ↑ ∞. If

X DXn Sn − E(Sn ) a.s.
< ∞ then −→ 0.
n=1
b2n bn

Theorem 3.2.6. Let (Xn )n≥1 be a sequence of iid random variables. Then
Sn
lim = E(X1 ) iff E(|X1 |) < ∞.
n→∞ n
3.2. LAWS OF LARGE NUMBERS 73

i.i.d
Example 3.2.7. Consider (Xn )n≥1 ∼ B(1, p). From Theorem 3.2.6,

Sn h.c.c
−→ E(X1 ) = p.
n
Then, to approximate the probability of success of each trial, we can use the approximation
Sn /n for n large enough.

An application of Strong law of large numbers that is quite simple but very useful is the
Monte Carlo method.

Example 3.2.8. Let f be an integrable function over [0, 1], i.e.


Z 1
|f (x)|dx < ∞. (3.3)
0
R1
In most of the practical applications, the quantity I = 0 f (x)dx can not be calculated exactly
by the analytical method. Therefore one usually approximate it by the numerical method.
When f is smooth enough, I can be approximated well by taking the average (with some
weight) of the values of f at some fixed points. For example, if f is twice differentiable, we
have
f (tn0 ) + 2f (tn1 ) + . . . + 2f (tnn−1 ) + f (tnn )
I≈ ,
2n
where tni = ni , i = 0, 1, . . . , n.
However, the above method is not good in the sense that we must take too many points to
have a good approximation when f is not smooth enough. In this case, we can use the Monte
Carlo method that can be stated in the simplest version as follows. Let (Uj )j≥1 be a sequence
of i.i.d random variables of the uniform distribution over [0, 1] and denote
n
1X
In = f (Uj ).
n j=1
R1
Since E[|f (Uj )|] = 0 |f (x)|dx < ∞, then from Theorem 3.2.6, In converges almost surely to
E[f (U1 )] = I as n → ∞. To evaluate the error of the approximation, we assume more that
Z 1
|f (x)|2 dx < ∞. (3.4)
0

Then, the square of the error is


Z 1
2 2 1 1
E[(In − I) ] = E[(In − E[In ]) ] = Df (U1 ) ≤ |f (x)|2 dx.
n n 0

In practical, we use the computer to generate the sequence (Uj )j≥1 and obtain an approxima-
tion of I for any function f satisfying the condition (3.3). Under the condition (3.4), the error
of the approximation only depends on the size n and not on the smoothness of f . The Monte
3.3. CENTRAL LIMIT THEOREMS 74

Carlo method also seems to be more useful than the other deterministic ones in approximat-
ing the multiple integral. The only thing we must care about is the square of the error. If we
can reduce it, then the calculation will be more accurate and we can also reduce the time on
computer (see [?]). That is the way one wants to improve the Monte Carlo method.
The error of the Monte Carlo method will be analysed in more detail based on the Central
limit theorems that will be explained in the following.

3.3 Central limit theorems


In this section, we will state and prove the second classical limit theorem in probability
theory named ”Central limit theorem”. It is the most beautiful pearl of probability and has a
lot of applications in statistics. However, to understand this theorem, we need to define a new
mode of convergence and the tools to study it.

3.3.1 Characteristic functions


Sometimes to analyse a quantity or a function, it is better to transform it in another form.
Since a random variable can be seen as a special function, we can do the same. In this section,
we study the Fourier transformation of a random variable. We have the following definition.

Definition 3.3.1. 1. Let X be a random variable. We define its characteristics function by


Z
itX
ϕX (t) = E[e ] = eitx dFX (x).
R

2. The characteristic function of random vector X = (X1 , . . . , Xn ) is defined by


h  Xn i
ϕX (t1 , . . . , tn ) = E exp i tj Xj .
j=1

Theorem 3.3.2. For every random variable X, the characteristic function ϕX has the following
properties;

1. ϕX (0) = 1;

2. ϕX (−t) = ϕX (t);

3. |ϕX (t)| ≤ 1;

4. |ϕX (t + h) − ϕX (t)| ≤ E[|eihX − 1|], so ϕX is uniformly continuous on (−∞, +∞);

5. E[eit(aX+b) ] = eitb ϕX (at).


3.3. CENTRAL LIMIT THEOREMS 75

Proof. It is easy to see that ϕX (0) = 1. Applying the inequality (EX)2 ≤ E(X 2 ),
p q
|ϕX (t)| = (E cos tX)2 + (E sin tX)2 ≤ E(cos2 tX) + E(sin2 tX) = 1,

then ϕX is bounded. And the continuity of can be deduced by Lebesgue dominated conver-
gence theorem.

The following theorem shows the connection between the characteristic function of a ran-
dom variable and its moments.

Theorem 3.3.3. If E[|X|m ] < ∞ for some positive integer m. Then ϕX has continuous deriva-
tives up to order m, and

ϕ(k) (t) = ik E[X k eitX ], (3.5)


ϕ(k) (0)
E[X k ] = , (3.6)
ik
n
X (it)k (it)n
ϕX (t) = E[X k ] + αn (t), (3.7)
k=0
k! n!

where |αn (t)| ≤ 2E(|X n |) and αn (t) → 0 as t → 0.


On the other hand, if ϕ(2m) (0) exists and is finite for some positive integer m, then E[X 2m ] < ∞.

Proof. Since E(|X|m ) < ∞, we have E(|X|k ) < ∞ for all k = 1, . . . , m. Then
Z Z
sup |(ix) e |dFX (x) ≤ |x|k dFX (x) < ∞.
k itx
t

From Lebesgue theorem, we can take the differentation under the integral sign and obtain
(3.5). In (3.5), let t = 0 then we have (3.6).
Consider the Taylor expansion2 of function exp(x) at x = 0,
n−1
itX
X (itX)k (itX)n iθX 
E(e )=E + e
k=0
k! n!
n−1
X (it)k (it)n  
= E(X k ) + E(X n ) + αn (t) ,
k=0
k! n!

where |θ| ≤ |t|, αn (t) = E X n (eiθX − 1) . Therefore |αn (t)| ≤ 2E(|X|n ), i.e it is bounded. So
from the Dominated convergence theorem, we have αn (t) → 0 as t → 0.
The inverse statement can be proved by concurrence, see [13, page 190-193].
2
Taylor expansion: Suppose that ϕ has continuous derivatives up to order m, then
m−1
X ϕ(k) (0) k ϕ(m) (θ) m
ϕ(x) = x + x ,
k! m!
k=0

where |θ| ≤ |x|.


3.3. CENTRAL LIMIT THEOREMS 76

We consider the characteristic function of some usual distributions.

Example 3.3.4. Let X ∼ P oi(λ). We have



X e−λ λk (λeit )k it
ϕX (t) = eitk = e−ld = eλ(e −1) .
k=0
k! k!

Example 3.3.5. Let X ∼ N (0, 1). We have


Z ∞ −x2 /2 Z ∞ Z ∞
itx e cos tx −x2 /2 sin tx −x2 /2
ϕX (t) = e √ dx = √ e dx + √ e dx.
−∞ 2π −∞ 2π −∞ 2π
2
Since the funtion x 7→ e−x /2 sin tx is an odd and integrable function, the second integral
equals to 0. Thanks to Theorem 3.3.3, we can take the derivative of both side with respect
to t and get Z ∞
0 sin tx −x2 /2
ϕX (t) = − √ xe dx.
−∞ 2π
By integration by parts,
Z ∞
0 cos tx −x2 /2
ϕX (t) = − √ te dx = −tϕX (t).
−∞ 2π
ϕ0X
The differential equation ϕX
= −t with initial condition ϕX (0) = 1 has a solution
2 /2
ϕX (t) = e−t .

If X ∼ N (a, σ 2 ) then Z
1 (x−a)2
ϕX (t) = √ eitx e− 2σ2 dx.
2πσ 2

Using the change of variable: y = (x − a)/σ, we get


eita
Z
2 2 2
ϕX (t) = √ eitσy e−y /2 dy = eita−t σ /2 .

The following theorem shows the meaning of the name ”characteristic function”.

Theorem 3.3.6. Two random vectors have the same distribution if their characteristic functions
R
coincide. Moreover, if |ϕX (t)|dt < ∞ then X has bounded continuous density given by
1 −ity
fX (y) = e ϕ(t)dt.

Example 3.3.7. Let X and Y have Poisson distribution with the corresponding parameters µ
and λ. Assume more that X and Y are independent. Let us consider the distribution of the
random variable X + Y . We can compute its characteristic function as
it −1)
ϕX+Y (t) = E(eit(X+Y ) ) = E(eitX )E(eitY ) = e(λ+µ)(e .

Then this characteristic function agrees with the one of P oi(µ + λ). So the random variable
X + Y has the Poisson distribution with the parameter µ + λ.
3.3. CENTRAL LIMIT THEOREMS 77

We can also use the characteristic function to check whether the random variables are
independent.

Theorem 3.3.8. X1 , . . . , Xn are independent random variables iff

ϕ(X1 ,...,Xn ) (t1 , . . . , tn ) = ϕX1 (t1 ) . . . ϕXn (tn ) for all (t1 , . . . , tn ) ∈ Rn .

Example 3.3.9. Let X and Y be independent random variables which have standard normal
distribution N (0, 1). According to Example 3.3.5, we have
2 −s2
ϕ(X+Y,X−Y ) (t, s) = Eeit(X+Y )+is(X−Y ) = Eei(t+s)X Eei(t−s)Y = e−t .
2 2
Put s = 0 and t = 0, we have ϕX+Y (t) = e−t and ϕX−Y (s) = e−s . Hence both X + Y and X − Y
have normal distribution N (0, 2). Furthermore, they are independent since

ϕ(X+Y,X−Y ) (t, s) = ϕX+Y (t)ϕX−Y (s) for all t, s ∈ R.

3.3.2 Weak convergence


In this section, we consider another mode of convergence of a sequence of random vari-
ables that is the weak convergence. In the first three modes of convergence, we can see the
trace of analysis such as the limit of a sequence of numbers or a Cauchy sequence. Here the
weak convergence totally comes from the probability theory.
w
Definition 3.3.10. Xn converges weakly to X and denoted by Xn −→ X, if limn→∞ E[f (Xn )] =
E[f (X)] for each f which is real-valued, continuous and bounded.
w w
When Xn −→ X we also say that FXn converges weakly to FX and denote FXn −→ FX .

Note that we do not require or suppose that the random variables (Xn )n≥1 are defined on
the same probability space in the above definition. We just care about the expectation or the
distribution. Therefore sometimes we call weakly convergence by convergence in distribution
(See Exercise 3.27).
If we suppose that Xn ’s and X are defined on the same probability space, we have the
following propositions.

Proposition 3.3.11. Let (Xn )n≥1 and X be random variables defined on the same probability
w
space (Ω, F, P). If Xn −→ X then Xn −→ X.
P

P w
Proof. We prove by contradiction. Assume that Xn −→ X but Xn 6−→ X. Then there exist a
bounded continuous function f , a constant  > 0 and a subsequence (nk )k≥1 such that

|E(f (Xnk )) − E(f (X))| >  for all k ≥ 1. (3.8)

From Proposition 3.1.6, there exists a subsequence (mk )k≥1 of the sequence (nk )k≥1 such that
a.s h.c.c
Xmk −→ X. Since f is continuous, f (Xmk ) −→ f (X). By Dominated Convergence Theorem,
E(f (Xmk )) → E(f (X)). It is in contradiction with (3.8). Then the result follows.
3.3. CENTRAL LIMIT THEOREMS 78

Proposition 3.3.12. Let (Xn )n≥1 and X be random variables defined on the same probability
w
space (Ω, F, P). If Xn −→ X and X = const a.s then Xn −→ X.
P

|x−a| w
Proof. Let X ≡ a a.s. Consider the bounded continuous function f (x) = |x−a|+1
. Since Xn −→
P
a, E(f (Xn )) → f (a) = 0. From Proposition 3.1.4, Xn → a.

The following theorem gives us a very useful criterion to verify the weak convergence of
random variables by using the characteristic function. Its proof is provided in [13, page 196-
199].

Theorem 3.3.13. Let (Fn )n≥1 be a sequence of distribution function whose characteristic func-
tions are (ϕn )n≥1 respectively, Z
ϕn (t) = eitx dFn (x).
R
w
1. If Fn → F for some distribution function F then (ϕn ) converges point-wise to the charac-
teristic function ϕ of F .

2. If ϕn (t) → ϕ(t), t ∈ R. Then the following statements are equivalent.


w
(a) ϕ(t) is a characteristic function and Fn → F where F is a distribution function whose
characteristic function is ϕ;
(b) ϕ is continuous at t = 0.

Example 3.3.14. Let Xn be normal N (an , σn2 ). Suppose that an → 0 and σn2 → 1 as n → ∞.
Then the sequence (Xn ) converges weakly to N (0, 1) since
2 2 /2 2 /2
ϕXn (t) = eitan −σn t → e−t .

Example 3.3.15 (Weak laws of large numbers). Let (Xk )k≥1 be a iid sequence of random vari-
ables whose mean is finite. Then
1 P
(X1 + . . . + Xn ) −→ a.
n
Indeed, denote Sn = X1 + . . . + Xn and ϕ is characteristic function of Xk . Then,

ϕSn /n (t) = ϕSn (t/n) = [ϕ(t/n)]n .

Thanks to Theorem 3.3.3, we have


ita t
ϕ(t/n) = 1 + + α(t/n),
n n
with α(t) → 0 as t → 0. Thus,
 ita t n
ϕSn /n (t) = 1 + + α(t/n) → eita
n n
w
as n → ∞. Note that if X ≡ a then ϕX (t) = eita . Hence, S/n → a. Applying Proposition 3.3.12
we obtain the desired result.
3.3. CENTRAL LIMIT THEOREMS 79

The following theorem is very useful in statistics.


w P P
Theorem 3.3.16 (Slutsky’s theorem). Suppose that Xn → X, An → a and Bn → b where a and
b are some constants. Then
w
An + Bn Xn → a + bX.

3.3.3 Central limit theorem


The central limit theorem is stated as follows.

Theorem 3.3.17. Let (Xn )n≥1 be a sequence of i.i.d random variables and E(Xn ) = µ and
−nµ w
DXn = σ 2 ∈ (0, ∞). Denote Sn = X1 + . . . + Xn . Then Yn = Sσn√ n
→ N (0, 1).

Proof. Denote ϕ by the characteristic function of the random variable Xn − µ. Since Xn ’s have
the same law, ϕ does not depend on n. Moreover, since Xn ’s are independent,
n n
 X Xj − µ  Y  X − µ 
j t n
ϕYn (t) = E exp it √ = E exp it √ = ϕ √ .
j=1
σ n j=1
σ n σ n

It is clear that E(Xj − µ) = 0 and E((Xj − µ)2 ) = σ 2 . Then from Theorem 3.3.3, ϕ has the
continuous second derivative and
σ 2 t2
ϕ(t) = 1 − + t2 α(t),
2
where α(t) → 0 as t → 0. Using the expansion ln(1 + x) = x + o(x) as x → 0,
 t2 t2 t  t2
ln ϕYn (t) = n ln 1 − + 2α √ →− .
2n nσ σ n 2
2 /2
Therefore ϕYn (t) → e−t as n → ∞. Applying Theorem 3.3.13, we have the desired result.

In the following, we give an example of the central limit theorem. More detail, we will
approximate the binomial probability by the normal probability.

Example 3.3.18. We know that a binomial random variable Sn ∼ B(n, p) can be written as the
sum of n i.i.d random variables ∼ B(1, p). Then as n large enough, from the central limit the-
p
orem, we can approximate the random variable (Sn − np)/ np(1 − p) by the standard normal
variable N(0, 1).
Usually, the probability that a ≤ Sn ≤ b can be formulated as

Σbi=a Cni pi (1 − p)n−i .

However, when n is too large, calculating Cni for some i is impossible since it exceeds the
capacity of the calculator or the computer (please, consider 1000! or 5000!). Then in practical,
3.4. EXERCISES 80

we can estimate this probability by


" #!
S − np a − np b − np
P(a ≤ Sn ≤ b) = P p n ∈ p ,p
np(1 − p) np(1 − p) np(1 − p)
" #!
∼ a − np b − np
= P N(0, 1) ∈ p ,p .
np(1 − p) np(1 − p)

Note that to compute the last probability, we can write down it as an integral from the density
function of the normal variable. It can be computed or approximated easily.

In order to define the rate that the distribution of FYn converges to normal distribution, we
use the Berry-Esseen’s inequality: Suppose E(|X1 |3 ) < ∞ then
Z x −t2 /2
e E(|X1 − EX1 |3 )
sup FYn (x) − √ dt ≤ KBE √ , (3.9)
−∞<x<∞ −∞ 2π σ3 n

where KBE is some constant in ( 610+3


, 0.4748) (see[12]).
The condition that Xn ’s are iid is too restrictive. Many authors manage to weaken this
condition. In the following, we state the Lindeberg’s central limit theorem. Its proof can be
found in [13, page 221-225].

Theorem 3.3.19. Let (Xn )n≥1 be a sequence of independent random variables with finite vari-
ance. Denote Sn = X1 + . . . + Xn , Bn = DX1 + . . . + DXn . Suppose that
n
1 X  2

Ln () := E (X k − EX k ) I{|Xk −EXk |>Bn } → 0, ∀ > 0. (3.10)
Bn2 k=1

Sn −ESn w
Then Sn∗ = Bn
→ N (0, 1).

3.4 Exercises

3.4.1 Convergence of random variables


P P
3.1. Prove that if Xn → X and, at the same time, Xn → Y , then X and Y are equivalent, in the
sense that P[X 6= Y ] = 0.

3.2. Show that dP is a metric on L0 , it means that

1. d(X, Y ) ≥ 0 and d(X, Y ) = 0 iff X = Y a.s.;

2. d(X, Y ) = d(Y, X);

3. d(X, Y ) ≤ d(X, Z) + d(Z, Y );

for any random variables X, Y, Z.


3.4. EXERCISES 81

3.3. Show that (Xn )n≥1 converges in probability to X iff

lim E(|Xn − X| ∧ 1) = 0.
n→∞

3.4. Consider the probability space ([0; 1], B([0; 1]), P ). Let X = 0 and X1 , X2 , . . . be random
variables 
0 if n1 ≤ ω ≤ 1
Xn (ω) =
en if 0 ≤ ω < 1/n.

P
Show that X −→ X, but E|Xn − X|p does not converge for any p > 0.

3.5. Consider the probability space ([0; 1], B([0; 1]), P ). Let X = 0. For each n = 2m + k where
0 ≤ k < 2m , we define 
1 if k ≤ ω ≤ k+1
2m 2m
Xn (ω) =
0 otherwise.

P
Show that X −→ X, but {Xn } does not converge to X a.s.

3.6. Let (Xn )n≥1 be a sequence of exponential random variables with parameter λ = 1. Show
that h Xn i
P lim sup = 1 = 1.
n→∞ ln n

3.7. Let X1 , X2 , . . . be a sequence of identically distributed random variables with E|X1 | < ∞
and let Yn = n−1 max1≤i≤n |Xi |. Show that limn E(Yn ) = 0 and limn Yn = 0 a.s.

P
3.8. [5] Let (Xn )n≥1 be random variables with Xn −→ X. Suppose |Xn (ω)| ≤ C for a constant
C > 0 and all ω. Show that limn→∞ E|Xn − X| = 0.

3.4.2 Law of large numbers


3.9. [10] Let X1 , . . . , Xn be independent and identically distributed random variables such
P∞ −2
that for x = 3, 4, . . . , P (X1 = ±x) = (2cx2 log x)−1 , where c = x=3 x / log x. Show that
−1
Pn P
E|X1 | = ∞ but n i=1 Xi −→ 0.

3.10. [10] Let X1 , . . . , Xn be independent and identically distributed random variables with
V ar(S1 ) < ∞. Show that
n
1 X P
jXj −→ EX1 .
n(n + 1) j=1

3.11. [2] If for every n, V ar(Xi ) ≤ c < ∞ and Cov(Xi , Xj ) < 0 (i, j = 1, 2, . . .), then the WLLN
holds.
3.4. EXERCISES 82

3.12. [2](Theorem of Bernstein) Let {Xn } be a sequence of random variables so that V ar(Xi ) ≤
c < ∞ (i = 1, 2, . . .) and Cov(Xi , Xj ) → 0 when |i − j| → ∞ then the WLLN holds.

3.13. [5] Let (Yj )j≥1 be a sequence of independent Binomial random variables, all defined on
the same probability space, and with law B(p, 1). Let Xn = nj=1 Yj . Show that Xj is B(p, j)
P
X
and that jj converges a.s to p.
Q  n1
n
3.14. [5] Let {Xj }j≥1 be i.i.d with Xj in L1 . Let Yj = eXj . Show that j=1 Yj converges to a
constant α a.s.

3.15. [5] Let (Xj )j≥1 be i.i.d with Xj in L1 and EXj = µ. Let (Yj )j≥1 be also i.i.d with Yj in L1
and EYj = ν 6= 0. Show that
n
1 X µ
lim Pn Xj = a.s.
n→∞
j=1 Yj j=1
ν
Pn
3.16. [5] Let (Xj )j≥1 be i.i.d with Xj in L1 and suppose √1 − ν) converges in distribu-
n j=1 (Xj
tion to a random variable Z. Show that
n
1X
lim Xj = ν a.s.
n→∞ n
j=1

3.17. [5] Let (Xj )j≥1 be i.i.d with Xj in Lp . Show that


n
1X p
lim Xj = EX p a.s.
n→∞ n
j=1

3.18. [5] Let (Xj )j≥1 be i.i.d. N (1; 3) random variables. Show that

X1 + X2 + . . . Xn 1
lim 2 2 2
= a.s.
n→∞ X1 + X2 + . . . + Xn 4

3.19. [5] Let (Xj )j≥1 be i.i.d with mean µ and variance σ 2 . Show that
n
1X
lim (Xi − µ)2 = σ 2 a.s.
n→∞ n
i=1

3.4.3 Characteristic function


3.20. Find the characteristic function of X,

1. P[X = 1] = P[X = −1] = 1/2;

2. P[X = 1] = P[X = 0] = 1/2;

3. X ∼ U (a, b);
3.4. EXERCISES 83

4. the density of X is f (x) = (1 − |x|)I|x|<1 ;

5. X ∼ Exp(λ);

3.21. Show that if X1 , . . . , Xn are independent and uniformly distribution on (−1, 1), then for
n ≥ 2, X1 + . . . + Xn has density
1 ∞  sin t n
Z
f (x) = cos txdt.
π 0 t
3.22. Suppose that X has density
1 − cos x
f (x) = .
πx2
Show that
ϕX (t) = (1 − |t|)+ .

3.23. 1. Suppose that X has Cauchy distribution with density


1
f (x) = .
π(1 + x2 )
Show that
ϕX (t) = e−|t| .

2. Let X1 , . . . , Xn be a sequence of independent Cauchy random variables. Find the distri-


bution of (X1 + . . . + Xn )/n.

3.24. Let X1 , X2 , . . . be independent taking values 0 and 1 with probability 1/2 each.

1. Find the distribution of ξ = ∞


P Xi
i=1 2i .

2. Find the characteristic function of ζ = 2 ∞


P Xi
i=1 3i . We say that ζ has the Cantor distribu-
tion.

3.4.4 Weak convergence


w w w
3.25. Show that if Xn and Yn are independent for 1 ≤ n, Xn → X and Yn → Y , then Xn + Yn →
X +Y.

3.26. Consider the probability space ([0; 1], B([0; 1]), P ). Let X and X1 , X2 , . . . be random vari-
ables 
1 if 0 ≤ ω ≤ 1/2
X2n (ω) =
0 if 1/2 < ω ≤ 1.

and 
0 if 0 ≤ ω ≤ 1/2
X2n+1 (ω) =
1 if 1/2 < ω ≤ 1.
Show that the sequence (Xn ) converges in distribution? Does it converge in probability?
3.4. EXERCISES 84

3.27. Let (Xn )n≥1 and X are random variables whose distribution functions are (Fn )n≥1 and
F , respectively.
w
1. If Xn −→ X then limn→∞ Fn (x) = F (x) for all x ∈ D where D is a dense subset of R given
by
D = {x ∈ R : F (x+) = F (x)}.
w
2. If limn→∞ Fn (x) = F (x) for any x in some dense subset of R then Xn −→ X.
w P
3.28. If Xn −→ X, Yn −→ c, then
w
a) Xn + Yn −→ X + c
w
b) Xn Yn −→ cX
w
c) Xn /Yn −→ X/c if Yn 6= 0 a.s for all n and c 6= 0.
d P
3.29. [10] Show that if Xn −→ X and X = c a.s for a real number c, then Xn −→ X.

3.30. [10] A family of random variable (Xi )i∈I is called uniformly integrable if

lim sup E[|Xi |I{|Xi |≥N } = 0.


N →∞ i∈I

Let X1 , X2 , . . . be random variables. Show that {|Xn |} is uniformly integrable if one of the
following condition holds:

a) supn E|Xn |1+δ < ∞ for a δ > 0.

b) P (|Xn | ≥ c) ≤ P (|X| ≥ c) for all n and c > 0, where X is an integrable random variable.

3.31. Let Xn be random variable distributed as N (µn , σn2 ), n = 1, 2, . . . and X be a random


d
variable distributed as N (µ, σ 2 ). Show that Xn −→ X if and only if limn µn = µ and limn σn2 = σ 2 .
w
3.32. If Yn are random variables with characteristic function ϕn , then Yn → 0 iff there is a δ > 0
so that ϕn (t) → 1 for |t| ≤ δ.

3.4.5 Central limit theorems


3.33. [10] Let U1 , U2 , . . . be independent random variables having the uniform distribution on
−1/n √ d
[0;1] and Yn = ( ni=1 Ui ) . Show that n(Yn − e) −→ N (0, e2 ).
Q

3.34. [10] Suppose that Xn is a random variable having the binomial distribution with size n
and probability θ ∈ (0, 1), n = 1, 2, . . . Define Yn = log(Xn /n) when Xn ≥ 1 and Yn = 1 when
√ d
Xn = 0. Show that limn Yn = log θ a.s and n(Yn − log θ) −→ N (0, 1−θ θ
).

3.35. [2] Show that for the sequence {Xn } of independent random variables with
3.5. INTRODUCTION 85

1−2−n 1
a) P [Xn = ±1] = 2
, P [Xn = ±2n ] = 2n+1
, n = 1, 2, . . . ,

b) P [Xn = ±n2 ] = 21 ,

the CLT holds.


2
3.36. [5] Let (Xj )j≥1 be i.i.d with EX1 = 1 and σX 1
= σ 2 ∈ (0; ∞). Show that

2 p √ d
( Sn − n) −→ N (0, 1).
σ
3.37. [5] Show that !
n
X nk 1
lim e−n = .
n→∞
k=0
k! 2
2
Pn
3.38. [5] Let (Xj )j≥1 be i.i.d with EXj = 0 and σX j
= σ 2 < ∞. Let Sn = j=1 Xj . Show that
  r
|Sn | 2
lim E √ = σ.
n→∞ n n

3.39. [5] Let (Xj )j≥1 be i.i.d with the uniform distribution on (-1;1). Let
Pn
j=1 Xj
Y n = Pn 2
Pn 3
.
j=1 Xj + j=1 Xj

Show that nYn converges in distribution.

3.40. [5] Let (Xj )j≥1 be i.i.d with the uniform distribution on (−j; j).

a) Show that
Sn d 1
3−→ N (0; ).
n2 9
b) Show that
S d
qP n −→ N (0, 1).
n
j=1 σj2

3.5 Introduction
3

Statistics is a process of using the scientific method to answer questions and make deci-
sions. That process involves designing studies, collecting good data, describing the data with
numbers and graphs, analyzing the data, and then making conclusions. We now review each
of these steps and show where statistics plays the all-important role.
3
This part is borrowed from D. Rumsey, Statistics Essentials for Dummies (2010) Wiley Publishing, Inc.
3.5. INTRODUCTION 86

3.5.1 Designing Studies


Once a research question is defined, the next step is designing a study in order to answer
that question. This amounts to figuring out what process is used to get the needed data. This
section overviews the two major types of studies: observational studies and experiments.

Survey

An observational study is one in which data are collected on individuals in a way that
doesn’t affect them. The most common observational study is the survey. Surveys are ques-
tionnaires that are presented to individuals who have been selected from a population of in-
terest. Surveys may be administered in a variety of ways, e.g. personal interview, telephone
interview, and self-administered questionnaire.
If conducted properly, surveys can be very useful tools for getting information. However,
if not conducted properly, surveys can result in bogus information. Some problems include
improper wording of questions, which can be misleading, people who were selected to par-
ticipate but do not respond, or an entire group in the population who had no chance of even
being selected. These potential problems mean a survey has to be well thought-out before it’s
given.
A downside of surveys is that they can only report relationships between variables that are
found; they cannot claim cause and effect. For example, if in a survey researchers notice that
the people who drink more than one Diet Coke per day tend to sleep fewer hours each night
than those who drink at most one per day, they cannot conclude that Diet Coke is causing the
lack of sleep. Other variables might explain the relationship, such as number of hours worked
per week.

Experiments

An experiment imposes one or more treatments on the participants in such a way that
clear comparisons can be made. Once the treatments are applied, the response is recorded.
For example, to study the effect of drug dosage on blood pressure, one group might take 10 mg
of the drug, and another group might take 20 mg. Typically, a control group is also involved,
where subjects each receive a fake treatment (a sugar pill, for example).
Experiments take place in a controlled setting, and are designed to minimize biases that
might occur. Some potential problems include: researchers knowing who got what treatment;
a certain condition or characteristic wasn’t accounted for that can affect the results (such as
weight of the subject when studying drug dosage); or lack of a control group. But when de-
signed correctly, if a difference in the responses is found when the groups are compared, the
researchers can conclude a cause and effect relationship.
3.5. INTRODUCTION 87

It is perhaps most important to note that no matter what the study, it has to be designed
so that the original questions can be answered in a credible way.

3.5.2 Collecting Data


Once a study has been designed, be it a survey or an experiment, the subjects are chosen
and the data are ready to be collected. This phase of the process is also critical to producing
good data.

Selecting a good sample

First, a few words about selecting individuals to participate in a study. In statistics, we have
a saying: “Garbage in equals garbage out.” If you select your subjects in a way that is biased
— that is, favouring certain individuals or groups of individuals — then your results will also
be biased.
Suppose Bob wants to know the opinions of people in your city regarding a proposed
casino. Bob goes to the mall with his clipboard and asks people who walk by to give their
opinions. What’s wrong with that? Well, Bob is only going to get the opinions of a) people
who shop at that mall; b) on that particular day; c) at that particular time; d) and who take the
time to respond. That’s too restrictive - those folks don’t represent a cross-section of the city.
Similarly, Bob could put up a Web site survey and ask people to use it to vote. However, only
those who know about the site, have Internet access, and want to respond will give him data.
Typically, only those with strong opinions will go to such trouble. So, again, these individuals
don’t represent all the folks in the city.
In order to minimize bias, you need to select your sample of individuals randomly - that
is, using some type of “draw names out of a hat” process. Scientists use a variety of methods
to select individuals at random, but getting a random sample is well worth the extra time and
effort to get results that are legitimate.

Avoiding bias in your data

Say you’re conducting a phone survey on job satisfaction of Americans. If you call them at
home during the day between 9 a.m. and 5 p.m., you’ll miss out on all those who work during
the day; it could be that day workers are more satisfied than night workers, for example. Some
surveys are too long - what if someone stops answering questions halfway through? Or what if
they give you misinformation and tell you they make $100,000 a year instead of $45,000? What
if they give you an answer that isn’t on your list of possible answers? A host of problems can
occur when collecting survey data. Experiments are sometimes even more challenging when
it comes to collecting data. Suppose you want to test blood pressure; what if the instrument
you are using breaks during the experiment? What if someone quits the experiment half- way
3.5. INTRODUCTION 88

through? What if something happens during the experiment to distract the subjects or the
researchers? Or they can’t find a vein when they have to do a blood test exactly one hour after
a dose of a drug is given? These are just some of the problems in data collection that can arise
with experiments.

3.5.3 Describing Data


Once data are collected, the next step is to summarize it all to get a handle on the big pic-
ture. Statisticians describe data in two major ways: with pictures (that is, charts and graphs)
and with numbers, called descriptive statistics.

Descriptive statistics

Data are also summarized (most often in conjunction with charts and/or graphs) by using
what statisticians call descriptive statistics. Descriptive statistics are numbers that describe a
data set in terms of its important features.
If the data are categorical (where individuals are placed into groups, such as gender or po-
litical affiliation) they are typically summarized using the number of individuals in each group
(called the frequency) or the percentage of individuals in each group (the relative frequency).
Numerical data represent measurements or counts, where the actual numbers have mean-
ing (such as height and weight). With numerical data, more features can be summarized be-
sides the number or percentage in each group. Some of these features include measures of
center (in other words, where is the “middle” of the data?); measures of spread (how diverse
or how concentrated are the data around the center?); and, if appropriate, numbers that mea-
sure the relationship between two variables (such as height and weight).
Some descriptive statistics are better than others, and some are more appropriate than
others in certain situations. For example, if you use codes of 1 and 2 for males and females,
respectively, when you go to analyze that data, you wouldn’t want to find the average of those
numbers — an “average gender” makes no sense. Similarly, using percentages to describe the
amount of time until a battery wears out is not appropriate.

Charts and graphs

Data are summarized in a visual way using charts and/or graphs. Some of the basic graphs
used include pie charts and bar charts, which break down variables such as gender and which
applications are used on teens’ cell phones. A bar graph, for example, may display opinions on
an issue using 5 bars labeled in order from “Strongly Disagree” up through “Strongly Agree.”
But not all data fit under this umbrella. Some data are numerical, such as height, weight,
time, or amount. Data representing counts or measurements need a different type of graph
3.5. INTRODUCTION 89

that either keeps track of the numbers themselves or groups them into numerical groupings.
One major type of graph that is used to graph numerical data is a histogram.

3.5.4 Analyzing Data


After the data have been collected and described using pictures and numbers, then comes
the fun part: navigating through that black box called the statistical analysis. If the study
has been designed properly, the original questions can be answered using the appropriate
analysis, the operative word here being appropriate. Many types of analyses exist; choosing
the wrong one will lead to wrong results.
This course covers the major types of statistical analyses encountered in introductory statis-
tics. Scenarios involving a fixed number of independent trials where each trial results in ei-
ther success or failure use the binomial distribution. In the case where the data follow a bell-
shaped curve, the normal distribution is used to model the data.
Chapter ?? deals with confidence intervals, used when you want to make estimates involv-
ing one or two population means or proportions using a sample of data. Chapter ?? focuses on
testing someone’s claim about one or two popu- lation means or proportions these analyses
are called hypothesis tests. If your data set is small and follows a bell-shape, the t-distribution
might be in order; see Chapter 9.
Chapter ?? examines relationships between two numerical variables (such as height and
weight) using correlation and simple linear regression. Chapter 11 studies relationships be-
tween two categorical variables (where the data place individuals into groups, such as gender
and political affiliation).

3.5.5 Making Conclusions


Researchers perform analysis with computers, using formulas. But neither a computer nor
a formula knows whether it’s being used properly, and they don’t warn you when your results
are incorrect. At the end of the day, computers and formulas can’t tell you what the results
mean. It’s up to you.
One of the most common mistakes made in conclusions is to overstate the results, or to
generalize the results to a larger group than was actually represented by the study. For exam-
ple, a professor wants to know which Super Bowl commercials viewers liked best. She gathers
100 students from her class on Super Bowl Sunday and asks them to rate each commercial
as it is shown. A top 5 list is formed, and she concludes that Super Bowl viewers liked those
5 commercials the best. But she really only knows which ones her students liked best - she
didn’t study any other groups, so she can’t draw conclusions about all viewers.
Statistics is about much more than numbers. It’s important to understand how to make
appropriate conclusions from studying data, and that’s something I discuss throughout the
3.6. DESCRIPTIVE STATISTICS 90

course.

3.6 Descriptive Statistics


Descriptive statistics are numbers that summarize some characteristic about a set of data.
They provide you with easy-to-understand information that helps answer questions. They
also help researchers get a rough idea about what’s happening in their experiments so later
they can do more formal and targeted analyses. Descriptive statistics make a point clearly
and concisely.
In this section you see the essentials of calculating and evaluating common descriptive
statistics for measuring center and variability in a data set, as well as statistics to measure the
relative standing of a particular value within a data set.

3.6.1 Types of Data


Data come in a wide range of formats. For example, a survey might ask questions about
gender, race, or political affiliation, while other questions might be about age, income, or the
distance you drive to work each day. Different types of questions result in different types of
data to be collected and analyzed. The type of data you have determines the type of descrip-
tive statistics that can be found and interpreted. There are two main types of data: categorical
(or qualitative) data and numerical (or quantitative data). Categorical data record qualities or
characteristics about the individual, such as eye color, gender, political party, or opinion on
some issue (using categories such as agree, disagree, or no opinion). Numerical data record
measurements or counts regarding each individual, which may include weight, age, height,
or time to take an exam; counts may include number of pets, or the number of red lights you
hit on your way to work. The important difference between the two is that with categorical
data, any numbers involved do not have real numerical mean- ing (for example, using 1 for
male and 2 for female), while all numerical data represents actual numbers for which math
operations make sense.
A third type of data, ordinal data, falls in between, where data appear in categories, but
the categories have a meaningful order, such as ratings from 1 to 5, or class ranks of freshman
through senior. Ordinal data can be analyzed like categorical data, and the basic numerical
data techniques also apply when categories are represented by numbers that have meaning.

3.6.2 Counts and Percents


Categorical data place individuals into groups. For example, male/female, own your home/don’t
own, or Democrat/ Republican/Independent/Other. Categorical data often come from sur-
vey data, but they can also be collected in experi- ments. For example, in a test of a new med-
3.6. DESCRIPTIVE STATISTICS 91

ical treatment, researchers may use three categories to assess the outcome: Did the patient
get better, worse, or stay the same? Categorical data are typically summarized by reporting
either the number of individuals falling into each category, or the percentage of individuals
falling into each category. For example, pollsters may report the percentage of Republicans,
Democrats, Independents, and others who took part in a survey. To calculate the percentage
of individuals in a certain category, find the number of individuals in that category, divide by
the total number of people in the study, and then multiply by 100%. For example, if a survey
of 2,000 teenagers included 1,200 females and 800 males, the resulting percent- ages would
be (1,200 : 2,000) * 100% = 60% female and (800 : 2,000) * 100% = 40% male.
You can further break down categorical data by creating crosstabs. Crosstabs (also called
two-way tables) are tables with rows and columns. They summarize the information from
two categorical variables at once, such as gender and political party, so you can see (or easily
calculate) the percentage of individuals in each combination of categories. For example, if
you had data about the gender and political party of your respondents, you would be able to
look at the percentage of Republican females, Democratic males, and so on. In this example,
the total number of possible combinations in your table would be the total number of gender
categories times the total number of party affiliation categories. The U.S. government calcu-
lates and summarizes loads of categorical data using crosstabs. (see Chapter 11 for more on
two-way tables.) If you’re given the number of individuals in each category, you can always
calculate your own percents. But if you’re only given percentages without the total number in
the group, you can never retrieve the original number of individuals in each group. For exam-
ple, you might hear that 80% of people surveyed prefer Cheesy cheese crackers over Crummy
cheese crackers. But how many were surveyed? It could be only 10 people, for all you know,
because 8 out of 10 is 80%, just as 800 out of 1,000 is 80%. These two fractions (8 out of 10 and
800 out of 1,000) have different meanings for statisticians, because the first is based on very
little data, and the second is based on a lot of data. (See Chapter 7 for more information on
data accuracy and margin of error.)

3.6.3 Measures of Center


The most common way to summarize a numerical data set is to describe where the center
is. One way of thinking about what the center of a data set means is to ask, “What’s a typical
value?” Or, “Where is the middle of the data?” The center of a data set can be measured in
different ways, and the method chosen can greatly influence the conclusions people make
about the data. In this section I present the two most common measures of center: the mean
(or average) and the median.
The mean (or average) of a data set is simply the average of all the numbers. Its formula is
1
x= (x1 + . . . + xn ).
n
3.6. DESCRIPTIVE STATISTICS 92

When it comes to measures of center, the average doesn’t always tell the whole story and may
be a bit misleading. Take NBA salaries. Every year, a few top-notch players (like Shaq) make
much more money than anybody else. These are called outliers (numbers in the data set that
are extremely high or low compared to the rest). Because of the way the average is calcu-
lated, high outliers drive the average upward (as Shaq’s salary did in the preceding example).
Similarly, outliers that are extremely low tend to drive the average downward. What can you
report, other than the average, to show what the salary of a “typical” NBA player would be?
Another statistic used to measure the center of a data set is the median. The median of a data
set is the place that divides the data in half, once the data are ordered from smallest to largest.
It is denoted by M or x̃. To find the median of a data set:

1. Order the numbers from smallest to largest.

2. If the data set contains an odd number of numbers, the one exactly in the middle is the
median.

3. If the data set contains an even number of numbers, take the two numbers that appear
exactly in the middle and average them to find the median.

For example, take the data set 4, 2, 3, 1. First, order the numbers to get 1, 2, 3, 4. Then note this
data has an even number of numbers, so go to Step 3. Take the two numbers in the middle 2
and 3 and find their average: 2.5.
Note that if the data set is odd, the median will be one of the numbers in the data set itself.
However, if the data set is even, it may be one of the numbers (the data set 1, 2, 2, 3 has median
2); or it may not be, as the data set 4, 2, 3, 1 (whose median is 2.5) shows.
Chapter 4

Some useful distributions in statistics

4.1 Multivariate normal distribution


Definition 4.1.1. An Rn -valued random variable X = (X1 , . . . , Xn ) is Gaussian (or multivariate
normal) if every linear combination nj=1 aj Xj is either a constant or a (one-dimensional)
P

normal distributed random variable.

Characteristic function is a useful tool to study Gaussian random variables.

Theorem 4.1.2. X is an Rd -valued Gaussian random variable if and only if its characteristic
function has the form
 1 
ϕX (u) = exp ihu, µi − hu, Qui ,
2
n
where µ ∈ R and Q is an n × n symmetric nonnegative semi-definite matrix. Q is then the
covariance matrix of X and µ is the mean of X, that is

µj = E[Xj ], Qk,j = cov(Xk , Xj ), for all j and k.

Example 4.1.3. Let X1 , . . . , Xn be R-valued independent random variable with law N(µj , σj2 )
then X = (X1 , . . . , Xn ) is Gaussian. Moreover, for any constant matrix A ∈ Rm×n , then Y =
QX ∗ is an m-dimensional Gaussian random variable.

Proposition 4.1.4. Let X be an Rn -valued Gaussian random variable. The components Xj are
independent if and only if the covariance matrix Q of X is diagonal.

Proposition 4.1.5. Let X be an Rn -valued Gaussian random variable. X has a density on Rn


if and only if the covariance matrix Q is non-degenerate (i.e., detQ 6= 0), and the probability
density function of X is given by
1 1 −1
fX (x) = p e− 2 hx−µ,Q (x−µ)i .
2π n/2 det Q

93
4.2. GAMMA, CHI-SQUARE, STUDENT AND F DISTRIBUTIONS 94

2.2
alpha =7/8, lambda = 1
2 alpha = 1,lambda = 1
alpha = 2, lambda = 1
1.8 alpha = 3, lambda = 1

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Figure 4.1: Density of gamma distribution

4.2 Gamma, chi-square, student and F distributions

4.2.1 Gamma distribution


Recall that a random variable X has a Gamma distribution G(α, λ) if its density is given by

xα−1 e−x/λ
fX (x) = I{x>0} .
Γ(α)λα

Note that G(1, λ) = Exp(λ).

Proposition 4.2.1. If X is G(α, λ) distributed, then

E[X] = αλ, DX = αλ2 .

Moreover, the characteristic function of X is given by



xα−1 e−x/λ
Z  1 α
ϕX (t) = eitx dx = .
0 Γ(α)λα 1 − iλt

Corollary 4.2.2. Let (Xi )1≤i≤n be a sequence of independent random variables. Suppose that
Xi is G(αi , λ) distributed. Then S = X1 + · · · + Xn is G(α1 + · · · + αi , λ) distributed.

4.2.2 Chi-square distribution


Definition 4.2.3. Let (Zi )1≤i≤n be a sequence of independent, standard normal distributed
random variables. The distribution of V = Z12 + . . . + Zn2 is called chi-square distribution with
n degrees of freedom and is denoted by χ2n .
4.2. GAMMA, CHI-SQUARE, STUDENT AND F DISTRIBUTIONS 95

0.9
n=1
n=2
0.8
n=4
n=6
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10

Figure 4.2: Density of χ2 distribution

Note that since Zi2 is G( 21 , 2) distributed, χ2n is G( n2 , 2) distributed. Moreover,


E[χ2n ] = n, Dχ2n = 2n.
A notable consequence of the definition of the chi-square distribution is that if U and V are
independent and U ∼ χ2n and V ∼ χ2m , then U + V ∼ χ2m+n .

4.2.3 Student’s t distribution


Definition 4.2.4. If Z ∼ N (0; 1) and U ∼ χ2n and Z and U are independent, then the distribu-
Z
tion of p is called student’s t distribution with n degrees of freedom.
U/n
Student’s t distribution is also call t distribution.
A direct computation with the density gives the following result.
Proposition 4.2.5. The density function of the student’s t distribution with n degrees of freedom
is  
n+1
Γ 2  t2 −(n+1)/2
fn (t) = √   1+ .
nπΓ n2 n

In addition,
n→∞ 1 2
fn (t) −→ √ e−t /2 .

4.2.4 F distribution
Definition 4.2.6. Let U and V be independent chi-square random variables with m and n
degrees of freedom, respectively. The distribution of
U/m
W =
V /n
4.3. SAMPLE MEAN AND SAMPLE VARIANCE 96

0.4
n=1
n=2
0.35 n=8
normal

0.3

0.25

0.2

0.15

0.1

0.05

0
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6

Figure 4.3: Density of student distribution

0.8
n = 4, m = 4
n = 10, m = 4
0.7 n = 10, m = 10
n=4, m= 10

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

Figure 4.4: Density of F distribution

is called the F distribution with m and n degrees of freedom and is denoted by Fm,n .

The density of W is given by


 
m+n
Γ 2  m m/2  m −(m+n)/2
f (x) =     xm/2−1 1 + x , x ≥ 0.
Γ m2 Γ n2 n n

4.3 Sample mean and sample variance


Let (Xn ) be independent N(µ, σ 2 ) random variables, and
n n−1
1X 1 X
Xn = Xi , s2n = (Xi − X n )2
n i=1 n − 1 i=1
4.3. SAMPLE MEAN AND SAMPLE VARIANCE 97

Proposition 4.3.1. The random variable X n and the vector of random variables (X1 − X n , X2 −
X n , . . . , Xn − X n ) are independent.

Proof. We write
n
X n
X
sX n + ti (Xi − X n ) = ai X i
i=1 i=1
s
where ai = n
+ (ti − t). Note that
n n n
X X s2 X
ai = s and a2i = + (ti − t)2 .
i=1 i=1
n i=1

Therefore, the characteristic function of (X n , X1 − X n , X2 − X n , . . . , Xn − X n ) is


n n
X Y  σ2 
= E[exp(isX n + i tj (Xj − X n )] = exp iµaj − a2j
j=1 j=1
2
 σ2 X
2 n
 σ 2 
= exp iµs − s exp − (ti − t)2 .
2n 2 i=1

The first factor is the cf of X n while the second factor is the cf of (X1 −X n , X2 −X n , . . . , Xn −X n )
(this is obtained by let s = 0 in the formula). This implies the desired result.

Corollary 4.3.2. X n and s2n are independently distributed.

Theorem 4.3.3. The distribution of (n−1)s2n /σ 2 is the chi-square distribution with n−1 degrees
of freedom.

Proof. We first note that


n n 
1 X 2
X Xi − µ  2
2
(Xi − µ) = ∼ χ2n .
σ i=1 i=1
σ

Also,
n n  X − µ 2
1 X 2 1 X 2
(X i − µ) = (X i − X n ) + √ =: U + V.
σ 2 i=1 σ 2 i=1 σ/ n
Since U and V are independent, ϕW (t) = ϕU (t)ϕV (t). Since W and V both follow chi-square
distribution, so
ϕW (t) (1 − i2t)−n/2
ϕU (t) = = = (1 − i2t)−(n−1)/2 .
ϕV (t) (1 − i2t)−1/2
The last expression is the c.f. of a random variable with a χ2n−1 distribution.

We end up with the following result.

Corollary 4.3.4.
Xn − µ
√ ∼ tn−1 .
sn / n
4.4. EXERCISES 98

4.4 Exercises
4.1. Show that

1. if X ∼ Fn,m then X −1 ∼ Fm,n .

2. if X ∼ tn then X 2 ∼ F1,n .

3. the Cauchy distribution and the t distribution with 1 degree of freedom are the same.

4. Iif X and Y are independent exponential random variable with λ = 1, then X/Y follows
an F distribution.

4.2. Show how to use the chi-square distribution to calculate P(a < s2n /σ 2 < b).

4.3. Let X1 , . . . , Xn be a sequence of independent and N(µX , σ 2 ) distributed random variables.


and Y1 , . . . , Ym be a sequence of independent and N(µY , σ 2 ) distributed random variables.
Show how to use F distribution to find P(s(X))2n /s(Y )2n > c), for some positive constant c.
m
4.4. Let W ∼ Fn,m and denote Y = m+nW
. Show that Y has a beta distribution.

4.5. Let X1 , X2 and X3 be three independent chi-square variables with r1 , r2 and r3 degrees of
freedom, respectively.

1. Show that Y1 = X1 /X2 and Y2 = X1 + X2 are independent.

2. Deduce that
X1 /r1 X2 /r3
and
X2 /r2 (X1 + X2 )/(r1 + r2 )
are independent F -variables.
Chapter 5

Parameter estimation

5.1 Samples and characteristic of sample


Definition 5.1.1 (Random samples). A sequence of random variable X1 , . . . , Xn is called a
random sample observing from a random variable X if

• (Xi )1≤i≤n are independent;

• Xi has the same distribution as X for all i = 1, . . . , n.

We call n the sample size.

Example 5.1.2. An urn contains m balls, labeled from 1, 2, . . . , m and are identical except for
the number. The experiment is to choose a ball at random and record the number. Let X
denote the number. Then the distribution of X is given by
1
P[X = k] = , f or x = 1, . . . , m.
m
In case m is unknown, to obtain information on m we take a sample of n balls, which we will
denote as X = (X1 , . . . , Xn ) where Xi is the number on the ith ball.
The sample can be drawn in several ways.

1. Sampling with replacement: We randomly select a ball, record its number and put it
back to the urn. All the ball are then remixed, and the next ball is chosen. We can see
that X1 , . . . , Xn are mutually independent random variables and each has the same dis-
tribution as X. Hence (X1 , . . . , Xn ) is a random sample.

2. Sampling without replacement: Here n balls are selected at random. After a ball is se-
lected, we do not return it to the urn. The X1 , . . . , Xn are not independent, but each Xi
has the same distribution as X.

If m is much greater than n, the sampling schemes are practically the same.

99
5.1. SAMPLES AND CHARACTERISTIC OF SAMPLE 100

Definition 5.1.3. The empirical distribution function is defined by


n
1X
Fn (x) = IX <x , x ∈ R.
n i=1 i

1. Fn (x) is non-dercreasing with respect to x;

2. Fn (x) is left continuous and has right limit at any points;

3. limx→−∞ Fn (x) = 0, limx→+∞ Fn (x) = 1.


a.s.
4. Fn (x) −→ F (x) for any x ∈ R. Indeed, applying law of large numbers for the iid sequence
IXi <x , we have
n
1X a.s.
Fn (x) = IXi <x −→ E[IX1 <x ] = F (x).
n i=1

Definition 5.1.4. Let (X1 , . . . , Xn ) be a random sample.

1. Sample mean
X1 + . . . + Xn
X̄n = .
n
2. Population variance
n
1X
Sn2 (X) = (Xi − X̄n )2 ,
n i=1
and sample variance
n
1 X
s2n (X) = (Xi − X̄n )2 .
n − 1 i=1

3. kth sample moment


n
1X k
mk = X ,
n i=1 i
and centralized kth sample moment
n
1X
vk = (Xi − X̄n )k .
n i=1

4. Sample covariance of 2-dimensional sample (X1 , Y1 ), . . . , (Xn , Yn )


1
Pn
n i=1 (Xi − X̄n )(Yi − Ȳn )
r= .
Sn (X)Sn (Y )

5. The sample mode is the most frequently occurring data value.


5.2. DATA DISPLAY 101

6. The sample median is a measure of central tendency that divides the data into two equal
parts, half below the median and half above. If the number of observations is even, the
median is halfway between the two central values. If the number of observations is odd,
the median is the central value.

7. When an ordered set of data is divided into four equal parts, the division points are called
quartiles. The first or lower quartile, q1 , is a value that has approximately 25% of the ob-
servations below it and approximately 75% of the observations above. The second quar-
tile, q2 , has approximately 50% of the observations below its value. The second quartile
is exactly equal to the median. The third or upper quartile, q3 , has approximately 75% of
the observations below its value.

8. The interquartile range is defined as IQR = q3 − q1 . The IQR is also used as a measure
of variability.

5.2 Data display


Well-constructed data display is essential to good statistical thinking, because it helps us
exploring important features of the data and providing insight about the type of model that
should be used in solving the problem. In this section we will briefly introduce some methods
to display data.

5.2.1 Stem-and-leaf diagrams


A stem-and-leaf diagram is a good way to obtain an informative visual display of a data set
x1 , x2 , . . . , xn , where each number xi consists of at least two digits. To construct a stem-and-
leaf diagram, use the following steps.

1. Divide each number xi into two parts: a stem, consisting of one or more of the leading
digits and a leaf, consisting of the remaining digit.

2. List the stem values in a vertical column.

3. Record the leaf for each observation beside its stem.

4. Write the units for stems and leaves on the display.

It is usually best to choose between 5 and 20 stems.

Example 5.2.1. The weights of 80 students are given in the following table.
5.2. DATA DISPLAY 102

59.0 59.5 52.7 47.9 55.7 48.3 52.1 53.1 55.2 45.3
46.5 54.8 48.4 53.1 56.9 47.4 50.2 52.1 49.6 46.4
52.9 41.1 51.0 50.0 56.8 45.9 59.5 52.8 46.7 55.7
48.6 51.6 53.2 54.1 45.8 50.4 54.1 52.0 56.2 62.7
62.0 46.8 54.6 54.7 50.2 45.9 49.1 42.6 49.8 52.1
56.5 53.5 46.5 51.9 46.5 53.5 45.5 50.2 55.1 49.6
47.6 44.8 55.0 56.2 49.4 57.0 52.4 48.4 55.0 47.1
52.4 56.8 53.2 50.5 56.6 49.5 53.1 51.2 55.5 53.7

Construct a stem-and-leaf diagram for their weight.

Stem Leaf Frequency


41 1 1
42 6 1
44 8 1
45 3 5 8 9 9 5
46 4 5 5578 6
47 1 4 69 4
48 3 4 4 6 4
49 1 4 56 6 8 6
50 0 2 22 45 6
51 0 2 6 9 4
52 0 1 1 1 447 8 9 9
53 1 1 1 2 2 5 5 7 8
54 1 1 6 7 8 5
55 0 0 12 5 7 7 7
56 2 2 5 6 8 8 9 7
57 1 1
59 0 5 5 3
62 0 7 2

5.2.2 Frequency distribution and histogram


A frequency distribution is a more compact summary of data than a stem-and-leaf dia-
gram. To construct a frequency distribution, we must divide the range of the data into inter-
vals, which are usually called class intervals, cells, or bins. If possible, the bins should be of
equal width in order to enhance the visual information in the frequency distribution. Some
judgment must be used in selecting the number of bins so that a reasonable display can be
developed. The number of bins depends on the number of observations and the amount
of scatter or dispersion in the data. A frequency distribution that uses either too few or too
many bins will not be informative. We usually find that between 5 and 20 bins is satisfactory
5.2. DATA DISPLAY 103

in most cases and that the number of bins should increase with n. Choosing the number of
bins approximately equal to the square root of the number of observations often works well
in practice.
The histogram is a visual display of the frequency distribution. The stages for constructing
a histogram follow.

1. Label the bin (class interval) boundaries on a horizontal scale.

2. Mark and label the vertical scale with the frequencies or the relative frequencies.

3. Above each bin, draw a rectangle where height is equal to the frequency (or relative fre-
quency) corresponding to that bin.

Example 5.2.2. Histogram of the students’ weight given in Example 5.2.1.

Histogram of weight
15
No. of student

10
5
0

40 45 50 55 60

weight

5.2.3 Box plots


The box plot is a graphical display that simultaneously describes several important fea-
tures of a data set, such as center, spread, departure from symmetry, and identification of
unusual observations or outliers.
A box plot displays the three quartiles, the minimum, and the maximum of the data on a rect-
angular box, aligned either horizontally or vertically. The box encloses the interquartile range
with the left (or lower) edge at the first quartile, q1 , and the right (or upper) edge at the third
quartile, q3 . A line is drawn through the box at the second quartile (which is the 50th percentile
or the median). A line, or whisker, extends from each end of the box. The lower whisker is a
line from the first quartile to the smallest data point within 1.5 interquartile ranges from the
first quartile. The upper whisker is a line from the third quartile to the largest data point within
1.5 interquartile ranges from the third quartile. Data farther from the box than the whiskers
are plotted as individual points. A point beyond a whisker, but less than 3 interquartile ranges
5.2. DATA DISPLAY 104

from the box edge, is called an outlier. A point more than 3 interquartile ranges from the box
edge is called an extreme outlier.

Example 5.2.3. Consider the sample in Example 5.2.1. The quantiles of the sample are q1 =
48.40, q2 = 52.10, q3 = 54.85. Bellow is the box plot of the students’ weight.

60
55
50
45

Example 5.2.4. Construct a box plot of the following data.

158.7 167.6 164.0 153.1 179.3 153.0 170.6 152.4 161.5 146.7
147.2 158.2 157.7 161.8 168.4 151.2 158.7 161.0 147.9 155.5

The quantiles of this sample are q1 = 152.85, q2 = 158.45, q3 = 162.35


180
170
160
150

5.2.4 Probability plots


How do we know if a particular probability distribution is a reasonable model for data?
Some of the visual displays we have used earlier, such as the histogram, can provide insight
5.2. DATA DISPLAY 105

about the form of the underlying distribution. However, histograms are usually not really re-
liable indicators of the distribution form unless the sample size is very large. Probability plot-
ting is a graphical method for determining whether sample data conform to a hypothesized
distribution based on a subjective visual examination of the data. The general procedure is
very simple and can be performed quickly. It is also more reliable than the histogram for small
to moderate size samples.
To construct a probability plot, the observations in the sample are first ranked from small-
est to largest. That is, the sample x1 , x2 , . . . , xn is arranged as x(1) ≤ x(2) < . . . ≤ x(n) . The
ordered observations x(j) are then plotted against their observed cumulative frequency (j −
0.5)/n. If the hypothesized distribution adequately describes the data, the plotted points will
fall approximately along a straight line which is approximately between the 25th and 75th
percentile points; if the plotted points deviate significantly from a straight line, the hypothe-
sized model is not appropriate. Usually, the determination of whether or not the data plot as
a straight line is subjective.
In particular, a normal
 probability
 plot can be constructed by plotting the standardized
j−0.5
normal scores zj = Φ−1 n against x(j) .

Example 5.2.5. Consider the following sample:

2.86, 3.33, 3.43, 3.77, 4.16, 3.52, 3.56, 3.63, 2.43, 2.78.

We construct a normal probability plot for this sample as follows.


 
−1 j−0.5
j x(j) (j − 0.5)/10 Φ n

1 2.43 0.05 -1.64


2 2.78 0.15 -1.04
3 2.86 0.25 -0.67
4 3.33 0.35 -0.39
5 3.43 0.45 -0.13
6 3.52 0.55 0.13
7 3.56 0.65 0.39
8 3.63 0.75 0.67
9 3.77 0.85 1.04
10 4.16 0.95 1.64
5.3. POINT ESTIMATIONS 106

Since all the points are very close to the straight line, one may conclude that a normal distri-
bution adequately describes the data.

Remark 4. This is very surjective method. Please use it at your own risk! Later we will intro-
duce the Shapiro and Wilcoxon tests for the normal distribution hypothesis.

5.3 Point estimations

5.3.1 Statistics
Example 5.3.1. We continue Example 5.1.2. Recall that we do not know the number of balls
m and have to use the sample (X1 , . . . , Xn ) to obtain information about m.
Since E(X) = m+1
2
, using laws of large numbers, we have

X1 + . . . + Xn a.s. m + 1
−→ .
n 2
Therefore, we get the first estimator for m given by
X 1 + . . . + Xn a.s.
m̂n := 2 − 1 −→ m.
n
Another estimation for m is defined by

m̃n := max{X1 , . . . , Xn }.

Since n
Y  m − 1 n
P[m̃n 6= m] = P[X1 < m, . . . , Xn < m] = P[Xi < m] = →0
i=1
m
a.s.
as n → ∞, we have m̃n −→ m.
The estimator m̂n and m̃n are called statistics which depend only on the observations
X1 , . . . , Xn not m.

Definition 5.3.2. Let X = (X1 , . . . , Xn ) be a sample observed from X and (T, BT ) a measur-
able space. Then any function ϕ(X) = ϕ(X1 , . . . , Xn ), where ϕ : (Rn , B(Rn )) → (T, BT ) is a
measurable function, of the sample is called a statistic.

In the following we only consider the case that (T, BT ) is a subset of (Rd , B(Rd )).

Definition 5.3.3. Let X = (X1 , . . . , Xn ) be a sample observed from a distribution with density
f (x; θ), θ ∈ Θ. Let Y = ϕ(X) be a statistic with density fY (y; θ). Then Y is called a sufficient
statistic for θ if
f (x; θ)
= H(x),
fY (ϕ(x); θ)
where x = (x1 , . . . , xn ), f (x; θ) is density of X at x, and H(x) does not depend on θ ∈ Θ.
5.3. POINT ESTIMATIONS 107

Example 5.3.4. Let (X1 , . . . , Xn ) be a sample observed from a Poisson distribution with pa-
rameter λ > 0. Then Yn = ϕ(X) = X1 + . . . + Xn has Poisson distribution with parameter nλ.
Hence
Qn Pn
f (X; θ) i=1 f (Xi ; θ) e−nλ λ i=1 Xi Yn ! Yn !
= = Qn −nλ Y
= Yn Q n .
fY (ϕ(X); θ) fYn (Yn ; nθ) i=1 Xi ! e (nλ) n i=1 Xi !

Therefore Yn is a sufficient statistic for λ.

In order to directly verify the sufficiency of statistic ϕ(X) we need to know the density of
ϕ(X) which is not always the case in practice. We next introduce the following criterion of
Neyman to overcome this difficulty.

Theorem 5.3.5. Let X = (X1 , . . . , Xn ) be a random sample from a distribution that


has density f (x; θ), θ ∈ Θ. The statistic Y1 = ϕ(X) is a sufficient statistic for θ iff we can
find two nonnegative functions k1 and k2 such that

f (x; θ) = k1 (ϕ(x); θ)k2 (x) (5.1)

where k2 does not depend upon θ.

Example 5.3.6. Let (X1 , . . . , Xn ) be a sample from normal distribution N (θ, 1) with θ ∈ Θ = R.
Denote x̄ = n1 ni=1 xi . The joint density of X1 , . . . , Xn at (x1 , . . . , xn ) is given by
P

h P i
n (xi −x̄)2
n(x̄ − θ)2 exp − i=1 2
n
X (xi − θ)2  
1 h i
exp − = exp − .
(2π)n/2 i=1
2 2 (2π)n/2

We see that the first factor on the right hand side depends upon x1 , . . . , xn only through x̄ and
the second factor does not depend upon θ, the factorization theorem implies that the mean
X̄ of the sample is a sufficient statistic for θ, the mean of the normal distribution.

5.3.2 Point estimators


Let X = (X1 , . . . , Xn ) be a sample from distribution F (x; θ) which depends on a unknown
parameter θ ∈ Θ. Even thought function ϕ does not depend on the unknown parameter θ, the
statistic ϕ(X) may convey information about θ. In such cases, we may call the statistic a point
estimator of θ.

Definition 5.3.7. A statistic ϕ(X1 , . . . , Xn ) is called

1. an unbiased estimator of θ if Eθ [ϕ(X1 , . . . , Xn )] = θ;

2. an asymptotic unbiased estimator of θ if limn→∞ Eθ [ϕ(X1 , . . . , Xn )] = θ;


5.3. POINT ESTIMATIONS 108

3. a best unbiased estimator of θ if

(a) Eθ [ϕ(X1 , . . . , Xn )] = θ;
(b) Dθ ϕ(X1 , . . . , Xn ) ≤ Dθ ϕ̄(X1 , . . . , Xn ) for any unbiased estimator ϕ̄(X1 , . . . , Xn ) of θ.

4. a consistent estimator of θ if
P θ
ϕ(X1 , . . . , Xn ) −→ θ khi n → ∞.

Here we denote Eθ , Dθ , Pθ the expectation, variance and probability under the condition
that the distribution of Xi is F (x; θ).

Example 5.3.8. Let (X1 , . . . , Xn ) be a sample from normal distribution N (a, σ 2 ). Using the
linearity of expectation and laws of large numbers, we have

• X̄n is an unbiased estimator of a;

• s2n (X) is an unbiased and consistent estimator of σ 2 .

• Sn2 (X) is an asymptotic unbiased and consistent estimator of σ 2 .

Example 5.3.9. In Example 5.3.1, both m̂n and m̃n are consistent estimators of m. Moreover,
m̂n is unbiased and m̃n is asymptotic unbiased.

5.3.3 Confidence intervals


Let X be a random variable whose density is f (x, θ), θ ∈ Θ, where θ is unknown. In the last
section, we discussed estimating θ by a statistic ϕ(X1 , . . . , Xn ) where X1 , . . . , Xn is a sample
from the distribution of X. When the sample is drawn, it is unlikely that the value of ϕ is the
true value of the parameter. In fact, if ϕ has a continuous distribution then Pθ [ϕ = θ] = 0.
What is needed is an estimate of the error of the estimation.

Example 5.3.10. Let (X1 , . . . , Xn ) be a sample from normal distribution N(a; σ 2 ) where σ 2 is
known. We know that X̄n is an unbiased, consistent estimator of a. But how close is X̄n to a?

Since X̄n ∼ N(a; σ 2 /n), we have (X̄n − a)/(σ/ n) has a standard normal N(0; 1) distribution.
Therefore,
h X̄n − a i h σ σ i
0.954 = P − 2 < √ < 2 = P X̄n − 2 √ < a < X̄n + 2 √ . (5.2)
σ/ n n n

Expression (5.2) says that before the sample is drawn the probability that a belongs to the
random interval X̄n − 2 √σn < a < X̄n + 2 √σn is 0.954. After the sample is drawn the realized
 
interval x̄n − 2 √σn < a < x̄n + 2 √σn has either trapped a or it has not. But because of the

high probability of success before the sample is drawn, we call the interval X̄n − 2 √σn < a <
5.3. POINT ESTIMATIONS 109

X̄n + 2 √σn a 95.4% confidence interval for a. We can say, with some confidence, that x̄ is within
2 √σn from a. The number 0.954 = 95.4% is called a confidence coefficient. Instead of using 2,
we could use, say, 1.645, 1.96 or 2.576 to obtain 90%, 95% or 99% confidence intervals for a.
Note that the lengths of these confidence intervals increase as the confidence increases; i.e.,
the increase in confidence implies a loss in precision. On the other hand, for any confidence
coefficient, an increase in sample size leads to shorter confidence intervals.

In the following, thanks to the central limit theorems, we will present a general method to
find the confident interval for parameters of a large class of distribution. To avoid confusion,
let θ0 denote the true, unknown value of the parameter θ. Suppose ϕ is an estimator of θ0 such
that
√ w
n(ϕ − θ0 ) → N(0, σϕ2 ). (5.3)

The parameter σϕ2 is the asymptotic variance of nϕ and, in practice, it is usually unknown.
For the present, though, we suppose that σϕ2 is known.

Let Z = n(ϕ − θ0 )/σϕ be the standardized random variable. Then Z is asymptotically
N(0, 1). Hence, P[−1.96 < Z < 1.96] = 0.95. This implies
h σϕ σϕ i
0.95 = P ϕ − 1.96 √ < θ0 < ϕ + 1.96 √ (5.4)
n n
 
σϕ σ
Because the interval ϕ − 1.96 √ < θ 0 < ϕ + 1.96 √ϕ is a function of the random variable
n n
ϕ, we will call it a random interval. The probability that the random interval contains θ is
approximately 0.95.
Since in practice, we often do not know σϕ . Suppose that there exists a consistent estimator
of σϕ , say Sϕ . It then follows from Slutsky’s theorem that

n(ϕ − θ0 ) w
→ N (0, 1).

 √ √ 
Hence the interval ϕ − 1.96Sϕ / n, ϕ − 1.96Sϕ / n would be a random interval with approx-
imate probability 0.95% of covering θ0 .
In general, we have the following definition.

Definition 5.3.11. Let (X1 , . . . , Xn ) be a sample from a distribution F (x, θ), θ ∈ Θ. A random
interval (ϕ1 , ϕ2 ), where ϕ1 and ϕ2 are some estimator of θ, is called a (1−α)-confidence interval
for θ if
P(ϕ1 < θ < ϕ2 ) = 1 − α,
for some α ∈ [0, 1].
5.3. POINT ESTIMATIONS 110

Confidence interval for the Mean a

Let X1 , . . . , Xn be a random sample from the distribution of a random variable X which


has unknown mean a and unknown variance σ 2 . Let X̄ and s2 be sample mean and sample

variance, respectively. By the Central limit theorem, the distribution of n(X̄ − a)/s approxi-
mates N(0; 1). Hence, an approximated (1 − α) confidence interval for a is
 s s 
x̄ − zα/2 √ , x̄ + zα/2 √ , (5.5)
n n

where zα/2 = Φ−1 (1 − α/2).

1. Because α < α0 implies that xα/2 > xα0 /2 , selection of higher values for confidence coef-
ficients leads to larger error terms and hence, longer confidence intervals, assuming all
else remains the same.

2. Choosing a larger sample size decreases the error part and hence, leads to shorter con-
fidence intervals, assuming all else stays the same.

3. Usually the parameter σ is some type of scale parameter of the underlying distribution.
In these situations, assuming all else remains the same, an increase in scale (noise level),
generally results in larger error terms and, hence, longer confidence intervals.

Confidence interval for p

Let X1 , . . . , Xn be a random sample from the Bernoulli distribution with probability of suc-
cess p. Let p̂ = X̄ be the sample proportion of successes. It follows from the Central limit the-
orem that p̂ has an approximate N(p; p(1−p) n
) distribution. Since p̂ and p̂(1 − p̂) are consistent
estimators for p and p(1 − p), respectively, an approximate (1 − α) confidence interval for p is
given by r r
 p̂(1 − p̂ p̂(1 − p̂ 
p̂ − zα/2 , p̂ + zα/2 .
n n

Confidence interval for mean of normal distribution

In general, the confidence intervals developed so far in this section are approximate. They
are based on the Central Limit Theorem and also, often require a consistent estimate of the
variance. In our next example, we develop an exact confidence interval for the mean when
sampling from a normal distribution

Theorem 5.3.12. Let X1 , . . . , Xn be a random sample from a N(a, σ 2 ) distribution. Re-


call that X̄ and s2 are sample mean and sample variance, respectively. The random
5.3. POINT ESTIMATIONS 111

variable T = (X̄ − a)/(s/ n) has a t-distribution with n − 1 degrees of freedom. a
a
In statistics, the t-distribution was first derived as a posterior distribution in 1876 by Helmert and
Lüroth. The t-distribution also appeared in a more general form as Pearson Type IV distribution in
Karl Pearson’s 1895 paper.
In the English-language literature the distribution takes its name from William Sealy Gosset’s 1908
paper in Biometrika under the pseudonym “Student”.

For each α ∈ (0, 1), denote tα/2,n−1 satisfying


α  
= P T > tα/2,n−1 .
2
Thanks to the symmetry of t-distribution, we have
   X̄ − a 
1 − α = P − tα/2,n−1 < T < tα/2,n−1 = P − tα/2,n−1 < √ < tα/2,n−1
S/ n
 s s 
= P X̄ − tα/2,n−1 √ < a < X̄ + tα/2,n−1 √ .
n n

Thus, a (1 − α) confidence interval for a is given by


 S S 
X̄ − tα/2,n−1 √ , X̄ + tα/2,n−1 √ . (5.6)
n n

Note that the only difference between this confidence interval and the large sample confi-
dence interval (5.5) is that tα/2,n−1 replaces zα/2 . This one is exact while (5.5) is approximate.
Of course, we have to assume we are sampling a normal population to get the exactness. In
practice, we often do not know if the population is normal. Which confidence interval should
we use? Generally, for the same α, the intervals based on tα/2,n−1 are larger than those based
on zα/2 . Hence, the interval (5.6) is generally more conservative than the interval (5.5). So in
practice, statisticians generally prefer the interval (5.6).

Confidence interval on the variance and standard deviation of a normal population

Sometimes confidence intervals on the population variance or standard deviation are needed.
When the population is modelled by a normal distribution, the tests and intervals described
in this section are applicable. The following result provides the basis of constructing these
confidence intervals.

Theorem 5.3.13. Let (X1 , X2 , . . . , Xn ) be a random sample from a normal distribution


with mean µ and variance σ 2 , and let s2 be the sample variance, i.e,
n
2 1 X
s = (Xi − X̄i )2 .
n − 1 i=1
5.3. POINT ESTIMATIONS 112

Then the random variable


(n − 1)s2
χ2n−1 =
σ2
has a χ2 -distribution with n − 1 degrees of freedom.

We recall that the pdf of a χ2 random variable with k degree of freedom is


1 k x
f (x) = x 2 −1 e− 2 , x > 0.
2k/2 Γ(k/2)

Theorem 5.3.13 leads to the following construction of the CI for σ 2 .

Theorem 5.3.14. If s2 is the sample variance from a random sample of n observation from a
normal distribution with unknown variance σ 2 , then a 100(1 − α)% CI on σ 2 is

(n − 1)s2 2 (n − 1)s2
≤σ ≤ 2 ,
c2α/2,n−1 c1−α/2,n−1

where c2a,n−1 satisfies P(χ2n−1 > c2a,n−1 ) = a and the random variable χ2n−1 has a chi-square dis-
tribution with n − 1 degrees of freedom.

Confidence interval for differences in means

A practical problem of interest is the comparison of two distributions; that is, comparing
the distributions of two random variables, say X and Y . In this section, we will compare the
means of X and Y . Denote the means of X and Y by aX and aY , respectively. In particular, we
shall obtain confidence intervals for the difference ∆ = aX − aY . Assume that the variances
2
of X and Y are finite and denote them as σX = V ar(X) and let σY2 = V ar(Y ). Let X1 ...., Xn be
a random sample from the distribution of X and let Y1 , ..., Ym be a random sample from the
distribution of Y. Assume that the sample were gathered independently of one another. Let
X̄ and Ȳ the sample means of X and Y , respectively. Let ∆ˆ = X̄ − Ȳ . Next we obtain a large
sample confidence interval for ∆ based on the asymptotic distribution of ∆. ˆ

Proposition 5.3.15. Let N = n + m denote the total sample size. We suppose that
n m
→ λX , and → λY where λX + λY = 1.
N N
Then a (1 − α) confidence interval for ∆ is
2
1. (if σX and σY2 are known)
r r
2
 σX σ2 2
σX σ2 
(X̄ − Ȳ ) − zα/2 + Y , (X̄ − Ȳ ) + zα/2 + Y ; (5.7)
n m n m
5.4. METHOD OF FINDING ESTIMATION 113

2
2. (if σX and σY2 are unknown)
r r
 s2 (X) s2 (Y ) s2 (X) s2 (Y ) 
(X̄ − Ȳ ) − zα/2 + , (X̄ − Ȳ ) + zα/2 + , (5.8)
n m n m
where s2 (X) and s2 (Y ) are sample variances of (Xn ) and (Ym ), respectively.
√ w
Proof. It follows from the Central limit theorem that n(X̄ − aX ) −→ N(0; σX
2
). Thus,
√ w
2
σX
N (X̄ − aX ) −→ N(0; ).
λX
Likewise,
√ σY2 w
N (Ȳ − aY ) −→ N(0;
).
λY
Since the samples are independent of one another, we have
√  
w σ2 σ2
N (X̄ − Ȳ ) − (aX − aY ) −→ N(0; X + Y ).
λX λY
This implies (5.7). Since S ∗2 (X) and S ∗2 (Y ) are consistent estimators of σX
2
and σY2 , applying
Slutsky’s theorem, we obtain (5.8).

Confidence interval for difference in proportions

Let X and Y be two independent random variables with Bernoulli distributions B(1, p1 )
and B(1, p2 ), respectively. Let X1 , . . . , Xn be a random sample from the distribution of X and
let Y1 , . . . , Ym be a random sample from the distribution of Y .

Proposition 5.3.16. A (1 − α) confidence interval for p1 − p2 is


r r
 p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 ) p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 ) 
p̂1 − p̂2 − zα/2 + , p̂1 − p̂2 + zα/2 + ,
n m n m
where p̂1 = X̄ and p̂2 = Ȳ .

5.4 Method of finding estimation

5.4.1 Maximum likelihood estimation


The method of maximum likelihood is one of the most popular technique for deriving es-
timators. Let X = (X1 , . . . , Xn ) be a random sample from a distribution with pdf/pdm f (x; θ).
The likelihood function is defined by
n
Y
L(x; θ) = f (xi ; θ).
i=1
5.4. METHOD OF FINDING ESTIMATION 114

Definition 5.4.1. For each sample point x, let θ̂(x) be a parameter value at which L(x; θ) at-
tains its maximum as a function of θ, with x held fixed. A maximum likelihood estimator
(MLE) of the parameter θ based on a sample X is θ̂(X).

Example 5.4.2. Let (X1 , . . . , Xn ) be a random sample from the distribution N (θ, 1), where θ is
unknown. We have n
Y 1 2
L(x; θ) = √ e−(xi −θ) /2 ,
i=1

A simple calculus shows that the MLE of θ is θ̂ = n1 ni=1 xi . One can easily verify that θ̂ is an
P

unbiased and consistent estimator of θ.

Example 5.4.3. Let (X1 , . . . , Xn ) be a random sample from the Bernoulli distribution with a
unknown parameter p. The likelihood function is
n
Y
L(x; p) = pxi (1 − p)1−xi .
i=1

1
Pn
A simple calculus shows that the MLE of p is θ̂ = n i=1 xi . One can easily verify that θ̂ is an
unbiased and consistent estimator of θ.

Let X1 , . . . , Xn denote a random sample from the distribution with pdf f (x; θ), θ ∈ Θ. Let
θ0 denote the true value of θ. The following theorem gives a theoretical reason for maximizing
the likelihood function. It says that the maximum of L(θ) asymptotically separates the true
model at θ0 from models at θ 6= θ0 .

Theorem 5.4.4. Suppose that

(R0) f (.; θ) 6= f (.; θ0 ) for all θ 6= θ0 ;

(R1) all f (.; θ), θ ∈ Θ have common support for all θ.

Then
lim Pθ0 [L(X; θ0 ) > L(X; θ)] = 1, for all θ 6= θ0 .
n→∞

Proof. By taking logs, we have


n
h1 X  f (X ; θ)  i
i
Pθ0 [L(X; θ0 ) > L(X; θ)] = P ln <0 .
n i=1
f (Xi ; θ0 )

Since the function φ(x) = − ln x is strictly convex, it follows from the Law of Large Numbers
and Jensen’s inequality that
n
1 X  f (Xi ; θ)  P h f (X ; θ) i
1
h f (X ; θ) i
1
ln → Eθ0 ln < ln Eθ0 = 0.
n i=1 f (Xi ; θ0 ) f (X1 ; θ0 ) f (X1 ; θ0 )
5.4. METHOD OF FINDING ESTIMATION 115

Note that condition f (.; θ) 6= f (.; θ0 ) for all θ 6= θ0 is needed to obtain the last strict inequality
while the common support is needed to obtain the last equality.

Theorem 5.4.4 says that asymptotically the likelihood function is maximized at the true value
θ0 . So in considering estimates of θ0 , it seems natural to consider the value of θ which maxi-
mizes the likelihood.
We close this section by showing that maximum likelihood estimators, under regularity
conditions, are consistent estimators.

Theorem 5.4.5. Suppose that the pdfs f (x, θ) satisfying (R0), (R1) and

(R2) The point θ0 is an interior point in Θ.

(R3) f (x; θ) is differentiable with respect to θ in Θ.

Then the likelihood equation,

∂ ∂
L(θ) = 0 ⇔ ln L(θ) = 0,
∂θ ∂θ
P
has a solution θ̂n such that θ̂n → θ0 .

5.4.2 Method of Moments


Let (X1 , . . . , Xn ) be a random sample from a distribution with density f (x; θ) where θ =
(θ1 , . . . , θk ) ∈ Θ ⊂ Rk . Method of moments estimators are found by equating the first k sample
moments to the corresponding k population moments, and solving the resulting system of
simultaneous equations. More precisely, define

µj = E[X j ] = gj (θ1 , . . . , θk ), j = 1, . . . , k.

and n
1X j
mj = X .
n i=1 i

The moments estimator (θ̂1 , . . . , θ̂k ) is obtained by solving the system of equations

mj = gj (θ1 , . . . , θk ), j = 1, . . . , k.

Example 5.4.6 (Binomial distribution). Let (X1 , . . . , Xn ) be a random sample from the Bino-
mial distribution B(k, p), that is,

Pk,p [Xi = x] = Ckx px (1 − p)k−x , x = 0, 1, . . . , k.


5.5. LOWER BOUND FOR VARIANCE 116

Here we assume that p and k are unknown parameters. Equating the first two sample mo-
ments to those of the population yields
 
X̄ = kp k = k̂ = X̄n2
n 1 Pn
X̄n − n 2
⇔ i=1 (Xi −X̄n )
 1 Pn X 2 = kp(1 − p) + k 2 p2 p = p̂ = X̄n .
n i=1 i k̂

5.5 Lower bound for variance


In this section we establish a remarkable inequality called the Rao-Cramer lower bound
which gives a lower bound on the variance of any unbiased estimate. We then show that,
under regularity conditions, the variances of the maximum likelihood estimates achieve this
lower bound asymptotically.

Theorem 5.5.1 (Rao-Cramer Lower Bound). Let X1 , . . . , Xn be iid with common pdf
f (x; θ) for θ ∈ Θ. Assume that the regularity conditions (R0)-(R2) hold. Moreover,
suppose that

(R4) The pdf f (x; θ) is twice differentiable as a function of θ.


R
(R5) The integral f (x; θ)dx can be differentiated twice under integral sign as a func-
tion of θ.

Let Y = u(X1 , . . . , Xn ) be a statistic with mean E[Y ] = E[u(X1 , . . . , Xn )] = k(θ). Then

[k 0 (θ)]2
DY ≥ ,
nI(θ)

where I(θ) is called Fisher information and given by


Z ∞ 2
∂ ln f (x; θ) h ∂ ln f (X; θ) i
I(θ) = − f (x; θ)dx = D .
−∞ ∂θ2 ∂θ

Proof. Since Z
k(θ) = u(x1 , . . . , xn )f (x1 ; θ) . . . f (xn ; θ)dx1 . . . dxn ,
Rn
we have
Z n
0
X ∂ ln f (xi ; θ) 
k (θ) = u(x1 , . . . , xn ) f (x1 ; θ) . . . f (xn ; θ)dx1 . . . dxn .
Rn i=1
∂θ
Pn ∂ ln f (xi ;θ)
Denote Z = i=1 ∂θ
. It is easy to verify that E[Z] = 0 and DZ = nI(θ). Moreover,
k 0 (θ) = E[Y Z]. Hence, we have
p
k 0 (θ) = E[Y Z] = E[Y ]E[Z] + ρ nI(θ)DY ,
5.5. LOWER BOUND FOR VARIANCE 117

where ρ is the correlation coefficient between Y and Z. Since E[Z] = 0 and ρ2 ≤ 1, we get

|k 0 (θ)|2
≤ 1,
nI(θ)DY

which implies the desired result.

Definition 5.5.2. Let Y be an unbiased estimator of a parameter θ in the case of point estima-
tion. The statistic Y is called an efficient estimator of θ if and only if the variance of Y attains
the Rao-Cramer lower bound.

Example 5.5.3. Let X1 , X2 , . . . , Xn denote a random sample from a exponential distribution


that has the mean λ > 0. Show that X̄ is an efficient estimator of λ.

Example 5.5.4 (Poisson distribution). Let X1 , X2 , . . . , Xn denote a random sample from a Pois-
son distribution that has the mean θ > 0. Show that X̄ is an efficient estimator of θ.

In the above examples, we were able to obtain the MLEs in closed form along with their
distributions and, hence, moments. This is often not the case. Maximum likelihood esti-
mators, however, have an asymptotic normal distribution. In fact, MLEs are asymptotically
efficient.

Theorem 5.5.5. Assume X1 , . . . , Xn are iid with pdf f (x; θ0 ) for θ0 ∈ Θ such that the regularity
condition (R0)-(R5) are satisfied. Suppose further that 0 < I(θ0 ) < ∞, and

(R6) The pdf f (x; θ) is three times differentiable as a function of θ. Moreover, for all θ ∈ Θ, there
exists a constant c and a function M (x) such that

∂ 2 ln f (x; θ)
≤ M (x),
∂θ3
with Eθ0 [M (X)] < ∞, for all θ0 − c < θ < θ0 + c and all x in the support of X.

Then any consistent sequence of solutions of the mle equations satisfies


√ w
n(θ̂n − θ0 ) → N (0, I(θ0 )−1 ).

Proof. Expanding the function l0 (θ) into a Taylor series of order two about θ0 and evaluating it
at θ̂n , we get
1
l0 (θ̂n ) = l0 (θ0 ) + (θ̂n − θ0 )l00 (θ0 ) + (θ̂n − θ0 )2 l000 (θn∗ ),
2
where θn∗ is between θ0 and θ̂n . But l0 (θ̂n ) = 0. Hence,

√ n−1/2 l0 (θ0 )
n(θ̂n − θ0 ) = .
−n−1 l00 (θ0 ) − (2n)−1 (θ̂n − θ0 )l000 (θn∗ )
5.6. EXERCISES 118

By the Central Limit Theorem,


n
1 1 X ∂ ln f (Xi ; θ0 ) w
√ l0 (θ0 ) = √ → N (0, I(θ0 )).
n n i=1 ∂θ

Also, by the Law of Large Numbers,


n
1 00 1 X ∂ 2 ln f (Xi ; θ0 ) P
− l (θ0 ) = − → I(θ0 ).
n n i=1 ∂θ2

Note that |θ̂n − θ0 | < c0 implies that |θn∗ − θ0 | < c0 , thanks to Condition (R6), we have
n n
1 000 ∗ 1 X ∂ 2 ln f (Xi ; θ) 1X
− l (θn ) ≤ ≤ M (Xi ).
n n i=1 ∂θ3 n i=1

1
Pn P
Since Eθ0 |M (X)| < ∞, applying Law of Large Numbers, we have n i=1 M (Xi ) → Eθ0 [M (X)].
P 
Moreover, since θ̂n → θ0 , for any  > 0, there exists N > 0 so that P[|θ̂n − θ0 | < c0 ] ≥ 1 − 2
and
h 1X n i 
P M (Xi ) − Eµ0 [M (X)] < 1 ≥ 1 − ,
n i=1 2

for all n ≥ N . Therefore,


h 1 000 ∗ i 
P − l (θn ) ≤ 1 + Eθ0 [M (X)] ≥ 1 − ,
n 2
hence n−1 l000 (θn∗ ) is bounded in probability. This implies the desired result.

5.6 Exercises

5.6.1 Confidence interval


5.1. For a normal population with known variance σ 2 , answer the following questions:
√ √
1. What is the confidence level for the interval x̄ − 2.14σ/ n ≤ µ ≤ x̄ + 2.14σ/ n.
√ √
2. What is the confidence level for the interval x̄ − 2.49σ/ n ≤ µ ≤ x̄ + 2.49σ/ n.
√ √
3. What is the confidence level for the interval x̄ − 1.85σ/ n ≤ µ ≤ x̄ + 1.84σ/ n.

5.2. A confidence interval estimate is desired for the gain in a circuit on a semiconductor
device. Assume that gain is normally distributed with standard deviation σ = 20.

1. Find a 95% CI for µ when n = 10 and x̄ = 1000.

2. Find a 95% CI for µ when n = 25 and x̄ = 1000.


5.6. EXERCISES 119

3. Find a 99% CI for µ when n = 10 and x̄ = 1000.

4. Find a 99% CI for µ when n = 25 and x̄ = 1000.

5.3. Following are two confidence interval estimates of the mean µ of the cycles to failure of
an automotive door latch mechanism (the test was conducted at an elevated stress level to
accelerate the failure).

3124.9 ≤ µ ≤ 3215.7 3110.5 ≤ µ ≤ 3230.1.

1. What is the value of the sample mean cycles to failure?

2. The confidence level for one of these CIs is 95% and the confidence level for the other is
99%. Both CIs are calculated from the same sample data. Which is the 95% CI? Explain
why.

5.4. n = 100 random samples of water from a fresh water lake were taken and the calcium
concentration (milligrams per liter) measured. A 95% CI on the mean calcium concentration
is 0.49 ≤ µ ≤ 0.82.

1. Would a 99% CI calculated from the same sample data been longer or shorter?

2. Consider the following statement: There is a 95% chance that µ is between 0.49 and 0.82.
Is this statement correct? Explain your answer.

3. Consider the following statement: If n = 100 random samples of water from the lake
were taken and the 95% CI on µ computed, and this process was repeated 1000 times,
950 of the CIs will contain the true value of µ. Is this statement correct? Explain your
answer.

5.5. A research engineer for a tire manufacturer is investigating tire life for a new rubber com-
pound and has built 16 tires and tested them to end-of-life in a road test. The sample mean
and standard deviation are 60,139.7 and 3645.94 kilometers. Find a 95% confidence interval
on mean tire life.

5.6. An Izod impact test was performed on 20 specimens of PVC pipe. The sample mean is
X̄ = 1.25 and the sample standard deviation is S = 0.25. Find a 99% lower confidence bound
on Izod impact strength.

5.7. The compressive strength of concrete is being tested by a civil engineer. He tests 12 spec-
imens and obtains the following data.
2216 2237 2225 2301 2318 2255
2249 2204 2281 2263 2275 2295
5.6. EXERCISES 120

1. Is there evidence to support the assumption that compressive strength is normally dis-
tributed? Does this data set support your point of view? Include a graphical display in
your answer.

2. Construct a 95% confidence interval on the mean strength.

5.8. A machine produces metal rods. A random sample of 15 rods is selected, and the diame-
ter is measured. The resulting date (in millimetres) are as follows
8.24 8.25 8.2 8.23 8.24
8.21 8.26 8.26 8.2 8.25
8.23 8.23 8.19 8.28 8.24

1. Check the assumption of normality for rod diameter.

2. Find a 95% CI on mean rod diameter.

5.9. A rivet is to be inserted into a hole. A random sample of n = 15 parts is selected, and
the hole diameter is measured. The sample standard deviation of the hole diameter measure-
ments is s = 0.008 millimeters. Construct a 99% CI for σ 2 .

5.10. The sugar content of the syrup in canned peaches is normally distributed with standard
deviation σ. A random sample of n = 10 cans yields a sample standard deviation of s = 4.8
milligrams. Find a 95% CI for σ.

5.11. Of 1000 randomly selected cases of lung cancer, 823 resulted in death within 10 years.

1. Construct a 95% CI on the death rate from lung cancer.

2. How large a sample would be required to be at least 95% confident that the error in
estimating the 10-year death rate from lung cancer is less than 0.03?

5.12. A random sample of 50 suspension helmets used by motorcycle riders and automobile
race-car drivers was subjected to an impact test, and on 18 of these helmets some damage
was observed.

1. Find a 95% CI on the true proportion of helmets of this type that would show damage
from this test.

2. Using the point estimate of p obtained from the preliminary sample of 50 helmets, how
many helmets must be tested to be 95% confident that the error in estimating the true
value of p is less than 0.02?

3. How large must the sample be if we wish to be at least 95% confident that the error in
estimating p is less than 0.02, regardless of the true value of p?
5.6. EXERCISES 121

5.13. Consider a CI for the mean µ when σ is known,


√ √
x̄ − zα1 σ/ n ≤ µ ≤ x̄ + zα2 σ/ n

where α1 +α2 = α. If α1 = α2 = α/2, we have the usual 100(1−α)% CI for µ. In the above, when

α1 6= α2 , the CI is not symmetric about µ. The length of the interval is L = σ(zα1 + zα2 )/ n.
Prove that the length of the interval L is minimized when α1 = α2 = α/2.

5.14. Let the observed value of the mean X̄ of a random sample of size 20 from a distribution
that is N (µ, 80) be 81.2. Find a 95 percent confidence interval for µ.

5.15. Let X̄ be the mean of a random sample of size n from a distribution that is N (µ, 9). Find
n such that P[X̄ − 1 < µ < X̄ + 1] = 0.90, approximately.

5.16. Let a random sample of size 17 from the normal distribution N (µ, σ 2 ) yield x̄ = 4.7 and
s2 = 5.76. Determine a 90 percent confidence interval for µ.

5.17. Let X̄ denote the mean of a random sample of size n from a distribution that has mean
µ and variance σ 2 = 10. Find n so that the probability is approximately 0.954 that the random
interval (X̄ − 21 , X̄ + 12 ) includes µ.

5.18. Find a 1 − α confidence interval for θ, given X1 , . . . , Xn iid with pdf


1
1. f (x; θ) = 1 if θ − 2
< x < θ + 12 .

2. f (x; θ) = 2x/θ2 if 0 < x < θ, θ > 0.

5.19. Let (X1 , . . . , Xn ) be a random sample from a N(0, σX


2
), and let (Y1 , . . . , Ym ) be a random
sample from a N(0, σY ), independent of the Xs. Define λ = σY2 /σX
2 2
. Find a (1 − α) CI for λ.

5.20. Suppose that X1 , . . . , Xn is a random sample from a N(µ, σ 2 ) population.

1. If σ 2 is known, find a minimum value for n to guarantee that a 0.95 CI for µ will have
length no more than σ/4.

2. If σ 2 is unknown, find a minimum value for n to guarantee, with probability 0.90, that a
0.95 CI for µ will have length no more than σ/4.

5.21. Let (X1 , . . . , Xn ) be iid uniform U(0; θ). Let Y be the largest order statistics. Show that
the distribution of Y /θ does not depend on θ, and find the shortest (1 − α) CI for θ.

5.6.2 Point estimator


5.22. Let X1 , X2 , X3 be a random sample of size three from a uniform (θ, 2θ) distribution,
where θ > 0.
5.6. EXERCISES 122

1. Find the method of moments estimator of θ.

2. Find the MLE of θ.

5.23. Let X1 , . . . , Xn be a random sample from the pdf

f (x; θ) = θx−2 , 0 < θ < x < ∞.

1. What is a sufficient statistics for θ.

2. Find the mle of θ.

3. Find the method of moments estimator of θ.

5.24. Let X1 , . . . , Xn be iid with density

eθ−x
f (x; θ) = , x ∈ R, θ ∈ R.
(1 + eθ−x )2

Show that the mle of θ exists and is unique.

5.25. Let X1 , . . . , Xn represent a random sample from each of the distributions having the
following pdfs or pmfs:

1. f (x; θ) = θx e−θ /x!, x = 0, 1, 2, . . . , 0 ≤ θ < ∞, zero elsewhere.

2. f (x; θ) = 1θ I{0<x<θ} , θ > 0.

3. f (x; θ) = θxθ−1 I{0<x<1} , 0 < θ < ∞.


e−x/θ
4. f (x; θ) = θ
I{x>0} , 0 < θ < ∞.

5. f (x; θ) = eθ−x I{x>θ} , −∞ < θ < ∞.

6. f (x; θ) = 21 e−|x−θ| , −∞ < x < ∞, −∞ < θ < ∞. Find the mle of θ.

In each case find the mle θ̂ of θ.

5.26. Let X1 , . . . , Xn be a sample from the inverse Gaussian pdf


 λ 1/2
2 2

f (x; µ, λ) = exp − λ(x − µ) /(2µ x) , x > 0.
2πx3
Find the mles of µ and λ.
2x
5.27. Suppose X1 , . . . , Xn are iid with pdf f (x; θ) = I
θ2 {0<x≤θ}
. Find

1. the mle θ̂ for θ;

2. the constant c so that E[cθ̂] = θ;


5.6. EXERCISES 123

3. the mle for the median of the distribution.

5.28. Suppose X1 , . . . , Xn are iid with pdf f (x; θ) = e−x/θ I{0<x<∞} . Find the mle of P[X ≥ 3].

5.29. The independent random variables X1 , . . . , Xn have the common distribution



0


 if x < 0
P(Xi ≤ x|α, β) = (x/β)α if 0 ≤ x ≤ β


1 if x > β,

where the parameter α and β are positive.

1. Find a two dimensional sufficient statistics for (α, β).

2. Find the mles of α and β.

3. The length (in millimeters) of cuckoos’ eggs found in hedge sparrow nests can be mod-
elled with this distribution. Fot the data

22, 0, 23, 9, 20, 9, 23, 8, 25, 0, 24, 0 21, 7, 23, 8, 22, 8, 23, 1, 23, 1, 23, 5, 23, 0, 23, 0,

find the mles of α and β.

5.30. Suppose that the random variables Y1 , . . . , Yn satisfy

Yi = βxi + i , i = 1, . . . , n,

where x1 , . . . , xn are fixed constants, and ! , . . . , n are iid N(0, σ 2 ), σ 2 unknown.


P P
1. Show that β̂n := Yi / xi is an unbiased estimator of β. Find the variance of β̂.

2. Find a two-dimensional sufficient statistics for (β, σ 2 ).

3. Find the mle β̄n of β, and show that it is an unbiased estimator of β. Compare the vari-
ances of β̄n and β̂n .

4. Find the distribution of the mle of β.

5.6.3 Lower bound for variance


5.31. Let (X1 , . . . , Xn ) be a random sample from a population with eman µ and variance σ 2 .
Pn Pn
1. Show that the estimator i=1 ai Xi is an unbiased estimator of µ if i=1 ai = 1.

2. Among all such unbiased estimator, find the one with minimum variance, and calculate
the variance.
5.6. EXERCISES 124

5.32. Given the pdf


1
f (x; θ) = , x ∈ R, θ ∈ R,
π(1 + (x − θ)2 )
show that the Rao-Cramér lower bound is n2 , where n is the sample size. What is the asymp-

totic distribution of n(θ̂ − θ) if θ̂ is the mle of θ?

5.33. Let X have a gamma distribution with α = 4 and β = θ > 0.

1. Find the Fisher information I(θ).

2. If (X1 , . . . , Xn ) is a random sample from this distribution, show that the mle of θ is an
efficient estimator of θ.

3. What is the asymptotic distribution of n(θ̂ − θ).

5.34. Let X be N(0; θ), 0 < θ < ∞.

1. Find the Fisher information I(θ).

2. If (X1 , . . . , Xn ) is a random sample from this distribution, show that the mle of θ is an
efficient estimator of θ.

3. What is the asymptotic distribution of n(θ̂ − θ).

5.35. Let (X1 , . . . , Xn ) be a random sample from a N(0; θ) distribution. We want to estimate

the standard deviation θ. Find the constant c so that Y = c ni=1 |Xi | is an unbiased estimator
P

of θ and determine its efficiency.

5.36. If (X1 , . . . , Xn ) is a random sample from a distribution with pdf



 3θ3 0 < x < ∞, 0 < θ < ∞
4
f (x; θ) = (x+θ)
0 otherwise,

show that Y = 2X̄n is an unbiased estimator of θ and determine its efficiency.

5.37 (Beta (θ, 1) distribution). Let X1 , X2 , . . . , Xn denote a random sample of size n > 2 from
a distribution with pdf 
θxθ−1 for x ∈ (0, 1)
f (x; θ) =
0 otherwise
where the parameter space Ω = (0, ∞).
n
1. Show that θ̂ = − Pn ln Xi
is the MLE estimator of θ.
i=1

2. Show that θ̂ is gamma distributed.


5.6. EXERCISES 125

3. Show that θ̂ is asymptotic unbiased estimator of θ.

4. Is θ̂ an efficient estimator of θ?
2
5.38. Let X1 , . . . , Xn be iid N(θ, 1). Show that the best unbiased estimator of θ2 is X n − n1 .
Calculate its variance and show that it is greater thatn the Cramer-Rao lower bound.
Chapter 6

Hypothesis Testing

6.1 Introduction
Point estimation and confidence intervals are useful statistical inference procedures. An-
other type of inference that is frequently used concerns tests of hypotheses. As in the last sec-
tion, suppose our interest centers on a random variable X which has density function f (x; θ)
where θ ∈ Θ. Suppose we think, due to theory or a preliminary experiment, that θ ∈ Θ0 or
θ ∈ Θ1 where Θ0 and Θ1 are subsets of Θ and Θ0 ∪ Θ1 = Θ. We label the hypothesis as

H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1 . (6.1)

We call H0 the null hypothesis and H1 the alternative hypothesis. A hypothesis of the form
θ = θ0 is called a simple hypothesis while a hypothesis of the form θ > θ0 or θ < θ0 is called a
composite hypothesis. A test of the form

H0 : θ = θ0 versus H1 : θ 6= θ0

is called a two-sided test. A test of the form

H0 : θ ≤ θ0 versus H1 : θ > θ0 ,

or
H0 : θ ≥ θ0 versus H1 : θ < θ0
is called a one-sided test.
Often the null hypothesis represents no change or no difference from the past while the
alternative represents change or difference. The alternative is often referred to as the research
worker’s hypothesis. The decision rule to take H0 or H1 is based on a sample X1 , . . . , Xn from
the distribution of X and hence, the decision could be right or wrong. There are only two
types of statistical errors we may commit: rejecting H0 when H0 is true (called the Type I
error) and accepting H0 when H0 is wrong (called the Type II error).
Let D denote the sample space. A test of H0 versus H1 is based on a subset C of D. This set
C is called the critical region and its corresponding decision rule is:

126
6.1. INTRODUCTION 127

Table 6.1: Decision Table for a Test of Hypothesis

Decision H0 is True H1 is True


Reject H0 Type I Error Correct Decision
Accept H0 Correct Decision Type II Error

• Reject H0 (Accept H1 ) if (X1 , . . . , Xn ) ∈ C;

• Retain H0 (Reject H0 ) if (X1 , . . . , Xn ) 6∈ C.

Our goal is to select a critical region which minimizes the probability of making error. In gen-
eral, it is not possible to simultaneously reduce Type I and Type II errors because of a see-saw
effect: if one takes C = ∅ then H0 would be never rejected so the probability of Type I error
would be 0, but the Type II error occurs with probability 1. Type I error is usually considered
to be worse than Type II. Therefore, we will choose a critical regions which, on one hand,
bound the probability of Type I error at a certain level, and on the other hand, minimizes the
probability of Type II error.

Definition 6.1.1. A critical region C is called of size α if

α = max Pθ [(X1 , . . . , Xn ) ∈ C].


θ∈Θ0

α is also called the significance level of the test associated with critical region C.

Over all critical regions of size α, we will look for the one whom has the lowest probability
of Type II error. It also means that for θ ∈ Θ1 , we want to maximize

1 − Pθ [Type II Error] = Pθ [(X1 , . . . , Xn ) ∈ C].

We call the probability on the right side of this equation the power of the test at θ. So our task
is to find among all the critical region of size α the one with highest power.
We define the power function of a critical region by

γC (θ) = Pθ [(X1 , . . . , Xn ) ∈ C], θ ∈ Θ1 .

Example 6.1.2. Suppose X1 , . . . , Xn is a random sample from a N (µ, 1) distribution. Consider


the hypotheses
H0 : µ = µ0 versus H1 : µ = µ1
where µ0 < µ1 are specified. Let’s consider a critical region C of the form C = {X̄n > k}. Since
X̄n has the N (µ, n1 ) distribution, the size of critical regions is

α = Pµ0 [X̄n > k] = 1 − Φ( n(k − µ0 )).
6.1. INTRODUCTION 128

The power function of the critical region C is



γC (µ1 ) = Pµ1 [X̄n > k] = 1 − Φ( n(k − µ1 )).

In particular, if we have µ0 = 0, µ1 = 1, n = 100 then at the significant level 5%, we would reject
H0 in favor of H1 if X̄n > 0.1965 and the power of the test is 1 − Φ(−8.135) = 0.9999.

Example 6.1.3 (Large Sample Test for the Mean). Let X1 , . . . , Xn be a random sample from the
distribution of X with mean µ and finite variance σ 2 . We want to test the hypotheses

H0 : µ = µ0 versus H1 : µ > µ0

where µ0 is specified. To illustrate, suppose µ0 is the mean level on a standardized test of


students who have been taught a course by a standard method of teaching. Suppose it is
hoped that a new method which incorporates computers will have a mean level µ > µ0 , where
µ = E[X] and X is the score of a student taught by the new method. This conjecture will be
tested by having n students (randomly selected) to be taught under this new method.
Because X̄n → µ in probability, an intuitive decision rule is given by

Reject H0 in favor of H1 if X̄n is much large than µ0 .

In general, the distribution of the sample mean cannot be obtained in closed form. So we will
use the Central Limit Theorem to find the critical region. Indeed, since

X̄n − µ w
√ → N (0, 1),
S/ n

we obtain a test with an approximate size α:


X̄n −µ
Reject H0 in favor of H1 if √ 0
S/ n
≥ xα .

The power of the test is also approximated by using the Central Limit Theorem

√  n(µ0 − µ) 
γ(µ) = P[X̄n ≥ µ0 + xα σ/ n] ≈ Φ − xα − .
σ
So if we have some reasonable idea of what σ equals, we can compute the approximate power
function.
Finally, note that if X has normal distribution then X̄S/n√−µ
n
has a t distribution with n − 1
degrees of freedom. Thus we can establish a rejection rule having exact level α:
−µ
X̄n√
Reject H0 in favor of H1 if T = S/ n
≥ tα,n−1 ,

where tα,n−1 is the upper α critical point of a t distribution with n − 1 degrees of freedom.
6.2. METHOD OF FINDING TEST 129

One way to report the results of a hypothesis test is to state that the null hypothesis was
or was not rejected at a specified α-value or level of significance. For example, we can say
that H0 : µ = 0 was rejected at the 0.05 level of significance. This statement of conclusions is
often inadequate because it gives the decision maker no idea about whether the computed
value of the test statistic was just barely in the rejection region or whether it was very far
into this region. Furthermore, stating the results this way imposes the predefined level of
significance on other users of the information. This approach may be unsatisfactory because
some decision makers might be uncomfortable with the risks implied by α = 0.05.
To avoid these difficulties the p-value approach has been adopted widely in practice. The
p-value is the probability that the test statistic will take on a value that is at least as extreme
as the observed value of the statistic when the null hypothesis H0 is true. Thus, a p-value
conveys much information about the weight of evidence against H0 , and so a decision maker
can draw a conclusion at any specified level of significance. We now give a formal definition
of a p-value.

Definition 6.1.4. The p-value is the smallest level of significance that would lead to rejection
of the null hypothesis H0 with the given data.

This mean that if α > p-value, we would reject H0 while if α < p-value, we would not reject
H0 .

6.2 Method of finding test

6.2.1 Likelihood Ratio Tests


Let L(x; θ) be the likelihood function of the sample (X1 , . . . , Xn ) from a distribution with
density p(x; θ).

Definition 6.2.1. The likelihood test statistic for testing H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1 is

supθ∈Θ0 L(x; θ)
λ(x) = .
supθ∈Θ L(x; θ)

A likelihood ratio test is any test that has a rejection region of the form C = {x : λ(x) ≤ c} for
some c ∈ [0, 1].

The motivation of likelihood ratio test comes from the fact that if θ0 is the true value of θ
then, asymptotically, L(θ0 ) is the maximum value of L(θ). Therefore, if H0 is true, λ should be
close to 1; while if H1 is true, λ should be smaller.

Example 6.2.2 (Likelihood Ratio Test for the Exponential Distribution). Suppose X1 , . . . , Xn
are iid with pdf f (x; θ) = θ−1 e−x/θ I{x>0} and θ > 0. Let’s consider the hypotheses
6.3. METHOD OF EVALUATING TEST 130

H0 : θ = θ0 versus H1 : θ 6= θ0 ,

where θ0 > 0 is a specified value. The likelihood ratio test statistic simplifies to
 X̄ n
n
λ(X) = en e−nX̄n /θ0 .
θ0
The decision rule is to reject H0 if λ ≤ c. Using differential calculus, it is easy to show that λ ≤ c
iff X̄ ≤ c1 θ0 or X̄ ≥ c2 θ0 for some positive constants c1 , c2 . Note that under the null hypothesis,
H0 , the statistic θ20 ni=1 Xi has a χ2 distribution with 2n degrees of freedom. Therefore, the
P

following decision rule results in a level α test:


2
Pn 2 2
Pn
Reject H0 if θ0 i=1 Xi ≤ χ1−α/2 (2n) or θ0 i=1 Xi ≥ χ2α/2 (2n),

where χ21−α/2 (2n) is the lower α/2 quantile of a χ2 distribution with 2n degrees of freedom and
χ2α/2 (2n) is the upper α/2 quantile of a χ2 distribution with 2n degrees of freedom.

If ϕ(X) is a sufficient statistic for θ with pdf or pmf g(t; θ), then we might consider con-
structing an likelihood ratio test based on ϕ and its likelihood function L∗ (t; θ) = g(t; θ) rather
than on the sample X and its likelihood function L(x; θ).

Theorem 6.2.3. If ϕ(X) is a sufficient statistic for θ and λ∗ (t) and λ(x) are the likelihood ratio
test statistics based on ϕ and X, respectively, then λ∗ (ϕ(x)) = λ(x) for every x in the sample
space.

Proof. From the Factorization Theorem, the pdf or pmf of X can be written as f (x; θ) =
g(T (x); θ)h(x), where g(t; θ) is the pdf or pmf of T and h(x) does not depend on θ. Thus

supΘ0 L(x; θ) supΘ0 f (x; θ) supΘ0 g(T (x); θ)h(x)


λ(x) = = =
supΘ L(x; θ) supΘ f (x; θ) supΘ g(T (x); θ)h(x)

supΘ0 L (T (x); θ)
= ∗
= λ∗ (T (x)).
supΘ L (x; θ)

6.3 Method of evaluating test

6.3.1 Most powerful test


Now we consider a test of a simple hypothesis H0 versus a simple alternative H1 . Let f (x; θ)
denote the density of a random variable X where θ ∈ Θ = {θ0 , θ1 }. Let X = (X1 , . . . , Xn ) be a
random sample from the distribution of X.
6.3. METHOD OF EVALUATING TEST 131

Definition 6.3.1. A subset C of the sample space is called a best critical region of size α for
testing the simple hypothesis
H0 : θ = θ0 versus H1 : θ = θ1 ,
if Pθ0 [X ∈ C] = α and for every subset A of the sample space
Pθ0 [X ∈ A] = α implies Pθ1 [X ∈ C] ≥ Pθ1 [X ∈ A].
The following theorem of Neyman and Pearson provides a systematic method of deter-
mining a best critical region.

Theorem 6.3.2. Let (X1 , . . . , Xn ) be a sample from a distribution that has density f (x; θ).
Then the likelihood of X1 , X2 , . . . , Xn is
n
Y
L(x; θ) = f (xi ; θ), for x = (x1 , . . . , xn ).
i=1

Let θ0 and θ1 be distinct fixed values of θ so that Θ = {θ0 , θ1 }, and let k be a positive
number. Let C be a subset of the sample space such that
L(x;θ0 )
(a) L(x;θ1 )
≤ k for each x ∈ C;

(b) L(x;θ0 )
L(x;θ1 )
≥ k for each x ∈ D\C;

(c) α = Pθ0 [X ∈ C].

Then C is a best critical region of size α for testing the simple hypothesis

H0 : θ = θ0 versus H1 : θ = θ1 .

Proof. We prove the theorem when the random variables are of the continuous type. If A is
another critical region of size α, we will show that
Z Z
L(x; θ1 )dx ≥ L(x; θ1 )dx.
C A
c
Write C as the disjoint union of C ∩ A and C ∩ A and A as the disjoint union of A ∩ C and
A ∩ C c , we have
Z Z Z Z
L(x; θ1 )dx − L(x; θ1 )dx = L(x; θ1 )dx − L(x; θ1 )dx
C A C∩A c A∩CZc
Z
1 1
≥ L(x; θ0 )dx − L(x; θ0 )dx,
k C∩Ac k A∩C c
where the last inequality follows from conditions (a) and (b). Moreover, we have
Z Z Z Z
L(x; θ0 )dx − L(x; θ0 )dx = L(x; θ0 )dx − L(x; θ0 )dx = α − α = 0.
C∩Ac A∩C c C A

This implies the desired result.


6.3. METHOD OF EVALUATING TEST 132

Example 6.3.3. Let X = (X1 , . . . , Xn ) denote a random sample from the distribution N (θ, 1).
It is desired to test the simple hypothesis

H0 : θ = 0 versus H1 : θ = 1.

We have
 Pn 
1 1 2
L(0; x) (2π)n/2
exp − 2 i=1 xi
 X n
n
=   = exp − xi + .
L(1; x) 1
exp − 12 ni=1 (xi − 1)2
P
i=1
2
(2π)n/2

If k > 0, the set of all points (x1 , . . . , xn ) such that


n n
 X n 1X 1 ln k
exp − xi + ≤k⇔ xi ≥ − =c
i=1
2 n i=1 2 n

is a best critical region, where c is a constant that can be determined so that the size of the
critical region is α. Since X̄n ∼ N (0, 1/n),

Pθ0 (X̄n ≥ c) = α ⇔ c = Φ−1 (1 − α),


Rx 2
where Φ−1 is the reverse function of Φ(x) = √12π −∞ e−t /2 dt..
If X̄n ≥ c, the simple hypothesis H0 : θ = 0 would be rejected at the significance level α; if
X̄n < c, the hypothesis H0 would be accepted.
The probability of rejecting H0 when H0 is true is α; the probability of rejecting H0 , when
H0 is false, is the value of the power of the test at θ = 1. That is
Z ∞  (x − 1)2 
1
Pθ1 [X̄n ≥ c] = p exp − dx.
c 2π/n 2/n

For example, if n = 25, α = 0.05 then c = 0.329. Thus the power of this best test of H0 against
H1 is 0.05 at θ = 1 is
Z ∞  (x − 1)2 
1
p exp − dx = 1 − Φ(−3.355) = 0.999.
0.329 2π/25 2/25

6.3.2 Uniformly most powerful test


We now define a critical region when it exists, which is a best critical region for testing a
simple hypothesis H0 against an alternative composite hypothesis H1 .

Definition 6.3.4. The critical region C is a uniformly most powerful (UMP) critical region of
size α for testing the simple hypothesis H0 against an alternative composite hypothesis H1 if
the set C is a best critical region of size a for testing H0 against each simple hypothesis in H1 .
A test defined by this critical region C is called a uniformly most powerful (UMP) test, with
significance level α, for testing the simple hypothesis H0 against the alternative composite
hypothesis H1 .
6.3. METHOD OF EVALUATING TEST 133

It is well-known that uniformly most powerful tests do not always exist. However, when
they do exist, the Neyman-Pearson theorem provides a technique for finding them.

Example 6.3.5. Let (X1 , X2 , . . . , Xn ) be a random sample from the distribution N (0, θ), where
the variance θ is an unknown positive number. We will show that there exists a uniformly
most powerful test with significance level α for testing

H0 : θ = θ0 versus H1 : θ > θ0 .

The joint density of X1 , . . . , Xn is


n
1  1 X 2
L(θ; x1 , . . . , xn ) = exp − x .
(2nθ)n/2 2θ i=1 i

Let θ0 be a number greater than θ0 an let k denote a positive number. Let C be the set of points
where n
L(θ0 ; x) X
2 2θ0 θ0  n θ0 
≤k⇔ xi ≥ 0 ln − ln k = c.
L(θ0 ; x) i=1
θ − θ0 2 θ0
n o
The set C = (x1 , . . . , xn ) : ni=1 x2i ≥ c is then a best critical region for our testing problem.
P

It remains to determine c so that this critical region has size α, i.e.,


n
hX i
α = Pθ0 Xi2 ≥c .
i=1

This can be done using the observation that θ10 ni=1 Xi2 has a χ2 -distribution with n degrees
P

of freedom. Note that for each number θ0 > θ0 , the foregoing argument holds. It means that C
is a uniformly most powerful critical region of size α.
In conclusion, if ni=1 Xi2 ≥ c, H0 is rejected at the significance level α and H1 is accepted;
P

otherwise, H0 is accepted.

Example 6.3.6. Let (X1 , . . . , Xn ) be a sample from the normal distribution N (a, 1), where a is
unknown. We will show that there is no uniformly most powerful test of the simple hypothesis

H0 : a = a0 versus H1 : a 6= a0 .

However, if the alternative composite hypothesis is either H1 : a > a0 or H1 : a < a0 , a


uniformly most powerful test will exist in each instance.
Let a1 be a number not equal to a0 . Let k be a positive number and consider
 Pn 
1 1 2
(2π)n/2
exp − 2
(x
i=1 i − a 0 ) Xn
n
  ≤ k ⇔ (a 1 − a 0 ) xi ≥ (a21 − a20 ) − ln k.
1
exp − 1 n (x − a )2
P
i=1
2
(2π)n/2 2 i=1 i 1

This last inequality is equivalent to


n
X n ln k
xi ≥ (a1 − a0 ) − ,
i=1
2 a1 − a0
6.3. METHOD OF EVALUATING TEST 134

provided that a1 > a0 , and it is equivalent to


n
X n ln k
xi ≤ (a1 − a0 ) − ,
i=1
2 a1 − a0
if a1 < a0 . The first of these two expressions defines a best critical region for testing H0 : a = a0
against the hypothesis a = a1 provided that a1 > a0 , while the second expression defines a
best critical region for testing H0 : a = a0 against the hypothesis a = a1 provided that a1 < a0 .
That is, a best critical region for testing the simple hypothesis against an alternative simple
hypothesis, say a = a0 + 1, will not serve as a best critical region for testing H0 : a = a0
against the alternative simple hypothesis a = a0 − 1. By definition, then, there is no uniformly
most powerful test in the case under consideration. However, if the alternative composite
hypothesis is either H1 : a > a0 or H1 : a < a0 , a uniformly most powerful test will exist in each
instance.
Remark 5. The sufficiency is importance for finding a test. Indeed, let X1 , . . . , Xn be a ran-
dom sample from a distribution that has pdf f (x, θ), θ ∈ Θ. Suppose that Y = u(X1 , . . . , Xn )
is a sufficient statistic for θ. It follows from the factorization theorem that the joint pdf of
X1 , . . . , Xn may be written
L(x1 , . . . , xn ; θ) = k1 (u(x1 , . . . , xn ); θ)k2 (x1 , . . . , xn ),
where k2 (x1 , . . . , xn ) does not depend upon θ. It implies that the ratio
L(x1 , . . . , xn ; θ0 ) k1 (u(x1 , . . . , xn ); θ0 )
=
L(x1 , . . . , xn ; θ00 ) k1 (u(x1 , . . . , xn ), θ00 )
depends upon x1 , . . . , xn only through u(x1 , . . . , xn ). Consequently, if there is a sufficient statis-
tic Y = u(X1 , . . . , Xn ) for θ and if a best test or a uniformly most powerful test is desired, there
is no need to consider tests which are based upon any statistic other than the sufficient statis-
tic.

6.3.3 Monotone likelihood ratio


Consider the general one-sided hypotheses of the form
H0 : θ ≤ θ0 versus H1 : θ > θ0 . (6.2)
In this section we introduce general forms of uniformly most powerful tests for these hypothe-
ses when the sample has a so called monotone likelihood ratio.
Definition 6.3.7. Let X0 = (X1 , . . . , Xn ) be a random sample with common pdf (or pmf)
f (x; θ), θ ∈ Θ. We say that its likelihood function L(x; θ) = ni=1 f (xi ; θ) has monotone like-
Q

lihood ratio in the statistic y = u(x) if for θ1 < θ2 , the ratio


L(x; θ1 )
L(x; θ2 )
is a monotone function of y = u(x).
6.3. METHOD OF EVALUATING TEST 135

Theorem 6.3.8. Assume that L(x; θ) has a monotone decreasing likelihood ratio in the statistic
y = u(x). The following test is uniformly most powerful of level α for the hypotheses (6.2):

Reject H0 if u(X) ≥ c, (6.3)

where c is determined by α = Pθ0 [u(X) ≥ c].

In case L(x; θ) has a monotone increasing likelihood ratio in the statistic y = u(x) we can
construct a uniformly most powerful test in a similar way.

Proof. We first consider the simple null hypothesis: H00 : θ = θ0 . Let θ1 > θ0 be arbitrary but
fixed. Let C denote the most powerful critical region for θ0 versus θ1 . By the Neyman-Pearson
Theorem, C is defined by,
L(X; θ0 )
≤ k, if and only if X ∈ C,
L(X; θ1 )
where k is determined by α = Pθ0 [X ∈ C]. However, since θ1 > θ0 ,
L(X; θ0 )
= g(u(X)) ≤ k ⇔ u(X) > g −1 (k),
L(X; θ1 )
L(x;θ0 )
where g(u(x)) = L(x;θ 1)
. Since α = Pθ0 [u(X) ≥ g −1 (k), we have c = g −1 (k). Hence, the Neyman-
Pearson test is equivalent to the test defined by (6.3). Moreover, the test is uniformly most
powerful for θ0 versus θ1 > θ0 because the test only depends on θ1 > θ0 and g −1 (k) is uniquely
determined under θ0 .
Let γ(θ) denote the power function of the test (6.3). For any θ0 < θ00 , the test (6.3) is the
most powerful test for testing θ0 versu θ00 with the level γ(θ0 ), we have γ(θ00 ) > γ(θ0 ). Hence γ(θ)
is a nondecreasing function. This implies maxθ<θ0 γ(θ) = α.

Example 6.3.9. Let X1 , . . . , Xn be a random sample from a Bernoulli distribution with param-
eter p = θ, where 0 < θ < 1. Let θ0 < θ1 . Consider the ratio of likelihood,
L(x1 , . . . , xn ; θ0 )  θ0 (1 − θ1 )  xi  1 − θ0 n
P

= .
L(x1 , . . . , xn ; θ1 ) θ1 (1 − θ0 ) 1 − θ1

Since θθ01 (1−θ1)


P
(1−θ0 )
< 1, the ratio is an decreasing function of y = xi . Thus we have a monotone
P
likelihood ratio in the statistic Y = Xi .
Consider the hypotheses

H0 : θ < θ0 versus H1 : θ > θ0 .

By Theorem 6.3.8, the uniformly most powerful level α decision rule is given by
n
X
Reject H0 if Y = Xi ≥ c,
i=1

where c is such that α = Pθ0 [Y ≥ c].


6.4. SOME WELL-KNOWN TESTS FOR A SINGLE SAMPLE 136

6.4 Some well-known tests for a single sample

6.4.1 Hypothesis test on the mean of a normal distribution, variance σ 2


known
In this section, we will assume that a random sample X1 , X2 , . . . , Xn has been taken from
a normal N (µ, σ 2 ) population. It is known that X̄ is an unbiased point estimator of µ.

Hypothesis tests on the mean

Null hypothesis: H0 : µ = µ0
Test statistic: Z0 = X̄−µ
√0
σ/ n

Alternative hypothesis Rejection criteria P value


H1 : µ 6= µ0 |Z0 | > zα/2 2[1 − Φ(|Z0 |)]
H1 : µ > µ0 Z0 > zα 1 − Φ(Z0 )
H1 : µ < µ0 Z0 < −zα Φ(Z0 )

Example 6.4.1. The following data give the score of 10 students in a certain exam.

75 64 75 65 72 80 71 68 78 62.

Assume that the score is normally distributed with mean µ and known variance σ 2 = 36, test
the following hypotheses at the 0.05 level of significance and find the P -value of each test.

(a) H0 : µ = 70 against H1 : µ 6= 70.

(b) H0 : µ = 68 against H1 : µ > 68.

(c) H0 : µ = 75 against H1 : µ < 75.

Solution: (a) We may solve the problem by following the six-step procedure as follows.

1. The parameter of interest is µ, the score.

2. We are going to test: H0 : µ = 70, H1 : µ 6= 70.

3. Sample size n = 10,


1
and sample mean X = (75 + 64 + 75 + 65 + 72 + 80 + 71 + 68 + 78 + 62) = 71.
10
4. Significance level α = 0.05 so zα/2 = 1.96.

5. The test statistic is


X − µ0 71 − 70
Z0 = √ = √ = 0.5270.
σ/ n 6/ 10
6.4. SOME WELL-KNOWN TESTS FOR A SINGLE SAMPLE 137

6. Since |Z0 | < zα/2 we do not reject H0 : µ = 70 in favour of H1 : µ 6= 70 at the 0.05 level of
significance. More precisely, we conclude that the mean score is 70 based on a sample
of 10 measurements.

The P -value of this test is 2(1 − Φ(|Z0 |)) = 2(1 − Φ(0.5270)) = 0.598.
(b)

1. The parameter of interest is µ, the score.

2. We are going to test: H0 : µ = 68, H1 : µ > 68.

3. Sample size n = 10,


1
and sample mean X = (75 + 64 + 75 + 65 + 72 + 80 + 71 + 68 + 78 + 62) = 71.
10
4. Significance level α = 0.05 so zα = 1.645.

5. The test statistic is


X − µ0 71 − 68
Z0 = √ = √ = 1.581.
σ/ n 6/ 10

6. Since Z0 < zα we do not reject H0 : µ = 68 in favour of H1 : µ > 68 at the 0.05 level of


significance. More precisely, we conclude that the mean score is 68 based on a sample
of 10 measurements.

The P -value of this test is 1 − Φ(Z0 ) = 1 − Φ(1.581) = 0.057.


(c)

1. The parameter of interest is µ, the score.

2. We are going to test: H0 : µ = 75, H1 : µ < 75.

3. Sample size n = 10,


1
and sample mean X = (75 + 64 + 75 + 65 + 72 + 80 + 71 + 68 + 78 + 62) = 71.
10
4. Significance level α = 0.05 so zα = 1.645.

5. The test statistic is


X − µ0 71 − 75
Z0 = √ = √ = −2.108.
σ/ n 6/ 10

6. Since Z0 < −zα we reject H0 : µ = 75 in favour of H1 : µ < 75 at the 0.05 level of


significance. More precisely, we conclude that the mean score is less than 75 based on a
sample of 10 measurements.

The P -value of this test is Φ(Z0 ) = Φ(−2.108) = 0.018.


6.4. SOME WELL-KNOWN TESTS FOR A SINGLE SAMPLE 138

Connection between hypothesis tests and confidence intervals

There is a close relationship between the test of a hypothesis about any parameter, say θ,
and the confidence interval for θ. If [l, u] is a 100(1 − α)% confidence interval for the parameter
θ, the test of size α of the hypothesis

H0 : θ = θ0 , H1 : θ 6= θ0

will lead to rejection of H0 if and only if θ0 is not in the 100(1 − α)% confidence interval [l, u].
Although hypothesis tests and CIs are equivalent procedures insofar as decision making
or inference about µ is concerned, each provides somewhat different insights. For instance,
the confidence interval provides a range of likely values for µ at a stated confidence level,
whereas hypothesis testing is an easy framework for displaying the risk levels such as the P -
value associated with a specific decision.

Type II error and choice of sample size

In testing hypotheses, the analyst directly selects the type I error probability. However, the
probability of type II error β depends on the choice of sample size. In this section, we will
show how to calculate the probability of type II error β. We will also show how to select the
sample size to obtain a specified value of β.
In the following we will derive the formula for β of the two-sided test. The ones for one-
sided tests can be derived in a similar way and we leave it as an exercise for the reader.
Finding the probability of type II error β: Consider the two-sided hypothesis

H0 : µ = µ0 , H1 : µ 6= µ0 .

Suppose the null hypothesis is false and that the true value of the mean is µ = µ0 + δ for some
δ. The test statistic Z0 is
√  δ √n
X − µ0 X − (µ0 + δ) δ n
Z0 = √ = √ ∼N , 1).
σ/ n σ/ n σ σ
Therefore, the probability of type II error is β = Pµ0 +δ (|Z0 | ≤ zα/2 ), i.e.,

Type II error for two-sided test


√ √
 δ n  δ n
β = Φ zα/2 − − Φ − zα/2 − . (6.4)
σ σ

Sample size formula There are no closed form for n from equation (6.4). However, we can
estimate n as follows.

Case 1 If δ > 0, then Φ(−zα/2 − δ σ n ) ≈ 0 then

 δ n (zα/2 + zβ )2 σ 2
β ≈ Φ zα/2 − ⇔n≈ .
σ δ2
6.4. SOME WELL-KNOWN TESTS FOR A SINGLE SAMPLE 139

δ n
Case 2 If δ < 0, then Φ(zα/2 − σ
) ≈ 1 then

 δ n (zα/2 + zβ )2 σ 2
β ≈ 1 − Φ − zα/2 − ⇔n≈ .
σ δ2
Therefore, the sample size formula is defined by

Sample size formula for two-sided test

(zα/2 + zβ )2 σ 2
n≈
δ2

Large-sample test

We have developed the test procedure for the null hypothesis H0 : µ = µ0 assuming that
the population is normally distributed and that σ 2 is known. In many if not most practical sit-
uations σ 2 will be unknown. Furthermore, we may not be certain that the population is well
modeled by a normal distribution. In these situations if n is large (say n > 40) the sample vari-
ance s2 can be substituted for σ 2 in the test procedures with little effect. Thus, while we have
given a test for the mean of a normal distribution with known σ 2 , it can be easily converted
into a large-sample test procedure for unknown σ 2 that is valid regardless of the form of the
distribution of the population. This large-sample test relies on the central limit theorem just
as the large-sample confidence interval on σ 2 that was presented in the previous chapter did.
Exact treatment of the case where the population is normal, σ 2 is unknown, and n is small
involves use of the t distribution and will be deferred in the next section.

6.4.2 Hypothesis test on the mean of a normal distribution, variance σ 2


unknown
Hypothesis test on the mean

We assume again that a random sample X1 , X2 , . . . , Xn has been taken from a normal
N (µ, σ 2 ) population. Recall that X and s(X)2 are sample mean and sample variance of the
random sample X1 , X2 , . . . , Xn , respectively. It is known that

X −µ
tn−1 = √
s(X)/ n

has a t distribution with n − 1 degree of freedom. This fact leads to the following test on the
mean µ.
6.4. SOME WELL-KNOWN TESTS FOR A SINGLE SAMPLE 140

Null hypothesis: H0 : µ = µ0
X̄−µ√
Test statistic: T0 = s(X)/ 0
n

Alternative hypothesis Rejection criteria P -value


H1 : µ 6= µ0 |T0 | > tα/2,n−1 2P(tn−1 > |T0 |)
H1 : µ > µ0 T0 > tα,n−1 P(tn−1 > T0 )
H1 : µ < µ0 T0 < −tα,n−1 P(tn−1 < −T0 )

Where ta,n−1 satisfies P[tn−1 > ta,n−1 ] = a.

Because the t-table in the Appendix contains a few critical values for each t distribution, com-
putation of the exact P -value directly from the table is usually impossible. However, it is easy
to find upper and lower bounds on the P -value from this table.
Example 6.4.2. The following data give the IQ score of 10 students.
112 116 115 120 118 125 118 113 117 121.
Suppose that the IQ score is normally distributed N(µ, σ 2 ), test the following hypotheses at
the 0.05 level of significance and estimate the P -value of each test.
(a) H0 : µ = 115 against H1 : µ 6= 115.

(b) H0 : µ = 115 against H1 : µ > 115.

(c) H0 : µ = 120 against H1 : µ < 120.


Solution (a)
1. The parameter of interest is the mean IQ score µ.

2. We are going to test H0 : µ = 115 against H1 : µ 6= 115.

3. Sample size n = 10,


sample mean X = 117.5,
sample variance s2 (X) = 14.944.

4. Significance level α = 0.05 so tα/2,9 = 2.262.

5. The test statistic is


X̄ − µ0 117.5 − 115
T0 = √ =√ √ = 2.04.
s(X)/ n 14.944/ 10
6. Since |T0 | < tα/2,9 we do not reject H0 : µ = 115 in favour of H1 : µ 6= 115 at the 0.05 level
of significance. More precisely, we conclude that the average IQ score is 115 based on a
sample of 10 measurements.
Based on the table of Student distribution, we know that the P -value of this test is 2P(t9 >
2.04) ∈ (0.05; 0.1). The actual value of the P -value is 0.072.
6.4. SOME WELL-KNOWN TESTS FOR A SINGLE SAMPLE 141

Type II error and choice of sample size

When the true value of the mean is µ = µ0 + δ, the distribution for T0 is called the non-

central t distribution with n − 1 degrees of freedom and non-centrality parameter δ n/σ.
Therefore, the type II error of the two-sided alternative would be

β = P(|T00 | ≤ tα/2,n−1 )

where T00 denotes the non-central t random variable.

6.4.3 Hypothesis test on the variance of a normal distribution


The hypothesis testing procedures

We assume that a random sample X1 , X2 , . . . , Xn has been taken from a normal N (µ, σ 2 )
population. Since (n − 1)s2 (X)/σ 2 follows the chi-square distribution with n − 1 degrees of
freedom, we obtain the following test for value of σ 2 .

Null hypothesis: H0 : σ = σ0
2 (X)
Test statistic: χ20 = (n−1)s
σ20

Alternative hypothesis Rejection criteria P -value


H1 : σ 6= σ0 χ20 > cα/2,n−1 or T0 < c1−α/2,n−1 1 − |2P(χ2n−1 > χ20 ) − 1|
H1 : σ > σ0 χ20 > cα,n−1 P(χ2n−1 > χ20 )
H1 : σ < σ0 χ20 < c1−α,n−1 P(χ2n−1 < χ20 )

Where ca,n−1 satisfy P[χ2n−1 > ca,n−1 ] = a.

Example 6.4.3. An automatic filling machine is used to fill bottles with liquid detergent. A
random sample of 20 bottles results in a sample variance of fill volume of s2 = 0.0153 (fluid
ounces)2 . If the variance of fill volume exceeds 0.01 (fluid ounces)2 , an unacceptable propor-
tion of bottles will be underfilled or overfilled. Is there evidence in the sample data to suggest
that the manufacturer has a problem with underfilled or overfilled bottles? Use α = 0.05, and
assume that fill volume has a normal distribution.
Solution

1. The parameter of interest is the population variance σ 2 .

2. We are going to test H0 : σ 2 = 0.01 against H1 : σ 2 > 0.01.

3. Sample size n = 20,


sample variance s2 (X) = 0.0153.

4. Significance level α = 0.05 so cα,19 = 30.14.


6.4. SOME WELL-KNOWN TESTS FOR A SINGLE SAMPLE 142

5. The test statistic is


(n − 1)s2 (X) 19 × 0.0153
χ20 = 2
= = 29.07.
σ0 0.01

6. Since χ20 < cα,19 , we conclude that there is no strong evidence that the variance of fill
volume exceeds 0.01 (fluid ounces)2 .

Since P(χ21 9 > 27.2) = 0.10 and P(χ21 9 > 30.4) = 0.05, we conclude that the P -value of the test
is in the interval (0.05, 0.10). Note that the actual P -value is 0.0649.

6.4.4 Test on a population proportion


Large-Sample tests on a proportion

Let (X1 , . . . , Xn ) be a random sample observing from a random variable X with B(1, p)
distribution. Then p̂ = X is a point estimator of p. By the Central limit theorem, when n is
large, p̂ is approximately normal with mean p and variance p(1 − p)/n. We thus obtain the
following test for value of p.

Null hypothesis: H0√: p = p0


Test statistic: Z0 = √n(X̄−p0 )
p0 (1−p0 )

Alternative hypothesis Rejection criteria P -value


H1 : p 6= p0 |Z0 | > zα/2 2(1 − Φ(|Z0 |)
H1 : p > p0 Z0 > zα 1 − Φ(Z0 )
H1 : p < p0 Z0 < −zα Φ(Z0 )

Example 6.4.4. A semiconductor manufacturer produces controllers used in automobile en-


gine applications. The customer requires that the process fallout or fraction defective at a
critical manufacturing step not exceed 0.05 and that the manufacturer demonstrate process
capability at this level of quality using α = 0.05. The semiconductor manufacturer takes a ran-
dom sample of 200 devices and finds that four of them are defective. Can the manufacturer
demonstrate process capability for the customer?
Solution

1. The parameter of interest is the process fraction defective p.

2. H0 : p = 0.05 against H1 : p < 0.05.


4
3. The sample size n = 200, and sample proportion X = 200
= 0.02.

4. Significance level α = 0.05 so zα = 1.645.


6.4. SOME WELL-KNOWN TESTS FOR A SINGLE SAMPLE 143

5. The test statistic is


√ √
n(X − p0 ) 200(0.02 − 0.05)
Z0 = p = √ = −1.947.
p0 (1 − p0 ) 0.05 × 0.95

6. Since Z0 < −zα , we reject H0 and conclude that the process fraction defective p is less
than 0.05. The P -value for this value of the test statistic is Φ(−1.947)) = 0.0256, which is
less than α = 0.05. We conclude that the process is capable.

Type II error and choice of sample size

Suppose that p is the true value of the population proportion. The approximate β-error is
defined as follows

• the two-sided alternative H1 : p 6= p0


 p − p + z pp (1 − p )/n   p − p − z pp (1 − p )/n 
0 α/2 0 0 0 α/2 0 0
β≈Φ p −Φ p
p(1 − p)/n p(1 − p)/n

• the one-sided alternative H1 : p < p0


 p − p − z pp (1 − p )/n 
0 α/2 0 0
β ≈1−Φ p
p(1 − p)/n

• the one-sided alternative H1 : p > p0


 p − p + z pp (1 − p )/n 
0 α/2 0 0
β≈Φ p
p(1 − p)/n

These equations can be solved to find the approximate sample size n that gives a test of level
α that has a specified β risk. The sample size is defined as follows.

• the two-sided alternative H1 : p 6= p0


h z pp (1 − p ) + z pp(1 − p) i2
α/2 0 0 β
n=
p − p0

• the one-sided alternative


h z pp (1 − p ) + z pp(1 − p) i2
α 0 0 β
n=
p − p0
6.5. SOME WELL-KNOWN TESTS FOR TWO SAMPLES 144

6.5 Some well-known tests for two samples

6.5.1 Inference for a difference in means of two normal distributions, vari-


ances known
In this section we consider statistical inferences on the difference in means µ1 − µ2 of two
normal distributions, where the variances σ12 and σ2 are known. The assumptions for this
section are summarized as follows.

X11 , X12 , . . . , X1n1 is a random sample from population 1.


X21 , X22 , . . . , X2n2 is a random sample from population 2. (6.5)
The two populations represented by X1 and X2 are independent.
Both populations are normal.

The inference for µ1 − µ2 is based on the following result.

Theorem 6.5.1. Under the assumptions stated above, the quantity

X 1 − X 2 − (µ1 − µ2 )
Z= q 2 ∼ N(0, 1).
σ1 σ22
n1
+ n2

Null hypothesis: H0 : µ1 − µ2 = ∆0
X 1 − X 2 − ∆0
Test statistic: Z0 = q 2
σ1 σ2
n1
+ n22

Alternative hypothesis Rejection criteria


H1 : µ1 − µ2 6= ∆0 |Z0 | > zα/2
H1 : µ1 − µ2 > ∆0 Z0 > zα
H1 : µ1 − µ2 < ∆0 Z0 < zα

When the population variances are unknown, the sample variances s21 and s22 can be substi-
tuted into the test statistic Z0 to produce a large-sample test for the difference in means. This
procedure will also work well when the populations are not necessarily normally distributed.
However, both n1 and n2 should exceed 40 for this large-sample test to be valid.

Example 6.5.2. A product developer is interested in reducing the drying time of a primer
paint. Two formulations of the paint are tested; formulation 1 is the standard chemistry, and
formulation 2 has a new drying ingredient that should reduce the drying time. From expe-
rience, it is known that the standard deviation of drying time is 8 minutes, and this inherent
variability should be unaffected by the addition of the new ingredient. Ten specimens are
6.5. SOME WELL-KNOWN TESTS FOR TWO SAMPLES 145

painted with formulation 1, and another 10 specimens are painted with formulation 2; the 20
specimens are painted in random order. The two sample average drying times are X 1 = 121
minutes and X 2 = 112 minutes, respectively. What conclusions can the product developer
draw about the effectiveness of the new ingredient, using α = 0.05?
Solution:
1. The quantity of interest is the difference in mean drying time, µ1 − µ2 , and ∆0 = 0.

2. We are going to test: H0 : µ1 − µ2 = 0 vs H1 : µ1 > µ2 .

3. The sample means n1 = n2 = 10.

4. The significance level α = 0.05 so zα = 1.645.

5. The test statistic is


121 − 112
Z0 = q = 2.52.
82 82
10
+ 10

6. Since Φ−1 (α) = Φ−1 (0.05) = 1.645 < Z0 , we reject H0 at the α = 0.05 level and conclude
that adding the new ingredient to the paint significantly reduces the drying time.
Alternatively, we can find the P -value for this test as
P -value = 1 − Φ(2.52) = 0.0059.
Therefore H0 : µ1 = µ2 would be rejected at any significance level α ≥ 0.0059.

Type 2 error and choice of sample size

6.5.2 Inference for the difference in means of two normal distributions,


variances unknown
Case 1: σ12 = σ22 = σ 2

Suppose we have two independent normal populations with unknown means µ1 and µ2 ,
and unknown but equal variances σ 2 . Assume that assumptions (6.5) hold.
The pooled estimator of σ 2 , denoted by Sp2 is defined by
(n1 − 1)s21 + (n2 − 1)s22
Sp2 = .
n1 + n2 − 2
The inference for µ1 − µ2 is based on the following result.
Theorem 6.5.3. Under all the assumption mentioned above, the quantity
X 1 − X 2 − (µ1 − µ2 )
T = q
Sp n11 + n12

has a student’s t distribution with n1 + n2 − 2 degrees of freedom.


6.5. SOME WELL-KNOWN TESTS FOR TWO SAMPLES 146

Null hypothesis: H0 : µ1 − µ2 = ∆0
X 1 − X 2 − ∆0
Test statistic: T0 = q
Sp n11 + n12

Alternative hypothesis Rejection criteria


H1 : µ1 − µ2 6= ∆0 |T0 | > tα/2,n1 +n2 −2
H1 : µ1 − µ2 > ∆0 T0 > tα,n1 +n2 −2
H1 : µ1 − µ2 < ∆0 T0 < −tα,n1 +n2 −2

Example 6.5.4. The IQ’s of 9 children in a district of a large city have empirical mean 107 and
standard deviation 10. The IQ’s of 12 children in another district have empirical mean 112
and standard deviation 9. Test the equality of means at the 0.05 significance of level.
Example 6.5.5. Two catalysts are being analyzed to determine how they affect the mean yield
of a chemical process. Specifically, catalyst 1 is currently in use, but catalyst 2 is acceptable.
Since catalyst 2 is cheaper, it should be adopted, providing it does not change the process
yield. A test is run in the pilot plant and results in the data shown in the following table.
Observation Num. Catalyst 1 Catalyst 2
1 91.50 89.19
2 94.18 90.95
3 92.18 90.46
4 95.39 93.21
5 91.79 97.19
6 89.07 97.04
7 94.72 91.07
8 89.21 92.75
Is there any difference between the mean yields? Use α = 0.05, and assume equal variances.

Cases 2: σ12 6= σ22

In some situations, we cannot reasonably assume that the unknown variances σ12 andσ22
are equal. There is not an exact t-statistic available for testing H0 : µ1 − µ2 = ∆0 in this case.
However, if H0 is true, the statistic
X 1 − X 2 − ∆0
T0∗ = q 2
s1 s2
n1
+ n22
is distributed approximately as t with degrees of freedom given by
 2 2
s1 s22
n1
+ n2
ν = (s2 /n )2 (s2 /n )2 .
1
1
n1 −1
+ n2 2 −12
6.5. SOME WELL-KNOWN TESTS FOR TWO SAMPLES 147

Therefore, if σ12 6= σ22 , the hypotheses on differences in the means of two normal distribution
are tested as in the equal variances case, except that T0∗ is used as the test statistic and n1 +n2 −2
is replaced by ν in determining the degrees of freedom for the test.

6.5.3 Paired t-test


A special case of the two-sample t-tests of previous section occurs when the observa-
tions on the two populations of interest are collected in pairs. Each pair of observations, say
(Xj , Xj ), is taken under homogeneous conditions, but these conditions may change from one
pair to another. For example, suppose that we are interested in comparing two different types
of tips for a hardness-testing machine. This machine presses the tip into a metal specimen
with a known force. By measuring the depth of the depression caused by the tip, the hard-
ness of the specimen can be determined. If several specimens were selected at random, half
tested with tip 1, half tested with tip 2, and the pooled or independent t-test in the previous
was applied, the results of the test could be erroneous. The metal specimens could have been
cut from bar stock that was produced in different heats, or they might not be homogeneous
in some other way that might affect hardness. Then the observed difference between mean
hardness readings for the two tip types also includes hardness differences between speci-
mens.
A more powerful experimental procedure is to collect the data in pairs - that is, to make
two hardness readings on each specimen, one with each tip. The test procedure would then
consist of analyzing the differences between hardness readings on each specimen. If there is
no difference between tips, the mean of the differences should be zero. This test procedure is
called the paired t-test.
Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be a set of n paired observations where we assume that
2
the mean and variance of the population represented by X are µX and σX , and the mean and
variance of the population represented by Y are µY and σY2 . Define the differences between
each pair of observations as Dj = Xj − Yj , j = 1, 2, . . . , n. The Dj ’s are assumed to be nor-
2
mally distributed with mean µD = µX − µY and variance σD , so testing hypotheses about the
difference between µX and µY can be accomplished by performing a one-sample t-test on µD .

Null hypothesis: H0 : µD = ∆0
D − ∆0
Test statistic: T0 = √
SD / n

Alternative hypothesis Rejection criteria


H1 : µD 6= ∆0 |T0 | > tα/2,n−1
H1 : µD > ∆0 T0 > tα,n−1
H1 : µD < ∆0 T0 < −tα,n−1
6.5. SOME WELL-KNOWN TESTS FOR TWO SAMPLES 148

Example 6.5.6. An article in the Journal of Strain Analysis (1983, Vol. 18, No. 2) compares
several methods for predicting the shear strength for steel plate girders. Data for two of these
methods, the Karlsruhe and Lehigh procedures, when applied to nine specific girders, are
shown in the following table.

Karlsruhe Method 1.186 1.151 1.322 1.339 1.200 1.402 1.365 1.537 1.559
Lehigh Method 1.061 0.992 1.063 1.062 1.065 1.178 1.037 1.086 1.052
Difference Dj 0.119 0.159 0.259 0.277 0.138 0.224 0.328 0.451 0.507

Test whether there is any difference (on the average) between the two methods?
Solution:

2
D = 0.2736, SD = 0.018349, T0 = 6.05939, t0.025,8 = 2, 306.

We conclude that there is difference between the two method at the 0.05 level of significance.

6.5.4 Inference on the variance of two normal populations


A hypothesis-testing procedure for the equality of two variances is based on the following
result.

Theorem 6.5.7. Let X11 , X12 , . . . , X1n1 be a random sample from a normal population with
mean µ1 and variance σ12 and let X21 , X22 , . . . , X2n2 be a random sample from a second normal
population with mean µ2 and variance σ22 . Assume that both normal populations are indepen-
dent. Let s21 and s22 be the sample variances. Then the ratio

s21 /σ12
F =
s22 /σ22

has an F distribution with n1 −1 numerator degrees of freedom and n2 −1 denominator degrees


of freedom.

This result is based on the fact that (n1 −1)s21 /σ12 is a chi-square random variable with n1 −1
degrees of freedom, that (n2 − 1)s21 /σ22 is a chi-square random variable with n2 − 1 degrees
of freedom, and that the two normal populations are independent. Clearly under the null
hypothesis H0 : σ12 = σ22 , the ratio F0 = s21 /s22 has an Fn1 −1,n2 −1 distribution. Let fα,n1 −1,n2 −1 be a
constant satisfying
P[F0 > fα,n1 −1,n2 −1 ] = α.
It follows from the property of F distribution that
1
f1−α,n1 −1,n2 −1 = .
fα,n1 −1,n2 −1
6.5. SOME WELL-KNOWN TESTS FOR TWO SAMPLES 149

Null hypothesis: H0 : σ12 = σ22


s21
Test statistic: F0 = 2
s2

Alternative hypothesis Rejection criteria


2 2
H1 : σ1 = σ2 F0 > fα/2,n1 −1,n2 −1 or F0 < f1−α/2,n1 −1,n2 −1
H1 : σ12 > σ22 F0 > fα,n1 −1,n2 −1
2 2
H1 : σ1 < σ2 F0 < f1−α,n1 −1,n2 −1

Example 6.5.8. Oxide layers on semiconductor wafers are etched in a mixture of gases to
achieve the proper thickness. The variability in the thickness of these oxide layers is a critical
characteristic of the wafer, and low variability is desirable for subsequent processing steps.
Two different mixtures of gases are being studied to determine whether one is superior in re-
ducing the variability of the oxide thickness. Twenty wafers are etched in each gas. The sam-
ple standard deviations of oxide thickness are s1 = 1.96 angstroms and s2 = 2.13 angstroms,
respectively. Is there any evidence to indicate that either gas is preferable? Use α = 0.05.
Solution: At the α = 0.05 level of significance we need to test

H0 : σ12 = σ22 vs H1 : σ12 6= σ22


s2 1
Since n1 = n2 = 20, we will reject H0 if F0 = s21 > f0.025,19,19 = 2.53 or F0 < f0.975,19,19 = 2.53 =
2
0.40.
1.962
Computation: F0 = 2.13 2 = 0.85. Hence we cannot reject the null hypothesis H0 at the 0.05

level of significance. Therefore, there is no strong evidence to indicate that either gas results
in a smaller variance of oxide thickness.

6.5.5 Inference on two population proportions


We now consider the case where there are two binomial parameters of interest, say, p1 and
p2 , and we wish to draw inferences about these proportions. We will present large-sample
hypothesis testing based on the normal approximation to the binomial.
Suppose that two independent random samples of sizes n1 and n2 are taken from two pop-
ulations, and let X1 and X2 represent the number of observations that belong to the class of
interest in samples 1 and 2, respectively. Furthermore, suppose that the normal approxima-
tion to the binomial is applied to each population, so the estimators of the population pro-
portions P̂1 = X1 /n1 and P̂2 = X2 /n2 have approximate normal distribution. Moreover, under
the null hypothesis H0 : p1 = p2 = p, the random variable

P̂1 − P̂2
Z=r
n1 + n2
p(1 − p)
n1 n2
6.6. THE CHI-SQUARE TEST 150

is distributed approximately N (0, 1).


This leads to the test procedures described below.

Null hypothesis: H0 : p1 = p2
P̂1 − P̂2 X1 + X2
Test statistic: Z0 = r with p̂ = .
n1 + n2 n1 + n2
p̂(1 − p̂)
n1 n2

Alternative hypothesis Rejection criteria


H1 : p1 6= p2 |Z0 | > zα/2
H1 : p1 > p2 Z0 > zα
H1 : p1 < p2 Z0 < −zα

Example 6.5.9. Two different types of polishing solution are being evaluated for possible use
in a tumble-polish operation for manufacturing interocular lenses used in the human eye fol-
lowing cataract surgery. Three hundred lenses were tumble- polished using the first polish-
ing solution, and of this number 253 had no polishing-induced defects. Another 300 lenses
were tumble-polished using the second polishing solution, and 196 lenses were satisfactory
upon completion. Is there any reason to believe that the two polishing solutions differ? Use
α = 0.01.

6.6 The chi-square test

6.6.1 Goodness-of-fit test


Suppose that a large population consists of items of k different types, and let pi denote the
probability that an item selected at random will be of type i(i = 1, . . . , k). Suppose that the
following hypothesis is to be tested

H0 : pi = p0i for i = 1, . . . k vs H1 : pi = p0i for at least one value of i

We shall assume that a random sample of size n is to be taken from the given population.
That is, n independent observations are to be taken, and there is probability pi that each ob-
servation will be of type i(i = 1, ..., k). On the basis of these n observations, the hypothesis is
to be tested.
For each i, we denote Ni the number of observations in the random sample that are of type
i.
6.6. THE CHI-SQUARE TEST 151

Theorem 6.6.1 (Pearson’s theorem). The following statistic


k
X (Ni − np0 )2 i
Q=
i=1
np0i

has the property that if H0 is true and the sample size n → ∞, then Q converges in
distribution to the χ2 distribution with k − 1 degrees of freedom.

Chi-squared goodness-of-fit test for simple hypothesis

Suppose that we observer an i.i.d. sample X1 , . . . , Xn of random variables that take a finite
number of values B1 , . . . , Bk with unknown probability p1 , . . . , pk . Consider the hypothesis

H0 : pi = p0i for i = 1, . . . k vs H1 : pi = p0i for at least one value of i

If the null hypothesis H0 is true then by Pearson’s theorem,


k
X (Ni − np0 )2 i d
Q= −→ χ2k−1
i=1
np0i

where Ni is number of Xj equal to Bj . On the other hand, if H1 holds, then for some index i∗ ,
pi∗ 6= p0i . We write
νi∗ − np0i∗ pi∗ νi∗ − npi∗ √ pi∗ − p0i∗
r
= √ + n p 0 .
p0i∗
p
np0i∗ npi∗ pi ∗
the first term converges to N(0, (1 − pi∗ )pi∗ /p0i∗ ) by the central limit theorem while the second
term diverges to plus or minus infinity. It means that if H1 holds then Q → ∞. Therefore, we
will reject H0 if Q ≥ cα,k−1 where cα,k−1 is chosen such that the error of type 1 is equal to the
level of significance α:
α = P0 (Q > cα,k−1 ) ≈ P(χ2k−1 > cα,k−1 ).
This test is called chi-squared goodness-of-fit test

Null hypothesis: H0 : pi = p0i for i = 1, . . . k


Test statistic
k
X (Ni − np0 )2 i
Q=
i=1
np0i

Alternative hypothesis Rejection criteria P -value


H1 : pi = p0i for at least one value of i Q ≥ cα,k−1 P(χ2k−1 > Q)
6.6. THE CHI-SQUARE TEST 152

Blood type A B AB O
Number of people 2162 738 228 2876.

Table 6.2: Blood types

Example 6.6.2. A study of blood types among a sample of 6004 people gives the following
result
A previous study claimed that the proportions of people whose blood of types A, B, AB and
O are 33.33%, 12.5%, 4.17% and 50%, respectively.
We can use the data in Table 6.2 to test the null hypothesis H0 that the probabilities (p1 , p2 , p3 , p4 )
of the four blood type equal ( 13 , 18 , 24
1 1
, 2 ). The χ2 test statistic is then
(2162 − 6004 × 31 )2 (738 − 6004 × 81 )2 (228 − 6004 × 1 2
24
) (2876 − 6004 × 12 )2
Q= + + + = 20.37
6004 × 13 6004 × 18 1
6004 × 24 6004 × 12
To test H0 at the level α0 , we would compare Q to the 1 − α0 quantile of the χ2 distribution
with three degrees of freedom. Alternatively, we can compute the P -value, which would be
the smallest α0 at which we could reject H0 . In general, the P -value is 1 − F (Q) where F is the
cumulative distribution function of the χ2 distribution with k − 1 degrees of freedom. In this
example k = 4 and Q = 20.37 then the p-value is 1.42 × 10−4 .

Goodness-of-fit for continuous distribution

Let X1 , . . . , Xn be an i.i.d. sample from unknown distribution P and consider the following
hypotheses
H0 : P = P0 vs H1 : P 6= P0
for some particular, possibly continuous distribution P0 . To do this, we will split a set of all
possible outcomes of Xi , say X, into a finite number of intervals I1 , . . . , Ik . The null hypothesis
H0 implies that for all intervals

P(X ∈ Ij ) = P0 (X ∈ Ij ) = p0j .

Therefore, we can do a chi-squared test for

H00 : P(X ∈ Ij ) = p0j for all j ≤ k vs H10 : otherwise.

It is clear that H0 implies H00 . However, the fact that H00 holds does not guarantee that H0
hold. There are many distribution different from P that have the same distribution on the
intervals I1 , . . . , Ik as P . On one hand, if we group into more and more intervals, our discrete
approximation of P will get closer and closer to P , so in some sense H00 will get ’closer’ to
H0 . However, we can not split into too many intervals either, because the χ2k−1 -distribution
approximation for statistic T in Pearson’s theorem is asymptotic. The rule of thumb is to group
the data in such a way that the expected count in each interval np0i is at least 5.
6.6. THE CHI-SQUARE TEST 153

Example 6.6.3. Suppose that we wish to test the null hypothesis that the logarithms of the
lifetime of ball bearings are an i.i.d. sample from the normal distribution with mean ln(50) =
3.912 and variance 0.25. The observed logarithms are

2.88 3.36 3.95 3.99 4.53 4.59 3.50 3.73


4.02 4.22 4.66 4.66 3.74 3.82 4.23 4.23
4.85 4.85 5.16 3.88 3.95 4.23 4.43

In order to have the expected count in each interval be at least 5, we can use at most k = 4
intervals. We shall make these intervals each have probability 0.25 under the null hypothesis.
That is, we shall divide the intervals at the 0.25, 0.5, and 0.75 quantiles of the hypothesized
normal distribution. These quantiles are

3.912 + 0.5Φ−1 (0.25) = 3.575


3.912 + 0.5Φ−1 (0.5) = 3.912
3.912 + 0.5Φ−1 (0.75) = 4.249.

The number of observation in each of the four intervals are then 3, 4, 8 and 8. We then calcu-
late
Q = 3.609.
Our table of the χ2 distribution with three degrees of freedom indicates that 3.609 is between
the 0.6 and 0.7 quantiles, so we would not reject the null hypothesis at levels less 0.3 and reject
the null hypothesis at levels greater than 0.4. (Actually, the P -value is 0.307.)

Goodness-of-fit for composite hypotheses

We can extend the goodness-of-fit test to deal with the case in which the null hypothesis
is that the distribution of our data belongs to a particular parametric family. The alternative
hypothesis is that the data have a distribution that is not a member of that parametric family.
There are two changes to the test procedure in going from the case of a simple null hypothesis
to the case of a composite null hypothesis. First, in the test statistic Q, the probabilities p0i are
replaced by estimated probabilities based on the parametric family. Second, the degrees of
freedom are reduced by the number of parameters.
Let us start with a discrete case when a random variable takes a finite number of values
B1 , . . . , Bk and
pi = P(X = Bi ), i = 1, . . . , k.
We would like to test a hypothesis that this distribution comes from a family of distributions
{Pθ : θ ∈ Θ}. In other words, if we denote

pj (θ) = Pθ (X = Bj ),
6.6. THE CHI-SQUARE TEST 154

we want to test

H0 : pj = pj (θ) for all j ≤ r for some θ ∈ Θ vs H1 : otherwise.

The situation now is complicated since we want to test if pj = pj (θ), j ≤ r at least for some
θ ∈ Θ which means that we may have many candidates for θ. One way to approach this
problem is as follows.

Step 1: Assuming that hypothesis H0 holds, we can find an estimator θ∗ of this unknown θ.

Step 2: Try to test if, indeed, the distribution P is equal to Pθ∗ by using the statistics
k

X (Ni − npi (θ∗ ))2
Q =
i=1
npi (θ∗ )

in chi-squared goodness-of-fit test.

This approach looks natural, the only question is what estimate θ∗ to use and how the fact
that θ∗ also depends on the data will affect the convergence of Q. It turns out that if we let θ∗
be the maximum likelihood estimate, i.e. θ that maximizes the likelihood function

φ(θ) = p1 (θ)N1 . . . pk (θ)Nk ,

then the statistics Q∗ converges in distribution to a χ2r−s−1 distribution with r − s − 1 degrees of


freedom, where s is the dimension of the parameter set Θ. Note that we must assume s ≤ r −2
so that we have at least one degree of freedom. Very informally, by dimension we understand
the number of free parameters that describe the set

{(p1 (θ), . . . , pk (θ)) : θ ∈ Θ}.

The we will reject H0 if T ≤ c where the threshold c is determined from the condition

P(T > c|H0 ) = α

where α ∈ [0, 1] is the level of significance.

Example 6.6.4. Suppose that a gene has two possible alleles A1 and A2 and the combinations
of these alleles define three genotypes A1 A1 , A1 A2 and A2 A2 . We want to test a theory that

Probability to pass A to a child = θ
1
Probability to pass A2 to a child = 1 − θ

and that the probabilities of genotypes are given by

p1 (θ) = P(A1 A1 ) = θ2
p2 (θ) = P(A1 A2 ) = 2θ(1 − θ)
p3 (θ) = P(A2 A2 ) = (1 − θ)2 .
6.6. THE CHI-SQUARE TEST 155

Suppose that given a random sample X1 , . . . , Xn from the population the counts of each geno-
type are N1 , N2 and N3 . To test the theory we want to test the hypothesis

H0 : pi = pi (θ), i = 1, 2, 3 vs H1 : otherwise.

First of all, the dimension of the parameter set is s = 1 since the distributions are determined
by one parameter θ. To find the MLE θ∗ we have to maximize the likelihood function

p1 (θ)N1 p2 (θ)N2 p3 (θ)N3 .

After computing the critical point by setting the derivative equal to 0, we get
2N1 + N2
θ∗ = .
2n
Therefore, under the null hypothesis H0 the statistics
3

X (Ni − npi (θ∗ ))2
Q =
i=1
npi (θ∗ )

converges to χ2 distribution with 1 degree of freedom. Therefore, if α = 0.05 we will reject H0


if Q∗ > 3.841.

In the case when the distributions Pθ are continuous or, more generally, have infinite num-
ber of values that must be grouped in order to use chi-squared test (for example, normal or
Poisson distribution), it can be a difficult numerical problem to maximize the “grouped“ like-
lihood function
Pθ (I1 )N1 · · · Pθ (Ik )Nk .
It is tempting to use a usual non-grouped MLE θ̂ of θ instead of the above θ∗ because it is often
easier to compute, in fact, for many distributions we know explicit formulas for these MLEs.
However, if we use θ̂ in the statistic
k
X (Ni − npi (θ̂))2
Q̂ =
i=1 npi (θ̂)

then it will no longer converges to χ2r−s−1 distribution. It has been shown that typically this Q̂
will converge to a distribution “in between” χ2k−s−1 and χ2k−s 1 . Thus, a conservative decision
rule is to reject H0 whether Q̂ > c where c is chosen such that P(χ2k−1 > c) = α.

Example 6.6.5 (Testing Whether a Distribution Is Normal). Consider now a problem in which
a random sample X1 , ..., Xn is taken from some continuous distribution for which the p.d.f.
is unknown, and it is desired to test the null hypothesis H0 that this distribution is a normal
1
Chernoff, Herman; Lehmann, E. L. (1954) The use of maximum likelihood estimates in χ2 tests for goodness
of fit. Ann. Math. Statistics 25, pp. 579-586.
6.6. THE CHI-SQUARE TEST 156

distribution against the alternative hypothesis H1 that the distribution is not normal. To per-
form a χ2 test of goodness- of-fit in this problem, divide the real line into k subintervals and
count the number Ni of observations in the random sample that fall into the ith subinterval
(i = 1, ..., k).
If H0 is true, and if µ and σ 2 denote the unknown mean and variance of the normal dis-
tribution, then the parameter vector θ is the two-dimensional vector θ(µ, σ 2 ). The probability
πi (θ), or πi (µ, σ 2 ), that an observation will fall within the ith subinterval, is the probability as-
signed to that subinterval by the normal distribution with mean µ and variance σ 2 . In other
words, if the ith subinterval is the interval from ai to bi , then
b − µ a − µ
i i
πi (µ, σ 2 ) = Φ −Φ .
σ σ
It is important to note that in order to calculate the value of the statistic Q∗ , the M.L.E.’s µ∗ and
σ 2∗ must be found by using the numbers N1 , ..., Nk of observations in the different subinter-
vals. The M.L.E.’s should not be found by using the observed values of X1 , ..., Xn themselves.
In other words, µ∗ and σ 2∗ will be the values of µ and σ 2 that maximize the likelihood function

L(µ, σ 2 ) = [π1 (µ, σ 2 )]N1 · · · [πk (µ, σ 2 )]Nk . (6.6)


Because of the complicated nature of the function πi (µ, σ 2 ) a lengthy numerical computation
would usually be required to determine the values of µ and σ 2 that maximize L(µ, σ 2 ). On the
other hand, we know that the M.L.E.’s of µ and σ 2 based on the n observed values X1 , ..., Xn
in the original sample are simply the sample mean X n and the sample variance s2n . Further-
more, if the estimators that maximize the likelihood function L(µ, σ 2 ) are used to calculate the
statistic Q∗ , then we know that when H0 is true, the distribution of Q∗ will be approximately
the χ2 distribution with k − 3 degrees of freedom. On the other hand, if the M.L.E.’s X n and s2n ,
which are based on the observed values in the original sample, are used to calculate Q̂, then
this χ2 approximation to the distribution of Q̂ will not be appropriate. Indeed, the distribution
of Q̂ is asymptotically “in between” χ2k−3 and χ2k−1 .
Return to Example 6.6.3. We are now in a position to try to test the composite null hypoth-
esis that the logarithms of ball bearing lifetimes have some normal distribution. We shall di-
vide the real line into the subintervals (−∞, 3.575], (3.575, 3.912], (3.912, 4.249], and (4.249, +∞).
The counts for the four intervals are 3, 4, 8, and 8. The M.L.E.’s based on the original data
gives µ̂ = 4.150 and σ̂ 2 = 0.2722. The probabilities of the four intervals are (π1 , π2 , π3 , π4 ) =
(0.1350, 0.1888, 0.2511, 0.4251). This make the value of Q̂ equal to 1.211.
The tail area corresponding to 1.211 needs to be computed for χ2 distributions with k − 1 =
3 and k − 3 = 1 degrees of freedom. For one degree of freedom, the p-value is 0.2711, and
for three degrees of freedom the p-value is 0.7504. So, our actual p-value lies in the interval
[0.2711, 0.7504]. Although this interval is wide, it tells not to reject H0 at level α if α < 0.2711.
6.6. THE CHI-SQUARE TEST 157

6.6.2 Tests of independence


In this section we will consider a situation when our observations are classified by two
different features and we would like to test if these features are independent. For example, we
can ask if the number of children in a family and family income are independent. Our sample
space X will consist of a × b pairs.

X = {(i, j) : i = 1, . . . , a, j = 1, . . . , b}

where the first coordinate represents the first feature that belongs to one of a categories and
the second coordinate represents the second feature that belongs to one of b categories. An
i.i.d. sample X1 , ..., Xn can be represented by a contingency table below where Nij is the num-
ber all observations in a cell (i, j).

Feature 2
Feature 1 1 2 ··· b
1 N11 N12 · · · N1b
2 N21 N22 · · · N2b
.. .. .. .. ..
. . . . .
a Na1 Na2 · · · Nab

We would like to test the independence of two features which means that

P[X = (i, j)] = P[X 1 = i]P[X 2 = j].

Denote θij = P[X = (i, j)]; pi = P[X 1 = i]; qj P[X 2 = j]. Then we want to test

H0 : θij = pi qj for all (i, j) for some (p1 , . . . , pa ) and (q1 , . . . , qb )


H1 : otherwise.

We can see that this null hypothesis H0 is a special case of the composite hypotheses from
previous lecture and it can be tested using the chi-squared goodness-of-fit test. The total
number of groups is k = a × b. Since pi s and qj s should add up to one, one parameter in
each sequence, for example pa and qb , can be computed in terms of other probabilities and
we can take (p1 , ..., pa−1 ) and (q1 , ..., qb−1 ) as free parameters of the model. This means that the
dimension of the parameter set is

s = (a − 1) × (b − 1).

Therefore, if we find the maximum likelihood estimates for the parameters of this model then
the chi-squared statistic satisfies
X (Nij − np∗i qj∗ )2 w
Q= ∗ ∗
−→ χ2k−s−1 = χ2(a−1)(b−1)
ij
npi qj
6.6. THE CHI-SQUARE TEST 158

To formulate the test it remains to find the maximum likelihood estimates of the parameters.
We need to maximize the likelihood function
Y  Y 
N
Y N
(pi qj )Nij = pi i+ pj +j ,
ij i j
P P
where Ni+ = j Nij and N+j = i Nij . Since pi s and qj s are not related to each other, max-
Q N Q N
imizing the likelihood function above is equivalent to maximizing i pi i+ and j pj +j sepa-
rately. We have
Y  a−1
X
N
ln pi i+ = Ni+ ln pi + Na+ ln(1 − p1 − · · · − pa−1 ).
i i=1

An elementary computation shows that

Ni+
p∗i = , i = 1, . . . , a.
n
Similarly, the MLE for qj is
N+j
qj∗ =
, ij = 1, . . . , b.
n
Therefore, chi-square statistic Q in this case can be written as
 2
Ni+ N+j
X Nij − n
Q= Ni+ N+j
.
ij n

We reject H0 if Q > cα,(a−1)(b−1) where the threshold cα,(a−1)(b−1) is determined from the condi-
tion
P[χ2(a−1)(b−1) > cα,(a−1)(b−1) ] = α.

6.6.3 Test of homogeneity


Suppose that the population is divided into R groups and each group (or the entire popula-
tion) is divided into C categories. We would like to test whether the distribution of categories
in each group is the same.

Category 1 Category 2 · · · Category C Σ


Group 1 N11 N12 ··· N1C N1+
Group 2 N21 N22 ··· N2b N2+
.. .. .. .. .. ..
. . . . . .
Group R NR1 NR2 ··· NRC NR+
Σ N+1 N+2 ··· N+C n
6.6. THE CHI-SQUARE TEST 159

If we denote
pij = P(Categoryj |Groupi )
so that for each group i we have
C
X
pij = 1,
j=1

then we want to test the following hypotheses

H0 : pij = pj for all groupsi ≤ R


H1 : otherwise.

If observations X1 , ..., Xn are sampled independently from the entire population then ho-
mogeneity over groups is the same as independence of groups and categories. Indeed, if have
homogeneity
P(Categoryj |Groupi ) = P(Categoryj ),
then we have
P(Categoryj , Groupi ) = P(Categoryj )P(Groupi ).
This means that to test homogeneity we can use the test of independence above. Denote
 2
Ni+ N+j
X X Nij − n
R C
w
Q= Ni+ N+j
−→ χ2(C−1)(R−1) .
i=1 j=1 n

We reject H0 at the significance level α if Q > cα,(C−1)(R−1) where the threshold cα,(C−1)(R−1) is
determined from the condtion

P[χ2(C−1)(R−1) > cα,(C−1)(R−1) ] = α.

Example 6.6.6. In this example, 100 people were asked whether the service provided by the
fire department in the city was satisfactory. Shortly after the survey, a large fire occured in
the city. Suppose that the same 100 people were asked whether they thought that the service
provided by the fire department was satisfactory. The result are in the following table:

Satisfactory Unsatisfactory
Before fire 80 20
After fire 72 28

Suppose that we would like to test whether the opinions changed after the fire by using a
chi-squared test. However, the i.i.d. sample consisted of pairs of opinions of 100 people
(Xi1 , Xi2 ), i = 1, . . . , 100 where the first coordinate/feature is a person’s opinion before the
fire and it belongs to one of two categories

{“Satisf actory”, “U nsatisf actory”},


6.7. EXERCISES 160

and the second coordinate/feature is a person’s opinion after the fire and it also belongs to
one of two categories
{“Satisf actory”, “U nsatisf actory”},
So the correct contingency table corresponding to the above data and satisfying the assump-
tion of the chi-squared test would be the following:
Satisfactory Unsatisfactory
Satisfactory 70 10
Unsatisfactory 2 18
In order to use the first contingency table, we would have to poll 100 people after the fire
independently of the 100 people polled before the fire.

6.7 Exercises

6.7.1 Significance level and power function


6.1. Suppose that X has a pdf of the form f (x; θ) = θxθ−1 I{0<x<1} where θ ∈ {1, 2}. To test the
simple hypotheses H0 : θ = 1 against H1 : θ = 2}, one uses a random sample X1 , X2 of size
n = 2 and define the critical region to be C = {(x1 , x2 ) : x1 x2 ≥ 43 }. Find the power function of
the test.
6.2. Suppose that X has a binomial distribution with the number of trials n = 10 and with
p ∈ { 41 , 21 }. The simple hypothesis H0 : p = 12 is rejected, and the alternative simple hypothesis
H1 : p = 14 is accepted, if the observed value of X1 , a random sample of size 1, is less than or
equal to 3. Find the significance level and the power of the test.
6.3. Let us say the life of a light bulb, say X, is normally distributed with mean θ and standard
deviation 5000. Past experience indicates that θ = 30000. The manufacturer claims that the
light bulb made by a new process have mean θ > 30000. It is possible that θ = 35000. Check his
claim by testing H0 : θ = 30000 against H1 : θ > 30000. We shall observe n independent values
of X, say X1 , . . . , Xn , and we shall reject H0 (thus accept H1 ) if and only if x̄ ≥ c. Determine n
and c so that the power function γ(θ) of the test has the values γ(30000) = 0, 01 and γ(35000) =
0, 98.
6.4. Suppose that X has a Poisson distribution with mean λ. Consider the simple hypothesis
H0 : λ = 12 and the alternative composite hypothesis H1 : λ < 12 . Let X1 , . . . , X12 denote a
random sample of size 12 from this distribution. One rejects H0 if and only if the observed
value of Y = X1 + . . . + X12 ≤ 2.. Find γ(λ) for λ ∈ (0, 21 ] and the significance level of the test.
6.5. Let Y1 < Y2 < Y2 < Y4 be the order statistics of a random sample of size n = 4 from a
distribution with pdf f (x; θ) = 1/θ, 0 < x < θ, zero elsewhere, where θ > 0. The hypothesis
H0 : θ = 1 is rejected and H1 : θ > 1 is accepted if the observed Y4 ≥ c.
6.7. EXERCISES 161

1. Find the constant c so that the significance level is α = 0.05.

2. Determine the power function of the test.

6.7.2 Null distribution


6.6. Let X1 , . . . , Xn be a random sample from a N (a0 , σ 2 ) distribution where 0 < σ 2 < ∞ and
a0 is known. Show that the likelihood ratio test of H0 : σ 2 = σ02 versus H1 : σ 2 6= σ02 can be
based upon the statistics W = ni=1 (Xi − a0 )2 /σ02 . Determine the null distribution of W and
P

give the rejection rule for a level α test.

6.7. Let X1 , . . . , Xn be a random sample from a Poisson distribution with mean λ > 0.

1. Show that the likelihood ratio test of H0 : λ = λ0 versus H1 : λ 6= λ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of Y .

2. For λ0 = 2 and n = 5, find the significance level of the test that rejects H0 if Y ≤ 4 or
Y ≥ 17.

6.8. Let X1 , . . . , Xn be a random sample from a Bernoulli B(1, θ) distribution, where 0 < θ < 1.

1. Show that the likelihood ratio test of H0 : θ = θ0 versus H1 : θ 6= θ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of Y .

2. For n = 100 and θ0 = 1/2, find c1 so that the test reject H0 when Y ≤ c1 or Y ≥ c2 = 100−c1
has the approximate significance level α = 0.05.

6.9. Let X1 , . . . , Xn be a random sample from a Γ(α = 3, β = θ) distribution, where 0 < θ < ∞.

1. Show that the likelihood ratio test of H0 : θ = θ0 versus H1 : θ 6= θ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of 2Y /θ0 .

2. For θ0 = 3 and n = 5, find c1 and c2 so that the test that rejects H0 when Y ≤ c1 or Y ≥ c2
has significance level 0.05.

6.7.3 Best critical region


6.10. Let X1 , X2 be a random sample of size 2 from a random variable X having the pdf
−x/θ
f (x; θ) = e θ I{0<x<∞} . Consider the simple hypothesis H0 : θ = θ0 = 2 and the alterna-
tive hypothesis H1 : θ = θ00 = 4. Show that the best test of H0 against H1 may be carried out by
use of the statistics X1 + X2 .

6.11. Let X1 , . . . , Xn be a random sample of size 10 from a normal distribution N (0, σ 2 ). Find
a best critical region of size α = 0.05 for testing H0 : σ 2 = 1 against H1 : σ 2 = 2. Is this a best
critical region of size 0.05 for testing H0 : σ 2 = 1 against H1 : σ 2 = 4? Against H1 : σ 2 = σ12 > 1.
6.7. EXERCISES 162

6.12. If X1 , . . . , Xn is a random sample from a distribution having pdf of the form f (x; θ) =
θxθ−1 , 0 < x < 1, nzero elsewhere, show that a obest critical region for testing H0 : θ = 1 against
H1 : θ = 2 is C = (x1 , . . . , xn ) : c ≤ x1 x2 . . . xn .

6.13. Let X1 , . . . , Xn denote a random sample from a normal distribution N (θ, 100). Show that

C = (x1 , . . . , xn ) : x̄ ≥ c is a best critical region for testing H0 : θ = 75 against H1 : θ = 78.
Find n and c so that
PH0 [(X1 , . . . , Xn ) ∈ C] = PH0 [X̄ ≥ c] = 0.05
and
PH1 [(X1 , . . . , Xn ) ∈ C] = PH1 [X̄ ≥ c] = 0.90,
approximately.

6.14. Let X1 , . . . , Xn be iid with pmf f (x; p) = px (1 − p)1−x , x = 0, 1, zero elsewhere. Show that
xi ≤ c is a best critical region for testing H0 : p = 21 against H1 : p = 13 .
 P
C = (x1 , . . . , xn ) :
P
Use the Central Limit Theorem to find n and c so that approximately PH0 [ Xi ≤ c] = 0.10 and
P
PH1 [ Xi ≤ c] = 0.80.

6.15. Let X1 , . . . , X10 denote a random sample of size 10 from a Poisson distribution with
mean λ. Show that the critical region C defined by 10
P
i=1 xi ≥ 3 is a best critical region for
testing H0 : λ = 0.1 against H1 : λ = 0.5. Determine, for this test, the significance level α and
the power at θ = 0.5.

6.16. Let X have the pmf f (x; θ) = θx (1 − θ)1−x , x = 0, 1, zero elsewhere. We test the simple
hypothesis H0 : λ = 14 against the alternative composite hypothesis H1 : θ < 14 by taking a
random sample of size 10 and rejecting H0 : θ = 41 iff the observed values x1 , . . . , x1 0 of the
sample observations are such that 10 1
P
i=1 xi ≤ 1. Find the power function γ(θ), 0 < θ ≤ 4 , of
this test.

6.7.4 Some tests for single sample


Tests on mean

6.17. (a) The sample mean and standard deviation from a random sample of 10 observations
from a normal population were computed as x = 23 and σ = 9. Calculate the value of the test
statistic of the test required to determine whether there is enough evidence to infer at the 5%
significance level that the population mean is greater than 20.
(b) Repeat part (a) with n = 30.
(c) Repeat part (b) with n = 40.

6.18. (a) A statistics practitioner is in the process of testing to determine whether there is
enough evidence to infer that the population mean is different from 180. She calculated the
6.7. EXERCISES 163

mean and standard deviation of a sample of 200 observations as x = 175 and σ = 22. Cal-
culate the value of the test statistic of the test required to determine whether there is enough
evidence at the 5% significance level.
(b) Repeat part (a) with s = 45.
(c) Repeat part (a) with s = 60.

6.19. A courier service advertises that its average delivery time is less than 6 hours for local
deliveries. A random sample of times for 12 deliveries to an address across town was recorded.
These data are shown here. Is this sufficient evidence to support the courier’s advertisement,
at the 5% level of significance?

3.03, 6.33, 7.98, 4.82, 6.50, 5.22, 3.56, 6.76, 7.96, 4.54, 5.09, 6.46.

X = 5, 6875; s2 = 2, 1325; T0 = −0.7413.

6.20. Aircrew escape systems are powered by a solid propellant. The burning rate of this pro-
pellant is an important product characteristic. Specifications require that the mean burning
rate must be 50 centimeters per second. We know that the standard deviation of burning rate
is σ = 2 centimeters per second. The experimenter decides to specify a type I error proba-
bility or significance level of α = 0.05 and selects a random sample of n = 25 and obtains a
sample average burning rate of X = 51.3 centimeters per second. What conclusions should
be drawn?

6.21. The mean water temperature downstream from a power plant cooling tower discharge
pipe should be no more than 100◦ F . Past experience has indicated that the standard deviation
of temperature is 2◦ F . The water temperature is measured on nine randomly chosen days,
and the average temperature is found to be 98◦ F.
(a) Should the water temperature be judged acceptable with α = 0.05?
(b) What is the P -value for this test?
(c) What is the probability of accepting the null hypothesis at α = 0.05 if the water has a true
mean temperature of 104◦ F ?

6.22. A study reported body temperatures (◦ F ) for 25 female subjects follow:


97.8, 97.2, 97.4, 97.6, 97.8, 97.9, 98.0, 98.0, 98.0, 98.1, 98.2, 98.3,
98.3, 98.4, 98.4, 98.4, 98.5, 98.6, 98.6, 98.7, 98.8, 98.8, 98.9, 98.9, and 99.0.
(a) Test the hypotheses H0 : µ = 98.6 versus H1 : µ 6= 98.6, using α = 0.05. Find the P -value.
(b) Compute the power of the test if the true mean female body temperature is as low as 98.0.
(c) What sample size would be required to detect a true mean female body temperature as low
as 98.2 if we wanted the power of the test to be at least 0.9?

6.23. Cloud seeding has been studied for many decades as a weather modification procedure.
The rainfall in acre-feet from 20 clouds that were selected at random and seeded with silver
6.7. EXERCISES 164

nitrate follows:
18.0, 30.7, 19.8, 27.1, 22.3, 18.8, 31.8, 23.4, 21.2, 27.9,
31.9, 27.1, 25.0, 24.7, 26.9, 21.8, 29.2, 34.8, 26.7, 31.6.
(a) Can you support a claim that mean rainfall from seeded clouds exceeds 25 acre-feet? Use
α = 0.01.
(b) Compute the power of the test if the true mean rainfall is 27 acre-feet.
(c) What sample size would be required to detect a true mean rainfall of 27.5 acre-feet if we
wanted the power of the test to be at least 0.9?
6.24. The life in hours of a battery is known to be approximately normally distributed, with
standard deviation σ = 1.25 hours. A random sample of 10 batteries has a mean life of x = 40.5
hours.
(a) Is there evidence to support the claim that battery life exceeds 40 hours? Use α = 0.05.
(b) What is the P -value for the test in part (a)?
(c) What is the power for the test in part (a) if the true mean life is 42 hours?
(d) What sample size would be required to ensure that the probability of making type II error
does not exceed 0.10 if the true mean life is 44 hours?
(e) Explain how you could answer the question in part (a) by calculating an appropriate con-
fidence bound on life.
6.25. Medical researchers have developed a new artificial heart constructed primarily of ti-
tanium and plastic. The heart will last and operate almost indefinitely once it is implanted
in the patient’s body, but the battery pack needs to be recharged about every four hours. A
random sample of 50 battery packs is selected and subjected to a life test. The average life of
these batteries is 4.05 hours. Assume that battery life is normally distributed with standard
deviation σ = 0.2 hour.
(a) Is there evidence to support the claim that mean battery life exceeds 4 hours? Use α = 0.05.
(b) Compute the power of the test if the true mean battery life is 4.5 hours.
(c) What sample size would be required to detect a true mean battery life of 4.5 hours if we
wanted the power of the test to be at least 0.9?
(d) Explain how the question in part (a) could be answered by constructing a one-sided con-
fidence bound on the mean life.

Tests on population variance

6.26. After many years of teaching, a statistics professor computed the variance of the marks
on her final exam and found it to be σ 2 = 250. She recently made changes to the way in
which the final exam is marked and wondered whether this would result in a reduction in the
variance. A random sample of this year’s final exam marks are listed here. Can the professor
infer at the 10% significance level that the variance has decreased?
57 92 99 73 62 64 75 70 88 60.
6.7. EXERCISES 165

6.27. With gasoline prices increasing, drivers are more concerned with their cars’ gasoline
consumption. For the past 5 years, a driver has tracked the gas mileage of his car and found
that the variance from fill-up to fill-up was σ 2 = 23 mpg2 . Now that his car is 5 years old, he
would like to know whether the variability of gas mileage has changed. He recorded the gas
mileage from his last eight fill-ups; these are listed here. Conduct a test at a 10% significance
level to infer whether the variability has changed.

28 25 29 25 32 36 27 24.

Tests on proportion

6.28. (a) Calculate the P -value of the test of the following hypotheses given that p̂ = 0.63 and
n = 100:
H0 : p = 0.6 vs H1 : p > 0.6.
(b) Repeat part (a) with n = 200.
(c) Repeat part (a) with n = 400.
(d) Describe the effect on P -value of increasing sample size.
6.29. Has the recent drop in airplane passengers resulted in better on-time performance?
Before the recent economic downturn, one airline bragged that 92% of its flights were on time.
A random sample of 165 flights completed this year reveals that 153 were on time. Can we
conclude at the 5% significance level that the airline’s on-time performance has improved?
6.30. In a random sample of 85 automobile engine crank- shaft bearings, 10 have a surface
finish roughness that exceeds the specifications. Does this data present strong evidence that
the proportion of crankshaft bearings exhibiting excess surface roughness exceeds 0.10? State
and test the appropriate hypotheses using α = 0.05.
6.31. An study claimed that nearly one-half of all engineers continue academic studies be-
yond the B.S. degree, ultimately receiving either an M.S. or a Ph.D. degree. Data from an
article in Engineering Horizons (Spring 1990) indicated that 117 of 484 new engineering grad-
uates were planning graduate study.
(a) Are the data from Engineering Horizons consistent with the claim reported by Fortune?
Use α = 0.05 in reaching your conclusions.
(b) Find the P -value for this test.
(c) Discuss how you could have answered the question in part (a) by constructing a two-sided
confidence interval on p.
6.32. A researcher claims that at least 10% of all football helmets have manufacturing flaws
that could potentially cause injury to the wearer. A sample of 200 helmets revealed that 16
helmets contained such defects.
(a) Does this finding support the researcher’s claim? Use α = 0.01.
(b) Find the P -value for this test.
6.7. EXERCISES 166

6.7.5 Some tests for two samples


Compare two means

6.33. In random samples ò 12 from each of two normal populations, we found the following
statistics: x1 = 74, s1 = 18 and x2 = 71, s2 = 16.
(a) Test with α = 0.05 to determine whether we can infer that the population means differ.
(b) Repeat part (a) increasing the standard deviation to s1 = 210 and s2 = 198.
(c) Describe what happens when the sample stan- dard deviations get larger.
(d) Repeat part (a) with sample size 150.
(e) Discuss the effects of increasing the sample size.

6.34. Random sampling from two normal populations produced the following results

x1 = 412 s1 = 128 n1 = 150


x2 = 405 s2 = 54 n2 = 150.

(a) Can we infer that at the 5% significance level that µ1 is greater than µ2 .
(b) Repeat part (a) decreasing the standard deviation to s1 = 31, s2 = 16.
(c) Describe what happens when the sample stan- dard deviations get smaller.
(d) Repeat part (a) with samples of size 20.
(e) Discuss the effects of decreasing the sample size.
(f ) Repeat part (a) changing the mean of sample 1 to x1 = 409.

6.35. Two machines are used for filling plastic bottles with a net volume of 16.0 ounces. The
fill volume can be assumed normal, with standard deviation σ1 = 0.020 and σ2 = 0.025 ounces.
A member of the quality engineering staff suspects that both machines fill to the same mean
net volume, whether or not this volume is 16.0 ounces. A random sample of 10 bottles is taken
from the output of each machine.

Machine 1 Machine 2
16.03 16.01 16.02 16.03
16.04 15.96 15.97 16.04
16.05 15.98 15.96 16.02
16.05 16.02 16.01 16.01
16.02 15.99 15.99 16.00

(a) Do you think the engineer is correct? Use α = 0.05.


(b) What is the P -value for this test?
(c) What is the power of the test in part (a) for a true difference in means of 0.04?
(d) Find a 95% confidence interval on the difference in means. Provide a practical interpreta-
tion of this interval.
6.7. EXERCISES 167

(e) Assuming equal sample sizes, what sample size should be used to assure that the probabil-
ity of making type II error is 0.05 if the true difference in means is 0.04? Assume that α = 0.05.

6.36. Every month a clothing store conducts an inventory and calculates losses from theft.
The store would like to reduce these losses and is considering two methods. The first is to
hire a security guard, and the second is to install cameras. To help decide which method to
choose, the manager hired a security guard for 6 months. During the next 6-month period, the
store installed cameras. The monthly losses were recorded and are listed here. The manager
decided that because the cameras were cheaper than the guard, he would install the cameras
unless there was enough evidence to infer that the guard was better. What should the manager
do?
Security guard 355 284 401 398 477 254
Cameras 486 303 270 386 411 435

Pair t-test

6.37. Many people use scanners to read documents and store them in a Word (or some other
software) file. To help determine which brand of scanner to buy, a student conducts an exper-
iment in which eight documents are scanned by each of the two scanners he is interested in.
He records the number of errors made by each. These data are listed here. Can he infer that
brand A (the more expensive scanner) is better than brand B?

Document 1 2 3 4 5 6 7 8
BrandA 17 29 18 14 21 25 22 29
BrandB 21 38 15 19 22 30 31 37

6.38. In an effort to determine whether a new type of fertilizer is more effective than the type
currently in use, researchers took 12 two-acre plots of land scattered throughout the county.
Each plot was divided into two equal-sized subplots, one of which was treated with the cur-
rent fertilizer and the other with the new fertilizer. Wheat was planted, and the crop yields
were measured.
Plot 1 2 3 4 5 6 7 8 9 10 11 12
Current fertilizer 56 45 68 72 61 69 57 55 60 72 75 66
New fertilizer 60 49 66 73 59 67 61 60 58 75 72 68

(a) Can we conclude at the 5% significance level that the new fertilizer is more effective than
the current one?
(b) Estimate with 95% confidence the difference in mean crop yields between the two fertiliz-
ers.
(c) What is the required condition(s) for the validity of the results obtained in parts (a) and
(b)?
6.7. EXERCISES 168

Compare two variances

6.39. Random samples from two normal population produced the following statistics

s21 = 350, n1 = 30, s22 = 700, n2 = 30.

(a) Can we infer at the 10% significance level that the two population variances differ?
(b) Repeat part (a) changing the sample sizes to n1 = 15 and n2 = 15.
(c) Describe what happens to the test statistics and the conclusion when the sample sizes
decrease.

6.40. A statistics professor hypothesized that not only would the means vary but also so would
the variances if the business statistics course was taught in two different ways but had the
same final exam. He organized an experiment wherein one section of the course was taught
using detailed PowerPoint slides whereas the other required students to read the book and
answer questions in class discussions. A sample of the marks was recorded and listed next.
Can we infer that the variances of the marks differ between the two sections?
Class 1 64 85 80 64 48 62 75 77 50 81 90
Class 2 73 78 66 69 79 81 74 59 83 79 84

6.41. An operations manager who supervises an assembly line has been experiencing prob-
lems with the sequencing of jobs. The problem is that bottle- necks are occurring because
of the inconsistency of sequential operations. He decides to conduct an experiment wherein
two different methods are used to complete the same task. He measures the times (in sec-
onds). The data are listed here. Can he infer that the second method is more consistent than
the first method?
Method 1 8.8 9.6 8.4 9.0 8.3 9.2 9.0 8.7 8.5 9.4
Method 2 9.2 9.4 8.9 9.6 9.7 8.4 8.8 8.9 9.0 9.7

Compare two proportions

6.42. Random samples from two binomial populations yielded the following statistics:

p̂1 = 0.45 n1 = 100 p̂2 = 0.40 n2 = 100.

(a) Calculate the P -value of a test to determine whether we can infer that the population pro-
portions differ.
(b) Repeat part (a) increasing the sample sizes to 400.
(c) Describe what happens to the p-value when the sample sizes increase.

6.43. Random samples from two binomial populations yielded the following statistics:

p̂1 = 0.60 n1 = 225 p̂2 = 0.55 n2 = 225.


6.7. EXERCISES 169

(a) Calculate the P -value of a test to determine whether we there is evidence to infer that the
population proportions differ.
(b) Repeat part (a) p̂1 = 0.95 and p̂2 = 0.90.
(c) Describe the effect on the P -value of increasing the sample proportions.
(d) Repeat part (a) p̂1 = 0.10 and p̂2 = 0.05.
(e) Describe the effect on the P -value of decreasing the sample proportions.

6.44. Many stores sell extended warranties for products they sell. These are very lucrative for
store owners. To learn more about who buys these warranties, a random sample was drawn
of a store’s customers who recently purchased a product for which an extended warranty was
available. Among other vari- ables, each respondent reported whether he or she paid the
regular price or a sale price and whether he or she purchased an extended warranty.
Regular Price Sale Price
Sample size 229 178
Number who bought extended warranty 47 25
Can we conclude at the 10% significance level that those who paid the regular price are more
likely to buy an extended warranty?

6.45. Surveys have been widely used by politicians around the world as a way of monitoring
the opinions of the electorate. Six months ago, a survey was undertaken to determine the de-
gree of support for a national party leader. Of a sample of 1100, 56% indicated that they would
vote for this politician. This month, another survey of 800 voters revealed that 46% now sup-
port the leader.
(a) At the 5% significance level, can we infer that the national leader’s popularity has de-
creased?
(b) At the 5% significance level, can we infer that the national leader’s popularity has de-
creased by more than 5%?

6.46. A random sample of 500 adult residents of Maricopa County found that 385 were in
favour of increasing the highway speed limit to 75 mph, while another sample of 400 adult
residents of Pima County found that 267 were in favour of the increased speed limit. Do these
data indicate that there is a difference in the support for increasing the speed limit between
the residents of the two counties? Use α = 0.05. What is the P -value for this test?

6.47. Two different types of injection-molding machines are used to form plastic parts. A part
is considered defective if it has excessive shrinkage or is discolored. Two random samples,
each of size 300, are selected, and 15 defective parts are found in the sample from machine 1
while 8 defective parts are found in the sample from machine 2. Is it reasonable to conclude
that both machines produce the same fraction of defective parts, using α = 0.05? Find the
P -value for this test.
6.7. EXERCISES 170

6.7.6 Chi-squared tests


Chi-squared test on distribution

6.48. A new casino game involves rolling 3 dice. The winnings are directly proportional to the
total number of sixes rolled. Suppose a gambler plays the game 100 times, with the following
observed counts:
Number of Sixes 0 1 2 3
Number of Rolls 48 35 15 3

The casino becomes suspicious of the gambler and wishes to determine whether the dice are
fair. What do they conclude?

6.49. Suppose that the distribution of the heights of men who reside in a certain large city is
the normal distribution for which the mean is 68 inches and the standard deviation is 1 inch.
Suppose also that when the heights of 500 men who reside in a certain neighbourhood of the
city were measured, the distribution in the following table was obtained. Test the hypothesis
that, with regard to height, these 500 men form a random sample from all the men who reside
in the city.

Height (in inch) <66 66-67.5 67.7-68.5 68.5-70 >70


Number of men 18 177 198 102 5

6.50. The 50 values in the following table are intended to be a random sample from the stan-
dard normal distribution.
-1.28 -1.22 -0.32 -0.80 -1.38 -1.26 2.33 -0.34 -1.14 0.64
0.41 -0.01 -0.49 0.36 1.05 0.04 0.35 2.82 0.64 0.56
-0.45 -1.66 0.49 -1.96 3.44 0.67 -1.24 0.76 -0.46 -0.11
-0.35 1.39 -0.14 -0.64 -1.67 -1.13 -0.04 0.61 -0.63 0.13
0.72 0.38 -0.85 -1.32 0.85 -0.41 -0.11 -2.04 -1.61 -1.81

a) Carry out a χ2 test of goodness-of-fit by dividing the real line into five intervals, each of
which has probability 0.2 under the standard normal distribution.
b) Carry out a χ2 test of goodness-of-fit by dividing the real line into ten intervals, each of
which has probability 0.1 under the standard normal distribution.
Chapter 7

Regression

7.1 Simple linear regression

7.1.1 Simple linear regression model


Suppose that we have a pair of variables (X, Y ) and a variable Y is a linear function of X
plus random noise:
Y = f (X) +  = β0 + β1 X + ,
where a random noise  is assumed to have normal distribution N(0, σ 2 ). A variable X is called
a predictor variable, Y - a response variable and a function f (x) = β0 + β1 x - a linear regeres-
sion function.
Suppose that we are given a sequence of pairs (X1 , Y1 ), . . . , (Xn , Yn ) that are described by
the above model:
Yi = β0 + β1 Xi + i ,
and 1 , . . . , n are i.i.d. N(0, σ 2 ). We have three unknown parameters β0 , β1 and σ 2 and we want
to estimate them using a given sample. The points X1 , . . . , Xn can be either random or non
random, but from the point of view of estimating linear regression function the nature of Xs
is in some sense irrelevant so we will think of them as fixed and non random and assume
that the randomness comes from the noise variables i . For a fixed Xi , the distribution of Yi is
equal to N(f (Xi ), σ 2 ). The likelihood function of the sequence (Y1 , . . . , Yn ) is
1 Pn 2
L(Y1 , . . . , Yn ; β0 , β1 , σ 2 ) = (2πσ)−n/2 e− 2σ2 i=1 (Yi −f (Xi )) .

Let us find the m.l.e. of β̂0 , β̂1 , σ̂ 2 that maximize this likelihood function L. First of all, it is
obvious that (β̂0 , β̂1 ) is also minimized
n
X

L (β0 , β1 ) := (Yi − β0 − β1 Xi )2
i=1

171
7.1. SIMPLE LINEAR REGRESSION 172

so β̂0 , β̂1 are solution to


 ∗
∂L
= − ni=1 2(Yi − (β0 + β1 Xi )) = 0
 P

∂β0
∂L∗
= − ni=1 2(Yi − (β0 + β1 Xi ))Xi = 0.
 P

∂β1
Denote
1X 1X 1X 2 1X
X̄ = Xi , Ȳ = Yi , X2 = Xi , XY = Xi Yi ,
n n n n i
we obtain

β̂0 = Ȳ − β̂1 X̄
XY − X̄ Ȳ
β̂1 =
X 2 − X̄ 2
n
2 1X
σ̂ = (Yi − β̂0 − β̂1 Xi )2
n i=1

Denote Ŷi = β̂0 + β̂1 Xi and Pn


2 (Yi − Ŷi )2
R = 1 − Pi=1
n 2
.
i=1 (Yi − Ȳ )

The numerator in the last sum is the sum of squares of the residuals and the numerator
is the variance of Y and R2 is usually interpreted as the proportion of variability in the data
explained by the linear model. The higher R2 the better our model explains the data. Next, we
would like to do statistical inference about the linear model.
1. Construct confidence intervals for parameters of the model β0 , β1 and σ 2 .
2. Construct prediction intervals for Y given any point X.
3. Test hypotheses about parameters of the model.
The distribution of β̂0 , β̂1 and σ̂ 2 are defined by the following result.

Proposition 7.1.1. 1. Vector (β̂0 , β̂1 ) has a normal distribution with mean (β0 , β1 )
and covariance matrix
!
σ2 X 2 −X̄
Σ= , where σx2 = X 2 − X̄ 2 .
nσx2 −X̄ 1

2. σ̂ 2 is independent of β̂0 , β̂1 .


nσ̂ 2
3. has χ2n−2 distribution with n − 2 degrees of freedom.
σ2
7.1. SIMPLE LINEAR REGRESSION 173

7.1.2 Confidence interval for σ 2


nσ̂ 2
It follows from Proposition 7.1.1 that 2 is χ2n−2 distributed, so if we choose c1−α/2,n−2 , cα/2,n−2
σ
such that

α α
P[χ2n−2 > c1−α/2,n−2 ] = 1 − , P[χ2n−2 > cα/2,n−2 ] = ,
2 2
then h nσ̂ 2
2 nσ̂ 2 i
P ≤σ ≤ = 1 − α.
cα/2,n−2 c1−α/2,n−2
Therefore the (1 − α) CI for σ 2 is
nσ̂ 2 nσ̂ 2
≤ σ2 ≤ .
cα/2,n−2 c1−α/2,n−2

7.1.3 Confidence interval for β1


It follows from Proposition 7.1.1 that
r
nσx2 nσ̂ 2
(β̂1 − β1 ) ∼ N(0, 1) and ∼ χ2n−2
σ2 σ2
and β̂1 is independent of σ̂ 2 , so
q
nσx2
− β1 )
r
σ2
(β̂1 (n − 2)σx2
q = (β̂1 − β1 )
1 nσ̂ 2 σ̂ 2
n−2 σ 2

has a tn−2 distribution with n − 2 degrees of freedom. Choose tα/2,n−2 such that

P[|tn−2 | < tα/2,n−2 ] = 1 − α

we obtain the (1 − α) CI for β1 as follows


s s
σ̂ 2 σ̂ 2
βˆ1 − tα/2,n−2 ≤ β1 ≤ ˆ
β1 + t α/2,n−2 .
(n − 2)σx2 (n − 2)σx2

7.1.4 Confidence interval for β0


A similar argument as above yields
s
βˆ0 − β0 1 nσ̂ 2 βˆ0 − β0
s : = r
1 X̄ 2  2 n − 2 σ2 σ̂ 2

X̄ 2

+ σ n−2
1 + σ 2
x
n nσx2
has a Student’s t distribution with n − 2 degrees of freedom. Thus the (1 − α) CI for β0 is
s s
σ̂ 2  X̄ 2  σ̂ 2  X̄ 2 
β̂0 − tα/2,n−2 1 + 2 ≤ β0 ≤ β̂0 + tα/2,n−2 1+ 2 .
n−2 σx n−2 σx
7.1. SIMPLE LINEAR REGRESSION 174

7.1.5 Prediction intervals


Suppose now that we have a new observation X for which Y is unknown and we want to
predict Y or find the confidence interval for Y . According to simple regression model,

Y = β0 + β1 X + 

and it is natural to take Ŷ = β̂0 + β̂1 X as the prediction of Y . Let us find the distribution of
their difference Ŷ − Y .

Proposition 7.1.2. The random variable

Ŷ − Y
r  
σ̂ 2 (X̄−X)2
n−2
n+1+ σx2

has a Student’s t distribution with n − 2 degrees of freedom.

Choose tα/2,n−2 such that P[|tn−2 | < xα ] = 1 − α we obtain the (1 − α) CI for Y as follows
s s
σ̂ 2  (X̄ − X)2  σ̂ 2  (X̄ − X)2 
Ŷ − tα/2,n−2 n+1+ ≤ Y ≤ Ŷ + tα/2,n−2 n + 1 + .
n−2 σx2 n−2 σx2
Appendies

Rz 2
e−x /2
Table of Normal distribution Φ(z) = −∞


dx

175
7.1. SIMPLE LINEAR REGRESSION 176

z 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09


0 0.5 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.5279 0.53188 0.53586
0.1 0.5398 0.5438 0.54776 0.55172 0.55567 0.55966 0.5636 0.56749 0.57142 0.57535
0.2 0.5793 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.6293 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.6591 0.66276 0.6664 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.7054 0.70884 0.71226 0.71566 0.71904 0.7224
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.7549
0.7 0.75804 0.76115 0.76424 0.7673 0.77035 0.77337 0.77637 0.77935 0.7823 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.8665 0.86864 0.87076 0.87286 0.87493 0.87698 0.879 0.881 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.9032 0.9049 0.90658 0.90824 0.90988 0.91149 0.91308 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.9222 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.9452 0.9463 0.94738 0.94845 0.9495 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.9608 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.9732 0.97381 0.97441 0.975 0.97558 0.97615 0.9767
2 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.9803 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.983 0.98341 0.98382 0.98422 0.98461 0.985 0.98537 0.98574
2.2 0.9861 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.9884 0.9887 0.98899
2.3 0.98928 0.98956 0.98983 0.9901 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.9918 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.9943 0.99446 0.99461 0.99477 0.99492 0.99506 0.9952
2.6 0.99534 0.99547 0.9956 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.9972 0.99728 0.99736
2.8 0.99744 0.99752 0.9976 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.999
7.1. SIMPLE LINEAR REGRESSION 177

Table of Student distribution1


1 side 75% 80% 85% 90% 95% 97.5% 99% 99.5% 99.75% 99.9% 99.95%
2 side 50% 60% 70% 80% 90% 95% 98% 99% 99.5% 99.8% 99.9%
1 1 1.376 1.963 3.078 6.314 12.71 31.82 63.66 127.3 318.3 636.6
2 0.816 1.08 1.386 1.886 2.92 4.303 6.965 9.925 14.09 22.33 31.6
3 0.765 0.978 1.25 1.638 2.353 3.182 4.541 5.841 7.453 10.21 12.92
4 0.741 0.941 1.19 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.61
5 0.727 0.92 1.156 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869
6 0.718 0.906 1.134 1.44 1.943 2.447 3.143 3.707 4.317 5.208 5.959
7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408
8 0.706 0.889 1.108 1.397 1.86 2.306 2.896 3.355 3.833 4.501 5.041
9 0.703 0.883 1.1 1.383 1.833 2.262 2.821 3.25 3.69 4.297 4.781
10 0.7 0.879 1.093 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587
11 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437
12 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 3.428 3.93 4.318
13 0.694 0.87 1.079 1.35 1.771 2.16 2.65 3.012 3.372 3.852 4.221
14 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.14
15 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073
16 0.69 0.865 1.071 1.337 1.746 2.12 2.583 2.921 3.252 3.686 4.015
17 0.689 0.863 1.069 1.333 1.74 2.11 2.567 2.898 3.222 3.646 3.965
18 0.688 0.862 1.067 1.33 1.734 2.101 2.552 2.878 3.197 3.61 3.922
19 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883
20 0.687 0.86 1.064 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.85
21 0.686 0.859 1.063 1.323 1.721 2.08 2.518 2.831 3.135 3.527 3.819
22 0.686 0.858 1.061 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792
23 0.685 0.858 1.06 1.319 1.714 2.069 2.5 2.807 3.104 3.485 3.767
24 0.685 0.857 1.059 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745
25 0.684 0.856 1.058 1.316 1.708 2.06 2.485 2.787 3.078 3.45 3.725
26 0.684 0.856 1.058 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707
27 0.684 0.855 1.057 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.69
28 0.683 0.855 1.056 1.313 1.701 2.048 2.467 2.763 3.047 3.408 3.674
29 0.683 0.854 1.055 1.311 1.699 2.045 2.462 2.756 3.038 3.396 3.659
30 0.683 0.854 1.055 1.31 1.697 2.042 2.457 2.75 3.03 3.385 3.646
40 0.681 0.851 1.05 1.303 1.684 2.021 2.423 2.704 2.971 3.307 3.551
50 0.679 0.849 1.047 1.299 1.676 2.009 2.403 2.678 2.937 3.261 3.496
60 0.679 0.848 1.045 1.296 1.671 2 2.39 2.66 2.915 3.232 3.46
80 0.678 0.846 1.043 1.292 1.664 1.99 2.374 2.639 2.887 3.195 3.416
100 0.677 0.845 1.042 1.29 1.66 1.984 2.364 2.626 2.871 3.174 3.39
120 0.677 0.845 1.041 1.289 1.658 1.98 2.358 2.617 2.86 3.16 3.373
∞ 0.674 0.842 1.036 1.282 1.645 1.96 2.326 2.576 2.807 3.09 3.291

1
P[T1 < 1.376] = 0.8 và P[|T1 | < 1.376] = 0.6
7.1. SIMPLE LINEAR REGRESSION 178

Table of χ2 -distribution P[χ2n > α]

DF: n 0.995 0.975 0.2 0.1 0.05 0.025 0.02 0.01 0.005 0.002 0.001
1 0.00004 0.001 1.642 2.706 3.841 5.024 5.412 6.635 7.879 9.55 10.828
2 0.01 0.0506 3.219 4.605 5.991 7.378 7.824 9.21 10.597 12.429 13.816
3 0.0717 0.216 4.642 6.251 7.815 9.348 9.837 11.345 12.838 14.796 16.266
4 0.207 0.484 5.989 7.779 9.488 11.143 11.668 13.277 14.86 16.924 18.467
5 0.412 0.831 7.289 9.236 11.07 12.833 13.388 15.086 16.75 18.907 20.515
6 0.676 1.237 8.558 10.645 12.592 14.449 15.033 16.812 18.548 20.791 22.458
7 0.989 1.69 9.803 12.017 14.067 16.013 16.622 18.475 20.278 22.601 24.322
8 1.344 2.18 11.03 13.362 15.507 17.535 18.168 20.09 21.955 24.352 26.124
9 1.735 2.7 12.242 14.684 16.919 19.023 19.679 21.666 23.589 26.056 27.877
10 2.156 3.247 13.442 15.987 18.307 20.483 21.161 23.209 25.188 27.722 29.588
11 2.603 3.816 14.631 17.275 19.675 21.92 22.618 24.725 26.757 29.354 31.264
12 3.074 4.404 15.812 18.549 21.026 23.337 24.054 26.217 28.3 30.957 32.909
13 3.565 5.009 16.985 19.812 22.362 24.736 25.472 27.688 29.819 32.535 34.528
14 4.075 5.629 18.151 21.064 23.685 26.119 26.873 29.141 31.319 34.091 36.123
15 4.601 6.262 19.311 22.307 24.996 27.488 28.259 30.578 32.801 35.628 37.697
16 5.142 6.908 20.465 23.542 26.296 28.845 29.633 32 34.267 37.146 39.252
17 5.697 7.564 21.615 24.769 27.587 30.191 30.995 33.409 35.718 38.648 40.79
18 6.265 8.231 22.76 25.989 28.869 31.526 32.346 34.805 37.156 40.136 42.312
19 6.844 8.907 23.9 27.204 30.144 32.852 33.687 36.191 38.582 41.61 43.82
20 7.434 9.591 25.038 28.412 31.41 34.17 35.02 37.566 39.997 43.072 45.315
21 8.034 10.283 26.171 29.615 32.671 35.479 36.343 38.932 41.401 44.522 46.797
22 8.643 10.982 27.301 30.813 33.924 36.781 37.659 40.289 42.796 45.962 48.268
23 9.26 11.689 28.429 32.007 35.172 38.076 38.968 41.638 44.181 47.391 49.728
24 9.886 12.401 29.553 33.196 36.415 39.364 40.27 42.98 45.559 48.812 51.179
25 10.52 13.12 30.675 34.382 37.652 40.646 41.566 44.314 46.928 50.223 52.62
26 11.16 13.844 31.795 35.563 38.885 41.923 42.856 45.642 48.29 51.627 54.052
27 11.808 14.573 32.912 36.741 40.113 43.195 44.14 46.963 49.645 53.023 55.476
28 12.461 15.308 34.027 37.916 41.337 44.461 45.419 48.278 50.993 54.411 56.892
29 13.121 16.047 35.139 39.087 42.557 45.722 46.693 49.588 52.336 55.792 58.301
30 13.787 16.791 36.25 40.256 43.773 46.979 47.962 50.892 53.672 57.167 59.703
Bibliography

[1] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA:
Duxbury, 2002.

[2] Cacoullos, T. (1989) Exercises in probability, Springer-Verlag New York Inc.

[3] DeGroot, M., & Mark J. Schervish. Probability and Statistics. 3rd ed. Boston, MA: Addison-
Wesley, 2002.

[4] Hogg, R., McKean, J.W., & Craig, A.T. (2005) Introduction to Mathematical Statistics, 6th
Edition. Pearson Education International.

[5] Jacod, J., Protter, P. (2003) Probability Essential. Springer.

[6] Montgomery, D. C., & Runger, G. C. (2010). Applied statistics and probability for engineers.
John Wiley & Sons.

[7] Panchenko, D. (2006) Lecture note “Statistics for Applications”.


http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-
2006/readings/

[8] Rahman N. A. (1983) Theoretical exercises in probability and statistics, second edition.
Macmillan Publishing.

[9] Rice, John. Mathematical statistics and data analysis. Nelson Education, 2006.

[10] Shao, J. (2005) Mathematical Statistics: Exercises and Solutions. Springer

[11] Yuri, S. & Kelbert, M. (2008) Probability and Statistics by Example: Volume 1 and 2. Cam-
bridge University Press.

[12] Shevtsova, I. (2011). On the absolute constants in the Berry-Esseen type inequalities for
identically distributed summands. arXiv preprint arXiv:1111.6554.

[13] Nguyen Duy Tien, Vu Viet Yen (2001) Probability Theory (in Vietnamese). Educational
Publishing House.

179

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy