Probability and Statistics For STEM
Probability and Statistics For STEM
J.G. Del Greco, Loyola University, Chicago
One of the most important subjects for all engineers and scientists is probability and statistics.
A Course in One Semester
E.N. Barron
J.G. Del Greco
About SYNTHESIS
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis
books provide concise, original presentations of important research and
store.morganclaypool.com
Probability and Statistics
for STEM
A Course in One Semester
Synthesis Lectures on
Mathematics and Statistics
Editor
Steven G. Krantz, Washington University, St. Louis
Statistics is Easy!
Dennis Shasha and Manda Wilson
2008
v
A Gyrovector Space Approach to Hyperbolic Geometry
Abraham Albert Ungar
2008
Copyright © 2020 by Morgan & Claypool
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00997ED1V01Y202002MAS033
Lecture #33
Series Editor: Steven G. Krantz, Washington University, St. Louis
Series ISSN
Print 1938-1743 Electronic 1938-1751
Probability and Statistics
for STEM
A Course in One Semester
E.N. Barron
Loyola University, Chicago
M
&C Morgan & cLaypool publishers
ABSTRACT
One of the most important subjects for all engineers and scientists is probability and statistics.
This book presents the basics of the essential topics in probability and statistics from a rigorous
standpoint. The basics of probability underlying all statistics is presented first and then we cover
the essential topics in statistics, confidence intervals, hypothesis testing, and linear regression.
This book is suitable for any engineer or scientist who is comfortable with calculus and is meant
to be covered in a one-semester format.
KEYWORDS
probability, random variables, sample distribution, confidence intervals, prediction
intervals, hypothesis testing, linear regression
ix
Dedicated to Christina
– E.N. Barron
For Jim
– J.G. Del Greco
xi
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Appendix: Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Important Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Important Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Expectation, Variance, Medians, Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1 Moment-Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Mean and Variance of Some Important Distributions . . . . . . . . . . . . 35
2.5 Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.1 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.3 The General Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.4 Chebychev’s Inequality and the Weak Law of Large Numbers . . . . . . 47
2.6 2 .k/, Student’s t- and F-Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6.1 2 .k/ Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6.2 Student’s t-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.6.3 F -Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
xv
Preface
Every student anticipating a career in science and technology will require at least a working
knowledge of probability and statistics, either for use in their own work, or to understand the
techniques, procedures, and conclusions contained in scholarly publications and technical re-
ports. Probability and statistics has always been and will continue to be a significant component
of the curricula of mathematics and engineering science majors, and these two subjects have
become increasingly important in areas that have not traditionally included them in their under-
graduate courses of study like biology, chemistry, physics, and economics. Over the last couple
of decades, methods originating in probability and statistics have found numerous applications
in a wide spectrum of scientific disciplines, and so it is necessary to at least acquaint prospective
professionals and researchers working in these areas with the fundamentals of these important
subjects. Unfortunately, there is little time to devote to the study of probability and statistics in a
science and engineering curriculum that is typically replete with required courses. What should
be a comprehensive two-semester course in probability and statistics has to be, out of necessity,
reduced to a single-semester course.This book is an attempt to provide a text that addresses both
rigor and conciseness of the main topics for undergraduate probability and statistics.
It is intended that this book be used in a one-semester course in probability and statis-
tics for students who have completed two semesters of calculus. It is our goal that readers gain
an understanding of the reasons and assumptions used in deriving various statistical conclusions.
The presentation of the topics in the book is intended to be at an intermediate, sophomore, or
junior level. Most two-semester courses present the subject at a higher level of detail and ad-
dress a wider range of topics. On the other hand, most one-semester, lower-level courses do not
present any derivations of statistical formulas and provide only limited reasoning motivating the
results. This book is meant to bridge the gap and present a concise but mathematically rigorous
introduction to all of the essential topics in a first course in probability and statistics. If you are
looking for a book which contains all the nooks and crannys of probability or statistics, this book
is not for you. If you plan on becoming (or already are) a practicing scientist or engineer, this
book will certainly contain much of what you need to know. But, if not, it will give you the
background to know where and how to look for what you do need, and to understand what you
are doing when you apply a statistical method and reach a conclusion.
The book provides answers to most of the problems in the book. While this book is
not meant to accompany a strictly computational course, calculations requiring a computer or
at least a calculator are inevitably necessary. Therefore, this course requires the use of a TI-
83/84/85/89, or any standard statistical package like Excel. Tables of things like the standard
normal, t-distribution, etc., are not provided.
xvi PREFACE
All experiments result in data. These data values are particular observations of underlying
random variables. To analyze the data correctly the experimenter needs to be equipped with the
tools that an understanding of probability and statistics provide. That is the purpose of this book.
We will present basic, essential statistics, and the underlying probability theory to understand
what the results mean.
The book begins with three foundational chapters on probability. The fundamental types
of discrete and continuous random variables and their basic properties are introduced. Moment-
generating functions are introduced at an early stage and used to calculate expected values and
variances, and also to enable a proof of the Central Limit Theorem, a cornerstone result. Much
of statistics is based on the Central Limit Theorem, and our view is that students should be
exposed to a rigorous argument for why it is true and why the normal distribution plays such a
central role. Distributions related to the normal distribution, like the 2 , t , and F distributions,
are presented for use in the statistical methods developed later in the book.
Chapter 3 is the prelude to the statistical topics included in the remainder of the text.
This chapter includes the analysis of sample means and sample standard deviations as random
variables.
The study of statistics begins in earnest with a discussion of confidence intervals in Chap-
ter 4. Both one-sample and two-independent-samples confidence intervals are constructed as
well as confidence intervals for paired data. Chapter 5 contains the topics at the core of statis-
tics, particularly important for experimenters. We introduce tests of hypotheses for the major
categories of experiments. Throughout the chapter, the dual relationship between confidence
intervals and hypotheses tests is emphasized. The power of a test of hypotheses is discussed
in some detail. Goodness-of-fit tests, contingency tables, tests for independence, and one-way
analysis of variance is presented.
The book ends with a basic discussion of linear regression, an extremely useful tool in
statistics. Calculus students have more than enough background to understand and appreciate
how the results are derived. The probability theory introduced in earlier chapters is sufficient to
analyze the coefficients derived in the regression.
The book has been used in the Introduction to Statistics & Probability course at our uni-
versity for several years. It has been used in both two lectures per week and three lectures per
week formats. Each semester typically involves at least two midterms (usually three) and a com-
prehensive final exam. In addition, class time includes time for group work as well as in-class
quizzes. It makes it a busy semester to finish.
Acknowledgments
We gratefully acknowledge Susanne Filler at Morgan & Claypool.
CHAPTER 1
Probability
There are two kinds of events which occur: deterministic and stochastic (or random as a syn-
onym). Deterministic phenomena are such that the same inputs always gives the exact same
outputs. Newtonian physics deals with deterministic phenomena, but even real-world science
is subject to random effects. Engineering design usually has a criteria which is to be met with
95% certainty because 100% certainty is impossible and too expensive. Think of designing an
elevator in a high-rise building. Does the engineer know with certainty how many people will
get on the elevator? What happens to the components of the elevator as they age? Do we know
with certainty when they will collapse? These are examples of random events and we need a way
to quantify them.
Statistics is based on probability which is a mathematical theory used to make sense out of
random events and phenomena. In this chapter we will cover the basic concepts and techniques
of probability we will use throughout this book.
Definition 1.1 The sample space is the set S of all possible outcomes of an experiment.
S could be a finite set (like the number of all possible five card hands), a countably infinite set
(like f0; 1; 2; 3 : : : g in a count of the number of users logging on to a computer system), or a
continuum (like an interval Œa; b, like selecting a random number from a to b ).
Example 1.3
• If we roll a die the sample space is S D f1; 2; 3; 4; 5; 6g: Rolling an even number is the
event A D f2; 4; 6g:
2 1. PROBABILITY
• If we want to count the number of customers coming to a bakery the sample space is
S D f0; 1; 2; : : : g; and the event we get between 2 and 7 customers is A D f2; 3; 4; 5; 6; 7g:
• If we throw a dart randomly at a circular board of radius 2 feet, the sample space is the
set of all possible positions of the dart S D f.x; y/ j x 2 C y 2 4g: The event that the dart
landed in the first quadrant is A D f.x; y/ j x 2 C y 2 4; x 0; y 0:g:
Eventually we want to find the probability that an event will occur. We say that an event
A occurs if any outcome in the set A actually occurs when the experiment is performed.
Combinations of events:
Let A; B 2 F be any two events. From these events we may describe the following events:
(b) A \ B , also written as AB , is the event A occurs and B occur, i.e., they both occur.
(c) Ac D S A is the event A does not occur. This is all the outcomes in S and not in A.
(e) A \ B D ; means the two events cannot occur together, i.e., they are mutually exclusive.
We also say that A and B are disjoint. Mutually exclusive events cannot occur at the same
time.
Many more such relations hold if we have three or more events. It is useful to recall the
following set relationships.
• A \ .B [ C / D .A \ B/ [ .A \ C / and A [ .B \ C / D .A [ B/ \ .A [ C /:
Remark 1.5 Immediately from the definition we can see that P .;/ D 0: In fact, since we have
the disjoint sum rule
P .Ac / D 1 P .A/
Remark 1.6 One of the most important and useful rules is the Law of Total Probability:
To see why this is true, we use some basic set theory decomposing A;
A D A \ S D A \ .B [ B c / D .A \ B/ [ .A \ B c /
P .A/ D P ..A \ B/ [ .A \ B c // D P .A \ B/ C P .A \ B c /:
A main use of this Law is that we may find the probability of an event A if we know what
happens when A \ B occurs and when A \ B c occurs. A useful form of this is P .A \ B c / D
P .A/ P .A \ B/:
The next theorem gives us the sum rule when the events are not mutually exclusive.
Theorem 1.7 General Sum Rule. If A; B are any two events, then
The next example gives one of the most important probability functions for finite sample
spaces.
Example 1.8 When the sample space is finite, say jSj D N; and all individual outcomes in S
are equally likely, we may define a function
n.A/
P .A/ D ; where n.A/ D number of outcomes in A:
N
To see that this is a probability function we only have to verify the conditions of the definition.
The requirement that individual outcomes be equally likely is essential. For example, suppose we
roll two dice and sum the numbers on each die. We take the sample space S D f2; 3; 4; : : : ; 12g:
If we use this sample space and we assume the outcomes are equally likely then we would get
that P .roll a 7/ D 1=11 which is clearly not correct. The problem is that with this sample space,
the individual outcomes are not equally likely. If we want equally likely outcomes we need to
change the sample space to account for the result on each die:
S D f.1; 1/; .1; 2/; : : : ; .1; 6/; .2; 1/; .2; 2/ : : : ; .2; 6/; : : : ; .6; 1/; .6; 2/; : : : ; .6; 6/g: (1.2)
A D f.1; 6/; .6; 1/; .2; 5/; .5; 2/; .3; 4/; .4; 3/g:
Example 1.9 Whenever the sample space can easily be written it is often the best way to find
probabilities. As an example, we roll two dice and we let D1 denote the number on the first die
and D2 the number on the second. Suppose we want to find P .D1 > D2 /: The easiest way to
1.2. CONDITIONAL PROBABILITY 5
solve this is to write down the sample space as we did in (1.2) and then use the fact that each
outcome is equally likely. We have
fD1 > D2 g D f.2; 1/; .3; 2/; .3; 1/; .4; 3/; .4; 2/; .4; 1/;
.5; 4/; .5; 3/; .5; 2/; .5; 1/; .6; 5/; .6; 4/; .6; 3/; .6; 2/; .6; 1/g:
15
This event has 15 outcomes which means P .D1 > D2 / D 36
:
Definition 1.10 The conditional probability of event A; given that event B has occurred is
P .A \ B/
P .AjB/ D if P .B/ > 0:
P .B/
One of the justifications for this definition can be seen from the case when the sample
space is finite (and equally likely individual outcomes). We have, if jSj D N;
n.A \ B/
n.A \ B/ N P .A \ B/
D D D P .AjB/:
n.B/ n.B/ P .B/
N
The left-most side of this string is the fraction of outcomes in A \ B from the event B: In other
words, it is the probability of A using the reduced sample space B: That is, if the outcomes in S
are equally likely, P .AjB/ is the proportion of outcomes in both A and B relative to the number
of outcomes in B .
The introduction of conditional probability gives us the following which follows by rear-
ranging the terms in the definition.
Example 1.11 In a controlled experiment to see if a drug is effective 71 patients were given the
drug (event D ), while 75 were given a placebo (event D c ). A patient records a response (event
R) or not (event Rc ). The following table summarizes the results.
6 1. PROBABILITY
Drug Placebo Subtotals Probability
Response 26 13 39 0.267
No Response 45 62 107 0.733
Subtotals 71 75 146
Probability 0.486 0.514
Proof. The Law of Total Probability combined with the multiplication rule says
Proof. Simply write out each term and use the theorem.
P .A \ B/ D P .A \ B \ C / C P .A \ B \ C c /
P .A \ B \ C / P .A \ B \ C c /
D P .B \ C / C P .B \ C c /
P .B \ C / P .B \ C c /
D P .AjB \ C /P .B \ C / C P .AjB \ C c /P .B \ C c /:
Another very useful fact is that conditional probabilities are actually probabilities and
therefore all rules for probabilities apply to conditional probabilities as long as the given in-
formation remains the same.
Corollary 1.14 Let B be an event with P .B/ > 0: Then Q.A/ D P .AjB/; A 2 F ; is a proba-
bility function.
8 1. PROBABILITY
Proof. We have to verify Q./ satisfies the axioms of Definition 1.4. Clearly, Q.A/ 0 for any
event A and Q.S/ D P .SjB/ D PP.S.B/\B/
DP .B/
P .B/
D 1: Finally, let A1 \ A2 D ;;
Definition 1.15 Two events A; B are said to be independent, if the knowledge that one of the
events occurred does not affect the probability that the other event occurs. That is,
Example 1.16 1. Suppose an experiment has two possible outcomes a; b; so the sample space
is S D fa; bg: Suppose P .a/ D p and P .b/ D 1 p: If we perform this experiment n 1 times
with identical conditions from experiment to experiment, then the events of individual experi-
ments are independent. We may calculate
In particular, the chance of getting five straight heads in five tosses of a fair coin is . 21 /5 D 1=32:
2. The following two-way table contains data on place of residence and political leaning.
Is one’s political leaning independent of place of residence? To answer this question, let U D
furbang; R D fruralg; M D fmoderateg; C D fconservativeg: Then P .U \ M / D 200=600 D
1=3; P .U / D 300=600 D 1=2; P .M / D 275=600: Since P .U \ M / ¤ P .U / P .M /; they are
not independent.
1.2. CONDITIONAL PROBABILITY 9
When events are not independent we can frequently use the information about the oc-
currence of one of the events to find the probability of the other. That is the basis of conditional
probability. The next concept allows us to calculate the probability of an event if the entire sam-
ple space is split (or partitioned) into pieces and decomposing the event we are interested in into
the parts occurring in each piece. Here’s the idea.
If we have events B1 ; : : : ; Bn such that Bi \ Bj D ;; for all i; j and [niD1 Bi D S; then
the collection fBi gniD1 is called a partition of S: In this case, the Law of Total Probability says
n
X n
X
P .A/ D P .A \ Bi /; and P .A/ D P .AjBi /P .Bi /
i D1 i D1
for any event A 2 F : We can calculate the probability of an event by using the pieces of A that
intersect each Bi : It is always possible to partition S by taking any event B and the event B c :
Then for any other event A;
Example 1.17 Suppose we draw the second card from the top of a well-shuffled deck. We
want to know the probability that this card is an Ace.
This seems to depend on what the first card is. Let B D f1st card is an Aceg and consider
the partition fB; B c g: We condition on what the first card is.
The next important theorem tells us how to find P .Bk jA/ if we know how to find P .AjBi /
for each event Bi in the partition of S: It shows us how to find the probability that if A occurs,
it was due to Bk :
Theorem 1.18 Bayes’ Rule. Let fBi gniD1 be a partition of S: Then for each k D 1; 2; : : : ; n:
The proof is in the statement of the theorem using the definition of conditional probability
and the Law of Total Probability.
Example 1.19 This example shows the use of both the Law of Total Probability and Bayes’
Rule. Suppose there is a box with 10 coins, 9 of which are fair coins (probability of heads is 1/2),
and 1 of which has heads on both sides. Suppose a coin is picked at random and it is tossed 5
times. Given that all 5 tosses result in heads, what is the probability the 6th toss will be a head?
Let A D ftoss 6 is a Hg; B D f1st 5 tosses are Hg; and C D fcoin chosen is fairg: The
problem is we can’t calculate P .A/ or P .B/ until we know what kind of coin we have. We
need to condition on the type of coin. Here’s what we know. P .C / D 9=10; and
1
P .AjC / D P .toss 6 is a Hjcoin chosen is fair/ D
2
P .AjC c / D P .toss 6 is a Hjcoin chosen is not fair/ D 1:
Example 1.20 Tests for a medical condition are not foolproof. To see what this implies, sup-
pose a test for a virus has sensitivity 0:95 and specificity 0:92: This means
and
Suppose the prevalence of the disease is 5%, which means P .D/ D 0:05: The question is if
someone tests positive for the disease, what are the chances this person actually has the disease?
1.2. CONDITIONAL PROBABILITY 11
c c
This is asking for P .DjTP / but what we know is P .TP jD/ and P .TP jD /: This is a
perfect use of Bayes’ rule. We also use Corrollary 1.14:
P .D \ TP / P .TP jD/P .D/
P .DjTP / D D
P .TP / P .TP /
P .TP jD/P .D/
D
P .TP jD/P .D/ C P .TP jD c /P .D c /
P .TP jD/P .D/
D
P .TP jD/P .D/ C .1 P .TP c jD c //P .D c /
0:95 0:05
D D 0:3846:
0:95 0:05 C .1 0:92/ 0:95
This is amazing. Only 38% of people who test positive actually have the disease.
Example 1.21 Suppose there is a 1% chance of contracting a rare disease. Let D be the event
you have the disease and TP the event you test positive for the disease. We know P .TP jD/ D
0:98; and P .TP c jD c / D 0:95: As in the previous example, we first ask: given that you test
positive, what is the probability that you really have the disease? We know how to work this out:
0:98.0:01/
P .DjTP / D D 0:165261:
0:98.0:01/ C .1 0:95/.0:99/
Now suppose there is an independent repetition of the test. Suppose the second test is also
positive and now you want to know the probability that you really have the disease given the
two positives.
To solve this let TPi ; i D 1; 2 denote the event you test positive on test i D 1; 2: These
events are assumed conditionally independent.1 Therefore, again by Bayes’ formula we have
P .TP1 \ TP2 jD/P .D/
P .DjTP1 \ TP2 / D
P .TP1 \ TP2 /
P .TP1 \ TP2 jD/P .D/
D
P .TP1 \ TP2 jD/P .D/ C P .TP1 \ TP2 jD c /P .D c /
P .TP1 jD/P .TP2 jD/P .D/
D D 0:795099:
P .TP1 jD/P .TP2 jD/P .D/ C .1 P .TP1c jD c //.1 P .TP2c jD//P .D c /
This says that the patient who tests positive twice now has an almost 80% chance of actually
having the disease.
Example 1.22 Simpson’s Paradox. Suppose a college has two majors, A and B. There are
2000 male applicants to the college with half applying to each major. There are 1100 female
1 Conditional independence means independent conditioned on some event, i.e., P .A \ BjC / D P .AjC / P .BjC /.
In our case, conditional independence means P .TP1 \ TP2 jD/ D P .TP1 jD/P .TP2 jD/.
12 1. PROBABILITY
applicants with 100 applying to A, and the rest to B. Major A admits 60% of applicants while
major B admits 30%. This means that the percentage of men and women who apply to the
college must be the same, right? Wrong.
In fact, we know that a total of 900 male applicants to the college were admitted giving
900/2000=0.45 or 45% of men admitted. For women the percentage is 360/1100=0.327 or 33%.
Aggregating the men and women covers the fact that a larger percentage of women applied to
the major which has a lower acceptance rate. This is an example of Simpson’s paradox. Here’s
another example.
Two doctors have a record of success in two types of surgery, Low Risk and High Risk.
Here’s the table summarizing the results.
Doctor A Doctor B
Low Risk 93% (81/87) 87% (234/270)
High Risk 73% (192/263) 69% (55/80)
Total 78% (273/350) 83% (289/350)
The data show that conditioned on either low- or high-risk surgeries, Doctor A has a
better success percentage. However, aggregating the high- and low-risk groups together pro-
duces the opposite conclusion. The explanation of this is arithmetic, namely, for numbers
A
a; b; c; d; A; B; C; D; it is not true that B > ab and D
C
> dc implies BCD
ACC aCc
> bCd : In the example,
81=87 > 234=270 and 192=263 > 55=80 but .81 C 192/=.87 C 263/ < .234 C 55/=.270 C 80/:
Permutations:
The number of ways to arrange k objects out of n distinct objects is
nŠ
n.n 1/.n 2/ .n .k 1// D D Pn;k :
.n k/Š
1.3. APPENDIX: COUNTING 13
For instance, if we have 3 distinct objects fa; b; cg; there are 6 D 3 2 ways to pick 2 objects out
of the 3, since there are 3 ways to pick the first object and then 2 ways to pick the second. They
are .a; b/; .a; c/; .b; a/; .b; c/; .c; a/; .c; b/:
Combinations:
The number of ways to choose k objects out of n when we don’t care about the order of the
objects is
!
nŠ n
Cn;k D D :
kŠ.n k/Š k
For example, in the paragraph on permutations, the choices .a; b/ and .b; a/ are different per-
mutations but they are the same combination and so should not be counted separately. The way
to get the number of combinations is to first figure out the number of permutations, namely
nŠ
.n k/Š
; and then get rid of the number of ways to arrange the selection of k objects, namely kŠ:
In other words, ! !
n n nŠ
Pn:k D kŠ H) D :
k k .n k/ŠkŠ
Example 1.23 Poker Hands. We will calculate the probability of obtaining some of the
common 5 card poker hands to illustrate the counting principles. A standard 52-card deck
has 4 suits (Hearts, Clubs, Spades, Diamonds) with each suit consisting of 13 cards labeled
2; 3; 4; 5; 6; 7; 8; 9; 10; J; Q; K; A: Five cards from the deck are chosen at random (without re-
placement). We now want to find the probabilities of various poker hands.
The sample space S is all possible 5-card hands where order of the cards does not mat-
ter. These are combinations of 5 cards from the 52, and there are 52 5
D 2;598;960 D jS j D N
possible hands, all of which are equally likely.
Probability of a Royal Flush, which is A; K; Q; J; 10 all the same suit. Let A D
froyal flushg: How many royal flushes are there? It should be obvious there are exactly 4, one for
each suit. Therefore, P .A/ D 4= 52
5
D 0:00000153908; an extremely rare event.
Probability of a Full House, which is 3 of a kind and a pair. Let A D ffull houseg: To get
a full house we break this down into steps.
(a) Pick a card for the 3 of a kind. There are (c) Choose another type distinct from the
13 types one could choose. first type. There are 12 ways to do that.
(b) Choose 3 out of the 4 cards of the same (d) Choose 2 cards of the same type chosen
type chosen in the first step. There are 43 in the previous step. There are 42 ways
ways to do that. to do that.
14 1. PROBABILITY
4
4
We conclude that the number
of full house hands is n.A/ D 13 3
12 2
D 3744:
Consequently P .A/ D 3744= 52
5
D 0:00144:
Probability of 3 of a Kind. This is a hand of the form aaabc where b; c are cards neither
of which has the same face value as a: Let A be the event we get 3 of a kind. The number of
hands in A is calculated using the multiplication rule with these steps:
(a) Choose a card type (c) Choose 3 types from the remaining types
(b) Choose 2 of that type (d) Choose 1 card from each of these types
3
The number of ways to do that is 131
42 123
41 D 1;098;240, Therefore, P .A/ D
0:4225: An exercise asks for the probability of getting 2 pairs.
1.4 PROBLEMS
1.1. Suppose P .A/ D p; P .B/ D 0:3; P .A [ B/ D 0:6. Find p so that P .A \ B/ D 0: Also,
find p so that A and B are independent.
1.2. When P .A/ D 1=3; P .B/ D 1=2; P .A [ B/ D 3=4; what is (a)P .A \ B/? and (b) what
is P .Ac [ B/?
1.3. Show P .AB c / D P .A/ P .AB/ and P (exactly one of A or B occur) D P .A/ CP .B/
2P .A \ B/.
1.4. 32% of Americans smoke cigarettes, 11% smoke cigars, 7% smoke both.
(a) What percent smoke neither cigars (b) What percent smoke cigars but not
nor cigarettes? cigarettes?
1.6. Suppose n.A/ is the number of times A occurs if an experiment is performed N times.
Set FN .A/ D n.A/
N
: Show that FN satisfies the definition to be a probability func-
tion. This leads to the frequency definition of the probability of an event P .A/ D
limN !1 FN .A/; i.e., the probability of an event is the long term fraction of time the
event occurs.
1.8. (a) Give an example to illustrate P .A/ C P .B/ D 1 does not imply A \ B D ;: (b)
Give an example to illustrate P .A [ B/ D 1 does not imply A \ B D ;: (c) Prove that
P .A/ C P .B/ C P .C / D 1 if and only if P .AB/ D P .AC / D P .BC / D 0:
1.9. A box contains 2 white balls and an unknown amount (finite) of non-white balls. Sup-
pose 4 balls are chosen at random without replacement and suppose the probability of
the sample containing both white balls is twice the probability of the sample containing
no white balls. Find the total number of balls in the box.
1.10. Let C and D be two events for which one knows that P .C / D 0:3; P .D/ D 0:4; P .C \
D/ D 0:2: What is P .C c \ D/?
1.11. An experiment has only two possible outcomes, only one of which may occur. The first
has probability p to occur, the second probability p 2 : What is p ?
1.12. We repeatedly toss a coin. A head has probability p , and a tail probability 1 p to occur,
where 0 < p < 1: What is the probability the first head occurs on the 5th toss? What
is the probability it takes 5 tosses to get two heads?
1.14. Analogously to the finite sample space case with equally likely outcomes we may define
P .A/ D area of A=area of S; where S R2 is a fixed two-dimensional set (with equally
likey outcomes) and A S: Suppose that we have a dart board given by S D fx 2 C y 2
9g and A is the event that a randomly thrown dart lands in the ring with inner radius 1
and outer radius 2. Find P .A/:
16 1. PROBABILITY
1.15. Show that P .A [ B [ C / D P .A/ C P .B/ C P .C / P .A \ B/ P .A \ C /
P .C \ B/ C P .A \ B \ C /:
1.16. Show that P .A \ B/ P .A/ C P .B/ 1 for all events A; B 2 F : Use this to find a
lower bound on the probability both events occur if the probability of each event is 0.9.
1.17. A fair coin is flipped twice. We know that one of the tosses is a Head. Find the proba-
bility the other toss is a Head. (Hint: The answer is not 1/2.).
1.18. Find the probability of two pair in a 5-card poker hand.
1.19. Show that DeMorgan’s Laws .A [ B/c D Ac \ B c and .A \ B/c D Ac [ B c hold and
then find the probability neither A nor B occur and the probability either A does not
occur or B does not but one of the two does occur. Your answer should express these in
terms of P .A/; P .B/, and P .A \ B/:
1.20. Show that if A and B are independent events, then so are A and B c as well as
Ac and B c :
1 1
1.21. If P .A/ D and P .B c / D is it possible that A \ B D ;? Explain.
3 4
1.22. Suppose we choose one of two coins C1 or C2 in which the probability of getting a head
with C1 is 1=3; and with C2 is 2=3: If we choose a coin at random what is the probability
we get a head when we flip it?
1.23. Suppose two cards are dealt one at a time from a well-shuffled standard deck of cards.
Cards are ranked 2 < 3 < < 10 < J < Q < K < A.
P
(a) Find the probability the second card beats the first card. Hint: Look at k P .C2 >
C1 jC1 D k/P .C1 D k/:
(b) Find the probability the first card beats the second and the probability the two
cards match.
1.24. A basketball team wins 60% of its games when it leads at the end of the first quarter,
and loses 90% of its games when the opposing team leads. If the team leads at the end
of the first quarter about 30% of the time, what fraction of the games does it win?
1.25. Suppose there is a box with 10 coins, 8 of which are fair coins (probability of heads is
1/2), and 2 of which have heads on both sides. Suppose a coin is picked at random and
it is tossed 5 times. Given that we got 5 straight heads, what are the chances the coin
has heads on both sides?
1.26. Is independence for three events A; B; C the same as: A; B are independent; B; C are
independent; and A; C are independent? Consider the example: Perform two indepen-
dent tosses of a coin. Let A Dheads on toss 1, B Dheads on toss 2, and C Dthe two
tosses are equal.
1.4. PROBLEMS 17
(a) Find P .A/; P .B/; P .C /; P .C jA/; P .BjA/; and P .C jB/. What do you conclude?
(b) Find P .A \ B \ C / and P .A \ B \ C c /. What do you conclude?
1.28. The events A; B; and C satisfy: P .AjB \ C / D 1=4; P .BjC / D 1=3; and P .C / D
1=2: Calculate P .Ac \ B \ C /:
1.29. Two independent events A and B are given, and P .BjA [ B/ D 2=3; P .AjB/ D 1=2:
What is P .B/?
1.30. You roll a die and a friend tosses a coin. If you roll a 6, you win. If you don’t roll a 6 and
your friend tosses a H, you lose. If you don’t roll a 6, and your friend does not toss a H,
the game repeats. Find the probability you Win.
1.31. You are diagnosed with an uncommon disease. You know that there only is a 4% chance
of having the disease. Let D Dyou have the disease, and T Dthe test says you have it.
It is known that the test is imperfect: P .T jD/ D 0:9 and P .T c jD c / D 0:85:
(a) Given that you test positive, what is the probability that you really have the disease?
(b) You obtain a second and third opinion: two more (conditionally) independent
repetitions of the test. You test positive again on both tests. Assuming conditional
independence, what is the probability that you really have the disease?
1.32. Two dice are rolled. What is the probability that at least one is a six? If the two faces
are different, what is the probability that at least one is a six?
1.33. 15% of a group are heavy smokers, 30% are light smokers, 55% are nonsmokers. In a 5-
year study it was determined that the death rates of heavy and light smokers were 5 and
3 times that of nonsmokers, respectively. What is the probability a randomly selected
person was a nonsmoker, given that he died?
1.34. A, B, and C are mutually independent and P .A/ D 0:5; P .B/ D 0:8; P .C / D 0:9: Find
the probabilities (i) all three occur, (ii) exactly 2 of the 3 occur, or (iii) none occurs.
1.35. A box has 8 red and 7 blue balls. A second box has an unknown number of red and 9
blue balls. If we draw a ball from each box at random we know the probability of getting
2 balls of the same color is 151/300. How many red balls are in the second box?
1.39. There are two universities. The breakdown of males and females majoring in Math at
each university is given in the tables.
Here M D male, F D female, and O D overall. Show that this is an example of Simpson’s
paradox.
19
CHAPTER 2
Random Variables
In this chapter we study the main properties of functions whose domain is an outcome of an
experiment with random outcomes, i.e., a sample space. Such functions are called random vari-
ables.
2.1 DISTRIBUTIONS
The distribution of a random variable is a specification of the probability that the random variable
takes on any set of values. What is a random variable? It is just a function defined on the sample
space S of an experiment.
Definition 2.2 If X is a discrete random variable with range R.X / D fx1 ; x2 ; : : : ; g, the prob-
ability mass function (pmf ) of X is p.xi / D P .X D xi /; i D 1; 2; : : : : We write1 fX D xi g for
the event fX D xi g D fs 2 S j X.s/ D xi g:
P
Remark 2.3 Any function p.xi / which satisfies (i) 0 p.xi / 1 for all i , and (ii) i p.xi / D
1 is called a pmf. The pmf of a rv is also called its distribution.
Remark 2.5 Here’s where this comes from. If we have a particular sequence of n Bernoulli
trials with x successes, say 10011101 : : : 1; then x 1’s must be in this sequence and n-x 0’s must
also be in there. By independence of the trials, the probability of any particular sequence of
x 1’s and n ! x 0’s is p x .1 p/n x : How many sequences with x 1’s out of n are there? That
n nŠ
number is D .
x xŠ.n x/Š
It should be clear that a Binomial.n; p/ rv X is a sum of (independent) n Bernoulli.p/
rvs, X D X1 C X2 C C Xn : Independent rvs will be discussed later.
18
Example 2.6 A bet on red for a standard roulette wheel has 38 chances of winning. Sup-
pose a gambler will bet $5 on red each time for 100 plays. Let X be the total amount won
or lost as a result of these 100 plays. X will be a discrete random variable with range R.X/ D
f0; ˙5; ˙10; : : : ; ˙500g: In fact, if M denotes the number of games won (which is also a random
variable with values from 0 to 100), then our net amount won or lost is X D 10M 500: The
random variable M is an example of a Binomial.100; 18=38/ rv.
50 20 50 100
The chance you win exactly 50 games is P .M D 50/ D 18 38 38 50
D 0:0693; so the
chance you break even is P .X D 0/ is also 0:0693:
Remark 2.8 Later it will turn out to be very useful to also use the notation for a pdf f .x/ D
P .X D x/ but we have to be careful with this because, as we will see, the probability a continuous
2.1. DISTRIBUTIONS 21
0.25
0.20
0.15
0.10
0.05
-4 -2 2 4 6 8 10
rv is any particular value is 0. This notation is purely to simplify statements and is intuitive as
long as one keeps this in mind.
Distribution 2.9 A rv X is said to have a Normal distribution with parameters .; /; > 0
if
Z 1
1 1 2
. x / dx:
f .x/ D p e 2
2 1
Remark 2.10 The line of symmetry of a N.; / is always at x D ; and it provides the
point of maximum of the pdf. One can check this using the second derivative test and f 0 ./ D
0; f 00 ./ < 0: It is also a calculus exercise to check that x D C and x D both provide
points of inflection (where concavity changes) of the pdf.
Remark 2.11 It is not an easy task to check that f .x/ really is a density function. It is obviously
always nonnegative but why is the integral equal to one? That fact uses the following formula
which is verified in calculus,
Z 1
2 p
e x dx D :
1
Using this formula and a simple change of variables, one can verify that f is indeed a pdf.
22 2. RANDOM VARIABLES
How do we use pdfs to compute probabilities? Let’s start with finding certain types of
probabilities.
Definition 2.12 The cumulative distribution function (cdf ) of a random variable X is FX .x/ D
P .X x/: 8X
ˆ
ˆ p.xi /; if X is discrete with pmf p;
<
xi x
Z
FX .x/ D
ˆ x
:̂ f .y/ dy; if X is continuous with pdf f :
1
(c) limy!xC0 FX .y/ D FX .x/ for all x 2 R: This says a cdf is continuous at every point from
the right.
0.8
0.6
0.4
0.2
1 2 3 4
1 x
Distribution 2.16 Discrete Uniform. P .X D x/ D ; x D 1; 2; : : : ; n:, FX .x/ D , 1 x
n n
n. A discrete uniform rv picks one of n points at random.
x
Distribution 2.17 Poisson./. P .X D x/ D e ; x D 0; 1; 2 : : : : The parameter > 0 is
xŠ
given. A Poisson./ rv counts the number of events that occur at the rate :3
This generalizes the binomial distribution to the case when there is more than just a success or
failure on each trial.
Example 2.22 For the Negative Binomial, X is the number of trials until we get r successes.
We must have at least r trials to get r successes and we get r successes with probability p r and
x r failures with probability .1 p/x r : Since we stop counting when we get the r th success,
2.2. IMPORTANT DISCRETE DISTRIBUTIONS 25
the last trial must be
a success. Therefore, in the preceding x x1 trials
r we spread r 1 successes
x 1 1 x r
and there are r 1 ways to do that. That’s why P .X D x/ D r 1 p .1 p/ ; x r: Here is
an example where the Negative Binomial arises.
Best of seven series. The baseball and NBA finals determines a winner by the two teams
playing up to seven games with the first team to win four games the champ. Suppose team A
wins with probability p each game and loses to team B with probability 1 p:
(a) If p D 0:52; what is the probability A wins the series? For A to win the series, A
can win 4 straight, or in 5 games, or in 6 games, or in 7 games. This is negative binomial with
r D 4; 5; 6; 7 so if X is the number of games to 4 successes for A,
P .A wins series/ D P .X D 4/ C P .X D 5/ C P .X D 6/ C P .X D 7/
! ! ! !
3 4 0 4 4 1 5 4 2 6
D :52 .:48/ C :52 .:48/ C :52 .:48/ C :524 .:48/3
3 3 3 3
D 0:54368:
If p D 0:55; the probability A wins the series goes up to 0:60828 and if p D 0:6 the probability
A wins is 0:7102:
(b) If p D 0:52 and A wins the first game, what is the probability A wins the series?
This is asking for P .A wins seriesjA wins game 1/: Let X1 be the number of games (out
of the remaining 6) until A wins 3. Then
P .A wins 7 game series \ A wins game 1/
P .A wins 7 game seriesjA wins game 1/ D
P .A wins game 1/
P .A wins 6 game series \ A wins game 1/
D
P .A wins game 1/
pP .X1 D 3/ C pP .X1 D 4/ C pP .X1 D 5/ C pP .X1 D 6/
D
p
D P .X1 D 3/ C P .X1 D 4/ C P .X1 D 5/ C P .X1 D 6/
! ! ! !
2 3 4 5
D :523 .:48/0 C :523 .:48/1 C :523 .:48/2 C :523 .:48/3 D 0:6929:
2 2 2 2
An easy way to get this without conditional probability is to realize that once game one is over,
A has to be the first to 3 wins in at most 6 trials.
Example 2.23 Multinomial distributions arise whenever one of two or more outcomes can
occur. Here’s a polling example. Suppose 25 registered voters are chosen at random from a
population in which we know that 55% are Democrats, 40% are Republicans, and 5% are Inde-
pendents. In our sample of 25, what are the chances we get 10 Democrats, 10 Republicans, and
5 Independents?
26 2. RANDOM VARIABLES
This is multinomial with p1 D :55; p2 D :4; p3 D 0:05: Then
!
25
P .D D 10; R D 10; I D 5/ D :5510 :410 :055 D 0:000814
10; 10; 5
which is really small because we are asking for exactly the numbers .10; 10; 5/: It is much more
tedious to calculate but we can also find things like
Notice that in the cumulative distribution we don’t require that 15 C 12 C 20 be the number of
trials.
(
60x 2 .1 x/3 ; 0 x 1;
Example 2.24 Consider a random variable X with pdf fX .x/ D
0; otherwise.
Suppose 20independent samples are drawn from X . An outcome is the sample value falling into
range 0; 15 when i D 1 or i 51 ; 5i , i D 2; 3; 4; 5. What is the probability that 3 observations
fall into the first range, 9 fall into the second range, 4 fall into the third and fourth ranges, and
that there are no observations that fall into the fifth range? To answer this question, let pi
denote the probability of a sample value falling into range i . These probabilities are computed
directly from the pdf. For example,
Z 0:2
p1 D 60x 2 .1 x/3 dx D 0:098:
0
Range Œ0; 0:2 .0:2; 0:4 .0:4; 0:6 .0:6; 0:8 .0:8; 1:0
Probability 0:098 0:356 0:365 0:162 0:019
P .X1 D 3; X2 D 9; X3 D 4; X4 D 4; X5 D 0/
!
20
D .0:098/3 .0:356/9 .0:365/4 .0:162 /4 .0:019 /0 D 0:00205:
3; 9; 4; 4; 0
Example 2.25 Suppose we have 10 patients, 7 of whom have a genetic marker for lung cancer
and 3 of whom do not. We will choose 6 at random (without replacing them as we make our
selection). What are the chances we get exactly 4 patients with the marker and 2 without?
2.2. IMPORTANT DISCRETE DISTRIBUTIONS 27
0.15
0.10
0.05
15 20 25 30
Figure 2.3: P .X D k/: Hypergeometric looks like normal if trials large enough.
! !
7 3
4 2 1
P .X D 4/ D ! D :
10 2
6
Next is the normal distribution which we have already discussed but we record it here
again for convenience.
Distribution 2.27 Normal.; /. X N.; / has density
1 1 2
. x / ;
f .x/ D p e 2 1 < x < 1:
2
It is not possible to get an explicit expression for the cdf so we simply write
Z x
1 1 y 2
FX .x/ D N.xI ; / D p e 2 . / dy:
2 1
We shall also write P .a < X < b/ D normalcdf.a; b; ; /: 5
(
x
e ; x 0;
Distribution 2.28 Exponential./: X Exp./; > 0; has pdf f .x/ D
0; x < 0:
The cdf is Z
x x
y 1 e ; if x 0;
FX .x/ D e dy D
0 0; if x < 0.
An exponential random variable represents processes that do not remember. For example, if X
represents the time between arrivals of customers to a store, a reasonable model is Exponential./
where represents the average rate at which customers arrive.
At the end of this chapter we will introduce the remaining important distributions for
statistics including the 2 ; t; and F -distributions. They are built on combinations of rvs. Now
we look at an important transformation of a rv in the next example.
Example 2.29 Change of scale and shift. If we have a random variable X which has pdf f and
cdf FX we may calculate the pdf and cdf of the random variable Y D ˛X C ˇ; where ˛ ¤ 0; ˇ
5 normalcdf.a; b; ; / is a command from a TI-8x calculator which gives the area under the normal density with pa-
rameters ; ; from a to b:
2.3. IMPORTANT CONTINUOUS DISTRIBUTIONS 29
are constants. To do so, start with the cdf:
8 8
ˆ y ˇ ˆ y ˇ
ˆ
< P X ; ˛ > 0; ˆ F
< X ; ˛ > 0;
˛ ˛
FY .y/ D P .Y y/ D D
ˆ y ˇ ˆ y ˇ
:̂P X ; ˛ < 0: :̂1 FX ; ˛ < 0:
˛ ˛
In particular, if X N.; /; we have the pdf for Y D ˛X C ˇ; (assuming ˛ > 0),
2
1 1 y ˛ ˇ
fY .y/ D p e 2 ˛
˛ 2
which we recognize as the pdf of a N.˛ C ˇ; ˛ / random variable. Thus, Y D ˛X C ˇ
N.˛ C ˇ; ˛/:
If we take ˛ D 1 ; ˇ D ; Y D 1 X ; then Y N.0; 1/: We have shown that given
any X N.; /; if we set
X
Y D H) Y N.0; 1/:
The rv X N.; / has been converted to the standard normal rv Y N.0; 1/: Starting with
X N.; / and converting it to Y N.0; 1/ is called standardizing X .
The reason that a normal distribution is so important is contained in the following special
case of the Central Limit Theorem.
Theorem 2.30 Central Limit Theorem for Binomial. Let Sn Binom.n; p/: Then
! Z x
Sn n p 1 2
lim P p x D p e z =2 dz D normcdf. 1; x; 0; 1/; for all x 2 R:
n!1 np.1 p/ 2 1
Sn n p p
In short, p N.0; 1/ for large n: Alternatively, Sn N.n p; np.1 p/ /.
np.1 p/
Example 2.32 Suppose you are going to play roulette 25 times, betting on red each time. What
is the probability you win at least 14 games?
Remember that the probability of winning is 18=38 D p: Let X be the number of games
won. Then X Binom.25; 18=38/ and what we want to find is P .X 14/:
We may calculate this in two ways. First, using the binomial distribution
25
!
X 25
P .X 14/ D .18=38/x .20=38/25 x D 1 binomcdf.25; 18=38; 13/ D 0:2531:
xD14
x
This is the exact answer. Second, we may use the Central Limit Theorem which says
S25 25 .18=38/
ZDp N.0; 1/:
25 .18=38/ .20=38/
Consequently,
P .S25 14/ D P p S25 25 .18=38/ p 14 25 .18=38/ P .Z 0:8644/ D 0:1937:
25 .18=38/ .20=38/ 25.18=38/ .20=38/
This is not a great approximation. We can make it better by using the continuity correction. We
have
p
P .S25 14/ normcdf.13:5; 1; 25 .18=38/; 25 .18=38/ .20=38// D 0:2533;
which is considerably better.
Figure 2.4 shows why the continuity correction gives a better estimate for a binomial.
2.4. EXPECTATION, VARIANCE, MEDIANS, PERCENTILES 31
0.12
0.10
0.08
0.06
0.04
0.02
12 14 16 18 20 22 24
R With this definition, you can see why it is frequently useful to write E.g.X// D
g.x/P .X D x/ dx even when X is a continuous rv. This abuses notation a lot and you have
to keep in mind that f .x/ ¤ P .X D x/; which is zero when X is continuous.
From calculus
R we know that if we have a one-dimensional object with density f .x/ at
each point, then xf .x/ dx D E.X / gives the center of gravity of the object. If X is discrete,
6 We frequently write E Œg.X / D E.g.X // D Eg.X / and drop the braces or parentheses.
32 2. RANDOM VARIABLES
the expected value is an average of the values of X; weighted by the probability it takes on each
value. For example, if X has values 1, 2, 3 with probabilities 1=8; 3=8; 1=2; respectively, then
On the other hand, the straight average of the 3 numbers is 2. The straight average corresponds
to each value with equal probability.
Now we have a definition of the expected value of any function of X: In particular,
Z 1
EŒX 2 D x 2 f .x/ dx:
1
We need this if we want to see how the random variable spreads its values around the mean.
Definition 2.34 The variance of a rv X is Var.X / D E.X .EŒX //2 : Written out, the first
step is to find the constant D EŒX and then
8 X
ˆ
< .x /2 P .X D x/; if X is discrete;
Var.X / D Zx 1
:̂ .x /2 f .x/ dx; if X is continuous.
1
p
The standard deviation, abbreviated SD, of X; SD.X/ D Var.X/:
Another measure of the spread of a distribution is the median and the percentiles. Here’s
the definition.
Definition 2.35 The median m D med.X / of a random variable X is defined to be the real
number such that P .X m/ D P .X m/ D 21 : The median is also known as the 50th per-
centile.
Given a real number 0 < q < 1; the 100q th percentile of X is the number xq such that
P .X xq / D q:
The interquartile range of a rv is IQR D Q3 Q1 ; i.e., the 75th percentile minus the
25th percentile. Q1 is the first quartile, the 25th percentile, and Q3 is the third quartile, the
75th percentile. The median is also known as Q2 ; the second quartile.
In other words, 100q% of the values of X are below xq : Percentiles apply to any random
variable and give an idea of the shape of the density. Note that percentiles do not have to be
unique, i.e., there may be several xq ’s resulting in the same q:
Z 1
1 2
Example 2.36 If Z N.0; 1/ we may calculate EŒX D x p e x =2 dx using substi-
1 2 Z
1
2 1 2
tution (z D x 2 =2) and obtain EŒZ D 0: Then we calculate EŒX D x 2 p e x =2 dx
1 2
2.4. EXPECTATION, VARIANCE, MEDIANS, PERCENTILES 33
2 2 2
using integration by parts. We get EŒZ D 1 and then VarŒZ D EŒZ .EŒZ/ D 1: The
parameters D 0 and D 1 represent the mean and SD of Z . In general, if X N.; / we
write X D Z C with Z N.0; 1/; and we see that EŒX D 0 C D and VarŒX D
2 VarŒZ D 2 so that SD.X/ D :
Example 2.37 Suppose we know that LSAT scores follow a normal distribution with mean
155 and SD 13. You take the test and score 162. What percent of people taking the test did
worse than you?
This is asking for P .X 162/ knowing X N.155; 13/: That’s easy since P .X 162/ D
normalcdf. 1; 162; 155; 13/ D 0:704: In other words, 162 is the 70.4 percentile of the scores.
Suppose instead someone told you that her score was in the 82nd percentile and you want
to know her actual score. To find that, we are looking to solve P .X x:82 / D 0:82: To find this
using technology7 we have x:82 D invNorm.:82; 155; 13/ D 166:89; so she scored about 167.
Now here’s a proposition which says that the mean is the best estimate of a rv X in the
mean square sense, and the median is the best estimate in the mean absolute deviation sense.
Proposition 2.38
(1) We have the alternate formula Var.X/ D EŒX 2 .EŒX /2 :
(2) The mean of X; EŒX D is the unique constant a which minimizes EŒX a2 : Then
mina EŒX a2 D EŒX 2 D Var.X/:
(3) A median med.X/ is a constant which provides a minimum for EjX aj: In other words,
mina EjX aj D EjX med.X/j.
The second statement says that the variance is the minimum of the mean squared distance
of the rv X to its mean. The third statement says that a median (which may not be unique)
satisfies a similar property for the absolute value of the distance.
Proof. (1) Set D EŒX
Var.X / D E.X EŒX /2 D E X 2 2X C 2
D E X2 2EŒX C 2 D E X 2 2 :
One reason the mgf is so useful is the following theorem. It says that if we know the mgf,
we can find moments, i.e., E.X n /; n D 1; 2; : : : ; by taking derivatives.
dn
Theorem 2.40 If X has the mgf M.t /; then EŒX n D dt n
M.t/j t D0 :
2.4. EXPECTATION, VARIANCE, MEDIANS, PERCENTILES 35
Proof. The proof is easy if we assume that we can switch integral and derivatives.
Z 1 n Z 1
dn d tx
M.t / D e f .x/ dx D x n e tx f .x/ dx:
dt n 1 dt n
1
R 1 n tx R1
Plug in t D 0 in the last integral to see 1 x e j t D0 f .x/ dx D 1 x n f .x/ dx D EX n :
Example 2.41 Let’s use the mgf to find the mean and variance of X Binom.n; p/:
n n
!
X X n
M.t / D e tx P .X D x/ D e tx p x .1 p/n x
xD0 xD0
x
n
!
X n
D .pe t /x .1 p/n x D .pe t C .1 p//n :
xD0
x
We used the Binomial Theorem from algebra8 in the last line. Now that we know the mgf we
can find any moment by taking derivatives. Here are the first two:
and
n 2 n 1
M 00 .t/ D n.n 1/p 2 e 2t pe t C .1 p/ C npet pe t C .1 p/
H) EX 2 D M 00 .0/ D n.n 1/p 2 C np:
Z ˇb
b
tx 1 1 1 tx ˇˇ et b e ta
M.t / D e dx D e ˇ D :
a b a b at ˇ t.b a/
a
Then
e at .at 1/ C e bt .1 bt / aCb
M 0 .t/ D and lim M 0 .t/ D :
.a b/t 2 t !0 2
n
!
X n k n
8 .a C b/n D a b k
:
k
kD0
36 2. RANDOM VARIABLES
aCb
We conclude EX D 2
: While we could find M 00 .0/ D EX 2 ; it is actually easier to find this
directly.
Z b
2 1 b 3 a3 .b a/2
EX D x2 dx D H) Var.X / D EX 2 .EX/2 D :
a b a 3.b a/ 12
x
(b) X Exp./ ; f .x/ D e ; x > 0: We get
Z 1
x
M.t / D EŒe tX D e tx e dx D ; if t < :
0 t
1 2 2
M 0 .t / D ; M 0 .0/ D D EŒX ; and M 00 .t / D ; M 00 Œ0 D 2 D EŒX 2
. t/2 . t /3
2 1 1
Var.X / D EX 2 .EX/2 D D 2:
2 2
x2
(c) X N.0; 1/ ; f .x/ D p12 e 2 ; 1 < x < 1: The mgf for the standard normal
distribution is
Z 1 Z 1
tx 1 x2 1 x2
M.t / D e p e 2 dx D p e tx 2 dx
1 2 2 1
Z 1 Z 1
1 t 2 =2 1
.x t/2 t 2 =2 1 1 2 2
Dp e e 2 dx D e p e 2 .x t/ dx D e t =2
2 1 2 1
R1 1 2
since p12 1 e 2 .x t/ dx D 1: Now we can find the moments fairly simply.
t2 t2 t2
M 0 .t / D e 2 t H) M 0 .0/ D EX D 0 and M 00 .t / D e 2 t2 C e 2 H) M 00 .0/ D EX 2 D 1
Var.X/ D EX 2 .EX/2 D 1:
X
(d) X N.; / : All we have to do is convert X to standard normal. Let Z D
:
2 =2
We know Z N.0; 1/ and we may use the previous part to write MZ .t/ D e t . How do we
get the mgf for X from that? Well, we know X D Z C and so
1 2 1 2t 2
MX .t/ D Ee tX D Ee .ZC/t D e t Ee .t /Z D e t e 2 . t / D e t C 2 :
2t2
Then M 0 .t/ D e 2 Ct C 2 t so that M 0 .0/ D EX D : Next
2t2
2
M 00 .t/ D e 2 Ct 2 C C 2t H) M 00 .0/ D 2 C 2 :
Theorem 2.42
(1) If X and Y are two rvs such that MX .t/ D MY .t / (for all t close to 0), then X and Y have
the same cdfs.
x
Example 2.43 An rv X has density f .x/ D p12 p1x e 2 ; x > 0 and f .x/ D 0 if x < 0: First,
we find the mgf of X:
Z 1
1 1 x
MX .t / D p e tx p e 2 dx
2 0 x
Z 1
2 2 1 p
Dp e u .t 2 / du setting u D x
2 0
Z 1 p p
2 1 2 1
Dp p e z =2 dz setting z D u 2 1 2t ; t <
2 1 2t 0 2
Z 1
1 1 2 2
Dp ; for t < ; since p e z =2 dz D 1:
1 2t 2 2 0
Then,
EX D MX0 .0/ D 1 and EX 2 D MX00 .0/ D 3 H) Var.X/ D 3 12 D 2:
Where does this density come from? To answer this, let Z N.0; 1/ and let’s find the pdf of
Y D Z2:
Z py
2 p p 1 2
FY .y/ D P .Z y/ D P . y Z y/ D p p
e x =2 dx
2 y
1 1 1
f .y/ D FY0 .y/ D p e y=2 p e y=2 p
2 2 y 2 y
1 2
Dp e y=2 p ; y > 0:
2 2 y
38 2. RANDOM VARIABLES
If y 0; f .y/ D 0: This shows that the density we started with is the density for Z 2 : Now we
calculate the mgf for Z 2 :
Z 1
tZ 2 1 2 2
MZ 2 .t / D EŒe D p e t z e z =2 dz; since Z N.0; 1/
2 1
Z 1
1 2 1
Dp e z .t 2 / dz D MX .t/
2 1
since if we compare the last integral with the second integral in the computation of MX we see
that they are the same. This means that MX .t/ D MZ 2 .t / and part (1) of the Theorem 2.42 says
that X and Z 2 must have the same distribution.
x
Definition 2.44 The rv X D Z 2 with density given by f .x/ D p12 p1x e 2 ; x > 0 and f .x/ D
0 if x < 0; is called 2 with 1 degree of freedom, written as X 2 .1/.
Remark 2.45 We record here the mean and variance of some important discrete distributions.
(a) X Binom.n; p/; EX D np; Var.X/ D np.1 p/:
1 1 p pe t
(b) X Geom.p/, EX D ; Var.X / D ; MX .t/ D :
p p2 1 .1 p/e t
N n
(c) X HyperGeom.N; r; n/; EX D np; p D r=N; Var.X/ D np.1 p/ :
N 1
r
1 p 1 p pe t
(d) X NegBinom.r; p/; EX D r , Var.X / D r : MX .t/ D :
p p2 1 .1 p/e t
t
(e) X Poisson./; EX D ; Var.X/ D ; MX .t / D e .e 1/
:
Definition 2.46
(1) If X and Y are two random variables, the joint cdf is FX;Y .x; y/ D P .fX xg \ fY yg/:
In general, we write this as FX;Y .x; y/ D P .X x; Y y/:
(2) If X and Y are discrete, the pmf of .X; Y / is p.x; y/ D P .X D x; Y D y/:
2.5. JOINT DISTRIBUTIONS 39
R1 R1
(3) A joint density function is a function fX;Y .x; y/ 0 with 1 1 fX;Y .x; y/ dx dy D 1:
The pair of rvs .X; Y / is continuous if there is a joint density function and then
Z x Z y
FX;Y .x; y/ D fX;Y .u; v/ du dv:
1 1
@2 FX;Y .x;y/
(4) If we know FX;Y .x; y/ then the joint density is fX;Y .x; y/ D @x@y
:
Knowing the joint distribution of X and Y means we have full knowledge of X and Y
individually. For example, if we know FX;Y .x; y/, then
FX .x/ D FX;Y .x; 1/ D lim FX;Y .x; y/; FY .y/ D FX;Y .1; y/:
y!1
The resulting FX and FY are called the marginal cumulative distribution functions. The
marginal densities when there is a joint density are given by
Z 1 Z 1
fX .x/ D fX;Y .x; y/ dy and fY .y/ D fX;Y .x; y/ dx:
1 1
(
8xy; 0 x < y 1;
Example 2.47 The function f .x; y/ D is given. First we verify it is a
0 otherwise:
joint density. Since f 0 all we need to check is that the double integral is one.
Z 1 Z 1 Z 1 Z y Z 1
1 2
f .x; y/ dx dy D 8xy dx dy D 8y y dy D 1:
1 1 0 0 0 2
If X and Y are discrete rvs, the joint pmf is p.x; y/ D P .X D x; Y D y/: The marginals
P P
are then given by pX .x/ D P .X D x/ D y p.x; y/ and pY .y/ D P .Y D y/ D x p.x; y/:
40 2. RANDOM VARIABLES
In general, to find the probability that for any set C R R; the pair .X; Y / 2 C has
probability defined by
8“
ˆ
ˆ
ˆ
< fX;Y .x; y/ dxdy; if X; Y are continuous;
P ..X; Y / 2 C / D C (2.1)
ˆ P P p .x; y/;
ˆ
if X; Y are discrete:
:̂ X;Y
.x;y/2C
Example 2.49 We calculate E.X C Y / assuming we have the joint density of .X; Y / given by
fX;Y .x; y/: By definition
“
E.X C Y / D .x C y/fX;Y .x; y/ dxdy
“ Z Z
D xfX;Y .x; y/ dxdy C yfX;Y .x; y/ dxdy
Z Z Z Z
D x fX;Y .x; y/ dy dx C y yfX;Y .x; y/ dx dy
Z Z
D xfX .x/ dx C yfY .y/ dy D E.X/ C E.Y /:
Notice that the first E uses the joint density fX;Y while the second and third E ’s use fX and fY ;
respectively.
Example 2.50 Suppose .X; Y / have joint density f .x; y/ D 1; 0 x; y 1; and f .x; y/ D 0
otherwise. This models picking a random point .x; y/ in the unit square. If we want to calculate
P .X < Y /, this uses the density.
“ Z 1Z y ˇ1
y 2 ˇˇ 1
P .X < Y / D f .x; y/ dx dy D 1 dx dy D ˇ D :
0 0 2 ˇ 2
0x<y1 0
2.5. JOINT DISTRIBUTIONS 41
Similarly, we may calculate
“
1
P X2 C Y 2 D 1 dx dy D area of semicicle in square D :
4 16
1
0x 2 Cy 2 4
Also, Z Z
1 1
2 2 2
E X CY D x 2 C y 2 f .x; y/ dx dy D :
0 0 3
(
8xy; 0 x < y 1;
Example 2.52 For the rvs .X; Y / with joint density f .x; y/ D we’ll
0 otherwise;
find E.X C Y / and E.XY/:
Z 1Z y Z 1 Z 1
4 4
E.X C Y / D 8xy.x C y/ dx dy D and E.XY/ D 8xy.xy/ dy dx D :
0 0 3 0 x 9
Note that
Z 1 Z 1
8 4
E.X / D 4x 1 x 2 x dx D and E.Y / D 4y 3 y dy D ;
0 15 0 5
so that E.X C Y / D E.X/ C E.Y / but E.XY/ ¤ E.X / E.Y /:
You see that E.XY/ ¤ E.X / E.Y / in general, but there are important cases when this
is true. For that we need the notion of independent random variables.
P .X x; Y y/ D P .X x/ P .Y y/ ; 8 x 2 R; y 2 R:
42 2. RANDOM VARIABLES
If .X; Y / has a joint density fX;Y ; X has density fX ; and Y has density fY ; then indepen-
dence means that the joint density factors into the individual densities:
One of the main consequences of independence is the following fact. It says the expected
value of a product of rvs is the product of the expected value of each rv.
In fact, for any functions g; h; we have EŒg.X /h.Y / D EŒg.X/ EŒh.Y /:
Proof. By definition,
Z 1 Z 1 Z 1 Z 1
EŒXY D xy fX;Y .x; y/ dx dy D xy fX .x/fY .y/ dx dy
1 1 1 1
Z 1 Z 1
D x fX .x/ dx y fY .y/ dy D EŒX EŒY :
1 1
Independence also allows us to find an explicit expression for the joint cumulative distri-
bution of the sum of two random variables.
This is really another application of the Law of Total Probability. To see this
Z
P .X C Y w/ D P .X C Y w; Y D y/ dy
Z Z
D P .X w y/P .Y D y/ dy D FX .w y/fY .y/ dy:
2.5. JOINT DISTRIBUTIONS 43
The first equality uses the Law of Total Probability and the second equality uses the indepen-
dence.
Example 2.56 Suppose X and Y are independent Exp./ rvs. Then, for w 0;
Z 1 Z w
P .X C Y w/ D FX .w y/fY .y/ dy D 1 e .w y/ e y dy
0 0
w
D1 .w C 1/e D FXCY .w/:
If w < 0; FXCY .w/ D 0: To find the density we take the derivative with respect to w to get
fX CY .w/ D 2 w e w
; w 0:
It turns out that this is the pdf of a so-called Gamma .; 2/ rv.
Cov.X; Y /
.X; Y / D ; X2 D Var.X /; Y2 D Var.Y /:
X Y
It looks like covariance measures how independent X and Y are. It is certainly true that if X; Y
are independent, then .X; Y / D 0; but the reverse is false.
The sum of each row is P .X D x/ while the sum of each column is P .Y D y/: Each element
of the matrix is P .X D x; Y D y/: If X; Y are independent, the .x; y/ element of the matrix
must be the product of the marginals, i.e., P .X D x; Y D y/ D P .X D x/P .Y D y/: You can
see from the table that is not true so X; Y are not independent. On the other hand,
X
EŒXY D x yP .X D x; Y D y/ D 0
x;y
1 1 1 1
EŒX D . 1/ C .C1/ D 0; and EŒY D . 1/ C .C1/ D 0:
4 4 4 4
which means Cov.X; Y / D 0 and so X; Y are uncorrelated.
Remark 2.60 This can be extended to n rvs X1 ; : : : ; Xn : If they are uncorrelated (which is true
if they are independent), Var.X1 C C Xn / D Var.X1 / C C Var.Xn /:
Since
mgfsqdetermine
a distribution uniquely according to Theorem 2.42, we see that Sn
P P 2
N i ; i :
Example 2.62 The sum of independent Geom.p/ random variables is Negative Binomial. In
particular, suppose X is the number of Bernoulli trials until we get r successes with probability p
of success on each trial. Then X D X1 C X2 C C Xr ; where Xi Geom.p/; i D 1; 2; : : : ; r;
is the number of trials until the first success. This is true since once we have a success we sim-
ply start counting anew from the last success until we get another success. Now, we have by
independence,
r
X X r
r r.1 p/
EŒX D E ŒXi D ; and VarŒX D VarŒXi D :
p p2
i D1 i D1
et p
In addition, using the mgf of Geom.p/; namely, MXi .t/ D 1 e t .1 p/
;t < ln.1 p/ we have
r
Y e rt p r
MX .t / D MXi .t/ D ; t< ln.1 p/:
.1 e t .1 p//r
i D1
We have seen that the sum of independent normal rvs is exactly normal. The Central
Limit Theorem says that even if the Xi ’s are not normal, the sum is approximately normal if the
number of rvs is large. We have already seen the special case of this for Binomials but it is true
in much more generality. The full proof is covered in more advanced courses.
Theorem 2.63 Central Limit Theorem. Let X1 ; X2 ; : : : be a sequence of independent rvs all
having the same distributions and EX 1 D ; Var.X1 / D 2 : Then for any a; b 2 R;
X1 C C Xn n
lim P a p b D P .a Z b/ ;
n!1 n
46 2. RANDOM VARIABLES
where Z N.0; 1/:
n 2
and, dividing by n; since E Snn D n
D ; Var. Snn / D 1 2
n2
n D n
;
Sn
X N ; p :
n n
This is true no matter what the distributions of the individual Xi ’s are as long as they all have
the same finite means and variances.
Sketch of Proof of the CLT (Optional): We may assume D 0 (why?) and we also may
p
assume a D 1: Set Zn D Sn =. n/: Then, if M.t / D MXi .t/ is the common mgf of the rvs
Xi ; n
t p
MZn .t / D M p D exp.n ln M.t =. n///:
n
If we can show that
2 =2
lim MZn .t / D e t
n!1
then by Theorem 2.42 we can conclude that the cdf of Zn will converge to the cdf of the random
2
variable that has mgf e t =2 : But that random variable is Z N.0; 1/: That will complete the
proof. Therefore, all we need to do is to show that
p t2
lim n ln M.t =. n// D :
n!1 2
p
To see this, change variables to x D t=. n/ so that
p t 2 ln M.x/
lim n ln M.t =. n// D lim 2 :
n!1 x!0 x2
Since ln M.0/ D 0 we may use L’Hopital’s rule to evaluate the limit. We get
t 2 ln M.x/ t2 M 0 .x/=M.x/
lim D lim
x!0 2 x2 2 x!0 2x
t2 M 00 .x/
D lim using L’Hopital again
2 2 x!0 xM 0 .x/ C M.x/
t2 M 00 .0/ t2 2 t2
D D D ;
2 2 0 M 0 .0/ C M.0/ 2 2 0 0 C 1 2
2.5. JOINT DISTRIBUTIONS 47
0 00 2 2
since M.0/ D 1; M .0/ D EX D 0; M .0/ D EX D : This completes the proof.
Example 2.64 Suppose an elevator is designed to hold 2000 pounds. The mean weight of a
person getting on the elevator is 175 with standard deviation 15 pounds. How many people can
board the elevator so that the chance it is overloaded is 1%?
Let W D X1 C C Xn be the total weight of n people who board the elevator. We don’t
know the distribution of the weights of individual people (which is probably not normal), but
p
we do know EX D 175 and Var.X/ D 152 : By the central limit theorem, W N.175n; 15 n/
and we want to find n so that
P .W > 2000/ D 0:01:
If we standardize W we get
W 175n 2000 175n 2000 175n
0:01 D P .W > 2000/ D P p > p DP Z> p :
15 n 15 n 15 n
Using a calculator, we get P .Z > z/ D 0:01 H) z D invNorm.0:99/ D 2:326: Therefore, it
must be true that
2000 175n
p 2:326 H) n 11:
15 n
The maximum number of people that can board the elevator and meet the criterion is 11. With-
out knowing the distribution of the weight of people, there is no other way to do this problem.
2
P .jX j c/ ; for any constant c > 0: (2.2)
c
The larger c is the smaller the probability can be. The argument for Chebychev is simple. Assume
X has pdf f . Then
Z Z
2 2 2
D EjX j D jx j f .x/ dx C jx j2 f .x/ dx
jx jc jx jc
Z Z
jx j2 f .x/ dx c 2 f .x/ dx D c 2 P .jX j c/ :
jx jc jx jc
48 2. RANDOM VARIABLES
Chebychev is used to give us the Weak Law of Large Numbers which tells us that the mean of
a random sample should converge to the population mean as the sample size goes to infinity.
Theorem 2.65 Weak Law of Large Numbers. Let X1 ; : : : ; Xn be a random sample, i.e., in-
dependent and all having the same distribution as the rv X which has finite mean EX D and
finite variance 2 D Var.X/: Then, for any constant c > 0; with X D X1 CCX
n
n
;
lim P jX j c D 0:
n!1
2
Proof. We know EX D and Var.X/ D n
: By Chebychev’s inequality,
Var.X/ 2
P jX j c D ! 0 as n ! 1:
c nc
The Strong Law of Large Numbers, which is beyond the scope of this book, says Xn !
in a much stronger way than the Weak Law says, so we are comfortable that the sample means
do converge to the population mean.
Then Y 2 .k/: That is, a 2 .k/ rv is the sum of the squares of k independent normal rvs. In
fact, if we look at the mgf of Y , we have using Example 2.43 and independence,
k
Y k
Y k=2
1 1 1
MY .t/ D MZ 2 .t / D p D ; t< ;
i 1 2t 1 2t 2
i D1 i D1
2.6. 2 .k/, STUDENT’S t- AND F-DISTRIBUTIONS 49
2
which is the mgf of a .k/ rv which may be derived directly from the density. From, the mgf
it is easy to see that EY D k and Var.Y / D 2k: The main properties of Y are the following.
Remark 2.66 If X 2 .n/; Y 2 .m/ and X; Y are independent, then X C Y 2 .n C m/:
To see why,
n=2 m=2 .nCm/=2
t .X CY / 1 1 1
MX CY .t / D Ee D MX .t/MY .t/ D D
1 2t 1 2t 1 2t
1
for t < Since distributions are uniquely determined by their mgf and the mgf of X C Y is
2
:
the mgf of a 2 .n C m/ rv, we know that X C Y 2 .n C m/:
Remark 2.67 The 2 .n/ distribution is not symmetric. Therefore, if we want to find a; b so
that P .a < 2 .n/ < b/ D 1 ˛ for some given 0 < ˛ < 1; we set it up so the area to the right
of b is ˛2 and the area to the left of a is also ˛2 : Using a TI-8x calculator, the command is a D
invchi.n; 1 ˛=2/ and b D invchi.n; ˛=2/ where the first parameter is the area desired to the
right of a or b . The program to get this is based on Newton’s method for solving 2 cdf.0; x; n/ D
1 ˛ for x .
Z
T .k/ D p ; Z N.0; 1/: (2.3)
2
.k/=k
We say that T has a Student’s t -distribution with k degrees of freedom.
Remark 2.68 We will come to this later but this will come from looking at the sample mean
divided by the sample standard deviation
v
u n
X X1 C Xn u 1 X 2
T D p ; XD ; S Dt Xi X :
S= n n n 1
i D1
50 2. RANDOM VARIABLES
This rv will have a t -distribution with n 1 degrees of freedom. Here are the main properties
of the t -distribution:
k
ET D 0; Var.T / D ; k > 2:
k 2
2.6.3 F -DISTRIBUTION
This is also a distribution arising in hypothesis testing in statistics as the quotient of two inde-
pendent 2 rvs. In particular,
2 .k1 /
k
F D 2 1 :
.k2 /
k2
2
k2 k2 k1 C k2 2
EX D ; k2 > 2; and Var.X / D 2 ; k2 > 4:
k2 2 k2 2 k1 .k2 4/
2.7 PROBLEMS
2.1. We roll two dice and X is the difference of the larger number of the two dice and the
smaller number. Find R.X /, the pmf, and the cdf of X and then find P .0 < X 3/
and P .1 X < 3/: Hint: Use the sample space.
2.7. PROBLEMS 51
2.2. Suppose that the distribution function of X is given by
8
ˆ 0 b<0 (a) Find P .X D i/; i D 1; 2; 3:
ˆ
ˆ
ˆ
ˆ (b) Find P .1=2 < X < 3=2/:
ˆ
ˆ b
ˆ
ˆ 0b<1
ˆ
ˆ
< 4
F .b/ D 1 b 1
ˆ C 1b<2
ˆ 2
ˆ 4
ˆ
ˆ 11
ˆ
ˆ
ˆ
ˆ 2b<3
ˆ 12
:̂
1 b3
(a) of e X ?
(b) of the random variable aX C b; where a and b are nonzero constants?
(a) f .x/ D xc ; x D 1; 2; 3; : : : ; n:
c
(b) f .x/ D .xC2/.xC3/
;x D 0; 1; 2; 3; : : : : Hint: Use partial fractions.
8
< 3=4; 0 x 1
2.5. Let X be a continuous random variable with pdf f .x/ D 1=4; 2 x 3
:
0; otherwise.
2.22. Jensen, arriving at a bus stop, just misses the bus. Suppose that he decides to walk if the
(next) bus takes longer than 5 minutes to arrive. Suppose also that the time in minutes
between the arrivals of buses at the bus stop is a continuous random variable with a
Unif.4; 6/ distribution. Let X be the time that Jensen will wait.
2.26. The time (in hours) required to repair a machine is an exponentially distributed random
variable with parameter D 1:2
2.27. The number of years a radio functions is exponentially distributed with parameter D
1=18. If Jones buys a used radio, what is the probability that it will work for an additional
8 years?
2.28. A patient has insurance that pays $1,500 per day up to 3 days and $1,000 per day after
3 days. For typical illnesses the number of days in the hospital, X; has the pmf p.k/ D
7 k
21
; k D 1; 2; : : : ; 7: Find the mean expected amount the insurance company will pay.
2.29. Let X N.; / use substitution and integration by parts to verify E.X / D and
SD.X/ D : That is, verify Example 2.36.
2.30. An investor has the option of investing in one of two stocks. If he buys Stock A he can
net $500 with probability 1/2, and lose $100 with probability 1/2. If he buys Stock B,
he can net $1,500 with probability 1/4 and lose $200 with probability 3/4.
(a) Find the mean and SD for each investment. Which stock should he buy based on
the coefficient of variation defined by SD=?
(b) What is the interpretation of the coefficient of variation?
2.7. PROBLEMS 55
p
(c) The value of x dollars is worth g.x/ D x C 200 to the investor. This is called a
utility function. What is the expected utility to the investor for each stock?
2.31. Suppose an rv has density f .x/ D 2x; 0 < x < 1 and 0 otherwise. Find
(a) P .X < 1=2/; P .1=4 < X 1=2/; P .X < 3=4jX > 1=2/ and
(b) EŒX; SDŒX ; EŒe tX :
5x
2.32. An rv has pdf f .x/ D 5e ; 0 x < 1; and 0 otherwise. Find EŒX ; VarŒX; medŒX:
P
2.33. Find the mgf of X Geometric.p/: (Hint: 1 k a
kD1 a D 1 a ; jaj < 1:) Use it to find
the mean and variance of X:
P ak
2.34. Find the mgf of X Poisson./ (Hint: 1 a
kD0 kŠ D e :) Use it to find the mean and
variance of X:
2t t
2.35. Suppose X has the mgf MX .t/ D 0:09e C 0:24e C 0:24e t C 0:09e 2t C 0:34:
2.40. Find EX and Var.X/ for the rv X with the following densities.
(a) P .X D k/ D n1 ; k D 1; 2; : : : ; n:
(b) fX .x/ D rx r 1
; 0 < x < 1; r > 0:
56 2. RANDOM VARIABLES
2.41. Let N be the number of Bernoulli trials (meaning independent and only one of two
outcomes possible) to get r successes. N is a negative binomial rv with parame-
ters r; p; NegBinom.r; p/; and we know P .N D k/ D kr 11 p r .1 p/k r ; k D r; r C
1; r C 2; : : : : If you think of the process restarting after each success is obtained, it is
reasonable to write N D Y1 C C Yr ; where Yi ’s are independent geometric rvs. Use
this to find the mgf of N and then find EN; Var.N /:
2.42. Let X be Hypergeometric.N; n; k/: Let p D k=N . It can be shown that EX D
np and Var.X/ D np.1 p/ N N 1
n
. This looks like the same mean and variance
k N n
of a Binomial.n; p/ with p D except for the extra term : This term
N N 1
is known as the Correction
p factor. We know that Binomial.n; p/ can be ap-
proximated by Normal.np; np.1 p//: What is the approximate distribution of
Hypergeometric.N; n; k/?
2.43. We throw a coin until a head turns up for the second time, where p is the probability that
a throw results in a head and we assume that the outcome of each throw is independent
of the previous outcomes. Let X be the number of times we have to throw the coin.
(a) Determine P .X D 2/; P .X D 3/; and P .X D 4/:
(b) Show that P .X D n/ D .n 1/p 2 .1 p/n 2
; for n 2:
(c) Find EX:
2.44. Suppose P .X D 0/ D 1 P .X D 1/; E.X / D 3Var.X/: Find P .X D 0/:
2.45. If EX D 1; Var.X/ D 5 find E.2 C X /2 and Var.4 C 3X/:
2.46. Monthly worldwide major mainframe sales is 3.5 per year and has a Poisson distribution.
Find
(a) the probability of at least 2 sales in the next month,
(b) the probability at most one sale in the next month, and
(c) the variance of the monthly number of sales.
2.47. A batch of 100 items has 6 defective and 94 good. If X is the number of defectives in a
randomly drawn sample of 10, find P .X D 0/; P .X > 2/; EX; Var.X /:
2.48. An insurance company sells a policy with a 1 unit deductible. Let X be the amount of
0:9; x D 0
the loss have pmf f .x/ D
c=x; x D 1; 2; 3; 4; 5; 6:
Find c and the expected total amount the insurance company has to pay out.
!
4 1 4
2.49. Find EX; EŒX.X 1/; if X has pmf f .x/ D ; x D 0; 1; 2; 3; 4:
x 2
2.7. PROBLEMS 57
2.50. Choose a so as to minimize EŒjX aj:
CHAPTER 3
#1 #2 # 3 # 4 # 5 …
The box contains one ticket for each individual in the population and the number on the ticket
is the item of interest to the experimenter. Think of a random sample as choosing tickets from
the population box with replacement (so the box remains the same on each draw). If we don’t
replace the tickets, the box changes after each draw and the random variables would no longer be
independent or have the same distribution as X . Each time we take a sample from the popula-
tion we are getting values X1 D x1 ; X2 D x2 ; : : : ; Xn D xn and these values .x1 ; x2 ; : : : ; xn / are
specific observed values of the random variables .X1 ; : : : ; Xn /. In general, lowercase variables
will be observed variables, while uppercase represents random variables before observation.
60 3. DISTRIBUTIONS OF SAMPLE MEAN AND SAMPLE SD
Once we have a random sample we want to summarize the values of the random variables
to obtain some information.
Definition 3.1 Given any collection of random variables X1 ; X2 ; : : : ; Xn ; the sample mean is
P
X D X1 CX2nCCXn : The sample variance is S 2 D n 1 1 niD1 .Xi X/2 and the sample stan-
dard deviation is S: The sample median, assuming the random variables are sorted, is
8
<X nC1 ; if n is odd;
eD
X 2
: X n2 CX n2 C1
2
; if n is even.
Any function g.X1 ; : : : ; Xn / of the random sample is said to be a statistic. For example
e are examples of statistics.
X ; S; and X
Example 3.3 Consider the population box 0 1 2 3 4 : This is another way of saying
the population rv is discrete with R.X / D f0; 1; 2; 3; 4g and P .X D k/ D 1=5; k D 0; 1; 2; 3; 4:
We will choose random samples of size 2 from this population, without replacement. The pop-
ulation mean of the numbers on the tickets is D E.X/ D 0C1C2C3C4 5
D 2 with population
2 1 P5 2
variance=E.X 2/ D 5 i D1 .xi 2/ D 2:
How many samples of size 2 are there? Since we don’t care about the order, we areasking
for how many combinations of the 5 numbers taken 2 at a time there are and that is 52 D 10:
Here are the 10 possible samples of size 2:
.0; 1/; .0; 2/; .0; 3/; .0; 4/; .1; 2/; .1; 3/; .1; 4/; .2; 3/; .2; 4/; .3; 4/:
the same as the population mean : This is not a coincidence. Next, the variance of the sample
mean is
X7
Var.X / D .x i 2/2 P .X D x i / D 0:75:
i D1
The population variance is 2 but the variance of the sample mean is 0.75. The variance of the
sample mean is always lower than the variance of the individuals in the population as we will
show in the next theorem. Averages have lower variation.
The theorem will show us how to calculate E.X/ and SD.X/ directly from the population
mean, SD, and sample size.
2
EŒX D ; Var.X / D :
n
If the population size, N; is finite, and the sampling is done without replacement, then
2 N n
EŒX D Var.X / D :
n N 1
1 n 2 2
Var.X/ D ŒVar.X1 / C C Var.Xn / D D :
n2 n2 n
The rest of the proof when sampling is done without replacement is based on the Hypergeometric
distribution and is omitted.
Definition
q 3.5 The Standard Error (SE) of the sample mean is SE D SD.X / D
E.X E.X //2 : When we are sampling from an infinite population or from a finite popula-
tion but with replacement, then SE.X/ D p : When we are sampling from a finite population
n
without replacement, then
r
N n
SE.X/ D p :
N 1 n
r
N n
The term ; when we have a finite population and sampling is without replacement, is
N 1
called the finite population correction factor for the standard error of the sample mean.
Remark 3.6 Theorem 3.4 shows that EX D : Whenever we have an estimator, in this case
X ; of a parameter inherent to the population rv, in this case ; and we know EX D ; we say
that the statistic X is an unbiased estimator of :
Sample means are calculated using X D .X1 C C Xn /=n. Sometimes we are interested
in the distribution of the sums Sn D X1 C Xn : Here is the result.
p
Thus, SE.Sn / D n: If we have
r a finite population and we sample without replacement,
p N n
EŒSn D n but SE.Sn / D n :
N 1
p
Proof. In Proposition 2.61 we see that Sn D X1 C C Xn N.n ; n /: Therefore, X D
Sn
n
N.; pn /:
X p
Now that we know X N.; pn / we may standardize X to see that Z D = n
is N.0; 1/:
We can answer questions about the chances the sample mean lies in a particular interval P .a
X b/: Here’s an example
Example 3.9 Suppose IQs of a population are normally distributed N.100; 10/: We know
10
X N.100; p n
/: If we take a random sample of n D 25 people, we find
p
(a) P .95 < X < 105/ D normalcdf.95; 105; 100; 10= 25/ D 0:9875.
p
(b) P .X > 120/ D normalcdf.120; 1; 100; 10= 25/ D 7:7 10 24 0:
p
(c) P .X < 98/ D normalcdf. 1; 98; 100; 10= 25/ D 0:1586.
You can see that the sample mean X has much less variability than X itself since, for example
P .95 < X < 105/ D normalcdf.95; 105; 100; 10/ D 0:3829: This says about 38% of individuals
have IQs between 95 and 105, but 98.75% of samples of size 25 will result in a sample mean
between 95 and 105. Individuals may vary a lot, but averages don’t.
Many underlying populations do not follow a normal distribution. For instance, the distri-
butions of annual incomes, or times between arrivals of customers, are not normally distributed.
Now we ask the question what happens to the sample mean when the underlying population
does not follow a normal distribution.
Suppose we know the probability of success is p; 0 < p < 1: This also means the percentage
of 1s in the population will be 100p%: The population rv is X Bernoulli.p/ and we take a
random sample X1 ; : : : ; Xn from X: Each Xi is either 0 or 1 and Sn D X1 C C Xn is the
total number of 1s in our sample and X D Snn is the fraction of 1s in the sample. It is also true
that Sn is exactly Binom.n; p/: By Theorem 2.63 we also know
r !
p p.1 p/
Sn N.np; np.1 p// and X N p; :
n
3.1. POPULATION DISTRIBUTION KNOWN 65
0.10
0.08
0.06
0.04
0.02
20 40 60 80 100
Example 3.11 Suppose a gambler bets $1 on Red in a game of roulette. The chance he wins
is p D 18
38
and the bet pays even money (which means if red comes up he is paid his $1 plus an
additional $1). He plans to play 50 times.
First, how many games do we expect to win? That is given by ES50 D 50 18 38
D 23:68
and we will expect to lose 50 23:68 D 26:32 games. Since the bet pays even money and we are
betting $1 on each game, we expect to lose $26.32.
How far off this number do weq expect to be, i.e., what is the SE for the number of games
won and lost? The SE is SD.S50 / D 50 18 38
.1 18
38
/ D 3:53 so we expect to win 23.68 games,
give or take about 3.53 games. Put another way, we expect to lose $26.32, give or take $3.53.
(a) What is the chance he wins 40% of the games, i.e., 20 games?
We are looking for P .X D 0:4/ which is equivalent to P .S50 D 20/: We’ll do this in two
ways to get the exact result and then using
q the normal approximation.
18
First, since S50 Binom.50 38
; 50 18
38
.1 18
38
/ D Binom.23:68; 3:53/ we get the exact
value
! 20 30
50 18 18
P .S50 D 20/ D 1 D binompdf.50; 18=38; 20/ D 0:066:
20 38 38
Note that the continuity correction would use 19:5 S50 20:5 H) :39 X :41: Also,
np D 50 18=38 > 5 and n.1 p/ D 50 20=38 > 5; so a normal approximation may be used.
(b) What is the chance he wins no more than 40% of the games?
Now we want P .X 0:4/ or P .S50 20/: We have the exact answer
20
!
X 50 18 k 18 50 k
P .S50 20/ D 1 D binomcdf.50; 18=38; 20/ D 0:1837:
k 38 38
kD0
The approximate answer is either given by P .S50 20/ normalcdf.0; 20:5; 23:68; 3:53/ D
0:18383 or P .X 0:4/ D normalcdf.0; 0:4; :473; :0706/ D :1505. If we use the continuity cor-
rection, it will be P .X 0:4/ D normalcdf.0; 0:41; :473; :0706/ D :18610.
(c) What is the chance the gambler comes out ahead?
This is asking for P .S50 25/ or P .X 0:5/ Using the same procedure as before we get
the exact answer
The result for Bernoulli populations, i.e., when X is either 0 or 1, extends to a more general
case when we assume the population takes on any two values. Here is the result.
a; with probability p ,
Theorem 3.12 Suppose the population is X D and we have a
b; with probability 1 p ,
random sample X1 ; : : : ; Xn ; from the population. Then
(a) EX D ap C b.1 p/ and Var.X/ D .a b/2 p.1 p/.
(b) ESn D E.X1 C C Xn / D n.ap C b.1 p// and Var.Sn / D n.a b/2 p.1 p/.
This results in EX D 0:0526; SD.X/ D 0:9986: Therefore, the mean expected winnings in
50 plays is ES50 D 2:63 with SE, SD.S50 / D 7:06: In 50 plays we expect to lose 2:63
dollars, give or take 7:06: Now we ask what are the chances of losing more than $4? This
is P .S50 < 4/ normalcdf. 1; 4; 2:63; 7:06/ D 0:423: Using the continuity correction
P .S50 < 4/ normalcdf. 1; 4:5; 2:63; 7:06/ D 0:395:
Example 3.14 A multiple choice exam has 20 questions with 4 possible answers for each ques-
tion. A student will guess the answer for each question by choosing an answer at random. To
penalize guessing, the test is scored as C2 for each correct answer and 1 for each incorrect
answer.
(a) What is the students’s expected score and the SE for the score?
Answering a question is just like choosing at random 1 out of 4 possible tickets from the box
-1 -1 -1 2 . Since there are 20 questions, we will do this 20 times and total up the num-
bers on the tickets. The population mean is .2/ 14 C . 1/ 34 D 1
and the population SD is
q 4
.2 . 1// 14 34 D 1:299.
p
The expected score is ES20 D 20. 14 / D 5 with SE D SD.S20 / 20 1:299 D 5:809: The best
estimate for the student’s score is 5; give or take about 6.
(b) Find the approximate chance the student scores a 5 or greater.
Since S20 N. 5; 5:809/ we have P .S20 4:5/ D 0:0509.
But we have the factor n n 1 which shouldn’t appear, so it must come from the substitution of X
for : Here is what we have to do:
n
! n
!
1 X 2 1 X
E Xi X DE .Xi C X/2
n 1 n 1
i D1 i D1
n
!
1 X 2
2
D E .Xi / C 2.Xi / X C X
n 1
i D1
1 2 2
D n C nE X C 2E X .Xi /
n 1
1 2 1
D n 2 C n C 2E. X/.nX n/ D n 2 C 2 2nE.X /2
n 1 n n 1
1 2 2
D n 2 C 2 2n D .n C 1 2/ D 2 :
n 1 n n 1
Theorem 3.16 If we have a random sample from a normal population X N.; /; then
n 1 2 2 2 2
S .n 1/: In particular, ES D E2 .n 1/ D 2 and
2 n 1
2
2 2
Var.S / D Var .n 1/
n 1
4 4 4
D Var.2 .n 1// D 2.n 1/ D 2 :
.n 1/2 .n 1/2 n 1
3.2. POPULATION VARIANCE UNKNOWN 69
We will skip the proof of this theorem but just note that if we replaced X by ; we would
have
n
X n n
2 1 2 2 X Xi 2 2 X 2
W D .Xi / D D Zi ;
n 1 n 1 n 1
i D1 i D1 i D1
where Zi N.0; 1/; i D 1; 2; : : : ; n is a set of n independent standard normal rvs. But we know
P
that in this case niD1 Zi2 2 .n/: Consequently, n 21 W 2 2 .n/: That looks really close to
what the theorem claims and we would be done except for the fact that W ¤ S and 2 .n/ ¤
2 .n 1/: Replacing by X accounts for the difference. We omit the details.
Example 3.17 Suppose we have a population X N.; 10/ and we choose a random sample
X1 ; : : : ; X25 from X: Letting S 2 denote the sample variance, we want to find the following.
(a) P .S 2 > 50/. Using the theorem we have .n 1/S 2 = 2 2 .24/ so that P .S 2 > 50/ D
P .24S 2 =100 > 24.50/=100/ D P .2 .24/ > 12/ D 2 cdf .12; 1; 24/ D 0:9799:
(b) P .75 < S 2 < 125/ D P .:24.75/ < 2 .24/ < :24.125// D 0:61825:
2 4
(c) ES 2 D 2 D 100; Var.S 2 / D D 10;000=12:
n 1
Now we know that if was known and we had a random sample from a normal population
X p
X N.; /, then = n
N.0; 1/: If is unknown, this is no longer true. What we want is to
replace by S , which is determined from the sample and not the population. That makes the
denominator also depend on the random sample. Consider the rv
X X
p p
X = n = n Z N.0; 1/
T D D r Ds Dr Dr :
S S 2 2
.n 1/ 2 .n 1/
p .n 1/= 2 S 2
n 2 n 1 n 1
.n 1/
From (2.3) we see that T T .n 1/ has a Student’s t -distribution distribution with n 1 de-
grees of freedom. We summarize this result.
Everything would be fine except for the fact that we don’t know p . We have no choice but to use
the information available to us and replace
p
p by the observed sample proportion P D p: The
SE is then approximated by SD.P / D pn p/ and
p.1
r !
P p p.1 p/
ZDr N.0; 1/ or P N p; :
p.1 p/ n
n
Theorem 3.20 Let X and Y be two random variables with and let X1 ; X2 ; : : : ; Xn be a
random sample from X and Y1 ; Y2 ; : : : ; Ym be an independent random sample from Y: Let
X D EX; Y D EY; X D SD.X/; and Y D SD.Y /:
X Y
(a) E.X n / D X ; SD.X n / D p and E.Y m / D Y ; SD.Y m / D p .
n m
s
X2 2
(b) E.X n Y m / D X Y and SD.X n Y m / D C Y:
n m
3.3. PROBLEMS 71
(c) If X N.X ; X /; Y N.Y ; Y / then X n Y m N.X Y ; SD.X n Y m //:
(d) For large enough n; m; X n Y m N.X Y ; SD.X n Y m //:
(e) If X Bernoulli.pX /, and Y Bernoulli.pY /, then, if npX 5; n.1 pX / 5 and
mpY 5; m.1 pY / 5;
r
pX .1 pX / pY .1 pY /
X n Y m N.pX pY ; C :
n m
(f ) If X Bernoulli.pX /, and Y Bernoulli.pY /, then, if npX 5; n.1 pX / 5 and
mpY 5; m.1 pY / 5;
p
Sn Sm N.npX mpY ; npX .1 pX / C mpY .1 pY /:
(g) If the sampling is done without replacement from finite populations, the correction factors
are used to adjust the SDs.
(h) If the SDs X and Y are unknown, replace the normal distributions with the
t distributions in a similar way as the one-sample cases.
The only point we need to verify is the formulas for the SD of the difference. This follows
from independenc:
Var.X n Y m / D Var.X n / C Var.Y m /:
3.3 PROBLEMS
3.1. Consider the population box 0 1 2 3 4 : We will choose random samples of size
2 from this population, with replacement.
(a) Find all samples of size 2. Find the distribution of X:
(b) Find E.X / and SD.X/ using Theorem 3.4 and directly from the first part.
3.2. Consider the population box 7 1 2 3 4 : We will choose random samples of size
2 from this population, with replacement.
(a) Find all samples of size 2. Find the distribution of X:
(b) Find E.X / and SD.X/ using Theorem 3.4 and directly from the first part.
(c) Repeat the first two parts if the samples are drawn without replacement.
3.3. Suppose an investor has 3 types of investments: 20% is at $40, 35% is at $55, and 45%
is at $95. If the investor randomly selects 2 of these investments for sale, what is the
distribution of the average sale price, i.e., P .X D k/‹ Then find E.X/ and SD.X/.
72 3. DISTRIBUTIONS OF SAMPLE MEAN AND SAMPLE SD
3.4. The distribution of X is given by
k 4 5 6 7 8
P .X D k/ 1/9 2/9 3/9 2/9 1/9
3.5. A manufacturer has six different devices; device i D 1; : : : ; 4 has i defects and devices
5 and 6 have 0 defects. The inspector chooses at random 2 different devices to inspect.
(a) What is the expected total number of defects of the two sample devices? What is
the SE for the total?
(b) What is P .X D 3/?
3.6. The IQs of 1000 students have an average of 105.5 with an SD of 11. IQs approximately
follow a normal distribution. Suppose 150 random samples of size 25 are taken from
this population. Find
3.7. Let X1 ; X2 ; : : : ; X50 be a random sample (so independent and identically distributed)
with D 1=4 and D 1=3: Use the CLT to estimate P .X1 C C X50 < 10/:
3.8. The mean life of a cell phone is 5 years with an SD of 1 year. Assume the lifetime follows
a normal distribution. Find
3.9. In the 1965 case of Swain v Alabama, an African-American man appealed to the U.S.
Supreme Court his conviction on a charge of rape on the basis of the fact that there
were no African-Americans on his jury. At that time in Alabama, only men over the
age of 21 were eligible to serve on a jury. Jurors were selected from a panel of 100
ostensibly randomly chosen men chosen from the county. Census data at that time
3.3. PROBLEMS 73
showed that 16% of the eligible members to be selected for the panel were African-
Americans. Of the 100 panel members for the Swain jury, 8 were African-American
but were not chosen for the jury because of challenges by the attorneys. Use the Central
Limit Theorem to approximate the chance that 8 or fewer African-Americans would
be on a panel selected at random from the population.
3.10. A random sample X1 ; : : : ; X150 ; is drawn from a population with mean D 25 and
D 10: The population distribution is unknown. Let A be the sample mean of the first
50 and B be the sample mean of the remaining 100.
(a) What are the approximate distributions of A; B from the CLT?
(b) Use the CLT to find P .19 A 26/ and P .19 B 26/:
3.11. An elevator in a hotel can carry a maximum weight of 4000 pounds. The weights of the
customers at the hotel are normally distributed with mean 165 pounds and SD D 12
pounds. How many passengers can get on the elevator at one time so that there is at
most a 1% chance it is overloaded? If the weights are not normally distributed will your
answer be approximately correct? Explain.
3.12. A random sample of 25 circuit boards is chosen and their mean life is calculated, X :
The true distribution of the length of life is Exponential with a mean of 5 years. Use the
CLT to approximate the probability P .jX 5j 0:5/:
3.13. You will flip a fair coin a certain number of times.
(a) Use both the exact Binomial.n; p/ distribution and the normal approximation to
find the probabilities P .X n=2/; P .X D n=2/ and compare the results. Use n D
10; 20; 30; 40 and p D 0:5: Here X is the number of heads.
(b) Find P .X D 2/ when n D 4 using the Binomial and the Normal approximation.
Note that np < 5; n.1 p/ < 5:
3.14. In a multiple choice exam there are 25 questions each with 4 possible answers only one
of which is correct. If a student takes the exam and randomly guesses the answer for
each problem, what is the exact and approximate chance the student will get at least 10
correct.
3.15. Five hundred people will each toss a coin which is suspected to be loaded. They will
each toss the coin 120 times.
(a) If it is a fair coin how many people will we expect to get between 40 and 60%
heads?
(b) If in fact 453 people get between 40 and 60% heads, and you declare the coin to
be fair, what is the probability you are making a mistake?
74 3. DISTRIBUTIONS OF SAMPLE MEAN AND SAMPLE SD
3.16. A candidate in an election received 44% of the popular vote. If a poll of a random sample
of size 250 is taken find the approximate probability a majority of the sample would be
for the candidate.
3.17. Another possible bet in roulette is to bet on a group of 4 numbers; so if any one of
the numbers comes up the gambler wins. The bet pays $8 for every $1 bet. Suppose a
gambler will play 25 times betting $1 on a group each time.
(a) P .T7 < 2:35/. (c) P . 1:45 < T12 < 2:2/.
(b) P .T22 > 1:33/. (d) the number c so that P .T12 > c/ D 0:95.
(a) Find EY and Var.Y /. (c) Suppose P .a < Y < b/ D 0:8: Find
a; b so that the two tails have the
same area.
(b) Suppose P .Y > a/ D 0:05; P .Y < (d) Suppose P .2 .1/ < a/ D 0:23; and
b/ D 0:1; P .Y < c/ D 0:9; P .Y > P .Z 2 < b/ D 0:23: Find a; b and de-
d / D 0:95: Find a; b; c , and d: termine if a D b 2 :
3.21. A random sample of 400 people was taken from the population of factory workers.
213 worked for factories with 100 or more employees. Find the probability of getting a
sample proportion of 0:5325 or more, when the true proportion is known to be 0:51:
3.22. Ten measurements of the diameter of a ball resulted in a sample mean x D 4:38 cm
with sample SD D 0:08: Given that the mean diameter should be D 4:30 find the
probability P .jX 4:30j > 0:08/.
3.23. The heights of a population of 300 male students at a college are normally distributed
with mean 68 inches. Suppose 80 male students are chosen at random and their sample
average is 66 inches with a sample SD of 3 inches. What are the chances of drawing
such a sample with x 66 if the sample is drawn with or without replacement?
3.3. PROBLEMS 75
3.24. A real estate office takes a random sample of 35 rental units and finds a sample average
rent paid of $1,200 with a sample SD of $325. The rents do not follow the normal
curve. Find the probability the sample average would be $1,200 or more if the actual
mean population rent is $1,250.
3.25. A random sample of 1000 people is taken to estimate the percentage of Republicans in
the population. 467 people in the sample claim to be Republicans. Find the chance that
the percentage of Republicans in a sample of size 1000 will be in the range from 45%
to 48%.
3.26. Three organizations take a random sample of size 40 from a Bernoulli population with
unknown p: The number of 1s in the first sample is 8, in the second sample is 10, and the
third sample is 13. Find the estimated population proportion p for each sample along
with the SE for each sample.
3.27. Given a random sample X1 ; : : : Xn from a population with E.X / D ; Var.X / D 2 ;
find the mean and variance of the estimator O D .X1 C Xn /=2:
3.28. A random sample of 20 measurements will be taken of the pressure in a chamber. It is
assumed that the measurements will be normally distributed with mean 0 and variance
2 D 0:004: What are the chances the sample mean will be within 0.01 of the true
mean?
3.29. A measurement process is normally distributed with known variance 2 D :054 but
unknown mean. What sample size is required in order to be sure the sample mean is
within 0.01 of the true mean with probability at least 0.95?
3.30. The time to failure of a cell phone’s battery is 2.78 hours, where failure means a low
battery indicator will flash.
(a) Given that 100 measurements are made with a sample SD of s D 0:26; what is the
probability the sample mean will be within 0.05 of the true mean?
(b) How many measurements need to be taken to ensure P .jX j < 0:05/ 0:98?
Assume s D 0:26 is a good approximation to :
3.31. Suppose we have 49 data values with sample mean x D 6:25 and sample SD 6: Find the
probability of obtaining a sample mean of 6.25 or greater if the true population mean is
D 4:
3.32. Two manufacturers each produce a semiconductor component with a mean lifetime of
X D 1400 hours and X D 200 for company X; and Y D 1200; Y D 100 for com-
pany Y: Suppose 125 components from each company are randomly selected and tested.
(a) Find the probability that X ’s components will have a mean lifetime at least 160
hours longer than Y ’s.
76 3. DISTRIBUTIONS OF SAMPLE MEAN AND SAMPLE SD
(b) Find the probability that X ’s components will have a mean lifetime at least 250
hours longer than Y ’s.
3.33. Two drug companies, A and B; have competing drugs for a disease. Each companies
drug has a 50% chance of a cure. They will each choose 50 patients at random and
administer their drug. Find the approximate probability company A will achieve 5 or
more cures than company B:
3.34. The population mean score of students on an exam is 72 with an SD of 6. Suppose two
groups of students are independently chosen at random with 26 in one group and 32 in
the other.
(a) Find the probability the sample means will differ by more than 3.
(b) Find the probability the sample means will differ by at least 2 but no more than 4
points.
3.35. Post election results showed the winning candidate with 53% of the vote. Find the prob-
ability the two independent random samples with 200 voters each would indicate a dif-
ference of more than 12% in their two voting proportions.
3.36. Let fZ1 ; : : : ; Z16 g be a random sample from N.0; 1/ and fX1 ; : : : ; X64 g an independent
random sample from X N.; 1/:
(a) Find P .Z12 > 2/: (d) Find a value of c such that!
! 16
16
X 1 X
P .Zi Z/2 > c D 0:05:
(b) Find P Zi > 2 : 15
i D1
i D1
! (e) Find the distribution of
16
X X16 64
X
(c) Find P Zi2 > 16 : Y D 2
Zi C .Xi /2 :
i D1 i D1 i D1
77
CHAPTER 4
The percentage 100.1 ˛/% is called the confidence level of the interval.
Definition 4.4 The error " of the estimate #e is the quantity " D j#e #j.
The error of the estimate is therefore the amount by which the estimate deviates from the
true value of # .
1 Capital values, X , are random variables, while small script, x , is the observed value of the rv, X .
4.1. CONFIDENCE INTERVALS FOR A SINGLE SAMPLE 79
4.1.2 PIVOTAL QUANTITIES
How do we construct confidence intervals for parameters in probability distributions? Normally
we need what are called pivotal quantities, or pivots for short.
Definition 4.5 Let X1 ; X2 ; : : : ; Xn be a random sample from a random variable X with pdf
fX .xI #/. A pivotal quantity for # is a random variable h.X1 ; X2 ; : : : ; Xn ; #/ whose distribution
does not depend on #; i.e., it is the same for every value of #:
The function h is a random variable constructed from the random sample and the constant
# . Here’s an example of a pivotal quantity.
Example 4.6 Take a sample X of size one from a normal distribution with unknown but
known . The rv
X
ZD N.0; 1/
is a standard normal random variable that doesn’t depend on the value of . Therefore, Z D
.X /= is a pivotal quantity for . These quantities will also be known as test statistics in
a later context.
2 D 02 is known
p
Since E.X/ D and SD.X / D 0 = n, a pivotal quantity for is
X
p N.0; 1/:
0 = n
2 is unknown
In this case, has to be estimated from the sample. Therefore, a pivotal quantity for is
X
p t.n 1/:
SX = n
We now turn to finding a pivotal quantity for the variance. We again consider two cases.
80 4. CONFIDENCE AND PREDICTION INTERVALS
Table 4.1: Pivotal quantities for N.; /
D 0 is known
In this case,
n
X 2
Xi 0
2 .n/:
i D1
Recall that this follows from the fact that each term .Xi /= N.0; 1/, and the sum of n
squares of independent standard normals is 2 .n/.
is unknown
In this case, has to be estimated from the sample, and the pivot becomes
n
1 X 2
Xi X 2 .n 1/:
2
i D1
There is a loss of a degree of freedom due to estimating . We summarize our results in Table 4.1.
Remark 4.7 A TI program to obtain critical values of the 2 distribution is given in Re-
mark 2.67.
where z˛=2 D invNorm.1 ˛=2/ is the ˛=2 critical value of Z . Rearranging we get
0 0
P X z˛=2 p < < X C z˛=2 p D 1 ˛:
n n
We are 100.1 ˛/% confident that the true value of is in this interval. If really is in this
interval, then the error is
0
" D jx j z˛=2 p :
n
But might not be in the interval, and so we can only say that we are 100.1 ˛/% confident
p
that the error is no more than z˛=2 0 = n.
Remark 4.8 Looking at the error ", we see that " can be decreased in several ways. For a fixed
sample size n, the larger ˛ is, i.e., the lower the confidence level, the smaller " will be. This is
because a smaller confidence level implies a smaller value of z˛=2 . On the other hand, for a fixed
˛; if we increase the sample size, " gets smaller. For a given confidence level, the only way to
decrease the error is to increase the sample size. Notice that as n ! 1; i.e., as the sample size
becomes larger and larger, the error shrinks to zero.
Example 4.9 A sample X1 ; X2 ; : : : ; X106 of 106 healthy adults have their temperatures taken
during a routine physical checkup, and it is found that the mean body temperature of the sample
is x D 98:2ı F. Previous studies suggest that 0 D 0:62ı F. Assume the body temperatures are
82 4. CONFIDENCE AND PREDICTION INTERVALS
drawn from a normal population. Set ˛ D 0:05 for a confidence level of 95%. A 95% confidence
interval for is, since z:025 D invNorm.:975/ D 1:96;
0 0
x z:025 p ; x C z:025 p D .98:08ı ; 98:32ı /.
106 106
We are 95% confident that the true mean healthy adult body temperature is between 98:08ı F
and 98:32ı F. The traditional value of 98:6ı F for the body temperature of a healthy adult is not in
this interval! There is a 5% chance that the interval missed the true value, but this result provides
some evidence that the true value may not be 98:6ı F. Finally, we note that
0
" D jx j z:025 p D 0:12ı F.
106:0
Therefore, we are 95% confident that our estimate of x D 98:2ı F as the mean healthy adult
body temperature deviates from the true temperature by at most 0:12ı F.
The smallest sample size that will do the job is given by the smallest integer greater than or equal
to .z˛=2 0 =d /2 :
Example 4.10 Going back to the previous example, suppose that we take d D 0:05, and we
want to be 99% percent confident that our estimate xN differs from the true value of the mean
healthy adult body temperature by at most 0:05ı F. In this case,
& ' & '
z˛=2 0 2 z0:005 0:62 2 .2:58/ .0:62/ 2
nD D D D 1;024.
d 0:05 0:05
We would have to take a much larger sample than the original n D 106 to be 99% confident that
our estimate is within 0:05ı F of the true value.
Now we have to work on the more realistic problem that the variance of the population is
unknown.
4.1. CONFIDENCE INTERVALS FOR A SINGLE SAMPLE 83
A Confidence Interval for the Mean, Unknown
If the variance 2 is unknown, we have no choice but to replace with its estimate SX from the
X
sample. A pivotal quantity for is given by T D p t.n 1/: Repeating essentially the
SX = n
same derivation as before, we obtain the 100.1 ˛/% confidence interval for as
sX sX
x t.n 1; ˛=2/ p ; x C t.n 1; ˛=2/ p :
n n
Just as x is the sample mean estimate for ; sX is the sample SD estimate for :
Example 4.11 During a certain two-week period during the summer, the number of drivers
speeding on Lake Shore Drive in Chicago is recorded. The data values are listed below.
Drivers Speeding on LSD
10 15 11 9 12 7 10
6 15 12 8 12 15 9
Assume the population is normal, meaning that the number of drivers speeding in a two-week
period is normally distributed. The variance is unknown and must be estimated. We compute
x D 10:93 and sX2 D 10:07 from the data values. Set ˛ D 0:05 for a confidence level of 95%. A
95% confidence interval for , the true mean number of drivers speeding on a given day, is given
by
sX sX
x t.13; 0:025/ p ; x C t.13; 0:025/ p D .9:10; 12:47/:
14 14
Notice that the 2 distribution is not symmetric and so we need to consider two distinct 2
critical values. For instance, 100˛=2% of the area under the 2 pdf is to the right of 2 .n; ˛=2/,
84 4. CONFIDENCE AND PREDICTION INTERVALS
and 100˛=2% of the area under the 2 pdf is to the left of 2 .n; 1 ˛=2/: Solving this inequality
for 2 , we get
0P n Pn 1
.Xi 0 /2 .Xi 0 /2
B i D1 i D1 C
PB 2
@ 2 .n; ˛=2/ < < 2 .n; 1 ˛=2/ A D 1 ˛:
C
In a similar way, if the mean is unknown, then it must be estimated as X . This time our
table of pivotal quantities gives
n
1 X 2
Xi X 2 .n 1/:
2
i D1
n
X 2
We write Xi X D .n 1/SX2 : In this case, a 100.1 ˛/% confidence interval for 2 is
i D1
given by
.n 1/sX2 .n 1/sX2
2
; :
.n 1; ˛=2/ 2 .n 1; 1 ˛=2/
We summarize all our confidence intervals for the parameters of the normal random vari-
able in Table 4.2.
2
.n 1/sX2 .n 1/sX2
unknown ; 2
2 .n 1; ˛=2/ .n 1; 1 ˛=2/
and so p p
P z˛=2 np.1 p/ < X np < z˛=2 np.1 p/ Š 1 ˛:
Dividing through by n, we get
r r !
p.1 p/ p.1 p/
P z˛=2 <X p < z˛=2 Š1 ˛:
n n
The problem with this interval is that the endpoints contain the unknown parameter p . To
eliminate p from the endpoints, we use the fact that p can be approximated by XN . If we take
this approach, called the bootstrap method, we obtain
0 s s 1
X.1 X/ X .1 X/
P @X z˛=2 < p < X C z˛=2 A Š 1 ˛.
n n
86 4. CONFIDENCE AND PREDICTION INTERVALS
The 100.1 ˛/% confidence interval for p using this approach is
r r !
p.1 p/ p.1 p/
p z˛=2 ; p C z˛=2 :
n n
We have replaced the random variable X with the observed sample proportion p D x:
The center of the confidence interval is p which is our approximation for p .
The value of n is a sample size for which we can be 100.1 ˛/% confident that the er-
ror "˙ d . For example,
if ˛ D 0:05 and d D 0:01, the sample size for which " 0:01 is
n 1:962 =.4.0:01/2 / D 9;604.
l m
2
Remark 4.12 The estimate of the sample size n z˛=2 =.4d 2 / is the conservative estimate
because we have replaced the unknown p with 1=2: Another method of estimating n is to run
a two-stage experiment. In the first stage, an arbitrary sample size n 30 is taken, and the
estimate p is computed. Then the sample size to obtain an error bound of d is calculated by
z 2
˛=2
n p.1 p/ :
d
Example 4.13 A gardener is trying to grow a rare species of orchid in a greenhouse. Let X
denote the number of plants that survive under greenhouse conditions. Of the (random sample
4.1. CONFIDENCE INTERVALS FOR A SINGLE SAMPLE 87
of ) 50 plants she originally potted, only 17 survived. The random variable X Binom.50; p/
has an unknown p . Suppose we wish to compute a 95% confidence interval for the percentage
p of plants that will survive in the greenhouse. Notice that our estimate for p is p D 0:34:
For the endpoints of the confidence interval, we obtain
r r
0:34.1 0:34/ 0:34.1 0:34/
l D 0:34 z0:025 D 0:209 and u D 0:34 C z0:025 D 0:471 .
50 50
The 95% confidence interval is .0:209; 0:471/. This time " 0:471 0:34 D 0:131. If we want
" 0:09 D d , then we need a sample of size
z
:025 2
n :34.1 :34/ D d106:422e D 107:
:09
If we had not taken a sample of n D 50 and didn’t have an estimate of p D :34, we would need
:025 2
a sample of size n d 41 z:09 e D 119 orchids to guarantee " 0:09:
One-sided confidence intervals can be constructed for all the parameters in this chapter.
It will suffice to give a simple example to illustrate the process of constructing such an interval.
Example 4.15 Consider Example 4.11 involving speeders on Lake Shore Drive discussed
previously. Suppose we wish to construct 95% one-sided confidence intervals for the mean
number of speeders. As before, if the variance 2 is unknown, then a pivotal quantity for
is SX =pn t .n 1/. To obtain an upper one-sided confidence interval, we set
X
!
X
P p > t.n 1; ˛/ D 1 ˛ ,
SX = n
88 4. CONFIDENCE AND PREDICTION INTERVALS
or equivalently, p
P < X C t .n 1; ˛/SX = n D 1 ˛.
p
A 100.1 ˛/% upper confidence interval for is . 1; X C t.n 1; ˛/SX = n/: An upper
100.1 ˛/% one-sided observed confidence interval is given by
p
1; x C t.n 1; ˛/sX = n :
p
For our example, if ˛ D 0:05, . 1; 10:93 C .1:77/ .3:17/ = 14/ D . 1; 12:43/ is a 95% upper
confidence interval for . We would say that we are 95% confident that the mean number of
speeders is no more than 12.43. If someone claimed the mean number of speeders was at least
15, our upper bound would be good evidence against that.
Similarly, a lower 100.1 ˛/% one-sided observed confidence interval is given by
p
x t.n 1; ˛/sX = n; C1 :
p
For our example, we obtain .10:93 .1:77/ .3:17/ = 14; C1/ D .9:43; C1/ as a 95% lower
confidence interval for : We say that we are 95% confident that the mean number of speeders
is at least 9.43. By contrast, the two-sided 95% confidence interval for the mean number of
speeders is .9:10; 12:47/, and we are 95% confident the mean number of speeders is in that
interval.
One-sided intervals are not constructed simply by taking the upper and lower values of
a two-sided interval because in the two-sided case we use z˛=2 , but we use z˛ in the one-sided
case.
.X mr Y n / .X Y /
Consequently, the random variable 2 2
D Z N.0; 1/ is a pivotal quantity.
0;X 0;Y
m C n
Therefore, we have
0 1
B Xm Y n .X Y / C
P @ z˛=2 < q 2 2
< z˛=2 A D 1 ˛:
0;X 0;Y
m
C n
s
2 2
0;X 0;Y
For simplicity, set Dn;m D C . Then after some algebra we get
m n
P .X m Y n/ z˛=2 Dn;m < X Y < .X m Y n / C z˛=2 Dn;m D 1 ˛:
The 100.1 ˛/% confidence interval for X Y in the case when both variances are known
is therefore given by
0 s s 1
2 2 2 2
0;X 0;Y 0;X 0;Y
@.x m yn/ z˛=2 C ; .x m y n / C z˛=2 C A:
m n m n
Since is unknown, we know we have to replace it with a sample SD. The sample SDs of both
samples, which may not be equal even though we are assuming the populations SDs are the
same, have to be taken into account. We will do that by pooling the two SDs. Define
.m 1/SX2 C .n 1/SY2
Sp2 D :
mCn 2
90 4. CONFIDENCE AND PREDICTION INTERVALS
as the pooled sample variance. Observe that it is a weighted average of SX2 and SY2 . We will
replace by Sp , and we need to find the distribution of
Xm Y n .X Y /
q :
1
Sp m C n1
We know that
.m 1/SX2 .n 1/SY2
2 .m 1/ and 2 .n 1/:
2 2
But the sum of two independent 2 random variables with 1 and 2 degrees of freedom, re-
spectively, is again a 2 random variables with 1 C 2 degrees of freedom. Therefore,
.m 1/SX2 .n 1/SY2
C 2 .m C n 2/:
2 2
Again by independence,
,s
.X m Y n/ .X Y / .m 1/SX2 C .n 1/SY2
T D q t.m C n 2/
2 1
C 1 2 .m C n 2/
m n
since the numerator is distributed as N.0; 1/ and the denominator is the square root of a 2
random variable with m C n 2 degrees of freedom divided by m C n 2. That is, T t.m C
n 2/: Using algebra we see that
,s
.X m Y n / .X Y / .m 1/SX2 C .n 1/SY2
T D q
2 1 C 1 2 .m C n 2/
m n
.X m Y n / .X Y /
D q ;
1
Sp m C n1
and so the above expression will be our pivotal quantity for X Y . Similar to the derivation in
the previous section, we conclude that a 100.1 ˛/% observed confidence interval for X Y
in the case when both variances are equal but unknown is given by
r r !
1 1 1 1
.x m yn/ t.m C n 2; ˛=2/sp C ; .x m y n / C t .m C n 2; ˛=2/sp C :
m n m n
In summary, this is the confidence interval to use when the variances are assumed un-
known but equal, and the sample variance we use is the pooled variance because it takes both
samples into account.
4.2. CONFIDENCE INTERVALS FOR TWO SAMPLES 91
Variances of Both Populations Unknown and Unequal
The final case occurs if both X2 and Y2 are unknown and unequal. Finding a pivotal quantity
for X Y with an exact distribution is currently an unsolved problem in statistics called the
Behrens–Fisher problem. Accordingly, the CI in this case is an approximation and not exact.
It turns out it can be shown that
X Yn .X Y /
T D q t./
2 2
SX SY
m
C n
does follow a t -distribution, but the degrees of freedom is given by the formula
$ %
1 1 2
m
r C n
D 1 1
;
m2 .m 1/
r2 C n2 .n 1/
where r D sX2 =sY2 is simply the ratio of the sample variances. Observe that depends only on
the ratio of the sample variances and the sizes of the respective samples, nothing else.
Therefore, an approximate 100.1 ˛/% observed confidence interval for X Y can
now be derived as
0 s s 1
sX2 sY2 sX2 sY2
@.x m yn/ t .; ˛=2/ C ; .x m y n / C t.; ˛=2/ C A:
m n m n
This is the CI to use with independent samples from two populations when there is no
reason to expect that the variances of the two populations are equal.
Example 4.16 A certain species of beetle is located throughout the United States, but specific
characteristics of the beetle, like carapace length, tend to vary by region. An entomologist is
studying the carapace length of populations of the beetle located in the southeast and the north-
east regions of the country. The data for the two samples of beetles is assumed to be normal and
is given below. It is desired to compute a 95% confidence interval for the mean difference in
carapace length between the two populations, the variances of which are unknown and assumed
to be unequal.
92 4. CONFIDENCE AND PREDICTION INTERVALS
Let N1 ; N2 ; : : : ; N16 represent the northeast sample and S1 ; S2 ; : : : ; S10 represent the southeast
2
sample. We first compute the sample variances of the two populations as sN D 0:355 and sS2 D
0:297: The value of is given by rounding down
2
1 1 1 0:355 1 2
rC C
m n 16 0:297 10
D 2 D 20:579.
1 1 1 0:355 1
2
r2 C 2 C 2
m .m 1/ n .n 1/ 2
16 .16 1/ 0:297 10 .10 1/
We find t.20; 0:025/ D invT.0:975; 20/ D 2:0859: A 95% confidence interval for X Y can
now be derived as
q
.10:146 10:395/ t.20; 0:975/ 0:355
16
C 0:297
10
;
q
.10:146 10:395/ C t .20; 0:975/ 0:355
16
C 0:297
10
D . 0:724 ; 0:226 /.
Notice that the value 0 is in this confidence interval, and so it is possible that there is no difference
in carapace length between the two populations of beetles.
Error Bounds
Error bounds in all the three cases discussed above are easy to derive since the estimator X m Yn
lies in the center of the confidence interval. In particular, for
Since the ratio of two 2 random variables each divided by their respective degrees of freedom
is distributed as an F random variable, we have
2
.m 1/SX , 2
.n 1/SY
2
X 2
Y S2 SY2 SX2 Y2
D X2 D F .m 1; n 1/.
m 1 n 1 X Y2 SY2 X2
It follows that
S2 2
P F .m 1; n 1; 1 ˛=2/ < X2 Y2 < F .m 1; n 1; ˛=2/ D 1 ˛.
SY X
Rewriting so that X2 =Y2 is in the center of the inequality gives the probability interval
2
SX 1 X2 SX2 1
P < < D 1 ˛.
SY2 F .m 1; n 1; ˛=2/ Y2 SY2 F .m 1; n 1; 1 ˛=2/
94 4. CONFIDENCE AND PREDICTION INTERVALS
A 100.1 ˛/% observed confidence interval for X2 =Y2 can now be derived as
sX2 1 s2 1
; X2 .
sY2 F .m 1; n 1; ˛=2/ sY F .m 1; n 1; 1 ˛=2/
This confidence interval for X2 =Y2 can be displayed in a variety of ways. Notice that if X
F .m; n/, then 1=X F .n; m/. It follows that F .n; m; 1 ˛/ D 1=F .m; n; ˛/. (Prove this!) The
confidence interval for X2 =Y2 can also be expressed as
sX2 sX2
F .n 1; m 1; 1 ˛=2/; F .n 1; m 1; ˛=2/
sY2 sY2
2
sX 1 sX2
D ; F .n 1; m 1; ˛=2/
sY2 F .m 1; n 1; ˛=2/ sY2
2
s s2 1
D X2 F .n 1; m 1; 1 ˛=2/; X2 .
sY sY F .m 1; n 1; 1 ˛=2/
Example: Continuing Example 4.16 involving beetle populations, set ˛ D 0:025. At the 95%
confidence level, F .15; 9; 0:025/ D 3:7694 and F .15; 9; 0:975/ D 0:3202. Therefore, a 95% con-
fidence interval for N2 =S2 is given by
0:355 1 0:355 1
; D .0:317 1; 3: 732 9/.
0:297 3:7694 0:297 0:3202
Since the interval contains the value 1, we cannot conclude the variances are different at the 95%
confidence level.
We summarize, on Table 4.3, all the two-sample confidence intervals obtained for the
normal distribution.
2
2
X sX 1 s2 1
rD 2 X ; Y unknown 2 ; X
F .m 1;n 1;˛=2/ s 2 F .m 1;n 1;1 ˛=2/
Y sY Y
Therefore,
.X m Y n/ .pX pY /
ZDr N.0; 1/ (approximate):
pX .1 pX / pY .1 pY /
C
m n
Approximating pX by pNX D xN m and pY by pNY D yNn , we obtain the 100.1 ˛/% (approximate)
confidence interval for pX pY as
r r !
pNX .1 pNX / pNY .1 pNY / pNX .1 pNX / pNY .1 pNY /
.pNX pNY / z˛=2 C ; .pNX pNY / C z˛=2 C :
m n m n
Example 4.17 An experiment was conducted by a social science researcher which involved as-
sessing the benefits of after-school enrichment programs in a certain low-income school district
96 4. CONFIDENCE AND PREDICTION INTERVALS
in Cleveland, Ohio. A total of 147 three- and four-year-old children were involved in the study.
Children were randomly assigned to two groups, one group of 73 students which participated
in the after-school programs, and a control group with 74 students which did not participate
in these programs. The children were followed as adults, and a number of data items were col-
lected, one of them being their annual incomes. In the control group, 23 out of the 74 children
were earning more that $75;000 per year whereas in the group participating in the after-school
programs, 38 out of the 73 children were earning more than $75;000 per year. Let pX and pY
denote the proportion of three- and four-year-old children who do and do not participate in
after-school programs making more than $75;000 per year, respectively. A 95% confidence in-
terval for pX pY is given by
r
0:520.1 0:520/ 0:311.1 0:311/
.0:520 0:311/ ˙ .1:96/ C H) .0:053; 0:365/:
73 74
As a result of the study, we can be 95% confident that between 5:3% and 36:5% more children
not having participated in after-school enrichment programs were earning less than $75;000 per
year than students who did participate in these programs.
sD sD
dm t .m 1; ˛=2/ p ; d m C t.m 1; ˛=2/ p :
m m
As usual, di is the observed difference of the i th pair xi yi :
Example 4.18 Ten middle-aged men with high blood pressure engage in a regimen of aerobic
exercise on a newly introduced type of tread mill. Their blood pressures are recorded before
starting the exercise program. After using the treadmill for 30 minutes each day for six months,
their blood pressures are again recorded. During the six-month period, the ten men do not
change their lifestyles in any other significant way. The two blood pressure readings in mmHg
(before and after the period of aerobic exercise) are given in the table below.
Male i BP (before).Xi / BP (after) .Yi / Difference Di
1 143 144 1
2 171 164 7
3 160 149 11
4 182 175 7
5 149 142 7
6 162 162 0
7 177 173 4
8 165 156 9
9 150 148 2
10 165 161 4
Take ˛ D 0:05. From the table, we compute d 10 D 5 and sD D 3:887. A 95% confidence interval
for D is given by
3:887 3:887
5 t.9; 0:025/ p ; 5 C t .9; 0:025/ p D .2:22; 7:78/.
10 10
We are 95% confident that exercising on the tread mill can lower blood pressure roughly between
2 and 8 points.
To obtain the actual prediction interval, we substitute the observed values of the
sample into the functions l.X1 ; X2 ; : : : ; Xn / and u.X1 ; X2 ; : : : ; Xn / to get an interval
.l.x1 ; : : : ; xn /; u.x1 ; : : : ; xn // of real numbers. Unlike for confidence intervals, the statement
P .l.x1 ; : : : ; xn / < XnC1 < u.x1 ; : : : ; xn // does have meaning since XnC1 is a random variable.
We are not capturing a number in the interval, but a random variable.
Similar to confidence intervals, constructing prediction intervals depends on identifying
pivotal quantities. In a prediction interval, a pivotal quantity may depend on X1 ; : : : ; XnC1 .
We will obtain pivotal quantities and construct prediction intervals depending on whether
or is known or unknown.
In other words, we can predict a sample value from N.0 ; 0 / will be in .0 z˛=2 0 ; 0 C
z˛=2 0 / with probability 1 ˛:
4.3. PREDICTION INTERVALS 99
is Unknown and D 2
02 is Known
Since Var.XnC1 X/ D 02 C 02 =n, by independence, we have
XnC1 X XnC1 X
q D q D Z N.0; 1/.
2 1
0
2
0 C n 0 1 C n
XnC1 X
Therefore, the random variable q is a pivotal quantity for XnC1 . The 100.1 ˛/%
0 1 C n1
prediction interval for XnC1 is given by
r r !
1 1
x z˛=2 0 1 C ; x C z˛=2 0 1 C :
n n
One of the uses of prediction intervals is in the detection of outliers, extreme values of
the random variable or a value that comes from a population whose mean is different from the
one under consideration. Given a choice of ˛ , the sample value XnC1 will be considered an
outlier if XnC1 is not in the 100.1 ˛/% prediction interval for XnC1 .
Example 4.21 An online furniture retailer has collected the times it takes an adult to assem-
ble a certain piece of its furniture. The data in the table below represents a random sample
X1 ; X2 ; : : : ; X36 of the number of minutes it took 36 adults to assemble a certain advertised
“easy to assemble” outdoor picnic table. Assume the data is drawn from a normal population.
100 4. CONFIDENCE AND PREDICTION INTERVALS
Assembly Time for Adults
17 13 18 19 17 21 29 22 16 28 21 15
26 23 24 20 8 17 17 21 32 18 25 22
16 10 20 22 19 14 30 22 12 24 28 11
We will use the data in the table to construct a 95% prediction interval for the next assembly
time. We compute the mean and standard deviation of the sample as x 36 D 19:92 and sX D 5:73.
The 95% prediction interval is given by
r r !
1 1
19:92 t.35; 0:025/ .5:73/ 1 C ; 19:92 C t.35; 0:025/ .5:73/ 1 C
36 36
D .8:13; 31:71/.
We can write a valid probability statement as P .8:13 < X37 < 31:71/ D 0:95:
Any data value falling outside the prediction interval .8:13; 31:71/ could be considered an
outlier at the 95% level.
4.4 PROBLEMS
4.1. An urn contains only black and white marbles with unknown proportions. If a random
sample of size 100 is drawn and it contains 47 black marbles, find the following.
(a) The percentage of black marbles in the urn is estimated as with a standard
error of
(b) The SE measures the likely size of the error due to chance in the estimate of the
percentage of black marbles in the urn. (T/F)
(c) Suppose your estimate of the proportion of black marbles in the urn is p:O Then pO
is likely to be off from the true proportion of black marbles in the urn by the SE.
(T/F)
(d) A 95% CI for the proportion of black marbles in the urn is to
(e) What is a 95% CI for the proportion of black marbles in the sample, or does that
make sense? Explain.
(f ) Suppose we know that 53% of the marbles in the urn are black. We take a random
sample of 100 marbles and calculate the SE as 0:016: Find the chance that the pro-
portion of black marbles in the sample is between 0:53 0:032 and 0:53 C 0:032:
4.2. Suppose a random sample of 100 male students is taken from a university with 546
male students in order to estimate the population mean height. The sample mean is
x D 67:45 inches with an SD of 2.93 inches.
4.4. PROBLEMS 101
(a) Assuming the sampling is done with replacement, find a 90% CI for the population
mean height.
(b) Assuming the sampling is done without replacement, find a 90% CI.
4.3. Suppose 50 random samples of size 10 are drawn from a population which is normally
distributed with mean 40 and variance 3. A 95% confidence interval is calculated for
each sample.
(a) How many of these 50 CIs do you expect to contain the true population mean
D 40?
(b) If we define the rv X to be the number of intervals out of 50 which contain the
true mean D 40; what is the distribution of X ? Find P .X D 40/; P .X 40/;
and P .X > 45/:
4.4. Find a conservative estimate of the sample size needed in order to ensure the error in a
poll is less than 3%. Assume we are using 95% confidence.
4.5. Find the 99% confidence limits for the population proportion of voters who favor a
candidate. The sample size is 100, and the sample percentage of voters favoring the
candidate was 55%.
4.6. A random sample is taken from a population which is N.; 5/: The sample size is 20
and results in the sample mean x D 15:2:
(a) Find the CI for levels of confidence 70; 80; and 90%.
(b) Repeat the problem assuming the sample size is 100.
4.7. Show that a lower 100.1 ˛/% one-sided confidence interval for the unknown mean
p
with unknown variance is given by .x t .n 1; ˛/sX = n; C1/.
4.8. Suppose X1 ; : : : ; Xn is a random sample from a continuous random variable X with
population median m: Suppose that we use the interval with random endpoints
.Xmin ; Xmax / as a CI for m:
(a) Find P .Xmin < m < Xmax / giving the confidence level for the interval
.Xmin ; Xmax /: Notice that it is not 100% because this involves a random sample.
(b) Find P .Xmin < m < Xmax / if the sample size is n D 8:
4.9. A random sample of 225 flights shows that the mean number of unoccupied seats is
11.6 with SD D 4.1. Assume this is the population SD.
(a) Find a 90% CI for the population mean.
(b) The 90% CI you found means (choose ONE):
102 4. CONFIDENCE AND PREDICTION INTERVALS
(a) The interval contains the population mean with probability 0.9.
(b) If repeated samples are taken, 90% of the CIs contain the population mean.
(c) What minimum sample size do you need if you want the error reduced to 0.2?
4.10. Suppose you want to provide an accurate estimate of customers preferring one brand of
coffee over another. You need to construct a 95% CI for p so that the error is at most
0:015. You are told that preliminary data shows p D 0:35: What sample size should you
choose?
4.11. Consider the following data points for fuel mileage in a particular vehicle.
42 36 38 45 41 47 33 38 37 36
40 44 35 39 38 41 44 37 37 49
Assume these data points are from a random sample. Construct a 95% CI for the mean
population mpg. Is there any reason to suspect that the data does not come from a nor-
mal population? What would a lower one-sided CI be, and how would it be interpreted?
4.12. The accuracy of speedometers is checked to see if the SD is about 2 mph. Suppose a ran-
dom sample of 35 speedometers are checked, and the sample SD is 1.2 mph. Construct
a 95% CI for the population variance.
4.13. The ACT exam for an entering class of 535 students had a mean of 24 with an SD of
3.9. Assume this class is representative of future students at this college.
(a) Find a 90% CI for the mean ACT score of all future students.
(b) Find a 90% for the ACT score of a future student.
4.14. A random sample of size 32 is taken to assess the weight loss on a low-fat diet. The
mean weight loss is 19.3 pounds with an SD of 10.8 pounds. An independent random
sample of 32 people who are on a low calorie diet resulted in a mean weight loss of
15.1 pounds with an SD of 12.8 pounds. Construct a 95% CI for the mean difference
between a low calorie and a low fat diet. Do not pool the data.
4.15. A sample of 140 LEDs resulted in a mean lifetime of 9.7 years with an SD of 6 months.
A sample of 200 CFLs resulted in a mean lifetime of 7.8 years with an SD of 1.2 years.
Find a 95% and 99% CI for the difference in the mean lifetimes. Is there evidence that
the difference is real? Do not pool the data.
4.16. The SD of a random sample of the service times of 200 customers was found to be 3
minutes.
(a) Find a 95% CI for the SD of the service times of all such customers.
4.4. PROBLEMS 103
(b) How large a sample is needed in order to be 99.93% confident that the true pop-
ulation SD will not differ from the sample SD by more than 5%.
4.17. Suppose the sample mean number of sick days at a factory is 6.3 with a sample SD of
4.5. This is based on a sample of 25.
(a) Find a 98% CI for the population mean number of sick days.
(b) Calculate the sample size needed so that a 95% CI has an error of no more than
0.5 days.
4.18. A random sample of 150 colleges and universities resulted in a sample mean ACT score
of 20.8 with an SD of 4.2. Assuming this is representative of all future students,
(a) find a 95% CI for the mean ACT of all future students, and
(b) find a 95% for the ACT score of a single future student.
4.19. A logic test is given to a random sample of students before and after they completed a
formal logic course. The results are given below. Construct a 95% confidence interval
for the mean difference between the before and after scores.
After 74 83 75 88 84 63 93 84 91 77
Before 73 77 70 77 74 67 95 83 84 75
4.20. Two independent groups, chosen with random assignments, A and B, consist of 100
people each of whom have a disease. An experimental drug is given to group A but not
to group B, which are termed treatment and control groups, respectively. Two simple
random samples have yielded that in the two groups, 75 and 65 people, respectively,
recover from the disease. To study the effect of the drug, build a 95% confidence interval
for the difference in proportions pA pB .
105
CHAPTER 5
Hypothesis Testing
This chapter is one of the cornerstones of statistics because it allows us to reach a decision based
on an experiment with random outcomes. The basic question in an experiment is whether or
not the outcome is real, or simply due to chance variation. For example, if we flip a coin 100
times and obtain 57 heads, can we conclude the coin is not fair? We know we expect 50 heads,
so is 57 too many, or is it due to chance variation? Hypothesis testing allows us to answer such
questions. This is an extremely important issue, for instance in drug trials in which we need to
know if a drug truly is efficacious, or if the result of the clinical trial is simply due to chance, i.e.,
the possibility that a subject will simply improve or get worse on their own. The purpose of this
chapter is to show how hypothesis testing is implemented.
1 Cardiovascular events associated with rofecoxib in a colorectal adenoma chemoreception trial, R.S. Bristlier et al., March
17, 2005, N. Engl. J. Med., 2005; 352:1092–1102.
106 5. HYPOTHESIS TESTING
What follows is how hypothesis testing would work in this situation. We will introduce
the necessary terminology as we explain the method.
First, let’s denote the true population proportion of people who will have a CV event as
pT if they are in the treatment group, and pC if they are in the control group. Hypothesis testing
assumes that the difference between these two quantities should be zero, i.e., that there is no
difference between the two groups. We say that this is the Null Hypothesis, and write it as H0 W
pT pC D 0: Next, since we observed p T D 0:0357 and p C D 0:0200, the observed difference
p T p C D 0:0157 > 0 and so establishes the Alternative Hypothesis as H1 W pT pC > 0.
What are the chances of observing a difference of 0.0157 in the proportions, if, in fact the
difference should be 0 under the assumption of the null hypothesis? The random assignment
of subjects to control and treatment groups allows us to use probability to actually answer this
question. We will see by the methods developed in this chapter that we will be able to calculate
that the chance of observing a difference of 0:0157 (or more) is 0:00753; under the assumption
of the null hypothesis that there should be no difference. This value is called the p-value of the
test or the level of significance of the test.
Now we are ready to reach a conclusion. Under the assumption there is no difference,
we calculate that the chance of observing a difference of 0.0157 is only 0.7%. But, this is what
actually occurred even though it is extremely unlikely. The more likely conclusion, supported
by the evidence, is that our assumption is incorrect, i.e., it is more likely H0 is wrong.2 The
conclusion is that we reject the null hypothesis in favor of the alternative hypothesis. How small
does the p-value have to be for a conclusion to reject the null? Statisticians use the guide in
Table 5.1 in Section 5.3.2 to reach a conclusion based on the p-value.
Remark 5.1 What does it mean to reject the null? It means that our assumption that the null
is true leads to a very low probability of it actually being true. So either we just witnessed a very
low probability event, or the null is not true. We say that the null is rejected if the p-value of
the test is low enough. If the p-value is not low enough, it means that it is plausible the null is
true, i.e., there is not enough evidence against it, and therefore we do not reject it.
What we just described is called the p-value approach to hypothesis testing, and the p-
value is essentially the probability we reach a wrong conclusion if we assume the null. Another
approach is called the critical value approach. We’ll discuss this in more detail later, but in this
example here’s how it works.
First, we specify a level of significance ˛; which is the largest p-value we are willing to
accept to reject the null. Say we take ˛ D 0:01. Then, we calculate the z value that gives the
proportion ˛ to the right of z under the standard normal curve, i.e., the 99th percentile of
the normal curve. In this case z D 2:326: We are choosing the area to the right because our
alternative hypothesis is H1 W pT pC > 0: This z -value is called the critical value for a 1%
level of significance. Now, when we carry out the experiment, any value of the test statistic
2 In 2004, on the basis of statistical evidence, Merck pulled VIOXX© off the market.
5.2. THE BASICS OF HYPOTHESIS TESTING 107
observed expected
zD 2:326 will result in our rejecting the null, and we know the corre-
SE
sponding p-value must be less than ˛ D 0:01: The advantage of this approach is that we know
the value of the test statistic we need to reject the null before we carry out the experiment.
Example 5.2 We have a coin which we think is not fair. That means we think p ¤ 0:5, where
p is the probability of flipping a head. Suppose we toss the coin 100 times and get 60 heads.
The sample proportion of heads we obtain is p D 0:6. Thus, our hypothesis test is H0 W p D 0:5
vs. H1 W p > 0:5 because we got 60% heads. We start off assuming it is a fair coin.
Now suppose we are given a level of significance p ˛ D 0:05. We know, using the normal
approximation, that the sample proportion P N.0:5; 0:5 0:5=100/. The critical value for
˛ D 0:05 is the z -value giving 0.05 area to the right of z , and this is z D 1:644: This is the
critical value for this test. The value of the test statistic is z D 0:60:050:5 D 2:0: Consequently, since
2:0 > 1:644, we conclude that we may reject the null at the 5% level of significance.
The p-value approach gives more information since it actually tells us the chance of being
wrong assuming the null. In this case the p-value is P .P 0:6/ D normalcdf.:6; 1; :5; :05/ D
0:0227; or by standardizing,
!
P 0:5 0:6 0:5
P .P 0:6 j p D 0:5/ D P p p
0:5 0:5=100 0:5 0:5=100
P .Z 2:0/ D 0:0227:
We have calculated the chance of getting 60 or more heads if it is a fair coin as only 2%, so we have
significant evidence against the null. Again, since 0:02 < 0:05, we reject the null and conclude
it is not a fair coin. We could be wrong, but the chance of being wrong is only 2%. Another
important point is our choice of null H0 W p D 0:5 which specifies exactly what p should be for
this coin if it’s fair.
H 0 W # D #0 vs. H1 W # ¤ #0 : (5.1)
or if
#0 u.X1 ; X2 ; : : : ; Xn /; .#0 appears to be too large/:
As before,
That is, the probability of rejecting the null hypothesis when the null is true (# D #0 ) is ˛ . The
form of the pivotal quantity for the CI under the null hypothesis is called the test statistic.
The value of ˛ is called the significance level (or simply level) of the test, and this is the
largest probability we are willing to accept for making an error in rejecting a true null. Rejecting
the null hypothesis when the null is true is commonly called a Type I error. A Type II error
occurs when we do not reject the null hypothesis when it is false. The probability of making a
Type II error is denoted ˇ . To summarize,
Conclusion
Retain H0 Reject H0
Actually H0 true Correct Type I error; probability ˛
H0 false Type II error; probability ˇ Correct
It turns out that the value of ˇ is not fixed like the value of ˛ and depends on the value of
# taken in the alternative hypothesis H1 W # D #1 ¤ #0 . In general, the value of ˇ is different
for every choice of #1 ¤ #0 , and so ˇ D ˇ.#1 / is a function of #1 .
Remark 5.3 A Type I error rejects a true null, while a Type II error does not reject a false null.
In general, making ˛ small is of primary importance. Here’s an analogy to explain why this is.
Suppose a male suspect has been arrested for murder. He is guilty or not guilty. In the U.S., the
null hypothesis is H0 W suspect is not guilty, with alternative H1 W suspect is guilty. A Type I
error, measured by ˛; says that the suspect is found guilty by a jury, but he is really not guilty. A
5.3. HYPOTHESES TESTS FOR ONE PARAMETER 109
Type II error, measured by ˇ , says that the suspect is found not guilty by the jury, when, in fact,
he is guilty. Both of these are errors, but finding an innocent man guilty is considered the more
serious error. That’s one of the reasons we focus on keeping ˛ small.
The set of hypotheses in (5.1) is called a two-sided test. In such a test, we are interested
in deciding if # D #0 or not. Sometimes we are only interested in testing whether H0 W # D #0
vs. H1 W # < #0 or vs. H1 W # > #0 . These types of tests are called one-sided. Often one decides
the form of the alternative on the basis of the result of the experiment. For example, in our coin
tossing experiment we obtained p D 0:6, so that a reasonable alternative is H1 W p > 0:5:
To construct a test of hypotheses for one-sided tests using the critical value approach, we
simply use one-sided confidence intervals instead of two-sided. Specifically, for a given level
of significance ˛; for the alternative hypothesis H1 W # < #0 ; we will reject H0 W # D #0 in favor
of H1 if #0 is not in the 100.1 ˛/% confidence interval . 1; u.X1 ; X2 ; : : : ; Xn // for # . In
particular, we reject the null hypothesis if #0 u.x1 ; : : : ; xn / if X1 D x1 ; X2 D x2 ; : : : ; Xn D xn
are the observed sample values. For the alternative hypothesis, H1 W # > #0 , we reject H0 if
#0 l.x1 ; : : : ; xn /. Finally, the the set of real number S for which we reject H0 if #0 2 S is
called the critical region of the test. To summarize, we reject the null hypothesis if
8
ˆ
< < 0 ; reject H0 if z z˛
H1 W > 0 ; reject H0 if z z˛ (5.2)
:̂
¤ 0 ; reject H0 if jzj z˛=2 :
8
2 2
ˆ
< < 0 ; reject H0 if 2 2 .n 1; 1 ˛/
H1 W 2 > 02 ; reject H0 if 2 2 .n 1; ˛/ (5.5)
:̂ 2
¤ 02 ; reject H0 if 2 2 .n 1; 1 2
˛=2/ or .n2
1; ˛=2/:
Recall that 2 .n; ˛/ is the critical value of the 2 distribution with n degrees of freedom. The
area under the 2 pdf to the right of the critical value is ˛ .
We present several examples illustrating these tests.
We are using the T statistic because we do not know the population SD and so must use the
sample SD. Since ˛ D 0:05, the area below the critical value must be 0.975. The critical value
is therefore t .24; 0:025/ D invT.:975; 24/ D 2:064. For the two-sided test we reject H0 if t …
. 2:064; 2:064/, but our value of t D 0:116 is clearly in the interval. We conclude that we
cannot reject the null at the 0.05 level of significance and state that the result in our experiment
could be due to chance. Whenever we have such a result, it is said that we do not reject the
null or that the null is plausible. It is never phrased that we accept the alternative. If we want
to calculate the p-value we need P .t.24/ 0:116/ C P .t .24/ 0:116/ D 0:9086 > ˛ , and so
there is a large probability we will make an error if we reject the null.
If we had decided to use the one-sided alternative H1 W < 7 on the basis that our ob-
served sample mean is x D 6:3, our p-value would turn out to be P .t .24/ < 0:116/ D 0:4545,
and we still could not reject the null.
Next we present an example in which we have a two-sided test for the variance.
Example 5.5 A certain plumbing supply company produces copper pipe of fixed various
lengths and asserts that the standard deviation of the various lengths is 1:3 cm. The mean length
is therefore assumed to be unknown. A contractor who is a client of the company decides to
test this claim by taking a random sample of 30 pipes and measuring their lengths. The stan-
dard deviation of the sample turned out to be s D 1:1 cm. The contractor tests the following
hypotheses, and sets the level of the test to be ˛ D 0:01.
Therefore, under an assumption of normality of the lengths, we have the test statistic
Example 5.6 At a certain food processing plant, 36 measurements of the volume of tomato
paste placed in cans by filling machines were made. A production manager at the plant thinks
that the data supports her belief that the mean is greater than 12 oz. causing lower company
112 5. HYPOTHESIS TESTING
profits. Because the sample size n D 36 > 30 is large, we assume normality. The sample mean
is x 36 D 12:19 oz. for the data. The standard deviation is unknown, but the sample standard
deviation sX D 0:11. The manager sets up the test of hypotheses.
.12:19 12/
tD p D 10:36:
0:11= 36
According to (5.3), she should reject H0 W D 12 if t t.35; 0:05/ D 1:69. The critical region
is Œ1:69; 1/: Since t D 10:36, the manager rejects the null. The machines that fill the cans must
be recalibrated to dispense the correct amount of tomato paste.
Example 5.7 Hypothesis test for the median. We introduce a test for the median m of a
continuous random variable X . Recall that the median is often a better measure of centrality
of a distribution than the mean. Suppose that x1 , x2 , , xn is a random sample of n observed
values from X . Consider the test of hypothesis H0 W m D m0 vs. H1 W m > m0 . Assume that
xi ¤ m0 for every i . Let S be the random variable that counts the number of values greater than
m0 (considered successes). If H0 is true, then by definition of the median, S Bin.n; 0:5/. If
S is too large, for example if S > k for some k , H0 should be rejected. Since the probability of
committing a Type I error must be at most ˛ , we have
k
X n
Choose the minimum value of k so that i
.0:5/n 1 ˛.
i D0
To illustrate the test, consider the following situation. Non-Hodgkins Lymphoma (NHL)
is a cancer that starts in cells called lymphocytes which are part of the body’s immune system.
Twenty patients were diagnosed with the disease in 2007. The survival time (in weeks) for each
of these patients is listed in the table below.
5.3. HYPOTHESES TESTS FOR ONE PARAMETER 113
We want to test the hypothesis H0 W m D 520 vs. H1 W m > 520 at the ˛ D 0:05 level and find
the p-value of the test. From the data we observe S D 7 data points above 520. Relevant values
of Binom.20; 0:5/ are shown in the table below.
k P .S k j H0 is true/
7 0:1316
12 0:8684
13 0:9423
14 0:9793
15 0:9941
From this table we see that the smallest k for which P .X > k j H0 is true/ ˛ is 14. Therefore
we reject H0 if S > 14. Since S D 7, we retain H0 . As for the p-value we have
6
!
X 15 1 15
p-value D P .S 7/ D 1 P .S < 7/ D D 0:30362:
i 2
i D0
H0 W # D #0 vs. H1 W # > # 0 :
114 5. HYPOTHESIS TESTING
Table 5.1: Conclusions for p-values
Let ‚ denote the test statistic under the null hypothesis, that is, when # D #0 , and let be the
value of ‚ when the sample data values X1 D x1 , X2 D x2 ; : : : ; Xn D xn are substituted for the
random variables. The
p-value of is P .‚ /
which is the probability of obtaining a value of the test statistic ‚ as or more extreme than the
value that was obtained from the data. For example, suppose that ‚ Z when # D #0 . We
would reject H0 if p D P .Z / < ˛; where ˛ is the specified minimum level of significance
required to reject the null. We may choose ˛ according to Table 5.1. On the other hand, if p ˛ ,
then we retain, or do not reject, H0 .
The p-value of our observed statistic value determines the significance of . If p is the
p-value of , then for all test levels ˛ such that ˛ > p we reject H0 . For all test levels ˛ p , we
retain H0 . Therefore, p can be viewed as the smallest test level for which the null hypothesis is
rejected.
If ˛ is not specified ahead of time, we use Table 5.1 to reach a conclusion depending on
the resulting p-value.
In a similar way, the alternative H1 W # < #0 is a one-sided test, and the p-value of the
observed test statistic is p D P .‚ /: If p < ˛ , we reject the null. The level of significance
of the test statistic is the p-value p . Finally, if the alternative hypothesis is two-sided, namely,
H1 W # ¤ #0 ; then the p-value is given by
p D P .‚ / C P .‚ /:
Example 5.10 A manufacturer of e-cigarettes (e-cigs) claims that the variance of the nicotine
content of its cigarettes is less than 0:77 mg. The sample variance in a random sample of 25
of the company’s e-cigs turned out to be 0:41 mg. A health professional would like to know if
there is enough evidence to reject the company’s claim. The hypothesis test is H0 W 2 D 0:77
vs. H1 W 2 < 0:77: Under an assumption of normality, the test statistic is
.n 1/SX2
2 .n 1/:
02
The p-value is therefore p D 2 minf0:0303; 0:9606g D 0:0606: Since p > 0:05, the result is not
statistically significant, and we do not reject the null. This example illustrates how a one-sided
alternative may lead to rejection of the null, but a two-sided alternative does not reject the null.
If we use the critical value approach with ˛ D 0:05; we calculate the critical values
P . .24/ a/ D 0:025 which implies a D 12:401 and P .2 .24/ b/ D 0:975 which implies
2
b D 39:364. The two-sided critical region is .0; 12:401 [ Œ39:364; 1/. Since our observed value
2 D 12:779 is not in the critical region, we do not reject the null at the 5% level of significance.
There is insufficient evidence to conclude that the variance is not 0:77.
X np X p
p Dr D Z N.0; 1/:
np.1 p/ p.1 p/
n
116 5. HYPOTHESIS TESTING
Before deriving the confidence interval for p , the approximate 100.1 ˛/% confidence interval
r r !
p.1 p/ p.1 p/
x z˛=2 ; x C z˛=2
n n
was obtained, but it was noted that this interval cannot be computed from the sample values since
p itself appears in the endpoints of the interval. Recall that this problem is solved by replacing p
with pN D xN . However, in a hypothesis test, we have a null H0 W p D p0 in which we assume the
given population proportion of p0 , and this is exactly what we substitute whenever a value of
p is required. Here’s how it works.
Consider the two-sided test of hypotheses H0 W p D p0 vs. H1 W p ¤ p0 . We take a ran-
dom sample and calculate the sample proportion p . Since p D p0 under the null hypothesis,
X p0
r Z (approximate).
p0 .1 p0 /
n
This is where we use the assumed value of p under the null. For a given significance level ˛ , we
will reject H0 if
r r
p0 .1 p0 / p0 .1 p0 /
x D p p0 C z˛=2 or x D p p0 z˛=2 :
n n
The p-value of the test is !
p p0
P jZj p :
p0 .1 p0 /=n
We reject H0 if this p-value is less than ˛ .
Example 5.11 One way to cheat at dice games is to use loaded dice. One possibility is to
make a 6 come up more than expected by making the 6-side weigh less than the 1-side. In
an experiment to test this possible loading, 800 rolls of a die produced 139 occurrences of a 6.
Consider the test of hypotheses
1 1
H0 W p D vs. H1 W p ¤
6 6
where p is the probability of obtaining a 6 on a roll of the die. Computing the value of the test
statistic, we get with p D 139=800 D 0:17375;
139 1
p p0
zDr D v 800 6
D 0:53759:
p0 .1 p0 / u
u1 1
u 1
n t6 6
800
5.4. HYPOTHESES TESTS FOR TWO POPULATIONS 117
The corresponding p-value is P .jZj 0:53759/ D 0:5909. We do not reject H0 , and we do not
have enough evidence that the die is loaded. How many 6s need to come up so that we can
say it is loaded? Set ˛ D 0:05. Then we need P .jZj z/ < 0:05 which implies jzj > 1:96: Since
p :166 n
zD , substitute p D 800 , and solve the inequality for n. We obtain that n 154 or
:01317
n 112. Therefore, if we roll at least 154 6s or at most 112 6s, we can conclude the die is loaded.
One-sided tests for a proportion are easily constructed. For the tests
Example 5.12 Suppose an organization claims that 75% of its members have IQs over 135.
Test the hypothesis H0 W p D 0:75 vs. H1 W p < :75: Suppose 10 members are chosen at random
and their IQs are measured. Exactly 5 had an IQ over 135. Under the null, we have
!
0:5 0:75
P Zp D P .Z 1:825/ D 0:034:
:75 :25=10
Since 0:03 < 0:05; our result is statistically significant, and we may reject the null. The evidence
supports the alternative.
There is a problem with this analysis in that np D 10.0:75/ D 7:5 > 5, but n.1 p/ D
10.0:25/ D 2:5 < 5, so that the use of the normal approximation is debatable. Let’s use the
exact binomial distribution to analyze this instead. If X is the number of members in a sample
of size 10 with IQs above 135, we have X Binom.10; 0:75/ under the null. Therefore, P .X
5/ D binomcdf.10; 0:75; 5/ D 0:078 is the exact probability of getting 5 or fewer members with
IQs below 135. Therefore, we do not have enough evidence to reject the null, and the result is
not statistically significant.
The test statistic depends on the standard deviations of the populations or the samples. Here
is a summary of the results. We use the critical value approach to state the results. Computing
p-values is straightforward.
.m 1/SX2 C .n 1/SY2
Sp2 D :
mCn 2
$ %
1 1 2
m
r C n sX2
D 1 1
,r D .
m2 .m 1/
r2 C n2 .n 1/
sY2
8
< X Y < d 0 , reject H0 if t t.; ˛/
H1 W Y > d 0 , reject H0 if t t.; ˛/
: X
X Y ¤ d0 , reject H0 if jtj t.; ˛=2/.
Paired Samples
If n D m and the two samples X1 ; : : : ; Xn and Y1 ; : : : ; Yn are not independent, we consider
the difference of the two samples Di D Xi Yi and deem this a one-sample t -test because we
DN n d0
assume D is unknown. The test statistic is T D p
SD = n
t.n 1/ if H0 is true. If t is the score,
for the alternative
8
< D < d0 , reject H0 if t t.n 1; ˛/
H1 W > d0 , reject H0 if t t.n 1; ˛/
: D
D ¤ d0 , reject H0 if jtj t.n 1; ˛=2/.
8
< r < r0 , reject H0 if f F .m 1; n 1; 1 ˛/
H1 W r > r0 , reject H0 if f F .m 1; n 1; ˛/
:
r ¤ r0 , reject H0 if f F .m 1; n 1; 1 ˛=2/ or f F .m 1; n 1; ˛=2/.
Example 5.13 In this example, two popular brands of low-fat, Greek-style yogurt, Artemis
and Demeter, are compared. Let X denote the weight (in grams) of a container of the Artemis
brand yogurt, and let Y be the weight (also in grams) of a container of the Demeter brand yogurt.
Assume that X N.X ; X / and that Y N.Y ; Y /. Nine measurements of the Artemis
yogurt and thirteen measurements of Demeter yogurt were taken with the following results.
SX2 1 SX2
F D D
SY2 r0 SY2
since r0 D 1 in this case. From the two samples we compute sX2 D 1:014 and sY2 D 0:367. There-
fore, f D 0:367
1:014
D 0:3619.
If we set the level of the test to be ˛ D 0:05, then F .8; 12; 0:025/ D 4:20 and
F .8; 12; 0:975/ D 0:28. Since 0:28 < f < 4:20, we cannot reject H0 , and we may assume the
variances are equal.
Now we want to test if the mean weights of the two brands of yogurt are the same. Con-
sider the hypotheses
H0 W X Y D 0 D d0 vs. H1 W X Y ¤ 0:
The pooled sample standard deviation is computed from the two samples as sp D 0:869. The
test statistic is
.X m Y n / d0 .X m Y n /
T D r D r t.20/:
1 1 1 1
Sp C Sp C
m n m n
Therefore,
.21:03 20:89/
tD r D 0:372:
1 1
0:869 C
9 13
For the same level ˛ D 0:05, we reject H0 if jtj t.20; 0:025/ D 2:086. Since t D 0:372, we
retain H0 . Also, we may calculate the p-value as P .jT j 0:372/ D 1 P .jT j 0:372/ D 1
tcdf. 0:372; 0:372; 20/ D 1 0:286 D 0:7138. We cannot reject the null and conclude there is
no difference between the weights of the Artemis and Demeter brand yogurts.
.XN m YNn / d0
q Z (approximate).
pX .1 pX / pY .1 pY /
m
C n
5.4. HYPOTHESES TESTS FOR TWO POPULATIONS 121
If we now approximate pX and pY by XNm and YNn respectively, the approximation still holds, and
we obtain the test statistic
.XN m YNn / d0
ZDq N.0; 1/.
XN m .1 XN m / YNn .1 YNn /
m
C n
We encounter the same problem as before, namely, that we do not know the value of p0 . Since
we are assuming that the population proportions are the same, it makes sense to pool the pro-
portions (that is, form a weighted average) as in
mXN m C nYNn
PN0 D .
mCn
Our test statistic becomes
.XN m YNn /
p q Z (approximate).
1 1
PN0 .1 PN0 / m C n
P-values can be easily computed for all the above tests. As an example, for the test of hypotheses
H0 W pX D pY vs. H1 W pX ¤ pY , the p-value is computed as 2P .Z jzj/ where
H0 W pB D pG vs. H1 W pB ¤ pG :
Set the level of the test at ˛ D 0:05. The pooled estimate of the common proportion p0 is
mpNB C npNG 1311 C 1440
p0 D D D 0:705.
mCn 3900
Therefore, the observed value of the test statistic is
1311 1440
zD 1900 r2000 D 2:053:
p 1 1
0:705.1 0:705/ C
1900 2000
We reject H0 since jzj D 2:053 z0:025 D 1:96. The p-value of z D 2:053 is easily calculated
to be 0:04.
Example 5.15 We revisit the VIOXX© clinical study introduced at the beginning of the chap-
ter to obtain the solution in terms of the notation and methods of hypothesis testing. Let pC
and pT denote the true proportions of subjects that experience cardiovascular events (CVs)
when taking the drug. Recall that the test of hypotheses was H0 W pT D pC vs. H1 W pT > pC .
As described in this section, the test statistic is
TN1287 CN 1299
ZDq .
TN1287 .1 TN1287 / CN 1299 .1 CN 1299 /
1287
C 1299
yielding a p-value of P .Z 2:4305/ D 0:00754. Pooling the proportions has no effect on the
significance of the result.
Tables summarizing all the tests for the normal parameters and the binomial proportion
are located in Section 5.8.
In order to quantify the power of a test, we need to be more specific with the alternative hy-
pothesis. Therefore, for each #1 specified in an alternative H1 W # D #1 , we obtain a different
value of ˇ . Consequently, ˇ is really a function of the # which we specify in the alternative. We
define for each #1 ¤ #0 ;
The power of a test is the probability of correctly rejecting a false null. For reasons that
will become clear later, we may define .#0 / D ˛ .
We now consider the problem of how to compute .#/ for different values of # ¤ #0 . To
illustrate the procedure in a particular case, suppose we take a random sample of size n from a
random variable X N.; 0 /. (0 is known.) Consider the set of hypotheses
H0 W D 0 vs. H1 W ¤ 0 :
124 5. HYPOTHESIS TESTING
Recall that we reject H0 if j x =p
0
n
j z˛=2 : If, in fact, D 1 ¤ 0 , the probability of a Type II
0
error is
ˇ.1 / D P .not rejecting H0 j D 1 / (5.6)
ˇ ˇ !
ˇX ˇ
ˇ 0ˇ
DP ˇ p ˇ < z˛=2 j D 1
ˇ 0 = n ˇ
0 0
D P 0 z˛=2 p < X < 0 C z˛=2 p j D 1
n n
!
.0 1 / X 1 .0 1 /
DP p z˛=2 < p < p C z˛=2 j D 1
0 = n 0 = n 0 = n
!
.0 1 / .0 1 / X 1
DP p z˛=2 < Z < p C z˛=2 since p Z
0 = n 0 = n 0 = n
ˇ ˇ
ˇ 0 1 ˇˇ
D P ˇˇZ p ˇ < z˛=2 :
0 = n
We needed to standardize X using 1 and not 0 because 1 is assumed to be the correct
value of the mean. Once we have ˇ.1 /, the power of the test at D 1 is then .1 / D
1 ˇ.1 /.
Remark 5.17 Notice that
lim .1 / D 1 P jZj < z˛=2 D 1 .1 ˛/ D ˛;
1 !0
and so it makes sense to define .0 / D ˛ . Also, note that lim1 !1 .1 / D 1 and
lim1 ! 1 .1 / D 1.
Keep in mind that the designer of an experiment wants both ˛ and ˇ to be small. We want
the power of a test to be close to 1 because the power quantifies the ability of a test to detect a
false null.
Example 5.18 A data set of n D 106 observations of the body temperature of healthy adults
was compiled. The standard deviation is known to be 0 D 0:62ı F. (In this example, degrees are
Fahrenheit.) Consider the set of hypotheses
H0 W D 98:6ı vs. H1 W ¤ 98:6ı :
Table 5.2 lists selected values of the power function ./ for the above test of hypotheses com-
puted using the formula derived above with the level set at ˛ D 0:05. For example, if the alter-
native is H1 W 1 D 98:75ı , the power is .98:75ı / D 0:702, and so there is about a 70% chance
of rejecting a false null if the true mean is 98:75ı :
A graph of the power function ./ is given below (Fig. 5.1).
5.5. POWER OF TESTS OF HYPOTHESES 125
ı
Table 5.2: Values of .1 /; 1 ¤ 98:6
1 98.35 98.40 98.45 98.50 98.55 98.60 98.65 98.70 98.75 98.80 98.85
.1 / 0.986 0.913 0.702 0.382 0.132 0.050 0.132 0.382 0.702 0.913 0.986
1.0
(98.35º, 0.986)
(98.75º, 0.702)
0.6
0.4
0.2
n = 106
0.8
n = 150
n = 50
0.6
0.4
n = 20
0.2 n = 10
Figure 5.2: Power curves for a two-sided test and different sample sizes.
Suppose we specify the desired power of a test. Can we find the sample size needed to
achieve this? Here’s an example of the method for a two-sided test if we specify ˇ: We have from
(5.6)
ˇ ˇ
ˇ 0 1 ˇ
1 ˇ D P ˇˇZ p ˇˇ z˛=2
0 = n
0 1 0 1
D P Z z˛=2 p C P Z z˛=2 C p
0 = n 0 = n
j0 1 j j0 1 j
D P Z z˛=2 p C P Z z˛=2 C p .
0 = n 0 = n
For statistically significant levels ˛ , the first term in the sum above is close to 0, and so it makes
sense to equate the power 1 ˇ with the second term and solve for n giving us a sample size
that is conservative in the sense that the sum of the two terms above will only be slightly larger
than 1 ˇ . Doing this, we get
j0 1 j
1 ˇ D P Z z˛=2 C p
0 = n
j0 1 j
H) zˇ D z˛=2 C p
& 0 = n '
0 .zˇ C z˛=2 / 2
H) n D .
0 1
5.5. POWER OF TESTS OF HYPOTHESES 127
0 .zˇ Cz˛=2 / 2
If we take N D 0 1
, then for all n N , the power at an alternative 1 ¤ 0 will
be at least 1 ˇ , approaching 1 as n ! 1.
Example 5.20 Continuing with the body temperature example once more, for a test at level
˛ D 0:05, suppose we specify that if j 98:6ı j 0:15ı , then we want ./ 0:90. By symme-
try and monotonicity of the power function, we need only require that .98:75ı / D 0:90. There-
fore, with z˛=2 D z:025 D 1:96; zˇ D z:1 D 1:28; 0 D 98:6ı ; 1 D 98:75ı , and 0 D 0:62ı , we
have a sample size
& 2 '
0:62.1:28 C 1:96/
nD D 180:
0:15
A sample of at least 180 healthy adults must be tested to ensure that the power of the test at all
alternatives such that j 98:6ı j 0:15ı will be at least 0:90.
.1 / D 1 ˇ.1 /
!
X 0
D1 P p < z˛ j D 1
0 = n
0
D 1 P X < 0 C z˛ p j D 1
n
!
X 1 .0 1 /
D1 P p < p C z˛ j D 1
0 = n 0 = n
!
.0 1 / X 1
D1 P Z< p C z˛ since p Z
0 = n 0 = n
0 1
D1 P Z < z˛ C p :
0 = n
Example 5.22 The diagram in Figure 5.3 displays power curves for one-sided tests for healthy
adult body temperature of the form H0 W D 98:6ı vs. H1 W > 98:6ı , for various choices of
sample size. Notice that for < 98:6ı , the power of the test diminishes rapidly.
1.0
n = 106
0.8
n = 50
0.6
n = 150
0.4
n = 20
0.2 n = 10
X1 np1 X2 np2
p N.0; 1/ and p N.0; 1/:
np1 .1 p1 / np2 .1 p2 /
However, these two distributions are not independent. (Why?) We also know from Chapter 2
that Z N.0; 1/ implies Z 2 2 .1/. Now we calculate
.X1 np1 /2
2 .1/ D1 D
np1 .1 p1 /
.1 p1 /.X1 np1 /2 p1 .X1 np1 /2
D C
np1 .1 p1 / np1 .1 p1 /
2
.X1 np1 / .n X2 np1 /2
D C
np1 np2
2
.X1 np1 / .n.1 p1 / X2 /2
D C
np1 np2
2
.X1 np1 / .np2 X2 /2
D C
np1 np2
2
.X1 np1 / .X2 np2 /2
D C :
np1 np2
Therefore, the sum of the two related distributions is approximately distributed as 2 .1/. If
they were independent, D1 would be approximately distributed as 2 .2/: This result can be
generalized in a natural way. Specifically, if .X1 ; X2 ; : : : ; Xk / Multin.n; p1 ; p2 ; : : : ; pk /, then
130 5. HYPOTHESIS TESTING
it can be shown that as n ! 1,
k
X .Xi npi /2
Dk 1 D 2 .k 1/
npi
i D1
although the proof is beyond the scope of this text. A common rule of thumb is that n should
be large enough so that npi 5 for each i to guarantee that the approximation is acceptable.
The statistic Dk 1 is called Pearson’s D statistic. The subscript k 1 is one less than the number
of terms in the sum.
We now discuss how the statistic Dk 1 might be used in testing hypotheses. Consider
an experiment with a sample space S , and suppose S is partitioned into disjoint events, i.e.,
S D [kiD1 Ai where A1 , A2 , …, Ak are mutually disjoint subsets such that P .Ai / D pi : Clearly,
p1 C p2 C C pk D 1. If the experiment is repeated independently n times, and Xi is defined
to be the number of times that Ai occurs, then .X1 ; X2 ; : : : ; Xk / Multin.n; p1 ; p2 ; : : : ; pk /.
We want to know if the experimental results for the proportion of time event Ai occurs matches
some prescribed proportion p0;i for each i D 1; 2; : : : ; k: The test of hypotheses for given prob-
abilities p0;1 ; : : : ; p0;k is
This is now setup for the 2 goodness-of-fit test. If H0 is true, then .X1 ; X2 ; : : : ; Xk /
Multin.n; p0;1 ; p0;2 ; : : : ; p0;k /, and for large enough n, Dk 1 is distributed approximately as
k
X .Xi np0;i /2
Dk 1 D 2 .k 1/:
np0;i
iD1
Since each Xi represents the observed frequency of the observations in Ai and np0;i is the
expected frequency of the observations in Ai , if the null hypothesis is true, we would expect the
value of Dk 1 to be small. We should reject the null hypothesis if the value of Dk 1 appears
to be too large. To make sure that we commit a Type I error at most 100˛% of the time, we
need
k
!
X .Xi np0;i /2
P .Dk 1 2 .k 1; ˛// D P 2 .k 1; ˛/ D ˛:
np0;i
iD1
The observed and expected frequencies are listed in the table below.
The value of D3 is
Set ˛ D 0:05. Since d3 < 2 .3; 0:05/ D 7:815, the null hypothesis is not rejected, and there is
not enough evidence to claim that the die is not fair. In addition, the p-value of the test is
P .D3 3:4118/ D 0:3324 > 0:05.
Example 5.24 The printer’s proofs of a new 260-page probability and statistics book contains
typographical errors (or not) on each page. The number of pages on which i errors occurred is
given in the table below.
Num of Errors 0 1 2 3 4 5 6
Num of Pages 77 90 55 30 5 3 0
Suppose we conjecture that these 260 values resulted from sampling a Poisson random variable
i
X with D 2. The random variable X Poisson./ if P .X D i/ D e i Š ; i D 0; 1; 2; : : :. The
hypothesis test is
To apply our method we must first compute the probabilities P .X D i/, i D 0; : : : ; 5, and also
P .X 6/ since the Poisson random variable takes on every nonnegative integer value. The
results are listed below.
132 5. HYPOTHESIS TESTING
2 2i
i Probability e iŠ
Expected Frequency 260 P .X D i/
0 0:13534 35: 188
1 0:27067 70: 374
2 0:27067 70: 374
3 0:18045 46: 917
4 0:09022 23: 457
5 0:03609 9: 383
6 1 P .X 5/ D 0:01656 4: 306
Since the expected frequency of at least 6 errors is less than 5, we must combine the entries for
i D 5 and i 6. Our revised table is displayed below.
The value of D5 is
.77 35: 188/2 .90 70: 374/2 .55 70: 374/2
d5 D C C C
35: 188 70: 374 70: 374
.30 46: 917/2 .5 23: 457/2 .3 13: 689/2
C C D 87: 484.
46: 917 23: 457 13: 689
Let ˛ D 0:01. Since d5 2 .5; 0:01/ D 15:086, we reject H0 . Another way to see this is by
calculating the p-value P .D5 87:484/ D 2:26 10 17 < 0:01: It is extremely unlikely that
the data is generated by a Poisson random variable with D 2.
Example 5.25 Continuing with the previous example, you may question why we took D 2.
Actually it was just a guess. A better way is to estimate the value of directly from the data
instead of attempting to guess it. Since the expectation of a Poisson random variable X with
parameter is E.X/ D , we can estimate from the sample mean as
77 0 C 90 1 C 55 2 C 30 3 C 5 4 C 3 5
D x 260 D D 1: 25.
260
The estimated expected frequencies with this value of are listed below.
5.6. MORE TESTS OF HYPOTHESES 133
1:25 1:25i
i P .X D i / D e iŠ
Expected Frequency 260 P .X D i/
0 0:2865 74:490
1 0:3581 93:106
2 0:2238 58:188
3 0:0933 24:258
4 0:0291 7:566
5 0:0073 1:898
6 1 P .X 5/ D 0:0019 0:494
We must combine the entries for i D 4, i D 5, and i 6 to replace the bottom three rows with
i 4; P .X 4/ D 0:0383, and expected frequency 9:958.
Since one of the parameters, , had to be estimated using the sample values, it turns out
that under the null hypothesis, the D statistic loses a degree of freedom. That is, D4 2 .3/:
The value of D4 is now computed as
Take ˛ D 0:01. Since d4 < 2 .3; 0:01/ D 11:345 or, calculating P .D4 2:107/ D 0:55049 > ˛ ,
we do not reject H0 . It is plausible that the population is described by a Poisson random variable
with D 1:25.
Remark 5.26 The previous example addresses an important issue. If any of the parameters
of the proposed distribution must be estimated, then the D statistic under H0 loses degrees
of freedom. In particular, if the proposed distribution has r parameters that must be estimated
from the data, then Dk 1 2 .k 1 r/ under the null hypothesis. The proof is fairly involved
and therefore is omitted.
H0 W data generated from UnifŒ0; 1 vs. H1 W data not generated from UnifŒ0; 1.
The adjective “contingent” is often used to describe the situation when an event can occur only
when some other event occurs first. For example, earning a high salary in the current economy is
contingent upon finding a high-tech job. In this sense, a contingency table’s function is to reveal
some type of dependency or relatedness between the traits.
We know that
.X11 ; X12 ; : : : ; X1c ; : : : ; Xr1 ; : : : ; Xrc / Multin.n; p11 ; p12 ; : : : ; p1c ; : : : ; pr1 ; : : : ; prc /;
136 5. HYPOTHESIS TESTING
and for large enough n, Drc 1 is approximately distributed as
r X
X c
.Xij npij /2
Drc 1 D 2 .rc 1/:
npij
i D1 j D1
This result can be used in a test of hypotheses for the independence of the traits (random vari-
ables) A and B . Clearly, traits A and B are independent if P .Ai j Bj / D P .Ai / for every i and
j . We know that independence of events, in this case traits, is equivalent to
If traits A and B are independent, the null should hold. Again, the null is formulated to assume
independence because otherwise we have no way to account for any level of dependence.
Let the row frequencies and column frequencies be denoted, respectively, by pi D P .Ai /
and pj D P .Bj /. The set of hypotheses above can now be rewritten more compactly as
In other words, the null states that the probability of each cell should be the product of the
corresponding row and column probabilities. Under the null hypothesis, Drc 1 is approximately
distributed as
X r X c
.Xij npi pj /2
Drc 1 D 2 .rc 1/:
npi pj
i D1 j D1
The problem is that we do not know any of the probabilities involved here. Our only course of
action is to estimate them from the sample values. Defining
c
X r
X
Xi D Xij D row sum and Xj D Xij D column sum;
j D1 iD1
If traits A and B are really independent, we would expect the value of Drc 1 to be small. As
before, we will reject H0 at level ˛ if the value of Drc 1 is too large, specifically, if the value is
at least 2 ..r 1/.c 1/; ˛/. If we use the p-value approach, we would compute
Finally, as a rule of thumb, the estimated expected frequencies should be at least 5 for each i and
j . If not, rows and columns should be collapsed to achieve this requirement.
Example 5.28 Data is collected to determine if political party affiliation is related to whether
or not a person opposes, supports, or is indifferent to water use restrictions in a certain South-
western American city that is experiencing a severe drought. A total of 500 adults who belonged
to one of the two major political parties were contacted in the survey. We would like to know
if a person’s party affiliation and his or her opinion about water restrictions are related. The
hypotheses to be tested are
H0 W party affiliation and water use restriction opinion are independent vs.
H1 W party affiliation and water use restriction opinion are dependent.
The results of the survey are presented in the following contingency table.
Response
Categories Approves (A) Opposes (O) Indifferent (I) Row Totals
Party Democrat (D) 138 64 83 285
115:14 84:36 85:5
Affiliation Republican (R) 64 84 67 215
86:86 63:64 64:5
Column Totals 202 148 150 500
Remark 5.29 A note of caution is in order here. The statistic Drc 1 is discrete, and we are
using a continuous distribution, namely the 2 distribution, to approximate it. If Drc 1 is ap-
proximated by 2 .1/ (for example, for a 2 2 contingency table) or when at least one of the
estimated expected frequencies is less than 5, a continuity correction has been suggested to im-
prove the approximation just as we do in using a normal distribution to approximate a binomial
distribution. The suggestion is that the D statistic be corrected as
r X c ˇ ˇ
X .ˇXij npOi pOj ˇ 0:5/2
Drc 1 D .
npOi pOj
i D1 j D1
This correction has a tendency however to over-correct and may lead to larger Type II errors.
Example 5.30 A sample of 185 prisoners who experienced trials in a certain criminal jurisdic-
tion was taken, and the results are presented in the 2 2 contingency table below.
Verdict
Categories Acquitted (A) Convicted (C) Row Totals
Offender Female (F) 39 5 44
39:495 4:505
Gender Male (M) 127 14 141
126:45 14:55
Column Totals 166 19 185
The estimated probabilities are
44 141
pOF D D 0:238, pOM D D 0:762,
185 185
166 19
pOA D D 0:897; pOC D D 0:103.
185 185
5.6. MORE TESTS OF HYPOTHESES 139
Since one of the estimated expected frequencies is less than 5, and we are working with a
2 2 contingency table, we will use the continuity correction. We compute
.j39 39: 495j 0:5/2 .j5 4: 505j 0:5/2
d3 D C
39: 495 4: 505
.j127 126: 450j 0:5/2 .j14 14:550j 0:5/2
C C D 0:000198.
126: 450 14:550
If ˛ D 0:05, then 2 .1; 0:05/ D 3:841. We fail to reject H0 . The gender of the offender and
whether or not the offender is convicted or acquitted appear not to be related.
Approval/Dissapproval
Categories Approves (A) Dissapprove (D) Row Totals
IN 65 35 100
73:2 26:8
IL 71 29 100
73:2 26:8
State MI 78 22 100
73:2 26:8
WI 82 18 100
73:2 26:8
OH 70 30 100
73:2 26:8
Column Totals 366 134 500
Set ˛ D 0:01. The p-value is P .D9 9: 3182/ D 0:053 62 > ˛ . We retain the null. The distri-
bution of parents across the five Midwestern states that approve and dissaprove having armed
guards in the schools appear to be the same.
Remark 5.33 The TI calculator can perform the 2 test. Enter the observed values in a list,
say L1 , and the expected values in a second list L2 . Press STAT ! TESTS ! 2 GOF Test.
The calculator will return the value of the 2 statistic as well as the p-value. In addition, it will
return the vector of each term’s contribution to the statistic so that it can be determined which
terms contribute the most to the total. For a test of independence/homogeneity, the observed
values and expected values can be entered into matrices A and B . The calculator’s 2 GOF test
will return the value of the statistic as well as the p-value.
142 5. HYPOTHESIS TESTING
5.6.3 ANALYSIS OF VARIANCE
Earlier in the chapter, we presented a method to test the set of hypotheses H0 W X D
Y vs. H1 W X ¤ Y based on the Student’s t random variable. In this section, we will de-
scribe a procedure that generalizes the two-sample test to k 2 samples. Specifically, there are
j treatments resulting in outcomes Xj N.j ; /; and suppose that X1j ; X2j ; : : : ; Xnj j is a
random sample of size nj from Xj , j D 1; 2; : : : ; k . We will assume that the random samples
are independent of one another, and the variances are the same for all the random variables
Xj ; j D 1; 2; : : : ; k , with common value 2 . The hypothesis test that determines if the treat-
ments result in the same means or if there is at least one difference is
H0 W 1 D 2 D D k vs.
H1 W j ¤ j 0 for some j ¤ j 0 .
The test for this set of hypotheses will be based on the F random variable as opposed to the
Student’s t when there are only two treatments. The tests will be equivalent in the case k D 2:
The development which follows has traditionally been referred to as the analysis of variance (or
ANOVA for short), and is part of the statistical theory of the design of experiments. In the next
chapter we will also present an ANOVA in connection with linear regression.
We now introduce some specialized notation that is traditionally used in ANOVA. Let
n D n1 C n2 C C nk , the total number of random variables across the k random samples. In
addition, let
k k
1X X nj
D nj j D j
n n
j D1 j D1
k
X nj
which is clearly a weighted average of all the means j , j D 1; 2; : : : ; k , since D 1. If we
n
j D1
assume the null hypothesis, then D 1 D D k , and is the common mean. Finally, let
nj k nj
1 X 1 XX
X j D Xij and X D Xij .
nj n
i D1 j D1 i D1
The quantity X j represents the sample mean of the j th sample, and X is the mean across all
the samples.
By considering the identity Xij X D .Xij X j / C .X j X /; squaring both
sides, and then taking the sum over i and j , we arrive at the following.
5.6. MORE TESTS OF HYPOTHESES 143
A Fundamental Identity
nj
k X nj
k X k
X X X
2
.Xij X / D .Xij X j /2 C nj .X j X /2
j D1 i D1 j D1 i D1 j D1
„ ƒ‚ … „ ƒ‚ … „ ƒ‚ …
SST SSE within treatments SSTR between treatments
In the above equation, SST stands for total sum-of-squares and measures the total variation,
SSE stands for error sum-of-squares and represents the sum of the variations within each sam-
ple, and finally, SSTR stands for treatment sum-of-squares and represents the variation across
samples. A considerable amount of algebra is required to verify the identity and is omitted.
and so by independence,
k nj
1 XX 1
.Xij X j /2 D SSE 2 .n k/
2 2
j D1 i D1
since
k nj !
1 X 1 X
SSE D .Xij X j /2 2 .n1 1/ C C 2 .nk 1/ 2 .n k/:
2 2
j D1 i D1
Notice that .1= 2 /SSE 2 .n k/ is true whether or not H0 is true. So we have the distri-
butions of SST and SSE. To complete the picture, we need to know something about how SSTR
is distributed. The way to determine this is to use the theorem from Chapter 2 that if two ran-
dom variables have the same moment generating function, then they have the same distribution.
Since .1= 2 /SST 2 .n 1/ under H0 , its mgf is
.n 1/=2
M.1= 2 /SST .t / D .1 2t / .
144 5. HYPOTHESIS TESTING
.n k/=2
Also, since .1= 2 /SSE 2 .n k/, we have M.1= 2 /SSE .t/ D .1 2t/ : By indepen-
dence, and since .1= 2 /SST D .1= 2 /SSE C .1= 2 /SSTR,
Therefore,
n 1
M.1= 2 /SST .t / .1 2t / 2
.k 1/=2
M.1= 2 /SSTR .t/ D D n k
D .1 2t / ,
M.1= 2 /SSE .t/ .1 2t / 2
which is the mgf of a 2 random variable with k 1 degrees of freedom. Therefore, under H0 ,
k
1 X
nj .X j X /2 D .1= 2 /SSTR 2 .k 1/:
2
j D1
Now consider the expected values of SSE and SSTR. If H0 is true, it is not hard to show
that
E.SSE/ D .n k/ 2 and E.SSTR/ D .k 1/ 2 :
In summary, assuming H0 , we have
1 1
2
SSE 2 .n k/; SSTR 2 .k 1/;
2
E.SSE/ D .n k/ 2 ; and E.SSTR/ D .k 1/ 2 :
Therefore, since the ratio of 2 rvs, each divided by its respective degrees of freedom has an
F -distribution as shown in Section 2.6.3, we have
1
, 1
2 SSTR 2
SSE SSTR=.k 1/
D F .k 1; n k/: (5.7)
k 1 n k SSE=.n k/
If H0 is true, since E.SSE/ D .n k/ 2 and E.SSTR/ D .k 1/ 2 , we would expect the ratio
to be close to 1. However, if H0 is not true, then it can be shown (details omitted) that
k
X
E.SSTR/ D .k 1/ 2 C nj .j /2 > .k 1/ 2
j D1
since the j s are not the same. The denominator in (5.7) should be about 2 since E.SSE/ D
.n k/ 2 even when H0 is not true. The numerator, however, should be greater than 2 , and
so the ratio should be greater than 1. Therefore,
SSTR=.k 1/
F D F .k 1; n k/
SSE=.n k/
5.6. MORE TESTS OF HYPOTHESES 145
is our test statistic, and we will reject H0 when its observed value, f , is too large. To prevent
committing a Type I error more than 100˛% of the time, we will reject H0 when f F .k
1; n k; ˛/.
Remark 5.34 The calculations for ANOVA are simplified using the following formulas:
nj
k X nj
k X
!2
X 1 X
SST D Xij2 Xij
n
kD1 i D1 kD1 iD1
k nj
k X
!2 nj !
X Sj2 1 X X
SSTR D Xij where Sj D Xij
nj n
j D1 kD1 iD1 i D1
To organize the calculations, ANOVA tables are used. Table entries are obtained by substituting
data values for the Xij in the formulas in the above remark.
SSTR MSTR
Treatment k 1 SSTR MSTR D k 1
f D MSE
P .F .k 1; n k/ f /
SSE
Error n k SSE MSE D n k
* *
Total n 1 * * * *
The table is filled in with numerical values from left to right until a p-value is calculated.
Example 5.35 Sweetcorn, as opposed to field corn which is fed to livestock, is for human
consumption. Three samples from three different types of sweetcorn, Sweetness, Allure, and
Montauk, were compared to determine if the mean heights of the plants were the same. The
following table gives the height (in feet) of 17 samples of Sweetness, 12 samples of Allure, and
15 samples of Montauk. Assume normal distributions for each treatment. Each treatment also
146 5. HYPOTHESIS TESTING
has a variance of 0:64 feet.
Sweetness Allure Montauk
5:48 6:06 7:51 6:73 5:83 5:42
5:21 6:14 6:54 5:85 5:80 6:92
5:08 4:99 6:66 7:28 6:27 5:41
4:14 5:88 5:29 6:83 5:44 6:65
6:56 5:49 5:17 6:77 6:54 5:60
4:81 6:81 7:45 7:25 6:78 6:05
6:70 6:44 7:19 6:08
6:99 6:66 6:16
6:37
Remark 5.36 To calculate the value F .d1; d 2; ˛/ for a given ˛; d1, and d 2; using a TI calcu-
lator, enter the following program as INVF into your calculator:
Press PRGM, then INVF, and you will be prompted for the area to the right of c; i.e., ˛ as in
P .F c/ D ˛: Then enter the degrees of freedom in the order D1, D2, and press ENTER.
The result is the value of c that gives you ˛:
where Sp2 is the pooled variance. Or we could use the ANOVA F -test just introduced. Which of
these methods is better and in what sense? To answer this question, we will apply the ANOVA
procedure to the two-sample problem. In this situation, the F statistic becomes
MSTR SSTR=.k 1/ SSTR
F D D D
MSE SSE=.n k/ SSE=.n1 C n2 2/
since k D 2 and n D n1 C n2 . Computing SSTR, we get
!2 !2
n1 X 1 C n2 X 2 n1 X 1 C n2 X 2
SSTR D n1 X 1 C n2 X 2
n1 C n2 n1 C n2
!2 !2
n2 X 1 X 2 n1 X 1 X 2
D n1 C n2
n1 C n2 n1 C n2
n1 n22 C n2 n21 2
D X 1 X 2
.n1 C n2 /2
2
n1 n2 2 X 1 X 2
D X 1 X 2 D .
n1 C n2 1 1
C
n1 n2
Computing SSE, we obtain
nj
2 X
X
SSE D .Xij X j /2 D .n1 1/SX21 C .n2 1/SX22 ;
j D1 i D1
and so
SSE .n1 1/SX21 C .n2 1/SX22
D ;
n1 C n2 2 n1 C n2 2
which is just the pooled variance Sp2 . Therefore,
2
SSTR X 1 X 2
D F .1; n1 C n2 2/:
SSE=.n1 C n2 2/ 2
1 1
Sp C
n1 n2
148 5. HYPOTHESIS TESTING
We reject H0 W 1 D 2 at level ˛ if
.x 1 x 2 /2
f D F .1; n1 C n2 2; ˛/:
1 1
sp2 C
n1 n2
.x 1 x 2 /2
.t.n1 C n2 2; ˛//2 ” jtj t.n1 C n2 2; ˛/
1 1
C s2
n1 n2 p
where t is the value of the statistic
X 1 X 2
T D r .
1 1
Sp C
n1 n2
This is exactly the condition under which we reject H0 using the T statistic for a two-sample
t -test. The two methods are equivalent in that one of the methods will reject H0 if and only if
the other method does.
5.7 PROBLEMS
5.1. An industrial drill bit has a lifetime (measured in years) that is a normal random variable
X N.; 2/. A random sample of the lifetimes of 100 bits resulted in a sample mean
of x D 1:3.
(a) Perform a test of hypothesis H0 W 0 D 1:5 vs. H1 W 0 ¤ 1:5 with ˛ D 0:05.
(b) Compute the probability of a Type II error when D 2. That is, compute ˇ.2/.
(c) Find a general expression for ˇ./:
5.2. Suppose a test of hypothesis is conducted by an experimenter for the mean of a normal
random variable X N.; 3:2/. A random sample of size 16 is taken from X . If H0 W
D 42:9 is tested against H1 W ¤ 42:9, and the experimenter rejects H0 if x is in the
region . 1; 41:164 [ Œ44:636; 1/, what is the level of the test ˛ ?
5.3. Compute the p-values associated with each of the following sample means x computed
from a random sample from a normal random variable X . Then decide if the null
hypothesis should be rejected if ˛ D 0:05.
(a) H0 W D 120 vs. H1 W < 120, n D 25, D 18, x D 114:2.
(b) H0 W D 14:2 vs. H1 W > 14:2, n D 9, D 4:1, x D 15:8.
5.7. PROBLEMS 149
(c) H0 W D 30 vs. H1 W ¤ 30, n D 16, D 6, x D 26:8.
5.4. An engineer at a prominent manufacturer of high performance alkaline batteries is at-
tempting to increase the lifetime of its best-selling AA battery. The company’s cur-
rent battery functions for 100:3 hours before it has to be recharged. The engineer ran-
domly selects 15 of the improved batteries and discovers that the mean operating time
is x D 105:6 hours. The sample standard deviation is s D 6:25.
(a) Perform a test of hypotheses
H0 W D 100:3 vs. H1 W > 100:3 with ˛ D 0:01.
(b) Compute the power function value .103/.
(c) Compute the general form of the power function ./.
5.5. Suppose a random sample of size 25 is taken from a normal random variable X
N.; / with unknown. For the following tests and test levels, determine the crit-
X 0
ical regions if t represents the value of the test statistic T D :
sX =5
(a) H0 W > 0 vs. H1 W D 0 , ˛ D 0:01.
(b) H0 W < 0 vs. H1 W D 0 , ˛ D 0:02.
(c) H0 W D 0 vs. H1 W ¤ 0 , ˛ D 0:05.
5.6. Compute the p-values associated with each of the following sample means x computed
from a random sample from a normal random variable X N.; / where is un-
known. Then decide if the null hypothesis should be rejected if ˛ D 0:05.
(a) H0 W D 90:5 vs. H1 W < 90:5, s D 9:5, n D 17, x D 85:2.
(b) H0 W D 20:2 vs. H1 W > 20:2, s D 6:3, n D 9, x D 21:8.
(c) H0 W D 35 vs. H1 W ¤ 35, s D 11:7, n D 20, x D 31:9.
5.7. Consider the calculation of a Type II error for a test for the variance of a random variable
X N.; /. Derive the general forms of ˇ.12 / for the following tests.
2 D 02 vs. H1 W 2 < 02 has ˇ.12 /
(a) H0 W
2
D P 2 .n 1/ > 02 2 .n 1; 1 ˛/ .
1
(b) H0 W 2
D 02 vs. H1 W 2 > 02 hasˇ.12 /
02
D P 2 .n 1/ < 12
2 .n 1; ˛/ :
25:23 25:50 26:18 25:44 26:04 26:01 25:30 24:49 25:21 25:68
25:18 25:01 26:09 24:49 24:54 25:12 25:84 24:22 25:14 25:67
226; 226; 227; 226; 225; 228; 225; 226; 229; 227
(a) Perform a test of hypotheses H0 W 2 D 2:25 vs. H1 W 2 < 2:25 with ˛ D 0:05.
(b) Calculate the Type II error ˇ.2/:
5.13. Cars intending on making a left turn at a certain intersection are observed. Out of 600
cars in the study, 157 of them pulled into the wrong lane.
(a) What is the range on the number of times x that a 6 would have to be rolled to
reject H0 at the ˛ D 0:05 level?
(b) What is the range on the number of times x that a 6 would have to be rolled to
retain H0 at the ˛ D 0:01 level?
5.15. (Small Sample Size) A cosmetics firm claims that its new topical cream reduces the
appearance of undereye bags in 60% of men and women. A consumer group thinks the
percentage is too high and conducts a test of hypothesis H0 W p0 D 0:6 vs. H1 W p0 <
0:6. Out of 8 men and women in a random sample, only 3 saw a significant reduction
in the appearance of their undereye bags.
(a) Perform a test of hypothesis H0 W p0 D 0:6 vs. H1 W p0 < 0:6 with ˛ D 0:05 by
computing the p-value of the test.
(b) What is the critical region of the test H0 W p0 D 0:6 vs. H1 W p0 ¤ 0:6 if ˛ D 0:1?
5.16. The director of a certain university tutoring center wants to know if there is a difference
between the mean lengths of time (in hours) male and female freshman students study
over a 30-day period. The study involved a random sample of 34 female and 29 male
students. Sample means and standard deviations are listed in the table below.
The variances of the two populations are unknown but assumed equal.
(a) Perform a test of hypothesis H0 W X D Y vs. H1 W X > Y with ˛ D 0:05.
(b) Compute the p-value of test in (a).
5.19. An entomologist is making a study of two species of lady bugs (Coccinellidae). She is
interested in whether there is a difference between the number of spots on the carapace
of the two species. She takes a random sample of 20 insects from each species and counts
the number of spots. The results are presented in the table below.
Species 1 .X / Species 2 .Y /
m D 20 n D 20
x D 3:8 y D 3:6
sX D 1:2 sY D 1:3
5.25. A new diet requires that certain food items be weighed before being consumed. Over
the course of a week, a person on the diet weighs ten food items (in ounces). Just to make
sure of the weight, she weighs the items on two different scales. The weights indicated
on the scales are close to one another, but are not exactly the same. The results of the
weighings are given below.
5.26. The table on the left lists the percentages in the population of each blood type in India.
The table on the right is the distribution of 1,150 blood types in a small northern Indian
town.
5.7. PROBLEMS 155
Blood Type Percentage of Population Blood Type Number of Residents
O+ 27:85 O+ 334
A+ 20:8 A+ 207
B+ 38:14 B+ 448
AB+ 8:93 AB+ 92
O- 1:43 O- 23
A- 0:57 A- 12
B- 1:79 B- 23
AB- 0:49 AB- 11
Does the town’s distribution of blood types conform to the national percentages for
India? Test at the ˛ D 0:01 level.
5.27. A six-sided die is tossed independently 180 times. The following frequencies were ob-
served.
Side 1 2 3 4 5 6
Frequency # 30 30 30 30 60 #
For what values of # would the null hypothesis that the die is fair be rejected at the
˛ D 0:05 level?
5.28. The distribution of colors in the candy M&Ms has varied over the years. A statistics
student conducted a study of the color distribution of M&Ms made in a factory in
Tennessee. After the study, she settles on the following distribution of the colors blue,
orange, green, yellow, red, and brown at the Tennessee plant.
She wanted to see if the same distribution held at another plant located in New Jersey.
A sample of 1,000 M&Ms were inspected for color at the New Jersey plant. The results
are in the table above right. Is the New Jersey plant’s distribution of colors the same as
the Tennessee plant’s distribution? Test at the ˛ D 0:05 level.
5.29. At a certain fishing resort off the Southeastern cost of Florida, a record is kept of the
number of sailfish caught daily over a 60-day period by the guests staying at the resort.
156 5. HYPOTHESIS TESTING
The results are in the table below.
Sailfish Caught 0 1 2 3 4 5 6
No. Days 8 14 14 17 3 3 1
An ecologist is concerned about declining fish populations in the area. He proposes that
the data follows a Poisson distribution with a Poisson rate D 2. Does it appear that
the data follows such a distribution? Test at the ˛ D 0:05 level.
5.30. A homeowner is interested in attracting hummingbirds to her backyard by installing a
bird feeder customized for hummingbirds. Over a period of 576 days, the homeowner
observes the number of hummingbirds visiting the feeder during a certain half-hour
period during the afternoon.
Hummingbird Visits 0 1 2 3 4 5
No. Days 229 211 93 35 7 1
(a) Is it possible the data in the table follows a Poisson distribution? Test at the ˛ D
0:05 level. What value of should be used?
(b) Apply the test with D 0:8
5.31. A traffic control officer is tracking speeders on Lake Shore Drive in Chicago. In par-
ticular, he is recording (from a specific vantage point) the time intervals (interarrival
times) between drivers that are speeding. A sample of 100 times (in seconds) are listed
in the table below.
Interval Œ0; 20/ Œ20; 40/ Œ40; 60/ Œ60; 90/ Œ90; 120/ Œ120; 180/ Œ180; 1/
No. in the
41 19 16 13 9 2 0
Interval
The officer hypothesizes that the interarrival time data follows an exponential distribu-
tion with mean 40 seconds. Test his hypothesis at the ˛ D 0:05 level.
5.32. A fisherman, who also happens to be a statistician, is counting the number of casts he has
to make in a certain small lake in southern Illinois near his home before his lure is taken
by a smallmouth bass. The data below represents the number of casts until achieving
50 strikes while fishing during a recent vacation. The fisherman hypothesizes that the
number of casts before a strike follows a geometric distribution. Test his hypothesis at
the ˛ D 0:01 level.
Casts to Strike 1 2 3 4 5 6 7 8 9
Frequency 4 13 10 7 5 4 3 3 1
5.7. PROBLEMS 157
5.33. A criminologist is studying the occurrence of serious injury due to criminal violence for
certain professions. A random sample of 490 causes of injury in the chosen professions
is taken. The results are displayed in the table. Does it appear that serious injury due to
criminal violence and choice of profession are independent? Test at the ˛ D 0:01 level.
Profession
Police Cashier Taxi Driver Security Row
(P) (C) (T) Guard (S) Totals
Criminal Violence (V) 82 107 70 59 318
Other Causes (O) 92 9 29 42 172
Column Totals 174 116 99 101 490
5.34. For a receiver in a wireless device, like a cell phone, for example, two important charac-
teristics are its selectivity and sensitivity. Selectivity refers to a wireless receiver’s capa-
bility to detect and decode a desired signal in the presence of other unwanted interfering
signals. Sensitivity refers to the smallest possible signal power level at the input which
assures proper functioning of a wireless receiver. A random sample of 170 radio receivers
produced the following results.
Sensitivity
Low (LN) Average (AN) High (HN) Row Totals
Low (LS) 6 12 12 30
Selectivity Average (AS) 33 61 18 112
High (HS) 13 15 0 28
Column Totals 52 88 30 170
Does it appear from the data that selectivity and sensitivity are dependent traits of a
receiver? Test at the ˛ D 0:01 level.
5.35. Consider again the small northern Indian town with 1,150 residents in problem 5.26
that has the following distribution of blood types.
Is gender and the genre of television program independent? Test the hypothesis at the
˛ D 0:05 level.
5.37. A ketogenic diet is a type of low-carbohydrate diet whose aim is to metabolize fats into
ketone bodies (water-soluble molecules acetoacetate, beta-hydroxybutyrate, and acetone
produced by the liver) rather than into glucose as the body’s main source of energy. A di-
etician wishes to study whether the proportion of adult men on ketogenic diets changes
with age. She samples 100 men currently on diets in each of five age groups: Group I:
20–25, Group II: 26–30, Group III: 31–35, Group IV: 36–40, and Group V: 41–45.
Her results are displayed in the table below. Let pi , i 2 fI; II; III; I V; V g denote the
proportion of men on a ketogenic diet in group i . Test whether the proportions are the
same across the age groups. Conduct the test at the ˛ D 0:05 level.
Group
(I) (II) (III) (IV) (V) Row Totals
Diet Ketogenic (K) 26 22 25 20 19 112
Nonketogenic (N) 74 78 75 80 81 388
Column Totals 100 100 100 100 100 500
5.38. Suppose the result of an experiment can be classified as having one of three mutually
exclusive A traits, A1 , A2 , and A3 , and also as having one of four mutually exclusive
B traits, B1 , B2 , B3 , and B4 . The experiment is independently repeated 300 times with
the following results.
B Trait
B1 B2 B3 B4 Row Totals
A1 25 5# 25 # 25 C # 25 C 5# 100
A Trait A2 25 25 25 25 100
A3 25 C 5# 25 C # 25 # 25 5# 100
Column Totals 75 75 75 75 300
5.7. PROBLEMS 159
What is the smallest integer value of # for which the null hypothesis that the traits are
independent is rejected? Test at the ˛ D 0:05 level. (Note that 0 # 5.)
5.39. A random sample of size n D 10 of top speeds of Indianapolis 500 drivers over the past
30 years is taken. The data is displayed in the following table.
5.40. Random samples of size 6 are taken from three normal random variables A, B, and C
having equal variances. The results are displayed in the table below.
5.41. Alice, John, and Bob are three truck assembly plant workers in Dearborn, MI. The times
(in minutes) each requires to mount the windshield on a particular model of truck the
160 5. HYPOTHESIS TESTING
plant produces are recorded on five randomly chosen occasions for each worker.
5.42. A new drug, AdolLoft, was developed to treat depression in adolescents. A research
study was established to assess the clinical efficacy of the drug. Patients suffering from
depression were randomly assigned to one of three groups: a placebo group (P), a low-
dose group (L), and a normal dose group (N). After six weeks, the subjects completed
the Beck Depression Inventory (BDI-II, 1996) assessment which is composed of ques-
tions relating to symptoms of depression such as hopelessness and irritability, feelings
of guilt or of being punished, and physical symptoms such as tiredness and weight loss.
The results of the study on the three groups of five subjects is given below.
Assuming equal variances, test if 5Y D 10Y D 15Y D 20Y at the ˛ D 0:01 level by
carrying out the following steps.
162 5. HYPOTHESIS TESTING
(a) Compute X 5Y , X 10Y , X 15Y , (e) Compute the p-value of f .
X 20Y , X .
(f ) Display the ANOVA table.
(b) Compute SSTR, SSE, and SST:
(c) Compute MSTR and MSE. (g) Test if 5Y D 10Y D 15Y D 20Y
(d) Compute f: at the ˛ D 0:01 level.
1; ˛/
H0 W D 0
Reject H0 if t t .n
H1 W < 0
D 0 known
2
H0 W 2 D 02
2 Xi 0
n 2
P Reject H0 if 2 2 .n; ˛/
0
test statistic: X D H1 W 2 > 02
i D1
˛/
H0 W 2 D 02
Reject H0 if 2 2 .n; 1
H1 W 2 < 02
unknown
Xi
n 2
XN n
2
P
0 1; ˛/
test statistic: X 2 D H0 W 2 D 02
Reject H0 if 2 2 .n
2
i D1 H1 W 2 > 02
.n 1/SX
02
D
1; 1 ˛/
H0 W 2 D 02
Reject H0 if 2 2 .n
H1 W 2 < 02
D 0 known
H0 W D 0
.XN p0 / Reject H0 if jzj z˛=2
0 = n
test statistic: Z D H1 W ¤ 0
unknown
1; ˛=2/
H0 W D 0
.XN
p0 /
Reject H0 if jtj t.n
SX = n
test statistic: T D H1 W ¤ 0
˛=2/
D 0 known
2
H0 W 2 D 02 Reject H0 if 2 2 .n; 1
2 Xi 0
n
P 2
0
test statistic: X D H1 W 2 ¤ 02 or 2 2 .n; ˛=2/
iD1
unknown
Xi 1; 1 ˛=2/
n 2
XN
2
P
0
test statistic: X 2 D H0 W 2 D 02 Reject H0 if 2 2 .n
iD1
2
H1 W 2 ¤ 02 or 2 2 .n 1; ˛=2/
.n 1/SX
02
D
1; n 1; 1 ˛/
H0 W r D r 0
Reject H0 if f F .m
H1 W r < r 0
2
D unknown
D 1; ˛/
H0 W D D d0
D (paired samples) N d0 Reject H0 if t t.m
p
SD = m
test statistic: T D H1 W D > d0
1; ˛/
H0 W D D d0
Reject H0 if t t .m
H1 W D < d0
5.8. SUMMARY TABLES
1 1
.XN mqYNn / d0
, Reject H0 if
Sp
test statistic: T D
Y
H0 W d D d0
mCn
2; ˛=2/
d D X
.m 1/SX Y
2 C.n 1/S 2 H1 W d ¤ d0 jtj t .m C n
mCn 2
Sp2 D
(pooled variance)
Reject H0 if
2
X2 ; Y2 unknown, X2 ¤ Y2
.X . m1 rC n1 /
jtj t .; ˛=2/,
Y N N H0 W d D d0
d D X test statistic: T D 2
rm Yn / d0
2 1 2 1 ,
SX SY
D
m2 .m 1/ n2 .n 1/
H1 W d ¤ d0 r C
m C n 2
sX
rD 2
sY
2 X ; Y unknown Reject H0 if
X
2
2
H0 W r D r0
Y
SX 1
2
rD f F .m 1; n 1; ˛=2/
SY r0
test statistic: F D H1 W r ¤ r0
or f F .m 1; n 1; 1 ˛=2/
2 Reject H0 if
D unknown
D (paired samples)
H0 W D D d0
D
N d0
p
t t.m 1; ˛=2/
SD = m
test statistic: T D H1 W D ¤ d0
or t t.m 1; ˛=2/
CHAPTER 6
Linear Regression
6.1 INTRODUCTION AND SCATTER PLOTS
In this chapter we discuss one of the most important topics in statistics. It provides us with a way
to determine an algebraic relationship between two variables x and the random variable Y which
depends on x and ultimately to use the relationship to predict one variable from knowledge of
the other. For instance, is there a relationship between a student’s GPA in high school and the
GPA in college? Is there a connection between a father’s intelligence and a daughter’s? Can
we predict the price of a stock from the S&P 500 index? These are all matters regression can
consider.
We have a data set of pairs of points f.xi ; yi /; i D 1; 2 : : : ; ng: The first step is to create a
scatterplot of the points. For example, we have the plots in Figures 6.1 and 6.2.
20 20
15 15
y
y
10 10
5 5
0 0
0 5 10 15 20 0 5 10 15 20
x x
Figure 6.1: Scatterplot of x -y data. Figure 6.2: Scatterplot of x -y data with line
fit to the data.
It certainly looks like there is some kind of relationship between x and y and it looks like it
could be linear. By eye we can draw a line that looks like it would be a pretty good fit to the data.
The questions which we will answer in this chapter are as follows.
• How do we measure how good using a line to approximate the data will be?
• How do we find the line which approximates the data as well as possible?
• How do we use the line to make predictions and quantify the errors?
First we turn to the problem of finding the best fit line to the data.
170 6. LINEAR REGRESSION
6.2 INTRODUCTION TO REGRESSION
If we have no information about an rv Y; the best estimate for Y is EY: The reason for this is
the following fact:
min E.Y a/2 D E.Y EY /2 D Var.Y /:
a
In other words, EY minimizes the mean square distance of the values of Y to any number a:
We have seen this to be true earlier but here is a short recap.
First consider the real valued function f .a/ D E.Y a/2 : To minimize this function take
a derivative and set to zero.
Since f 00 .a/ D 2 > 0; this a D EY provides the minimum. We conclude that if we have no
information about Y ’s distribution and we have to guess something about Y; then a D EY is
the best guess.
Now suppose we know that there is another rv X which is related to Y and we assume
that it is a linear relationship. Think of X as the independent variable and Y as the dependent
variable, i.e., X is the input and Y is the response. We would really like to precisely describe
the linear relationship between X and Y: To do so, we will find constants a; b to minimize the
following function giving the mean squared distance of Y to a C b X :
To minimize this function (which depends on two variables) take partial derivatives and set to
zero:
fa D 2E.Y a bX/ D 0; and fb D 2EŒX.Y a bX/ D 0:
Solving these simultaneous two equations for a; b we get
E.XY / EXEY E.X Y / EXEY
bD D and a D EY b EX: (6.1)
E.X 2 / .EX/ 2 Var.X /
Now we rewrite these solutions using the covariance. Recall that the covariance of X; Y is given
by
Cov.X; Y / D E.X EX/.Y EY / D E.X Y / EXEY
Cov.X;Y /
and we can rewrite the slope parameter as b D 2
X
. We have
Cov.X; Y /
bD and a D EY bEX:
X2
Cov.X;Y /
We can also rewrite the slope b using the correlation coefficient .X; Y / D X Y
: We have
6.2. INTRODUCTION TO REGRESSION 171
Cov.X; Y / Cov.X; Y / Y Y
bD 2
D D .X; Y / :
X X Y X X
This gives us the result that b D XY : We summarize these results.
Proposition 6.1 The minimum of f .a; b/ D E.Y a bX/2 over all possible constants a; b
is provided by b D .X; Y / XY ; a D EY bEX: The minimum value of f is then given by
Proof. We have already shown the first part and all we have to do is find the value of the function
f at the point that provides the minimum. We first plug in the value of a D EY b EX and
then rearrange:
E.Y a bX/2 D EŒ.Y EY / C b.EX X /2
D E .Y EY /2 C 2b.EX X/.Y EY / C b 2 .EX X/2
D Var.Y / C b 2 Var.X / 2bCov.X; Y /
Y2 2 Y Y
D Y2 C 2 2 Y X using b D ;
X2 X X X
D Y2 C 2 Y2 22 Y2 D .1 2 /Y2 :
Remark 6.2 The regression line, or least squares fit line, we have derived is written as
Y D a C bX D EY bEX C bX D EY C b.X EX/ H)
Y
Y EY D .X EX/:
X
This shows that the regression line always passes through the point .EX; EY / and has
slope XY : Consequently, a one standard deviation increase in X from the mean results in a
Y unit increase from the mean (or decrease if < 0) in Y:
Example 6.3 Suppose we know that for a rv Y; Y2 D 10 and the correlation coefficient between
X; Y is D 0:5: If we ignore X and try to guess Y; the best estimate is EY which will give us
an error of E.Y EY /2 D Var.Y / D 10: If we use the information on the correlation between
X; Y we have instead
EŒY .a C bX/2 D 1 2 Y2 D .1 :25/10 D 7:5
172 6. LINEAR REGRESSION
and the error is cut by 25%.
What we have shown is that the best approximation to the rv Y by a linear function
a C bX of a rv X is given by the rv W D a C bX with the constants a; b given by b D Y =X
and a D EY bEX: It is not true that Y D a C bX but that W D a C bX is a rv with mean
square distance E.Y W /2 D .1 2 /Y2 ; and this is the smallest possible such distance for any
possible constants a; b:
Remark 6.4 Since E.Y a bX/2 D .1 2 /Y2 0; it must be true that jj 1: Also, if
jj D 1; then the only possibility is that Y D a C bX so that Y is exactly a linear function of X:
Now we see that is a quantitative measure of how well a line approximates Y: The closer
is to ˙1, the better the approximation. When jj D 1 we say that Y is perfectly linearly
correlated with X: When D 0 the error of approximation by a linear function is the largest
possible error. When D 0 we say that Y is uncorrelated with X:
As a general rule, if jj 0:5 the correlation is weak and when jj 0:8 we say the linear
correlation is strong.
If we knew ; then we are saying that for each observed value of x; we have Y N.a C bx; /:
Thus, for each fixed observed value x; the mean of Y is a C bx and the Y -values are normally
distributed with SD given by :
Now suppose the data pairs .xi ; yi /; i D 1; 2; : : : ; n; are observations from .xi ; Yi /; where
Y1 ; Y2 ; : : : ; Yn is a random sample from Y D yi D a C b xi C "i ; with independent errors "i
N.0; /:
Example 6.5 Suppose the relationship between interest rates and the value of real estate
is given by a simple linear regression model with the regression line Y D 137:5 12:5x C ";
where x is the interest rate and Y is the value of the real estate. We suppose the noise term
" N.0; 10/. Then for any fixed interest rate x0 ; the real estate will have the distribution
N . 12:5x0 C 137:5; 10/: For instance, if x0 D 8; EY D 37:5; and then
The mean value of real estate when interest rates are 8% is 37.5 and there is about a 22% chance
the real estate value will be above 45.
Notice that because the slope of the regression line is negative, higher interest rates will
give lower values of real estate. Suppose Y1 D 12:5x1 C 137:5 C "1 is an observation when
x1 D 10 and Y2 D 12:5x2 C 137:5 C "2 is an observation when x2 D 9: By properties of the
normal distribution Y1 Y2 N. 12:5.x1 x2 /; 14:14/: What are the chances real estate val-
ues will be higher with rates at 10% rather than at 9%? That is, what is P .Y1 > Y2 /? Here’s the
answer:
P .Y1 Y2 > 0/ D normalcdf.0; 1; 12:5; 14:14/ D 0:1883:
There is an 18.83% chance that values will be higher at 10% interest than at 9%. Next, suppose
the value of real estate at 9% is actually 35. What percentile is this? That’s easy because we are
assuming Y N.25; 10/ so that P .Y < 35 j x D 9/ D normalcdf. 1; 35; 25; 10/ D 0:841 and
so 35 is the 84th percentile of real estate values when interest rates are 9%. This means that 84%
of real estate values are below 35 when interest rates are 9%.
The term regression implies a return to a less developed state. The question naturally
arises as to why this term is applied to linear regression analysis. Informally, Galton, the first
developer of regression analysis, noticed that tall fathers had tall sons, but not quite so tall as
the father. He termed this as regression to the mean, implying that the height of the sons is
regressing more toward what the mean height should be for men of a certain age. It was also
noticed that students who did very well on an exam would not do quite so well on a retake of
the exam. They were regressing to the mean.
174 6. LINEAR REGRESSION
The mathematical explanation for this is straightforward from the model Y D a C bx C
"; " N.0; /: Suppose we fix the input x and take a measurement Y: This measurement itself
follows a normal distribution with mean a C bx and SD : Suppose the measurement is Y D
y1 and this measurement is 1.5 standard units above average. That would put it at the 93.32
percentile. If that’s a test score it’s a good score. On another measurement of Y D y2 for the
same x; 93.32% of the observations are below y1 : This means there is a really good chance the
second observation will be below y1 : There is some chance y2 > y1 ; ( 7%) but it’s unlikely
compared to the chance y2 < y1 : That is what regression to the mean refers to. The regression
fallacy is attributing the change in scores to something important rather than just to the chance
variability around the mean.
n n
sY 1X 1X
y yDr .x x/; xD xi ; yD yi
sX n n
i D1 i D1
n
X
1
.xi x/.yi y/ n
n 1 1 X xi x yi y
i D1
rD D ;
sX sY n 1 sX sY
i D1
This says that the sample correlation coefficient is calculated by converting each pair .xi ; yi / to
standard units and then (almost) averaging the products of the standard units.
If we wish to write the regression equation in slope intercept form we have
sY
y D aO C bO x where aO D y bO x; bO D r :
sX
6.2. INTRODUCTION TO REGRESSION 175
This is the regression equation derived from the probabilistic model Y D a C bx C ": If we
ignore the probability aspects we can derive the same equations as follows. The proof is a calculus
exercise.
n
X
Proposition 6.6 Given data points .xi ; yi /; i D 1; 2; : : : ; n; set f .a; b/ D .yi a bxi /2 :
i D1
sY
Then the minimum of f over all a 2 R; b 2 R is achieved at bO D r ; aO D y O and the
bx
sX
minimum is f .a; O D .n 1/.1 r 2 /s 2 :
O b/ Y
Example 6.7 Suppose we choose 11 families randomly and we let xi Dheight of brother,
yi Dheight of sister. We have the summary statistics x D 69; y D 64; sX D 2:72; sY D 2:569:
The correlation coefficient is given by r D 0:558: The equation of the regression line is therefore
2:569
y 64 D 0:558 .x 69/ D 0:527.x 69/:
2:72
If a brother’s height is actually 69 C 2:72; then the sister’s mean height will be
64 C 0:558.2:569/ D 65:43: The minimum squared error is f .27:637; 0:527/ D 10.1
0:5582 /2:5692 D 45:448: Any other choice of a; b will result in a larger error.
Example 6.8 Suppose we know that a student scored in the 82nd percentile of the SAT exam
in high school and we know that the correlation between high school SAT scores and first-year
college GPA is D :9: Assuming that high school SAT scores and GPA scores are normally
distributed, what will this student’s predicted percentile GPA be?
This problem seems to not provide enough information. Shouldn’t we know the means
and SDs of the SAT and GPA scores? Actually we do have enough information to solve it.
First, rewrite the regression equation as
y y x x
Dr ;
sY sX
where x D SAT and y D GPA: Knowing that the student scored in the 82nd percentile of SAT
scores, which is assumed normally distributed, tells us that this student’s SAT score in stan-
x x
dard units is invNorm.:82/ D 0:9154 D : Therefore, the student’s GPA score in standard
sX
y y
units is D 0:9.0:9154/ D 0:8239: Therefore, the student’s predicted percentile for GPA is
sY
normalcdf. 1; :8239/ D 0:795; or 79.5 percentile.
176 6. LINEAR REGRESSION
6.2.3 ERRORS OF THE REGRESSION
Next, we need to go further into the use of the regression line to predict y values for given x
values and to determine how well the line approximates the data. We will present an ANOVA
for linear regression to decompose the variation in the dependent variable.
We use the notation
SST is the total variation in the Y -values and is decomposed in the next proposition into the
variation of the Y -values from the fitted line, .SSE/; and the variance of the fitted values to the
mean of Y , .SSR/.
Proof.
When we have data points .xi ; yi /, SSE is the variation of the residuals .yi a bxi /2 :
SSR is the variation due to the regression .y a b xi /2 : Using this decomposition we can
get some important consequences.
SSR SSE
Proposition 6.10 We have (a) SSE D .1 2 /SST , and, (b) 2 D D1 :
SST SST
6.2. INTRODUCTION TO REGRESSION 177
y
N(ax0+b, σ)
Data Point (x1,y1)
y = ax0+b
Residual
Fitted point
x
x0
Using (a),
SSR SST SSE SSE
SST D SSE C SSR D SST.1 2 / C SSR H) 2 D D D1 :
SST SST SST
Now suppose we are given a line y D a C bx and we have data points f.xi ; yi /g: We will
sY
set bO D r and aO D y bO x as the coefficients of the best fit line when we have the data.
sX
Definition 6.11 The fitted values are yOi D aO C bO xi : These are the points on the regression
line for the associated xi : Residuals are "i D yi yOi ; the difference between the observed data
values and the fitted values.
P
It is always true for the regression line y D aO C bO x that "i D 0 since aO D y bO x (Fig-
ure 6.3). Therefore basing errors on the sum of the residuals won’t work.
We have the observed quantities yi ; the calculated quantities yOi D yOi .xi /; and the resid-
uals "i D yi yOi which is the amount by which the observed value differs from the fitted value.
The residuals are labeled "i because "i D yi aO bx O i:
Another way to think of the "i ’s is as observations from the normal distribution giving
the errors at each xi : We have chosen the line so that
n
X n
X n
X 2
"2i D .yi yOi /2 D yi aO O i
bx
i D1 i D1 i D1
178 6. LINEAR REGRESSION
is the minimum possible.
In the preceding definitions of SST; SSE; and SSR when we have data points .xi ; yi / we
replace Y by yi ; X by xi , and by r: We have:
n
X n
X
(a) Error sum of squares, deviation of yi from yOi : SSE D .yi yOi /2 D "2i :
i D1 iD1
n
X
(b) Regression sum of squares, deviation of yOi from y : SSR D .yOi y/2 : This is the amount
i D1
of total variation in the y -values that is explained by the linear model.
n
X
(c) Total sum of squares, deviation of data values yi from y : SST D .yi y/2 : In other
i D1
SST
words n 1
is the sample variance of the y -values.
The algebraic relationships between the quantities SST; SSR; SSE become
SSR SSE
SST D SSR C SSE and r 2 D D1 and SSE D 1 r 2 SST:
SST SST
Remark 6.12 For computation, the following formulas can simplify the work:
X X X
Sxx D xi2 nx 2 ; Syy D yi2 ny 2 ; Sxy D xi yi nxy:
6.2. INTRODUCTION TO REGRESSION 179
2
The Estimate of
Recall that we assumed Y D aO C bx O C "; where " N.0; /: We have for fixed x the mean of
O and Var.Y / D 2 : The variance 2 measures the spread of
the data given x is E.Y jx/ D aO C bx
O The estimate of 2 is given by
the data around the mean aO C bx:
Pn
2 i D1.yi yOi /2 SSE
s D D :
n 2 n 2
The sample value s is called the standard error of the estimate and represents the deviation of
the y data values from the corresponding fitted values of the regression line. We will see later that
n
1 X
2
S D .Yi aO bO xi /2 is an unbiased estimator of 2 ; and S 2 will be associated with
n 2
i D1
a 2 .n 2/ random variable. The degrees of freedom will be n 2 because of the involvement
of two unknown parameters a; O
O b:
The value of s measures how far above or below the data points are from the regression
line. Since we are assuming a linear model with noise which is normally distributed, we can say
that roughly 68% of the data points will lie within the band created by two parallel lines to the
regression line, one above the line and one below. If the width of this band is 2s; it will contain
roughly 95% of all the data points. Any data point lying outside the 2s band is considered an
outlier.
Remark 6.13 We have shown in Proposition 6.6 that given data points .xi ; yi /; i D 1; 2; : : : ; n;
if we set
n
X
f .a; b/ D .yi a bxi /2 ;
i D1
n
X 2 n
X
O bO D .n
f a; 1/ 1 r 2 sY2 D yi aO bO xi D "2i :
i D1 i D1
Example 6.14 The table contains data for the pairs (height of father, height of son) as well as the
predicted (mean) height of the son for each given height of the father. The resulting difference of
180 6. LINEAR REGRESSION
the observed height of the son from the prediction, i.e., the residual, is also listed. The residuals
should not exhibit a consistent pattern but be both positive and negative. Otherwise, a line may
be a bad fit to the data.
X , Father 65 63 67 64 68 62
Y , Son 68 66 68 65 69 66
yO , predicted 66.79 65.84 67.74 66.31 68.22 65.36
", residuals 1.21 0.16 0.26 -1.31 0.78 0.64
X , Father 70 66 68 67 69 71
Y , Son 68 65 71 67 68 70
yO , predicted 69.17 67.27 68.22 67.74 68.69 69.65
", residuals -1.17 -2.27 2.72 -0.74 -0.69 0.35
The regression line is y D 35:82480 C 0:476377 x and the sample correlation coefficient
is r D 0:70265: Also, sX D 2:7743; x D 66:67; sY D 1:8809; y D 67:583. The coefficient of de-
termination is r 2 D 0:4937; which means 49% of the variation in the y -values is explained by
the regression. The standard
p error
p of the estimate of is s D 1:40366 which may be calculated
using s D 1:8809 11=10 1 0:4937:
n
X
O We have from the fact that
Proof. We will only show the result for b: .xi x/ D nx nx
i D1
D 0;
Pn n
SxY X
i D1 .xi x/.Yi Y/ xi x
bO D D D Yi :
Sxx Sxx Sxx
i D1
6.3. THE DISTRIBUTIONS OF aO AND bO 181
Remember that Yi here is random while xi is deterministic. This computation shows that bO is a
linear combination of independent normally distributed random variables and therefore bO also
has a normal distribution. Now to calculate the mean and SD of bO we compute
n
X n
X
O D xi x xi x
EŒb E ŒYi D .a C bxi / D b;
Sxx Sxx
i D1 i D1
xP xi P P
where we use the facts Sxx
D 0 and .xi
x/xi D .xi x/2 D Sxx : To find the SD of
O we have from the independence of the random variables Yi ;
b;
n
X 2
O D xi x 1 2
VarŒb Var.Yi / D 2 2
Sxx D :
Sxx Sxx Sxx
i D1
Since "i N.0; /; i D 1; 2; : : : ; n; are independent and normally distributed, the sum of
squares of normals has a 2 distribution. The numerator of S 2 seems to be 2 .n/: However,
there are two parameters aO and bO in this expression so the degrees of freedom actually turns out
to be n 2; not n. That is,
n 2 2 SSE
2
S D 2 2 .n 2/:
O we have the following.
This means that if we replace by s in the distributions of aO and b;
Theorem 6.16
aO a bO b
sP t.n 2/ and p t .n 2/:
n 2 S= Sxx
i D1 xi
S
nSxx
182 6. LINEAR REGRESSION
Proof. We have the standardized random variables
aO a aO a bO b bO b
D sP N.0; 1/ and D p N.0; 1/:
SD.a/
O n
xi2 O
SD.b/ = Sxx
iD1
nSxx
p
Therefore, since S D 2 .n 2/=.n 2/
aO a
sP
n
i D1 xi2
aO a nSxx N.0; 1/
sP D Dr D t.n 2/:
n
xi2 S= 2 .n 2/
i D1
S n 2
nSxx
As usual, when we replace by the sample standard deviation, the normal distribution gets
changed to a t -distribution.
Example 6.17 Consider the data from Example 6.14 concerning heights of fathers and sons.
O We have aO D 35:8248; bO D 0:4764: Using Theorem 6.16
We calculate a 95% CI for aO and b:
6.4. CONFIDENCE INTERVALS FOR SLOPE AND INTERCEPT AND HYPOTHESIS TESTS 183
rP
x2 p p
we calculate using SE.a/
O D s nsxxi ; that s D SSE=.n 2/ D 19:702=10 D 1:4036: Since
P 2 O D
xi Dp53418; x D 66:66; we have Sxx D 53418 12 66:662 D 84:667 and then SE.b/
1:4036= 84:667 D 0:1525: For 95% confidence, we have t .10; 0:05/ D invT.0:975; 10/ D
2:228 and then the CI for the slope is 0:4764 ˙q2:228 0:1525; i.e., .0:136; 0:816/. The 95%
53418
CI for the intercept is 35:8248 ˙ 2:228 1:4036 1284:67
D 35:8248 ˙ 22:675:
• We can use the test for the slope to determine if there is a linear relationship between the
x and Y variables. This is based on the fact that b D Y =X so that D 0 if and only
if b D 0: Therefore to test for a linear relationship the null is H0 W b D 0: The alternative
O
hypothesis is that there is no linear relationship, H1 W b ¤ 0: The test statistic is t0 D b 0O :
SE.b/
The null is rejected at level ˛ if jt0 j > t.n 2; ˛=2/:
184 6. LINEAR REGRESSION
0.05
10.0
9.9
5 10 15 20
y
9.8
-0.05
9.7
9.6 -0.10
0 10 20 30 40
x
Figure 6.4: World record 100 meter times, Figure 6.5: Residuals.
1964–2009.
In all cases we may calculate the p-values. For instance, if H1 W b < b0 in the test for the slope
H0 W b D b0 , the p-value is P .tb0 t.n 2//:
Example 6.18 Figure 6.4 (Figure 6.5 is a plot of the residuals) is a plot of the world record 100
meter times (for men) measured from 1963 until 2009. There are 24 data points with the last
data point (46,9.58) corresponding to (2009,9.58). This point beat the previous world record
.2008; 9:69/ by 8 seconds!
The regression line is given by y D 10:0608 0:00743113x with a correlation coefficient
of r D 0:91455: It is easy to calculate SSR D 0:256318; SSE D 0:050139; SST D 0:030645; and
SE.a/O pD 0:0227721; SE.b/ O D 0:0000700655: The standard error of the estimate of the regression
is s D SSE=22 D 0:047735: Also, Sxx D 4641:625: The 95% CI for the slope is 0:0074313 ˙
t .22; 0:025/ p0:047735 which gives . 0:00888; 0:005978/. If we test the hypothesis H0 W b D 0
4641:625
0:00743113
the test statistic is t D :047735= p
4641:625
D 10:606: Since t.22; 0:025/ D 2:818 we will reject
the null if ˛ D 0:01: Observe also that if we project the linear model into the future, in the year
3316, the world record time would be zero.
bO
Notice that O
D t.n 2/; i.e., it has a t -distribution with n 2 degrees of freedom. This
SE.b/
is the test statistic for the hypothesis H0 W b D 0: The square of a t -distribution with k de-
grees of freedom is an F .1; k/ distributed random variable. Therefore, the last column says that
MSR=MSE has an F -distribution with degrees of freedom .1; n 2/: This gives us the value of
the test statistic squared in the hypothesis test H0 W b D 0 against H1 W b ¤ 0 and we may reject
H0 if F > f .1; n 2; ˛/ at level ˛:
Example 6.19 A delivery company records the following delivery times depending on miles
driven.
Distance 2 2 2 5 5 5 10 10 10 15 15 15
Time 10.2 14.6 18.2 20.1 22.4 30.6 30.8 35.4 50.6 60.1 68.4 72.1
The regression line becomes y D 4:54677 C 3:94728 x: The statistics for the regression
line are summarized in the table
Remark 6.20 The random variable T D XS=pn0 when we have a random sample X1 ; : : : ; Xn
from a normal population has a Student’s t -distribution with n 1 degrees of freedom. Also,
n 2 2
2
S 2 .n 2/: So now the T variable may be rewritten as
p X 0
X 0 = n Z
T D p Dp Dp
S= n 2
S = 2 S 2 = 2
and then
Z 2 =1 2 .1/
T2 D D D F .1; n 2/
S 2 = 2 2 .n 2/=.n 2/
since Z 2 2 .1/: Therefore, T 2 .n 2; ˛=2/ D F .1; n 2; ˛/: We have to use ˛=2 for the
t -distribution because of the fact that T is squared, i.e, P .F .1; n 2; ˛/ > f / D P .jT .n
2; ˛=2/j > t / D ˛: That’s why the two-sided test has p-value P .F .1; n 2/ > f /:
1 In general, predictions of y for a given x using a regression line are only valid in the range of the data of the x values.
6.4. CONFIDENCE INTERVALS FOR SLOPE AND INTERCEPT AND HYPOTHESIS TESTS 187
We denote the predicted value of the rv Y by yp and the estimate of E.Yp / by p : In the
absence of other information, the best estimates of both will be given by yp D p D aO C bx O p;
the difference is that the error of these estimates are not the same. In particular, we will have a
confidence interval for p ; but a prediction interval (abbreviated PI) for yp .
We will not derive these results but simply note that the only difference between them
is the additional 1 inside the square root. This makes the PI for a response wider than the CI
for the mean response to reflect the additional uncertainty in predicting a single response rather
than the mean response.
If we let xp vary over the range of values for which linearity holds we obtain confidence
curves and prediction curves which band the regression line. The bands are narrowest at the
point on the regression line .x; y/: Extrapolating the curves beyond that point results in ever
widening bands and beyond the range of the observed data, linearity and the bands may not
make sense. Making predictions beyond the range of the data is a bad idea in general.
Example 6.22 The following table exhibits the data for 15 students giving the time to complete
a test x and the resulting score y:
188 6. LINEAR REGRESSION
y y
100
100
80
80
60 60
40 40
20 20
x x
0 20 40 60 80 100 50 55 60 65 70
Data Data Minus Outlier
150 20
10
100
2 4 6 8 10 12 14
50
-10
20 40 60 80 100 -20
-50
Confidence and Prediction Bands Residuals
index 1 2 3 4 5 6 7
time .x/ 59 49 61 52 61 52 48
score .y/ 50 95 73 59 98 84 78
index 8 9 10 11 12 13 14 15
time .x/ 53 68 57 49 70 62 52 10
score .y/ 65 79 84 46 90 60 57 15
The data has summary statistics x D 57:3571; y D 72:7143; sX D 7:17482; sY D 16:7398
and r D 0:2046.
Figure 6.6 has a scatterplot of the data. As soon as we see the plot we see that point 15
is an outlier and is either a mistake in recording or the student gave up and quit. We need to
remove this point and we will consider it dropped. Figure 6.6 shows the data points with the
outlier removed and the fitted regression line. This line has equation
y D 45:34 C 0:4773x:
6.4. CONFIDENCE INTERVALS FOR SLOPE AND INTERCEPT AND HYPOTHESIS TESTS 189
A plot of the data, the fitted line, and the mean confidence bands (95% confidence) and
single prediction bands show that the prediction bands are much wider than the confidence
bands reflecting the uncertainty in prediction. It is also clear that the use of these bands should
be restricted to the range of the data.
The equations of the bands here are given by
p
y D 0:477x C 45:337 ˙ 2:179 0:435x 2 49:86x C 1450:66 (6.3)
for the 95% prediction bands. This means that for a given input xp Dtest time, the predicted
test score would be 45:34 C 0:477xp : The 95% CI for this mean predicted test score is given in
(6.3) and the 95% PI for this particular xp is given in (6.4), with x D xp : For example, suppose a
student takes 50 minutes to complete the test. According to our linear model, we would predict
a mean score of 69.2. The 95% CI for this mean score is Œ54:70; 83:70. On the other hand, the
95% PI for the particular score associated with a test time=50 is Œ29:31; 109:1; which is a very
wide interval and doesn’t even make sense if the maximum score is 100.
If we test the hypothesis H0 W b D 0 against the alternative H1 W b ¤ 0; we get the results
that the p-value is p D :4829, which means that the null hypothesis is plausible, i.e., it may not
be reasonable to predict test score from test time.
We have the ANOVA table
Source DF SS MS F-statistic p-Value
Regression 1 152.469 152.469 0.524191 0.482936
Residuals 12 3490.39 290.866
Total 13 3642.86
Finally, we plot the residuals in order to determine if there is a pattern which would inval-
idate the model. The summary statistics for this regression tell us the correlation coefficient is
r D 0:2046 so the coefficient of determination is r 2 D 0:0419 and only about 4% of the variation
in the scores are explained by the linear regression line. The 95% CI for the slope and intercept
are . :959108; 1:91375/ and . 37:6491; 128:322/, respectively.
Remark 6.23 One important point of the previous example and linear regression in general
is the identification and elimination of outliers. Because we are using a linear model, a single
outlier can drastically affect the slope, invalidating the results. To identify possible outliers use
the following method.
Using all the data points calculate the regression line and the estimate s of : Suppose the
regression line becomes y D a C bx: Now consider the two lines y D a C bx ˙ 2.s/: In other
190 6. LINEAR REGRESSION
words we shift the regression line up and down by twice the SD of the residuals. Any data point
lying outside this band around the regression line is considered an outlier. Remove any such
points and recalculate. We use twice the SD to account for approximately 95% of reasonable
data values.
To analyze R we would be faced with the extremely difficult job of determining its distribution.
The only case when this is not too hard is when we want to consider the hypothesis test H0 W
D 0 because we know that this hypothesis is equivalent to testing H0 W b D 0 if the slope of
the regression line is zero. This is due to the formula b D XY :
We will work with the sample values .xi ; yi /; x; y; sx ; sy ; and r: Now we have seen that
r 2 D bO 2 SSyy
xx SSR
D SST D 1 SSE
SST
so that
q s
1 r
sx
S
n 1 xx S xx Sxx SSE .n 2/s 2
r D bO D bO q D bO D bO ; and 1 r 2 D D :
sy 1
S Syy SST SST SST
n 1 yy
bO 0
We also know that the test statistic for H0 W b D 0 is t D and is distributed as t.n 2/:
O
SE.b/
O Dp s
Since SE.b/ we have
Sxx
p r s
bO S xx Sxx .n 2/SST
tD D bO D bO :
O
SE.b/ s SST .n 2/s 2
0.4
4.5
0.2
4.0
5 10 15 20
-0.2
3.5
-0.4
60 65 70 75 -0.6
p
R n 2
More precisely, the rv T D p t.n 2/:
1 R2
Example 6.24 Does a person’s self esteem depend on their height? (See Figure 6.7.) The fol-
lowing table gives data from 20 people.
Person Height Self Esteem Person Height Self Esteem
1 68 4.1 11 68 3.5
2 71 4.6 12 67 3.2
3 62 3.8 13 63 3.7
4 75 4.4 14 62 3.3
5 58 3.2 15 60 3.4
6 60 3.1 16 63 4.0
7 67 3.8 17 65 4.1
8 68 4.1 18 67 3.8
9 71 4.3 19 63 3.4
10 69 3.7 20 61 3.6
The sample correlation coefficient is r D 0:730636. The fitted regression equation is
y D :866269 C 0:07066x . The 95% CI’s for the slope and intercept are Œ0:0379; 0:10336 and
Œ 3:00936; 1:27682; respectively. The calculation for the slope and intercept are shown in the
next table.
To test the hypothesis H0 W D 0 against H1 W ¤ 0; we see that the t statistic for the
correlation (and also the slope) is
p
:73 20 2
t .18/ D 4:54009 D p
1 :732
which results in a p-value2 of 0.00025, i.e., P .jt .18/j 4:54009/ D 0:00025: Thus, we have
high statistical significance and plenty of evidence that the correlation (and the slope) is not
zero. Incidentally, the residual plot shows that there may be an outlier at person 12.
Example 6.25 Suppose we calculate r D 0:32 from a sample of size n D 18: First we perform
the test H0 W D 0; H1 W > 0: The statistic is
p p
r n 2 :32 16
tD p Dp D 1:35:
1 r2 1 :322
For 16 degrees of freedom we have P .t .16/ > 1:35/ D 0:0979; so we do not reject the null.
Next, we find the sample size necessary in order to conclude that r D 0:32 differs signif-
icantly from 0 at the level ˛ D 0:05 level. In order
p
to reject H0 W D 0 with a two-sided alter-
rp n 2
native, we would need jt.n 2; 0:025/j t D : The sample size needed is the solution to
1 r2
the equation p
:32 n 2
t.n 2; 0:05/ D p
1 :322
for n: In general, this cannot be solved p exactly. By trial and error, we have for n D
38; t .36; 0:025/ D
p 2:02809 and 0:33776 38 2 D 2:02656: Also, for n D 39; t.37; 0:025/ D
2:026 < 0:33776 37 D 2:0541: Consequently, the first n which works is n D 39:
Testing H0 W D 0 Against H1 W ¤ 0
It is possible to show that
1 1CR e 2Z 1
Z D ln with inverse R D 2Z ;
2 1 R e C1
2 The p-value is obtained using a TI-83 as 2tcdf.4:54009; 999; 18/ D 0:0002:
6.4. CONFIDENCE INTERVALS FOR SLOPE AND INTERCEPT AND HYPOTHESIS TESTS 193
has an approximate normal distribution with mean 12 ln 1C 1
and SD D p1 :
n 3
This means
we can base a hypothesis test H0 W D 0 on the test statistic
1 1Cr 1 1 C 0
ln ln
2 1 r 2 1 0
zD r
1
n 3
observed expected
which comes from the usual formula z D SE
: Then we proceed to reach a conclusion
based on our choice of alternative:
H1 W < 0 then, reject null if z z˛
H1 W > 0 then, reject null if z z˛
H1 W ¤ 0 then, reject null if jzj z˛=2 :
Using the statistic z , we may also construct a 100.1 ˛/% CI for by using
1 1Cr 1
ln ˙ z˛=2 p
2 1 r n 3
and then transforming back to r: Here’s an example.
Example 6.26 A sample of size n D 20 resulted in a sample correlation coefficient of r D
0:626: The 95% CI for 21 ln 1Cr
1 r
D 0:7348 is 0:7348 ˙ 1:96 p1 D 0:7348 ˙ 0:4754: Using
17
the inverse transformation, we get the 95% CI for given by
20:2594
e 1 e 21:2102 1
; D .0:2537; 0:8367/:
e 20:2594 C 1 e 21:2102 C 1
Suppose a second random sample of size n D 15 results in a sample correlation coefficient of
r D 0:405. Based on this, we would like to test H0 W D 0:626 against H1 W ¤ 0:626: Since
our sample correlation coefficient 0:405 2 .0:2537; 0:8367/ we cannot reject the null and this is
not sufficient evidence to conclude the correlation coefficient is not 0:626:
Suppose another random sample led to a correlation coefficient (sample size n D 24) of
r D 0:75: We now want to test H0 W D 0:626; H1 W > :626. Here we calculate
1 1 C :75 1 1 C :626
ln ln
z D 2 1 :75r 2 1 :626 D 1:0913:
1
24 3
Since P .Z > 1:0913/ D 0:1375 we still cannot reject the null. If we had specified ˛ D 0:05; then
z0:05 D 1:645 and since 1:645 > 1:0913; we cannot reject the null at the 5% level of significance.
194 6. LINEAR REGRESSION
Example 6.27 This example shows that we can compare the difference of correlations for two
independent
q random samples. All we need note is that the SE for the difference will be given
by SE D n11 3 C n21 3 :
Suppose we took two independent random samples of sizes n1 D 28; n2 D 35 and calcu-
late the correlation coefficient of each sample to be r1 D :5; r2 D :3. We want to test H0 W 1 D
2 ; H1 W 1 ¤ 2 : The test statistic is
1 1 C :5 1 1 C :3
ln ln
z D 2 r1 :5 2 1 :3 D 0:8985:
1 1
C
28 3 35 3
Since P .Z > :8985/ D 0:184; the p-value is :36 and we cannot reject the null.
6.5 PROBLEMS
6.1. Show that the equations for the minimum of f .a; b/ in Proposition 6.6 are
@f
D 2n .y a bx/ D 0
@a
n
X
@f
D 2 xi yi axi bxi2 D 0:
@b
i D1
Solve the first equation for a and substitute into the second to find the formulas for aO
and bO in Proposition 6.6.
6.2. We have seen that if we consider x as the independent and y as the dependent variable,
the regression line is y y D r ssyx .x x/: What is the regression line if we assume
instead that y is the independent variable and x is the dependent variable? Derive the
P
equation by minimizing f .a; b/ D niD1 .xi a b yi /2 : Find the value of f at the
optimal a; O
O b:
6.3. If the regression line with dependent Y is Y D a C b X and the line with dependent X is
X D c C d Y , derive that b d D 2 : Then, given the two lines Y D a C 0:476 X; X D
c C 1:036 Y , find :
6.4. If Yi D 1:1 C 2:5xi C "i ; "i N.0; 1:7/ is a regression model with independent errors,
find
(a) The distribution of Y1 Y2 when Y1 corresponds to x1 D 3 and Y2 corresponds to
x2 D 4.
(b) P .Y1 > Y2 /.
6.5. PROBLEMS 195
6.5. Given the data in the table, find the equation of the regression line in the form .y
s
y/ D r syx .x x/ and also with x as the dependent variable. Find the minimum value of
P
f .a; b/ D niD1 .xi a byi /2 with dependent variable x and with dependent variable
y:
6.6. Math and verbal SAT scores at a university have the following summary statistics:
M SAT D 570; SDM SAT D 110 and V SAT D 610; SDV SAT D 120;
Suppose the correlation coefficient is r D 0:73:
(a) If a student scores 690 on the MSAT, what is the prediction for the VSAT score?
(b) If a student scores 700 on the VSAT, what is the prediction for the MSAT score?
(c) If a student scores in the 77th percentile for the MSAT, what percentile will she
score in the VSAT?
(d) What is the standard error of the estimate of the regression?
6.7. We have the following summary statistics for a linear regression model relating the
heights of sisters and brothers:
B D 68; SDB D 2:4; and S D 62; SDS D 2:2; n D 32:
The correlation coefficient is r D 0:26:
(a) What percentage of sisters were over 68 inches?
(b) Of the women who had brothers who were 72 inches tall, what percentage were
over 68 inches tall?
6.8. Suppose the correlation between the educational levels of brothers and sisters in a city
is 0:8: Both brothers and sisters averaged 12 years of school with an SD of 2 years.
(a) What is the predicted educational level of a woman whose brother has completed
18 school years?
(b) What is the predicted educational level of a brother whose sister has completed 16
years of school?
6.9. In a large biology class the correlation between midterm grades and final grades is about
0.5 for almost every semester. Suppose a student’s percentile score on the midterm is
.a/ 4% .b/ 75% .c/ 55% .d / unknown
Predict the student’s final grade percentile in each case.
196 6. LINEAR REGRESSION
6.10. In many real applications it is known that the y -intercept must be zero. Derive the least
P
squares line through the origin that minimizes f .b/ D niD1 .yi bxi /2 for given data
points f.xi ; yi /gniD1 :
6.11. Find the equations, but do not solve, for the best quadratic approximation to the data
points f.xi ; yi /gniD1 : That is, find the equations for a; b; c which minimize f .a; b; c/ D
Pn
i D1 .yi a bxi cxi2 /2 : Now find the best quadratic approximation to the data
. 3; 7:5/; . 2; 3/; . 1; 0:5/; .0; 1/; .1; 3/; .2; 6/; .3; 14/:
6.12. This problem shows how to get the estimates of a; b; and by maximizing a function
called the likelihood function. Recall that we assume " D Y a bx N.0; /: De-
fine, for the data points xE D .x1 ; : : : ; xn / and yE D .y1 ; : : : ; yn /;
n
Y 1 .yi a bxi /2
L a; b; I x;
E yE D p exp :
2 2 2
i D1
(b) What is the connection of the first two equations with Proposition 6.6?
1X
(c) The third equation gives the estimator for 2 ; s 2 D .yi a bxi /2 : Assum-
n
i
ing
1 2 1 X
S D .Yi a bxi /2 2 .n 2/;
2 2
i
6.5. PROBLEMS 197
2 2 2 2 1 P
find E.S = / and then find an unbiased estimator of using s D n i .yi
a bxi /2 :
6.13. In the following table we have predicted values from a regression line model and the
actual data values. Calculate the residuals and the SE estimate of the regression, s:
yi 55 64 48 49 58
yOi 62 61 49 51 50
6.14. The table contains Consumer Price index data for 2008–2018 for gasoline and food
Year 08 09 10 11 12 13 14 15 16 17 18
Gas 34.5 -40.4 51.3 13.4 9.7 -1.5 0.1 -35.4 -7.3 20.3 8.5
Food 4.8 5.2 -0.2 1.8 4.2 1.6 1.1 3.1 0.9 -0.1 1.6
Find the correlation coefficient and the regression equation. Find the standard error of
the estimate of :
6.15. In a study to determine the relationship between income and IQ, we have the following
summary statistics:
(a) Find the regression equation for predicting income from IQ.
(b) Find the regression equation for predicting IQ from income.
(c) If the subjects in the data are followed for a year and everyone’s income goes up by
10%, find the new regression equation for predicting income from IQ.
6.16. In a sample of 10 Home Runs in Major League Baseball, the following summary statis-
tics were computed:
We are also given the correlation data that the correlation between SpeedOffBat and
Distance is 0.098, the correlation between the Apex and Distance is -0.058, and the
correlation between SpeedOffBat and Apex is 0.3977.
(a) Find the regression lines for predicting distance from SpeedOffBat and from Apex.
(b) Find the ANOVA tables for each case.
198 6. LINEAR REGRESSION
(c) Test the hypotheses that the correlation coefficients in each case are zero.
6.17. The duration an ulcer lasts for the grade of ulcer is given in the table.
Stage(x) 4 3 5 4 4 3 3 4 6 3
Days(y) 18 6 20 15 16 15 10 18 26 15
Stage(x) 3 4 3 2 3 2 2 3 5 6
Days(y) 8 16 17 6 7 7 8 11 21 24
Find the ANOVA table and test the hypothesis that the slope of the regression line is
zero.
6.18. Consider the data:
x -1 0 2 -2 5 6 8 11 12 -3
y -5 -4 2 -7 6 9 13 21 20 -9
Find the regression line and test the hypothesis that the slope of the line is zero.
6.19. The following table gives the speed of a car, x; and the stopping distance, y; when the
brakes are applied with full force.
Fit a regression line to this data and plot the line and the data points on the same plot.
Also plot the residuals. Test the hypothesis that the data is uncorrelated.
6.20. In the hypothesis testing chapter we considered testing H0 W 1 D 2 from two inde-
pendent random samples. We can set this up as a linear regression problem using the
following steps.
• The sample sizes are n1 and n2 : The y data values are the observations from the
two samples, labeled y1 ; : : : ; yn1 ; yn1 C1 ; : : : ; yn1 Cn2 : The x -data values are defined
by 8
<1 if yi comes from sample 1
xi D ; i D 1; 2; : : : ; n1 C n2 :
:0 if yi comes from sample 2
• Calculate the regression line for the data values f.xi ; yi /ginD1
1 Cn2
:
x -6 -2 2 2 4
y 12 8 6 2 2
O aO , and s 2 .
calculate Sxx ; Sxy ; Syy ; b;
201
APPENDIX A
Answers to Problems
A.1 ANSWERS TO CHAPTER 1 PROBLEMS
1.1 p D 0:3 and p D 3=7.
1.2 (a) P .A \ B/ D 1=12: (b) P .Ac [ B/ D 9=12.
1.4 (a) Let C D cigarettes, Cg D cigars. Then P .C c \ Cgc / D 0:64:
(b) P .Cg \ C c / D :04:
In the table 0 indicates an outcome is not in the event and a 1 indicates it is. Notice that
all possibilities are covered in this table and that the column .A \ B/c and Ac [ B c are
identical. Therefore, the two events must be the same.
1.35 Let x be the number of red balls in the second box, x D 11:
7 31 7 3
1.37 (a) P .7H / D 107
0:4 0:6 2 C 10 7
0:7 0:3 .
(b) Let A Dfirst toss is H, then P .7H jA/ D 96 :46 :63 11
4
C 96 :76 :33 11
7
:
P P P .A \ B \ Ei / P .B \ Ei /
1.38 P .AjB/ D i P .A \ Ei jB/ D i :
P .B \ Ei / P .B/
1.39 If we calculate the proportion of male math majors at each university, we get 0.2 at
university 1 and 0.3 at university 2. For females it is 0:15 < 0:2 at university 1 and
0:25 < 0:3 at university 2. These inequalities are reversed in the amalgamated table.
1.40 Let R D recover, D D die. We have P .RjM \ D/ D :27 < P .RjM \ D c / D :33 and
P .RjF \ D/ D :642 < P .RjF \ D c / D :66 so the recovery probability is lower for
both males and females on the drug. However, P .Rj/ \ D/ D 0:538‹P .RjO \ D c / D
:44 so that amalgamated, the drug is better,
2.7 (a) c D 1:
8
ˆ 0; x 3
ˆ
ˆ
ˆ
ˆ 1
ˆ
ˆ .x C 3/2 ; 3<x 2
ˆ
ˆ
< 2
(b) FX .x/ D 1
ˆ ; 2x2
ˆ 2
ˆ
ˆ
ˆ 1
ˆ
ˆ 1 .3 x/2 ; 2 < x 3
ˆ 2
:̂
1; x > 3:
2.10 0:364577.
p
2.11 P .X 70/ D normalcdf.0; 70:5; 75; 100.:75/.:25// D 0:149348 using the
normal approximation. The exact value using the binomial is P .X 70/ D
binomcdf.100; 0:75; 70/ D 0:149541:
P .X1 D 1; X2 D 2; X3 D 0; X4 D 2; X5 D 0; X6 D 0/
!
5
D .0:662/1 .0:052/2 .0:213/0 .0:018 /2 .0:009 /0 .0:046 /0
1; 2; 0; 2; 0; 0
D 0:0000174:
.r k/.n k/
2.14 :
.k C 1/.N r n C k C 1/
2.15 (a) X Binom.8; 0:1/ H) P .X D 2/ D 0:1488; Y Poisson.0:8/ H) P .Y D
2/ D 0:143785:
(b) X Binom.10; 0:95/ H) P .X D 9/ D 0:315125; Y Poisson.9:5/ H)
P .Y D 9/ D 0:130003:
2.16 (a) P .X D 1/ D binompdf.50; 0:01; 1/ D 0:305559.
(b) P .X 1/ D 0:394994:
(c) P .X 2/ D 0:0894353:
2.19 P .1=2 < X 3=4/ D 5=16: The pdf of X is f .x/ D 2x; 0 < x < 1; and 0 otherwise.
2.20 (a) The largest area possible is 12 : ; f.x; y/ j x 2 Œ2; 3; y 2 Œ1; 3=2g:
(b) F .a/ D P .A a/ D P .h 2a/ D 2a 1; 21 a 1:
8
<2; 1 a 1
(c) f .a/ D 2
:0 otherwise:
2 71
2.48 c D 49
: Y D total amount the insurance company has to pay out, EY D :
490
2.49 EX D 2: EŒX.X 1/ D 3:
2.51 2 D 22:58845.
2.52 8 8
<x C 1 0 x 1; <y C 1 0 y 1;
fX .x/ D 2 fY .y/ D 2
:0 otherwise :0 otherwise:
2.56 P .X D 1/ D 0:26695:
208 A. ANSWERS TO PROBLEMS
A.3 ANSWERS TO CHAPTER 3 PROBLEMS
3.1 (a) 25 samples of size 2.
3.5 (a) The mean number of defects in a sample of 2 is 2 53 : SE for the total number of
p
defects in a sample of size 2 is 2 1:491 :8944 D 1:886:
1
(b) :
15
3.6 (a) E.S25 / Dp2637:5; Var.Sum/ D 3025; SD.Sum/ D 55: We ignore the correction fac-
tor since .1000 25/=999 D 0:9879:
p
(b) E.X / D 105:5; SD.X/ D = 25 D 11=5 D 2:2.
(c) P .98 X 104/ D 0:24735: Therefore, if N D Number of sample means in this
range, N Binomial.150; 0:24735/ and E.N / D 37:102:
(d) P .X < 97:5/ D 0:0000138 so the expected number of sample means less than 97.5
will be approximately 0.
3.7 P .X1 C C X50 < 10/ 0:1444.
3.8 (a) P .4:4 < X < 5:2/ D 0:6898.
(b) 85th percentile D 5:345.
(c) P .X > 7/ D 0:022:
3.9 P .X 8/ normcdf.0; 8:5; 16; 3:66/ D 0:0202:
p p
3.10 (a) A N.25; 10= 50/; B N.25; 10= 100/:
(b) P .19 A 26/ D 0:7602 and P .19 B 26/ D 0:8413.
3.11 n D 25:
A.3. ANSWERS TO CHAPTER 3 PROBLEMS 209
3.12 P .jX 5j 0:5/ normalcdf.4:5; 5:5; 5; 1/ D 0:3829:
p
3.13 (a) P .X n=2/ normalcdf.0; np C 0:5; np; np.1 pp//:
P .X D n=2/ normalcdf.np 0:5; np C 0:5; np; np.1 p//:
Here’s the table.
P .X n=2/ P .X n=2/ P .X Dn=2/ P .X Dn=2/
Tosses n Exact Approx. Exact Approx.
3.20 (a) EY D 8 and Var.Y / D 16. (d) P .2 .1/ < 0:0855/p
D 0:23; and
2
(b) P .Y > 15:507/ D 0:05; P .Y < P
p .Z < b/ D P . p b < Z <
3:489/ D 0:1; P .Y < 13:361/ D b/ D 0:23 H) b D 0:29237:
0:9; P .Y > 2:733/ D 0:95: Then b D 0:0855:
(c) P .3:489 < Y < 13:361/ D 0:8:
210 A. ANSWERS TO PROBLEMS
3.21 P .PO 0:5325/ normalcdf.0:5325; 1; 0:51; 0:02499/ D 0:1839:
3.22 P .jt9 j > 3:1622/ D 0:00575:
3.23 With replacement: P .t79 5:96/ D 3:34 10 8 ; i.e., virtually no chance. Without
replacement: P .t79 6:95/ 0:
3.24 P .X 1200/ D 0:315427.
3.25 P .:45 PO :48/ D 0:654612:
3.26 Sample 1: pO D 0:2; SE D 0:0632: Sample 2: pO D 0:25; SE D 0:0684: Sample 3: pO D
0:325; SE D 0:0740:
2
O D 41 2 2 D
3.27 E O D ; Var./ 2
:
3.28 P .jXj 0:01/ D 0:52050:
3.29 n 2075:
3.30 (a) tcdf . 1:923; 1:923; 99/ D 0:94265: (b) n 152:
3.31 P .t48 2:25=.6=7// D 0:00579.
3.32 (a) P .X 125 Y 125 160/ D 0:9772. (b) P .X 125 Y 125 250/ D 0:0062:
3.35 0:071349.
4.7 The pivot is T D sX=pn and we start with P .T t.n 1; ˛=2// D 1 ˛: Solving the
X
inequality for to get the result.
4.8 (a) P .Xmin m Xmax / D 1 P .Xmin > m/ P .Xmax < m/. Now P .Xmin >
m/ D P .X1 > m; : : : ; Xn > m/ D P .X > m/n D .1=2/n ; and P .Xmax < m/ D
P .X1 < m; : : : ; Xn < m/ D P .X < m/n D .1=2/n ; P .Xmin m Xmax / D 1
2.1=2/n D 1 .1=2/n 1 :
(b) By the first part this is 1 .1=2/7 D 0:9921875.
4.10 n 884:
4.11 95% CI is .37:874; 41:826/. The histogram of the data is skewed right because there
are some cars with high mpg. A lower 95% CI is .39:85 1:729 4:221187 p
20
; 1/ D
.38:218; 1/: We are 95% confident that the mean mpg is at least 38.218 mpg.
.n 1/sX2 2
.n 1/sX
4.12 ;
2 .n 1;˛=2/ 2 .n 1;1 ˛=2/
D .34 4=51:9659; 34 4=19:80625/ D
.2:6171; 6:8665/:
4.14 The mean difference is 4.2 pounds; the CI is . 1:721; 10:121/ with df D 60:292: There
is not enough evidence to conclude the difference is real.
4.15 95% CI is .1:7134; 2:0866/ with df D 284:865: 99% CI is .1:654; 2:146/. Since 0 is not
in either CI, there is evidence that the difference is real.
p
3s= 2n
4.16 The 99% CI for is s ˙ 3 ps2n : The percentage error in the SD is s
D 300 p1 %.
2n
If we want this no more than 5% we need 300 p12n 5 H) n 1800:
212 A. ANSWERS TO PROBLEMS
4.17 (a) .4:057; 8:543/. (b) n 346:
5.3 (a) The p-value is D 0.054. Retain H0 . (c) The p-value is D 0.032. Reject H0 .
(b) The p-value is D 0.121. Retain H0 .
5.5 (a) Reject H0 if t t .24; 0:01/ D 2:492. (c) Reject H0 if t t.24; 0:025/ D
(b) Reject H0 if t t .24; 0:02/ D 2:064 or t t.24; 0:025/ D 2:064.
2:172.
5.7 (a) We have ˛ D P .n 1/S 2 =02 2 .n 1; 1 ˛/ : Therefore by the definition of
Type II error,
ˇ.12 / D P .n 1/S 2 =12 > 02 =12 2 .n 1; 1 ˛/ :
Reject H0 .
(c) p-value D 2P .t.55/ 3:5395/ D 2.0:000412/ D 0:000824.
2
sX 6:8
5.17 (a) Reject H0 if f D 2
sY
D 7:1
D 0:95775 F .10; 9; 0:1/ D 2:4163: Retain H0 .
2
sX 6:8
(b) Reject H0 if f D 2
sY
D 7:1
D 0:95775 F .10; 9; 0:95/ D 0:3311: Retain H0 .
2
sX 6:8
(c) Reject H0 if f D 2
sY
D 7:1
D 0:95775 F .10; 9; 0:005/ D 6:4172 or
f D 0:95775 F .10; 9; 0:995/ D 0:16757: Retain H0 .
5.18 (a) Since the variances are assumed equal, we compute the pooled variance as sp2 D
10:186: Reject H0 if
19:1 16:3
tD r D 2:8137 t.40; 0:05/ D 1:684:
p 1 1
10:186 C
24 18
Reject H0 .
(b) p-value D P .t .40/ 2:8137/ D 0:00378.
5.19 (a) To test equality of means, the degrees of freedom of the t -distribution is given by
D 37: Reject H0 if
ˇ ˇ
ˇ ˇ
ˇ ˇ
ˇ 3:8 3:6 ˇ
ˇ
jtj D ˇ r ˇ D 0:50556 t.37; 0:025/ D 2:0262:
ˇ
ˇ 1:22 1:32 ˇ
ˇ C ˇ
20 20
Retain H0 .
(b) p-value D 2P .t.37/ 0:50556/ D 2.0:30808/ D 0:61616.
A.5. ANSWERS TO CHAPTER 5 PROBLEMS 215
5.20 Reject H0 if
ˇ ˇ
ˇ ˇ
ˇ ˇ
ˇ .x y/ ˇ jx yj jx yj
ˇ
jt j D ˇ r ˇD r D t.21; 0:025/ D 2:0796
1 1ˇˇ p 1 1 25:5692
ˇ
ˇ sp C ˇ 3581:6 C
m n 9 14
H) jx yj 2:0796.25:5692/ D 53:1737:
Retain H0 .
(b) p-value D 2P .Z 0:6614/ D 2.0:2541/ D 0:5082.
(c) ˇ. :05/ D 0:45743:
5.22 (a) Let X and Y denote the sample of at bats last and this season, respectively. Con-
sider the hypothesis H0 W pX D pY vs. H1 W pX > pY with ˛ D 0:05. The pooled
proportion is calculated as p 0 D 300.0:276/C235.0:220/
535
D 0:2514: Reject H0 if
0:276 0:220
zD r D 1:4818 z0:05 D 1:645:
p 1 1
0:2514.1 0:2514/ C
300 235
Retain H0 .
(b) p-value D 0:069:
550.0:61/C690.0:53/
5.23 (a) The pooled proportion is calculated as p 0 D 1240:0
D 0:56548: Reject
H0 if
0:61 0:53
zD r D 2:8234 z0:01 D 2:326:
p 1 1
0:56548.1 0:56548/ C
550 690
Reject H0 .
216 A. ANSWERS TO PROBLEMS
(b) p-value D P .Z 2:8234/ D 0:002376:
(c) We show how to do this in general for a two-proportion, one-sided test. In our
case
q H1 W pX pY > 0: Suppose we have H1 W pX pY D # > 0. Then with SE D
N
X.1 N
X/ YN .1 YN /
nX
C nY
,
s !
1 1
ˇ.#/ D P XN YN < z˛ PN0 1 PN0 C
nX nY
0 s 1
1 1
B z˛ PN0 1 PN0 C #C
B XN YN # nX nY C
DPB
B < C
C
@ SE SE A
0 s 1
1 1
B z˛ PN0 1 PN0 C #C
B nX nY C
PB
BZ <
C
C
@ SE A
0 p q 1
1 1
z˛ pN0 .1 pN0 / C #
B nX nY C
P @Z < q A
pNX .1 pNX / pN Y .1 pN Y /
m
C n
(when samples values are substituted for rvs).
Alternatively,
p find the critical region for given ˛ first using x D invNorm.1
˛; 0; p 0 .1 p 0 /.1=nX C 1=nY //: Then find the area to the left of x under the
normal curve using normalcdf. 1; x; #; SE/:
Given pX pY D # D 0:1, we have SE D 0:02817, and x D
invNorm.:99; 0; 0:02833/ D 0:06591. Then ˇ.0:1/ D normalcdf. 1, 0:06591; 0:1,
0:02817/ D 0:11311:
2
5.24 (a) d D 0:5; sD D 7:8333: Reject H0 if
0:5 0:0
p D 0:56493 t.9; 0:05/ D 1:8331:
7:8333
p
10
Retain H0 .
(b) p-value D 0:29296:
A.5. ANSWERS TO CHAPTER 5 PROBLEMS 217
2
5.25 (a) d D 0:02; sD D 0:0008222: Reject H0 if
0:02 0:0
p D 2:2057 t.9; 0:025/ D 2:2622 or 2:2057 2:2622:
0:0008222
p
10
Retain H0 .
(b) p-value D 2.0:027414/ D 0:054828:
5.26 Test H0 W pOC D 0:2785; pAC D 0:208; : : : ; pAB D 0:0049 vs. H1 : the proportions are
not the same. Expected frequencies:
e 2 21
1 60 1Š
D 16:24
e 2 22
2 60 2Š
D 16: 24
e 2 23
3 60 3Š
D 10: 827
P3
e 2 2k
4 60 1 kD0 kŠ
D 8: 572 6
e 0:922 0:9221
1 576 1Š
D 211:219
e 0:922 0:9222
2 576 2Š
D 97:372
0:922 3
3 576 e 0:922
3Š
D 29:926
P3 e 0:922 0:922k
4 576 1 kD0 kŠ
D 8:394
A.5. ANSWERS TO CHAPTER 5 PROBLEMS 219
2 2
Value of D4 .3/: d4 D 1:075: Reject H0 if d4 D 1:075 .3; 0:05/ D
7: 814 7: Retain H0 .
(b) Now suppose D 0:8. Expected frequencies:
e 0:8 0:81
1 576 1Š
D 207:051
e 0:8 0:82
2 576 2Š
D 82:820
0:8 3
3 576 e 3Š0:8 D 22:085
P3 e 0:8 0:8k
4 576 1 kD0 kŠ
D 5:230
5.31 H0 W data follows an exponential distribution with mean 40 seconds vs. H1 W data does
not follow and exponential distribution with mean 40 seconds. Probabilities of an in-
terarrival time falling in each of the intervals are listed in the following tables.
Number of Casts
Expected Frequency
Until a Strike
1 .50/ .0:26178/ D 13:089
2 .50/ .1 0:26178/.0:26178/ D 9:6626
3 .50/ .1 0:26178/2 .0:26178/ D 7:1331
4 .50/ .1 0:26178/3 .0:26178/ D 5:2658
X1
5 .50/ .1 0:26178/i .0:26178/ D 14:8495
i D4
We combine the cells for i 5 strikes so that all the expected frequencies are at least
5. Value of D4 2 .3/: d4 D 9:277. Reject H0 if d4 D 9:277 2 .3; 0:01/ D 11: 345:
Retain H0 . However, if ˛ D 0:05, the null hypothesis can be rejected since
d4 D 9:277 2 .3; 0:05/ D 7: 814 7:
At that level, it is likely that the data is not following a geometric distribution.
5.33 Test H0 W injury due to criminal violence and choice of profession are independent vs.
H1 W injury due to criminal violence and choice of profession are not related.
Estimated row and column probabilities:
318 172 174
pOV D D 0:648 98; pOO D D 0:351 02; pOP D D 0:355 1;
490 490 490
116 99 101
pOC D D 0:236 73; pOT D D 0:202 04; pOS D D 0:206 12:
490 490 490
Estimated frequencies:
318 174 318 116
490pOV pOP D 490 D 112:922; 490pOV pOC D 490 D 75:280;
490 490 490 490
318 99 318 101
490pOV pOT D 490 D 64: 249; 490pOV pOS D 490 D 65:546;
490 490 490 490
172 174 172 116
490pOO pOP D 490 D 61:077; 490pOO pOC D 490 D 40: 718;
490 490 490 490
172 99 172 101
490pOO pOT D 490 D 34: 751; 490pOO pOS D 490 D 35: 453:
490 490 490 490
A.5. ANSWERS TO CHAPTER 5 PROBLEMS 221
2 2
Value of D7 .3/: d7 D 65:526: Reject H0 if d7 D 65:526 .3; 0:01/ D 11:345.
Reject H0 .
5.34 Test H0 W selectivity and sensitivity are independent vs. H1 W selectivity and sensitivity
are not independent
Estimated row and column probabilities:
30 112 28
pOLS D D 0:176 47; pOAS D D 0:658 82; pOHS D D 0:164 71;
170 170 170
52 88 30
pOLN D D 0:305 88; pOAN D D 0:517 65; pOHN D D 0:176 47:
170 170 170
Estimated frequencies: (Note that all estimated frequencies are at least 5 except the
frequency of high selectivity and high sensitivity. However, it is acceptably close to 5.)
30 52 30 88
170pOLS pOLN D 170 D 9:176; 170pOLS pOAN D 170 D 15: 529;
170 170 170 170
30 30 112 52
170pOLS pOHN D 170 D 5:294; 170pOAS pOLN D 170 D 34:258;
170 170 170 170
112 88 112 30
170pOAS pOAN D 170 D 57: 976; 170pOAS pOHN D 170 D 19: 765;
170 170 170 170
28 52 28 88
170pOHS pOLN D 170 D 8:565; 170pOHS pOAN D 170 D 14:495;
170 170 170 170
28 30
170pOHS pOHN D 170 D 4:9411:
170 170
O A B AB Totals
Positive rH 344 207 448 92 1091
Negative rH 23 12 23 11 69
Totals 367 219 471 103 1150
5.45 H0 W 5Y D 10Y D 15Y D 20Y vs. H1 W the means are different.
(a) x D 50:35; x 5Y D 58:4; x 10Y D 57:4; x 15Y D 43:6; x 20Y D 42:0:
(b) SSE D 201:6; SSTR D 1149:0; SST D SSE C SSTR D 1350:6:
SSTR 1149:0 SSE 201:6
(c) MSTR D 3
D 3
D 383:0, MSE D 16
D 16
D 12:6.
(d) f D 30:397:
(e) p-valueD P .F .3; 16/ 30:397/ D 7:6621 10 7 .
(f )
Source DF SS MSS F-statistic p-Value
7
Treatment 3 1149: 0 383:0 f D 30: 397 7: 662 1 10
Error 16 201: 6 12: 6 * *
Total 19 1350: 6 * * *
(g) Reject H0 at the ˛ D 0:01 level. Marksmanship skill differs according to how many
years served.
A.6. ANSWERS TO CHAPTER 6 PROBLEMS 225
A.6 ANSWERS TO CHAPTER 6 PROBLEMS
6.2 .x x/ D r ssXY .y y/: bO D r ssXY ; aO D x O Finally, f .a;
by: O D .n
O b/ 1/.1 r 2 /sX2 :
Y X
6.3 We have b D X
; d D Y
H) b d D 2 : Then using the two lines given, D
0:7027:
p
6.4 (a) Y1 D 8:6 C "1 ; Y2 D 11:1 C "2 H) Y1 Y2 N. 2:5; 1:7 2/:
p
(b) P .Y1 > Y2 / D normalcdf.0; 1; 2:5; 1:7 2/ D 0:1492.
6.5 Since x D 4:45; sx D 2:167; y D 5:375; sy D 1:2104; r D 0:7907 we have
1:2104
.y 5:375/ D 0:7907 .x 4:45/ D 0:4416.x 4:45/
2:167
2:167
and .x 4:45/ D 0:7907 .y 5:375/ D 1:4156.y 5:375/: The minimum of f
1:2104
with dependent
p variable x is 7.1 0:79072 /1:21042 D 3:844; and with dependent vari-
able y is 7 1 0:79072 2:1672 D 12:32:
120
6.6 (a) We have VSAT 610 D 0:73 110 .MSAT 570/: Therefore, if MSAT D
690; VSAT D 705:56:
(b) MSAT D 570 C 0:73.110=120/.700 610/ D 630:23:
(c) 0:738846 D .MSAT 570/=110 D 0:73.VSAT 610/=120 H) z D .VSAT
610/=120 D 1:01211 H) normalcdf. 1; z/ D 0:842.
p
(d) Se D 1 r 2 SDVSAT D 82:0136:
p p
6.7 (a) We have the model S N.a C b B; / and se D 31=30SDS 1 0:262 D
2:159: So, normalcdf.68; 1; 62; 2:2:159/ D 0:00273.
(b) The regression equation is S D 45:7933 C 0:2383 B: If B D 72; then S D 62:9509
and then N.62:95; se /; so normalcdf.68; 1; 62:9509; 2:1259/ D 0:00967.
6.8 (a) 16.8. (b) 15.2.
6.9 (a) 19% (b) 63.20% (c) 52.50% (d) 50%
6.10 Consider f .b/ D E.Y b X /2 : We have f 0 .b/ D 2E.Y b X /X D 0 PH) b D
E.XY /=E.X 2 /: Also f 00 .b/ D 2EX 2 > 0: Using observations we have, bO D Pxxi y2 i :
i
0 P P 21 0 1 0 P 1
n x x a y
P P 2i P i3 P i
6.11 In matrix form @ xi x x A @b A D @ xi yi A. For the given data, the
P 2 P i3 P i4 P 2
xi xi xi c xi yi
equation of the least squares quadratic is y D 0:57 C x C 1:107x 2 :
n 2
6.12 E.S 2 = 2 / D n 2 so an unbiased estimator for 2 using the s 2 of the problem is n 2
s :
226 A. ANSWERS TO PROBLEMS
6.14 Let x denote gas CPI and y be the food CPI. The data gives x D 4:836; y D
2:182; sX D 26:95; sY D 1:88: The regression equation is food CPI D 2:323
0:0293 gas CPI: The correlation coefficient is r D 0:42:
6.17 Here is the ANOVA table
Source of Variation DF SS MS F-statistic p-Value
Regression 1 570.04 MSR D 570:04 77:05 D F .1; 18/ 0.0
Residuals 18 133.16 MSE D 7:4
Total 19 703.2
The last column gives the p-value for the observed F value of 77.05. The null is rejected
and there is strong evidence that the slope of the regression line is not zero. Since r ssyx
is the slope, we can use this to conclude that there is strong evidence the correlation is
not zero.
6.18 We get the regression line and correlation coefficient y D 3:10091 C 2:02656x; r D
:99641. We want to test H0 W ˇ D 0; H1 W ˇ ¤ 0: We get the test statistic t D 33:2917:
With n 2 D 8 degrees of freedom, the p-value is basically 0.
6.24 We have y.750/
O D 14:95 and the 95% PI 14:95 ˙ 1:492:
57122
6.25 We have Syy D 2243266 17
D 324034 and
Authors’ Biographies
EMMANUEL BARRON
Professor Barron received his B.S. (1970) in Mathematics from the University of Illionois at
Chicago and his M.S. (1972) and Ph.D. (1974) from Northwestern University in Mathematics
specializing in partial differential equations and differential games. After receiving his Ph.D.,
Dr. Barron was an Assistant Professor at Georgia Tech, and then became a Member of Tech-
nical Staff at Bell Laboratories. In 1980 he joined the Department of Mathematical Sciences
at Loyola University Chicago, where he is a Professor of Mathematics and Statistics. Professor
Barron has published over 70 research papers, and he has also authored the book Game Theory:
An Introduction, 2nd Edition in 2013. Professor Barron has received continuous research fund-
ing from the National Science Foundation and the Air Force Office for Scientific Research.
Dr. Barron has taught Probability and Statistics to undergraduates and graduate students since
1974.
Index