Transition To MATH503
Transition To MATH503
David Caraballo
1 Notational assumptions
Throughout these notes, unless otherwise stated X will be a random variable on a probability space ( ; F; P ) ; and having
a second moment (so that its variance is de…ned and …nite).
As usual, we will let = E (X) and 2 = V AR (X) ; so that = ST DDEV (X) :
We will consider X to be a population random variable, in that we will use it to model a population of interest.
2 Population parameters
De…nition 1 By a population parameter (or simply a parameter) we mean a random variable h (X) which is a function
of X (so that it is computed by using our population random variable X).
Example 2 The population mean = E (X) ; the population variance 2 = V AR (X) ; and the population standard
deviation = ST DDEV (X) are population parameters. They are de…ned and …nite throughout this section since we
assumed above that X has a second moment.
Example 3 For each positive integer k for which X k has …nite expectation, the kth moment of X; E X k ; is a population
parameter.
k
Example 4 For each positive integer k for which X k has …nite expectation, the kth central moment of X; E (X ) ; is
a population parameter.
Terminology Many authors use the terminology that X1 ; X2 ; : : : ; Xn are observations of a random sample of size n taken
from a population having the distribution of X:
De…nition 6 Whenever X1 ; X2 ; : : : ; Xn is a random sample from a population having the distribution of X; we de…ne the
sample sum (or sample total) Tn ; the sample mean X n ; and (if n > 1) the sample variance Sn2 as follows:
Tn = X1 + ::: + Xn ;
1 X1 + : : : + Xn
Xn = Tn = ;
n n
n
1 X 2
Sn2 = Xi X n :
n 1 i=1
De…nition 7 By a sample statistic (or simply a statistic) we mean a random variable s (X1 ; X2 ; : : : ; Xn ) which is a
function of the variables in a random sample (so that it is computed by using one or more elements of the random sample).
Formally, each statistic s (X1 ; X2 ; : : : ; Xn ) is a random variable on ( ; F; P ) which is itself a function of the random
variables X1 ; X2 ; : : : ; Xn on ( ; F; P ) : As such, it assigns a numerical value, s (X1 ; X2 ; : : : ; Xn ) (!) ; to each ! 2 :
1
Example 9 For each integer n > 1; Sn2 is a sample statistic.
min (X1 ; X2 ; : : : ; Xn ) ;
Example 11 For each integer k such that 1 k n; the kth order statistic X(k) is de…ned as follows:
X(k) (!) = the kth smallest number among fX1 (!) ; X2 (!) ; : : : ; Xn (!)g :
For each k as above, X(k) is a sample statistic. Using this notation, we can rewrite the minimum and maximum as follows:
is a sample statistic.
Example 13 There are many other sample statistics, such as the lower quartile Q1 ; the upper quartile Q3 ; the in-
terquartile range (IQR) Q3 Q1 ; the median, percentiles, and so on.
5 Unbiased estimators
De…nition 14 A sample statistic s (X1 ; X2 ; : : : ; Xn ) is called an unbiased estimator for a population parameter provided
Unbiased estimators are useful because they correctly estimate a given population parameter “on average.” Generally
speaking, among unbiased estimators for a given population parameter, it is desirable to have as small a variance as possible,
so as to increase the probability that the estimator will take a value close to its mean (which, since the estimator is unbiased,
is the true value of the population parameter of interest).
Theorem 15 For each positive integer n; the sample mean X n is an unbiased estimator for the population mean : I.e.,
q.e.d.
2
Theorem 16 The sample variance Sn2 is an unbiased estimator for the population variance 2
: I.e.,
E Sn2 = 2
for each integer n > 1:
Remark 17 The n 1 term in the denominator of the formula for Sn2 is there precisely so that Sn2 will be an unbiased
estimator for 2 :
Proof: Suppose that n is an integer greater than one. We will begin by establishing three formulas, after which the proof
will be easy to complete.
E Xi2 = 2
+ 2
for each i = 1; 2; : : : ; n; (1)
2 2
E Xi X n = + =n for each i = 1; 2; : : : ; n; (2)
2 2 2
E Xn = + =n: (3)
3
We can now complete the proof.
n
!
1 X 2
E Sn2 = E Xi Xn
n 1 i=1
n
!
1 X 2
= E Xi Xn
n 1 i=1
n
X
1 2
= E Xi Xn
n 1 i=1
n
X
1 2
= E Xi2 2Xi X n + X n
n 1 i=1
n h
X i
1 2
= E Xi2 2E Xi X n + E X n :
n 1 i=1
We now use all three of the formulas (1), (2), and (3) to rewrite the expression in brackets. We get
n
X
1
E Sn2 = 2
+ 2
2 2
+ 2
=n + 2
+ 2
=n
n 1 i=1
Xn
1 n 1 2
=
n 1 i=1
n
n
1X 2 2
= = :
n i=1
q.e.d.
Example 18 To make the preceding results more concrete, it is helpful to consider a speci…c example. Suppose that X is a
random variable which takes each of the values 1; 5; and 6 with probability 1=3: X here is modeling random selection from the
set f1; 5; 6g : X is a simple random variable, so it has moments and central moments of all orders. We calculate
1 1 1
= E (X) = (1) + (5) + (6) = 4;
3 3 3
2 1 2 1 2 1 2 14
= (1 4) + (5 4) + (6 4) = :
3 3 3 3
Consider random sampling with replacement using sample size n = 2: There are 3 3 = 9 possible (distinct) samples of size
2: For each one, we will compute the sample mean and the sample variance. We will then average the sample means and the
sample variances (to …nd E X 2 and E S22 ). We will see that these averages are precisely and 2 :
Observe that the X 2 (!) values add up to 36 and hence average to 4; which is ; while the S22 values add up to 42 and hence
average to 14=3; which is 2 :
In the preceding example, note the importance of sampling with replacement (which is what is needed to have X1 and X2
be independent and identically distributed). If we had considered only the samples of size two without replacement, the S22
values would still add up to 42 but would average instead to 7 (there are just 6 distinct pairs without replacement), which
is not the value of 2 :
4
6 Consistent estimators and the Weak and Strong Laws of Large Numbers
De…nition 19 A sample statistic s (X1 ; X2 ; : : : ; Xn ) is called a consistent estimator for a population parameter provided
In other words, with probability which converges to 1 as n ! 1; a consistent estimator s (X1 ; X2 ; : : : ; Xn ) with sample
size n is as close as we wish (within ; for any as small as we wish) to the parameter that it is estimating.
Whereas bias is about the behavior of s (X1 ; X2 ; : : : ; Xn ) for each n for which it is de…ned, consistency is about the
long-term behavior (as n ! 1) of the in…nite sequence
of estimators.
Unbiased estimators satisfy a given useful property for each …xed n for which the estimators are de…ned, but consistent
estimators satisfy a di¤erent, arguably more useful approximation property in the limit as n ! 1:
With an unbiased estimator, our estimator is correct “on average,”but it might still be incorrect with very high probability
for each n; no matter how large.
By contrast, consistent estimators do not need to be correct “on average” for each n but do need to be approximately
correct (within ; where can be taken as small as we wish) with probability that for each > 0 (as small as we wish)
converges to 1 as n ! 1: Provided we have the ability to take large random samples, in many respects consistency is a more
useful property than unbiasedness.
We have already seen that X n and Sn2 are unbiased estimators for and 2 ; respectively. It turns out that they are also
consistent estimators as well.
Theorem 20 (Weak Law of Large Numbers) The sample mean X n is a consistent estimator for the population mean
: I.e., for each positive integer n; for every > 0 we have
The Weak Law of Large Numbers is often stated more succinctly as “X n ! in probability as n ! 1:”
Theorem 21 The sample variance Sn2 is a consistent estimator for the population variance 2
: I.e., for every > 0 we have
Warning “convergence in probability”(de…ned above) and “convergence with probability 1”mean entirely di¤erent things.
The latter is another name for “strong convergence.”
The Weak Law of Large Numbers has the following improvement, which is called the Strong Law of Large Numbers
since the form of convergence involved is called strong convergence, almost sure convergence, or convergence with
probability 1 (which, as noted above, is distinct from convergence in probability). In general, strong convergence implies
convergence in probability, but not vice versa.
The Strong Law of Large Numbers is often stated as “X n ! almost surely as n ! 1:” The term “almost surely” is
often abbreviated as “a.s.,”and many authors choose to put this abbreviation above the arrow itself rather than to the right
of it.
Some authors restate the Strong Law of Large Numbers by saying that X n is a strongly consistent estimator for ;
the term “strongly” referring to the fact that the convergence is strong convergence rather than convergence in probability
(which is implied by strong convergence).
5
7 Sampling from normal populations
When our random sample is taken from a normal population –formally, when our population random variable X is normally
distributed –there is much more that we can say about X n and Sn2 ; even when n is small.
2
Theorem 23 Suppose that X N ; : Then
2
(1) For each positive integer n; X n is normally distributed with mean and variance =n: I.e.,
p 2
Xn N ; = n for each positive integer n:
(3) For each integer n > 1; the random variables X n and Sn2 are independent.
It is important to note that none of these conclusions may be assumed unless we suppose that X is normal (counterexamples
show that each of the conclusions can be false when X is not normal).
Conclusion (1) is particularly impressive in that it represents a major improvement over the Central Limit Theorem’s
conclusions in this case. When X is normal, each X n is exactly normally distributed no matter how small n is (even if n = 1
or n = 2).
By contrast, the Central Limit Theorem ensuresp that FWn (a) ! (a) for each real a as n ! 1; where Wn is the
standardization of X n (Wn = X n = ( = n)) and is the cdf of a N 0; 12 random variable. Consequently, the
Central Limit Theorem allows us to deduce that X n is approximately normally distributed for large enough values of n
(which is a weaker conclusion than X n being exactly normally distributed for each n).
Of course, when n = 1 we have X 1 = X1 : Since X1 and X are identically distributed, we have X1 N ; 2 ; and hence
X 1 N ; 2 ; so that conclusion (1) holds trivially when n = 1:
SE X n = p :
n
p 2
Suppose that Y N ; ( = n) ; let
E = 1:96 p ;
n
and let
J =( E; + E) :
Then
P Xn 2 J is approximately equal to P (fY 2 Jg) ;
which equals
P (fY 2 ( E; + E)g)
6
where Z N 0; 12 ; since
p
( 1:96 = n)
z-score of 1:96 p is p = 1:96
n = n
and p
( + 1:96 = n)
z-score of + 1:96 p is p = 1:96:
n = n
We then calculate Z 1:96
1 (1=2)z 2
P (f 1:96 < Z < 1:96g) = p e dz = 0:9500042097:::;
2 1:96
Xn 2 J = !2 : X n (!) 2 J
= !2 : X n (!) is within E units of
= !2 : is within E units of X n (!)
= f! 2 : 2 I95 g
= fI95 3 g :
I wrote it this way (with I95 on the left, where the random variables normally go, even though writing it as f 2 I95 g
would also have been correct / equivalent) for an important reason: to emphasize the fact that the random variables are part
of I95 (which depends on X n ; a random variable), NOT a part of ; which is a parameter and which is a …xed constant. The
reason that this distinction is important will become apparent below (see the last example of this section).
Because the events X n 2 J and fI95 3 g are equal, their probabilities are equal. Thus, P (fI95 3 g) is approximately
0:95: This means that about 95% of all samples of size n (the percentage converging to 9500042097:::% as n ! 1) will be
such that the interval I95 (de…ned using the value of X n ) contains :
Each sample of size n results in an X n ; which in turn results in an interval I95 ; which may or may not contain : About
95% of the samples of size n (the percentage converging to 9500042097:::% as n ! 1) will result in I95 which contains
(which means that our X n is within E units of the true mean ), while about 5% of the samples of size n (the percentage
converging to (100 95:00042097:::) % as n ! 1) will result in I95 which does not contain (which means that our X n is
not within E units of the true mean ).
For the moment (this is not standard terminology), let us call our random sample “good”if the corresponding I95 contains
; and let us call it “bad” if the corresponding I95 does not contain : Our I95 has been constructed in such a way as to
ensure that about 95% of our samples of size n (the percentage converging to 9500042097:::% as n ! 1) will be “good.”
Since we typically have just one random sample of size n; assuming all samples of size n were a priori equally likely to
be selected (which requires great care in our choice of sampling methodology), there is about a 95% chance that our sample
will be a good one, and there is about a 5% chance that our sample will be a bad one.
7
We call I95 a 95% con…dence interval for (even though the percentage, for any given …nite n; even a large one, may
be unequal to 95%), and we call E the margin of error at the 95% con…dence level. We say that we may be 95%
con…dent that is in the interval I95 :
Terminology Regrettably, many authors use the less precise term maximum error instead of “margin of error,” which I
prefer greatly. Why is the term “maximum error” somewhat inaccurate? About 5% of the samples of size n will be
such that the interval I95 does not contain ; and for each such sample the error X n is greater than E (potentially
much, much greater than E), and so it makes little sense and is perhaps quite misleading to call E the maximum error.
Example 24 A random sample of size 80 is selected from a population of size 5000 having unknown mean but having
known variance 2 = 49: The sample mean X 80 is computed and equals 104: Find a 95% con…dence interval for : Find the
margin of error.
Solution: We have n = N = 80=5000 0:05; and we have n 30: Our sample is a random sample. We may therefore use the
Central Limit Theorem to deduce that X 80 is approximately normally distributed with mean and variance 2 =n = 49=80:
Our margin of error at the 95% con…dence level is
7
E = 1:96 p = 1:96 p = 1:533942632:::
n 80
and so our 95% con…dence interval is
I95 = Xn E; X n + E
7 7
= 104 1:96 p ; 104 + 1:96 p (exactly)
80 80
= (102:466057367:::; 105:533942632:::) (approximately).
Example 25 Suppose our population is as in the preceding example, and suppose that our sample size is 80; as in that
example. Suppose that our …rst sample was as indicated in that example as well, but now suppose we select another random
0
sample of size 80 and get X n = 90 (which is entirely possible). The prime symbol here is meant simply to distinguish results
from this sample from those of our …rst sample, for which X n = 104; notice that E itself depends on and on n but not on
our sample). This would give the 95% con…dence interval
0 0 0
I95 = Xn E; X n + E
7 7
= 90 1:96 p ; 90 + 1:96 p (exactly)
80 80
= (88:466057367:::; 91:533942632:::) (approximately).
Observe that these con…dence intervals do not overlap. That can very easily occur. We will now see exactly why it is so
important to avoid writing something like
P (f 2 I95 g) = 0:95
in general (which, unfortunately, many people do). Let us do that for our two samples (with I95 for the …rst sample and with
0
I95 for the second sample) to see what happens. We get
P (f 2 (102:466057367:::; 105:533942632:::)g) = 0:95;
P (f 2 (88:466057367:::; 91:533942632:::)g) = 0:95:
I.e.,
P (f 2 I95 g) = 0:95;
0
P (f 2 I95 g) = 0:95:
0
Since I95 and I95 are disjoint, our partition formulas would then give
0
P (f 2 I95 [ I95 g) = 0:95 + 0:95 = 1:9 (!),
which makes no sense at all. Where is the ‡aw? The ‡aw is that is not the random variable we have been studying (and
0
does not even depend on our sample in any way); X 80 is our random variable, and I95 and I95 are our intervals depending
on it. What is true is that the probability is about 0:95 that our random sample of size 80 is one of the ones which yields a
sample mean within E = 1:533942632::: units of ; equivalently, the probability is about 0:95 that our random sample of size
80 is one of the ones which yields a 95% con…dence interval (with margin of error E = 1:533942632:::) which contains :
8
9 Sample size considerations for the Central Limit Theorem
We saw above that larger sample sizes n result (for …xed ) in smaller margins of error E; when estimating an unknown
from a population with a known …nite positive 2 : Thus, larger sample sizes result in narrower, hence better, con…dence
intervals, for the same con…dence level, 0.95.
Is it always better to use larger samples? For sampling with replacement, or for theoretical problems where one is choosing
independent and identically distributed random variables, the answer is yes. However, for practical problems where one is
sampling from a …nite population having size N; the answer is no. The issue p is that, on the one hand, we want n to be large
(n 30 minimally, and ideally n would be even larger so that E = 1:96 = n will be very small, the smaller the better). On
the other hand, we need n = N to be small as well, and selecting larger samples will make this ratio larger, which makes our
assumptions that the Xi s are independent and identically distributed less reasonable, which makes using the Central Limit
Theorem increasingly invalid. The X n s might be quite far from normally distributed when the hypotheses of the Central
Limit Theorem, such as the n = N 0:05 condition, are not satis…ed, and thus our con…dence interval and margin of error –
both computed under assumptions of normality –will be unreliable.
So, in general, we would like n to be as large as possible, consistent with the requirement that it be small enough relative
to N: What if N < 600? In this case, it is mathematically impossible to choose n 30 such that n = N 0:05; for doing so
would require
n 30
N = 600:
0:05 0:05
In such a situation, you could consider using a larger population (to increase N ). You could also consider trying to …nd the
exact probability distribution of X n ; instead of using the Central Limit Theorem to approximate it. There are various other
methods as well.
= E (X) = np = 1 p = p;
2
= np (1 p) = (1) (p) (1 p) = p (1 p) :
Let X1 ; X2 ; : : : ; Xn be independent and identically distributed random variables with the distribution of X: These repre-
sent observations from a random sample of size n (chosen with replacement).
Let
Tn = X1 + X2 + : : : + Xn
for some (arbitrary) positive integer n: Since the Xi s are independent B (1; p) random variables, it follows that Tn B (n; p) ;
and hence
E (Tn ) = np;
V AR (Tn ) = np (1 p) :
Each Xi is 1 or 0 depending on whether the ith element of the sample is a success or a failure, respectively, and thus Tn
denotes the number of successes in the sample.
It follows that
1
X n = Tn
n
1 Alternatively, we can observe that X is simple and so it has moments and central moments of all orders, and we can calculate its mean and
9
is the proportion of successes in the sample. It is customary to use the notation pbn for this sample proportion. Thus,
1
pbn = X n = Tn :
n
We now calculate
1 1 1
E (b
pn ) = E Tn = E (Tn ) = (np) = p;
n n n
1 1 1 p (1 p)
V AR (b
pn ) = V AR Tn = 2 V AR (Tn ) = 2 np (1 p) = :
n n n n
Because pbn = X n ; which is a mean of independent and identically distributed random variables, pbn will be approximately
normally distributed for large enough values of n: Commonly used “cuto¤”conditions for n being “large enough”are that n
should satisfy
np 10 and n (1 p) 10 if p is known,
nb
pn 10 and n (1 pbn ) 10 if p is unknown.
Provided our sampling is done without replacement (as is the case in practice most of the time), we must also have
n = N 0:05:
Thus, for large enough values of n; provided the Xi s are independent and identically distributed (as with random sampling
with replacement) or provided random sampling without replacement is used and also n = N 0:05; pbn will be approximately
normally distributed with mean p and with variance p (1 p) =n:
Because
E (b
pn ) = p for each positive integer n;
pbn is an unbiased estimator for p:
Once again, for large values of n the quantity p (1 p) =n will be small (converging to 0 as n ! 1), and so “most”
samples of size n will yield a pbn close to p; ensuring that the probability that pbn will be very close to (its mean) p will be
very high.
We may use pbn to estimate an unknown population proportion p by following precisely the same procedure given above
for means.
Here, the standard deviation of pbn depends on p; which is unknown, so we need to approximate it by pbn : Thus, we de…ne
our margin of error as follows: r
pbn (1 pbn )
E = 1:96 :
n
Our 95% con…dence interval for the unknown p is then
I95 = (b
p E; pb + E) :
As with means, about 95% of the samples of size n will result in a pb within E units of p (equivalently, will result in an I95
which contains p).
It is common to want to know how large a sample size is needed to ensure that the margin of error will be no more than,
say 3%. Our margin of error formula above depends on pbn ; which will not be known until the sample is obtained (which
requires us to …nd n …rst!).
Fortunately, calculus can help. The function
f (x) = x (1 x)
attains its absolute maximum value, 1=4; when x = 1=2: This is easy to demonstrate with di¤erential calculus (or even with
algebra, after completing the square).
Therefore, it is always the case that
r r
pbn (1 pbn ) 1=4
E = 1:96 1:96 :
n n
Therefore, if we would like to ensure that, no matter what, E will not exceed 0:03; we simply choose n large enough to ensure
that r
1=4
1:96 0:03:
n
10
Since both quantities are positive, this inequality is true if and only if
2 1=4 2
(1:96) (0:03) ;
n
which is true if and only if
2
(1:96) (1=4)
n 2 = 1067:1:
(0:03)
Since n must be an integer, we require n 1068 (since n = 1067 does not satisfy the requisite inequality).
Choosing n 1068 will ensure that
r r
pbn (1 pbn ) 1=4
E = 1:96 1:96 0:03;
n n
no matter what pbn turns out to be later.
Example 26 An election between two candidates is going to be held in a city in which there are 120,000 eligible voters. A
random sample of 284 eligible voters is obtained, and it is found that 150 of the respondents favor the …rst candidate, Amy,
while 134 of the respondents favor the second candidate, Jack. Find a 95% con…dence interval for the proportion p of eligible
voters who favor Amy. What is the margin of error? Why is a smaller margin of error desirable? How large a sample size
would be required to ensure that the margin of error does not exceed 2% no matter what?
Solution: We …rst check the hypotheses. We have a random sample without replacement, so we need to ensure that
n = N 0:05: Here, N is the number of eligible voters (the population of interest for this study), which is 120,000, which is
certainly at least 20 times as large as our sample size, so n is not too large.
Next, we compute pb = 150=284 = 0:528169014:::; and we check that
150
nb
p = 284 = 150 10;
284
134
n (1 pb) = 284 = 134 10;
284
so n is large enough.
We may now use the procedure we carefully derived above. We set
r r
pbn (1 pbn ) (150=284) (1 (150=284))
E = 1:96 = 1:96 = 0:058059940:::;
n 284
which yields the 95% con…dence interval
I95 = (b
p E; pb + E)
= (0:470109073:::; 0:586228955:::) (approximately).
This con…dence interval is a bit problematic since part of it is below 0:5 and part is above 0:5; yet, perhaps more than
anything we would like to know the probability that p will be greater than 0:5 (in which case Amy wins) or that p will be less
than 0:5 (in which case Jack wins). Our margin of error is too large to help us make this determination with high probability.
Since our pb is about 0:528; something like a 2% margin of error would be more desirable.
Since we do not know what the pb will really be once we select a larger sample size, in order to estimate the smallest
sample size required to ensure that the margin of error E will certainly not exceed 0:02; we use the inequality (derived above)
r r
pbn (1 pbn ) 1=4
E = 1:96 1:96 ;
n n
and we select n to be large enough so as to ensure that the right side of this inequality does not exceed 0:02; which will also
ensure that E does not exceed 0:02:
Since both quantities are positive, this inequality is true if and only if
2 1=4 2
(1:96) (0:02) ;
n
which is true if and only if
2
(1:96) (1=4)
n 2 = 2401 (exactly).
(0:02)
11
Since this is exact, we do not need to round up.
Choosing n 2401 will ensure that
r r
pbn (1 pbn ) 1=4
E = 1:96 1:96 0:02;
n n
no matter what pbn turns out to be later. It is also still small enough that it satis…es the condition n = N 0:05:
If our new sample of size 2401 (or larger) yields a value of pb of around 0:52 or higher, then we’ll be quite con…dent that
Amy will win, since the entire 95% con…dence interval will be above 0:5: However, it could very well happen that our new
sample will yield a result like pb = 0:51; and then our margin of error (our E; using pb = 0:51; will be 0:019995999:::; which is
just barely under 2%) would be too large to allow us to predict the winner with any great degree of certainty.
12