12 UnknownProportions
12 UnknownProportions
Population
2/37
Module4 Decisions with Data
The z-test
How can we make evidence based decisions? Is an observed result due to
chance or something else? How can we test whether a population has a certain
proportion?
The t-test
How can we test whether an unkown population has a certain mean?
𝜒 2 -test
How to compare frequencies of categories?
3/37
Today’s outline
4/37
A review of the Central Limit
Theorem (CLT)
Central Limit Theorem
· Let 𝑋1 , … , 𝑋𝑛 be 𝑛 random draws with replacement from a box and let:
- 𝑆 = 𝑋1 + ⋯ + 𝑋𝑛 = ∑𝑛𝑖=1 𝑋𝑖 be the sample sum
- 𝑋¯ = 𝑋1 +⋯+𝑋𝑛 = 𝑆 = 1 ∑𝑛 𝑋𝑖 be the sample average
𝑛 𝑛 𝑛 𝑖=1
· For large 𝑛 , the box of all possible sample sums has mean 𝐸(𝑆) = 𝑛𝜇 , SD
𝑆𝐸(𝑆) = 𝜎√𝑛 and is approximately normal
· For large 𝑛 , the box of all possible sample means has mean 𝐸(𝑋¯ ) = 𝜇, SD
𝑆𝐸(𝑋¯ ) = 𝜎/√𝑛 and is approximately normal
6/37
The 0-1 box
Important special case: 0-1 boxes
· An important example is where the box only contains 0 and 1 .
· Let 𝑝 denote the proportion of 1 s in the box, and 𝑁 the number of tickets. Then
there are:
- (1 − 𝑝)𝑁 0 s and
- 𝑝𝑁 1 s:
0 ⋯ 0 1 ⋯ 1
(1−𝑝)𝑁 of these 𝑝𝑁 of these
8/37
𝜇 and 𝜎 only depend on 𝑝
· We can calculate the mean and SD of the box in terms of 𝑝 :
- the mean of the box 𝜇 = 𝑝𝑁 = 𝑝 ;
𝑁
- the mean square of the box is also 𝑝 , and so the SD of the box is
9/37
Prediction intervals
Introduction to prediction intervals
· A 100 ⋅ 𝛾% (two-sided) prediction interval for the sample sum 𝑆 is an interval
[𝑎, 𝑏] in which there is a 100 ⋅ 𝛾% chance that 𝑆 lands in [𝑎, 𝑏]:
𝑃(𝑎 ≤ 𝑆 ≤ 𝑏) = 𝛾
· A 100 ⋅ 𝛾% (two-sided) prediction interval for the sample average 𝑋¯ is an interval
[𝑎, 𝑏] in which there is a 100 ⋅ 𝛾% chance that 𝑋¯ lands is in [𝑎, 𝑏]:
𝑃(𝑎 ≤ 𝑋¯ ≤ 𝑏) = 𝛾
· How can we find 𝑎 and 𝑏?
11/37
Derivation with 𝑋¯
· Note the following:
qnorm(0.025)
## [1] -1.959964
· This means that 2.5% under the normal curve is to the left of -1.96 and 2.5%
under the normal curve is to the right of 1.96.
· In other words 95% of the area under the normal curve is between -1.96 and 1.96.
· 𝑋¯ is approximately normal with mean 𝐸(𝑋¯ ) and SD equal to 𝑆𝐸(𝑋¯ ) .
· Equivalently 𝑋¯−𝐸(𝑋¯) is approximately standard normal with mean 0 and SD 1 .
𝑆𝐸(𝑋¯ )
12/37
Derivation with 𝑋¯
· Finally note that
( 𝑆𝐸(𝑋) 𝑆𝐸(𝑋¯ ) )
𝑎 − 𝐸(𝑋 ¯) ¯ − 𝐸(𝑋¯ )
𝑋 𝑏 − 𝐸( 𝑋¯)
𝑃(𝑎 ≤ 𝑋¯ ≤ 𝑏) = 𝑃 ≤ ≤
¯ ¯
𝑆𝐸(𝑋)
· So if I choose 𝑎 such that 𝑎−𝐸(𝑋¯) = −1.96 and 𝑏 such that 𝑏−𝐸(𝑋¯) = 1.96 , then
𝑆𝐸(𝑋¯ ) 𝑆𝐸(𝑋¯ )
𝑋¯ −𝐸(𝑋¯ )
the quantity ¯ will land between -1.96 and 1.96, with 95% probability.
𝑆𝐸(𝑋)
· Rearranging these two equations we get:
13/37
Putting this together
· So for an (approximate) 95% prediction interval for the sample average 𝑋¯ is:
𝑆𝐸(𝑋¯ ) = √‾‾‾‾‾
- 𝑝(1−𝑝)‾
𝑛
· So for an (approximate) 95% prediction interval for the sample average 𝑋¯ the 0-1
box:
[ ]
‾𝑝(1
‾‾‾‾‾‾‾
− 𝑝)‾ ‾𝑝(1
‾‾‾‾‾‾‾
− 𝑝)‾
√ √
𝑝 − 1.96 , 𝑝 + 1.96 ⋅
𝑛 𝑛
· For other values of 100 ⋅ 𝛾% like 90% or 99%, the “1.96” needs to be adjusted
14/37
Example 1
· Suppose we draw 𝑛 = 49 times randomly from a box with 𝑝 = 0.4 .
· What is the 95% prediction interval for 𝑋¯ ?
· Solution:
- the expected value is 𝐸(𝑋¯ ) = 𝜇 = 𝑝 = 0.4 .
- the standard error is
𝑆𝐸(𝑋¯ ) = 𝜎/√𝑛 = √‾‾‾‾‾
𝑝(1−𝑝)‾ ‾1‾‾‾‾‾‾‾‾‾3‾
√ 49
= × 2
× √6
𝑛 5 5 = 35 ≈ 0.07 .
- the distribution of 𝑋¯ has a (roughly) normal shape.
- Hence substituting into our prediction interval gives us:
15/37
Visualisation
· The histogram of 𝑋¯ is approximated by the following normal curve
· 𝑋¯ lands in the interval (0.26, 0.54) (the blue interval) with probability 95%.
16/37
What if 𝑝 = 0.2 instead of 0.4?
· It is interesting to see how this changes if the proportion in the box is 0.2 instead of
0.4.
· We then get
- 𝐸(𝑋¯ ) = 𝜇 = 𝑝 = 0.2
· The resulting 95% prediction interval is 0.2 ± 1.96 × 0.057 or (0.09, 0.31).
17/37
Interval now a bit narrower
· Not that the interval is now a little narrower, i.e. 0.22 units wide (compare with
0.28 when 𝑝 = 0.4 ).
18/37
Simulation for 𝑝 = 0.4
too.big = 0
too.small = 0
for (i in 1:1000) {
samp = sample(c(0, 1), prob = c(0.6, 0.4), repl = T, size = 49)
prop = mean(samp)
too.big[i] = prop > (0.4 + 1.96 * 0.07)
too.small[i] = prop < (0.4 - 1.96 * 0.07)
}
num.too.small = sum(too.small)
num.too.big = sum(too.big)
num.just.right = 1000 - num.too.small - num.too.big
cbind(num.too.small, num.just.right, num.too.big)
19/37
Simulation for 𝑝 = 0.2
too.big = 0
too.small = 0
for (i in 1:1000) {
samp = sample(c(0, 1), prob = c(0.8, 0.2), repl = T, size = 49)
prop = mean(samp)
too.big[i] = prop > (0.2 + 1.96 * 0.057)
too.small[i] = prop < (0.2 - 1.96 * 0.057)
}
num.too.small = sum(too.small)
num.too.big = sum(too.big)
num.just.right = 1000 - num.too.small - num.too.big
cbind(num.too.small, num.just.right, num.too.big)
20/37
Size of prediction intervals
· The variability in the sample proportion gets smaller as the 𝑝 in the box gets
further from 0.5.
·
This is precisely reflected in 𝑆𝐸(𝑋¯ ) = √
‾𝑝(1−𝑝)
‾‾‾‾‾.
𝑛
p = 0:1000/1000
plot(p, p * (1 - p), type = "l")
21/37
Prediction interval for 𝑆
· For a 95% prediction interval for the sample average 𝑋¯ is:
· Again, the value “1.96” needs to be adjusted for different values of 100 ⋅ 𝛾%.
22/37
Confidence intervals
Interval of values consistent with each 𝑝
· The previous section showed us how the sample mean/proportion 𝑋¯ behaves for
a known box proportion 𝑝 .
· We saw that each value 𝑝 has associated with it an interval of values consistent
with that 𝑝 , characterized as a 95% “prediction interval” for the sample proportion.
- the interval is centred at 𝑝
- its width depends on 𝑛 and 𝑝
- interval is wider the closer 𝑝 is to 0.5.
24/37
Turning things around
· What if the “population” proportion 𝑝 is unknown?
· Suppose
25/37
How about both 𝑝 = 0.2 and 𝑝 = 0.4 ?
· We replicate our graph from before, showing intervals of values consistent with
both 𝑝 = 0.2 and 𝑝 = 0.4 , when 𝑛 = 49 .
· The vertical green line below shows our observed value 𝑥¯ = 2 .
7
26/37
Furthest values
· Clearly, there exists “upper” and “lower” values of 𝑝 for which the observation is
just on the edge.
· These values form the endpoints of a 95% (two-sided) confidence interval for
the unknown 𝑝 (this is called a Wilson’s confidence interval).
· In other words, we want to find all the 𝑝 ’s that make our observation 𝑥¯ land in the
interval [𝑝 − 1.96√ √ 𝑛 ]
‾𝑝(1−𝑝)
‾‾‾‾‾, 𝑝 + 1.96 ⋅ ‾𝑝(1−𝑝)
‾‾‾‾‾
𝑛
27/37
The R binom package
· The R package binom computes these endpoints using the binom.confint()
function.
binom.confint(x = 14, n = 49, method = "wilson") # note here the argument 'x' is the sample sum or count
28/37
Sanity check
· This shows us the “extreme” values of 𝑝 for which 𝑥¯ = 2 ≈ 0.285 still falls in the
7
95% prediction interval for 𝑝 = 0.178 and 𝑝 = 0.424.
· We can check this to be sure:
- For 𝑝 = 0.178 the 95% prediction interval is:
0.178 ± 1.96 ⋅ √‾0.178⋅0.822
‾‾‾‾‾‾‾‾ = (0.071, 0.285)
49
29/37
Interpreting the confidence interval
· Suppose we construct a 95% confidence interval from a box with a proportion 𝑝 of
1 s.
· We know there is a 95% chance that 𝑋¯ will fall in the prediction interval. The
confidence interval will include that 𝑝 if and only if that happens!
· Equivalently, there is a 95% chance that 𝑝 will fall in the confidence interval.
· 𝑝 is not what is random here, it is the confidence interval since it depends on the
observed value of 𝑋¯ .
· I use will here, because this is different than 𝑝 falling into the confidence interval
computed using the observed 𝑥¯ !
· This is a deterministic statement, that is either true or false.
30/37
Demonstration
· Let us see how (Wilson’s) confidence interval works when repeatedly sampling
from a box with a known 𝑝 .
p = 0.3
n = 50
over.est = 0
under.est = 0
for (i in 1:1000) {
samp = sample(c(0, 1), prob = c(1 - p, p), replace = T, size = n)
s = sum(samp)
w = binom.confint(s, n, method = "wilson")
over.est[i] = w$lower > p
under.est[i] = w$upper < p
}
num.over.est = sum(over.est)
num.under.est = sum(under.est)
num.covering = 1000 - num.over.est - num.under.est
cbind(num.under.est, num.covering, num.over.est)
We see that close to 95% of the time, the interval covers the “true” value of 𝑝 = 0.3 .
31/37
Properties of the (Wilson) confidence interval
· Under repeated sampling from a 0-1 box, the 95% Wilson confidence interval
covers the “true” proportion 𝑝 in (approx.) 95% of samples.
· This is a long-run property of the procedure.
· For a single data set, you don’t know if it has covered the true value or not.
- You just know that the procedure you have used is 95% reliable in the long
run.
· Note that the interval is not (in general) symmetric about the observed sample
proportion 𝑥¯ .
- The midpoint of the interval is somewhere between 𝑥¯ and 0.5.
32/37
Different confidence levels
· We can change the confidence level by replacing 1.96 with another value.
qnorm(0.995)
## [1] 2.575829
(which gives 0.5% in the upper tail under the standard normal curve).
33/37
Changing confidence level using
binom.confint()
· Using binom.confint() we simply set the conf.level= argument to the
desired level:
34/37
Sanity check
· This shows us the “extreme” values of 𝑝 for which 𝑥¯ = 2 ≈ 0.285 still falls in the
7
99% prediction interval for 𝑝 = 0.153 and 𝑝 = 0.469.
· We can check this to be sure:
- For 𝑝 = 0.153 the 99% prediction interval is:
0.153 ± 2.576 ⋅ √‾0.153⋅0.847
‾‾‾‾‾‾‾‾ = (0.021, 0.285)
49
35/37
Example
· The file march2023.csv has daily weather observations from the Canterbury
Racecourse weather station for March 2023.
## X Date Minimum.temperature..degC.
## Mode:logical Length:31 Min. :11.60
## NA's:31 Class :character 1st Qu.:16.20
## Mode :character Median :17.20
## Mean :16.89
## 3rd Qu.:18.20
## Max. :21.60
## Maximum.temperature..degC. Rainfall..mm. Evaporation..mm. Sunshine..hours.
## Min. :22.70 Min. : 0.000 Mode:logical Mode:logical
## 1st Qu.:25.00 1st Qu.: 0.000 NA's:31 NA's:31
## Median :26.40 Median : 0.000
## Mean :27.77 Mean : 2.058
## 3rd Qu.:29.50 3rd Qu.: 1.700
## Max. :38.10 Max. :31.400
## Direction.of.maximum.wind.gust. Speed.of.maximum.wind.gust..km.h.
## Length:31 Min. :24.00
## Class :character 1st Qu.:31.00
## Mode :character Median :37.00
## Mean :38.23
## 3rd Qu.:46.00
## Max. :57.00
## Time.of.maximum.wind.gust X9am.Temperature..degC. X9am.relative.humidity.... 36/37
## Length:31 Min. :17.20 Min. : 42.00
Rainfall
mar.2023$Rain
## [1] 0.0 0.4 0.0 3.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 2.0 31.4
## [16] 0.0 0.0 0.0 0.0 0.0 2.6 0.0 0.2 4.0 0.0 1.4 8.8 0.4 3.8 0.4
## [31] 0.0
x = sum(mar.2023$Rain > 0)
binom.confint(14, 31, method = "wilson")
· The data is thus consistent with the “true” 𝑝 being anywhere in the range
(0.29, 0.62).
37/37