0% found this document useful (0 votes)
22 views37 pages

12 UnknownProportions

The document outlines a course on statistical inference for means, focusing on various tests such as the z-test, t-test, and chi-squared test. It reviews the Central Limit Theorem and discusses prediction intervals for sample sums and averages, particularly in the context of a 0-1 box model. Examples illustrate how to calculate prediction intervals and the impact of different population proportions on these intervals.

Uploaded by

ishrat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views37 pages

12 UnknownProportions

The document outlines a course on statistical inference for means, focusing on various tests such as the z-test, t-test, and chi-squared test. It reviews the Central Limit Theorem and discusses prediction intervals for sample sums and averages, particularly in the context of a 0-1 box model. Examples illustrate how to calculate prediction intervals and the impact of different population proportions on these intervals.

Uploaded by

ishrat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Unknown Proportions

Decisions with Data | Inference for means

© University of Sydney MATH1062/1005


05 October 2024
Course Overview

Population

3 Sampling Data 4 Decisions with Data

1 Exploring Data Sample 2 Modelling Data

2/37

 Module4 Decisions with Data

The z-test
How can we make evidence based decisions? Is an observed result due to
chance or something else? How can we test whether a population has a certain
proportion?

The t-test
How can we test whether an unkown population has a certain mean?

The two-sample test


How can we test whether two variables have the same mean?

𝜒 2 -test
How to compare frequencies of categories?
3/37

 Today’s outline

A review of the Central Limit Theorem (CLT)

The 0-1 box


Prediction intervals for the 0-1 box

Confidence intervals for the 0-1 box

4/37
A review of the Central Limit
Theorem (CLT)
Central Limit Theorem
· Let 𝑋1 , … , 𝑋𝑛 be 𝑛 random draws with replacement from a box and let:
- 𝑆 = 𝑋1 + ⋯ + 𝑋𝑛 = ∑𝑛𝑖=1 𝑋𝑖 be the sample sum
- 𝑋¯ = 𝑋1 +⋯+𝑋𝑛 = 𝑆 = 1 ∑𝑛 𝑋𝑖 be the sample average
𝑛 𝑛 𝑛 𝑖=1

· If we know the mean 𝜇 and SD 𝜎 of the box, then:


- 𝐸(𝑆) = 𝑛𝜇 , 𝑆𝐸(𝑆) = 𝜎√𝑛
- 𝐸(𝑋¯ ) = 𝜇, 𝑆𝐸(𝑋¯ ) = 𝜎
√𝑛

· For large 𝑛 , the box of all possible sample sums has mean 𝐸(𝑆) = 𝑛𝜇 , SD
𝑆𝐸(𝑆) = 𝜎√𝑛 and is approximately normal
· For large 𝑛 , the box of all possible sample means has mean 𝐸(𝑋¯ ) = 𝜇, SD
𝑆𝐸(𝑋¯ ) = 𝜎/√𝑛 and is approximately normal

6/37
The 0-1 box
Important special case: 0-1 boxes
· An important example is where the box only contains 0 and 1 .

· Let 𝑝 denote the proportion of 1 s in the box, and 𝑁 the number of tickets. Then
there are:
- (1 − 𝑝)𝑁 0 s and

- 𝑝𝑁 1 s:

0 ⋯ 0 1 ⋯ 1
 
(1−𝑝)𝑁 of these 𝑝𝑁 of these

8/37
𝜇 and 𝜎 only depend on 𝑝
· We can calculate the mean and SD of the box in terms of 𝑝 :
- the mean of the box 𝜇 = 𝑝𝑁 = 𝑝 ;
𝑁
- the mean square of the box is also 𝑝 , and so the SD of the box is

𝜎 = √‾mn.sq. − (mean)‾2 = √‾𝑝‾‾‾‾


‾‾‾‾‾‾‾‾‾‾‾‾‾‾ − 𝑝‾2 = √𝑝(1
‾‾‾‾‾‾‾
− 𝑝)‾

· Therefore 𝐸(𝑆), 𝐸(𝑋¯ ), 𝑆𝐸(𝑆), 𝑆𝐸(𝑋¯ ) only depends on 𝑝 and 𝑛 .

9/37
Prediction intervals
Introduction to prediction intervals
· A 100 ⋅ 𝛾% (two-sided) prediction interval for the sample sum 𝑆 is an interval
[𝑎, 𝑏] in which there is a 100 ⋅ 𝛾% chance that 𝑆 lands in [𝑎, 𝑏]:
𝑃(𝑎 ≤ 𝑆 ≤ 𝑏) = 𝛾
· A 100 ⋅ 𝛾% (two-sided) prediction interval for the sample average 𝑋¯ is an interval
[𝑎, 𝑏] in which there is a 100 ⋅ 𝛾% chance that 𝑋¯ lands is in [𝑎, 𝑏]:
𝑃(𝑎 ≤ 𝑋¯ ≤ 𝑏) = 𝛾
· How can we find 𝑎 and 𝑏?

11/37
Derivation with 𝑋¯
· Note the following:

qnorm(0.025)

## [1] -1.959964

· This means that 2.5% under the normal curve is to the left of -1.96 and 2.5%
under the normal curve is to the right of 1.96.
· In other words 95% of the area under the normal curve is between -1.96 and 1.96.
· 𝑋¯ is approximately normal with mean 𝐸(𝑋¯ ) and SD equal to 𝑆𝐸(𝑋¯ ) .
· Equivalently 𝑋¯−𝐸(𝑋¯) is approximately standard normal with mean 0 and SD 1 .
𝑆𝐸(𝑋¯ )

12/37
Derivation with 𝑋¯
· Finally note that

( 𝑆𝐸(𝑋) 𝑆𝐸(𝑋¯ ) )
𝑎 − 𝐸(𝑋 ¯) ¯ − 𝐸(𝑋¯ )
𝑋 𝑏 − 𝐸( 𝑋¯)
𝑃(𝑎 ≤ 𝑋¯ ≤ 𝑏) = 𝑃 ≤ ≤
¯ ¯
𝑆𝐸(𝑋)
· So if I choose 𝑎 such that 𝑎−𝐸(𝑋¯) = −1.96 and 𝑏 such that 𝑏−𝐸(𝑋¯) = 1.96 , then
𝑆𝐸(𝑋¯ ) 𝑆𝐸(𝑋¯ )
𝑋¯ −𝐸(𝑋¯ )
the quantity ¯ will land between -1.96 and 1.96, with 95% probability.
𝑆𝐸(𝑋)
· Rearranging these two equations we get:

𝑎 = 𝐸(𝑋¯ ) − 1.96 ⋅ 𝑆𝐸(𝑋¯ ), 𝑏 = 𝐸(𝑋¯ ) + 1.96 ⋅ 𝑆𝐸(𝑋¯ )


· In other words, 𝑋¯ will be between these two values 95% of the time.

13/37
Putting this together
· So for an (approximate) 95% prediction interval for the sample average 𝑋¯ is:

[𝐸(𝑋¯ ) − 1.96 ⋅ 𝑆𝐸(𝑋¯ ), 𝐸(𝑋¯ ) + 1.96 ⋅ 𝑆𝐸(𝑋¯ )]


· But we know how to compute 𝐸(𝑋¯ ) and 𝑆𝐸(𝑋¯ ) in the 0-1 box and so more
precisely:
- 𝐸(𝑋¯ ) = 𝑝

𝑆𝐸(𝑋¯ ) = √‾‾‾‾‾
- 𝑝(1−𝑝)‾
𝑛

· So for an (approximate) 95% prediction interval for the sample average 𝑋¯ the 0-1
box:

[ ]
‾𝑝(1
‾‾‾‾‾‾‾
− 𝑝)‾ ‾𝑝(1
‾‾‾‾‾‾‾
− 𝑝)‾
√ √
𝑝 − 1.96 , 𝑝 + 1.96 ⋅
𝑛 𝑛
· For other values of 100 ⋅ 𝛾% like 90% or 99%, the “1.96” needs to be adjusted
14/37
Example 1
· Suppose we draw 𝑛 = 49 times randomly from a box with 𝑝 = 0.4 .
· What is the 95% prediction interval for 𝑋¯ ?
· Solution:
- the expected value is 𝐸(𝑋¯ ) = 𝜇 = 𝑝 = 0.4 .
- the standard error is
𝑆𝐸(𝑋¯ ) = 𝜎/√𝑛 = √‾‾‾‾‾
𝑝(1−𝑝)‾ ‾1‾‾‾‾‾‾‾‾‾3‾
√ 49
= × 2
× √6
𝑛 5 5 = 35 ≈ 0.07 .
- the distribution of 𝑋¯ has a (roughly) normal shape.
- Hence substituting into our prediction interval gives us:

[0.4 − 1.96 × 0.07, 0.4 + 1.96 × 0.07]


or 0.4 ± 1.96 × 0.07 or (0.26, 0.54).

15/37
Visualisation
· The histogram of 𝑋¯ is approximated by the following normal curve

· 𝑋¯ lands in the interval (0.26, 0.54) (the blue interval) with probability 95%.

16/37
What if 𝑝 = 0.2 instead of 0.4?
· It is interesting to see how this changes if the proportion in the box is 0.2 instead of
0.4.
· We then get
- 𝐸(𝑋¯ ) = 𝜇 = 𝑝 = 0.2

𝑆𝐸(𝑋¯ ) = 𝜎/√𝑛 = √‾‾‾‾‾


- 𝑝(1−𝑝)‾ ‾1‾‾‾‾‾‾‾‾‾4‾
√ 49
1 2
𝑛 = × 5 × 5 = 35 = 0.057

· So the box of all possible 𝑋¯ values has


- mean 0.2;
- SD 0.057 ;
- a normal shape.

· The resulting 95% prediction interval is 0.2 ± 1.96 × 0.057 or (0.09, 0.31).

17/37
Interval now a bit narrower

· Not that the interval is now a little narrower, i.e. 0.22 units wide (compare with
0.28 when 𝑝 = 0.4 ).

18/37
Simulation for 𝑝 = 0.4
too.big = 0
too.small = 0
for (i in 1:1000) {
samp = sample(c(0, 1), prob = c(0.6, 0.4), repl = T, size = 49)
prop = mean(samp)
too.big[i] = prop > (0.4 + 1.96 * 0.07)
too.small[i] = prop < (0.4 - 1.96 * 0.07)
}
num.too.small = sum(too.small)
num.too.big = sum(too.big)
num.just.right = 1000 - num.too.small - num.too.big
cbind(num.too.small, num.just.right, num.too.big)

## num.too.small num.just.right num.too.big


## [1,] 13 967 20

19/37
Simulation for 𝑝 = 0.2
too.big = 0
too.small = 0
for (i in 1:1000) {
samp = sample(c(0, 1), prob = c(0.8, 0.2), repl = T, size = 49)
prop = mean(samp)
too.big[i] = prop > (0.2 + 1.96 * 0.057)
too.small[i] = prop < (0.2 - 1.96 * 0.057)
}
num.too.small = sum(too.small)
num.too.big = sum(too.big)
num.just.right = 1000 - num.too.small - num.too.big
cbind(num.too.small, num.just.right, num.too.big)

## num.too.small num.just.right num.too.big


## [1,] 26 948 26

20/37
Size of prediction intervals
· The variability in the sample proportion gets smaller as the 𝑝 in the box gets
further from 0.5.

·
This is precisely reflected in 𝑆𝐸(𝑋¯ ) = √
‾𝑝(1−𝑝)
‾‾‾‾‾.
𝑛

· The function 𝑝 ↦ 𝑝(1 − 𝑝) = 𝑝 − 𝑝2 is a quadratic function of 𝑝 :

p = 0:1000/1000
plot(p, p * (1 - p), type = "l")

21/37
Prediction interval for 𝑆
· For a 95% prediction interval for the sample average 𝑋¯ is:

[𝐸(𝑋¯ ) − 1.96 ⋅ 𝑆𝐸(𝑋¯ ), 𝐸(𝑋¯ ) + 1.96 ⋅ 𝑆𝐸(𝑋¯ )]


· Following the same steps, we get that a 95% prediction interval for the sample
sum 𝑆 is:

[𝐸(𝑆) − 1.96 ⋅ 𝑆𝐸(𝑆), 𝐸(𝑆) + 1.96 ⋅ 𝑆𝐸(𝑆)]


or more precisely

[𝑛𝑝 − 1.96 ⋅ √‾𝑛𝑝(1


‾‾‾‾‾‾‾
− 𝑝)‾, 𝑛𝑝 + 1.96 ⋅ √‾𝑛𝑝(1 − 𝑝)‾]
‾‾‾‾‾‾‾

· Again, the value “1.96” needs to be adjusted for different values of 100 ⋅ 𝛾%.

22/37
Confidence intervals
Interval of values consistent with each 𝑝
· The previous section showed us how the sample mean/proportion 𝑋¯ behaves for
a known box proportion 𝑝 .
· We saw that each value 𝑝 has associated with it an interval of values consistent
with that 𝑝 , characterized as a 95% “prediction interval” for the sample proportion.
- the interval is centred at 𝑝
- its width depends on 𝑛 and 𝑝
- interval is wider the closer 𝑝 is to 0.5.

24/37
Turning things around
· What if the “population” proportion 𝑝 is unknown?

· Suppose

- we have a sample of size 𝑛 = 49 from a box with unknown 𝑝 ,


- the observed sample sum is 𝑠 = 14 , so that
- the observed sample proportion is 𝑥¯ = 𝑠 = 14 = 2 ≈ 0.2857.
𝑛 49 7
· We might ask the following question:

Which values of 𝑝 is this observation consistent with (using the 95%


prediction intervals) ?

25/37
How about both 𝑝 = 0.2 and 𝑝 = 0.4 ?
· We replicate our graph from before, showing intervals of values consistent with
both 𝑝 = 0.2 and 𝑝 = 0.4 , when 𝑛 = 49 .
· The vertical green line below shows our observed value 𝑥¯ = 2 .
7

· Note that 𝑥¯ = 2 is consistent with both 𝑝 = 0.2 and 𝑝 = 0.4 .


7
· What other values of 𝑝 is the observed value 2 consistent with (in this sense)?
7

26/37
Furthest values
· Clearly, there exists “upper” and “lower” values of 𝑝 for which the observation is
just on the edge.
· These values form the endpoints of a 95% (two-sided) confidence interval for
the unknown 𝑝 (this is called a Wilson’s confidence interval).
· In other words, we want to find all the 𝑝 ’s that make our observation 𝑥¯ land in the
interval [𝑝 − 1.96√ √ 𝑛 ]
‾𝑝(1−𝑝)
‾‾‾‾‾, 𝑝 + 1.96 ⋅ ‾𝑝(1−𝑝)
‾‾‾‾‾
𝑛

· How can we find these endpoints?


· Three methods:
1. Theoretically by solving a quadratic equation (we shall not spend time on this, it
is too complicated)
2. Using a numerical method to find these endpoints (this requires a little require
work from our side)
3. Use an R package (all the work is taken care for you).

27/37
The R binom package
· The R package binom computes these endpoints using the binom.confint()
function.

· In our case, we compute the endpoints as follows:

require(binom) # this makes sure the binom package is loaded

## Loading required package: binom

binom.confint(x = 14, n = 49, method = "wilson") # note here the argument 'x' is the sample sum or count

## method x n mean lower upper


## 1 wilson 14 49 0.2857143 0.1784959 0.4240888

28/37
Sanity check
· This shows us the “extreme” values of 𝑝 for which 𝑥¯ = 2 ≈ 0.285 still falls in the
7
95% prediction interval for 𝑝 = 0.178 and 𝑝 = 0.424.
· We can check this to be sure:
- For 𝑝 = 0.178 the 95% prediction interval is:
0.178 ± 1.96 ⋅ √‾0.178⋅0.822
‾‾‾‾‾‾‾‾ = (0.071, 0.285)
49

- For 𝑝 = 0.424 the 95% prediction interval is:


0.424 ± 1.96 ⋅ √‾0.424⋅0.576
‾‾‾‾‾‾‾‾ = (0.285, 0.562)
49

29/37
Interpreting the confidence interval
· Suppose we construct a 95% confidence interval from a box with a proportion 𝑝 of
1 s.
· We know there is a 95% chance that 𝑋¯ will fall in the prediction interval. The
confidence interval will include that 𝑝 if and only if that happens!
· Equivalently, there is a 95% chance that 𝑝 will fall in the confidence interval.
· 𝑝 is not what is random here, it is the confidence interval since it depends on the
observed value of 𝑋¯ .
· I use will here, because this is different than 𝑝 falling into the confidence interval
computed using the observed 𝑥¯ !
· This is a deterministic statement, that is either true or false.

30/37
Demonstration
· Let us see how (Wilson’s) confidence interval works when repeatedly sampling
from a box with a known 𝑝 .

p = 0.3
n = 50
over.est = 0
under.est = 0
for (i in 1:1000) {
samp = sample(c(0, 1), prob = c(1 - p, p), replace = T, size = n)
s = sum(samp)
w = binom.confint(s, n, method = "wilson")
over.est[i] = w$lower > p
under.est[i] = w$upper < p
}
num.over.est = sum(over.est)
num.under.est = sum(under.est)
num.covering = 1000 - num.over.est - num.under.est
cbind(num.under.est, num.covering, num.over.est)

## num.under.est num.covering num.over.est


## [1,] 21 948 31

We see that close to 95% of the time, the interval covers the “true” value of 𝑝 = 0.3 .

31/37
Properties of the (Wilson) confidence interval
· Under repeated sampling from a 0-1 box, the 95% Wilson confidence interval
covers the “true” proportion 𝑝 in (approx.) 95% of samples.
· This is a long-run property of the procedure.
· For a single data set, you don’t know if it has covered the true value or not.
- You just know that the procedure you have used is 95% reliable in the long
run.
· Note that the interval is not (in general) symmetric about the observed sample
proportion 𝑥¯ .
- The midpoint of the interval is somewhere between 𝑥¯ and 0.5.

32/37
Different confidence levels
· We can change the confidence level by replacing 1.96 with another value.

· E.g., for 99% we should replace 1.96 with

qnorm(0.995)

## [1] 2.575829

(which gives 0.5% in the upper tail under the standard normal curve).

33/37
Changing confidence level using
binom.confint()
· Using binom.confint() we simply set the conf.level= argument to the
desired level:

binom.confint(14, 49, conf.level = 0.99, method = "wilson")

## method x n mean lower upper


## 1 wilson 14 49 0.2857143 0.1531828 0.4693562

34/37
Sanity check
· This shows us the “extreme” values of 𝑝 for which 𝑥¯ = 2 ≈ 0.285 still falls in the
7
99% prediction interval for 𝑝 = 0.153 and 𝑝 = 0.469.
· We can check this to be sure:
- For 𝑝 = 0.153 the 99% prediction interval is:
0.153 ± 2.576 ⋅ √‾0.153⋅0.847
‾‾‾‾‾‾‾‾ = (0.021, 0.285)
49

- For 𝑝 = 0.469 the 99% prediction interval is:


0.469 ± 2.576 ⋅ √‾0.469⋅0.531
‾‾‾‾‾‾‾‾ = (0.285, 0.653)
49

35/37
Example
· The file march2023.csv has daily weather observations from the Canterbury
Racecourse weather station for March 2023.

mar.2023 = read.csv("march2023.csv", skip = 5)


summary(mar.2023)

## X Date Minimum.temperature..degC.
## Mode:logical Length:31 Min. :11.60
## NA's:31 Class :character 1st Qu.:16.20
## Mode :character Median :17.20
## Mean :16.89
## 3rd Qu.:18.20
## Max. :21.60
## Maximum.temperature..degC. Rainfall..mm. Evaporation..mm. Sunshine..hours.
## Min. :22.70 Min. : 0.000 Mode:logical Mode:logical
## 1st Qu.:25.00 1st Qu.: 0.000 NA's:31 NA's:31
## Median :26.40 Median : 0.000
## Mean :27.77 Mean : 2.058
## 3rd Qu.:29.50 3rd Qu.: 1.700
## Max. :38.10 Max. :31.400
## Direction.of.maximum.wind.gust. Speed.of.maximum.wind.gust..km.h.
## Length:31 Min. :24.00
## Class :character 1st Qu.:31.00
## Mode :character Median :37.00
## Mean :38.23
## 3rd Qu.:46.00
## Max. :57.00
## Time.of.maximum.wind.gust X9am.Temperature..degC. X9am.relative.humidity.... 36/37
## Length:31 Min. :17.20 Min. : 42.00
Rainfall
mar.2023$Rain

## [1] 0.0 0.4 0.0 3.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 2.0 31.4
## [16] 0.0 0.0 0.0 0.0 0.0 2.6 0.0 0.2 4.0 0.0 1.4 8.8 0.4 3.8 0.4
## [31] 0.0

· What proportion of days in March have rain?


· Suppose we can model the presence or absence of rain as being like a random
sample from a 0-1 box with an unknown proportion 𝑝 of 1s.
· What is a 95% Wilson confidence interval for 𝑝 ?

x = sum(mar.2023$Rain > 0)
binom.confint(14, 31, method = "wilson")

## method x n mean lower upper


## 1 wilson 14 31 0.4516129 0.2916174 0.6222783

· The data is thus consistent with the “true” 𝑝 being anywhere in the range
(0.29, 0.62).
37/37

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy