Unit 2
Unit 2
2. Foundations of Inference
1/61
Goals
In this unit, we introduce a mathematical formalization of
statistical modeling to make a principled sense of the Trinity of
statistical inference.
We will make sense of the following statements:
1. Estimation: only one number .
2/61
The rationale behind statistical modeling
I Let X1 , . . . , Xn be n independent copies of X.
I The goal of statistics is to learn the distribution of X.
I If X 2 {0, 1}, easy! It’s and we only have to
learn the parameter
=
x 1 2 3 4 5 6 7
P
IP(X = x) p1 p2 p3 p4 p5 p6 i 7 pi
yb
since sum to 1
That’s 7 parameters to learn.
I Or we could assume that X 1⇠ . That’s 1
parameter to learn!
3/61
Statistical model
Formal definition
where:
I E is called
iid
2. If X1 , . . . , Xn ⇠ Poiss( ) for some unknown > 0,
, .
iid 2 ),
3. If X1 , . . . , Xn ⇠ N (µ, for some unknown µ 2 IR and
2 > 0:
⇣ ⌘
2
, N (µ, ) (µ, 2 )2 .
iid d
4. If X1 , . . . , Xn ⇠ Nd (µ, Id ), for some unknown µ 2 IR :
⇣ ⌘
, (Nd (µ, Id ))(µ2 ) .
6/61
Examples of nonparametric models
E= ⇥=
1
Increases on ( 1a) and then decreases on (a, 1) for some a > 0. 7/61
Further examples
Sometimes we do not have simple notation to write (IP✓ )✓2⇥ , e.g.,
(Ber(p))p2(0,1) and we have to be more explicit:
1. Linear regression model: If
d
(X1 , Y1 ), . . . , (Xn , Yn ) 2 IR ⇥ IR are i.i.d from the linear
> iid
regression model Yi = Xi + "i "i ⇠ N (0, 1) for an
unknown 2 IRd and Xi ⇠ Nd (0, Id ) independent of "i
E= ⇥=
Examples
iid 2 ),
2. If Xi = 1IYi 0 (indicator function), Y1 , . . . , Yn ⇠ N (µ,
for some unknown µ 2 IR and 2 > 0, are unobserved: µ and
2 are not identifiable (but ✓ = µ/ is).
9/61
Exercises
a) Which of the following is a statistical model?
⇣ ⌘
1. {1}, (Ber(p))p2(0,1)
⇣ ⌘
2. {0, 1}, (Ber(p))p2(0.2,0.4)
3. Both 1 and 2
4. None of the above
iid
b) Let X1 , . . . , Xn ⇠ U([0, a]) for some unknown a > 0. Which
one of the following is the associated statistical model?
1. [0, a], (U ([0, a]))a>0
2. IR+ , (U ([0, a]))a>0
3. IR, (U ([0, a]))a>0
4. None of the above
10/61
Exercises
2 iid
c) Let Xi = where Y1 , . . . , Yn ⇠ U([0, a]), for some unknown
Yi ,
a, are unobserved. Is a identifiable?
1. Yes
2. No
iid
d) Let Xi = 1IYi a2 , where Y1 , . . . , Yn ⇠ U([0, a]), for some
unknown a, are unobserved. Is a identifiable?
1. Yes
2. No
11/61
Estimation
12/61
Parameter estimation
Definitions
I Statistic: Any measurable2 function of the sample, e.g.,
X̄n , max Xi , X1 + log(1 + |Xn |), sample variance, etc...
i
I Estimator of ✓: Any statistic whose expression does not
depend on
I An estimator ✓ˆn of ✓ is weakly (resp. strongly )
if
IP (resp. a.s.)
✓ˆn ! ✓ (w.r.t. IP✓ ).
n!1
bias(✓ˆn ) =
I If bias(✓)
ˆ = 0, we say that ✓ˆ is
I Example: Assume that X1 , . . . , Xn ⇠ Ber(p) and consider the
iid
I p̂n = X1 : bias(p̂n ) =
X1 + X2
I p̂n = : bias(p̂n ) =
2
p
I p̂n = 1I(X1 = 1, X2 = 2)
14/61
Variance of an estimator
I p̂n = X1 : bias(p̂n ) =
X1 + X2
I p̂n = : bias(p̂n ) =
2
p
I p̂n = 1I(X1 = 1, X2 = 2)
15/61
Quadratic risk
I Low quadratic risk means that both bias and variance are
small:
quadratic risk=
16/61
Exercises
1
b) Is X̄n 2 an unbiased estimator for a?
1
c) Find the variance of X̄n 2 .
1
d) Find the quadratic risk of X̄n 2 .
17/61
Confidence intervals
18/61
Confidence intervals
IP✓ [I 3 ✓] 1 ↵, 8✓ 2 ⇥.
3
I 3 ✓ means that I contains ✓. This notation emphasizes the randomness
of I but we can equivalently write ✓ 2 I. 19/61
A confidence interval for the kiss example
I Recall that we observe R1 , . . . , Rn ⇠ Ber(p) for some
iid
I It yields
" p p # !
q↵/2 p(1 p) q↵/2 p(1 p)
lim IP R̄n p , R̄n + p 3p
n!1 n n
p(1 p)
I Indeed
lim IP(Iconserv 3 p) 1 ↵
n!1
22/61
Solution 2: Solving the (quadratic) equation for p
I We have the system of two inequalities in p:
p p
q↵/2 p(1 p) q↵/2 p(1 p)
R̄n p p R̄n + p
n n
I Each is a quadratic inequality in p of the form
2
q↵/2 p(1 p)
2
(p R̄n )
n
We need to find the roots p1 < p2 of
2
q↵/2
2 2
1+ p p+ R̄n =0
n
I This leads to a new confidence interval Isolve = [p1 , p2 ] such
that:
lim IP(Isolve 3 p)
n!1
(it’s complicated to write in generic way so let us wait to have
values for n, ↵ and R̄n to plug-in)
23/61
Solution 3: plug-in
IP,a.s.
I Recall that by the LLN p̂ = R̄n !p
n!1
I So by Slutsky, we also have .
p R̄n p (d)
np ! N (0, 1)
p̂(1 p̂) n!1
such that
lim IP(Iplug-in 3 p)
n!1
24/61
95% asymptotic CI for the kiss example
Recall that in the kiss example we had n = 124 and R̄n = 0.645.
Assume ↵ = 5%.
For Isolve , we have to find the roots of:
2
1.03p 1.32p + 0.41 = 0 p1 = 0.53, p2 = 0.75
4
See R. Newcombe (1998). Two-Sided Confidence Intervals for the Single
Proportion: Comparison of Seven Methods. 25/61
Exercises
26/61
Exercises
27/61
Exercises
d) If [0.34, 0.57] is a 95% confidence interval for an unknown
proportion p, then the probability that p is in this interval is
1. 0.025
2. 0.05
3. 0.95
4. None of the above
29/61
Statistical problem
30/61
Discussion of the modeling assumptions
31/61
Estimator
I Density of T1 :
t
f (t) = e , 8t 0.
1
I IE[T1 ] = .
1
I Hence, a natural estimate of is
n
X
1
T̄n := Ti .
n
i=1
I A natural estimator of is
ˆ :=
32/61
First properties
I By the LLN’s,
a.s./IP 1
T̄n !
n!1
I Hence,
ˆ a.s. /IP
! .
n!1
I By the CLT,
✓ ◆
p 1 2
n T̄n ! N (0, ).
n!1
33/61
The Delta method
34/61
Consequence of the Delta method
p ⇣ ⌘ (d)
I n ˆ ! N (0, 2
).
n!1
|ˆ |
35/61
Three solutions
1. The conservative bound: we have no a priori way to bound
2. We can solve for :
✓ ◆ ✓ ◆
q↵/2 q↵/2 q↵/2
|ˆ | p () 1 p ˆ 1+ p
n n n
()
It yields
" ✓ ◆ ✓ ◆ #
1 1
q↵/2 q↵/2
Isolve = ˆ 1+ p ,ˆ 1 p
n n
3. Plug-in yields
✓ ◆ ✓ ◆
q↵/2 q↵/2
Iplug-in = ˆ 1 p ,ˆ 1+ p
n n
36/61
95% asymptotic CI for the T example
37/61
Meaning of a confidence interval
⇥ ⇤
Take Iplug-in = 0.12 , 0.20 for example. What is the meaning of
“Iplug-in is a confidence intervals of asymptotic level 95%”.
●
●
credit: openintro.org). ●
5
The frequentist approach is often contrasted with the Bayesian approach. 38/61
Hypothesis testing
39/61
How to board a plane?
40/61
What is the fastest boarding method?
What is the fastest method to board a plane?
R2F or WilMA?
41/61
The data
R2F ⑦
WilMA
Average (mins) 24.2 15.9
Std. Dev (mins) 2.1 1.3
Sample size 72 56
"
"
sufficient statistics
42/61
Model and Assumptions
I Let X (resp. Y ) denote the boarding time of a random
JetBlue (resp. United) flight.
0
I We assume that X ⇠ N (µ1 , 12 ) and Y ⇠ N (µ2 , 22 )
I Let n and m denote the JetBlue and United sample sizes
respectively.
I We have X1 , . . . , Xn independent copies of X and Y1 , . . . Ym
independent copies of Y .
I We further assume that the two samples are independent.
We want to answer the question:
Is µ1 = µ2 or is µ1 > µ2 ?
Better heuristic:
“If
then µ1 > µ2 ”
To make this intuition more precise, we need to take the size of the
random fluctuations of X̄n and Ȳn into account!
44/61
Waiting time in the ER
I The average waiting time in the Emergency Room (ER) in the
US is 30 minutes according to the CDC
I Some patients claim that the new Princeton-Plainsboro
hospital has a longer waiting time. Is it true?
I Here, we collect only one sample: X1 , . . . , Xn (waiting time in
minutes for n random patients) with unknown expected value
IE[X1 ] = µ.
I We want to know if µ > 30.
45/61
Heuristic
Heuristic:
“If
X̄n + < 30
then conclude that ”
46/61
Example 1
According to a survey conducted in 2017 on 4,971 randomly
sampled Americans, 32% report to get at least some of their news
on Youtube. Can we conclude that at most a third of all
Americans get at least some of their news on Youtube?
I n = 4, 971, X1 , . . . , Xn ⇠ Ber(p);
iid
I X̄n = 0.32
I If it was true that p = .33: By CLT,
p X̄n .3
np ⇡ N (0, 1).
.33(1 .33)
p X̄n .3
I np ⇡ 1.50
.3(1 .3)
I Conclusion:
47/61
The Standard Gaussian distribution
68%
95%
99.7%
µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ
48/61
Example 2
Example 2: A coin is tossed 30 times, and Heads are obtained 13
times. Can we conclude that the coin is significantly unfair ?
I n = 30, X1 , . . . , Xn ⇠ Ber(p);
iid
p X̄n .5
np ⇡ N (0, 1).
.5(1 .5)
p X̄n .5
I Our data gives n p ⇡
.5(1 .5)
I The number .77 realization of a random
variable Z ⇠ N (0, 1).
I Conclusion:
49/61
Statistical formulation
I Consider a sample X1 , . . . , Xn of i.i.d. random variables and a
statistical model (E, (IP✓ )✓2⇥ ).
n o
I = 1I , for some C > 0.
51/61
Errors
I Rejection region of a test :
n
R = {x 2 E : (x) = 1}.
: ⇥1 ! IR
✓ 7!
I Power of a test :
⇡ = inf (1 (✓)) .
✓2⇥1
52/61
Level, test statistic and rejection region
↵ (✓) ↵, 8✓ 2 ⇥0 .
↵ (✓) ↵, 8✓ 2 ⇥0 .
53/61
One-sided vs two-sided tests
I If H1 : ✓ =
6 ✓0 : two-sided test
I If H1 : ✓ > ✓ or H1 : ✓ < ✓0 : one-sided test
Examples:
I Boarding method:
I Waiting time in the ER:
I The kiss example:
I Fair coin:
54/61
Bernoulli experiment
I We want to test:
H0 : p = 1/2 vs. H1 : p 6= 1/2
p p̂n 0.5
I Let Tn = n p , where p̂n is the MLE.
.5(1 .5)
55/61
Examples
For ↵ = 5%, q↵/2 = 1.96 and q↵/2 =
Fair coin
H0 is at the asymptotic level 5% by the test 5% .
News on Youtube
H0 : p 0.33 vs. H1 : p < 0.33. This is a -sided test.
We reject if:
p p̂n p
n p >c
p(1 p)
But what value for p 2 ⇥0 = should we choose?
Type 1 error is the function p 7! IPp [ = 1]. To control the level
we need to find the p that maximizes it over ⇥0
! no need for computations, it’s clearly p =
H0 is at the asymptotic level 5% by the test 5% .
56/61
p-value
Definition
Golden rule
The smaller the p-value, the more confidently one can reject
H0 .
6
from the textbook OpenIntro Statistics 58/61
Exercise: kiss
59/61
Exercise : Machine learning predicts breast cancer
60/61
Recap
I A statistical model is a pair of the form (E, (IP✓ )✓2⇥ ) where
E is the sample space and (IP✓ )✓2⇥ is a family of candidate
probability distributions.
I A model can be well specified and identifiable.
I The trinity of statistical inference: estimation, confidence
intervals and testing
I Estimator: one value whose performance can be measured by
consistency, asymptotic normality, bias, variance and
quadratic risk
I Confidence intervals provide “error bars” around estimators.
Their size depends on the confidence level
I Hypothesis testing: we want to ask a yes/no answer about an
unknown parameter. They are characterized by hypotheses,
level, power, test statistic and rejection region. Under the null
hypothesis, the value of the unknown parameter becomes
known (no need for plug-in).
61/61