0% found this document useful (0 votes)
123 views294 pages

MIT - Statistics For Applications

The document outlines the topics and structure for a course on statistics for applications. It discusses 21 topics that will be covered across 24 lectures, including introductions to statistics, parametric inference, maximum likelihood estimation, hypothesis testing, regression, Bayesian statistics, and principal component analysis. Videos of the lectures and slides will be made available online through MIT OpenCourseWare.

Uploaded by

affieexyqznxee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views294 pages

MIT - Statistics For Applications

The document outlines the topics and structure for a course on statistics for applications. It discusses 21 topics that will be covered across 24 lectures, including introductions to statistics, parametric inference, maximum likelihood estimation, hypothesis testing, regression, Bayesian statistics, and principal component analysis. Videos of the lectures and slides will be made available online through MIT OpenCourseWare.

Uploaded by

affieexyqznxee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 294

Statistics for

Applications
Prof. Philippe Rigollet
SES # TOPICS

1-2 Introduction to Statistics (PDF)

3 Parametric Inference (PDF)

4-5 Maximum Likelihood Estimation (PDF)

6 The Method of Moments (PDF)

7-10 Parametric Hypothesis Testing (PDF)

11-12 Testing Goodness of Fit (PDF)

13-16 Regression (PDF - 1.2MB)

17-18 Bayesian Statistics (PDF)

19-20 Principal Component Analysis (PDF)

21-24 Generalized Linear Models (PDF)

https://ocw.mit.edu/courses/18-650-statistics-for-applications-fall-2016/

https://www.youtube.com/playlist?list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0
18.650
Statistics for Applications

Chapter 1: Introduction

1/43
Goals

Goals:
▶ To give you a solid introduction to the mathematical theory
behind statistical methods;
▶ To provide theoretical guarantees for the statistical methods
that you may use for certain applications.
At the end of this class, you will be able to
1. From a real-life situation, formulate a statistical problem in
mathematical terms
2. Select appropriate statistical methods for your problem
3. Understand the implications and limitations of various
methods

2/43
Instructors

▶ Instructor: Philippe Rigollet


Associate Prof. of Applied Mathematics; IDSS; MIT Center
for Statistics and Data Science.

▶ Teaching Assistant: Victor-Emmanuel Brunel


Instructor in Applied Mathematics; IDSS; MIT Center for
Statistics and Data Science.

3/43
Logistics

▶ Lectures: Tuesdays & Thursdays 1:00 -2:30am


▶ Optional Recitation: TBD.
▶ Homework: weekly. Total 11, 10 best kept (30%).
▶ Midterm: Nov. 8, in class, 1 hours and 20 minutes (30 %).
Closed books closed notes. Cheatsheet.
▶ Final: TBD, 2 hours (40%). Open books, open notes.

4/43
Miscellaneous

▶ Prerequisites: Probability (18.600 or 6.041), Calculus 2,


notions of linear algebra (matrix, vector, multiplication,
orthogonality,…)
▶ Reading: There is no required textbook
▶ Slides are posted on course website
https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/lecture-slides

▶ Videolectures: Each lecture is recorded and posted online.


Attendance is still recommended.

5/43
Why statistics?

6/43
Not only in the press

Hydrology Netherlands, 10th century, building dams and dykes


Should be high enough for most floods Should not be
too expensive (high)

Insurance Given your driving record, car information, coverage.


What is a fair premium?

Clinical trials A drug is tested on 100 patients; 56 were cured and


44 showed no improvement. Is the drug effective?

8/43
Randomness

What is common to all these examples?

9/43
Randomness

What is common to all these examples?

RANDOMNESS

9/43
Randomness

What is common to all these examples?

RANDOMNESS

Associated questions:
▶ Notion of average (“fair premium”, …)
▶ Quantifying chance (“most of the floods”, …)
▶ Significance, variability, …

9/43
Probability
▶ Probability studies randomness (hence the prerequisite)
▶ Sometimes, the physical process is completely known: dice,
cards, roulette, fair coins, …

Examples

Rolling 1 die:
▶ Alice gets $1 if # of dots 3
▶ Bob gets $2 if # of dots 2
Who do you want to be: Alice or Bob?
Rolling 2 dice:
▶ Choose a number between 2 and 12
▶ Win $100 if you chose the sum of the 2 dice

Which number do you choose?


Well known random process from physics: 1/6 chance of each side,
dice are independent. We can deduce the probability of outcomes,
and expected $ amounts. This is probability. 10/43
Probability
▶ Probability studies randomness (hence the prerequisite)
▶ Sometimes, the physical process is completely known: dice,
cards, roulette, fair coins, …

Examples

Rolling 1 die:
▶ Alice gets $1 if # of dots 3
▶ Bob gets $2 if # of dots 2
Who do you want to be: Alice or Bob?
Rolling 2 dice:
▶ Choose a number between 2 and 12
▶ Win $100 if you chose the sum of the 2 dice

Which number do you choose?


Well known random process from physics: 1/6 chance of each side,
dice are independent. We can deduce the probability of outcomes,
and expected $ amounts. This is probability. 10/43
Probability
▶ Probability studies randomness (hence the prerequisite)
▶ Sometimes, the physical process is completely known: dice,
cards, roulette, fair coins, …

Examples

Rolling 1 die:
▶ Alice gets $1 if # of dots 3
▶ Bob gets $2 if # of dots 2
Who do you want to be: Alice or Bob?
Rolling 2 dice:
▶ Choose a number between 2 and 12
▶ Win $100 if you chose the sum of the 2 dice

Which number do you choose?


Well known random process from physics: 1/6 chance of each side,
dice are independent. We can deduce the probability of outcomes,
and expected $ amounts. This is probability. 10/43
Probability
▶ Probability studies randomness (hence the prerequisite)
▶ Sometimes, the physical process is completely known: dice,
cards, roulette, fair coins, …

Examples

Rolling 1 die:
▶ Alice gets $1 if # of dots 3
▶ Bob gets $2 if # of dots 2
Who do you want to be: Alice or Bob?
Rolling 2 dice:
▶ Choose a number between 2 and 12
▶ Win $100 if you chose the sum of the 2 dice

Which number do you choose?


Well known random process from physics: 1/6 chance of each side,
dice are independent. We can deduce the probability of outcomes,
and expected $ amounts. This is probability. 10/43
Statistics and modeling

▶ How about more complicated processes? Need to estimate


parameters from data. This is statistics
▶ Sometimes real randomness (random student, biased coin,
measurement error, …)
▶ Sometimes deterministic but too complex phenomenon:
statistical modeling
Complicated process “=” Simple process + random noise
▶ (good) Modeling consists in choosing (plausible) simple
process and noise distribution.

11/43
Statistics vs. probability

Probability Previous studies showed that the drug was 80%


effective. Then we can anticipate that for a study on
100 patients, in average 80 will be cured and at least
65 will be cured with 99.99% chances.

Statistics Observe that 78/100 patients were cured. We (will


be able to) conclude that we are 95% confident that
for other studies the drug will be effective on between
69.88% and 86.11% of patients

13/43
18.650

What this course is about


▶ Understand mathematics behind statistical methods
▶ Justify quantitive statements given modeling assumptions
▶ Describe interesting mathematics arising in statistics
▶ Provide a math toolbox to extend to other models.

What this course is not about


▶ Statistical thinking/modeling (applied stats, e.g. IDS.012)
▶ Implementation (computational stats, e.g. IDS.012)
▶ Laundry list of methods (boring stats, e.g. AP stats)

14/43
18.650

What this course is about


▶ Understand mathematics behind statistical methods
▶ Justify quantitive statements given modeling assumptions
▶ Describe interesting mathematics arising in statistics
▶ Provide a math toolbox to extend to other models.

What this course is not about


▶ Statistical thinking/modeling (applied stats, e.g. IDS.012)
▶ Implementation (computational stats, e.g. IDS.012)
▶ Laundry list of methods (boring stats, e.g. AP stats)

14/43
Let’s do some statistics

15/43
Heuristics (1)

“A neonatal right-side preference makes a surprising


romantic reappearance later in life.”

▶ Let p denote the proportion of couples that turn their head to


the right when kissing.
▶ Let us design a statistical experiment and analyze its outcome.
▶ Observe n kissing couples times and collect the value of each
outcome (say 1 for RIGHT and 0 for LEFT);
▶ Estimate p with the proportion p̂ of RIGHT.
▶ Study: “Human behaviour: Adult persistence of head-turning
asymmetry” (Nature, 2003): n = 124, 80 to the right so
80
p̂ = = 64.5%
124

17/43
Heuristics (2)

Back to the data:


▶ 64.5% is much larger than 50% so there seems to be a
preference for turning right.
▶ What if our data was RIGHT, RIGHT, LEFT (n = 3). That’s
66.7% to the right. Even better?
▶ Intuitively, we need a large enough sample size n to make a
call. How large?

We need mathematical modeling to understand


the accuracy of this procedure?

18/43
Heuristics (3)

Formally, this procedure consists of doing the following:


▶ For i = 1, . . . , n, define Ri = 1 if the ith couple turns to the
right RIGHT, Ri = 0 otherwise.
▶ The estimator of p is the sample average

∑ n
¯n = 1
p̂ = R Ri .
n
i=1

What is the accuracy of this estimator ?

In order to answer this question, we propose a statistical model


that describes/approximates well the experiment.

19/43
Heuristics (4)

Coming up with a model consists of making assumptions on the


observations Ri , i = 1, . . . , n in order to draw statistical
conclusions. Here are the assumptions we make:

1. Each Ri is a random variable.

2. Each of the r.v. Ri is Bernoulli with parameter p.

3. R1 , . . . , Rn are mutually independent.

20/43
Heuristics (5)
Let us discuss these assumptions.

1. Randomness is a way of modeling lack of information; with


perfect information about the conditions of kissing (including
what goes in the kissers’ mind), physics or sociology would
allow us to predict the outcome.

2. Hence, the Ri ’s are necessarily Bernoulli r.v. since


Ri → {0, 1}. They could still have a different parameter
Ri " Ber(pi ) for each couple but we don’t have enough
information with the data estimate the pi ’s accurately. So we
simply assume that our observations come from the same
process: pi = p for all i

3. Independence is reasonable (people were observed at different


locations and different times).

21/43
Two important tools: LLN & CLT

Let X, X1 , X2 , . . . , Xn be i.i.d. r.v., µ = IE[X] and α 2 = V[X].

▶ Laws of large numbers (weak and strong):

1∑
n
IP, a.s.
X̄n := Xi −−−−≥ µ.
n n-≥
i=1

▶ Central limit theorem:


∈ X̄n − µ (d)
n −−−≥ N (0, 1).
α n-≥

∈ (d)
(Equivalently, ¯ n − µ) −−
n (X −≥ N (0, α 2 ).)
n-≥

22/43
Consequences (1)

▶ The LLN’s tell us that


IP, a.s.
R̄n −−−−≥ p.
n-≥

▶ ¯n
Hence, when the size n of the experiment becomes large, R
is a good (say ”consistent”) estimator of p.

▶ The CLT refines this by quantifying how good this estimate is.

23/43
Consequences (2)

<(x): cdf of N (0, 1);

∈ R̄n − p
<n (x): cdf of n√ .
p(1 − p)

CLT: <n (x) ≤ <(x) when n becomes large. Hence, for all x > 0,
( ( ∈ ))
[ ] x n
IP |R̄n − p| 2 x ≤ 2 1 − < √ .
p(1 − p)

24/43
Consequences (3)

Consequences:

▶ ¯ n concentrates around p;
Approximation on how R

▶ For a fixed o → (0, 1), if qa/2 is the (1 − o/2)-quantile of


N (0, 1), then with probability ≤ 1 − o (if n is large enough !),
[ √ √ ]
qa/2 p(1 − p) qa/2 p(1 − p)
R̄n → p − ∈ ,p + ∈ .
n n

25/43
Consequences (4)
▶ Note that no matter the (unknown) value of p,

p(1 − p) 1/4.
▶ Hence, roughly with probability at least 1 − o,
[ ]
qa/2 qa/2
R̄n → p − ∈ , p + ∈ .
2 n 2 n
▶ In
[ other words, when n becomes
] large, the interval
qa/2 qa/2
R̄n − ∈ , R̄n + ∈ contains p with probability 2 1 − o.
2 n 2 n
▶ This interval is called an asymptotic confidence interval for p.
▶ In the kiss example, we get
[ 1.96 ]
0.645 ± ∈ = [0.56, 0.73]
2 124
If the extreme (n = 3 case) we would have [0.10, 1.23] but
CLT is not valid! Actually we can make exact computations!
26/43
Another useful tool: Hoeffding’s inequality
What if n is not so large ?
Hoeffding’s inequality (i.i.d. case)
Let n be a positive integer and X, X1 , . . . , Xn be i.i.d. r.v. such
that X → [a, b] a.s. (a < b are given numbers). Let µ = IE[X].
Then, for all λ > 0,
2n�2

IP[|X̄n − µ| 2 λ] 2e (b−a)2 .

Consequence:
▶ For o → (0, 1), with probability 2 1 − o,
√ √
log(2/o) ¯ n + log(2/o) .
R̄n − p R
2n 2n
▶ This holds even for small sample sizes n.

27/43
Review of different types of convergence (1)

Let (Tn )n?1 a sequence of r.v. and T a r.v. (T may be


deterministic).
▶ Almost surely (a.s.) convergence:
[{ }]
a.s.
Tn −−−≥ T iff IP θ : Tn (θ) −−−≥ T (θ) = 1.
n-≥ n-≥

▶ Convergence in probability:
IP
Tn −−−≥ T iff IP [|Tn − T | 2 λ] −−−≥ 0, ⇒λ > 0.
n-≥ n-≥

28/43
Review of different types of convergence (2)

▶ Convergence in Lp (p 2 1):
Lp
Tn −−−≥ T iff IE [|Tn − T |p ] −−−≥ 0.
n-≥ n-≥

▶ Convergence in distribution:
(d)
Tn −−−≥ T iff IP[Tn x] −−−≥ IP[T x],
n-≥ n-≥

for all x → IR at which the cdf of T is continuous.

Remark
These definitions extend to random vectors (i.e., random variables
in IRd for some d 2 2).

29/43
Review of different types of convergence (3)

Important characterizations of convergence in distribution

The following propositions are equivalent:

(d)
(i) Tn −−−≥ T ;
n-≥

(ii) IE[f (Tn )] −−−≥ IE[f (T )], for all continuous and
n-≥

bounded function f ;
[ ] [ ]
(iii) IE eixTn −−−≥ IE eixT , for all x → IR.
n-≥

30/43
Review of different types of convergence (4)

Important properties
▶ If (Tn )n?1 converges a.s., then it also converges in probability,
and the two limits are equal a.s.

▶ If (Tn )n?1 converges in Lp , then it also converges in Lq for all


q p and in probability, and the limits are equal a.s.

▶ If (Tn )n?1 converges in probability, then it also converges in


distribution

▶ If f is a continuous function:
a.s./IP/(d) a.s./IP/(d)
Tn −−−−−−−≥ T ≈ f (Tn ) −−−−−−−≥ f (T ).
n-≥ n-≥

31/43
Review of different types of convergence (6)

Limits and operations

One can add, multiply, ... limits almost surely and in probability. If
a.s./IP a.s./IP
Un −−−−≥ U and Vn −−−−≥ V , then:
n-≥ n-≥
a.s./IP
▶ Un + Vn −−−−≥ U + V ,
n-≥
a.s./IP
▶ Un Vn −−−−≥ U V ,
n-≥
Un a.s./IP U
▶ If in addition, V ̸= 0 a.s., then −−−−≥ .
Vn n-≥ V

� In general, these rules do not apply to convergence in


distribution unless the pair (Un , Vn ) converges in distribution to
(U, V ).

33/43
Another example (1)

▶ You observe the times between arrivals of the T at Kendall:


T1 , . . . , Tn .

▶ You assume that these times are:


▶ Mutually independent
▶ Exponential random variables with common parameter , > 0.

▶ You want to estimate the value of ,, based on the observed


arrival times.

34/43
Another example (2)

Discussion of the assumptions:

▶ Mutual independence of T1 , . . . , Tn : plausible but not


completely justified (often the case with independence).

▶ T1 , . . . , Tn are exponential r.v.: lack of memory of the


exponential distribution:

IP[T1 > t + s|T1 > t] = IP[T1 > s], ⇒s, t 2 0.

Also, Ti > 0 almost surely!


▶ The exponential distributions of T1 , . . . , Tn have the same
parameter: in average all the same inter-arrival time. True
̸ 11pm).
only for limited period (rush hour =

35/43
Another example (3)

▶ Density of T1 :
f (t) = ,e− t , ⇒t 2 0.
1
▶ IE[T1 ] = .
,

1
▶ Hence, a natural estimate of is
,

1∑
n
T̄n := Ti .
n
i=1

▶ A natural estimator of , is

ˆ := 1 .
,
T̄n

36/43
Another example (4)

▶ By the LLN’s,
a.s./IP 1
T̄n −−−−≥
n-≥ ,
▶ Hence,
ˆ−a.s./IP
, −−−≥ ,.
n-≥
▶ By the CLT,
( )
∈ 1 (d)
n T̄n − −−−≥ N (0, ,−2 ).
, n-≥

▶ ˆ ? How to find an asymptotic


How does the CLT transfer to ,
confidence interval for , ?

37/43
The Delta method

Let (Zn )n?1 sequence of r.v. that satisfies


∈ (d)
n(Zn − e) −−−≥ N (0, α 2 ),
n-≥

for some e → IR and α 2 > 0 (the sequence (Zn )n?1 is said to be


asymptotically normal around e).

Let g : IR ≥ IR be continuously differentiable at the point e.


Then,
▶ (g(Zn ))n?1 is also asymptotically normal;
▶ More precisely,
∈ (d)
n (g(Zn ) − g(e)) −−−≥ N (0, g ′ (e)2 α 2 ).
n-≥

38/43
Consequence of the Delta method (1)

∈ ( ) (d)
▶ n ,ˆ − , −−−≥ N (0, ,2 ).
n-≥

▶ Hence, for o → (0, 1) and when n is large enough,

qa/2 ,
ˆ − ,|
|, ∈ .
n
[ ]
q , q ,
▶ Can , ˆ − a/2
∈ ,, ˆ + a/2
∈ be used as an asymptotic
n n
confidence interval for , ?

▶ No ! It depends on ,...

39/43
Consequence of the Delta method (2)

Two ways to overcome this issue:


▶ In this case, we can solve for ,:
( ) ( )
qa/2 , qa/2 qa/2
ˆ
|, − ,| ∈ ∼≈ , 1 − ∈ , , 1+ ∈
ˆ
n n n
( )−1 ( )
qa/2 qa/2 −1
∼≈ , 1 + ∈
ˆ , , 1− ∈
ˆ .
n n
[ ( ) ( ) ]
qa/2 −1 qa/2 −1
Hence, , ˆ 1+ ∈ ,,ˆ 1− ∈ is an asymptotic
n n
confidence interval for ,.

▶ A systematic way: Slutsky’s theorem.

40/43
Slutsky’s theorem
Slutsky’s theorem
Let (Xn ), (Yn ) be two sequences of r.v., such that:
(d)
(i) Xn −−−≥ X;
n-≥
IP
(ii) Yn −−−≥ c,
n-≥
where X is a r.v. and c is a given real number. Then,
(d)
(Xn , Yn ) −−−≥ (X, c).
n-≥

In particular,
(d)
Xn + Yn −−−≥ X + c,
n-≥
(d)
Xn Yn −−−≥ cX,
n-≥
...
41/43
Consequence of Slutsky’s theorem (1)
▶ Thanks to the Delta method, we know that

∈ ,ˆ − , (d)
n −−−≥ N (0, 1).
, n-≥

▶ By the weak LLN,


ˆ −−IP−≥ ,.
,
n-≥
▶ Hence, by Slutsky’s theorem,

∈ ,ˆ − , (d)
n −−−≥ N (0, 1).
ˆ
, n-≥

▶ Another asymptotic confidence interval for , is


[ ]
ˆ
qa/2 , ˆ
qa/2 ,
ˆ − ∈ ,,
, ˆ+ ∈ .
n n

42/43
Consequence of Slutsky’s theorem (2)

Remark:

▶ In the first example (kisses), we used a problem dependent


trick: “p(1 − p) 1/4”.

▶ We could have used Slutsky’s theorem and get the asymptotic


confidence interval
[ √ √ ]
qa/2 R ¯ n (1 − R
¯n) qa/2 R¯ n (1 − R
¯n)
R¯n − ∈ ¯n +
,R ∈ .
n n

43/43
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications


Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications

Chapter 2: Parametric Inference

1/11
The rationale behind statistical modeling
◮ Let X1 , . . . , Xn be n independent copies of X.
◮ The goal of statistics is to learn the distribution of X.
◮ If X ∈ {0, 1}, easy! It’s Ber(p) and we only have to learn the
parameter p of the Bernoulli distribution.
◮ Can be more complicated. For example, here is a (partial)
dataset with number of siblings (including self) that were
collected from college students a few years back: 2, 3, 2, 4, 1,
3, 1, 1, 1, 1, 1, 2, 2, 3, 2, 2, 2, 3, 2, 1, 3, 1, 2, 3, . . .
◮ We could make no assumption and try to learn the pmf:

x 1 2 3 4 5 6 ≥7
L
IP(X = x) p1 p2 p3 p4 p5 p6 i≥7 pi

That’s 7 parameters to learn.


◮ Or we could assume that X ∼ Poiss(λ). That’s 1 parameter
to learn!
2/11
Statistical model (1)

Formal definition

Let the observed outcome of a statistical experiment be a sample


X1 , . . . , Xn of n i.i.d. random variables in some measurable space
E (usually E ⊆ IR) and denote by IP their common distribution. A
statistical model associated to that statistical experiment is a pair

(E, (IPθ )θ∈Θ ) ,

where:

◮ E is sample space;

◮ (IPθ )θ∈Θ is a family of probability measures on E;

◮ Θ is any set, called parameter set.

3/11
Statistical model (2)

◮ Usually, we will assume that the statistical model is well


specified, i.e., defined such that IP = IPθ , for some θ ∈ Θ.

◮ This particular θ is called the true parameter, and is unknown:


The aim of the statistical experiment is to estimate θ, or
check it’s properties when they have a special meaning
(θ > 2?, θ = 1/2?, . . . )

◮ For now, we will always assume that Θ ⊆ IRd for some d ≥ 1:


The model is called parametric.

4/11
Statistical model (3)
Examples
1. For n Bernoulli trials:
( )
{0, 1}, (Ber(p))p∈(0,1) .

iid
2. If X1 , . . . , Xn ∼ Exp(λ), for some unknown λ > 0:
( ∗ )
IR+ , (Exp(λ))λ>0 .

iid
3. If X1 , . . . , Xn ∼ Poiss(λ), for some unknown λ > 0:
( )
IN, (Poiss(λ))λ>0 .

iid
4. If X1 , . . . , Xn ∼ N (µ, σ 2 ), for some unknown µ ∈ IR and
σ 2 > 0: ( ( )
IR, N (µ, σ 2 ) (µ,σ2 )∈IR×IR∗ .
)
+

5/11
Identification

The parameter θ is called identified iff the map θ ∈ Θ → IPθ is


injective, i.e.,
θ = θ ′ ⇒ IPθ = IPθ′ .

Examples

1. In all four previous examples, the parameter was identified.

iid
2. If Xi = 1IYi ≥0 , where Y1 , . . . , Yn ∼ N (µ, σ 2 ), for some
unknown µ ∈ IR and σ 2 > 0, are unobserved: µ and σ 2 are
not identified (but θ = µ/σ is).

6/11
Parameter estimation (1)
Idea: Given an observed sample X1 , . . . , Xn and a statistical
model (E, (IPθ )θ∈Θ ), one wants to estimate the parameter θ.

Definitions
◮ Statistic: Any measurable1 function of the sample, e.g.,
X̄n , max Xi , X1 + log(1 + |Xn |), sample variance, etc...
i
◮ Estimator of θ: Any statistic whose expression does not
depend on θ.
◮ An estimator θ̂n of θ is weakly (resp. strongly ) consistent iff

IP (resp. a.s.)
θ̂n −−−−−−−−−→ θ (w.r.t. IPθ ).
n→∞

1
Rule of thumb: if you can compute it exactly once given data, it is
measurable. You may have some issues with things that are implicitly defined
such as sup or inf but not in this class 7/11
Parameter estimation (2)

◮ Bias of an estimator θ̂n of θ:


� �
IE θˆn − θ.

◮ Risk (or quadratic risk) of an estimator θ̂n :


� �
IE |θ̂n − θ|2 .

Remark: If Θ ⊆ IR,

”Quadratic risk = bias2 + variance”.

8/11
Confidence intervals (1)
Let (E, (IPθ )θ∈Θ ) be a statistical model based on observations
X1 , . . . , Xn , and assume Θ ⊆ IR.

Definition
Let α ∈ (0, 1).
◮ Confidence interval (C.I.) of level 1 − α for θ: Any random
(i.e., depending on X1 , . . . , Xn ) interval I whose boundaries
do not depend on θ and such that:

IPθ [I ∋ θ] ≥ 1 − α, ∀θ ∈ Θ.
◮ C.I. of asymptotic level 1 − α for θ: Any random interval I
whose boundaries do not depend on θ and such that:

lim IPθ [I ∋ θ] ≥ 1 − α, ∀θ ∈ Θ.
n→∞

9/11
Confidence intervals (2)
iid
Example: Let X1 , . . . , Xn ∼ Ber(p), for some unknown
p ∈ (0, 1).

◮ LLN: The sample average X̄n is a strongly consistent


estimator of p.

α
◮ Let qα/2 be the (1 −)-quantile of N (0, 1) and
2
J J
qα/2 p(1 − p) qα/2 p(1 − p)
I = X̄n − √ ¯n +
,X √ .
n n

◮ CLT: lim IPp [I ∋ p] = 1 − α, ∀p ∈ (0, 1).


n→∞

◮ Problem: I depends on p !

10/11
Confidence intervals (3)

Two solutions:

◮ Replace p(1 − p) with 1/4 in I (since p(1 − p) ≤ 1/4).

◮ ¯ n in I and use Slutsky’s theorem.


Replace p with X

11/11
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications


Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications

Chapter 3: Maximum Likelihood Estimation

1/23
Total variation distance (1)
( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that there exists θ ∗ ∈ Θ such
that X1 ∼ IPθ∗ : θ ∗ is the true parameter.

Statistician’s goal: given X1 , . . . , Xn , find an estimator


θˆ = θ(X
ˆ 1 , . . . , Xn ) such that IP ˆ is close to IPθ∗ for the true
θ
parameter θ ∗ .
This means: IPθˆ(A) − IPθ∗ (A) is small for all A ⊂ E.
Definition
The total variation distance between two probability measures IPθ
and IPθ′ is defined by

TV(IPθ , IPθ′ ) = max IPθ (A) − IPθ′ (A) .


A⊂E

2/23
Total variation distance (2)

Assume that E is discrete (i.e., finite or countable). This includes


Bernoulli, Binomial, Poisson, . . .

Therefore X has a PMF (probability mass function):


IPθ (X = x) = pθ (x) for all x ∈ E,
L
pθ (x) ≥ 0, pθ (x) = 1 .
x∈E

The total variation distance between IPθ and IPθ′ is a simple


function of the PMF’s pθ and pθ′ :
1L
TV(IPθ , IPθ′ ) = pθ (x) − pθ′ (x) .
2
x∈E

3/23
Total variation distance (3)

Assume that E is continuous. This includes Gaussian, Exponential,


...
J
Assume that X has a density IPθ (X ∈ A) = A fθ (x)dx for all
A ⊂ E. l
fθ (x) ≥ 0, fθ (x)dx = 1 .
E

The total variation distance between IPθ and IPθ′ is a simple


function of the densities fθ and fθ′ :
l
1
TV(IPθ , IPθ′ ) = fθ (x) − fθ′ (x) dx .
2 E

4/23
Total variation distance (4)

Properties of Total variation:

◮ TV(IPθ , IPθ′ ) = TV(IPθ′ , IPθ ) (symmetric)


◮ TV(IPθ , IPθ′ ) ≥ 0
◮ If TV(IPθ , IPθ′ ) = 0 then IPθ = IPθ′ (definite)
◮ TV(IPθ , IPθ′ ) ≤ TV(IPθ , IPθ′′ ) + TV(IPθ′′ , IPθ′ ) (triangle
inequality)

These imply that the total variation is a distance between


probability distributions.

5/23
Total variation distance (5)
T θ , IPθ∗ ) for all
An estimation strategy: Build an estimator TV(IP
T θ , IPθ∗ ).
θ ∈ Θ. Then find θˆ that minimizes the function θ → TV(IP

6/23
Total variation distance (5)
T θ , IPθ∗ ) for all
An estimation strategy: Build an estimator TV(IP
T θ , IPθ∗ ).
θ ∈ Θ. Then find θˆ that minimizes the function θ → TV(IP

T θ , IPθ∗ )!
problem: Unclear how to build TV(IP
6/23
Kullback-Leibler (KL) divergence (1)

There are many distances between probability measures to replace


total variation. Let us choose one that is more convenient.

Definition
The Kullback-Leibler (KL) divergence between two probability
measures IPθ and IPθ′ is defined by




 L ( p (x) )

 θ

 pθ (x) log if E is discrete
pθ′ (x)
KL(IPθ , IPθ ) =
′ x∈E

 l

 ( f (x) )

 θ

 f θ (x) log dx if E is continuous
E θf ′ (x)

7/23
Kullback-Leibler (KL) divergence (2)

Properties of KL-divergence:

◮ KL(IPθ , IPθ′ ) = KL(IPθ′ , IPθ ) in general


◮ KL(IPθ , IPθ′ ) ≥ 0
◮ If KL(IPθ , IPθ′ ) = 0 then IPθ = IPθ′ (definite)
◮ KL(IPθ , IPθ′ ) i KL(IPθ , IPθ′′ ) + KL(IPθ′′ , IPθ′ ) in general

Not a distance.

This is is called a divergence.

Asymmetry is the key to our ability to estimate it!

8/23
Kullback-Leibler (KL) divergence (3)
[ ( p ∗ (X) )]
θ
KL(IPθ∗ , IPθ ) = IEθ∗ log
pθ (X)

[ ] [ ]
= IEθ∗ log pθ∗ (X) − IEθ∗ log pθ (X)

So the function θ [→ KL(IPθ∗ ], IPθ ) is of the form:


“constant” − IEθ∗ log pθ (X)
n
1L
Can be estimated: IEθ∗ [h(X)] - h(Xi ) (by LLN)
n
i=1

L n
T θ∗ , IPθ ) = “constant” − 1
KL(IP log pθ (Xi )
n
i=1
9/23
Kullback-Leibler (KL) divergence (4)
L n
T θ∗ , IPθ ) = “constant” − 1
KL(IP log pθ (Xi )
n
i=1

n
T θ∗ , IPθ ) 1L
min KL(IP ⇔ min − log pθ (Xi )
θ∈Θ θ∈Θ n
i=1
n
L
1
⇔ max log pθ (Xi )
θ∈Θ n
i=1
n
L
⇔ max log pθ (Xi )
θ∈Θ
i=1
n
n
⇔ max pθ (Xi )
θ∈Θ
i=1

This is the maximum likelihood principle.


10/23
Interlude: maximizing/minimizing functions (1)

Note that
min −h(θ) ⇔ max h(θ)
θ∈Θ θ∈Θ

In this class, we focus on maximization.

Maximization of arbitrary functions can be difficult:

�n
Example: θ → i=1 (θ − Xi )

11/23
Interlude: maximizing/minimizing functions (2)
Definition
A function twice differentiable function h : Θ ⊂ IR → IR is said to
be concave if its second derivative satisfies

h′′ (θ) ≤ 0 , ∀θ∈Θ

It is said to be strictly concave if the inequality is strict: h′′ (θ) < 0

Moreover, h is said to be (strictly) convex if −h is (strictly)


concave, i.e. h′′ (θ) ≥ 0 (h′′ (θ) > 0).
Examples:
◮ Θ = IR, h(θ) = −θ 2 ,

◮ Θ = (0, ∞), h(θ) = θ,
◮ Θ = (0, ∞), h(θ) = log θ,
◮ Θ = [0, π], h(θ) = sin(θ)
◮ Θ = IR, h(θ) = 2θ − 3
12/23
Interlude: maximizing/minimizing functions (3)
More generally for a multivariate function: h : Θ ⊂ IRd → IR,
d ≥ 2, define the
 ∂h 
∂θ1 (θ)
 .. 
◮ gradient vector: ∇h(θ) =  .  ∈ IRd
∂h
∂θd (θ)
◮ Hessian matrix:
 
∂2h ∂2h
···
∂θ1 ∂θ1 (θ) ∂θ1 ∂θd (θ)
 .. 
2 
∇ h(θ) =  .  ∈ IRd×d

∂2h ∂2h
∂θd ∂θd (θ) · · · ∂θd ∂θd (θ)

h is concave ⇔ x⊤ ∇2 h(θ)x ≤ 0 ∀x ∈ IRd , θ ∈ Θ.


h is strictly concave ⇔ x⊤ ∇2 h(θ)x < 0 ∀x ∈ IRd , θ ∈ Θ.
Examples:
◮ Θ = IR2 , h(θ) = −θ12 − 2θ22 or h(θ) = −(θ1 − θ2 )2
◮ Θ = (0, ∞), h(θ) = log(θ1 + θ2 ),
13/23
Interlude: maximizing/minimizing functions (4)

Strictly concave functions are easy to maximize: if they have a


maximum, then it is unique. It is the unique solution to

h′ (θ) = 0 ,

or, in the multivariate case

∇h(θ) = 0 ∈ IRd .

There are may algorithms to find it numerically: this is the theory


of “convex optimization”. In this class, often a closed form
formula for the maximum.

14/23
Likelihood, Discrete case (1)

( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that E is discrete (i.e., finite or
countable).

Definition

The likelihood of the model is the map Ln (or just L) defined as:

Ln : En × Θ → IR
(x1 , . . . , xn , θ) → IPθ [X1 = x1 , . . . , Xn = xn ].

15/23
Likelihood, Discrete case (2)
iid
Example 1 (Bernoulli trials): If X1 , . . . , Xn ∼ Ber(p) for some
p ∈ (0, 1):

◮ E = {0, 1};
◮ Θ = (0, 1);
◮ ∀(x1 , . . . , xn ) ∈ {0, 1}n , ∀p ∈ (0, 1),
n
n
L(x1 , . . . , xn , p) = IPp [Xi = xi ]
i=1
nn
= pxi (1 − p)1−xi
i=1
�n �n
xi
=p i=1 (1 − p)n− i=1 xi
.

16/23
Likelihood, Discrete case (3)
Example 2 (Poisson model):
iid
If X1 , . . . , Xn ∼ Poiss(λ) for some λ > 0:

◮ E = IN;
◮ Θ = (0, ∞);
◮ ∀(x1 , . . . , xn ) ∈ INn , ∀λ > 0,
n
n
L(x1 , . . . , xn , p) = IPλ [Xi = xi ]
i=1
nn
λxi
= e−λ
xi !
i=1
�n
λ i=1 xi
= e−nλ .
x1 ! . . . xn !

17/23
Likelihood, Continuous case (1)

( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that all the IPθ have density fθ .

Definition

The likelihood of the model is the map L defined as:

L : En × Θ → �
IR
n
(x1 , . . . , xn , θ) → i=1 fθ (xi ).

18/23
Likelihood, Continuous case (2)

iid
Example 1 (Gaussian model): If X1 , . . . , Xn ∼ N (µ, σ 2 ), for
some µ ∈ IR, σ 2 > 0:

◮ E = IR;
◮ Θ = IR × (0, ∞)
◮ ∀(x1 , . . . , xn ) ∈ IRn , ∀(µ, σ 2 ) ∈ IR × (0, ∞),
n
1 1 L
L(x1 , . . . , xn , µ, σ 2 ) = √ exp − 2 (xi − µ)2 .
(σ 2π)n 2σ
i=1

19/23
Maximum likelihood estimator (1)

Let X1(, . . . , Xn be )an i.i.d. sample associated with a statistical


model E, (IPθ )θ∈Θ and let L be the corresponding likelihood.

Definition
The likelihood estimator of θ is defined as:

θ̂nM LE = argmax L(X1 , . . . , Xn , θ),


θ∈Θ

provided it exists.

Remark (log-likelihood estimator): In practice, we use the fact


that
θ̂nM LE = argmax log L(X1 , . . . , Xn , θ).
θ∈Θ

20/23
Maximum likelihood estimator (2)

Examples

◮ Bernoulli trials: p̂M


n
LE ¯n.
=X

◮ ˆ M LE = X
Poisson model: λ ¯n.
n

( ) ( )
◮ Gaussian model: µ̂n , σ̂n2 = X̄n , Ŝn .

21/23
Maximum likelihood estimator (3)

Definition: Fisher information


Define the log-likelihood for one observation as:

ℓ(θ) = log L1 (X, θ), θ ∈ Θ ⊂ IRd

Assume that ℓ is a.s. twice differentiable. Under some regularity


conditions, the Fisher information of the statistical model is
defined as:
[ ] [ ] [ ]⊤ [ ]
I(θ) = IE ∇ℓ(θ)∇ℓ(θ)⊤ − IE ∇ℓ(θ) IE ∇ℓ(θ) = −IE ∇2 ℓ(θ) .

If Θ ⊂ IR, we get:
[ ] [ ]
I(θ) = var ℓ′ (θ) = −IE ℓ′′ (θ)

22/23
Maximum likelihood estimator (4)

Theorem

Let θ ∗ ∈ Θ (the true parameter). Assume the following:


1. The model is identified.
2. For all θ ∈ Θ, the support of IPθ does not depend on θ;
3. θ ∗ is not on the boundary of Θ;
4. I(θ) is invertible in a neighborhood of θ ∗ ;
5. A few more technical conditions.

Then, θˆnM LE satisfies:


IP
◮ θˆnM LE −−−→ θ ∗ w.r.t. IPθ∗ ;
n→∞
√ ( ) (d) ( )
◮ n θ̂nM LE − θ ∗ −−−→ N 0, I(θ ∗ )−1 w.r.t. IPθ∗ .
n→∞

23/23
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications


Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications

Chapter 4: The Method of Moments

1/14
Weierstrass Approximation Theorem (WAT)

Theorem
Let f be a continuous function on the interval [a, b], then, for any
ε > 0, there exists a0 , a1 , . . . , ad ∈ IR such that
d

ak xk � < ε .
� �
max �f (x) −
x∈[a,b]
k=0

In word: “continuous functions can be arbitrarily well approximated


by polynomials”

2/14
Statistical application of the WAT (1)
◮ Let X1 , . . . , Xn be an i.i.d. sample associated
) with a
(identified) statistical model E, {IPθ }θ∈Θ . Write θ ∗ for the
(

true parameter.
◮ Assume that for all θ, the distribution IPθ has a density fθ .
◮ If we find θ such that

h(x)fθ∗ (x)dx = h(x)fθ (x)dx

for all (bounded continuous) functions h, then θ = θ ∗ .


◮ Replace expectations by averages: find estimator θ̂ such that
n
1�
h(Xi ) = h(x)fθˆ(x)dx
n
i=1

for all (bounded continuous) functions h. There is an infinity


of such functions: not doable!
3/14
Statistical application of the WAT (2)
◮ By the WAT, it is enough to consider polynomials:
n d d
1 �� �
ak Xik = ak xk fθ̂ (x)dx , ∀a0 , . . . , ad ∈ IR
n
i=1 k=0 k=0

Still an infinity of equations!


◮ In turn, enough to consider
n
1� k
Xi = xk fθˆ(x)dx , ∀k = 1, . . . , d
n
i=1

(only d + 1 equations)
◮ The quantity mk (θ) := xk fθ (x)dx is the kth moment of
IPθ . Can also be written as

mk (θ) = IEθ [X k ] .
4/14
Gaussian quadrature (1)
◮ The Weierstrass approximation theorem has limitations:
1. works only for continuous functions (not really a problem!)
2. works only on intervals [a, b]
3. Does not tell us what d (# of moments) should be
◮ What if E is discrete: no PDF but PMF p(·)?
◮ Assume that E = {x1 , x2 , . . . , xr } is finite with r possible
values. The PMF has r − 1 parameters:

p(x1 ), . . . , p(xr−1 )
r−1

because the last one: p(xr ) = 1 − p(xj ) is given by the
j=1
first r − 1.
◮ Hopefully, we do not need much more than d = r − 1
moments to recover the PMF p(·).

5/14
Gaussian quadrature (2)
◮ Note that for any k = 1, . . . , r1 ,
r

k
mk = IE[X ] = p(xj )xjk
j=1

and
r

p(xj ) = 1
j=1

This is a system of linear equations with unknowns


p(x1 ), . . . , p(xr ).
◮ We can write it in a compact form:
x11 x12 · · · x1r
     
p(x1 ) m1
 x2 x 2 · · · x 2  
p(x 2 )   m 2

1 2 r  
 .. ..  ·  ..   ..
   
.. =
.

 . .   .   .
    
 r−1
xr−1 r−1   p(x

 x1 2 · · · x r r−1  )  mr−1 
1 1 ··· 1 p(xr ) 1
6/14
Gaussian quadrature (2)
◮ Check if matrix is invertible: Vandermonde determinant
x11 x12 · · · x1r
 
2 x22 · · · x2r 
 x1

 .. ..  =

det  . .. (xj − xk ) = 0
. . 
 r−1 r−1 r−1

 x1 x2 · · · xr  1<j<k<r

1 1 ··· 1

◮ So given m1 , . . . , mr−1 , there is a unique PMF that has these


moments. It is given by
−1 
x11 x12 x1r
   
p(x1 ) ··· m1
 p(x2 )   x21 x22 ··· x2r   m2 
.. .. .. ..
     
= ..
.
     
 .   r−1 . .   . 
xr−1
   
 p(xr−1 )   x1 2 ··· xr−1
r
  mr−1 
p(xr ) 1 1 ··· 1 1
7/14
Conclusion from WAT and Gaussian quadrature

◮ Moments contain important information to recover the PDF


or the PMF

◮ If we can estimate these moments accurately, we may be able


to recover the distribution

◮ In a parametric setting, where knowing the distribution IPθ


amounts to knowing θ, it is often the case that even less
moments are needed to recover θ. This is on a case-by-case
basis.

◮ Rule of thumb if θ ∈ Θ ⊂ IRd , we need d moments.

8/14
Method of moments (1)

Let X1(, . . . , Xn be an
) i.i.d. sample associated with a statistical
d
model E, (IPθ )θ∈Θ . Assume that Θ ⊆ IR , for some d ≥ 1.

◮ Population moments: Let mk (θ) = IEθ [X1k ], 1 ≤ k ≤ d.

n
1� k
◮ Empirical moments: Let m̂k = Xnk = Xi , 1 ≤ k ≤ d.
n
i=1

◮ Let
ψ : Θ ⊂ IRd → IRd
θ → (m1 (θ), . . . , md (θ)) .

9/14
Method of moments (2)

Assume ψ is one to one:

θ = ψ −1 (m1 (θ), . . . , md (θ)).

Definition

Moments estimator of θ:

θˆnM M = ψ −1 (m̂1 , . . . , m̂d ),

provided it exists.

10/14
Method of moments (3)

Analysis of θˆnM M

◮ Let M (θ) = (m1 (θ), . . . , md (θ));

◮ Let M̂ = (m̂1 , . . . , m̂d ).

◮ Let Σ(θ) = Vθ (X, X 2 , . . . , X d ) be the covariance matrix of


the random vector (X, X 2 , . . . , X d ), where X ∼ IPθ .

Assume −1
� ψ is continuously differentiable at M (θ). Write

−1 �
∇ψ M (θ) for the d × d gradient matrix at this point.

11/14
Method of moments (4)

◮ LLN: θˆnM M is weakly/strongly consistent.


◮ CLT:
√ ( ) (d)
n Mˆ − M (θ) −−−→ N (0, Σ(θ)) (w.r.t. IPθ ).
n→∞

Hence, by the Delta method (see next slide):

Theorem
√ ( MM ) (d)
n θ̂n − θ −−−→ N (0, Γ(θ)) (w.r.t. IPθ ),
n→∞

�⊤
where Γ(θ) = ∇ψ −1 �M (θ) Σ(θ) ∇ψ −1 �M (θ) .
� � � � �

12/14
Multivariate Delta method
Let (Tn )n≥1 sequence of random vectors in IRp (p ≥ 1) that
satisfies
√ (d)
n(Tn − θ) −−−→ N (0, Σ),
n→∞

for some θ ∈ IRp and some symmetric positive semidefinite matrix


Σ ∈ IRp×p.

Let g : IRp → IRk (k ≥ 1) be continuously differentiable at θ.


Then,
√ (d)
n (g(Tn ) − g(θ)) −−−→ N (0, ∇g(θ)⊤ Σ∇g(θ)),
n→∞

∂gj
where ∇g(θ) = ∈ IRk×d .
∂θi 1≤i≤d,1≤j≤k

13/14
MLE vs. Moment estimator

◮ Comparison of the quadratic risks: In general, the MLE is


more accurate.
◮ Computational issues: Sometimes, the MLE is intractable.
◮ If likelihood is concave, we can use optimization algorithms
(Interior point method, gradient descent, etc.)
◮ If likelihood is not concave: only heuristics. Local maxima.
(Expectation-Maximization, etc.)

14/14
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications


Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications

Chapter 5: Parametric hypothesis testing

1/37
Cherry Blossom run (1)

◮ The credit union Cherry Blossom Run is a 10 mile race that


takes place every year in D.C.
◮ In 2009 there were 14974 participants
◮ Average running time was 103.5 minutes.

Were runners faster in 2012?

To answer this question, select n runners from the 2012 race at


random and denote by X1 , . . . , Xn their running time.

2/37
Cherry Blossom run (2)

We can see from past data that the running time has Gaussian
distribution.

The variance was 373.

3/37
Cherry Blossom run (3)

◮ We are given i.i.d r.v X1 , . . . , Xn and we want to know if


X1 ∼ N (103.5, 373)
◮ This is a hypothesis testing problem.
◮ There are many ways this could be false:
1. IE[X1 ] = 103.5
2. var[X1 ] = 373
3. X1 may not even be Gaussian.
◮ We are interested in a very specific question: is
IE[X1 ] < 103.5?

4/37
Cherry Blossom run (4)

◮ We make the following assumptions:


1. var[X1 ] = 373 (variance is the same between 2009 and 2012)
2. X1 is Gaussian.
◮ The only thing that we did not fix is IE[X1 ] = µ.
◮ Now we want to test (only): “Is µ = 103.5 or is µ < 103.5”?
◮ By making modeling assumptions, we have reduced the
number of ways the hypothesis X1 ∼ N (103.5, 373) may be
rejected.
◮ The only way it can be rejected is if X1 ∼ N (µ, 373) for some
µ < 103.5.
◮ We compare an expected value to a fixed reference number
(103.5).

5/37
Cherry Blossom run (5)

Simple heuristic:

¯ n < 103.5, then µ < 103.5”


“If X

This could go wrong if I randomly pick only fast runners in my


sample X1 , . . . , Xn .

Better heuristic:
¯ n < 103.5−(something that −−−→ 0), then µ < 103.5”
“If X
n→∞

To make this intuition more precise, we need to take the size of the
random fluctuations of X ¯ n into account!

6/37
Clinical trials (1)

◮ Pharmaceutical companies use hypothesis testing to test if a


new drug is efficient.
◮ To do so, they administer a drug to a group of patients (test
group) and a placebo to another group (control group).
◮ Assume that the drug is a cough syrup.
◮ Let µcontrol denote the expected number of expectorations per
hour after a patient has used the placebo.
◮ Let µdrug denote the expected number of expectorations per
hour after a patient has used the syrup.
◮ We want to know if µdrug < µcontrol
◮ We compare two expected values. No reference number.

7/37
Clinical trials (2)

◮ Let X1 , . . . , Xndrug denote ndrug i.i.d r.v. with distribution


Poiss(µdrug )
◮ Let Y1 , . . . , Yncontrol denote ncontrol i.i.d r.v. with distribution
Poiss(µcontrol )
◮ We want to test if µdrug < µcontrol .
Heuristic:

¯ drug < X
“If X ¯ control −(something that −−−−−−−→ 0), then
n →∞ drug
ncontrol →∞
conclude that µdrug < µcontrol ”

8/37
Heuristics (1)
Example 1: A coin is tossed 80 times, and Heads are obtained 54
times. Can we conclude that the coin is significantly unfair ?

iid
◮ n = 80, X1 , . . . , Xn ∼ Ber(p);
◮ ¯ n = 54/80 = .68
X
◮ If it was true that p = .5: By CLT+Slutsky’s theorem,

√ X̄n − .5
nJ ≈ N (0, 1).
.5(1 − .5)

√ X̄n − .5
◮ nJ ≈ 3.22
.̄5(1 − .5)
◮ Conclusion: It seems quite reasonable to reject the
hypothesis p = .5.

9/37
Heuristics (2)
Example 2: A coin is tossed 30 times, and Heads are obtained 13
times. Can we conclude that the coin is significantly unfair ?

iid
◮ n = 30, X1 , . . . , Xn ∼ Ber(p);
◮ ¯ n = 13/30 ≈ .43
X
◮ If it was true that p = .5: By CLT+Slutsky’s theorem,

√ X̄n − .5
nJ ≈ N (0, 1).
.5(1 − .5)

√ X¯n − .5
◮ Our data gives nJ ≈ −.77
.5(1 − .5)
◮ The number .77 is a plausible realization of a random variable
Z ∼ N (0, 1).
◮ Conclusion: our data does not suggest that the coin is unfair.
10/37
Statistical formulation (1)
◮ Consider a sample X1 , . . . , Xn of i.i.d. random variables and a
statistical model (E, (IPθ )θ∈Θ ).

◮ Let Θ0 and Θ1 be disjoint subsets of Θ.



H0 : θ ∈ Θ0
◮ Consider the two hypotheses:
H1 : θ ∈ Θ1

◮ H0 is the null hypothesis, H1 is the alternative hypothesis.

◮ If we believe that the true θ is either in Θ0 or in Θ1 , we may


want to test H0 against H1 .

◮ We want to decide whether to reject H0 (look for evidence


against H0 in the data).
11/37
Statistical formulation (2)

◮ H0 and H1 do not play a symmetric role: the data is is only


used to try to disprove H0
◮ In particular lack of evidence, does not mean that H0 is true
(“innocent until proven guilty”)

◮ A test is a statistic ψ ∈ {0, 1} such that:


◮ If ψ = 0, H0 is not rejected;
◮ If ψ = 1, H0 is rejected.

◮ Coin example: H0 : p = 1/2 vs. H1 : p = 1/2.

{√ X̄n − .5 }
◮ ψ = 1I nJ > C , for some C > 0.
.5(1 − .5)

◮ How to choose the threshold C ?


12/37
Statistical formulation (3)
◮ Rejection region of a test ψ:

Rψ = {x ∈ E n : ψ(x) = 1}.

◮ Type 1 error of a test ψ (rejecting H0 when it is actually


true):
αψ : Θ0 → IR
θ �→ IPθ [ψ = 1].
◮ Type 2 error of a test ψ (not rejecting H0 although H1 is
actually true):

βψ : Θ1 → IR
θ �→ IPθ [ψ = 0].

◮ Power of a test ψ:

πψ = inf (1 − βψ (θ)) .
θ∈Θ1
13/37
Statistical formulation (4)

◮ A test ψ has level α if

αψ (θ) ≤ α, ∀θ ∈ Θ0 .

◮ A test ψ has asymptotic level α if

lim αψ (θ) ≤ α, ∀θ ∈ Θ0 .
n→∞

◮ In general, a test has the form

ψ = 1I{Tn > c},

for some statistic Tn and threshold c ∈ IR.

◮ Tn is called the test statistic. The rejection region is


Rψ = {Tn > c}.

14/37
Example (1)

iid
◮ Let X1 , . . . , Xn ∼ Ber(p), for some unknown p ∈ (0, 1).
◮ We want to test:
H0 : p = 1/2 vs. H1 : p = 1/2

with asymptotic level α ∈ (0, 1).

√ p̂n − 0.5
◮ Let Tn = n J , where p̂n is the MLE.
.5(1 − .5)

◮ If H0 is true, then by CLT and Slutsky’s theorem,

IP[Tn > qα/2 ] −−−→ 0.05


n→∞

◮ Let ψα = 1I{Tn > qα/2 }.

15/37
Example (2)

Coming back to the two previous coin examples: For α = 5%,


qα/2 = 1.96, so:

◮ In Example 1, H0 is rejected at the asymptotic level 5% by


the test ψ5% ;

◮ In Example 2, H0 is not rejected at the asymptotic level 5%


by the test ψ5% .

Question: In Example 1, for what level α would ψα not reject H0


? And in Example 2, at which level α would ψα reject H0 ?

16/37
p-value
Definition

The (asymptotic) p-value of a test ψα is the smallest (asymptotic)


level α at which ψα rejects H0 . It is random, it depends on the
sample.

Golden rule

p-value ≤ α ⇔ H0 is rejected by ψα , at the (asymptotic) level α.

The smaller the p-value, the more confidently one can reject
H0 .

◮ Example 1: p-value = IP[|Z| > 3.21] ≪ .01.


◮ Example 2: p-value = IP[|Z| > .77] ≈ .44.
17/37
Neyman-Pearson’s paradigm

Idea: For given hypotheses, among all tests of level/asymptotic


level α, is it possible to find one that has maximal power ?

Example: The trivial test ψ = 0 that never rejects H0 has a


perfect level (α = 0) but poor power (πψ = 0).

Neyman-Pearson’s theory provides (the most) powerful tests


with given level. In 18.650, we only study several cases.

18/37
The χ2 distributions
Definition
For a positive integer d, the χ2 (pronounced “Kai-squared”)
distribution with d degrees of freedom is the law of the random
iid
variable Z12 + Z22 + . . . + Zd2 , where Z1 , . . . , Zd ∼ N (0, 1).

Examples:
◮ If Z ∼ Nd (0, Id ), then IZI22 ∼ χ2d .
◮ Recall that the sample variance is given by
n n
1n 1n 2
Sn = (Xi − X̄n )2 = Xi − (X̄n )2
n n
i=1 i=1
iid
◮ Cochran’s theorem implies that for X1 , . . . , Xn ∼ N (µ, σ 2 ), if
Sn is the sample variance, then
nSn
∼ χ2n−1 .
σ2
◮ χ22 = Exp(1/2).
19/37
Student’s T distributions

Definition
For a positive integer d, the Student’s T distribution with d
degrees of freedom (denoted by td ) is the law of the random
Z
variable J , where Z ∼ N (0, 1), V ∼ χ2d and Z ⊥⊥ V (Z is
V /d
independent of V ).

Example:
iid
◮ Cochran’s theorem implies that for X1 , . . . , Xn ∼ N (µ, σ 2 ), if
Sn is the sample variance, then
√ X̄n − µ
n−1 √ ∼ tn−1 .
Sn

20/37
Wald’s test (1)

◮ Consider an i.i.d. sample X1 , . . . , Xn with statistical model


(E, (IPθ )θ∈Θ ), where Θ ⊆ IRd (d ≥ 1) and let θ0 ∈ Θ be fixed
and given.
◮ Consider the following hypotheses:

H0 : θ = θ0
H1 : θ = θ0 .

◮ Let θˆM LE be the MLE. Assume the MLE technical conditions


are satisfied.

◮ If H0 is true, then
√ � � (d)
n I(θ̂ M LE )1/2 θ̂nM LE − θ0 −−−→ Nd (0, Id ) w.r.t. IPθ0 .
n→∞

21/37
Wald’s test (2)

◮ Hence,
� �⊤ � � (d)
n θ̂nM LE − θ0 I(θˆM LE ) θ̂nM LE − θ0 −−−→ χd2 w.r.t. IPθ0 .
� �� � n→∞
Tn

◮ Wald’s test with asymptotic level α ∈ (0, 1):

ψ = 1I{Tn > qα },

where qα is the (1 − α)-quantile of χ2d (see tables).

◮ Remark: Wald’s test is also valid if H1 has the form “θ > θ0 ”


or “θ < θ0 ” or “θ = θ1 ”...

22/37
Likelihood ratio test (1)

◮ Consider an i.i.d. sample X1 , . . . , Xn with statistical model


(E, (IPθ )θ∈Θ ), where Θ ⊆ IRd (d ≥ 1).

◮ Suppose the null hypothesis has the form


(0) (0)
H0 : (θr+1 , . . . , θd ) = (θr+1 , . . . , θd ),

(0) (0)
for some fixed and given numbers θr+1 , . . . , θd .

◮ Let
θ̂n = argmax ℓn (θ) (MLE)
θ∈Θ

and
θ̂nc = argmax ℓn (θ) (“constrained MLE”)
θ∈Θ0

23/37
Likelihood ratio test (2)

◮ Test statistic:
� �
Tn = 2 ℓn (θ̂n ) − ℓn (θ̂nc ) .

◮ Theorem
Assume H0 is true and the MLE technical conditions are satisfied.
Then,
(d)
Tn −−−→ χ2d−r w.r.t. IPθ .
n→∞

◮ Likelihood ratio test with asymptotic level α ∈ (0, 1):

ψ = 1I{Tn > qα },

where qα is the (1 − α)-quantile of χ2d−r (see tables).

24/37
Testing implicit hypotheses (1)

◮ Let X1 , . . . , Xn be i.i.d. random variables and let θ ∈ IRd be


a parameter associated with the distribution of X1 (e.g. a
moment, the parameter of a statistical model, etc...)

◮ Let g : IRd → IRk be continuously differentiable (with k < d).

◮ Consider the following hypotheses:



H0 : g(θ) = 0
H1 : g(θ) = 0.

◮ E.g. g(θ) = (θ1 , θ2 ) (k = 2), or g(θ) = θ1 − θ2 (k = 1), or...

25/37
Testing implicit hypotheses (2)

◮ Suppose an asymptotically normal estimator θ̂n is available:


√ � � (d)
n θˆn − θ −−−→ Nd (0, Σ(θ)).
n→∞

◮ Delta method:
√ � � (d)
n g(θ̂n ) − g(θ) −−−→ Nk (0, Γ(θ)) ,
n→∞

where Γ(θ) = ∇g(θ)⊤ Σ(θ)∇g(θ) ∈ IRk×k .

◮ Assume Σ(θ) is invertible and ∇g(θ) has rank k. So, Γ(θ) is


invertible and
√ � � (d)
n Γ(θ)−1/2 g(θ̂n ) − g(θ) −−−→ Nk (0, Ik ) .
n→∞

26/37
Testing implicit hypotheses (3)

◮ Then, by Slutsky’s theorem, if Γ(θ) is continuous in θ,


√ � � (d)
n Γ(θ̂n )−1/2 g(θ̂n ) − g(θ) −−−→ Nk (0, Ik ) .
n→∞

◮ Hence, if H0 is true, i.e., g(θ) = 0,


(d)
ng(θ̂n )⊤ Γ−1 (θˆn )g(θˆn ) −−−→ χ2k .
� �� � n→∞
Tn

◮ Test with asymptotic level α:

ψ = 1I{Tn > qα },

where qα is the (1 − α)-quantile of χ2k (see tables).

27/37
The multinomial case: χ2 test (1)

Let E = {a1 , . . . , aK } be a finite space and (IPp )p∈ΔK be the


family of all probability distributions on E:

 
 K
n 
◮ ΔK = p = (p1 , . . . , pK ) ∈ (0, 1)K : pj = 1 .
 
j=1

◮ For p ∈ ΔK and X ∼ IPp ,

IPp [X = aj ] = pj , j = 1, . . . , K.

28/37
The multinomial case: χ2 test (2)

iid
◮ Let X1 , . . . , Xn ∼ IPp , for some unknown p ∈ ΔK , and let
p0 ∈ ΔK be fixed.

◮ We want to test:
H0 : p = p0 vs. H1 : p = p0

with asymptotic level α ∈ (0, 1).

◮ Example: If p0 = (1/K, 1/K, . . . , 1/K), we are testing


whether IPp is the uniform distribution on E.

29/37
The multinomial case: χ2 test (3)

◮ Likelihood of the model:

NK
Ln (X1 , . . . , Xn , p) = pN 1 N2
1 p2 . . . pK ,

where Nj = #{i = 1, . . . , n : Xi = aj }.

◮ Let p̂ be the MLE:


Nj
p̂j = , j = 1, . . . , K.
n
� p̂ maximizes log Ln (X1 , . . . , Xn , p) under the constraint
K
n
pj = 1.
j=1

30/37
The multinomial case: χ2 test (4)

◮ If H0 is true, then n(p̂ − p0 ) is asymptotically normal, and
the following holds.

Theorem
� �2
K
n p̂j − pj0 (d)
n −−−→ χ2K−1.
j=1
pj0 n→∞
� �� �
Tn

◮ χ2 test with asymptotic level α: ψα = 1I{Tn > qα },


where qα is the (1 − α)-quantile of χ2K−1 .
◮ Asymptotic p-value of this test: p − value = IP [Z > Tn |Tn ],
where Z ∼ χ2K−1 and Z ⊥⊥ Tn .

31/37
The Gaussian case: Student’s test (1)

iid
◮ Let X1 , . . . , Xn ∼ N (µ, σ 2 ), for some unknown
µ ∈ IR, σ 2 > 0 and let µ0 ∈ IR be fixed, given.

◮ We want to test:
H0 : µ = µ0 vs. H1 : µ = µ0

with asymptotic level α ∈ (0, 1).

√ X̄n − µ0
◮ If σ 2 is known: Let Tn = n . Then, Tn ∼ N (0, 1)
σ
and
ψα = 1I{|Tn | > qα/2 }
is a test with (non asymptotic) level α.

32/37
The Gaussian case: Student’s test (2)

If σ 2 is unknown:

√ X̄n − µ0
◮ Tn =
Let T n−1 √ , where Sn is the sample variance.
Sn

◮ Cochran’s theorem:
◮ ¯ n ⊥⊥ Sn ;
X
nSn
◮ ∼ χ2n−1 .
σ2

◮ Hence, TTn ∼ tn−1 : Student’s distribution with n − 1 degrees


of freedom.

33/37
The Gaussian case: Student’s test (3)
◮ Student’s test with (non asymptotic) level α ∈ (0, 1):

Tn | > qα/2 },
ψα = 1I{|T

where qα/2 is the (1 − α/2)-quantile of tn−1 .

◮ If H1 is µ > µ0 , Student’s test with level α ∈ (0, 1) is:

Tn > qα },
ψα′ = 1I{T

where qα is the (1 − α)-quantile of tn−1 .

◮ Advantage of Student’s test:


◮ Non asymptotic
◮ Can be run on small samples

◮ Drawback of Student’s test: It relies on the assumption that


the sample is Gaussian.
34/37
Two-sample test: large sample case (1)
◮ Consider two samples: X1 , . . . , Xn and Y1 , . . . , Ym , of
independent random variables such that

IE[X1 ] = · · · = IE[Xn ] = µX

, and
IE[Y1 ] = · · · = IE[Ym ] = µY

◮ Assume that the variances of are known so assume (without


loss of generality) that

var(X1 ) = · · · = var(Xn ) = var(Y1 ) = · · · = var(Ym ) = 1

◮ We want to test:
H0 : µX = µY vs. H1 : µX = µY

with asymptotic level α ∈ (0, 1).


35/37
Two-sample test: large sample case (2)
From CLT:
√ (d)
¯n − µX ) −−
n(X −→ N (0, 1)
n→∞
and
√ (d) √ (d)
m(Ȳm −µY ) −−−−→ N (0, 1) ⇒ n(Ȳm −µY ) −n→∞
−−−→ N (0, γ)
m→∞
m→∞
m
n
→γ

Moreover, the two samples are independent so


√ √
n(X ¯ n − Y¯m ) + n(µX − µY ) −−(d)
−−→ N (0, 1 + γ)
n→∞
m→∞
m
n
→γ

Under H0 : µX = µY :
√ X̄n − Ȳm (d)
nJ −n→∞
−−−→ N (0, 1)
1 + m/n m→∞
m
n
→γ
{√ X¯n − Y¯m }
Test: ψα = 1I nJ > qα/2
1 + m/n 36/37
Two-sample T-test
◮ If the variances are unknown but we know that
Xi ∼ N (µX , σX 2 ), Y ∼ N (µ , σ 2 ).
i Y Y
◮ Then
( 2 2 )
X¯ n − Y¯m ∼ N µX − µY , σX + σY
n m
◮ Under H0 :
X¯ n − Y¯m
J ∼ N (0, 1)
σX2 /n + σ 2 /m
Y

◮ For unknown variance:


¯ n − Y¯m
X
J ∼ tN
2 /n + S 2 /m
SX Y

where ( )2
2 /n + S 2 /m
SX Y
N= 4
SX SY4
2
n (n−1)
+ 2
m (m−1)
37/37
MIT OpenCourseWare
http://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications


Fall 2016

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Statistics for Applications

Chapter 6: Testing goodness of fit

1/25
Goodness of fit tests

Let X be a r.v. Given i.i.d copies of X we want to answer the


following types of questions:
◮ Does X have distribution N (0, 1)? (Cf. Student’s T
distribution)
◮ Does X have distribution U ([0, 1])? (Cf p-value under H0 )
◮ Does X have PMF p1 = 0.3, p2 = 0.5, p3 = 0.2

These are all goodness of fit tests: we want to know if the


hypothesized distribution is a good fit for the data.

Key characteristic of GoF tests: no parametric modeling.

2/25
Cdf and empirical cdf (1)
Let X1 , . . . , Xn be i.i.d. real random variables. Recall the cdf of
X1 is defined as:

F (t) = IP[X1 ≤ t], ∀t ∈ IR.

It completely characterizes the distribution of X1 .

Definition
The empirical cdf of the sample X1 , . . . , Xn is defined as:
n
1L
Fn (t) = 1{Xi ≤ t}
n
i=1
#{i = 1, . . . , n : Xi ≤ t}
= , ∀t ∈ IR.
n

3/25
Cdf and empirical cdf (2)

By the LLN, for all t ∈ IR,


a.s.
Fn (t) −−−→ F (t).
n→∞

Glivenko-Cantelli Theorem (Fundamental theorem of


statistics)
a.s.
sup |Fn (t) − F (t)| −−−→ 0.
t∈IR n→∞

4/25
Cdf and empirical cdf (3)

By the CLT, for all t ∈ IR,


√ (d) ( )
n (Fn (t) − F (t)) −−−→ N 0, F (t) (1 − F (t)) .
n→∞

Donsker’s Theorem
If F is continuous, then
√ (d)
n sup |Fn (t) − F (t)| −−−→ sup |B(t)|,
t∈IR n→∞ 0≤t≤1

where B is a Brownian bridge on [0, 1].

5/25
Kolmogorov-Smirnov test (1)

◮ Let X1 , . . . , Xn be i.i.d. real random variables with unknown


cdf F and let F 0 be a continuous cdf.

◮ Consider the two hypotheses:

H0 : F = F 0 v.s. H1 : F = F 0 .

◮ Let Fn be the empirical cdf of the sample X1 , . . . , Xn .

◮ If F = F 0 , then Fn (t) ≈ F 0 (t), for all t ∈ [0, 1].

6/25
Kolmogorov-Smirnov test (2)


◮ Let Tn = sup n Fn (t) − F 0 (t) .
t∈IR

(d)
◮ By Donsker’s theorem, if H0 is true, then Tn −−−→ Z,
n→∞
where Z has a known distribution (supremum of a Brownian
bridge).

◮ KS test with asymptotic level α:

δαKS = 1{Tn > qα },

where qα is the (1 − α)-quantile of Z (obtained in tables).

◮ p-value of KS test: IP[Z > Tn |Tn ].

7/25
Kolmogorov-Smirnov test (3)

Remarks:
◮ In practice, how to compute Tn ?

◮ F 0 is non decreasing, Fn is piecewise constant, with jumps at


ti = Xi , i = 1, . . . , n.

◮ Let X(1) ≤ X(2) ≤ . . . ≤ X(n) be the reordered sample.

◮ The expression for Tn reduces to the following practical


formula:

√ { i−1 i }
Tn = n max max − F 0 (X(i) ) , − F 0 (X(i) ) .
i=1,...,n n n

8/25
Kolmogorov-Smirnov test (4)

◮ Tn is called a pivotal statistic: If H0 is true, the distribution


of Tn does not depend on the distribution of the Xi ’s and it is
easy to reproduce it in simulations.

◮ Indeed, let Ui = F 0 (Xi ), i = 1, . . . , n and let Gn be the


empirical cdf of U1 , . . . , Un .

i.i.d.
◮ If H0 is true, then U1 , . . . , Un ∼ U ([0.1])

and Tn = sup n |Gn (x) − x|.
0≤x≤1

9/25
Kolmogorov-Smirnov test (5)

◮ For some large integer M :


◮ Simulate M i.i.d. copies Tn1 , . . . , TnM of Tn ;

(n)
◮ Estimate the (1 − α)-quantile qα of Tn by taking the sample
(n,M)
(1 − α)-quantile q̂α of Tn1 , . . . , TnM .

◮ Test with approximate level α:

δα = 1{Tn > q̂α(n,M ) }.

◮ Approximate p-value of this test:

#{j = 1, . . . , M : Tnj > Tn }


p-value ≈ .
M

10/25
Kolmogorov-Smirnov test (6)
These quantiles are often precomputed in a table.

11/25
Other goodness of fit tests

We want to measure the distance between two functions: Fn (t)


and F (t). There are other ways, leading to other tests:
◮ Kolmogorov-Smirnov:

d(Fn , F ) = sup |Fn (t) − F (t)|


t∈IR

◮ Cramér-Von Mises:

d2 (Fn , F ) = [Fn (t) − F (t)]2 dt
IR

◮ Anderson-Darling:

[Fn (t) − F (t)]2



2
d (Fn , F ) = dt
IR F (t)(1 − F (t))

12/25
Composite goodness of fit tests

What if I want to test: ”Does X have Gaussian distribution?” but


I don’t know the parameters?
Simple idea: plug-in

sup Fn (t) − Φµ,ˆ


ˆ σ2 (t)
t∈IR

where
¯n ,
µ̂ = X ˆ 2 = Sn2
σ
2
ˆ σ2 (t) is the cdf of N (µ̂, σ̂ ).
and Φµ,ˆ

In this case Donsker’s theorem is no longer valid. This is a


common and serious mistake!

13/25
Kolmogorov-Lilliefors test (1)

Instead, we compute the quantiles for the test statistic:

sup Fn (t) − Φµ,ˆ


ˆ σ2 (t)
t∈IR

They do not depend on unknown parameters!

This is the Kolmogorov-Lilliefors test.

14/25
Kolmogorov-Lilliefors test (2)
These quantiles are often precomputed in a table.

15/25
Quantile-Quantile (QQ) plots (1)
◮ Provide a visual way to perform GoF tests
◮ Not formal test but quick and easy check to see if a
distribution is plausible.
◮ Main idea: we want to check visually if the plot of Fn is close
to that of F or equivalently if the plot of Fn−1 is close to that
of F −1 .
◮ More convenient to check if the points
( −1 1 1 ) ( 2 2 ) n−1 n−1 )
F ( ), Fn−1 ( ) , F −1 ( ), Fn−1 ( ) , . . . , F −1 ( ), Fn−1 (
(
)
n n n n n n
are near the line y = x.
◮ Fn is not technically invertible but we define

Fn−1 (i/n) = X(i) ,

the ith largest observation.


16/25
χ2 goodness-of-fit test, finite case (1)

◮ Let X1 , . . . , Xn be i.i.d. random variables on some finite


space E = {a1 , . . . , aK }, with some probability measure IP.

◮ Let (IPθ )θ∈Θ be a parametric family of probability


distributions on E.

◮ Example: On E = {1, . . . , K}, consider the family of binomial


distributions (Bin(K, p))p∈(0,1) .

◮ For j = 1, . . . , K and θ ∈ Θ, set

pj (θ) = IPθ [Y = aj ], where Y ∼ IPθ

and
pj = IP[X1 = aj ].

19/25
χ2 goodness-of-fit test, finite case (2)

◮ Consider the two hypotheses:

H0 : IP ∈ (IPθ )θ∈Θ v.s. H1 : IP ∈


/ (IPθ )θ∈Θ .

◮ Testing H0 means testing whether the statistical model


( )
E, (IPθ )θ∈Θ fits the data (e.g., whether the data are indeed
from a binomial distribution).

◮ H0 is equivalent to:

pj = pj (θ), ∀j = 1, . . . , K, for some θ ∈ Θ.

20/25
χ2 goodness-of-fit test, finite case (3)

◮ Let θ̂ be the MLE of θ when assuming H0 is true.

◮ Let
n
1L #{i : Xi = aj }
p̂j = 1{Xi = aj } = , j = 1, . . . , K.
n n
i=1

◮ Idea: If H0 is true, then pj = pj (θ) so both p̂j and pj (θ̂) are


good estimators or pj . Hence, p̂j ≈ pj (θ̂), ∀j = 1, . . . , K.

� �2
K
L p̂j − pj (θ̂)
◮ Define the test statistic: Tn = n .
ˆ
pj (θ)
j=1

21/25
χ2 goodness-of-fit test, finite case (4)

◮ Under some technical assumptions, if H0 is true, then


(d)
Tn −−−→ χ2K−d−1,
n→∞

where d is the size of the parameter θ (Θ ⊆ IRd and


d < K − 1).

◮ Test with asymptotic level α ∈ (0, 1):

δα = 1{Tn > qα },

where qα is the (1 − α)-quantile of χ2K−d−1 .

◮ p-value: IP[Z > Tn |Tn ], where Z ∼ χ2K−d−1 and Z ⊥⊥ Tn .

22/25
χ2 goodness-of-fit test, infinite case (1)

◮ If E is infinite (e.g. E = IN, E = IR, ...):

◮ Partition E into K disjoint bins:

E = A1 ∪ . . . ∪ AK .
◮ Define, for θ ∈ Θ and j = 1, . . . , K:
◮ pj (θ) = IPθ [Y ∈ Aj ], for Y ∼ IPθ ,
◮ pj = IP[X1 ∈ Aj ],
n
1L #{i : Xi ∈ Aj }
◮ p̂j = 1{Xi ∈ Aj } = ,
n i=1 n

◮ θ̂: same as in the previous case.

23/25
χ2 goodness-of-fit test, infinite case (2)
� �2
K
L p̂j − pj (θ̂)
◮ As previously, let Tn = n .
j=1 pj (θ̂)

◮ Under some technical assumptions, if H0 is true, then


(d)
Tn −−−→ χ2K−d−1,
n→∞

where d is the size of the parameter θ (Θ ⊆ IRd and


d < K − 1).

◮ Test with asymptotic level α ∈ (0, 1):

δα = 1{Tn > qα },

where qα is the (1 − α)-quantile of χ2K−d−1 .

24/25
χ2 goodness-of-fit test, infinite case (3)
◮ Practical issues:
◮ Choice of K ?
◮ Choice of the bins A1 , . . . , AK ?
◮ Computation of pj (θ) ?

◮ Example 1: Let E = IN and H0 : IP ∈ (Poiss(λ))λ>0 .

◮ If one expects λ to be no larger than some λmax , one can


choose A1 = {0}, A2 = {1}, . . . , AK−1 = {K − 2}, AK =
{K − 1, K, K + 1, . . .}, with K large enough such that
pK (λmax ) ≈ 0.

25/25
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications


Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
Statistics for Applications

Chapter 7: Regression

1/43
Heuristics of the linear regression (1)
Consider a cloud of i.i.d. random points (Xi , Yi ), i = 1, . . . , n :

2/43
Heuristics of the linear regression (2)

I Idea: Fit the best line fitting the data.

I Approximation: Yi ⇡ a + bXi , i = 1, . . . , n, for some


(unknown) a, b 2 IR.

I ˆ ˆb that approach a and b.


Find a,

I More generally: Yi 2 IR, Xi 2 IRd ,

Yi ⇡ a + Xi> b, a 2 IR, b 2 IRd .

I Goal: Write a rigorous model and estimate a and b.

3/43
Heuristics of the linear regression (3)

Examples:
Economics: Demand and price,

Di ⇡ a + bpi , i = 1, . . . , n.

Ideal gas law: P V = nRT ,

log Pi ⇡ a + b log Vi + c log Ti , i = 1, . . . , n.

4/43
Linear regression of a r.v. Y on a r.v. X (1)

Let X and Y be two real r.v. (non necessarily independent)


with two moments and such that V ar(X) 6= 0.

The theoretical linear regression of Y on X is the best


approximation in quadratic means of Y by a linear function of
X, i.e. the r.v. a + bX,h where a and b iare the two real
numbers minimizing IE (Y − a − bX)2 .

By some simple algebra:


cov(X, Y )
I b= ,
V ar(X)
cov(X, Y )
I a = IE[Y ] − bIE[X] = IE[Y ] − IE[X].
V ar(X)

5/43
Linear regression of a r.v. Y on a r.v. X (2)

If " = Y − (a + bX), then

Y = a + bX + ",

with IE["] = 0 and cov(X, ") = 0.

Conversely: Assume that Y = a + bX + " for some a, b 2 IR


and some centered r.v. " that satisfies cov(X, ") = 0.

E.g., if X ?? " or if IE["|X] = 0, then cov(X, ") = 0.

Then, a + bX is the theoretical linear regression of Y on X.

6/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

7/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

8/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

9/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

10/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , Y1 ), . . . , (Xn , Yn ) with
same distribution as (X, Y ) is available.

We want to estimate a and b.

11/43
Linear regression of a r.v. Y on a r.v. X (4)

Definition
The least squared error (LSE) estimator of (a, b) is the minimizer
of the sum of squared errors:
n
X
(Yi − a − bXi )2 .
i=1

(â, b̂) is given by


XY − X̄ Y¯
b̂ = ,
2
X −X ¯ 2

â = Y¯ − ˆbX.
¯

12/43
Linear regression of a r.v. Y on a r.v. X (5)

13/43
Multivariate case (1)

Yi = Xi β + "i , i = 1, . . . , n.

Vector of explanatory variables or covariates: Xi 2 IRp (wlog,


assume its first coordinate is 1).

Dependent variable: Yi .

β = (a, b ) ; β1 (= a) is called the intercept.

{"i }i=1,...,n : noise terms satisfying cov(Xi , "i ) = 0.


Definition
The least squared error (LSE) estimator of β is the minimizer of
the sum of square errors:
n
X
β̂ = argmin (Yi − Xi t)2
t2IRp i=1
14/43
Multivariate case (2)

LSE in matrix form


Let Y = (Y1 , . . . , Yn ) 2 IRn .

Let X be the n ⇥ p matrix whose rows are X1 , . . . , Xn (X is


called the design).

Let " = ("1 , . . . , "n ) 2 IRn (unobserved noise)

Y = Xβ + ".

The LSE β̂ satisfies:

β̂ = argmin kY − Xtk22 .
t2IRp

15/43
Multivariate case (3)

Assume that rank(X) = p.

Analytic computation of the LSE:

β̂ = (X X)−1 X Y.

Geometric interpretation of the LSE

Xβ̂ is the orthogonal projection of Y onto the subspace


spanned by the columns of X:

Xβ̂ = P Y,

where P = X(X X)−1 X .

16/43
Linear regression with deterministic design and Gaussian
noise (1)

Assumptions:

The design matrix X is deterministic and rank(X) = p.

The model is homoscedastic: "1 , . . . , "n are i.i.d.

The noise vector " is Gaussian:

" ⇠ Nn (0, σ 2 In ),

for some known or unknown σ 2 > 0.

17/43
Linear regression with deterministic design and Gaussian
noise (2)
⇣ ⌘
LSE = MLE: ˆ ⇠ Np β, σ (X X)
β 2 −1
.
h i ⇣ ⌘
−1
Quadratic risk of β̂: IE kβ̂ − βk22 2
= σ tr (X X) .
h i
Prediction error: IE kY − Xβ̂k22 = σ 2 (n − p).

1
Unbiased estimator of σ 2 : σ̂ 2 = kY − Xβ̂k22 .
n−p

Theorem
σ̂ 2
(n − p) 2 ⇠ χ2n−p .
σ

β̂ ?? σ̂ 2 .
18/43
Significance tests (1)
Test whether the j-th explanatory variable is significant in the
linear regression (1  j  p).

H0 : βj = 0 v.s. H1 : βj = 0.

If γj is the j-th diagonal coefficient of (X X)−1 (γj > 0):

β̂j − βj
p ⇠ tn−p .
2
σ̂ γj

β̂j
Let Tn(j) = p .
σ̂ 2 γj

Test with non asymptotic level ↵ 2 (0, 1):

δ↵(j) = 1{|Tn(j) | > q ↵2 (tn−p )},

where q ↵2 (tn−p ) is the (1 − ↵/2)-quantile of tn−p .


19/43
Significance tests (2)

Test whether a group of explanatory variables is significant in


the linear regression.

H0 : βj = 0, 8j 2 S v.s. H1 : 9j 2 S, βj = 0, where
S ✓ {1, . . . , p}.

(j)
Bonferroni’s test: δ↵B = max δ↵/k , where k = |S|.
j2S

δ↵ has non asymptotic level at most ↵.

20/43
More tests (1)

Let G be a k ⇥ p matrix with rank(G) = k (k  p) and λ 2 IRk .


Consider the hypotheses:

H0 : Gβ = λ v.s. H1 : Gβ = λ.

The setup of the previous slide is a particular case.

If H0 is true, then:

Gβ̂ − λ ⇠ Nk 0, σ 2 G(X X)−1 G ,

and
−1
−2 −1
σ (Gβ̂ − λ) G(X X) G (Gβ − λ) ⇠ χ2k .

21/43
More tests (2)
� �−1
1 (Gβ̂ − λ) G(X X)−1 G (Gβ − λ)
Let Sn = .
σ̂ 2 k

If H0 is true, then Sn ⇠ Fk,n−p .


Test with non asymptotic level ↵ 2 (0, 1):

δ↵ = 1{Sn > q↵ (Fk,n−p )},

where q↵ (Fk,n−p ) is the (1 − ↵)-quantile of Fk,n−p .

Definition
The Fisher distribution with p and q degrees of freedom, denoted
U/p
by Fp,q , is the distribution of , where:
V /q
U ⇠ χ2p , V ⇠ χ2q ,

U ?? V .
22/43
Concluding remarks

Linear regression exhibits correlations, NOT causality

Normality of the noise: One can use goodness of fit tests to


test whether the residuals "ˆi = Yi − Xi β̂ are Gaussian.

Deterministic design: If X is not deterministic, all the above


can be understood conditionally on X, if the noise is assumed
to be Gaussian, conditionally on X.

23/43
Linear regression and lack of identifiability (1)
Consider the following model:

Y = Xβ + ",

with:
1. Y 2 IRn (dependent variables), X 2 IRn⇥p (deterministic
design) ;
2. β 2 IRp , unknown;
3. " ⇠ Nn (0, σ 2 In ).

Previously, we assumed that X had rank p, so we could invert


X X.

What if X is not of rank p ? E.g., if p > n ?

β would no longer be identified: estimation of β is vain


(unless we add more structure).
24/43
Linear regression and lack of identifiability (2)

What about prediction ? Xβ is still identified.

Ŷ: orthogonal projection of Y onto the linear span of the


columns of X.

Ŷ = Xβ̂ = X(X X)† XY, where A† stands for the


(Moore-Penrose) pseudo inverse of a matrix A.

Similarly as before, if k = rank(X):

kŶ − Yk22
2
⇠ χ2n k,
σ

kŶ − Yk22 ?? Ŷ.

25/43
Linear regression and lack of identifiability (3)

In particular:

IE[kŶ − Yk22 ] = (n − k)σ 2 .

Unbiased estimator of the variance:


1
σ̂ 2 = kŶ − Yk22 .
n−k

26/43
Linear regression in high dimension (1)
Consider again the following model:

Y = Xβ + ",

with:
1. Y 2 IRn (dependent variables), X 2 IRn⇥p (deterministic
design) ;
2. β 2 IRp , unknown: to be estimated;
3. " ⇠ Nn (0, σ 2 In ).
For each i, Xi 2 IRp is the vector of covariates of the i-th
individual.

If p is too large (p > n), there are too many parameters to be


estimated (overfitting model), although some covariates may
be irrelevant.

Solution: Reduction of the dimension.


27/43
Linear regression in high dimension (2)

Idea: Assume that only a few coordinates of β are nonzero


(but we do not know which ones).

Based on the sample, select a subset of covariates and


estimate the corresponding coordinates of β.

For S ✓ {1, . . . , p}, let

β̂ S 2 argmin kY − XS tk2 ,
t2IRS

where XS is the submatrix of X obtained by keeping only the


covariates indexed in S.

28/43
Linear regression in high dimension (3)

Select a subset S that minimizes the prediction error


penalized by the complexity (or size) of the model:
ˆ k2 + λ|S|,
kY − XS β S

where λ > 0 is a tuning parameter.

If λ = 2σ̂ 2 , this is the Mallow’s Cp or AIC criterion.

If λ = σ̂ 2 log n, this is the BIC criterion.

29/43
Linear regression in high dimension (4)
Each of these criteria is equivalent to finding β 2 IRp that
minimizes:
kY − Xbk22 + λkbk0 ,
where kbk0 is the number of nonzero coefficients of b.

This is a computationally hard problem: nonconvex and


requires to compute 2n estimators (all the β̂ S , for
S ✓ {1, . . . , p}).

Lasso estimator:
p
X p
X � �
replacekbk0 = 1I{bj = 0} with kbk1 = �bj �
j=1 j=1

and the problem becomes convex.


L
β̂ 2 argmin kY − Xbk2 + λkbk1 ,
b2IRp
where λ > 0 is a tuning parameter.
30/43
Linear regression in high dimension (5)

How to choose λ ?

This is a difficult question (see grad course 18.657:


”High-dimensional statistics” in Spring 2017).

A good choice of λ with lead to an estimator β̂ that is very


close to β and will allow to recover the subset S ⇤ of all
j 2 {1, . . . , p} for which β j = 0, with high probability.

31/43
Linear regression in high dimension (6)

32/43
Nonparametric regression (1)

In the linear setup, we assumed that Yi = Xi β + "i , where


Xi are deterministic.

This has to be understood as working conditionally on the


design.

This is to assume that IE[Yi |Xi ] is a linear function of Xi ,


which is not true in general.

Let f (x) = IE[Yi |Xi = x], x 2 IRp : How to estimate the


function f ?

33/43
Nonparametric regression (2)

Let p = 1 in the sequel.


One can make a parametric assumption on f .

E.g., f (x) = a + bx, f (x) = a + bx + cx2 , f (x) = ea+bx , ...

The problem reduces to the estimation of a finite number of


parameters.

LSE, MLE, all the previous theory for the linear case could be
adapted.

What if we do not make any such parametric assumption on f


?

34/43
Nonparametric regression (3)

Assume f is smooth enough: f can be well approximated by a


piecewise constant function.

Idea: Local averages.

For x 2 IR: f (t) ⇡ f (x) for t close to x.

For all i such that Xi is close enough to x,

Yi ⇡ f (x) + "i .

Estimate f (x) by the average of all Yi ’s for which Xi is close


enough to x.

35/43
Nonparametric regression (4)

Let h > 0: the window’s size (or bandwidth).

Let Ix = {i = 1, . . . , n : |Xi − x| < h}.

Let fˆn,h (x) be the average of {Yi : i 2 Ix }.


8
> 1 X
>
< |I | Yi if Ix = ;
x
fˆn,h (x) = i2Ix
>
>
:
0 otherwise.

36/43
Nonparametric regression (5)
0.5


0.4


● ● ●
●●
0.3

● ●

● ●
● ● ●

● ●



● ●
0.2




● ●●

● ●


Y

● ●
0.1






● ●
0.0



−0.2

0.2 0.4 0.6 0.8 1.0

37/43
Nonparametric regression (6)
0.5

l
0.4

l
l l l

ll ll
ll ll
0.3

ll
l l

l l ll ll
l ll
l
l
l
l l
0.2

l
l
ll
l
l
l
ll ll
l ll l
l
Y

l l
0.1

l
l

l
l
l
l
l l
0.0

l
l

x  0.6 l

h  0.1
^
−0.2

l f x  0.27

0.2 0.4 0.6 0.8 1.0

38/43
Nonparametric regression (7)

How to choose h ?

If h ! 0: overfitting the data;

If h ! 1: underfitting, fˆn,h (x) = Y¯n .

39/43
Nonparametric regression (8)
Example:
n = 100, f (x) = x(1 − x),
h = .005.
0.4

^
l

f nh
l
l

l
l l

l l
l
0.3

l l
l
l
l
l

l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2

l
l l
l
l l l

l l l
l l
l l
l l
l
l l l
Y

l l
l l l
l

l l l
l l l

l
0.1

l l l l
l
l
l
l
l
l
l
l
l

l l

l
l

l l
0.0

l l

l
l
l l l

0.0 0.2 0.4 0.6 0.8 1.0

X
40/43
Nonparametric regression (9)
Example:
n = 100, f (x) = x(1 − x),
h = 1.
0.4

^
l

f nh
l
l

l
l l

l l
l
0.3

l l
l
l
l
l

l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2

l
l l
l
l l l

l l l
l l
l l
l l
l
l l l
Y

l l
l l l
l

l l l
l l l

l
0.1

l l l l
l
l
l
l
l
l
l
l
l

l l

l
l

l l
0.0

l l

l
l
l l l

0.0 0.2 0.4 0.6 0.8 1.0

X
41/43
Nonparametric regression (10)
Example:
n = 100, f (x) = x(1 − x),
h = .2.
0.4

^
l

f nh
l
l

l
l l

l l
l
0.3

l l
l
l
l
l

l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2

l
l l
l
l l l

l l l
l l
l l
l l
l
l l l
Y

l l
l l l
l

l l l
l l l

l
0.1

l l l l
l
l
l
l
l
l
l
l
l

l l

l
l

l l
0.0

l l

l
l
l l l

0.0 0.2 0.4 0.6 0.8 1.0

X
42/43
Nonparametric regression (11)

Choice of h ?

If the smoothness of f is known (i.e., quality of local


approximation of f by piecewise constant functions): There is
a good choice of h depending on that smoothness

If the smoothness of f is unknown: Other techniques, e.g.


cross validation.

43/43
0,72SHQ&RXUVH:DUH
KWWSVRFZPLWHGX

6WDWLVWLFVIRU$SSOLFDWLRQV
)DOO 

)RULQIRUPDWLRQDERXWFLWLQJWKHVHPDWHULDOVRURXU7HUPVRI8VHYLVLWKWWSVRFZPLWHGXWHUPV
Statistics for Applications

Chapter 8: Bayesian Statistics

1/17
The Bayesian approach (1)

◮ So far, we have studied the frequentist approach of statistics.

◮ The frequentist approach:


◮ Observe data
◮ These data were generated randomly (by Nature, by
measurements, by designing a survey, etc...)
◮ We made assumptions on the generating process (e.g., i.i.d.,
Gaussian data, smooth density, linear regression function,
etc...)
◮ The generating process was associated to some object of
interest (e.g., a parameter, a density, etc...)
◮ This object was unknown but fixed and we wanted to find it:
we either estimated it or tested a hypothesis about this object,
etc...

2/17
The Bayesian approach (2)

◮ Now, we still observe data, assumed to be randomly generated


by some process. Under some assumptions (e.g., parametric
distribution), this process is associated with some fixed object.

◮ We have a prior belief about it.

◮ Using the data, we want to update that belief and transform


it into a posterior belief.

3/17
The Bayesian approach (3)
Example
◮ Let p be the proportion of woman in the population.

◮ Sample n people randomly with replacement in the population


and denote by X1 , . . . , Xn their gender (1 for woman, 0
otherwise).

◮ In the frequentist approach, we estimated p (using the MLE),


we constructed some confidence interval for p, we did
hypothesis testing (e.g., H0 : p = .5 v.s. H1 : p = .5).

◮ Before analyzing the data, we may believe that p is likely to


be close to 1/2.

◮ The Bayesian approach is a tool to:


1. include mathematically our prior belief in statistical procedures.
2. update our prior belief using the data.
4/17
The Bayesian approach (4)
Example (continued)

◮ Our prior belief about p can be quantified:

◮ E.g., we are 90% sure that p is between .4 and .6, 95% that it
is between .3 and .8, etc...

◮ Hence, we can model our prior belief using a distribution for


p, as if p was random.

◮ In reality, the true parameter is not random ! However, the


Bayesian approach is a way of modeling our belief about the
parameter by doing as if it was random.

◮ E.g., p ∼ B(a, a) (Beta distribution) for some a > 0.

◮ This distribution is called the prior distribution.


5/17
The Bayesian approach (5)
Example (continued)

◮ In our statistical experiment, X1 , . . . , Xn are assumed to be


i.i.d. Bernoulli r.v. with parameter p conditionally on p.

◮ After observing the available sample X1 , . . . , Xn , we can


update our belief about p by taking its distribution
conditionally on the data.

◮ The distribution of p conditionally on the data is called the


posterior distribution.

◮ Here, the posterior distribution is


n n
� �
� �
B a+ Xi , a + n − Xi .
i=1 i=1

6/17
The Bayes rule and the posterior distribution (1)

◮ Consider a probability distribution on a parameter space Θ


with some pdf π(·): the prior distribution.

◮ Let X1 , . . . , Xn be a sample of n random variables.

◮ Denote by pn (·|θ) the joint pdf of X1 , . . . , Xn conditionally


on θ, where θ ∼ π.

◮ Usually, one assumes that X1 , . . . , Xn are i.i.d. conditionally


on θ.

◮ The conditional distribution of θ given X1 , . . . , Xn is called


the posterior distribution. Denote by π(·|X1 , . . . , Xn ) its pdf.

7/17
The Bayes rule and the posterior distribution (2)

◮ Bayes’ formula states that:

π(θ|X1 , . . . , Xn ) ∝ π(θ)pn (X1 , . . . , Xn |θ), ∀θ ∈ Θ.

◮ The constant does not depend on θ:

π(θ)pn (X1 , . . . , Xn |θ)


π(θ|X1 , . . . , Xn ) = � , ∀θ ∈ Θ.
Θ pn (X1 , . . . , Xn |t) dπ(t)

8/17
The Bayes rule and the posterior distribution (3)
In the previous example:

◮ π(p) ∝ pa−1 (1 − p)a−1 , p ∈ (0, 1).

i.i.d.
◮ Given p, X1 , . . . , Xn ∼ Ber(p), so
�n �n
Xi
pn (X1 , . . . , Xn |θ) = p i=1 (1 − p)n− i=1 Xi
.

◮ Hence,
�n �n
π(θ|X1 , . . . , Xn ) ∝ pa−1+ i=1 Xi
(1 − p)a−1+n− i=1 Xi
.

◮ The posterior distribution is


n n
� �
� �
B a+ Xi , a + n − Xi .
i=1 i=1

9/17
Non informative priors (1)

◮ Idea: In case of ignorance, or of lack of prior information, one


may want to use a prior that is as little informative as
possible.

◮ Good candidate: π(θ) ∝ 1, i.e., constant pdf on Θ.

◮ If Θ is bounded, this is the uniform prior on Θ.

◮ If Θ is unbounded, this does not define a proper pdf on Θ !

◮ An improper prior on Θ is a measurable, nonnegative function


π(·) defined on Θ that is not integrable.

◮ In general, one can still define a posterior distribution using an


improper prior, using Bayes’ formula.

10/17
Non informative priors (2)
Examples:

i.i.d.
◮ If p ∼ U (0, 1) and given p, X1 , . . . , Xn ∼ Ber(p) :
�n �n
Xi
π(p|X1 , . . . , Xn ) ∝ p i=1 (1 − p)n− i=1 Xi
,
i.e., the posterior distribution is
n n
� �
� �
B 1+ Xi , 1 + n − Xi .
i=1 i=1
i.i.d.
◮ If π(θ) = 1, ∀θ ∈ IR and given θ, X1 , . . . , Xn ∼ N (θ, 1):
n
� �
1� 2
π(θ|X1 , . . . , Xn ) ∝ exp − (Xi − θ) ,
2
i=1
i.e., the posterior distribution is

N ¯n , 1
X .
n
11/17
Non informative priors (3)

◮ Jeffreys prior: J
πJ (θ) ∝ det I(θ),
where I(θ) is the Fisher information matrix of the statistical
model associated with X1 , . . . , Xn in the frequentist approach
(provided it exists).

◮ In the previous examples:


1
◮ Ex. 1: πJ (p) ∝ √ , p ∈ (0, 1): the prior is B(1/2, 1/2).
p(1−p)

◮ Ex. 2: πJ (θ) ∝ 1, θ ∈ IR is an improper prior.

12/17
Non informative priors (4)

◮ Jeffreys prior satisfies a reparametrization invariance principle:


If η is a reparametrization of θ (i.e., η = φ(θ) for some
one-to-one map φ), then the pdf π̃(·) of η satisfies:
J
π̃(η) ∝ det I˜(η),

where I˜(η) is the Fisher information of the statistical model


parametrized by η instead of θ.

13/17
Bayesian confidence regions

◮ For α ∈ (0, 1), a Bayesian confidence region with level α is a


random subset R of the parameter space Θ, which depends
on the sample X1 , . . . , Xn , such that:

IP[θ ∈ R|X1 , . . . , Xn ] = 1 − α.

◮ Note that R depends on the prior π(·).

◮ ”Bayesian confidence region” and ”confidence interval” are


two distinct notions.

14/17
Bayesian estimation (1)
◮ The Bayesian framework can also be used to estimate the true
underlying parameter (hence, in a frequentist approach).

◮ In this case, the prior distribution does not reflect a prior


belief: It is just an artificial tool used in order to define a new
class of estimators.

◮ Back to the frequentist approach: The sample


X1 , . . . , Xn is associated with a statistical model
(E, (IPθ )θ∈Θ ).

◮ Define a distribution (that can be improper) with pdf π on


the parameter space Θ.

◮ Compute the posterior pdf π(·|X1 , . . . , Xn ) associated with π,


seen as a prior distribution.
15/17
Bayesian estimation (2)

◮ Bayes estimator:

(π)
θ̂ = θ dπ(θ|X1 , . . . , Xn ) :
Θ

This is the posterior mean.

◮ The Bayesian estimator depends on the choice of the prior


distribution π (hence the superscript π).

16/17
Bayesian estimation (3)
◮ In the previous examples:
◮ Ex. 1 with prior B(a, a) (a > 0):
�n
(π) a + i=1 Xi a/n + X̄n
p̂ = = .
2a + n 2a/n + 1

In particular, for a = 1/2 (Jeffreys prior),


¯n
1/(2n) + X
p̂(πJ ) = .
1/n + 1

◮ Ex. 2: θ̂(πJ ) = X̄n .


◮ In each of these examples, the Bayes estimator is consistent
and asymptotically normal.

◮ In general, the asymptotic properties of the Bayes estimator


do not depend on the choice of the prior.
17/17
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications


Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
Statistics for Applications

Chapter 9: Principal Component Analysis (PCA)

1/16
Multivariate statistics and review of linear algebra (1)

� Let X be a d-dimensional random vector and X1 , . . . , Xn be


n independent copies of X.

� Write Xi = (Xi1 , . . . , Xid )T , i = 1, . . . , n.


� Denote by X the random n × d matrix

· · · XT
⎛ ⎞
1 ···
X=⎝ ..
⎠.
⎜ ⎟
.
T
· · · Xn · · ·

2/16
Multivariate statistics and review of linear algebra (2)

� Assume that E[IXI22 ] < ∞.

� Mean of X:
T
E[X] = E[X 1 ], . . . , E[X d ] .

� Covariance matrix of X: the matrix Σ = (σj,k )j,k=1,...,d , where

σj,k = cov(Xj , Xk ).

� It is easy to see that

Σ = E[XXT ] − E[X]E[X]T = E (X − E[X])(X − E[X])T .

3/16
Multivariate statistics and review of linear algebra (3)

� Empirical mean of X1 , . . . , Xn :
n
T
¯ = 1

X Xi = X̄ 1 , . . . , X̄ d .
n
i=1

� Empirical covariance of X1 , . . . , Xn : the matrix


S = (sj,k )j,k=1,...,d where sj,k is the empirical covariance of
the Xij , Xik , i = 1 . . . , n.

� It is easy to see that


n n
1� 1 �� �T
Xi XT T
��
S= i − X̄X̄ = Xi − X̄ Xi − X̄ .
n n
i=1 i=1

4/16
Multivariate statistics and review of linear algebra (4)

� ¯ = 1 XT 1, where 1 = (1, . . . , 1)T ∈ Rd .


Note that X
n
� Note also that
1 T 1 1
S= X X − 2 X11T X = XT HX,
n n n
where H = In − n1 11T .

� H is an orthogonal projector: H 2 = H, H T = H. (on what


subspace ?)

� If u ∈ Rd ,
� uT Σu is the variance of uT X;
� uT Su is the sample variance of uT X1 , . . . , uT Xn .

5/16
Multivariate statistics and review of linear algebra (5)

� In particular, uT Su measures how spread (i.e., diverse) the


points are in direction u.

� If uT Su = 0, then all Xi ’ s are in an affine subspace


orthogonal to u.

� If uT Σu = 0, then X is almost surely in an affine subspace


orthogonal to u.

� If uT Su is large with IuI2 = 1, then the direction of u


explains well the spread (i.e., diversity) of the sample.

6/16
Multivariate statistics and review of linear algebra (6)
� In particular, Σ and S are symmetric, positive semi-definite.

� Any real symmetric matrix A ∈ Rd×d has the decomposition

A = P DP T ,

where:
� P is a d × d orthogonal matrix, i.e., P P T = P T P = Id ;
� D is diagonal.

� The diagonal elements of D are the eigenvalues of A and the


columns of P are the corresponding eigenvectors of A.

� A is semi-definite positive iff all its eigenvalues are


nonnegative.
7/16
Principal Component Analysis: Heuristics (1)

� The sample X1 , . . . , Xn makes a cloud of points in Rd .

� In practice, d is large. If d > 3, it becomes impossible to


represent the cloud on a picture.

� Question: Is it possible to project the cloud onto a linear


subspace of dimension d' < d by keeping as much information
as possible ?

� Answer: PCA does this by keeping as much covariance


structure as possible by keeping orthogonal directions that
discriminate well the points of the cloud.

8/16
Principal Component Analysis: Heuristics (2)
� Idea: Write S = P DP T , where
� P = (v1 , . . . , vd ) is an orthogonal matrix, i.e.,
Ivj I2 = 1, vjT vk = 0, ∀j = k.
⎛ ⎞
λ1


⎜ λ2 0 ⎟


� D=⎜
⎜ . .. ⎟, with λ1 ≥ . . . ≥ λd ≥ 0.

⎜ ⎟

⎝ 0 ..
.


λd

� Note that D is the empirical covariance matrix of the


P T Xi ’s, i = 1, . . . , n.

� In particular, λ1 is the empirical variance of the v1T Xi ’s; λ2 is


the empirical variance of the v2T Xi ’s, etc...
9/16
Principal Component Analysis: Heuristics (3)

� So, each λj measures the spread of the cloud in the direction


vj .

� In particular, v1 is the direction of maximal spread.

� Indeed, v1 maximizes the empirical covariance of


aT X1 , . . . , aT Xn over a ∈ Rd such that IaI2 = 1.

� Proof: For any unit vector a, show that


T
aT Σa = P T a D P T a ≤ λ1 ,

with equality if a = v1 .

10/16
Principal Component Analysis: Main principle
� Idea of the PCA: Find the collection of orthogonal directions
in which the cloud is much spread out.
Theorem

v1 ∈ argmax uT Su,
lul=1

v2 ∈ argmax uT Su,
lul=1,u⊥v1
···
vd ∈ argmax uT Su.
lul=1,u⊥vj ,j=1,...,d−1

Hence, the k orthogonal directions in which the cloud is the


most spread out correspond exactly to the eigenvectors
associated with the k largest values of S.
11/16
Principal Component Analysis: Algorithm (1)

1. Input: X1 , . . . , Xn : cloud of n points in dimension d.

2. Step 1: Compute the empirical covariance matrix.

3. Step 2: Compute the decomposition S = P DP T , where


D = Diag(λ1 , . . . , λd ), with λ1 ≥ λ2 ≥ . . . ≥ λd and
P = (v1 , . . . , vd ) is an orthogonal matrix.

4. Step 3: Choose k < d and set Pk = (v1 , . . . , vk ) ∈ Rd×k .

5. Output: Y1 , . . . , Yn , where

Yi = PkT Xi ∈ Rk , i = 1, . . . , n.

Question: How to choose k ?

12/16
Principal Component Analysis: Algorithm (2)
Question: How to choose k ?
� Experimental rule: Take k where there is an inflection point in
the sequence λ1 , . . . , λd (scree plot).

� Define a criterion: Take k such that


λ1 + . . . + λk
≥ 1 − α,
λ1 + . . . + λd

for some α ∈ (0, 1) that determines the approximation error


that the practitioner wants to achieve.

� Remark: λ1 + . . . + λk is called the variance explained by the


PCA and λ1 + . . . + λd = T r(S) is the total variance.

� Data visualization: Take k = 2 or 3.

13/16
Example: Expression of 500,000 genes among 1400
Europeans

Reprinted by permission from


Macmillan Publishers Ltd: Nature.
Source: John Novembre, et al. "Genes
mirror geography within Europe."
Nature 456 (2008): 98-101. © 2008.
14/16
Principal Component Analysis - Beyond practice (1)
� PCA is an algorithm that reduces the dimension of a cloud of
points and keeps its covariance structure as much as possible.

� In practice this algorithm is used for clouds of points that are


not necessarily random.

� In statistics, PCA can be used for estimation.

� If X1 , . . . , Xn are i.i.d. random vectors in Rd , how to


estimate their population covariance matrix Σ ?

� If n » d, then the empirical covariance matrix S is a


consistent estimator.

� In many applications, n « d (e.g., gene expression). Solution:


sparse PCA
15/16
Principal Component Analysis - Beyond practice (2)
� It may be known beforehand that Σ has (almost) low rank.

� Then, run PCA on S: Write S ≈ S ' , where


λ1
⎛ ⎞

⎜ λ2 0 ⎟

⎜ .. ⎟
⎜ . ⎟
S' = P ⎜
⎜ ⎟ T
λk ⎟P .
⎜ ⎟

⎜ 0 ⎟

..
0
⎜ ⎟
⎝ . ⎠
0
� S ' will be a better estimator of S under the low-rank
assumption.

� A theoretical analysis would lead to an optimal choice of the


tuning parameter k.
16/16
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications


Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
Statistics for Applications

Chapter 10: Generalized Linear Models (GLMs)

1/52
Linear model

A linear model assumes

Y |X ∼ N (µ(X), σ 2 I),

And
IE(Y |X) = µ(X) = X ⊤ β,

2/52
Components of a linear model

The two components (that we are going to relax) are


1. Random component: the response variable Y |X is continuous
and normally distributed with mean µ = µ(X) = IE(Y |X).

2. Link: between the random and covariates


X = (X (1) , X (2) , · · · , X (p) )⊤ : µ(X) = X ⊤ β.

3/52
Generalization

A generalized linear model (GLM) generalizes normal linear


regression models in the following directions.
1. Random component:

Y ∼ some exponential family distribution

2. Link: between the random and covariates:

g µ(X) = X ⊤ β
� �

where g called link function and µ = IE(Y |X).

4/52
Example 1: Disease Occuring Rate

In the early stages of a disease epidemic, the rate at which new


cases occur can often increase exponentially through time. Hence,
if µi is the expected number of new cases on day ti , a model of the
form
µi = γ exp(δti )
seems appropriate.
◮ Such a model can be turned into GLM form, by using a log
link so that

log(µi ) = log(γ) + δti = β0 + β1 ti .

◮ Since this is a count, the Poisson distribution (with expected


value µi ) is probably a reasonable distribution to try.

5/52
Example 2: Prey Capture Rate(1)

The rate of capture of preys, yi , by a hunting animal, tends to


increase with increasing density of prey, xi , but to eventually level
off, when the predator is catching as much as it can cope with.
A suitable model for this situation might be
αxi
µi = ,
h + xi
where α represents the maximum capture rate, and h represents
the prey density at which the capture rate is half the maximum
rate.

6/52
Example 2: Prey Capture Rate (2)

0.6
0.5
0.4
0.3
0.2
0.1
0.0

0.0 0.2 0.4 0.6 0.8 1.0

7/52
Example 2: Prey Capture Rate (3)

◮ Obviously this model is non-linear in its parameters, but, by


using a reciprocal link, the right-hand side can be made linear
in the parameters,

1 1 h 1 1
g(µi ) = = + = β0 + β1 .
µi α α xi xi
◮ The standard deviation of capture rate might be
approximately proportional to the mean rate, suggesting the
use of a Gamma distribution for the response.

8/52
Example 3: Kyphosis Data

The Kyphosis data consist of measurements on 81 children


following corrective spinal surgery. The binary response variable,
Kyphosis, indicates the presence or absence of a postoperative
deforming. The three covariates are, Age of the child in month,
Number of the vertebrae involved in the operation, and the Start
of the range of the vertebrae involved.
◮ The response variable is binary so there is no choice: Y |X is
Bernoulli with expected value µ(X) ∈ (0, 1).
◮ We cannot write
µ(X) = X ⊤ β
because the right-hand side ranges through IR.
◮ We need an invertible function f such that f (X ⊤ β) ∈ (0, 1)

9/52
GLM: motivation

◮ clearly, normal LM is not appropriate for these examples;


◮ need a more general regression framework to account for
various types of response data
◮ Exponential family distributions
◮ develop methods for model fitting and inferences in this
framework
◮ Maximum Likelihood estimation.

10/52
Exponential Family

A family of distribution {Pθ : θ ∈ Θ}, Θ ⊂ IRk is said to be a


k-parameter exponential family on IRq , if there exist real valued
functions:
◮ η1 , η2 , · · · , ηk and B of θ,

◮ T1 , T2 , · · · , Tk , and h of x ∈ IRq such that the density


function (pmf or pdf) of Pθ can be written as

Lk
pθ (x) = exp[ ηi (θ)Ti (x) − B(θ)]h(x)
i=1

11/52
Normal distribution example
◮ Consider X ∼ N (µ, σ 2 ), θ = (µ, σ 2 ). The density is
(µ 1 2 µ2 ) 1
pθ (x) = exp x − x − √ ,
σ2 2σ 2 2σ 2 σ 2π

which forms a two-parameter exponential family with


µ 1
η1 = 2
, η2 = − 2 , T1 (x) = x, T2 (x) = x2 ,
σ 2σ
µ2 √
B(θ) = 2
+ log(σ 2π), h(x) = 1.

◮ When σ 2 is known, it becomes a one-parameter exponential
family on IR:
x2
µ µ2 e− 2σ2
η = 2 , T (x) = x, B(θ) = 2 , h(x) = √ .
σ 2σ σ 2π
12/52
Examples of discrete distributions

The following distributions form discrete exponential families of


distributions with pmf

◮ Bernoulli(p): px (1 − p)1−x , x ∈ {0, 1}

λx −λ
◮ Poisson(λ): e , x = 0, 1, . . . .
x!

13/52
Examples of Continuous distributions
The following distributions form continuous exponential families of
distributions with pdf:
1 x
◮ Gamma(a, b):
a
xa−1 e− b ;
Γ(a)b
◮ above: a: shape parameter, b: scale parameter
◮ reparametrize: µ = ab: mean parameter
( )a
1 a
xa−1 e− µ .
ax

Γ(a) µ

β α −α−1 −β/x
◮ Inverse Gamma(α, β): x e .
Γ(α)
2 2
σ 2 −σ 2µ(x−µ)
◮ Inverse Gaussian(µ, σ 2 ): e 2x
.
2πx3
Others: Chi-square, Beta, Binomial, Negative binomial
distributions.
14/52
Components of GLM

1. Random component:

Y ∼ some exponential family distribution

2. Link: between the random and covariates:

g µ(X) = X ⊤ β
� �

where g called link function and µ(X) = IE(Y |X).

15/52
One-parameter canonical exponential family

◮ Canonical exponential family for k = 1, y ∈ IR


( yθ − b(θ) )
fθ (y) = exp + c(y, φ)
φ

for some known functions b(·) and c(·, ·) .

◮ If φ is known, this is a one-parameter exponential family with


θ being the canonical parameter .
◮ If φ is unknown, this may/may not be a two-parameter
exponential family. φ is called dispersion parameter.
◮ In this class, we always assume that φ is known.

16/52
Normal distribution example

◮ Consider the following Normal density function with known


variance σ 2 ,
1 (y−µ)2
fθ (y) = √ e− 2σ2
σ 2π
yµ − 12 µ2 1 y2
� ( )�
2
= exp − + log(2πσ ) ,
σ2 2 σ2

θ2
◮ Therefore θ = µ, φ = σ 2 , , b(θ) = 2 , and

1 y2
c(y, φ) = − ( + log(2πφ)).
2 φ

17/52
Other distributions

Table 1: Exponential Family


Normal Poisson Bernoulli
Notation N (µ, σ 2 ) P(µ) B(p)
Range of y (−∞, ∞) [0, −∞) {0, 1}
φ σ2 1 1
θ2
b(θ) 2 eθ log(1 + eθ )
2
c(y, φ) − 21 ( yφ + log(2πφ)) − log y! 1

18/52
Likelihood

Let ℓ(θ) = log fθ (Y ) denote the log-likelihood function.


The mean IE(Y ) and the variance var(Y ) can be derived from the
following identities
◮ First identity
∂ℓ
IE( ) = 0,
∂θ
◮ Second identity

∂2ℓ ∂ℓ
IE( 2
) + IE( )2 = 0.
∂θ ∂θ

Obtained from fθ (y)dy ≡ 1 .

19/52
Expected value

Note that
Y θ − b(θ
ℓ(θ) = + c(Y ; φ),
φ
Therefore
∂ℓ Y − b′ (θ)
=
∂θ φ
It yields
∂ℓ IE(Y ) − b′ (θ))
0 = IE( )= ,
∂θ φ
which leads to
IE(Y ) = µ = b′ (θ).

20/52
Variance

On the other hand we have we have


∂2ℓ ∂ℓ 2 b′′ (θ) ( Y − b′ (θ) )2
+ ( ) = − +
∂θ 2 ∂θ φ φ
and from the previous result,

Y − b′ (θ) Y − IE(Y )
=
φ φ
Together, with the second identity, this yields

b′′ (θ) var(Y )


0=− + ,
φ φ2
which leads to
var(Y ) = V (Y ) = b′′ (θ)φ.

21/52
Example: Poisson distribution

Example: Consider a Poisson likelihood,


µy −µ
f (y) = e = ey log µ−µ−log(y!) ,
y!
Thus,
θ = log µ, b(θ) = µ, c(y, φ) = − log(y!),
φ = 1,
µ = eθ ,
b(θ) = eθ ,
b′′ (θ) = eθ = µ,

22/52
Link function

◮ β is the parameter of interest, and needs to appear somehow


in the likelihood function to use maximum likelihood.
◮ A link function g relates the linear predictor X ⊤ β to the mean
parameter µ,
X ⊤ β = g(µ).
◮ g is required to be monotone increasing and differentiable

µ = g−1 (X ⊤ β).

23/52
Examples of link functions

◮ For LM, g(·) = identity.


◮ Poisson data. Suppose Y |X ∼ Poisson(µ(X)).
◮ µ(X) > 0;
◮ log(µ(X)) = X ⊤ β;
◮ In general, a link function for the count data should map
(0, +∞) to IR.
◮ The log link is a natural one.
◮ Bernoulli/Binomial data.
◮ 0 < µ < 1;
◮ g should map (0, 1) to IR:
◮ 3 choices: ( )
µ(X)
1. logit: log 1−µ(X)
= X ⊤ β;
2. probit: Φ (µ(X)) = X ⊤ β where Φ(·) is the normal cdf;
−1

3. complementary log-log: log(− log(1 − µ(X))) = X ⊤ β


◮ The logit link is the natural choice.

24/52
Examples of link functions for Bernoulli response (1)
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

ex
◮ in blue: f1 (x) =
1 + ex
◮ in red: f2 (x) = Φ(x) (Gaussian CDF)
25/52
Examples of link functions for Bernoulli response (2)
5

2
◮ in blue:
1 g1 (x) = f1−1 (x) =
� x �
log (logit link)
0
1−x
-1
◮ in red:
g2 (x) = f2−1 (x) = Φ−1 (x)
-2
(probit link)
-3

-4

-5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

26/52
Canonical Link

◮ The function g that links the mean µ to the canonical


parameter θ is called Canonical Link:

g(µ) = θ

◮ Since µ = b′ (θ), the canonical link is given by

g(µ) = (b′ )−1 (µ) .

◮ If φ > 0, the canonical link function is strictly increasing.


Why?

27/52
Example: the Bernoulli distribution

◮ We can check that

b(θ) = log(1 + eθ )

◮ Hence we solve
( )
′ exp(θ) µ
b (θ) = =µ ⇔ θ = log
1 + exp(θ) 1−µ
◮ The canonical link for the Bernoulli distribution is the logit
link.

28/52
Other examples

b(θ) g(µ)
Normal θ 2 /2 µ
Poisson exp(θ) log µ
µ
Bernoulli log(1 + eθ ) log 1−µ
Gamma − log(−θ) − µ1

29/52
Model and notation

◮ Let (Xi , Yi ) ∈ IRp × IR, i = 1, . . . , n be independent random


pairs such that the conditional distribution of Yi given
Xi = xi has density in the canonical exponential family:
� yi θi − b(θi ) �
fθi (yi ) = exp + c(yi , φ) .
φ

◮ Y = (Y1 , . . . , Yn )⊤ , X = (X1⊤ , . . . , Xn⊤ )⊤


◮ Here the mean µi is related to the canonical parameter θi via

µi = b′ (θi )

◮ and µi depends linearly on the covariates through a link


function g:
g(µi ) = Xi⊤ β .

30/52
Back to β

◮ Given a link function g, note the following relationship


between β and θ:

θi = (b′ )−1 (µi )


= (b′ )−1 (g −1 (Xi⊤ β)) ≡ h(Xi⊤ β),

where h is defined as

h = (b′ )−1 ◦ g −1 = (g ◦ b′ )−1 .

◮ Remark: if g is the canonical link function, h is identity.

31/52
Log-likelihood

◮ The log-likelihood is given by


L Yi θi − b(θi )
ℓn (β; Y, X) =
φ
i
L Yi h(X ⊤ β) − b(h(X ⊤ β))
i i
=
φ
i

up to a constant term.
◮ Note that when we use the canonical link function, we obtain
the simpler expression
L Yi X ⊤ β − b(X ⊤ β)
i i
ℓn (β, φ; Y, X) =
φ
i

32/52
Strict concavity

◮ The log-likelihood ℓ(θ) is strictly concave using the canonical


function when φ > 0. Why?
◮ As a consequence the maximum likelihood estimator is unique.
◮ On the other hand, if another parameterization is used, the
likelihood function may not be strictly concave leading to
several local maxima.

33/52
Optimization Methods

Given a function f (x) defined on X ⊂ IRm , find x∗ such that


f (x∗ ) ≥ f (x) for all x ∈ X .

We will describe the following three methods,


◮ Newton-Raphson Method
◮ Fisher-scoring Method
◮ Iteratively Re-weighted Least Squares.

34/52
Gradient and Hessian
◮ Suppose f : IRm → IR has two continuous derivatives.
◮ Define the Gradient of f at point x0 , ∇f = ∇f (x0 ), as

(∇f ) = (∂f /∂x1 , . . . , ∂f /∂xm )⊤ .

◮ Define the Hessian (matrix) of f at point x0 , Hf = Hf (x0 ),


as
∂2f
(Hf )ij = .
∂xi ∂xj
◮ For smooth functions, the Hessian is symmetric. If f is strictly
concave, then Hf (x) is negative definite.
◮ The continuous function:

x �→ Hf (x)

is called Hessian map.


35/52
Quadratic approximation

◮ Suppose f has a continuous Hessian map at x0 . Then we can


approximate f quadratically in a neighborhood of x0 using
1
f (x) ≈ f (x0 ) + ∇⊤ ⊤
f (x0 )(x − x0 ) + (x − x0 ) Hf (x0 )(x − x0 ).
2
◮ This leads to the following approximation to the gradient:

∇f (x) ≈ ∇f (x0 ) + Hf (x0 )(x − x0 ).

◮ If x∗ is maximum, we have

∇f (x∗ ) = 0

◮ We can solve for it by plugging in x∗ , which gives us

x∗ = x0 − Hf (x0 )−1 ∇f (x0 ).

36/52
Newton-Raphson method

◮ The Newton-Raphson method for multidimensional


optimization uses such approximations sequentially
◮ We can define a sequence of iterations starting at an arbitrary
value x0 , and update using the rule,

x(k+1) = x(k) − Hf (x(k) )−1 ∇f (x(k) ).

◮ The Newton-Raphson algorithm is globally convergent at


quadratic rate whenever f is concave and has two continuous
derivatives.

37/52
Fisher-scoring method (1)

◮ Newton-Raphson works for a deterministic case, which does


not have to involve random data.
◮ Sometimes, calculation of the Hessian matrix is quite
complicated (we will see an example)
◮ Goal: use directly the fact that we are minimizing the KL
divergence [ ]
KL“ = ” − IE log-likelihood
◮ Idea: replace the Hessian with its expected value. Recall that

IEθ (Hℓn (θ)) = −I(θ)

is the Fisher Information

38/52
Fisher-scoring method (2)

◮ The Fisher Information matrix is positive definite, and can


serve as a stand-in for the Hessian in the Newton-Raphson
algorithm, giving the update:

θ (k+1) = θ (k) + I(θ (k) )−1 ∇ℓn (θ (k) ).

This is the Fisher-scoring algorithm.


◮ It has essentially the same convergence properties as
Newton-Raphson, but it is often easier to compute I than
H ℓn .

39/52
Example: Logistic Regression (1)

◮ Suppose Yi ∼ Bernoulli(pi ), i = 1, . . . , n, are independent


0/1 indicator responses, and Xi is a p × 1 vector of predictors
for individual i.
◮ The log-likelihood is as follows:
n (
L ( ))
ℓn (θ|Y, X) = Yi θi − log 1 + eθi .
i=1

◮ Under the canonical link,


( )
pi
θi = log = Xi⊤ β.
1 − pi

40/52
Example: Logistic Regression (2)
◮ Thus, we have
n ( ( ))

L
ℓn (β|Y, X) = Yi Xi⊤ β − log 1 + eXi β .
i=1

◮ The gradient is
n
� ⊤β

L eXi
∇ℓn (β) = Yi Xi − ⊤β
Xi .
i=1 1 + eXi
◮ The Hessian is
n ⊤β
L eXi ⊤
Hℓn (β) = − ( )2 Xi Xi .

i=1 1 + eXi β

◮ As a result, the updating rule is

β (k+1) = β (k) − Hℓn (β (k) )−1 ∇ℓn (β (k) ).


41/52
Example: Logistic Regression (3)

◮ The score function is a linear combination of the Xi , and the


Hessian or Information matrix is a linear combination of
Xi Xi⊤ . This is typical in exponential family regression models
(i.e. GLM).
◮ The Hessian is negative definite, so there is a unique local
maximizer, which is also the global maximizer.
◮ Finally, note that that the Yi does not appear in Hℓn (β),
which yields
[ ]
Hℓn (β) = IE Hℓn (β) = −I(β)

42/52
Iteratively Re-weighted Least Squares

◮ IRLS is an algorithm for fitting GLM obtained by


Newton-Raphson/Fisher-scoring.
◮ Suppose Yi |Xi has a distribution from an exponential family
with the following log-likelihood function,
n
L Yi θi − b(θi )
ℓ= + c(Yi , φ).
φ
i=1

◮ Observe that
dµi ′′
µi = b′ (θi ), Xi⊤ β = g(µi ), = b (θi ) ≡ Vi .
dθi

θi = (b′ )−1 ◦ g −1 (Xi⊤ β) := h(Xi⊤ β)

43/52
Chain rule

◮ According to the chain rule, we have


n
∂ℓn L ∂ℓi ∂θi
=
∂βj ∂θi ∂βj
i=1
L Yi − µi
= h′ (Xi⊤ β)Xij
φ
i
h′ (Xi⊤ β)
( ( ))
(Ỹi − µ̃i )Wi Xij
L
= Wi ≡ .
g′ (µi )φ
i

◮ Where Ỹ = (g′ (µ1 )Y1 , . . . g′ (µn )Yn )⊤ and


µ̃ = (g ′ (µ1 )µ1 , . . . g′ (µn )µn )⊤

44/52
Gradient

◮ Define
W = diag{W1 , . . . , Wn },
◮ Then, the gradient is

∇ℓn (β) = X⊤ W (Ỹ − µ̃)

45/52
Hessian
◮ For the Hessian, we have

∂2ℓ L Yi − µi
= h′′ (Xi⊤ β)Xij Xij
∂βj ∂βk φ
i
( )
1L ∂µi
− h′ (Xi⊤ β)Xij
φ ∂βk
i

◮ Note that
∂µi ∂b′ (θi ) ∂b′ (h(Xi⊤ β))
= = = b′′ (θi )h′ (Xi⊤ β)Xik
∂βk ∂βk ∂βk
It yields
1 L ′′ ]2
b (θi ) h′ (Xi⊤ β) Xi Xi⊤
[
IE(Hℓn (β)) = −
φ
i

46/52
Fisher information
◮ Note that g−1 (·) = b′ ◦ h(·) yields
1
b′′ ◦ h(·) · h′ (·) =
g′ ◦ g−1 (·)

Recall that θi = h(Xi⊤ β) and µi = g −1 (Xi⊤ β), we obtain


1
b′′ (θi )h′ (Xi⊤ β) =
g ′ (µ i)

◮ As a result
L h′ (X ⊤ β)
i
IE(Hℓn (β)) = − Xi Xi⊤
g ′ (µi )φ
i

◮ Therefore,
( h′ (X ⊤ β) )
i
I(β) = −IE(Hℓn (β)) = X⊤ W X where W = diag
g ′ (µi )
47/52
Fisher-scoring updates

◮ According to Fisher-scoring, we can update an initial estimate


β (k) to β (k+1) using

β (k+1) = β (k) + I(β (k) )−1 ∇ℓn (β (k) ) ,

◮ which is equivalent to

β (k+1) = β (k) + (X⊤ W X)−1 X⊤ W (Ỹ − µ̃)


= (X⊤ W X)−1 X⊤ W (Ỹ − µ̃ + Xβ (k) )

48/52
Weighted least squares (1)
Let us open a parenthesis to talk about Weighted Least Squares.
◮ Assume the linear model Y = Xβ + ε, where
ε ∼ Nn (0, W −1 ), where W −1 is a n × n diagonal matrix.
When variances are different, the regression is said to be
heteroskedastic.
◮ The maximum likelihood estimator is given by the solution to

min(Y − Xβ)⊤ W (Y − Xβ)


β

This is a Weighted Least Squares problem


◮ The solution is given by

(X⊤ W X)−1 X⊤ W (X⊤ W X)Y

◮ Routinely implemented in statistical software.

49/52
Weighted least squares (2)

Back to our problem.


Recall that
˜ −µ
β (k+1) = (X⊤ W X)−1 X⊤ W (Y ˜ + Xβ (k) )

◮ This reminds us of Weighted Least Squares with


1. W = W (β (k) ) being the weight matrix,
2. Ỹ − µ̃ + Xβ (k) being the response.
So we can obtain β (k+1) using any system for WLS.

50/52
IRLS procedure (1)
Iteratively Reweighed Least Squares is an iterative procedure to
compute the MLE in GLMs using weighted least squares.
We show how to go from β (k) to β (k+1)
(k)
1. Fix β (k) and µi = g−1 (Xi⊤ β (k) );
2. Calculate the adjusted dependent responses
(k) (k) (k)
Zi = Xi⊤ β (k) + g ′ (µi )(Yi − µi );

3. Compute the weights W (k) = W (β (k) )


� �
h′ (X ⊤ β (k) )
W (k) = diag i
(k)

g (µi )φ

4. Regress Z(k) on the design matrix X with weight W (k) to


derive a new estimate β (k+1) ;
We can repeat this procedure until convergence.
51/52
IRLS procedure (2)

◮ For this procedure, we only need to know X, Y, the link


′′
function g(·) and the variance function V (µ) = b (θ).
◮ A possible starting value is to let µ(0) = Y.
◮ If the canonical link is used, then Fisher scoring is the same as
Newton-Raphson.
IE (Hℓn ) = Hℓn .
There is no random component (Y) in the Hessian matrix.

52/52
MIT OpenCourseWare
http://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications


Fall 2016

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy