0% found this document useful (0 votes)

123 views294 pages

MIT - Statistics For Applications

The document outlines the topics and structure for a course on statistics for applications. It discusses 21 topics that will be covered across 24 lectures, including introductions to statistics, parametric inference, maximum likelihood estimation, hypothesis testing, regression, Bayesian statistics, and principal component analysis. Videos of the lectures and slides will be made available online through MIT OpenCourseWare.

Uploaded by

affieexyqznxee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views294 pages

MIT - Statistics For Applications

Uploaded by

affieexyqznxee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 294

Statistics for

Applications
Prof. Philippe Rigollet
SES # TOPICS

1-2 Introduction to Statistics (PDF)

3 Parametric Inference (PDF)

4-5 Maximum Likelihood Estimation (PDF)

6 The Method of Moments (PDF)

7-10 Parametric Hypothesis Testing (PDF)

11-12 Testing Goodness of Fit (PDF)

13-16 Regression (PDF - 1.2MB)

17-18 Bayesian Statistics (PDF)

19-20 Principal Component Analysis (PDF)

21-24 Generalized Linear Models (PDF)

https://ocw.mit.edu/courses/18-650-statistics-for-applications-fall-2016/

https://www.youtube.com/playlist?list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0
18.650
Statistics for Applications

Chapter 1: Introduction

1/43
Goals

Goals:
▶ To give you a solid introduction to the mathematical theory
behind statistical methods;
▶ To provide theoretical guarantees for the statistical methods
that you may use for certain applications.
At the end of this class, you will be able to
1. From a real-life situation, formulate a statistical problem in
mathematical terms
2. Select appropriate statistical methods for your problem
3. Understand the implications and limitations of various
methods

2/43
Instructors

▶ Instructor: Philippe Rigollet

Associate Prof. of Applied Mathematics; IDSS; MIT Center
for Statistics and Data Science.

▶ Teaching Assistant: Victor-Emmanuel Brunel

Instructor in Applied Mathematics; IDSS; MIT Center for
Statistics and Data Science.

3/43
Logistics

▶ Lectures: Tuesdays & Thursdays 1:00 -2:30am

▶ Optional Recitation: TBD.
▶ Homework: weekly. Total 11, 10 best kept (30%).
▶ Midterm: Nov. 8, in class, 1 hours and 20 minutes (30 %).
Closed books closed notes. Cheatsheet.
▶ Final: TBD, 2 hours (40%). Open books, open notes.

4/43
Miscellaneous

▶ Prerequisites: Probability (18.600 or 6.041), Calculus 2,

notions of linear algebra (matrix, vector, multiplication,
orthogonality,…)
▶ Reading: There is no required textbook
▶ Slides are posted on course website
https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/lecture-slides

▶ Videolectures: Each lecture is recorded and posted online.

Attendance is still recommended.

5/43
Why statistics?

6/43
Not only in the press

Hydrology Netherlands, 10th century, building dams and dykes

Should be high enough for most floods Should not be
too expensive (high)

Insurance Given your driving record, car information, coverage.

What is a fair premium?

Clinical trials A drug is tested on 100 patients; 56 were cured and

44 showed no improvement. Is the drug effective?

8/43
Randomness

What is common to all these examples?

9/43
Randomness

What is common to all these examples?

RANDOMNESS

9/43
Randomness

What is common to all these examples?

RANDOMNESS

Associated questions:
▶ Notion of average (“fair premium”, …)
▶ Quantifying chance (“most of the floods”, …)
▶ Significance, variability, …

9/43
Probability
▶ Probability studies randomness (hence the prerequisite)
▶ Sometimes, the physical process is completely known: dice,
cards, roulette, fair coins, …

Examples

Rolling 1 die:
▶ Alice gets $1 if # of dots 3
▶ Bob gets $2 if # of dots 2
Who do you want to be: Alice or Bob?
Rolling 2 dice:
▶ Choose a number between 2 and 12
▶ Win $100 if you chose the sum of the 2 dice

Which number do you choose?

Well known random process from physics: 1/6 chance of each side,
dice are independent. We can deduce the probability of outcomes,
and expected $ amounts. This is probability. 10/43
Probability
▶ Probability studies randomness (hence the prerequisite)
▶ Sometimes, the physical process is completely known: dice,
cards, roulette, fair coins, …

Examples

Which number do you choose?

Examples

Which number do you choose?

Examples

Which number do you choose?

▶ How about more complicated processes? Need to estimate

parameters from data. This is statistics
▶ Sometimes real randomness (random student, biased coin,
measurement error, …)
▶ Sometimes deterministic but too complex phenomenon:
statistical modeling
Complicated process “=” Simple process + random noise
▶ (good) Modeling consists in choosing (plausible) simple
process and noise distribution.

11/43
Statistics vs. probability

Probability Previous studies showed that the drug was 80%

effective. Then we can anticipate that for a study on
100 patients, in average 80 will be cured and at least
65 will be cured with 99.99% chances.

Statistics Observe that 78/100 patients were cured. We (will

be able to) conclude that we are 95% confident that
for other studies the drug will be effective on between
69.88% and 86.11% of patients

13/43
18.650

What this course is about

▶ Understand mathematics behind statistical methods
▶ Justify quantitive statements given modeling assumptions
▶ Describe interesting mathematics arising in statistics
▶ Provide a math toolbox to extend to other models.

What this course is not about

▶ Statistical thinking/modeling (applied stats, e.g. IDS.012)
▶ Implementation (computational stats, e.g. IDS.012)
▶ Laundry list of methods (boring stats, e.g. AP stats)

14/43
18.650

What this course is about

What this course is not about

▶ Statistical thinking/modeling (applied stats, e.g. IDS.012)
▶ Implementation (computational stats, e.g. IDS.012)
▶ Laundry list of methods (boring stats, e.g. AP stats)

14/43
Let’s do some statistics

15/43
Heuristics (1)

“A neonatal right-side preference makes a surprising

romantic reappearance later in life.”

▶ Let p denote the proportion of couples that turn their head to

the right when kissing.
▶ Let us design a statistical experiment and analyze its outcome.
▶ Observe n kissing couples times and collect the value of each
outcome (say 1 for RIGHT and 0 for LEFT);
▶ Estimate p with the proportion p̂ of RIGHT.
▶ Study: “Human behaviour: Adult persistence of head-turning
asymmetry” (Nature, 2003): n = 124, 80 to the right so
80
p̂ = = 64.5%
124

17/43
Heuristics (2)

Back to the data:

▶ 64.5% is much larger than 50% so there seems to be a
preference for turning right.
▶ What if our data was RIGHT, RIGHT, LEFT (n = 3). That’s
66.7% to the right. Even better?
▶ Intuitively, we need a large enough sample size n to make a
call. How large?

We need mathematical modeling to understand

the accuracy of this procedure?

18/43
Heuristics (3)

Formally, this procedure consists of doing the following:

▶ For i = 1, . . . , n, define Ri = 1 if the ith couple turns to the
right RIGHT, Ri = 0 otherwise.
▶ The estimator of p is the sample average

∑ n
¯n = 1
p̂ = R Ri .
n
i=1

What is the accuracy of this estimator ?

In order to answer this question, we propose a statistical model

that describes/approximates well the experiment.

19/43
Heuristics (4)

Coming up with a model consists of making assumptions on the

observations Ri , i = 1, . . . , n in order to draw statistical
conclusions. Here are the assumptions we make:

1. Each Ri is a random variable.

2. Each of the r.v. Ri is Bernoulli with parameter p.

3. R1 , . . . , Rn are mutually independent.

20/43
Heuristics (5)
Let us discuss these assumptions.

1. Randomness is a way of modeling lack of information; with

perfect information about the conditions of kissing (including
what goes in the kissers’ mind), physics or sociology would
allow us to predict the outcome.

2. Hence, the Ri ’s are necessarily Bernoulli r.v. since

Ri → {0, 1}. They could still have a different parameter
Ri " Ber(pi ) for each couple but we don’t have enough
information with the data estimate the pi ’s accurately. So we
simply assume that our observations come from the same
process: pi = p for all i

3. Independence is reasonable (people were observed at different

locations and different times).

21/43
Two important tools: LLN & CLT

Let X, X1 , X2 , . . . , Xn be i.i.d. r.v., µ = IE[X] and α 2 = V[X].

▶ Laws of large numbers (weak and strong):

1∑
n
IP, a.s.
X̄n := Xi −−−−≥ µ.
n n-≥
i=1

▶ Central limit theorem:

∈ X̄n − µ (d)
n −−−≥ N (0, 1).
α n-≥

∈ (d)
(Equivalently, ¯ n − µ) −−
n (X −≥ N (0, α 2 ).)
n-≥

22/43
Consequences (1)

▶ The LLN’s tell us that

IP, a.s.
R̄n −−−−≥ p.
n-≥

▶ ¯n
Hence, when the size n of the experiment becomes large, R
is a good (say ”consistent”) estimator of p.

▶ The CLT refines this by quantifying how good this estimate is.

23/43
Consequences (2)

<(x): cdf of N (0, 1);

∈ R̄n − p
<n (x): cdf of n√ .
p(1 − p)

CLT: <n (x) ≤ <(x) when n becomes large. Hence, for all x > 0,
( ( ∈ ))
[ ] x n
IP |R̄n − p| 2 x ≤ 2 1 − < √ .
p(1 − p)

24/43
Consequences (3)

Consequences:

▶ ¯ n concentrates around p;
Approximation on how R

▶ For a fixed o → (0, 1), if qa/2 is the (1 − o/2)-quantile of

N (0, 1), then with probability ≤ 1 − o (if n is large enough !),
[ √ √ ]
qa/2 p(1 − p) qa/2 p(1 − p)
R̄n → p − ∈ ,p + ∈ .
n n

25/43
Consequences (4)
▶ Note that no matter the (unknown) value of p,

p(1 − p) 1/4.
▶ Hence, roughly with probability at least 1 − o,
[ ]
qa/2 qa/2
R̄n → p − ∈ , p + ∈ .
2 n 2 n
▶ In
[ other words, when n becomes
] large, the interval
qa/2 qa/2
R̄n − ∈ , R̄n + ∈ contains p with probability 2 1 − o.
2 n 2 n
▶ This interval is called an asymptotic confidence interval for p.
▶ In the kiss example, we get
[ 1.96 ]
0.645 ± ∈ = [0.56, 0.73]
2 124
If the extreme (n = 3 case) we would have [0.10, 1.23] but
CLT is not valid! Actually we can make exact computations!
26/43
Another useful tool: Hoeffding’s inequality
What if n is not so large ?
Hoeffding’s inequality (i.i.d. case)
Let n be a positive integer and X, X1 , . . . , Xn be i.i.d. r.v. such
that X → [a, b] a.s. (a < b are given numbers). Let µ = IE[X].
Then, for all λ > 0,
2n�2
−
IP[|X̄n − µ| 2 λ] 2e (b−a)2 .

Consequence:
▶ For o → (0, 1), with probability 2 1 − o,
√ √
log(2/o) ¯ n + log(2/o) .
R̄n − p R
2n 2n
▶ This holds even for small sample sizes n.

27/43
Review of different types of convergence (1)

Let (Tn )n?1 a sequence of r.v. and T a r.v. (T may be

deterministic).
▶ Almost surely (a.s.) convergence:
[{ }]
a.s.
Tn −−−≥ T iff IP θ : Tn (θ) −−−≥ T (θ) = 1.
n-≥ n-≥

▶ Convergence in probability:
IP
Tn −−−≥ T iff IP [|Tn − T | 2 λ] −−−≥ 0, ⇒λ > 0.
n-≥ n-≥

28/43
Review of different types of convergence (2)

▶ Convergence in Lp (p 2 1):
Lp
Tn −−−≥ T iff IE [|Tn − T |p ] −−−≥ 0.
n-≥ n-≥

▶ Convergence in distribution:
(d)
Tn −−−≥ T iff IP[Tn x] −−−≥ IP[T x],
n-≥ n-≥

for all x → IR at which the cdf of T is continuous.

Remark
These definitions extend to random vectors (i.e., random variables
in IRd for some d 2 2).

29/43
Review of different types of convergence (3)

Important characterizations of convergence in distribution

The following propositions are equivalent:

(d)
(i) Tn −−−≥ T ;
n-≥

(ii) IE[f (Tn )] −−−≥ IE[f (T )], for all continuous and
n-≥

bounded function f ;
[ ] [ ]
(iii) IE eixTn −−−≥ IE eixT , for all x → IR.
n-≥

30/43
Review of different types of convergence (4)

Important properties
▶ If (Tn )n?1 converges a.s., then it also converges in probability,
and the two limits are equal a.s.

▶ If (Tn )n?1 converges in Lp , then it also converges in Lq for all

q p and in probability, and the limits are equal a.s.

▶ If (Tn )n?1 converges in probability, then it also converges in

distribution

▶ If f is a continuous function:
a.s./IP/(d) a.s./IP/(d)
Tn −−−−−−−≥ T ≈ f (Tn ) −−−−−−−≥ f (T ).
n-≥ n-≥

31/43
Review of different types of convergence (6)

Limits and operations

One can add, multiply, ... limits almost surely and in probability. If
a.s./IP a.s./IP
Un −−−−≥ U and Vn −−−−≥ V , then:
n-≥ n-≥
a.s./IP
▶ Un + Vn −−−−≥ U + V ,
n-≥
a.s./IP
▶ Un Vn −−−−≥ U V ,
n-≥
Un a.s./IP U
▶ If in addition, V ̸= 0 a.s., then −−−−≥ .
Vn n-≥ V

� In general, these rules do not apply to convergence in

distribution unless the pair (Un , Vn ) converges in distribution to
(U, V ).

33/43
Another example (1)

▶ You observe the times between arrivals of the T at Kendall:

T1 , . . . , Tn .

▶ You assume that these times are:

▶ Mutually independent
▶ Exponential random variables with common parameter , > 0.

▶ You want to estimate the value of ,, based on the observed

arrival times.

34/43
Another example (2)

Discussion of the assumptions:

▶ Mutual independence of T1 , . . . , Tn : plausible but not

completely justified (often the case with independence).

▶ T1 , . . . , Tn are exponential r.v.: lack of memory of the

exponential distribution:

IP[T1 > t + s|T1 > t] = IP[T1 > s], ⇒s, t 2 0.

Also, Ti > 0 almost surely!

▶ The exponential distributions of T1 , . . . , Tn have the same
parameter: in average all the same inter-arrival time. True
̸ 11pm).
only for limited period (rush hour =

35/43
Another example (3)

▶ Density of T1 :
f (t) = ,e− t , ⇒t 2 0.
1
▶ IE[T1 ] = .
,

1
▶ Hence, a natural estimate of is
,

1∑
n
T̄n := Ti .
n
i=1

▶ A natural estimator of , is

ˆ := 1 .
,
T̄n

36/43
Another example (4)

▶ By the LLN’s,
a.s./IP 1
T̄n −−−−≥
n-≥ ,
▶ Hence,
ˆ−a.s./IP
, −−−≥ ,.
n-≥
▶ By the CLT,
( )
∈ 1 (d)
n T̄n − −−−≥ N (0, ,−2 ).
, n-≥

▶ ˆ ? How to find an asymptotic

How does the CLT transfer to ,
confidence interval for , ?

37/43
The Delta method

Let (Zn )n?1 sequence of r.v. that satisfies

∈ (d)
n(Zn − e) −−−≥ N (0, α 2 ),
n-≥

for some e → IR and α 2 > 0 (the sequence (Zn )n?1 is said to be

asymptotically normal around e).

Let g : IR ≥ IR be continuously differentiable at the point e.

Then,
▶ (g(Zn ))n?1 is also asymptotically normal;
▶ More precisely,
∈ (d)
n (g(Zn ) − g(e)) −−−≥ N (0, g ′ (e)2 α 2 ).
n-≥

38/43
Consequence of the Delta method (1)

∈ ( ) (d)
▶ n ,ˆ − , −−−≥ N (0, ,2 ).
n-≥

▶ Hence, for o → (0, 1) and when n is large enough,

qa/2 ,
ˆ − ,|
|, ∈ .
n
[ ]
q , q ,
▶ Can , ˆ − a/2
∈ ,, ˆ + a/2
∈ be used as an asymptotic
n n
confidence interval for , ?

▶ No ! It depends on ,...

39/43
Consequence of the Delta method (2)

Two ways to overcome this issue:

▶ In this case, we can solve for ,:
( ) ( )
qa/2 , qa/2 qa/2
ˆ
|, − ,| ∈ ∼≈ , 1 − ∈ , , 1+ ∈
ˆ
n n n
( )−1 ( )
qa/2 qa/2 −1
∼≈ , 1 + ∈
ˆ , , 1− ∈
ˆ .
n n
[ ( ) ( ) ]
qa/2 −1 qa/2 −1
Hence, , ˆ 1+ ∈ ,,ˆ 1− ∈ is an asymptotic
n n
confidence interval for ,.

▶ A systematic way: Slutsky’s theorem.

40/43
Slutsky’s theorem
Slutsky’s theorem
Let (Xn ), (Yn ) be two sequences of r.v., such that:
(d)
(i) Xn −−−≥ X;
n-≥
IP
(ii) Yn −−−≥ c,
n-≥
where X is a r.v. and c is a given real number. Then,
(d)
(Xn , Yn ) −−−≥ (X, c).
n-≥

In particular,
(d)
Xn + Yn −−−≥ X + c,
n-≥
(d)
Xn Yn −−−≥ cX,
n-≥
...
41/43
Consequence of Slutsky’s theorem (1)
▶ Thanks to the Delta method, we know that

∈ ,ˆ − , (d)
n −−−≥ N (0, 1).
, n-≥

▶ By the weak LLN,

ˆ −−IP−≥ ,.
,
n-≥
▶ Hence, by Slutsky’s theorem,

∈ ,ˆ − , (d)
n −−−≥ N (0, 1).
ˆ
, n-≥

▶ Another asymptotic confidence interval for , is

[ ]
ˆ
qa/2 , ˆ
qa/2 ,
ˆ − ∈ ,,
, ˆ+ ∈ .
n n

42/43
Consequence of Slutsky’s theorem (2)

Remark:

▶ In the first example (kisses), we used a problem dependent

trick: “p(1 − p) 1/4”.

▶ We could have used Slutsky’s theorem and get the asymptotic

confidence interval
[ √ √ ]
qa/2 R ¯ n (1 − R
¯n) qa/2 R¯ n (1 − R
¯n)
R¯n − ∈ ¯n +
,R ∈ .
n n

43/43
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications

Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications

Chapter 2: Parametric Inference

1/11
The rationale behind statistical modeling
◮ Let X1 , . . . , Xn be n independent copies of X.
◮ The goal of statistics is to learn the distribution of X.
◮ If X ∈ {0, 1}, easy! It’s Ber(p) and we only have to learn the
parameter p of the Bernoulli distribution.
◮ Can be more complicated. For example, here is a (partial)
dataset with number of siblings (including self) that were
collected from college students a few years back: 2, 3, 2, 4, 1,
3, 1, 1, 1, 1, 1, 2, 2, 3, 2, 2, 2, 3, 2, 1, 3, 1, 2, 3, . . .
◮ We could make no assumption and try to learn the pmf:

x 1 2 3 4 5 6 ≥7
L
IP(X = x) p1 p2 p3 p4 p5 p6 i≥7 pi

That’s 7 parameters to learn.

◮ Or we could assume that X ∼ Poiss(λ). That’s 1 parameter
to learn!
2/11
Statistical model (1)

Formal deﬁnition

Let the observed outcome of a statistical experiment be a sample

X1 , . . . , Xn of n i.i.d. random variables in some measurable space
E (usually E ⊆ IR) and denote by IP their common distribution. A
statistical model associated to that statistical experiment is a pair

(E, (IPθ )θ∈Θ ) ,

where:

◮ E is sample space;

◮ (IPθ )θ∈Θ is a family of probability measures on E;

◮ Θ is any set, called parameter set.

3/11
Statistical model (2)

◮ Usually, we will assume that the statistical model is well

speciﬁed, i.e., deﬁned such that IP = IPθ , for some θ ∈ Θ.

◮ This particular θ is called the true parameter, and is unknown:

The aim of the statistical experiment is to estimate θ, or
check it’s properties when they have a special meaning
(θ > 2?, θ = 1/2?, . . . )

◮ For now, we will always assume that Θ ⊆ IRd for some d ≥ 1:

The model is called parametric.

4/11
Statistical model (3)
Examples
1. For n Bernoulli trials:
( )
{0, 1}, (Ber(p))p∈(0,1) .

iid
2. If X1 , . . . , Xn ∼ Exp(λ), for some unknown λ > 0:
( ∗ )
IR+ , (Exp(λ))λ>0 .

iid
3. If X1 , . . . , Xn ∼ Poiss(λ), for some unknown λ > 0:
( )
IN, (Poiss(λ))λ>0 .

iid
4. If X1 , . . . , Xn ∼ N (µ, σ 2 ), for some unknown µ ∈ IR and
σ 2 > 0: ( ( )
IR, N (µ, σ 2 ) (µ,σ2 )∈IR×IR∗ .
)
+

5/11
Identiﬁcation

The parameter θ is called identiﬁed iﬀ the map θ ∈ Θ → IPθ is

injective, i.e.,
θ = θ ′ ⇒ IPθ = IPθ′ .

Examples

1. In all four previous examples, the parameter was identiﬁed.

iid
2. If Xi = 1IYi ≥0 , where Y1 , . . . , Yn ∼ N (µ, σ 2 ), for some
unknown µ ∈ IR and σ 2 > 0, are unobserved: µ and σ 2 are
not identiﬁed (but θ = µ/σ is).

6/11
Parameter estimation (1)
Idea: Given an observed sample X1 , . . . , Xn and a statistical
model (E, (IPθ )θ∈Θ ), one wants to estimate the parameter θ.

Deﬁnitions
◮ Statistic: Any measurable1 function of the sample, e.g.,
X̄n , max Xi , X1 + log(1 + |Xn |), sample variance, etc...
i
◮ Estimator of θ: Any statistic whose expression does not
depend on θ.
◮ An estimator θ̂n of θ is weakly (resp. strongly ) consistent iﬀ

IP (resp. a.s.)
θ̂n −−−−−−−−−→ θ (w.r.t. IPθ ).
n→∞

1
Rule of thumb: if you can compute it exactly once given data, it is
measurable. You may have some issues with things that are implicitly deﬁned
such as sup or inf but not in this class 7/11
Parameter estimation (2)

◮ Bias of an estimator θ̂n of θ:

� �
IE θˆn − θ.

◮ Risk (or quadratic risk) of an estimator θ̂n :

� �
IE |θ̂n − θ|2 .

Remark: If Θ ⊆ IR,

”Quadratic risk = bias2 + variance”.

8/11
Conﬁdence intervals (1)
Let (E, (IPθ )θ∈Θ ) be a statistical model based on observations
X1 , . . . , Xn , and assume Θ ⊆ IR.

Deﬁnition
Let α ∈ (0, 1).
◮ Conﬁdence interval (C.I.) of level 1 − α for θ: Any random
(i.e., depending on X1 , . . . , Xn ) interval I whose boundaries
do not depend on θ and such that:

IPθ [I ∋ θ] ≥ 1 − α, ∀θ ∈ Θ.
◮ C.I. of asymptotic level 1 − α for θ: Any random interval I
whose boundaries do not depend on θ and such that:

lim IPθ [I ∋ θ] ≥ 1 − α, ∀θ ∈ Θ.
n→∞

9/11
Conﬁdence intervals (2)
iid
Example: Let X1 , . . . , Xn ∼ Ber(p), for some unknown
p ∈ (0, 1).

◮ LLN: The sample average X̄n is a strongly consistent

estimator of p.

α
◮ Let qα/2 be the (1 −)-quantile of N (0, 1) and
2
J J
qα/2 p(1 − p) qα/2 p(1 − p)
I = X̄n − √ ¯n +
,X √ .
n n

◮ CLT: lim IPp [I ∋ p] = 1 − α, ∀p ∈ (0, 1).

n→∞

◮ Problem: I depends on p !

10/11
Conﬁdence intervals (3)

Two solutions:

◮ Replace p(1 − p) with 1/4 in I (since p(1 − p) ≤ 1/4).

◮ ¯ n in I and use Slutsky’s theorem.

Replace p with X

11/11
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications

Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications

Chapter 3: Maximum Likelihood Estimation

1/23
Total variation distance (1)
( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that there exists θ ∗ ∈ Θ such
that X1 ∼ IPθ∗ : θ ∗ is the true parameter.

Statistician’s goal: given X1 , . . . , Xn , ﬁnd an estimator

θˆ = θ(X
ˆ 1 , . . . , Xn ) such that IP ˆ is close to IPθ∗ for the true
θ
parameter θ ∗ .
This means: IPθˆ(A) − IPθ∗ (A) is small for all A ⊂ E.
Deﬁnition
The total variation distance between two probability measures IPθ
and IPθ′ is deﬁned by

TV(IPθ , IPθ′ ) = max IPθ (A) − IPθ′ (A) .

A⊂E

2/23
Total variation distance (2)

Assume that E is discrete (i.e., ﬁnite or countable). This includes

Bernoulli, Binomial, Poisson, . . .

Therefore X has a PMF (probability mass function):

IPθ (X = x) = pθ (x) for all x ∈ E,
L
pθ (x) ≥ 0, pθ (x) = 1 .
x∈E

The total variation distance between IPθ and IPθ′ is a simple

function of the PMF’s pθ and pθ′ :
1L
TV(IPθ , IPθ′ ) = pθ (x) − pθ′ (x) .
2
x∈E

3/23
Total variation distance (3)

Assume that E is continuous. This includes Gaussian, Exponential,

...
J
Assume that X has a density IPθ (X ∈ A) = A fθ (x)dx for all
A ⊂ E. l
fθ (x) ≥ 0, fθ (x)dx = 1 .
E

The total variation distance between IPθ and IPθ′ is a simple

function of the densities fθ and fθ′ :
l
1
TV(IPθ , IPθ′ ) = fθ (x) − fθ′ (x) dx .
2 E

4/23
Total variation distance (4)

Properties of Total variation:

◮ TV(IPθ , IPθ′ ) = TV(IPθ′ , IPθ ) (symmetric)

◮ TV(IPθ , IPθ′ ) ≥ 0
◮ If TV(IPθ , IPθ′ ) = 0 then IPθ = IPθ′ (deﬁnite)
◮ TV(IPθ , IPθ′ ) ≤ TV(IPθ , IPθ′′ ) + TV(IPθ′′ , IPθ′ ) (triangle
inequality)

These imply that the total variation is a distance between

probability distributions.

5/23
Total variation distance (5)
T θ , IPθ∗ ) for all
An estimation strategy: Build an estimator TV(IP
T θ , IPθ∗ ).
θ ∈ Θ. Then ﬁnd θˆ that minimizes the function θ → TV(IP

6/23
Total variation distance (5)
T θ , IPθ∗ ) for all
An estimation strategy: Build an estimator TV(IP
T θ , IPθ∗ ).
θ ∈ Θ. Then ﬁnd θˆ that minimizes the function θ → TV(IP

T θ , IPθ∗ )!
problem: Unclear how to build TV(IP
6/23
Kullback-Leibler (KL) divergence (1)

There are many distances between probability measures to replace

total variation. Let us choose one that is more convenient.

Deﬁnition
The Kullback-Leibler (KL) divergence between two probability
measures IPθ and IPθ′ is deﬁned by




 L ( p (x) )

 θ

 pθ (x) log if E is discrete
pθ′ (x)
KL(IPθ , IPθ ) =
′ x∈E

 l

 ( f (x) )

 θ

 f θ (x) log dx if E is continuous
E θf ′ (x)

7/23
Kullback-Leibler (KL) divergence (2)

Properties of KL-divergence:

◮ KL(IPθ , IPθ′ ) = KL(IPθ′ , IPθ ) in general

◮ KL(IPθ , IPθ′ ) ≥ 0
◮ If KL(IPθ , IPθ′ ) = 0 then IPθ = IPθ′ (deﬁnite)
◮ KL(IPθ , IPθ′ ) i KL(IPθ , IPθ′′ ) + KL(IPθ′′ , IPθ′ ) in general

Not a distance.

This is is called a divergence.

Asymmetry is the key to our ability to estimate it!

8/23
Kullback-Leibler (KL) divergence (3)
[ ( p ∗ (X) )]
θ
KL(IPθ∗ , IPθ ) = IEθ∗ log
pθ (X)

[ ] [ ]
= IEθ∗ log pθ∗ (X) − IEθ∗ log pθ (X)

So the function θ [→ KL(IPθ∗ ], IPθ ) is of the form:

“constant” − IEθ∗ log pθ (X)
n
1L
Can be estimated: IEθ∗ [h(X)] - h(Xi ) (by LLN)
n
i=1

L n
T θ∗ , IPθ ) = “constant” − 1
KL(IP log pθ (Xi )
n
i=1
9/23
Kullback-Leibler (KL) divergence (4)
L n
T θ∗ , IPθ ) = “constant” − 1
KL(IP log pθ (Xi )
n
i=1

n
T θ∗ , IPθ ) 1L
min KL(IP ⇔ min − log pθ (Xi )
θ∈Θ θ∈Θ n
i=1
n
L
1
⇔ max log pθ (Xi )
θ∈Θ n
i=1
n
L
⇔ max log pθ (Xi )
θ∈Θ
i=1
n
n
⇔ max pθ (Xi )
θ∈Θ
i=1

This is the maximum likelihood principle.

10/23
Interlude: maximizing/minimizing functions (1)

Note that
min −h(θ) ⇔ max h(θ)
θ∈Θ θ∈Θ

In this class, we focus on maximization.

Maximization of arbitrary functions can be diﬃcult:

�n
Example: θ → i=1 (θ − Xi )

11/23
Interlude: maximizing/minimizing functions (2)
Definition
A function twice differentiable function h : Θ ⊂ IR → IR is said to
be concave if its second derivative satisfies

h′′ (θ) ≤ 0 , ∀θ∈Θ

It is said to be strictly concave if the inequality is strict: h′′ (θ) < 0

Moreover, h is said to be (strictly) convex if −h is (strictly)

concave, i.e. h′′ (θ) ≥ 0 (h′′ (θ) > 0).
Examples:
◮ Θ = IR, h(θ) = −θ 2 ,
√
◮ Θ = (0, ∞), h(θ) = θ,
◮ Θ = (0, ∞), h(θ) = log θ,
◮ Θ = [0, π], h(θ) = sin(θ)
◮ Θ = IR, h(θ) = 2θ − 3
12/23
Interlude: maximizing/minimizing functions (3)
More generally for a multivariate function: h : Θ ⊂ IRd → IR,
d ≥ 2, deﬁne the
 ∂h 
∂θ1 (θ)
 .. 
◮ gradient vector: ∇h(θ) =  .  ∈ IRd
∂h
∂θd (θ)
◮ Hessian matrix:
 
∂2h ∂2h
···
∂θ1 ∂θ1 (θ) ∂θ1 ∂θd (θ)
 .. 
2 
∇ h(θ) =  .  ∈ IRd×d

∂2h ∂2h
∂θd ∂θd (θ) · · · ∂θd ∂θd (θ)

h is concave ⇔ x⊤ ∇2 h(θ)x ≤ 0 ∀x ∈ IRd , θ ∈ Θ.

h is strictly concave ⇔ x⊤ ∇2 h(θ)x < 0 ∀x ∈ IRd , θ ∈ Θ.
Examples:
◮ Θ = IR2 , h(θ) = −θ12 − 2θ22 or h(θ) = −(θ1 − θ2 )2
◮ Θ = (0, ∞), h(θ) = log(θ1 + θ2 ),
13/23
Interlude: maximizing/minimizing functions (4)

Strictly concave functions are easy to maximize: if they have a

maximum, then it is unique. It is the unique solution to

h′ (θ) = 0 ,

or, in the multivariate case

∇h(θ) = 0 ∈ IRd .

There are may algorithms to ﬁnd it numerically: this is the theory

of “convex optimization”. In this class, often a closed form
formula for the maximum.

14/23
Likelihood, Discrete case (1)

( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that E is discrete (i.e., ﬁnite or
countable).

Deﬁnition

The likelihood of the model is the map Ln (or just L) deﬁned as:

Ln : En × Θ → IR
(x1 , . . . , xn , θ) → IPθ [X1 = x1 , . . . , Xn = xn ].

15/23
Likelihood, Discrete case (2)
iid
Example 1 (Bernoulli trials): If X1 , . . . , Xn ∼ Ber(p) for some
p ∈ (0, 1):

◮ E = {0, 1};
◮ Θ = (0, 1);
◮ ∀(x1 , . . . , xn ) ∈ {0, 1}n , ∀p ∈ (0, 1),
n
n
L(x1 , . . . , xn , p) = IPp [Xi = xi ]
i=1
nn
= pxi (1 − p)1−xi
i=1
�n �n
xi
=p i=1 (1 − p)n− i=1 xi
.

16/23
Likelihood, Discrete case (3)
Example 2 (Poisson model):
iid
If X1 , . . . , Xn ∼ Poiss(λ) for some λ > 0:

◮ E = IN;
◮ Θ = (0, ∞);
◮ ∀(x1 , . . . , xn ) ∈ INn , ∀λ > 0,
n
n
L(x1 , . . . , xn , p) = IPλ [Xi = xi ]
i=1
nn
λxi
= e−λ
xi !
i=1
�n
λ i=1 xi
= e−nλ .
x1 ! . . . xn !

17/23
Likelihood, Continuous case (1)

( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that all the IPθ have density fθ .

Deﬁnition

The likelihood of the model is the map L deﬁned as:

L : En × Θ → �
IR
n
(x1 , . . . , xn , θ) → i=1 fθ (xi ).

18/23
Likelihood, Continuous case (2)

iid
Example 1 (Gaussian model): If X1 , . . . , Xn ∼ N (µ, σ 2 ), for
some µ ∈ IR, σ 2 > 0:

◮ E = IR;
◮ Θ = IR × (0, ∞)
◮ ∀(x1 , . . . , xn ) ∈ IRn , ∀(µ, σ 2 ) ∈ IR × (0, ∞),
n
1 1 L
L(x1 , . . . , xn , µ, σ 2 ) = √ exp − 2 (xi − µ)2 .
(σ 2π)n 2σ
i=1

19/23
Maximum likelihood estimator (1)

Let X1(, . . . , Xn be )an i.i.d. sample associated with a statistical

model E, (IPθ )θ∈Θ and let L be the corresponding likelihood.

Deﬁnition
The likelihood estimator of θ is deﬁned as:

θ̂nM LE = argmax L(X1 , . . . , Xn , θ),

θ∈Θ

provided it exists.

Remark (log-likelihood estimator): In practice, we use the fact

that
θ̂nM LE = argmax log L(X1 , . . . , Xn , θ).
θ∈Θ

20/23
Maximum likelihood estimator (2)

Examples

◮ Bernoulli trials: p̂M

n
LE ¯n.
=X

◮ ˆ M LE = X
Poisson model: λ ¯n.
n

( ) ( )
◮ Gaussian model: µ̂n , σ̂n2 = X̄n , Ŝn .

21/23
Maximum likelihood estimator (3)

Deﬁnition: Fisher information

Deﬁne the log-likelihood for one observation as:

ℓ(θ) = log L1 (X, θ), θ ∈ Θ ⊂ IRd

Assume that ℓ is a.s. twice diﬀerentiable. Under some regularity

conditions, the Fisher information of the statistical model is
deﬁned as:
[ ] [ ] [ ]⊤ [ ]
I(θ) = IE ∇ℓ(θ)∇ℓ(θ)⊤ − IE ∇ℓ(θ) IE ∇ℓ(θ) = −IE ∇2 ℓ(θ) .

If Θ ⊂ IR, we get:
[ ] [ ]
I(θ) = var ℓ′ (θ) = −IE ℓ′′ (θ)

22/23
Maximum likelihood estimator (4)

Theorem

Let θ ∗ ∈ Θ (the true parameter). Assume the following:

1. The model is identiﬁed.
2. For all θ ∈ Θ, the support of IPθ does not depend on θ;
3. θ ∗ is not on the boundary of Θ;
4. I(θ) is invertible in a neighborhood of θ ∗ ;
5. A few more technical conditions.

Then, θˆnM LE satisﬁes:

IP
◮ θˆnM LE −−−→ θ ∗ w.r.t. IPθ∗ ;
n→∞
√ ( ) (d) ( )
◮ n θ̂nM LE − θ ∗ −−−→ N 0, I(θ ∗ )−1 w.r.t. IPθ∗ .
n→∞

23/23
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications

Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications

Chapter 4: The Method of Moments

1/14
Weierstrass Approximation Theorem (WAT)

Theorem
Let f be a continuous function on the interval [a, b], then, for any
ε > 0, there exists a0 , a1 , . . . , ad ∈ IR such that
d
�
ak xk � < ε .
� �
max �f (x) −
x∈[a,b]
k=0

In word: “continuous functions can be arbitrarily well approximated

by polynomials”

2/14
Statistical application of the WAT (1)
◮ Let X1 , . . . , Xn be an i.i.d. sample associated
) with a
(identiﬁed) statistical model E, {IPθ }θ∈Θ . Write θ ∗ for the
(

true parameter.
◮ Assume that for all θ, the distribution IPθ has a density fθ .
◮ If we ﬁnd θ such that

h(x)fθ∗ (x)dx = h(x)fθ (x)dx

for all (bounded continuous) functions h, then θ = θ ∗ .

◮ Replace expectations by averages: ﬁnd estimator θ̂ such that
n
1�
h(Xi ) = h(x)fθˆ(x)dx
n
i=1

for all (bounded continuous) functions h. There is an inﬁnity

of such functions: not doable!
3/14
Statistical application of the WAT (2)
◮ By the WAT, it is enough to consider polynomials:
n d d
1 ��
ak Xik = ak xk fθ̂ (x)dx , ∀a0 , . . . , ad ∈ IR
n
i=1 k=0 k=0

Still an inﬁnity of equations!

◮ In turn, enough to consider
n
1� k
Xi = xk fθˆ(x)dx , ∀k = 1, . . . , d
n
i=1

(only d + 1 equations)
◮ The quantity mk (θ) := xk fθ (x)dx is the kth moment of
IPθ . Can also be written as

mk (θ) = IEθ [X k ] .
4/14
Gaussian quadrature (1)
◮ The Weierstrass approximation theorem has limitations:
1. works only for continuous functions (not really a problem!)
2. works only on intervals [a, b]
3. Does not tell us what d (# of moments) should be
◮ What if E is discrete: no PDF but PMF p(·)?
◮ Assume that E = {x1 , x2 , . . . , xr } is ﬁnite with r possible
values. The PMF has r − 1 parameters:

p(x1 ), . . . , p(xr−1 )
r−1
�
because the last one: p(xr ) = 1 − p(xj ) is given by the
j=1
ﬁrst r − 1.
◮ Hopefully, we do not need much more than d = r − 1
moments to recover the PMF p(·).

5/14
Gaussian quadrature (2)
◮ Note that for any k = 1, . . . , r1 ,
r
�
k
mk = IE[X ] = p(xj )xjk
j=1

and
r
�
p(xj ) = 1
j=1

This is a system of linear equations with unknowns

p(x1 ), . . . , p(xr ).
◮ We can write it in a compact form:
x11 x12 · · · x1r
     
p(x1 ) m1
 x2 x 2 · · · x 2  
p(x 2 )   m 2

1 2 r  
 .. ..  ·  ..   ..
   
.. =
.

 . .   .   .
    
 r−1
xr−1 r−1   p(x

 x1 2 · · · x r r−1  )  mr−1 
1 1 ··· 1 p(xr ) 1
6/14
Gaussian quadrature (2)
◮ Check if matrix is invertible: Vandermonde determinant
x11 x12 · · · x1r
 
2 x22 · · · x2r 
 x1

 .. ..  =

det  . .. (xj − xk ) = 0
. . 
 r−1 r−1 r−1

 x1 x2 · · · xr  1<j<k<r

1 1 ··· 1

◮ So given m1 , . . . , mr−1 , there is a unique PMF that has these

moments. It is given by
−1 
x11 x12 x1r
   
p(x1 ) ··· m1
 p(x2 )   x21 x22 ··· x2r   m2 
.. .. .. ..
     
= ..
.
     
 .   r−1 . .   . 
xr−1
   
 p(xr−1 )   x1 2 ··· xr−1
r
  mr−1 
p(xr ) 1 1 ··· 1 1
7/14
Conclusion from WAT and Gaussian quadrature

◮ Moments contain important information to recover the PDF

or the PMF

◮ If we can estimate these moments accurately, we may be able

to recover the distribution

◮ In a parametric setting, where knowing the distribution IPθ

amounts to knowing θ, it is often the case that even less
moments are needed to recover θ. This is on a case-by-case
basis.

◮ Rule of thumb if θ ∈ Θ ⊂ IRd , we need d moments.

8/14
Method of moments (1)

Let X1(, . . . , Xn be an
) i.i.d. sample associated with a statistical
d
model E, (IPθ )θ∈Θ . Assume that Θ ⊆ IR , for some d ≥ 1.

◮ Population moments: Let mk (θ) = IEθ [X1k ], 1 ≤ k ≤ d.

n
1� k
◮ Empirical moments: Let m̂k = Xnk = Xi , 1 ≤ k ≤ d.
n
i=1

◮ Let
ψ : Θ ⊂ IRd → IRd
θ → (m1 (θ), . . . , md (θ)) .

9/14
Method of moments (2)

Assume ψ is one to one:

θ = ψ −1 (m1 (θ), . . . , md (θ)).

Deﬁnition

Moments estimator of θ:

θˆnM M = ψ −1 (m̂1 , . . . , m̂d ),

provided it exists.

10/14
Method of moments (3)

Analysis of θˆnM M

◮ Let M (θ) = (m1 (θ), . . . , md (θ));

◮ Let M̂ = (m̂1 , . . . , m̂d ).

◮ Let Σ(θ) = Vθ (X, X 2 , . . . , X d ) be the covariance matrix of

the random vector (X, X 2 , . . . , X d ), where X ∼ IPθ .

Assume −1
� ψ is continuously diﬀerentiable at M (θ). Write
◮
−1 �
∇ψ M (θ) for the d × d gradient matrix at this point.

11/14
Method of moments (4)

◮ LLN: θˆnM M is weakly/strongly consistent.

◮ CLT:
√ ( ) (d)
n Mˆ − M (θ) −−−→ N (0, Σ(θ)) (w.r.t. IPθ ).
n→∞

Hence, by the Delta method (see next slide):

Theorem
√ ( MM ) (d)
n θ̂n − θ −−−→ N (0, Γ(θ)) (w.r.t. IPθ ),
n→∞

�⊤
where Γ(θ) = ∇ψ −1 �M (θ) Σ(θ) ∇ψ −1 �M (θ) .
� � � � �

12/14
Multivariate Delta method
Let (Tn )n≥1 sequence of random vectors in IRp (p ≥ 1) that
satisﬁes
√ (d)
n(Tn − θ) −−−→ N (0, Σ),
n→∞

for some θ ∈ IRp and some symmetric positive semideﬁnite matrix

Σ ∈ IRp×p.

Let g : IRp → IRk (k ≥ 1) be continuously diﬀerentiable at θ.

Then,
√ (d)
n (g(Tn ) − g(θ)) −−−→ N (0, ∇g(θ)⊤ Σ∇g(θ)),
n→∞

∂gj
where ∇g(θ) = ∈ IRk×d .
∂θi 1≤i≤d,1≤j≤k

13/14
MLE vs. Moment estimator

◮ Comparison of the quadratic risks: In general, the MLE is

more accurate.
◮ Computational issues: Sometimes, the MLE is intractable.
◮ If likelihood is concave, we can use optimization algorithms
(Interior point method, gradient descent, etc.)
◮ If likelihood is not concave: only heuristics. Local maxima.
(Expectation-Maximization, etc.)

14/14
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications

Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications

Chapter 5: Parametric hypothesis testing

1/37
Cherry Blossom run (1)

◮ The credit union Cherry Blossom Run is a 10 mile race that

takes place every year in D.C.
◮ In 2009 there were 14974 participants
◮ Average running time was 103.5 minutes.

Were runners faster in 2012?

To answer this question, select n runners from the 2012 race at

random and denote by X1 , . . . , Xn their running time.

2/37
Cherry Blossom run (2)

We can see from past data that the running time has Gaussian
distribution.

The variance was 373.

3/37
Cherry Blossom run (3)

◮ We are given i.i.d r.v X1 , . . . , Xn and we want to know if

X1 ∼ N (103.5, 373)
◮ This is a hypothesis testing problem.
◮ There are many ways this could be false:
1. IE[X1 ] = 103.5
2. var[X1 ] = 373
3. X1 may not even be Gaussian.
◮ We are interested in a very speciﬁc question: is
IE[X1 ] < 103.5?

4/37
Cherry Blossom run (4)

◮ We make the following assumptions:

1. var[X1 ] = 373 (variance is the same between 2009 and 2012)
2. X1 is Gaussian.
◮ The only thing that we did not ﬁx is IE[X1 ] = µ.
◮ Now we want to test (only): “Is µ = 103.5 or is µ < 103.5”?
◮ By making modeling assumptions, we have reduced the
number of ways the hypothesis X1 ∼ N (103.5, 373) may be
rejected.
◮ The only way it can be rejected is if X1 ∼ N (µ, 373) for some
µ < 103.5.
◮ We compare an expected value to a ﬁxed reference number
(103.5).

5/37
Cherry Blossom run (5)

Simple heuristic:

¯ n < 103.5, then µ < 103.5”

“If X

This could go wrong if I randomly pick only fast runners in my

sample X1 , . . . , Xn .

Better heuristic:
¯ n < 103.5−(something that −−−→ 0), then µ < 103.5”
“If X
n→∞

To make this intuition more precise, we need to take the size of the
random ﬂuctuations of X ¯ n into account!

6/37
Clinical trials (1)

◮ Pharmaceutical companies use hypothesis testing to test if a

new drug is eﬃcient.
◮ To do so, they administer a drug to a group of patients (test
group) and a placebo to another group (control group).
◮ Assume that the drug is a cough syrup.
◮ Let µcontrol denote the expected number of expectorations per
hour after a patient has used the placebo.
◮ Let µdrug denote the expected number of expectorations per
hour after a patient has used the syrup.
◮ We want to know if µdrug < µcontrol
◮ We compare two expected values. No reference number.

7/37
Clinical trials (2)

◮ Let X1 , . . . , Xndrug denote ndrug i.i.d r.v. with distribution

Poiss(µdrug )
◮ Let Y1 , . . . , Yncontrol denote ncontrol i.i.d r.v. with distribution
Poiss(µcontrol )
◮ We want to test if µdrug < µcontrol .
Heuristic:

¯ drug < X
“If X ¯ control −(something that −−−−−−−→ 0), then
n →∞ drug
ncontrol →∞
conclude that µdrug < µcontrol ”

8/37
Heuristics (1)
Example 1: A coin is tossed 80 times, and Heads are obtained 54
times. Can we conclude that the coin is signiﬁcantly unfair ?

iid
◮ n = 80, X1 , . . . , Xn ∼ Ber(p);
◮ ¯ n = 54/80 = .68
X
◮ If it was true that p = .5: By CLT+Slutsky’s theorem,

√ X̄n − .5
nJ ≈ N (0, 1).
.5(1 − .5)

√ X̄n − .5
◮ nJ ≈ 3.22
.̄5(1 − .5)
◮ Conclusion: It seems quite reasonable to reject the
hypothesis p = .5.

9/37
Heuristics (2)
Example 2: A coin is tossed 30 times, and Heads are obtained 13
times. Can we conclude that the coin is signiﬁcantly unfair ?

iid
◮ n = 30, X1 , . . . , Xn ∼ Ber(p);
◮ ¯ n = 13/30 ≈ .43
X
◮ If it was true that p = .5: By CLT+Slutsky’s theorem,

√ X̄n − .5
nJ ≈ N (0, 1).
.5(1 − .5)

√ X¯n − .5
◮ Our data gives nJ ≈ −.77
.5(1 − .5)
◮ The number .77 is a plausible realization of a random variable
Z ∼ N (0, 1).
◮ Conclusion: our data does not suggest that the coin is unfair.
10/37
Statistical formulation (1)
◮ Consider a sample X1 , . . . , Xn of i.i.d. random variables and a
statistical model (E, (IPθ )θ∈Θ ).

◮ Let Θ0 and Θ1 be disjoint subsets of Θ.

�
H0 : θ ∈ Θ0
◮ Consider the two hypotheses:
H1 : θ ∈ Θ1

◮ H0 is the null hypothesis, H1 is the alternative hypothesis.

◮ If we believe that the true θ is either in Θ0 or in Θ1 , we may

want to test H0 against H1 .

◮ We want to decide whether to reject H0 (look for evidence

against H0 in the data).
11/37
Statistical formulation (2)

◮ H0 and H1 do not play a symmetric role: the data is is only

used to try to disprove H0
◮ In particular lack of evidence, does not mean that H0 is true
(“innocent until proven guilty”)

◮ A test is a statistic ψ ∈ {0, 1} such that:

◮ If ψ = 0, H0 is not rejected;
◮ If ψ = 1, H0 is rejected.

◮ Coin example: H0 : p = 1/2 vs. H1 : p = 1/2.

{√ X̄n − .5 }
◮ ψ = 1I nJ > C , for some C > 0.
.5(1 − .5)

◮ How to choose the threshold C ?

12/37
Statistical formulation (3)
◮ Rejection region of a test ψ:

Rψ = {x ∈ E n : ψ(x) = 1}.

◮ Type 1 error of a test ψ (rejecting H0 when it is actually

true):
αψ : Θ0 → IR
θ �→ IPθ [ψ = 1].
◮ Type 2 error of a test ψ (not rejecting H0 although H1 is
actually true):

βψ : Θ1 → IR
θ �→ IPθ [ψ = 0].

◮ Power of a test ψ:

πψ = inf (1 − βψ (θ)) .
θ∈Θ1
13/37
Statistical formulation (4)

◮ A test ψ has level α if

αψ (θ) ≤ α, ∀θ ∈ Θ0 .

◮ A test ψ has asymptotic level α if

lim αψ (θ) ≤ α, ∀θ ∈ Θ0 .
n→∞

◮ In general, a test has the form

ψ = 1I{Tn > c},

for some statistic Tn and threshold c ∈ IR.

◮ Tn is called the test statistic. The rejection region is

Rψ = {Tn > c}.

14/37
Example (1)

iid
◮ Let X1 , . . . , Xn ∼ Ber(p), for some unknown p ∈ (0, 1).
◮ We want to test:
H0 : p = 1/2 vs. H1 : p = 1/2

with asymptotic level α ∈ (0, 1).

√ p̂n − 0.5
◮ Let Tn = n J , where p̂n is the MLE.
.5(1 − .5)

◮ If H0 is true, then by CLT and Slutsky’s theorem,

IP[Tn > qα/2 ] −−−→ 0.05

n→∞

◮ Let ψα = 1I{Tn > qα/2 }.

15/37
Example (2)

Coming back to the two previous coin examples: For α = 5%,

qα/2 = 1.96, so:

◮ In Example 1, H0 is rejected at the asymptotic level 5% by

the test ψ5% ;

◮ In Example 2, H0 is not rejected at the asymptotic level 5%

by the test ψ5% .

Question: In Example 1, for what level α would ψα not reject H0

? And in Example 2, at which level α would ψα reject H0 ?

16/37
p-value
Deﬁnition

The (asymptotic) p-value of a test ψα is the smallest (asymptotic)

level α at which ψα rejects H0 . It is random, it depends on the
sample.

Golden rule

p-value ≤ α ⇔ H0 is rejected by ψα , at the (asymptotic) level α.

The smaller the p-value, the more conﬁdently one can reject
H0 .

◮ Example 1: p-value = IP[|Z| > 3.21] ≪ .01.

◮ Example 2: p-value = IP[|Z| > .77] ≈ .44.
17/37
Neyman-Pearson’s paradigm

Idea: For given hypotheses, among all tests of level/asymptotic

level α, is it possible to ﬁnd one that has maximal power ?

Example: The trivial test ψ = 0 that never rejects H0 has a

perfect level (α = 0) but poor power (πψ = 0).

Neyman-Pearson’s theory provides (the most) powerful tests

with given level. In 18.650, we only study several cases.

18/37
The χ2 distributions
Deﬁnition
For a positive integer d, the χ2 (pronounced “Kai-squared”)
distribution with d degrees of freedom is the law of the random
iid
variable Z12 + Z22 + . . . + Zd2 , where Z1 , . . . , Zd ∼ N (0, 1).

Examples:
◮ If Z ∼ Nd (0, Id ), then IZI22 ∼ χ2d .
◮ Recall that the sample variance is given by
n n
1n 1n 2
Sn = (Xi − X̄n )2 = Xi − (X̄n )2
n n
i=1 i=1
iid
◮ Cochran’s theorem implies that for X1 , . . . , Xn ∼ N (µ, σ 2 ), if
Sn is the sample variance, then
nSn
∼ χ2n−1 .
σ2
◮ χ22 = Exp(1/2).
19/37
Student’s T distributions

Deﬁnition
For a positive integer d, the Student’s T distribution with d
degrees of freedom (denoted by td ) is the law of the random
Z
variable J , where Z ∼ N (0, 1), V ∼ χ2d and Z ⊥⊥ V (Z is
V /d
independent of V ).

Example:
iid
◮ Cochran’s theorem implies that for X1 , . . . , Xn ∼ N (µ, σ 2 ), if
Sn is the sample variance, then
√ X̄n − µ
n−1 √ ∼ tn−1 .
Sn

20/37
Wald’s test (1)

◮ Consider an i.i.d. sample X1 , . . . , Xn with statistical model

(E, (IPθ )θ∈Θ ), where Θ ⊆ IRd (d ≥ 1) and let θ0 ∈ Θ be ﬁxed
and given.
◮ Consider the following hypotheses:
�
H0 : θ = θ0
H1 : θ = θ0 .

◮ Let θˆM LE be the MLE. Assume the MLE technical conditions

are satisﬁed.

◮ If H0 is true, then
√ � � (d)
n I(θ̂ M LE )1/2 θ̂nM LE − θ0 −−−→ Nd (0, Id ) w.r.t. IPθ0 .
n→∞

21/37
Wald’s test (2)

◮ Hence,
� �⊤ � � (d)
n θ̂nM LE − θ0 I(θˆM LE ) θ̂nM LE − θ0 −−−→ χd2 w.r.t. IPθ0 .
� �� n→∞
Tn

◮ Wald’s test with asymptotic level α ∈ (0, 1):

ψ = 1I{Tn > qα },

where qα is the (1 − α)-quantile of χ2d (see tables).

◮ Remark: Wald’s test is also valid if H1 has the form “θ > θ0 ”

or “θ < θ0 ” or “θ = θ1 ”...

22/37
Likelihood ratio test (1)

◮ Consider an i.i.d. sample X1 , . . . , Xn with statistical model

(E, (IPθ )θ∈Θ ), where Θ ⊆ IRd (d ≥ 1).

◮ Suppose the null hypothesis has the form

(0) (0)
H0 : (θr+1 , . . . , θd ) = (θr+1 , . . . , θd ),

(0) (0)
for some ﬁxed and given numbers θr+1 , . . . , θd .

◮ Let
θ̂n = argmax ℓn (θ) (MLE)
θ∈Θ

and
θ̂nc = argmax ℓn (θ) (“constrained MLE”)
θ∈Θ0

23/37
Likelihood ratio test (2)

◮ Test statistic:
� �
Tn = 2 ℓn (θ̂n ) − ℓn (θ̂nc ) .

◮ Theorem
Assume H0 is true and the MLE technical conditions are satisﬁed.
Then,
(d)
Tn −−−→ χ2d−r w.r.t. IPθ .
n→∞

◮ Likelihood ratio test with asymptotic level α ∈ (0, 1):

ψ = 1I{Tn > qα },

where qα is the (1 − α)-quantile of χ2d−r (see tables).

24/37
Testing implicit hypotheses (1)

◮ Let X1 , . . . , Xn be i.i.d. random variables and let θ ∈ IRd be

a parameter associated with the distribution of X1 (e.g. a
moment, the parameter of a statistical model, etc...)

◮ Let g : IRd → IRk be continuously diﬀerentiable (with k < d).

◮ Consider the following hypotheses:

�
H0 : g(θ) = 0
H1 : g(θ) = 0.

◮ E.g. g(θ) = (θ1 , θ2 ) (k = 2), or g(θ) = θ1 − θ2 (k = 1), or...

25/37
Testing implicit hypotheses (2)

◮ Suppose an asymptotically normal estimator θ̂n is available:

√ � � (d)
n θˆn − θ −−−→ Nd (0, Σ(θ)).
n→∞

◮ Delta method:
√ � � (d)
n g(θ̂n ) − g(θ) −−−→ Nk (0, Γ(θ)) ,
n→∞

where Γ(θ) = ∇g(θ)⊤ Σ(θ)∇g(θ) ∈ IRk×k .

◮ Assume Σ(θ) is invertible and ∇g(θ) has rank k. So, Γ(θ) is

invertible and
√ � � (d)
n Γ(θ)−1/2 g(θ̂n ) − g(θ) −−−→ Nk (0, Ik ) .
n→∞

26/37
Testing implicit hypotheses (3)

◮ Then, by Slutsky’s theorem, if Γ(θ) is continuous in θ,

√ � � (d)
n Γ(θ̂n )−1/2 g(θ̂n ) − g(θ) −−−→ Nk (0, Ik ) .
n→∞

◮ Hence, if H0 is true, i.e., g(θ) = 0,

(d)
ng(θ̂n )⊤ Γ−1 (θˆn )g(θˆn ) −−−→ χ2k .
� �� n→∞
Tn

◮ Test with asymptotic level α:

ψ = 1I{Tn > qα },

where qα is the (1 − α)-quantile of χ2k (see tables).

27/37
The multinomial case: χ2 test (1)

Let E = {a1 , . . . , aK } be a ﬁnite space and (IPp )p∈ΔK be the

family of all probability distributions on E:

 
 K
n 
◮ ΔK = p = (p1 , . . . , pK ) ∈ (0, 1)K : pj = 1 .
 
j=1

◮ For p ∈ ΔK and X ∼ IPp ,

IPp [X = aj ] = pj , j = 1, . . . , K.

28/37
The multinomial case: χ2 test (2)

iid
◮ Let X1 , . . . , Xn ∼ IPp , for some unknown p ∈ ΔK , and let
p0 ∈ ΔK be ﬁxed.

◮ We want to test:
H0 : p = p0 vs. H1 : p = p0

with asymptotic level α ∈ (0, 1).

◮ Example: If p0 = (1/K, 1/K, . . . , 1/K), we are testing

whether IPp is the uniform distribution on E.

29/37
The multinomial case: χ2 test (3)

◮ Likelihood of the model:

NK
Ln (X1 , . . . , Xn , p) = pN 1 N2
1 p2 . . . pK ,

where Nj = #{i = 1, . . . , n : Xi = aj }.

◮ Let p̂ be the MLE:

Nj
p̂j = , j = 1, . . . , K.
n
� p̂ maximizes log Ln (X1 , . . . , Xn , p) under the constraint
K
n
pj = 1.
j=1

30/37
The multinomial case: χ2 test (4)
√
◮ If H0 is true, then n(p̂ − p0 ) is asymptotically normal, and
the following holds.

Theorem
� �2
K
n p̂j − pj0 (d)
n −−−→ χ2K−1.
j=1
pj0 n→∞
� ��
Tn

◮ χ2 test with asymptotic level α: ψα = 1I{Tn > qα },

where qα is the (1 − α)-quantile of χ2K−1 .
◮ Asymptotic p-value of this test: p − value = IP [Z > Tn |Tn ],
where Z ∼ χ2K−1 and Z ⊥⊥ Tn .

31/37
The Gaussian case: Student’s test (1)

iid
◮ Let X1 , . . . , Xn ∼ N (µ, σ 2 ), for some unknown
µ ∈ IR, σ 2 > 0 and let µ0 ∈ IR be ﬁxed, given.

◮ We want to test:
H0 : µ = µ0 vs. H1 : µ = µ0

with asymptotic level α ∈ (0, 1).

√ X̄n − µ0
◮ If σ 2 is known: Let Tn = n . Then, Tn ∼ N (0, 1)
σ
and
ψα = 1I{|Tn | > qα/2 }
is a test with (non asymptotic) level α.

32/37
The Gaussian case: Student’s test (2)

If σ 2 is unknown:

√ X̄n − µ0
◮ Tn =
Let T n−1 √ , where Sn is the sample variance.
Sn

◮ Cochran’s theorem:
◮ ¯ n ⊥⊥ Sn ;
X
nSn
◮ ∼ χ2n−1 .
σ2

◮ Hence, TTn ∼ tn−1 : Student’s distribution with n − 1 degrees

of freedom.

33/37
The Gaussian case: Student’s test (3)
◮ Student’s test with (non asymptotic) level α ∈ (0, 1):

Tn | > qα/2 },
ψα = 1I{|T

where qα/2 is the (1 − α/2)-quantile of tn−1 .

◮ If H1 is µ > µ0 , Student’s test with level α ∈ (0, 1) is:

Tn > qα },
ψα′ = 1I{T

where qα is the (1 − α)-quantile of tn−1 .

◮ Advantage of Student’s test:

◮ Non asymptotic
◮ Can be run on small samples

◮ Drawback of Student’s test: It relies on the assumption that

the sample is Gaussian.
34/37
Two-sample test: large sample case (1)
◮ Consider two samples: X1 , . . . , Xn and Y1 , . . . , Ym , of
independent random variables such that

IE[X1 ] = · · · = IE[Xn ] = µX

, and
IE[Y1 ] = · · · = IE[Ym ] = µY

◮ Assume that the variances of are known so assume (without

loss of generality) that

var(X1 ) = · · · = var(Xn ) = var(Y1 ) = · · · = var(Ym ) = 1

◮ We want to test:
H0 : µX = µY vs. H1 : µX = µY

with asymptotic level α ∈ (0, 1).

35/37
Two-sample test: large sample case (2)
From CLT:
√ (d)
¯n − µX ) −−
n(X −→ N (0, 1)
n→∞
and
√ (d) √ (d)
m(Ȳm −µY ) −−−−→ N (0, 1) ⇒ n(Ȳm −µY ) −n→∞
−−−→ N (0, γ)
m→∞
m→∞
m
n
→γ

Moreover, the two samples are independent so

√ √
n(X ¯ n − Y¯m ) + n(µX − µY ) −−(d)
−−→ N (0, 1 + γ)
n→∞
m→∞
m
n
→γ

Under H0 : µX = µY :
√ X̄n − Ȳm (d)
nJ −n→∞
−−−→ N (0, 1)
1 + m/n m→∞
m
n
→γ
{√ X¯n − Y¯m }
Test: ψα = 1I nJ > qα/2
1 + m/n 36/37
Two-sample T-test
◮ If the variances are unknown but we know that
Xi ∼ N (µX , σX 2 ), Y ∼ N (µ , σ 2 ).
i Y Y
◮ Then
( 2 2 )
X¯ n − Y¯m ∼ N µX − µY , σX + σY
n m
◮ Under H0 :
X¯ n − Y¯m
J ∼ N (0, 1)
σX2 /n + σ 2 /m
Y

◮ For unknown variance:

¯ n − Y¯m
X
J ∼ tN
2 /n + S 2 /m
SX Y

where ( )2
2 /n + S 2 /m
SX Y
N= 4
SX SY4
2
n (n−1)
+ 2
m (m−1)
37/37
MIT OpenCourseWare
http://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications

Fall 2016

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Statistics for Applications

Chapter 6: Testing goodness of ﬁt

1/25
Goodness of ﬁt tests

Let X be a r.v. Given i.i.d copies of X we want to answer the

following types of questions:
◮ Does X have distribution N (0, 1)? (Cf. Student’s T
distribution)
◮ Does X have distribution U ([0, 1])? (Cf p-value under H0 )
◮ Does X have PMF p1 = 0.3, p2 = 0.5, p3 = 0.2

These are all goodness of ﬁt tests: we want to know if the

hypothesized distribution is a good ﬁt for the data.

Key characteristic of GoF tests: no parametric modeling.

2/25
Cdf and empirical cdf (1)
Let X1 , . . . , Xn be i.i.d. real random variables. Recall the cdf of
X1 is deﬁned as:

F (t) = IP[X1 ≤ t], ∀t ∈ IR.

It completely characterizes the distribution of X1 .

Deﬁnition
The empirical cdf of the sample X1 , . . . , Xn is deﬁned as:
n
1L
Fn (t) = 1{Xi ≤ t}
n
i=1
#{i = 1, . . . , n : Xi ≤ t}
= , ∀t ∈ IR.
n

3/25
Cdf and empirical cdf (2)

By the LLN, for all t ∈ IR,

a.s.
Fn (t) −−−→ F (t).
n→∞

Glivenko-Cantelli Theorem (Fundamental theorem of

statistics)
a.s.
sup |Fn (t) − F (t)| −−−→ 0.
t∈IR n→∞

4/25
Cdf and empirical cdf (3)

By the CLT, for all t ∈ IR,

√ (d) ( )
n (Fn (t) − F (t)) −−−→ N 0, F (t) (1 − F (t)) .
n→∞

Donsker’s Theorem
If F is continuous, then
√ (d)
n sup |Fn (t) − F (t)| −−−→ sup |B(t)|,
t∈IR n→∞ 0≤t≤1

where B is a Brownian bridge on [0, 1].

5/25
Kolmogorov-Smirnov test (1)

◮ Let X1 , . . . , Xn be i.i.d. real random variables with unknown

cdf F and let F 0 be a continuous cdf.

◮ Consider the two hypotheses:

H0 : F = F 0 v.s. H1 : F = F 0 .

◮ Let Fn be the empirical cdf of the sample X1 , . . . , Xn .

◮ If F = F 0 , then Fn (t) ≈ F 0 (t), for all t ∈ [0, 1].

6/25
Kolmogorov-Smirnov test (2)

√
◮ Let Tn = sup n Fn (t) − F 0 (t) .
t∈IR

(d)
◮ By Donsker’s theorem, if H0 is true, then Tn −−−→ Z,
n→∞
where Z has a known distribution (supremum of a Brownian
bridge).

◮ KS test with asymptotic level α:

δαKS = 1{Tn > qα },

where qα is the (1 − α)-quantile of Z (obtained in tables).

◮ p-value of KS test: IP[Z > Tn |Tn ].

7/25
Kolmogorov-Smirnov test (3)

Remarks:
◮ In practice, how to compute Tn ?

◮ F 0 is non decreasing, Fn is piecewise constant, with jumps at

ti = Xi , i = 1, . . . , n.

◮ Let X(1) ≤ X(2) ≤ . . . ≤ X(n) be the reordered sample.

◮ The expression for Tn reduces to the following practical

formula:

√ { i−1 i }
Tn = n max max − F 0 (X(i) ) , − F 0 (X(i) ) .
i=1,...,n n n

8/25
Kolmogorov-Smirnov test (4)

◮ Tn is called a pivotal statistic: If H0 is true, the distribution

of Tn does not depend on the distribution of the Xi ’s and it is
easy to reproduce it in simulations.

◮ Indeed, let Ui = F 0 (Xi ), i = 1, . . . , n and let Gn be the

empirical cdf of U1 , . . . , Un .

i.i.d.
◮ If H0 is true, then U1 , . . . , Un ∼ U ([0.1])
√
and Tn = sup n |Gn (x) − x|.
0≤x≤1

9/25
Kolmogorov-Smirnov test (5)

◮ For some large integer M :

◮ Simulate M i.i.d. copies Tn1 , . . . , TnM of Tn ;

(n)
◮ Estimate the (1 − α)-quantile qα of Tn by taking the sample
(n,M)
(1 − α)-quantile q̂α of Tn1 , . . . , TnM .

◮ Test with approximate level α:

δα = 1{Tn > q̂α(n,M ) }.

◮ Approximate p-value of this test:

#{j = 1, . . . , M : Tnj > Tn }

p-value ≈ .
M

10/25
Kolmogorov-Smirnov test (6)
These quantiles are often precomputed in a table.

11/25
Other goodness of ﬁt tests

We want to measure the distance between two functions: Fn (t)

and F (t). There are other ways, leading to other tests:
◮ Kolmogorov-Smirnov:

d(Fn , F ) = sup |Fn (t) − F (t)|

t∈IR

◮ Cramér-Von Mises:
�
d2 (Fn , F ) = [Fn (t) − F (t)]2 dt
IR

◮ Anderson-Darling:

[Fn (t) − F (t)]2

�
2
d (Fn , F ) = dt
IR F (t)(1 − F (t))

12/25
Composite goodness of ﬁt tests

What if I want to test: ”Does X have Gaussian distribution?” but

I don’t know the parameters?
Simple idea: plug-in

sup Fn (t) − Φµ,ˆ

ˆ σ2 (t)
t∈IR

where
¯n ,
µ̂ = X ˆ 2 = Sn2
σ
2
ˆ σ2 (t) is the cdf of N (µ̂, σ̂ ).
and Φµ,ˆ

In this case Donsker’s theorem is no longer valid. This is a

common and serious mistake!

13/25
Kolmogorov-Lilliefors test (1)

Instead, we compute the quantiles for the test statistic:

sup Fn (t) − Φµ,ˆ

ˆ σ2 (t)
t∈IR

They do not depend on unknown parameters!

This is the Kolmogorov-Lilliefors test.

14/25
Kolmogorov-Lilliefors test (2)
These quantiles are often precomputed in a table.

15/25
Quantile-Quantile (QQ) plots (1)
◮ Provide a visual way to perform GoF tests
◮ Not formal test but quick and easy check to see if a
distribution is plausible.
◮ Main idea: we want to check visually if the plot of Fn is close
to that of F or equivalently if the plot of Fn−1 is close to that
of F −1 .
◮ More convenient to check if the points
( −1 1 1 ) ( 2 2 ) n−1 n−1 )
F ( ), Fn−1 ( ) , F −1 ( ), Fn−1 ( ) , . . . , F −1 ( ), Fn−1 (
(
)
n n n n n n
are near the line y = x.
◮ Fn is not technically invertible but we deﬁne

Fn−1 (i/n) = X(i) ,

the ith largest observation.

16/25
χ2 goodness-of-ﬁt test, ﬁnite case (1)

◮ Let X1 , . . . , Xn be i.i.d. random variables on some ﬁnite

space E = {a1 , . . . , aK }, with some probability measure IP.

◮ Let (IPθ )θ∈Θ be a parametric family of probability

distributions on E.

◮ Example: On E = {1, . . . , K}, consider the family of binomial

distributions (Bin(K, p))p∈(0,1) .

◮ For j = 1, . . . , K and θ ∈ Θ, set

pj (θ) = IPθ [Y = aj ], where Y ∼ IPθ

and
pj = IP[X1 = aj ].

19/25
χ2 goodness-of-ﬁt test, ﬁnite case (2)

◮ Consider the two hypotheses:

H0 : IP ∈ (IPθ )θ∈Θ v.s. H1 : IP ∈

/ (IPθ )θ∈Θ .

◮ Testing H0 means testing whether the statistical model

( )
E, (IPθ )θ∈Θ ﬁts the data (e.g., whether the data are indeed
from a binomial distribution).

◮ H0 is equivalent to:

pj = pj (θ), ∀j = 1, . . . , K, for some θ ∈ Θ.

20/25
χ2 goodness-of-ﬁt test, ﬁnite case (3)

◮ Let θ̂ be the MLE of θ when assuming H0 is true.

◮ Let
n
1L #{i : Xi = aj }
p̂j = 1{Xi = aj } = , j = 1, . . . , K.
n n
i=1

◮ Idea: If H0 is true, then pj = pj (θ) so both p̂j and pj (θ̂) are

good estimators or pj . Hence, p̂j ≈ pj (θ̂), ∀j = 1, . . . , K.

� �2
K
L p̂j − pj (θ̂)
◮ Deﬁne the test statistic: Tn = n .
ˆ
pj (θ)
j=1

21/25
χ2 goodness-of-ﬁt test, ﬁnite case (4)

◮ Under some technical assumptions, if H0 is true, then

(d)
Tn −−−→ χ2K−d−1,
n→∞

where d is the size of the parameter θ (Θ ⊆ IRd and

d < K − 1).

◮ Test with asymptotic level α ∈ (0, 1):

δα = 1{Tn > qα },

where qα is the (1 − α)-quantile of χ2K−d−1 .

◮ p-value: IP[Z > Tn |Tn ], where Z ∼ χ2K−d−1 and Z ⊥⊥ Tn .

22/25
χ2 goodness-of-ﬁt test, inﬁnite case (1)

◮ If E is inﬁnite (e.g. E = IN, E = IR, ...):

◮ Partition E into K disjoint bins:

E = A1 ∪ . . . ∪ AK .
◮ Deﬁne, for θ ∈ Θ and j = 1, . . . , K:
◮ pj (θ) = IPθ [Y ∈ Aj ], for Y ∼ IPθ ,
◮ pj = IP[X1 ∈ Aj ],
n
1L #{i : Xi ∈ Aj }
◮ p̂j = 1{Xi ∈ Aj } = ,
n i=1 n

◮ θ̂: same as in the previous case.

23/25
χ2 goodness-of-ﬁt test, inﬁnite case (2)
� �2
K
L p̂j − pj (θ̂)
◮ As previously, let Tn = n .
j=1 pj (θ̂)

◮ Under some technical assumptions, if H0 is true, then

(d)
Tn −−−→ χ2K−d−1,
n→∞

where d is the size of the parameter θ (Θ ⊆ IRd and

d < K − 1).

◮ Test with asymptotic level α ∈ (0, 1):

δα = 1{Tn > qα },

where qα is the (1 − α)-quantile of χ2K−d−1 .

24/25
χ2 goodness-of-ﬁt test, inﬁnite case (3)
◮ Practical issues:
◮ Choice of K ?
◮ Choice of the bins A1 , . . . , AK ?
◮ Computation of pj (θ) ?

◮ Example 1: Let E = IN and H0 : IP ∈ (Poiss(λ))λ>0 .

◮ If one expects λ to be no larger than some λmax , one can

choose A1 = {0}, A2 = {1}, . . . , AK−1 = {K − 2}, AK =
{K − 1, K, K + 1, . . .}, with K large enough such that
pK (λmax ) ≈ 0.

25/25
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications

Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
Statistics for Applications

Chapter 7: Regression

1/43
Heuristics of the linear regression (1)
Consider a cloud of i.i.d. random points (Xi , Yi ), i = 1, . . . , n :

2/43
Heuristics of the linear regression (2)

I Idea: Fit the best line ﬁtting the data.

I Approximation: Yi ⇡ a + bXi , i = 1, . . . , n, for some

(unknown) a, b 2 IR.

I ˆ ˆb that approach a and b.

Find a,

I More generally: Yi 2 IR, Xi 2 IRd ,

Yi ⇡ a + Xi> b, a 2 IR, b 2 IRd .

I Goal: Write a rigorous model and estimate a and b.

3/43
Heuristics of the linear regression (3)

Examples:
Economics: Demand and price,

Di ⇡ a + bpi , i = 1, . . . , n.

Ideal gas law: P V = nRT ,

log Pi ⇡ a + b log Vi + c log Ti , i = 1, . . . , n.

4/43
Linear regression of a r.v. Y on a r.v. X (1)

Let X and Y be two real r.v. (non necessarily independent)

with two moments and such that V ar(X) 6= 0.

The theoretical linear regression of Y on X is the best

approximation in quadratic means of Y by a linear function of
X, i.e. the r.v. a + bX,h where a and b iare the two real
numbers minimizing IE (Y − a − bX)2 .

By some simple algebra:

cov(X, Y )
I b= ,
V ar(X)
cov(X, Y )
I a = IE[Y ] − bIE[X] = IE[Y ] − IE[X].
V ar(X)

5/43
Linear regression of a r.v. Y on a r.v. X (2)

If " = Y − (a + bX), then

Y = a + bX + ",

with IE["] = 0 and cov(X, ") = 0.

Conversely: Assume that Y = a + bX + " for some a, b 2 IR

and some centered r.v. " that satisﬁes cov(X, ") = 0.

E.g., if X ?? " or if IE["|X] = 0, then cov(X, ") = 0.

Then, a + bX is the theoretical linear regression of Y on X.

6/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

7/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

8/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

9/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.

We want to estimate a and b.

10/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , Y1 ), . . . , (Xn , Yn ) with
same distribution as (X, Y ) is available.

We want to estimate a and b.

11/43
Linear regression of a r.v. Y on a r.v. X (4)

Deﬁnition
The least squared error (LSE) estimator of (a, b) is the minimizer
of the sum of squared errors:
n
X
(Yi − a − bXi )2 .
i=1

(â, b̂) is given by

XY − X̄ Y¯
b̂ = ,
2
X −X ¯ 2

â = Y¯ − ˆbX.
¯

12/43
Linear regression of a r.v. Y on a r.v. X (5)

13/43
Multivariate case (1)

Yi = Xi β + "i , i = 1, . . . , n.

Vector of explanatory variables or covariates: Xi 2 IRp (wlog,

assume its ﬁrst coordinate is 1).

Dependent variable: Yi .

β = (a, b ) ; β1 (= a) is called the intercept.

{"i }i=1,...,n : noise terms satisfying cov(Xi , "i ) = 0.

Deﬁnition
The least squared error (LSE) estimator of β is the minimizer of
the sum of square errors:
n
X
β̂ = argmin (Yi − Xi t)2
t2IRp i=1
14/43
Multivariate case (2)

LSE in matrix form

Let Y = (Y1 , . . . , Yn ) 2 IRn .

Let X be the n ⇥ p matrix whose rows are X1 , . . . , Xn (X is

called the design).

Let " = ("1 , . . . , "n ) 2 IRn (unobserved noise)

Y = Xβ + ".

The LSE β̂ satisfies:

β̂ = argmin kY − Xtk22 .
t2IRp

15/43
Multivariate case (3)

Assume that rank(X) = p.

Analytic computation of the LSE:

β̂ = (X X)−1 X Y.

Geometric interpretation of the LSE

Xβ̂ is the orthogonal projection of Y onto the subspace

spanned by the columns of X:

Xβ̂ = P Y,

where P = X(X X)−1 X .

16/43
Linear regression with deterministic design and Gaussian
noise (1)

Assumptions:

The design matrix X is deterministic and rank(X) = p.

The model is homoscedastic: "1 , . . . , "n are i.i.d.

The noise vector " is Gaussian:

" ⇠ Nn (0, σ 2 In ),

for some known or unknown σ 2 > 0.

17/43
Linear regression with deterministic design and Gaussian
noise (2)
⇣ ⌘
LSE = MLE: ˆ ⇠ Np β, σ (X X)
β 2 −1
.
h i ⇣ ⌘
−1
Quadratic risk of β̂: IE kβ̂ − βk22 2
= σ tr (X X) .
h i
Prediction error: IE kY − Xβ̂k22 = σ 2 (n − p).

1
Unbiased estimator of σ 2 : σ̂ 2 = kY − Xβ̂k22 .
n−p

Theorem
σ̂ 2
(n − p) 2 ⇠ χ2n−p .
σ

β̂ ?? σ̂ 2 .
18/43
Significance tests (1)
Test whether the j-th explanatory variable is significant in the
linear regression (1  j  p).

H0 : βj = 0 v.s. H1 : βj = 0.

If γj is the j-th diagonal coefficient of (X X)−1 (γj > 0):

β̂j − βj
p ⇠ tn−p .
2
σ̂ γj

β̂j
Let Tn(j) = p .
σ̂ 2 γj

Test with non asymptotic level ↵ 2 (0, 1):

δ↵(j) = 1{|Tn(j) | > q ↵2 (tn−p )},

where q ↵2 (tn−p ) is the (1 − ↵/2)-quantile of tn−p .

19/43
Significance tests (2)

Test whether a group of explanatory variables is significant in

the linear regression.

H0 : βj = 0, 8j 2 S v.s. H1 : 9j 2 S, βj = 0, where
S ✓ {1, . . . , p}.

(j)
Bonferroni’s test: δ↵B = max δ↵/k , where k = |S|.
j2S

δ↵ has non asymptotic level at most ↵.

20/43
More tests (1)

Let G be a k ⇥ p matrix with rank(G) = k (k  p) and λ 2 IRk .

Consider the hypotheses:

H0 : Gβ = λ v.s. H1 : Gβ = λ.

The setup of the previous slide is a particular case.

If H0 is true, then:

Gβ̂ − λ ⇠ Nk 0, σ 2 G(X X)−1 G ,

and
−1
−2 −1
σ (Gβ̂ − λ) G(X X) G (Gβ − λ) ⇠ χ2k .

21/43
More tests (2)
� �−1
1 (Gβ̂ − λ) G(X X)−1 G (Gβ − λ)
Let Sn = .
σ̂ 2 k

If H0 is true, then Sn ⇠ Fk,n−p .

Test with non asymptotic level ↵ 2 (0, 1):

δ↵ = 1{Sn > q↵ (Fk,n−p )},

where q↵ (Fk,n−p ) is the (1 − ↵)-quantile of Fk,n−p .

Definition
The Fisher distribution with p and q degrees of freedom, denoted
U/p
by Fp,q , is the distribution of , where:
V /q
U ⇠ χ2p , V ⇠ χ2q ,

U ?? V .
22/43
Concluding remarks

Linear regression exhibits correlations, NOT causality

Normality of the noise: One can use goodness of fit tests to

test whether the residuals "ˆi = Yi − Xi β̂ are Gaussian.

Deterministic design: If X is not deterministic, all the above

can be understood conditionally on X, if the noise is assumed
to be Gaussian, conditionally on X.

23/43
Linear regression and lack of identifiability (1)
Consider the following model:

Y = Xβ + ",

with:
1. Y 2 IRn (dependent variables), X 2 IRn⇥p (deterministic
design) ;
2. β 2 IRp , unknown;
3. " ⇠ Nn (0, σ 2 In ).

Previously, we assumed that X had rank p, so we could invert

X X.

What if X is not of rank p ? E.g., if p > n ?

β would no longer be identified: estimation of β is vain

(unless we add more structure).
24/43
Linear regression and lack of identifiability (2)

What about prediction ? Xβ is still identified.

Ŷ: orthogonal projection of Y onto the linear span of the

columns of X.

Ŷ = Xβ̂ = X(X X)† XY, where A† stands for the

(Moore-Penrose) pseudo inverse of a matrix A.

Similarly as before, if k = rank(X):

kŶ − Yk22
2
⇠ χ2n k,
σ

kŶ − Yk22 ?? Ŷ.

25/43
Linear regression and lack of identifiability (3)

In particular:

IE[kŶ − Yk22 ] = (n − k)σ 2 .

Unbiased estimator of the variance:

1
σ̂ 2 = kŶ − Yk22 .
n−k

26/43
Linear regression in high dimension (1)
Consider again the following model:

Y = Xβ + ",

with:
1. Y 2 IRn (dependent variables), X 2 IRn⇥p (deterministic
design) ;
2. β 2 IRp , unknown: to be estimated;
3. " ⇠ Nn (0, σ 2 In ).
For each i, Xi 2 IRp is the vector of covariates of the i-th
individual.

If p is too large (p > n), there are too many parameters to be

estimated (overfitting model), although some covariates may
be irrelevant.

Solution: Reduction of the dimension.

27/43
Linear regression in high dimension (2)

Idea: Assume that only a few coordinates of β are nonzero

(but we do not know which ones).

Based on the sample, select a subset of covariates and

estimate the corresponding coordinates of β.

For S ✓ {1, . . . , p}, let

β̂ S 2 argmin kY − XS tk2 ,
t2IRS

where XS is the submatrix of X obtained by keeping only the

covariates indexed in S.

28/43
Linear regression in high dimension (3)

Select a subset S that minimizes the prediction error

penalized by the complexity (or size) of the model:
ˆ k2 + λ|S|,
kY − XS β S

where λ > 0 is a tuning parameter.

If λ = 2σ̂ 2 , this is the Mallow’s Cp or AIC criterion.

If λ = σ̂ 2 log n, this is the BIC criterion.

29/43
Linear regression in high dimension (4)
Each of these criteria is equivalent to finding β 2 IRp that
minimizes:
kY − Xbk22 + λkbk0 ,
where kbk0 is the number of nonzero coefficients of b.

This is a computationally hard problem: nonconvex and

requires to compute 2n estimators (all the β̂ S , for
S ✓ {1, . . . , p}).

Lasso estimator:
p
X p
X � �
replacekbk0 = 1I{bj = 0} with kbk1 = �bj �
j=1 j=1

and the problem becomes convex.

L
β̂ 2 argmin kY − Xbk2 + λkbk1 ,
b2IRp
where λ > 0 is a tuning parameter.
30/43
Linear regression in high dimension (5)

How to choose λ ?

This is a difficult question (see grad course 18.657:

”High-dimensional statistics” in Spring 2017).

A good choice of λ with lead to an estimator β̂ that is very

close to β and will allow to recover the subset S ⇤ of all
j 2 {1, . . . , p} for which β j = 0, with high probability.

31/43
Linear regression in high dimension (6)

32/43
Nonparametric regression (1)

In the linear setup, we assumed that Yi = Xi β + "i , where

Xi are deterministic.

This has to be understood as working conditionally on the

design.

This is to assume that IE[Yi |Xi ] is a linear function of Xi ,

which is not true in general.

Let f (x) = IE[Yi |Xi = x], x 2 IRp : How to estimate the

function f ?

33/43
Nonparametric regression (2)

Let p = 1 in the sequel.

One can make a parametric assumption on f .

E.g., f (x) = a + bx, f (x) = a + bx + cx2 , f (x) = ea+bx , ...

The problem reduces to the estimation of a finite number of

parameters.

LSE, MLE, all the previous theory for the linear case could be
adapted.

What if we do not make any such parametric assumption on f

34/43
Nonparametric regression (3)

Assume f is smooth enough: f can be well approximated by a

piecewise constant function.

Idea: Local averages.

For x 2 IR: f (t) ⇡ f (x) for t close to x.

For all i such that Xi is close enough to x,

Yi ⇡ f (x) + "i .

Estimate f (x) by the average of all Yi ’s for which Xi is close

enough to x.

35/43
Nonparametric regression (4)

Let h > 0: the window’s size (or bandwidth).

Let Ix = {i = 1, . . . , n : |Xi − x| < h}.

Let fˆn,h (x) be the average of {Yi : i 2 Ix }.

8
> 1 X
>
< |I | Yi if Ix = ;
x
fˆn,h (x) = i2Ix
>
>
:
0 otherwise.

36/43
Nonparametric regression (5)
0.5

●
0.4

●
● ● ●
●●
0.3

● ●
●
● ●
● ● ●
●
● ●
●
●
●
● ●
0.2

●
●
●

●
● ●●
●
● ●
●
●
Y

● ●
0.1

●
●

●
●
●
●
● ●
0.0

●
●

●
−0.2

0.2 0.4 0.6 0.8 1.0

37/43
Nonparametric regression (6)
0.5

l
0.4

l
l l l

ll ll
ll ll
0.3

ll
l l

l l ll ll
l ll
l
l
l
l l
0.2

l
l
ll
l
l
l
ll ll
l ll l
l
Y

l l
0.1

l
l

l
l
l
l
l l
0.0

l
l

x 0.6 l

h 0.1
^
−0.2

l f x 0.27

0.2 0.4 0.6 0.8 1.0

38/43
Nonparametric regression (7)

How to choose h ?

If h ! 0: overfitting the data;

If h ! 1: underfitting, fˆn,h (x) = Y¯n .

39/43
Nonparametric regression (8)
Example:
n = 100, f (x) = x(1 − x),
h = .005.
0.4

^
l

f nh
l
l

l
l l

l l
l
0.3

l l
l
l
l
l

l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2

l
l l
l
l l l

l l l
l l
l l
l l
l
l l l
Y

l l
l l l
l

l l l
l l l

l
0.1

l l l l
l
l
l
l
l
l
l
l
l

l l

l
l

l l
0.0

l l

l
l
l l l

0.0 0.2 0.4 0.6 0.8 1.0

X
40/43
Nonparametric regression (9)
Example:
n = 100, f (x) = x(1 − x),
h = 1.
0.4

^
l

f nh
l
l

l
l l

l l
l
0.3

l l
l
l
l
l

l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2

l
l l
l
l l l

l l l
l l
l l
l l
l
l l l
Y

l l
l l l
l

l l l
l l l

l
0.1

l l l l
l
l
l
l
l
l
l
l
l

l l

l
l

l l
0.0

l l

l
l
l l l

0.0 0.2 0.4 0.6 0.8 1.0

X
41/43
Nonparametric regression (10)
Example:
n = 100, f (x) = x(1 − x),
h = .2.
0.4

^
l

f nh
l
l

l
l l

l l
l
0.3

l l
l
l
l
l

l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2

l
l l
l
l l l

l l l
l l
l l
l l
l
l l l
Y

l l
l l l
l

l l l
l l l

l
0.1

l l l l
l
l
l
l
l
l
l
l
l

l l

l
l

l l
0.0

l l

l
l
l l l

0.0 0.2 0.4 0.6 0.8 1.0

X
42/43
Nonparametric regression (11)

Choice of h ?

If the smoothness of f is known (i.e., quality of local

approximation of f by piecewise constant functions): There is
a good choice of h depending on that smoothness

If the smoothness of f is unknown: Other techniques, e.g.

cross validation.

43/43
0,72SHQ&RXUVH:DUH
KWWSVRFZPLWHGX

6WDWLVWLFVIRU$SSOLFDWLRQV
)DOO

)RULQIRUPDWLRQDERXWFLWLQJWKHVHPDWHULDOVRURXU7HUPVRI8VHYLVLWKWWSVRFZPLWHGXWHUPV
Statistics for Applications

Chapter 8: Bayesian Statistics

1/17
The Bayesian approach (1)

◮ So far, we have studied the frequentist approach of statistics.

◮ The frequentist approach:

◮ Observe data
◮ These data were generated randomly (by Nature, by
measurements, by designing a survey, etc...)
◮ We made assumptions on the generating process (e.g., i.i.d.,
Gaussian data, smooth density, linear regression function,
etc...)
◮ The generating process was associated to some object of
interest (e.g., a parameter, a density, etc...)
◮ This object was unknown but ﬁxed and we wanted to ﬁnd it:
we either estimated it or tested a hypothesis about this object,
etc...

2/17
The Bayesian approach (2)

◮ Now, we still observe data, assumed to be randomly generated

by some process. Under some assumptions (e.g., parametric
distribution), this process is associated with some ﬁxed object.

◮ We have a prior belief about it.

◮ Using the data, we want to update that belief and transform

it into a posterior belief.

3/17
The Bayesian approach (3)
Example
◮ Let p be the proportion of woman in the population.

◮ Sample n people randomly with replacement in the population

and denote by X1 , . . . , Xn their gender (1 for woman, 0
otherwise).

◮ In the frequentist approach, we estimated p (using the MLE),

we constructed some conﬁdence interval for p, we did
hypothesis testing (e.g., H0 : p = .5 v.s. H1 : p = .5).

◮ Before analyzing the data, we may believe that p is likely to

be close to 1/2.

◮ The Bayesian approach is a tool to:

1. include mathematically our prior belief in statistical procedures.
2. update our prior belief using the data.
4/17
The Bayesian approach (4)
Example (continued)

◮ Our prior belief about p can be quantiﬁed:

◮ E.g., we are 90% sure that p is between .4 and .6, 95% that it
is between .3 and .8, etc...

◮ Hence, we can model our prior belief using a distribution for

p, as if p was random.

◮ In reality, the true parameter is not random ! However, the

Bayesian approach is a way of modeling our belief about the
parameter by doing as if it was random.

◮ E.g., p ∼ B(a, a) (Beta distribution) for some a > 0.

◮ This distribution is called the prior distribution.

5/17
The Bayesian approach (5)
Example (continued)

◮ In our statistical experiment, X1 , . . . , Xn are assumed to be

i.i.d. Bernoulli r.v. with parameter p conditionally on p.

◮ After observing the available sample X1 , . . . , Xn , we can

update our belief about p by taking its distribution
conditionally on the data.

◮ The distribution of p conditionally on the data is called the

posterior distribution.

◮ Here, the posterior distribution is

n n
� �
� �
B a+ Xi , a + n − Xi .
i=1 i=1

6/17
The Bayes rule and the posterior distribution (1)

◮ Consider a probability distribution on a parameter space Θ

with some pdf π(·): the prior distribution.

◮ Let X1 , . . . , Xn be a sample of n random variables.

◮ Denote by pn (·|θ) the joint pdf of X1 , . . . , Xn conditionally

on θ, where θ ∼ π.

◮ Usually, one assumes that X1 , . . . , Xn are i.i.d. conditionally

on θ.

◮ The conditional distribution of θ given X1 , . . . , Xn is called

the posterior distribution. Denote by π(·|X1 , . . . , Xn ) its pdf.

7/17
The Bayes rule and the posterior distribution (2)

◮ Bayes’ formula states that:

π(θ|X1 , . . . , Xn ) ∝ π(θ)pn (X1 , . . . , Xn |θ), ∀θ ∈ Θ.

◮ The constant does not depend on θ:

π(θ)pn (X1 , . . . , Xn |θ)

π(θ|X1 , . . . , Xn ) = � , ∀θ ∈ Θ.
Θ pn (X1 , . . . , Xn |t) dπ(t)

8/17
The Bayes rule and the posterior distribution (3)
In the previous example:

◮ π(p) ∝ pa−1 (1 − p)a−1 , p ∈ (0, 1).

i.i.d.
◮ Given p, X1 , . . . , Xn ∼ Ber(p), so
�n �n
Xi
pn (X1 , . . . , Xn |θ) = p i=1 (1 − p)n− i=1 Xi
.

◮ Hence,
�n �n
π(θ|X1 , . . . , Xn ) ∝ pa−1+ i=1 Xi
(1 − p)a−1+n− i=1 Xi
.

◮ The posterior distribution is

n n
� �
� �
B a+ Xi , a + n − Xi .
i=1 i=1

9/17
Non informative priors (1)

◮ Idea: In case of ignorance, or of lack of prior information, one

may want to use a prior that is as little informative as
possible.

◮ Good candidate: π(θ) ∝ 1, i.e., constant pdf on Θ.

◮ If Θ is bounded, this is the uniform prior on Θ.

◮ If Θ is unbounded, this does not deﬁne a proper pdf on Θ !

◮ An improper prior on Θ is a measurable, nonnegative function

π(·) deﬁned on Θ that is not integrable.

◮ In general, one can still deﬁne a posterior distribution using an

improper prior, using Bayes’ formula.

10/17
Non informative priors (2)
Examples:

i.i.d.
◮ If p ∼ U (0, 1) and given p, X1 , . . . , Xn ∼ Ber(p) :
�n �n
Xi
π(p|X1 , . . . , Xn ) ∝ p i=1 (1 − p)n− i=1 Xi
,
i.e., the posterior distribution is
n n
� �
� �
B 1+ Xi , 1 + n − Xi .
i=1 i=1
i.i.d.
◮ If π(θ) = 1, ∀θ ∈ IR and given θ, X1 , . . . , Xn ∼ N (θ, 1):
n
� �
1� 2
π(θ|X1 , . . . , Xn ) ∝ exp − (Xi − θ) ,
2
i=1
i.e., the posterior distribution is

N ¯n , 1
X .
n
11/17
Non informative priors (3)

◮ Jeﬀreys prior: J
πJ (θ) ∝ det I(θ),
where I(θ) is the Fisher information matrix of the statistical
model associated with X1 , . . . , Xn in the frequentist approach
(provided it exists).

◮ In the previous examples:

1
◮ Ex. 1: πJ (p) ∝ √ , p ∈ (0, 1): the prior is B(1/2, 1/2).
p(1−p)

◮ Ex. 2: πJ (θ) ∝ 1, θ ∈ IR is an improper prior.

12/17
Non informative priors (4)

◮ Jeﬀreys prior satisﬁes a reparametrization invariance principle:

If η is a reparametrization of θ (i.e., η = φ(θ) for some
one-to-one map φ), then the pdf π̃(·) of η satisﬁes:
J
π̃(η) ∝ det I˜(η),

where I˜(η) is the Fisher information of the statistical model

parametrized by η instead of θ.

13/17
Bayesian conﬁdence regions

◮ For α ∈ (0, 1), a Bayesian conﬁdence region with level α is a

random subset R of the parameter space Θ, which depends
on the sample X1 , . . . , Xn , such that:

IP[θ ∈ R|X1 , . . . , Xn ] = 1 − α.

◮ Note that R depends on the prior π(·).

◮ ”Bayesian conﬁdence region” and ”conﬁdence interval” are

two distinct notions.

14/17
Bayesian estimation (1)
◮ The Bayesian framework can also be used to estimate the true
underlying parameter (hence, in a frequentist approach).

◮ In this case, the prior distribution does not reﬂect a prior

belief: It is just an artiﬁcial tool used in order to deﬁne a new
class of estimators.

◮ Back to the frequentist approach: The sample

X1 , . . . , Xn is associated with a statistical model
(E, (IPθ )θ∈Θ ).

◮ Deﬁne a distribution (that can be improper) with pdf π on

the parameter space Θ.

◮ Compute the posterior pdf π(·|X1 , . . . , Xn ) associated with π,

seen as a prior distribution.
15/17
Bayesian estimation (2)

◮ Bayes estimator:
�
(π)
θ̂ = θ dπ(θ|X1 , . . . , Xn ) :
Θ

This is the posterior mean.

◮ The Bayesian estimator depends on the choice of the prior

distribution π (hence the superscript π).

16/17
Bayesian estimation (3)
◮ In the previous examples:
◮ Ex. 1 with prior B(a, a) (a > 0):
�n
(π) a + i=1 Xi a/n + X̄n
p̂ = = .
2a + n 2a/n + 1

In particular, for a = 1/2 (Jeﬀreys prior),

¯n
1/(2n) + X
p̂(πJ ) = .
1/n + 1

◮ Ex. 2: θ̂(πJ ) = X̄n .

◮ In each of these examples, the Bayes estimator is consistent
and asymptotically normal.

◮ In general, the asymptotic properties of the Bayes estimator

do not depend on the choice of the prior.
17/17
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications

Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
Statistics for Applications

Chapter 9: Principal Component Analysis (PCA)

1/16
Multivariate statistics and review of linear algebra (1)

� Let X be a d-dimensional random vector and X1 , . . . , Xn be

n independent copies of X.

� Write Xi = (Xi1 , . . . , Xid )T , i = 1, . . . , n.

� Denote by X the random n × d matrix

· · · XT
⎛ ⎞
1 ···
X=⎝ ..
⎠.
⎜ ⎟
.
T
· · · Xn · · ·

2/16
Multivariate statistics and review of linear algebra (2)

� Assume that E[IXI22 ] < ∞.

� Mean of X:
T
E[X] = E[X 1 ], . . . , E[X d ] .

� Covariance matrix of X: the matrix Σ = (σj,k )j,k=1,...,d , where

σj,k = cov(Xj , Xk ).

� It is easy to see that

Σ = E[XXT ] − E[X]E[X]T = E (X − E[X])(X − E[X])T .

3/16
Multivariate statistics and review of linear algebra (3)

� Empirical mean of X1 , . . . , Xn :
n
T
¯ = 1
�
X Xi = X̄ 1 , . . . , X̄ d .
n
i=1

� Empirical covariance of X1 , . . . , Xn : the matrix

S = (sj,k )j,k=1,...,d where sj,k is the empirical covariance of
the Xij , Xik , i = 1 . . . , n.

� It is easy to see that

n n
1� 1 �� T
Xi XT T
��
S= i − X̄X̄ = Xi − X̄ Xi − X̄ .
n n
i=1 i=1

4/16
Multivariate statistics and review of linear algebra (4)

� ¯ = 1 XT 1, where 1 = (1, . . . , 1)T ∈ Rd .

Note that X
n
� Note also that
1 T 1 1
S= X X − 2 X11T X = XT HX,
n n n
where H = In − n1 11T .

� H is an orthogonal projector: H 2 = H, H T = H. (on what

subspace ?)

� If u ∈ Rd ,
� uT Σu is the variance of uT X;
� uT Su is the sample variance of uT X1 , . . . , uT Xn .

5/16
Multivariate statistics and review of linear algebra (5)

� In particular, uT Su measures how spread (i.e., diverse) the

points are in direction u.

� If uT Su = 0, then all Xi ’ s are in an aﬃne subspace

orthogonal to u.

� If uT Σu = 0, then X is almost surely in an aﬃne subspace

orthogonal to u.

� If uT Su is large with IuI2 = 1, then the direction of u

explains well the spread (i.e., diversity) of the sample.

6/16
Multivariate statistics and review of linear algebra (6)
� In particular, Σ and S are symmetric, positive semi-deﬁnite.

� Any real symmetric matrix A ∈ Rd×d has the decomposition

A = P DP T ,

where:
� P is a d × d orthogonal matrix, i.e., P P T = P T P = Id ;
� D is diagonal.

� The diagonal elements of D are the eigenvalues of A and the

columns of P are the corresponding eigenvectors of A.

� A is semi-deﬁnite positive iﬀ all its eigenvalues are

nonnegative.
7/16
Principal Component Analysis: Heuristics (1)

� The sample X1 , . . . , Xn makes a cloud of points in Rd .

� In practice, d is large. If d > 3, it becomes impossible to

represent the cloud on a picture.

� Question: Is it possible to project the cloud onto a linear

subspace of dimension d' < d by keeping as much information
as possible ?

� Answer: PCA does this by keeping as much covariance

structure as possible by keeping orthogonal directions that
discriminate well the points of the cloud.

8/16
Principal Component Analysis: Heuristics (2)
� Idea: Write S = P DP T , where
� P = (v1 , . . . , vd ) is an orthogonal matrix, i.e.,
Ivj I2 = 1, vjT vk = 0, ∀j = k.
⎛ ⎞
λ1
⎜
⎜
⎜ λ2 0 ⎟
⎟
⎟
� D=⎜
⎜ . .. ⎟, with λ1 ≥ . . . ≥ λd ≥ 0.
⎟
⎜ ⎟
⎜
⎝ 0 ..
.
⎟
⎠
λd

� Note that D is the empirical covariance matrix of the

P T Xi ’s, i = 1, . . . , n.

� In particular, λ1 is the empirical variance of the v1T Xi ’s; λ2 is

the empirical variance of the v2T Xi ’s, etc...
9/16
Principal Component Analysis: Heuristics (3)

� So, each λj measures the spread of the cloud in the direction

vj .

� In particular, v1 is the direction of maximal spread.

� Indeed, v1 maximizes the empirical covariance of

aT X1 , . . . , aT Xn over a ∈ Rd such that IaI2 = 1.

� Proof: For any unit vector a, show that

T
aT Σa = P T a D P T a ≤ λ1 ,

with equality if a = v1 .

10/16
Principal Component Analysis: Main principle
� Idea of the PCA: Find the collection of orthogonal directions
in which the cloud is much spread out.
Theorem

v1 ∈ argmax uT Su,
lul=1

v2 ∈ argmax uT Su,
lul=1,u⊥v1
···
vd ∈ argmax uT Su.
lul=1,u⊥vj ,j=1,...,d−1

Hence, the k orthogonal directions in which the cloud is the

most spread out correspond exactly to the eigenvectors
associated with the k largest values of S.
11/16
Principal Component Analysis: Algorithm (1)

1. Input: X1 , . . . , Xn : cloud of n points in dimension d.

2. Step 1: Compute the empirical covariance matrix.

3. Step 2: Compute the decomposition S = P DP T , where

D = Diag(λ1 , . . . , λd ), with λ1 ≥ λ2 ≥ . . . ≥ λd and
P = (v1 , . . . , vd ) is an orthogonal matrix.

4. Step 3: Choose k < d and set Pk = (v1 , . . . , vk ) ∈ Rd×k .

5. Output: Y1 , . . . , Yn , where

Yi = PkT Xi ∈ Rk , i = 1, . . . , n.

Question: How to choose k ?

12/16
Principal Component Analysis: Algorithm (2)
Question: How to choose k ?
� Experimental rule: Take k where there is an inﬂection point in
the sequence λ1 , . . . , λd (scree plot).

� Deﬁne a criterion: Take k such that

λ1 + . . . + λk
≥ 1 − α,
λ1 + . . . + λd

for some α ∈ (0, 1) that determines the approximation error

that the practitioner wants to achieve.

� Remark: λ1 + . . . + λk is called the variance explained by the

PCA and λ1 + . . . + λd = T r(S) is the total variance.

� Data visualization: Take k = 2 or 3.

13/16
Example: Expression of 500,000 genes among 1400
Europeans

Reprinted by permission from

Macmillan Publishers Ltd: Nature.
Source: John Novembre, et al. "Genes
mirror geography within Europe."
Nature 456 (2008): 98-101. © 2008.
14/16
Principal Component Analysis - Beyond practice (1)
� PCA is an algorithm that reduces the dimension of a cloud of
points and keeps its covariance structure as much as possible.

� In practice this algorithm is used for clouds of points that are

not necessarily random.

� In statistics, PCA can be used for estimation.

� If X1 , . . . , Xn are i.i.d. random vectors in Rd , how to

estimate their population covariance matrix Σ ?

� If n » d, then the empirical covariance matrix S is a

consistent estimator.

� In many applications, n « d (e.g., gene expression). Solution:

sparse PCA
15/16
Principal Component Analysis - Beyond practice (2)
� It may be known beforehand that Σ has (almost) low rank.

� Then, run PCA on S: Write S ≈ S ' , where

λ1
⎛ ⎞
⎜
⎜ λ2 0 ⎟
⎟
⎜ .. ⎟
⎜ . ⎟
S' = P ⎜
⎜ ⎟ T
λk ⎟P .
⎜ ⎟
⎜
⎜ 0 ⎟
⎟
..
0
⎜ ⎟
⎝ . ⎠
0
� S ' will be a better estimator of S under the low-rank
assumption.

� A theoretical analysis would lead to an optimal choice of the

tuning parameter k.
16/16
MIT OpenCourseWare
https://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications

Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
Statistics for Applications

Chapter 10: Generalized Linear Models (GLMs)

1/52
Linear model

A linear model assumes

Y |X ∼ N (µ(X), σ 2 I),

And
IE(Y |X) = µ(X) = X ⊤ β,

2/52
Components of a linear model

The two components (that we are going to relax) are

1. Random component: the response variable Y |X is continuous
and normally distributed with mean µ = µ(X) = IE(Y |X).

2. Link: between the random and covariates

X = (X (1) , X (2) , · · · , X (p) )⊤ : µ(X) = X ⊤ β.

3/52
Generalization

A generalized linear model (GLM) generalizes normal linear

regression models in the following directions.
1. Random component:

Y ∼ some exponential family distribution

2. Link: between the random and covariates:

g µ(X) = X ⊤ β
� �

where g called link function and µ = IE(Y |X).

4/52
Example 1: Disease Occuring Rate

In the early stages of a disease epidemic, the rate at which new

cases occur can often increase exponentially through time. Hence,
if µi is the expected number of new cases on day ti , a model of the
form
µi = γ exp(δti )
seems appropriate.
◮ Such a model can be turned into GLM form, by using a log
link so that

log(µi ) = log(γ) + δti = β0 + β1 ti .

◮ Since this is a count, the Poisson distribution (with expected

value µi ) is probably a reasonable distribution to try.

5/52
Example 2: Prey Capture Rate(1)

The rate of capture of preys, yi , by a hunting animal, tends to

increase with increasing density of prey, xi , but to eventually level
oﬀ, when the predator is catching as much as it can cope with.
A suitable model for this situation might be
αxi
µi = ,
h + xi
where α represents the maximum capture rate, and h represents
the prey density at which the capture rate is half the maximum
rate.

6/52
Example 2: Prey Capture Rate (2)

0.6
0.5
0.4
0.3
0.2
0.1
0.0

0.0 0.2 0.4 0.6 0.8 1.0

7/52
Example 2: Prey Capture Rate (3)

◮ Obviously this model is non-linear in its parameters, but, by

using a reciprocal link, the right-hand side can be made linear
in the parameters,

1 1 h 1 1
g(µi ) = = + = β0 + β1 .
µi α α xi xi
◮ The standard deviation of capture rate might be
approximately proportional to the mean rate, suggesting the
use of a Gamma distribution for the response.

8/52
Example 3: Kyphosis Data

The Kyphosis data consist of measurements on 81 children

following corrective spinal surgery. The binary response variable,
Kyphosis, indicates the presence or absence of a postoperative
deforming. The three covariates are, Age of the child in month,
Number of the vertebrae involved in the operation, and the Start
of the range of the vertebrae involved.
◮ The response variable is binary so there is no choice: Y |X is
Bernoulli with expected value µ(X) ∈ (0, 1).
◮ We cannot write
µ(X) = X ⊤ β
because the right-hand side ranges through IR.
◮ We need an invertible function f such that f (X ⊤ β) ∈ (0, 1)

9/52
GLM: motivation

◮ clearly, normal LM is not appropriate for these examples;

◮ need a more general regression framework to account for
various types of response data
◮ Exponential family distributions
◮ develop methods for model ﬁtting and inferences in this
framework
◮ Maximum Likelihood estimation.

10/52
Exponential Family

A family of distribution {Pθ : θ ∈ Θ}, Θ ⊂ IRk is said to be a

k-parameter exponential family on IRq , if there exist real valued
functions:
◮ η1 , η2 , · · · , ηk and B of θ,

◮ T1 , T2 , · · · , Tk , and h of x ∈ IRq such that the density

function (pmf or pdf) of Pθ can be written as

Lk
pθ (x) = exp[ ηi (θ)Ti (x) − B(θ)]h(x)
i=1

11/52
Normal distribution example
◮ Consider X ∼ N (µ, σ 2 ), θ = (µ, σ 2 ). The density is
(µ 1 2 µ2 ) 1
pθ (x) = exp x − x − √ ,
σ2 2σ 2 2σ 2 σ 2π

which forms a two-parameter exponential family with

µ 1
η1 = 2
, η2 = − 2 , T1 (x) = x, T2 (x) = x2 ,
σ 2σ
µ2 √
B(θ) = 2
+ log(σ 2π), h(x) = 1.
2σ
◮ When σ 2 is known, it becomes a one-parameter exponential
family on IR:
x2
µ µ2 e− 2σ2
η = 2 , T (x) = x, B(θ) = 2 , h(x) = √ .
σ 2σ σ 2π
12/52
Examples of discrete distributions

The following distributions form discrete exponential families of

distributions with pmf

◮ Bernoulli(p): px (1 − p)1−x , x ∈ {0, 1}

λx −λ
◮ Poisson(λ): e , x = 0, 1, . . . .
x!

13/52
Examples of Continuous distributions
The following distributions form continuous exponential families of
distributions with pdf:
1 x
◮ Gamma(a, b):
a
xa−1 e− b ;
Γ(a)b
◮ above: a: shape parameter, b: scale parameter
◮ reparametrize: µ = ab: mean parameter
( )a
1 a
xa−1 e− µ .
ax

Γ(a) µ

β α −α−1 −β/x
◮ Inverse Gamma(α, β): x e .
Γ(α)
2 2
σ 2 −σ 2µ(x−µ)
◮ Inverse Gaussian(µ, σ 2 ): e 2x
.
2πx3
Others: Chi-square, Beta, Binomial, Negative binomial
distributions.
14/52
Components of GLM

1. Random component:

Y ∼ some exponential family distribution

2. Link: between the random and covariates:

g µ(X) = X ⊤ β
� �

where g called link function and µ(X) = IE(Y |X).

15/52
One-parameter canonical exponential family

◮ Canonical exponential family for k = 1, y ∈ IR

( yθ − b(θ) )
fθ (y) = exp + c(y, φ)
φ

for some known functions b(·) and c(·, ·) .

◮ If φ is known, this is a one-parameter exponential family with

θ being the canonical parameter .
◮ If φ is unknown, this may/may not be a two-parameter
exponential family. φ is called dispersion parameter.
◮ In this class, we always assume that φ is known.

16/52
Normal distribution example

◮ Consider the following Normal density function with known

variance σ 2 ,
1 (y−µ)2
fθ (y) = √ e− 2σ2
σ 2π
yµ − 12 µ2 1 y2
� ( )�
2
= exp − + log(2πσ ) ,
σ2 2 σ2

θ2
◮ Therefore θ = µ, φ = σ 2 , , b(θ) = 2 , and

1 y2
c(y, φ) = − ( + log(2πφ)).
2 φ

17/52
Other distributions

Table 1: Exponential Family

Normal Poisson Bernoulli
Notation N (µ, σ 2 ) P(µ) B(p)
Range of y (−∞, ∞) [0, −∞) {0, 1}
φ σ2 1 1
θ2
b(θ) 2 eθ log(1 + eθ )
2
c(y, φ) − 21 ( yφ + log(2πφ)) − log y! 1

18/52
Likelihood

Let ℓ(θ) = log fθ (Y ) denote the log-likelihood function.

The mean IE(Y ) and the variance var(Y ) can be derived from the
following identities
◮ First identity
∂ℓ
IE( ) = 0,
∂θ
◮ Second identity

∂2ℓ ∂ℓ
IE( 2
) + IE( )2 = 0.
∂θ ∂θ
�
Obtained from fθ (y)dy ≡ 1 .

19/52
Expected value

Note that
Y θ − b(θ
ℓ(θ) = + c(Y ; φ),
φ
Therefore
∂ℓ Y − b′ (θ)
=
∂θ φ
It yields
∂ℓ IE(Y ) − b′ (θ))
0 = IE( )= ,
∂θ φ
which leads to
IE(Y ) = µ = b′ (θ).

20/52
Variance

On the other hand we have we have

∂2ℓ ∂ℓ 2 b′′ (θ) ( Y − b′ (θ) )2
+ ( ) = − +
∂θ 2 ∂θ φ φ
and from the previous result,

Y − b′ (θ) Y − IE(Y )
=
φ φ
Together, with the second identity, this yields

b′′ (θ) var(Y )

0=− + ,
φ φ2
which leads to
var(Y ) = V (Y ) = b′′ (θ)φ.

21/52
Example: Poisson distribution

Example: Consider a Poisson likelihood,

µy −µ
f (y) = e = ey log µ−µ−log(y!) ,
y!
Thus,
θ = log µ, b(θ) = µ, c(y, φ) = − log(y!),
φ = 1,
µ = eθ ,
b(θ) = eθ ,
b′′ (θ) = eθ = µ,

22/52
Link function

◮ β is the parameter of interest, and needs to appear somehow

in the likelihood function to use maximum likelihood.
◮ A link function g relates the linear predictor X ⊤ β to the mean
parameter µ,
X ⊤ β = g(µ).
◮ g is required to be monotone increasing and diﬀerentiable

µ = g−1 (X ⊤ β).

23/52
Examples of link functions

◮ For LM, g(·) = identity.

◮ Poisson data. Suppose Y |X ∼ Poisson(µ(X)).
◮ µ(X) > 0;
◮ log(µ(X)) = X ⊤ β;
◮ In general, a link function for the count data should map
(0, +∞) to IR.
◮ The log link is a natural one.
◮ Bernoulli/Binomial data.
◮ 0 < µ < 1;
◮ g should map (0, 1) to IR:
◮ 3 choices: ( )
µ(X)
1. logit: log 1−µ(X)
= X ⊤ β;
2. probit: Φ (µ(X)) = X ⊤ β where Φ(·) is the normal cdf;
−1

3. complementary log-log: log(− log(1 − µ(X))) = X ⊤ β

◮ The logit link is the natural choice.

24/52
Examples of link functions for Bernoulli response (1)
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

ex
◮ in blue: f1 (x) =
1 + ex
◮ in red: f2 (x) = Φ(x) (Gaussian CDF)
25/52
Examples of link functions for Bernoulli response (2)
5

2
◮ in blue:
1 g1 (x) = f1−1 (x) =
� x �
log (logit link)
0
1−x
-1
◮ in red:
g2 (x) = f2−1 (x) = Φ−1 (x)
-2
(probit link)
-3

-4

-5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

26/52
Canonical Link

◮ The function g that links the mean µ to the canonical

parameter θ is called Canonical Link:

g(µ) = θ

◮ Since µ = b′ (θ), the canonical link is given by

g(µ) = (b′ )−1 (µ) .

◮ If φ > 0, the canonical link function is strictly increasing.

Why?

27/52
Example: the Bernoulli distribution

◮ We can check that

b(θ) = log(1 + eθ )

◮ Hence we solve
( )
′ exp(θ) µ
b (θ) = =µ ⇔ θ = log
1 + exp(θ) 1−µ
◮ The canonical link for the Bernoulli distribution is the logit
link.

28/52
Other examples

b(θ) g(µ)
Normal θ 2 /2 µ
Poisson exp(θ) log µ
µ
Bernoulli log(1 + eθ ) log 1−µ
Gamma − log(−θ) − µ1

29/52
Model and notation

◮ Let (Xi , Yi ) ∈ IRp × IR, i = 1, . . . , n be independent random

pairs such that the conditional distribution of Yi given
Xi = xi has density in the canonical exponential family:
� yi θi − b(θi ) �
fθi (yi ) = exp + c(yi , φ) .
φ

◮ Y = (Y1 , . . . , Yn )⊤ , X = (X1⊤ , . . . , Xn⊤ )⊤

◮ Here the mean µi is related to the canonical parameter θi via

µi = b′ (θi )

◮ and µi depends linearly on the covariates through a link

function g:
g(µi ) = Xi⊤ β .

30/52
Back to β

◮ Given a link function g, note the following relationship

between β and θ:

θi = (b′ )−1 (µi )

= (b′ )−1 (g −1 (Xi⊤ β)) ≡ h(Xi⊤ β),

where h is deﬁned as

h = (b′ )−1 ◦ g −1 = (g ◦ b′ )−1 .

◮ Remark: if g is the canonical link function, h is identity.

31/52
Log-likelihood

◮ The log-likelihood is given by

L Yi θi − b(θi )
ℓn (β; Y, X) =
φ
i
L Yi h(X ⊤ β) − b(h(X ⊤ β))
i i
=
φ
i

up to a constant term.
◮ Note that when we use the canonical link function, we obtain
the simpler expression
L Yi X ⊤ β − b(X ⊤ β)
i i
ℓn (β, φ; Y, X) =
φ
i

32/52
Strict concavity

◮ The log-likelihood ℓ(θ) is strictly concave using the canonical

function when φ > 0. Why?
◮ As a consequence the maximum likelihood estimator is unique.
◮ On the other hand, if another parameterization is used, the
likelihood function may not be strictly concave leading to
several local maxima.

33/52
Optimization Methods

Given a function f (x) deﬁned on X ⊂ IRm , ﬁnd x∗ such that

f (x∗ ) ≥ f (x) for all x ∈ X .

We will describe the following three methods,

◮ Newton-Raphson Method
◮ Fisher-scoring Method
◮ Iteratively Re-weighted Least Squares.

34/52
Gradient and Hessian
◮ Suppose f : IRm → IR has two continuous derivatives.
◮ Deﬁne the Gradient of f at point x0 , ∇f = ∇f (x0 ), as

(∇f ) = (∂f /∂x1 , . . . , ∂f /∂xm )⊤ .

◮ Deﬁne the Hessian (matrix) of f at point x0 , Hf = Hf (x0 ),

as
∂2f
(Hf )ij = .
∂xi ∂xj
◮ For smooth functions, the Hessian is symmetric. If f is strictly
concave, then Hf (x) is negative deﬁnite.
◮ The continuous function:

x �→ Hf (x)

is called Hessian map.

35/52
Quadratic approximation

◮ Suppose f has a continuous Hessian map at x0 . Then we can

approximate f quadratically in a neighborhood of x0 using
1
f (x) ≈ f (x0 ) + ∇⊤ ⊤
f (x0 )(x − x0 ) + (x − x0 ) Hf (x0 )(x − x0 ).
2
◮ This leads to the following approximation to the gradient:

∇f (x) ≈ ∇f (x0 ) + Hf (x0 )(x − x0 ).

◮ If x∗ is maximum, we have

∇f (x∗ ) = 0

◮ We can solve for it by plugging in x∗ , which gives us

x∗ = x0 − Hf (x0 )−1 ∇f (x0 ).

36/52
Newton-Raphson method

◮ The Newton-Raphson method for multidimensional

optimization uses such approximations sequentially
◮ We can deﬁne a sequence of iterations starting at an arbitrary
value x0 , and update using the rule,

x(k+1) = x(k) − Hf (x(k) )−1 ∇f (x(k) ).

◮ The Newton-Raphson algorithm is globally convergent at

quadratic rate whenever f is concave and has two continuous
derivatives.

37/52
Fisher-scoring method (1)

◮ Newton-Raphson works for a deterministic case, which does

not have to involve random data.
◮ Sometimes, calculation of the Hessian matrix is quite
complicated (we will see an example)
◮ Goal: use directly the fact that we are minimizing the KL
divergence [ ]
KL“ = ” − IE log-likelihood
◮ Idea: replace the Hessian with its expected value. Recall that

IEθ (Hℓn (θ)) = −I(θ)

is the Fisher Information

38/52
Fisher-scoring method (2)

◮ The Fisher Information matrix is positive deﬁnite, and can

serve as a stand-in for the Hessian in the Newton-Raphson
algorithm, giving the update:

θ (k+1) = θ (k) + I(θ (k) )−1 ∇ℓn (θ (k) ).

This is the Fisher-scoring algorithm.

◮ It has essentially the same convergence properties as
Newton-Raphson, but it is often easier to compute I than
H ℓn .

39/52
Example: Logistic Regression (1)

◮ Suppose Yi ∼ Bernoulli(pi ), i = 1, . . . , n, are independent

0/1 indicator responses, and Xi is a p × 1 vector of predictors
for individual i.
◮ The log-likelihood is as follows:
n (
L ( ))
ℓn (θ|Y, X) = Yi θi − log 1 + eθi .
i=1

◮ Under the canonical link,

( )
pi
θi = log = Xi⊤ β.
1 − pi

40/52
Example: Logistic Regression (2)
◮ Thus, we have
n ( ( ))
⊤
L
ℓn (β|Y, X) = Yi Xi⊤ β − log 1 + eXi β .
i=1

◮ The gradient is
n
� ⊤β
�
L eXi
∇ℓn (β) = Yi Xi − ⊤β
Xi .
i=1 1 + eXi
◮ The Hessian is
n ⊤β
L eXi ⊤
Hℓn (β) = − ( )2 Xi Xi .
⊤
i=1 1 + eXi β

◮ As a result, the updating rule is

β (k+1) = β (k) − Hℓn (β (k) )−1 ∇ℓn (β (k) ).

41/52
Example: Logistic Regression (3)

◮ The score function is a linear combination of the Xi , and the

Hessian or Information matrix is a linear combination of
Xi Xi⊤ . This is typical in exponential family regression models
(i.e. GLM).
◮ The Hessian is negative deﬁnite, so there is a unique local
maximizer, which is also the global maximizer.
◮ Finally, note that that the Yi does not appear in Hℓn (β),
which yields
[ ]
Hℓn (β) = IE Hℓn (β) = −I(β)

42/52
Iteratively Re-weighted Least Squares

◮ IRLS is an algorithm for ﬁtting GLM obtained by

Newton-Raphson/Fisher-scoring.
◮ Suppose Yi |Xi has a distribution from an exponential family
with the following log-likelihood function,
n
L Yi θi − b(θi )
ℓ= + c(Yi , φ).
φ
i=1

◮ Observe that
dµi ′′
µi = b′ (θi ), Xi⊤ β = g(µi ), = b (θi ) ≡ Vi .
dθi

θi = (b′ )−1 ◦ g −1 (Xi⊤ β) := h(Xi⊤ β)

43/52
Chain rule

◮ According to the chain rule, we have

n
∂ℓn L ∂ℓi ∂θi
=
∂βj ∂θi ∂βj
i=1
L Yi − µi
= h′ (Xi⊤ β)Xij
φ
i
h′ (Xi⊤ β)
( ( ))
(Ỹi − µ̃i )Wi Xij
L
= Wi ≡ .
g′ (µi )φ
i

◮ Where Ỹ = (g′ (µ1 )Y1 , . . . g′ (µn )Yn )⊤ and

µ̃ = (g ′ (µ1 )µ1 , . . . g′ (µn )µn )⊤

44/52
Gradient

◮ Deﬁne
W = diag{W1 , . . . , Wn },
◮ Then, the gradient is

∇ℓn (β) = X⊤ W (Ỹ − µ̃)

45/52
Hessian
◮ For the Hessian, we have

∂2ℓ L Yi − µi
= h′′ (Xi⊤ β)Xij Xij
∂βj ∂βk φ
i
( )
1L ∂µi
− h′ (Xi⊤ β)Xij
φ ∂βk
i

◮ Note that
∂µi ∂b′ (θi ) ∂b′ (h(Xi⊤ β))
= = = b′′ (θi )h′ (Xi⊤ β)Xik
∂βk ∂βk ∂βk
It yields
1 L ′′ ]2
b (θi ) h′ (Xi⊤ β) Xi Xi⊤
[
IE(Hℓn (β)) = −
φ
i

46/52
Fisher information
◮ Note that g−1 (·) = b′ ◦ h(·) yields
1
b′′ ◦ h(·) · h′ (·) =
g′ ◦ g−1 (·)

Recall that θi = h(Xi⊤ β) and µi = g −1 (Xi⊤ β), we obtain

1
b′′ (θi )h′ (Xi⊤ β) =
g ′ (µ i)

◮ As a result
L h′ (X ⊤ β)
i
IE(Hℓn (β)) = − Xi Xi⊤
g ′ (µi )φ
i

◮ Therefore,
( h′ (X ⊤ β) )
i
I(β) = −IE(Hℓn (β)) = X⊤ W X where W = diag
g ′ (µi )
47/52
Fisher-scoring updates

◮ According to Fisher-scoring, we can update an initial estimate

β (k) to β (k+1) using

β (k+1) = β (k) + I(β (k) )−1 ∇ℓn (β (k) ) ,

◮ which is equivalent to

β (k+1) = β (k) + (X⊤ W X)−1 X⊤ W (Ỹ − µ̃)

= (X⊤ W X)−1 X⊤ W (Ỹ − µ̃ + Xβ (k) )

48/52
Weighted least squares (1)
Let us open a parenthesis to talk about Weighted Least Squares.
◮ Assume the linear model Y = Xβ + ε, where
ε ∼ Nn (0, W −1 ), where W −1 is a n × n diagonal matrix.
When variances are diﬀerent, the regression is said to be
heteroskedastic.
◮ The maximum likelihood estimator is given by the solution to

min(Y − Xβ)⊤ W (Y − Xβ)

This is a Weighted Least Squares problem

◮ The solution is given by

(X⊤ W X)−1 X⊤ W (X⊤ W X)Y

◮ Routinely implemented in statistical software.

49/52
Weighted least squares (2)

Back to our problem.

Recall that
˜ −µ
β (k+1) = (X⊤ W X)−1 X⊤ W (Y ˜ + Xβ (k) )

◮ This reminds us of Weighted Least Squares with

1. W = W (β (k) ) being the weight matrix,
2. Ỹ − µ̃ + Xβ (k) being the response.
So we can obtain β (k+1) using any system for WLS.

50/52
IRLS procedure (1)
Iteratively Reweighed Least Squares is an iterative procedure to
compute the MLE in GLMs using weighted least squares.
We show how to go from β (k) to β (k+1)
(k)
1. Fix β (k) and µi = g−1 (Xi⊤ β (k) );
2. Calculate the adjusted dependent responses
(k) (k) (k)
Zi = Xi⊤ β (k) + g ′ (µi )(Yi − µi );

3. Compute the weights W (k) = W (β (k) )

� �
h′ (X ⊤ β (k) )
W (k) = diag i
(k)
′
g (µi )φ

4. Regress Z(k) on the design matrix X with weight W (k) to

derive a new estimate β (k+1) ;
We can repeat this procedure until convergence.
51/52
IRLS procedure (2)

◮ For this procedure, we only need to know X, Y, the link

′′
function g(·) and the variance function V (µ) = b (θ).
◮ A possible starting value is to let µ(0) = Y.
◮ If the canonical link is used, then Fisher scoring is the same as
Newton-Raphson.
IE (Hℓn ) = Hℓn .
There is no random component (Y) in the Hessian matrix.

52/52
MIT OpenCourseWare
http://ocw.mit.edu

18.650 / 18.6501 Statistics for Applications

Fall 2016

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Quantitative Techniques 2025
No ratings yet
Quantitative Techniques 2025
277 pages
Probabilty Slides by Madhav
No ratings yet
Probabilty Slides by Madhav
625 pages
1 - Basic Probability Theory
No ratings yet
1 - Basic Probability Theory
58 pages
Stat Merge
No ratings yet
Stat Merge
162 pages
Kognitivnolingv Aspekty
No ratings yet
Kognitivnolingv Aspekty
2,153 pages
Better Futures: Tools for dealing with uncertainty
From Everand
Better Futures: Tools for dealing with uncertainty
Andy Garlick
No ratings yet
Learn The Basics Of Decision Trees A Popular And Powerful Machine Learning Algorithm
From Everand
Learn The Basics Of Decision Trees A Popular And Powerful Machine Learning Algorithm
UBER AUTHOR
No ratings yet
q3 Stat Prob Week 1 7
No ratings yet
q3 Stat Prob Week 1 7
95 pages
Probability Theory and Stochastic Processes
0% (1)
Probability Theory and Stochastic Processes
155 pages
E-Note 20895 Content Document 20240607120458PM
No ratings yet
E-Note 20895 Content Document 20240607120458PM
202 pages
STAT600 Notes Student
No ratings yet
STAT600 Notes Student
285 pages
Wa0008.
No ratings yet
Wa0008.
157 pages
Probability and Random Processes
No ratings yet
Probability and Random Processes
167 pages
Chapter One - Introduction
No ratings yet
Chapter One - Introduction
156 pages
MetNum1 2023 1 Week 9 With Worked Examples
No ratings yet
MetNum1 2023 1 Week 9 With Worked Examples
60 pages
Unit I Probability
No ratings yet
Unit I Probability
40 pages
BK Chap12
No ratings yet
BK Chap12
74 pages
01 Introduction
No ratings yet
01 Introduction
52 pages
French Level 1 - Student Workbook PDF
75% (4)
French Level 1 - Student Workbook PDF
100 pages
Super Logic Modern Mathematics: Classical Mathematics
From Everand
Super Logic Modern Mathematics: Classical Mathematics
Kok Fah Chong
No ratings yet
MITx - 18.6501x - FUNDAMENTALS OF STATISTICS
No ratings yet
MITx - 18.6501x - FUNDAMENTALS OF STATISTICS
10 pages
Introduction To Probability Theory and S
No ratings yet
Introduction To Probability Theory and S
127 pages
Intro Slides
No ratings yet
Intro Slides
9 pages
Introduction To Probability QMUL
No ratings yet
Introduction To Probability QMUL
117 pages
Math For AI
No ratings yet
Math For AI
34 pages
AEM Lecture 1
No ratings yet
AEM Lecture 1
70 pages
Classification in the Wild: The Science and Art of Transparent Decision Making
From Everand
Classification in the Wild: The Science and Art of Transparent Decision Making
Konstantinos V. Katsikopoulos
No ratings yet
G321.93 0.01: A Rare Site of Multiple Hub-Filament Systems With Evidence of Collision and Merging of Filaments
No ratings yet
G321.93 0.01: A Rare Site of Multiple Hub-Filament Systems With Evidence of Collision and Merging of Filaments
27 pages
TMJC H2 Mathematics Prelims Paper 2 (Q)
No ratings yet
TMJC H2 Mathematics Prelims Paper 2 (Q)
25 pages
Second Grade Reading Street Unit 1: Iris & Walter: Selection Words: Amazing Words
No ratings yet
Second Grade Reading Street Unit 1: Iris & Walter: Selection Words: Amazing Words
31 pages
Statistics
No ratings yet
Statistics
167 pages
Prob&StatsBook PDF
No ratings yet
Prob&StatsBook PDF
202 pages
Prob Best-2 PDF
No ratings yet
Prob Best-2 PDF
181 pages
IBM AIX7 官方培训文档
No ratings yet
IBM AIX7 官方培训文档
495 pages
Haskell and Yesod
100% (1)
Haskell and Yesod
265 pages
Class - 'UKG - A + B' Unit Test - II Syllabus
No ratings yet
Class - 'UKG - A + B' Unit Test - II Syllabus
1 page
Aptis Reading Test Part 1 Day 2
No ratings yet
Aptis Reading Test Part 1 Day 2
2 pages
Lecture2 Math ML Review
No ratings yet
Lecture2 Math ML Review
87 pages
2basic Prob 2
No ratings yet
2basic Prob 2
49 pages
MATH302
No ratings yet
MATH302
16 pages
PCG 850 Vs
No ratings yet
PCG 850 Vs
22 pages
1probability and Statistics For Pre-Engineers Course Outline
No ratings yet
1probability and Statistics For Pre-Engineers Course Outline
3 pages
I&O Device Simulation 2.0
No ratings yet
I&O Device Simulation 2.0
7 pages
L06 - Syntactic and Semantic Errors
No ratings yet
L06 - Syntactic and Semantic Errors
19 pages
18.650 - Fundamentals of Statistics
No ratings yet
18.650 - Fundamentals of Statistics
45 pages
Mit18 05 s22 Class01-Prep-A
No ratings yet
Mit18 05 s22 Class01-Prep-A
3 pages
Introduction To Probability Theory and Statistics
No ratings yet
Introduction To Probability Theory and Statistics
127 pages
Macbeth
No ratings yet
Macbeth
3 pages
Probability and Statistics II - Course Outline - 250414 - 101735
No ratings yet
Probability and Statistics II - Course Outline - 250414 - 101735
7 pages
STAT515 Lecture
No ratings yet
STAT515 Lecture
85 pages
The Handwriting Difficulty Checklist
No ratings yet
The Handwriting Difficulty Checklist
2 pages
Netaji Notes
No ratings yet
Netaji Notes
2 pages
Spectral Density Estimation Using P-Spline Priors
No ratings yet
Spectral Density Estimation Using P-Spline Priors
15 pages
Statistics
No ratings yet
Statistics
2 pages
SM Contents-1
No ratings yet
SM Contents-1
8 pages
Review 3 El9
No ratings yet
Review 3 El9
7 pages
Course Outline BTECH Probability CompSc 2024
No ratings yet
Course Outline BTECH Probability CompSc 2024
5 pages
Midterm I Review - 1 Per Page
No ratings yet
Midterm I Review - 1 Per Page
24 pages
MIT18 650F16 Introduction
No ratings yet
MIT18 650F16 Introduction
46 pages
Worksheet On Grammar Class 10
No ratings yet
Worksheet On Grammar Class 10
4 pages
Probability&stats
No ratings yet
Probability&stats
12 pages
Probability and Random Processes-1-10
No ratings yet
Probability and Random Processes-1-10
10 pages
Q3 Statistics and Probability Week 1
No ratings yet
Q3 Statistics and Probability Week 1
19 pages
Hymn Sheet - 6th Sept 2024
No ratings yet
Hymn Sheet - 6th Sept 2024
2 pages
4 - Mobile Cloud Computing
No ratings yet
4 - Mobile Cloud Computing
3 pages
Suresh Kumar 1-4 Chap Pns Notes
No ratings yet
Suresh Kumar 1-4 Chap Pns Notes
19 pages
Stat 230 Notes
No ratings yet
Stat 230 Notes
248 pages
DA-MATH 220 Syllabus SP 24
No ratings yet
DA-MATH 220 Syllabus SP 24
5 pages
Eskimo Words For Snow
No ratings yet
Eskimo Words For Snow
7 pages
Reading Jibanananda Dass Banalata Sen From A Surr PDF
No ratings yet
Reading Jibanananda Dass Banalata Sen From A Surr PDF
11 pages
Probsta
No ratings yet
Probsta
5 pages
2002 Amc 10B
No ratings yet
2002 Amc 10B
6 pages
Sms2309 Intro To Statistics Outline
No ratings yet
Sms2309 Intro To Statistics Outline
6 pages
Course Outline - Probability & Statistics (14-02-2022)
No ratings yet
Course Outline - Probability & Statistics (14-02-2022)
4 pages
Structured Decision Making
From Everand
Structured Decision Making
Andreas Michael Theodorou
No ratings yet
E-Note 24354 Content Document 20240917024357PM
No ratings yet
E-Note 24354 Content Document 20240917024357PM
4 pages
Organic Chemistry Assignment-1: Complex Question SET
No ratings yet
Organic Chemistry Assignment-1: Complex Question SET
6 pages
Course Outliens of Proabibility and Statistics
No ratings yet
Course Outliens of Proabibility and Statistics
3 pages
ACT Math Section and SAT Math Level 2 Subject Test Practice Problems 2013 Edition
From Everand
ACT Math Section and SAT Math Level 2 Subject Test Practice Problems 2013 Edition
Dr. David Kronmiller
3/5 (3)
THIRD CONDITIONAL STUDENT DOCUMENT - Edited Yta
No ratings yet
THIRD CONDITIONAL STUDENT DOCUMENT - Edited Yta
5 pages
Mathematics-II (Probability & Statistics) - Course Oultine
No ratings yet
Mathematics-II (Probability & Statistics) - Course Oultine
2 pages
Changing Views On Estuary English
No ratings yet
Changing Views On Estuary English
37 pages
Perfect 800: SAT Math: Advanced Strategies for Top Students
From Everand
Perfect 800: SAT Math: Advanced Strategies for Top Students
Dan Celenti
3.5/5 (11)
What Is Coaching?: in This Chapter We Will Look at
No ratings yet
What Is Coaching?: in This Chapter We Will Look at
7 pages
Course Outline - Probability & Statistics (05-03-2021)
No ratings yet
Course Outline - Probability & Statistics (05-03-2021)
4 pages
MATH 233 Syllabus-Revised
No ratings yet
MATH 233 Syllabus-Revised
3 pages
Industrial Automation Brochure - Deepthi PDF
No ratings yet
Industrial Automation Brochure - Deepthi PDF
6 pages
ALE HCM FI Integration
No ratings yet
ALE HCM FI Integration
2 pages
Updated README
No ratings yet
Updated README
2 pages
004-5-MATH 361 Probability & Statistics
No ratings yet
004-5-MATH 361 Probability & Statistics
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.