MIT - Statistics For Applications
MIT - Statistics For Applications
Applications
Prof. Philippe Rigollet
SES # TOPICS
https://ocw.mit.edu/courses/18-650-statistics-for-applications-fall-2016/
https://www.youtube.com/playlist?list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0
18.650
Statistics for Applications
Chapter 1: Introduction
1/43
Goals
Goals:
▶ To give you a solid introduction to the mathematical theory
behind statistical methods;
▶ To provide theoretical guarantees for the statistical methods
that you may use for certain applications.
At the end of this class, you will be able to
1. From a real-life situation, formulate a statistical problem in
mathematical terms
2. Select appropriate statistical methods for your problem
3. Understand the implications and limitations of various
methods
2/43
Instructors
3/43
Logistics
4/43
Miscellaneous
5/43
Why statistics?
6/43
Not only in the press
8/43
Randomness
9/43
Randomness
RANDOMNESS
9/43
Randomness
RANDOMNESS
Associated questions:
▶ Notion of average (“fair premium”, …)
▶ Quantifying chance (“most of the floods”, …)
▶ Significance, variability, …
9/43
Probability
▶ Probability studies randomness (hence the prerequisite)
▶ Sometimes, the physical process is completely known: dice,
cards, roulette, fair coins, …
Examples
Rolling 1 die:
▶ Alice gets $1 if # of dots 3
▶ Bob gets $2 if # of dots 2
Who do you want to be: Alice or Bob?
Rolling 2 dice:
▶ Choose a number between 2 and 12
▶ Win $100 if you chose the sum of the 2 dice
Examples
Rolling 1 die:
▶ Alice gets $1 if # of dots 3
▶ Bob gets $2 if # of dots 2
Who do you want to be: Alice or Bob?
Rolling 2 dice:
▶ Choose a number between 2 and 12
▶ Win $100 if you chose the sum of the 2 dice
Examples
Rolling 1 die:
▶ Alice gets $1 if # of dots 3
▶ Bob gets $2 if # of dots 2
Who do you want to be: Alice or Bob?
Rolling 2 dice:
▶ Choose a number between 2 and 12
▶ Win $100 if you chose the sum of the 2 dice
Examples
Rolling 1 die:
▶ Alice gets $1 if # of dots 3
▶ Bob gets $2 if # of dots 2
Who do you want to be: Alice or Bob?
Rolling 2 dice:
▶ Choose a number between 2 and 12
▶ Win $100 if you chose the sum of the 2 dice
11/43
Statistics vs. probability
13/43
18.650
14/43
18.650
14/43
Let’s do some statistics
15/43
Heuristics (1)
17/43
Heuristics (2)
18/43
Heuristics (3)
∑ n
¯n = 1
p̂ = R Ri .
n
i=1
19/43
Heuristics (4)
20/43
Heuristics (5)
Let us discuss these assumptions.
21/43
Two important tools: LLN & CLT
1∑
n
IP, a.s.
X̄n := Xi −−−−≥ µ.
n n-≥
i=1
∈ (d)
(Equivalently, ¯ n − µ) −−
n (X −≥ N (0, α 2 ).)
n-≥
22/43
Consequences (1)
▶ ¯n
Hence, when the size n of the experiment becomes large, R
is a good (say ”consistent”) estimator of p.
▶ The CLT refines this by quantifying how good this estimate is.
23/43
Consequences (2)
∈ R̄n − p
<n (x): cdf of n√ .
p(1 − p)
CLT: <n (x) ≤ <(x) when n becomes large. Hence, for all x > 0,
( ( ∈ ))
[ ] x n
IP |R̄n − p| 2 x ≤ 2 1 − < √ .
p(1 − p)
24/43
Consequences (3)
Consequences:
▶ ¯ n concentrates around p;
Approximation on how R
25/43
Consequences (4)
▶ Note that no matter the (unknown) value of p,
p(1 − p) 1/4.
▶ Hence, roughly with probability at least 1 − o,
[ ]
qa/2 qa/2
R̄n → p − ∈ , p + ∈ .
2 n 2 n
▶ In
[ other words, when n becomes
] large, the interval
qa/2 qa/2
R̄n − ∈ , R̄n + ∈ contains p with probability 2 1 − o.
2 n 2 n
▶ This interval is called an asymptotic confidence interval for p.
▶ In the kiss example, we get
[ 1.96 ]
0.645 ± ∈ = [0.56, 0.73]
2 124
If the extreme (n = 3 case) we would have [0.10, 1.23] but
CLT is not valid! Actually we can make exact computations!
26/43
Another useful tool: Hoeffding’s inequality
What if n is not so large ?
Hoeffding’s inequality (i.i.d. case)
Let n be a positive integer and X, X1 , . . . , Xn be i.i.d. r.v. such
that X → [a, b] a.s. (a < b are given numbers). Let µ = IE[X].
Then, for all λ > 0,
2n�2
−
IP[|X̄n − µ| 2 λ] 2e (b−a)2 .
Consequence:
▶ For o → (0, 1), with probability 2 1 − o,
√ √
log(2/o) ¯ n + log(2/o) .
R̄n − p R
2n 2n
▶ This holds even for small sample sizes n.
27/43
Review of different types of convergence (1)
▶ Convergence in probability:
IP
Tn −−−≥ T iff IP [|Tn − T | 2 λ] −−−≥ 0, ⇒λ > 0.
n-≥ n-≥
28/43
Review of different types of convergence (2)
▶ Convergence in Lp (p 2 1):
Lp
Tn −−−≥ T iff IE [|Tn − T |p ] −−−≥ 0.
n-≥ n-≥
▶ Convergence in distribution:
(d)
Tn −−−≥ T iff IP[Tn x] −−−≥ IP[T x],
n-≥ n-≥
Remark
These definitions extend to random vectors (i.e., random variables
in IRd for some d 2 2).
29/43
Review of different types of convergence (3)
(d)
(i) Tn −−−≥ T ;
n-≥
(ii) IE[f (Tn )] −−−≥ IE[f (T )], for all continuous and
n-≥
bounded function f ;
[ ] [ ]
(iii) IE eixTn −−−≥ IE eixT , for all x → IR.
n-≥
30/43
Review of different types of convergence (4)
Important properties
▶ If (Tn )n?1 converges a.s., then it also converges in probability,
and the two limits are equal a.s.
▶ If f is a continuous function:
a.s./IP/(d) a.s./IP/(d)
Tn −−−−−−−≥ T ≈ f (Tn ) −−−−−−−≥ f (T ).
n-≥ n-≥
31/43
Review of different types of convergence (6)
One can add, multiply, ... limits almost surely and in probability. If
a.s./IP a.s./IP
Un −−−−≥ U and Vn −−−−≥ V , then:
n-≥ n-≥
a.s./IP
▶ Un + Vn −−−−≥ U + V ,
n-≥
a.s./IP
▶ Un Vn −−−−≥ U V ,
n-≥
Un a.s./IP U
▶ If in addition, V ̸= 0 a.s., then −−−−≥ .
Vn n-≥ V
33/43
Another example (1)
34/43
Another example (2)
35/43
Another example (3)
▶ Density of T1 :
f (t) = ,e− t , ⇒t 2 0.
1
▶ IE[T1 ] = .
,
1
▶ Hence, a natural estimate of is
,
1∑
n
T̄n := Ti .
n
i=1
▶ A natural estimator of , is
ˆ := 1 .
,
T̄n
36/43
Another example (4)
▶ By the LLN’s,
a.s./IP 1
T̄n −−−−≥
n-≥ ,
▶ Hence,
ˆ−a.s./IP
, −−−≥ ,.
n-≥
▶ By the CLT,
( )
∈ 1 (d)
n T̄n − −−−≥ N (0, ,−2 ).
, n-≥
37/43
The Delta method
38/43
Consequence of the Delta method (1)
∈ ( ) (d)
▶ n ,ˆ − , −−−≥ N (0, ,2 ).
n-≥
qa/2 ,
ˆ − ,|
|, ∈ .
n
[ ]
q , q ,
▶ Can , ˆ − a/2
∈ ,, ˆ + a/2
∈ be used as an asymptotic
n n
confidence interval for , ?
▶ No ! It depends on ,...
39/43
Consequence of the Delta method (2)
40/43
Slutsky’s theorem
Slutsky’s theorem
Let (Xn ), (Yn ) be two sequences of r.v., such that:
(d)
(i) Xn −−−≥ X;
n-≥
IP
(ii) Yn −−−≥ c,
n-≥
where X is a r.v. and c is a given real number. Then,
(d)
(Xn , Yn ) −−−≥ (X, c).
n-≥
In particular,
(d)
Xn + Yn −−−≥ X + c,
n-≥
(d)
Xn Yn −−−≥ cX,
n-≥
...
41/43
Consequence of Slutsky’s theorem (1)
▶ Thanks to the Delta method, we know that
∈ ,ˆ − , (d)
n −−−≥ N (0, 1).
, n-≥
∈ ,ˆ − , (d)
n −−−≥ N (0, 1).
ˆ
, n-≥
42/43
Consequence of Slutsky’s theorem (2)
Remark:
43/43
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications
1/11
The rationale behind statistical modeling
◮ Let X1 , . . . , Xn be n independent copies of X.
◮ The goal of statistics is to learn the distribution of X.
◮ If X ∈ {0, 1}, easy! It’s Ber(p) and we only have to learn the
parameter p of the Bernoulli distribution.
◮ Can be more complicated. For example, here is a (partial)
dataset with number of siblings (including self) that were
collected from college students a few years back: 2, 3, 2, 4, 1,
3, 1, 1, 1, 1, 1, 2, 2, 3, 2, 2, 2, 3, 2, 1, 3, 1, 2, 3, . . .
◮ We could make no assumption and try to learn the pmf:
x 1 2 3 4 5 6 ≥7
L
IP(X = x) p1 p2 p3 p4 p5 p6 i≥7 pi
Formal definition
where:
◮ E is sample space;
3/11
Statistical model (2)
4/11
Statistical model (3)
Examples
1. For n Bernoulli trials:
( )
{0, 1}, (Ber(p))p∈(0,1) .
iid
2. If X1 , . . . , Xn ∼ Exp(λ), for some unknown λ > 0:
( ∗ )
IR+ , (Exp(λ))λ>0 .
iid
3. If X1 , . . . , Xn ∼ Poiss(λ), for some unknown λ > 0:
( )
IN, (Poiss(λ))λ>0 .
iid
4. If X1 , . . . , Xn ∼ N (µ, σ 2 ), for some unknown µ ∈ IR and
σ 2 > 0: ( ( )
IR, N (µ, σ 2 ) (µ,σ2 )∈IR×IR∗ .
)
+
5/11
Identification
Examples
iid
2. If Xi = 1IYi ≥0 , where Y1 , . . . , Yn ∼ N (µ, σ 2 ), for some
unknown µ ∈ IR and σ 2 > 0, are unobserved: µ and σ 2 are
not identified (but θ = µ/σ is).
6/11
Parameter estimation (1)
Idea: Given an observed sample X1 , . . . , Xn and a statistical
model (E, (IPθ )θ∈Θ ), one wants to estimate the parameter θ.
Definitions
◮ Statistic: Any measurable1 function of the sample, e.g.,
X̄n , max Xi , X1 + log(1 + |Xn |), sample variance, etc...
i
◮ Estimator of θ: Any statistic whose expression does not
depend on θ.
◮ An estimator θ̂n of θ is weakly (resp. strongly ) consistent iff
IP (resp. a.s.)
θ̂n −−−−−−−−−→ θ (w.r.t. IPθ ).
n→∞
1
Rule of thumb: if you can compute it exactly once given data, it is
measurable. You may have some issues with things that are implicitly defined
such as sup or inf but not in this class 7/11
Parameter estimation (2)
Remark: If Θ ⊆ IR,
8/11
Confidence intervals (1)
Let (E, (IPθ )θ∈Θ ) be a statistical model based on observations
X1 , . . . , Xn , and assume Θ ⊆ IR.
Definition
Let α ∈ (0, 1).
◮ Confidence interval (C.I.) of level 1 − α for θ: Any random
(i.e., depending on X1 , . . . , Xn ) interval I whose boundaries
do not depend on θ and such that:
IPθ [I ∋ θ] ≥ 1 − α, ∀θ ∈ Θ.
◮ C.I. of asymptotic level 1 − α for θ: Any random interval I
whose boundaries do not depend on θ and such that:
lim IPθ [I ∋ θ] ≥ 1 − α, ∀θ ∈ Θ.
n→∞
9/11
Confidence intervals (2)
iid
Example: Let X1 , . . . , Xn ∼ Ber(p), for some unknown
p ∈ (0, 1).
α
◮ Let qα/2 be the (1 −)-quantile of N (0, 1) and
2
J J
qα/2 p(1 − p) qα/2 p(1 − p)
I = X̄n − √ ¯n +
,X √ .
n n
◮ Problem: I depends on p !
10/11
Confidence intervals (3)
Two solutions:
11/11
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications
1/23
Total variation distance (1)
( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that there exists θ ∗ ∈ Θ such
that X1 ∼ IPθ∗ : θ ∗ is the true parameter.
2/23
Total variation distance (2)
3/23
Total variation distance (3)
4/23
Total variation distance (4)
5/23
Total variation distance (5)
T θ , IPθ∗ ) for all
An estimation strategy: Build an estimator TV(IP
T θ , IPθ∗ ).
θ ∈ Θ. Then find θˆ that minimizes the function θ → TV(IP
6/23
Total variation distance (5)
T θ , IPθ∗ ) for all
An estimation strategy: Build an estimator TV(IP
T θ , IPθ∗ ).
θ ∈ Θ. Then find θˆ that minimizes the function θ → TV(IP
T θ , IPθ∗ )!
problem: Unclear how to build TV(IP
6/23
Kullback-Leibler (KL) divergence (1)
Definition
The Kullback-Leibler (KL) divergence between two probability
measures IPθ and IPθ′ is defined by
L ( p (x) )
θ
pθ (x) log if E is discrete
pθ′ (x)
KL(IPθ , IPθ ) =
′ x∈E
l
( f (x) )
θ
f θ (x) log dx if E is continuous
E θf ′ (x)
7/23
Kullback-Leibler (KL) divergence (2)
Properties of KL-divergence:
Not a distance.
8/23
Kullback-Leibler (KL) divergence (3)
[ ( p ∗ (X) )]
θ
KL(IPθ∗ , IPθ ) = IEθ∗ log
pθ (X)
[ ] [ ]
= IEθ∗ log pθ∗ (X) − IEθ∗ log pθ (X)
L n
T θ∗ , IPθ ) = “constant” − 1
KL(IP log pθ (Xi )
n
i=1
9/23
Kullback-Leibler (KL) divergence (4)
L n
T θ∗ , IPθ ) = “constant” − 1
KL(IP log pθ (Xi )
n
i=1
n
T θ∗ , IPθ ) 1L
min KL(IP ⇔ min − log pθ (Xi )
θ∈Θ θ∈Θ n
i=1
n
L
1
⇔ max log pθ (Xi )
θ∈Θ n
i=1
n
L
⇔ max log pθ (Xi )
θ∈Θ
i=1
n
n
⇔ max pθ (Xi )
θ∈Θ
i=1
Note that
min −h(θ) ⇔ max h(θ)
θ∈Θ θ∈Θ
�n
Example: θ → i=1 (θ − Xi )
11/23
Interlude: maximizing/minimizing functions (2)
Definition
A function twice differentiable function h : Θ ⊂ IR → IR is said to
be concave if its second derivative satisfies
h′ (θ) = 0 ,
∇h(θ) = 0 ∈ IRd .
14/23
Likelihood, Discrete case (1)
( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that E is discrete (i.e., finite or
countable).
Definition
The likelihood of the model is the map Ln (or just L) defined as:
Ln : En × Θ → IR
(x1 , . . . , xn , θ) → IPθ [X1 = x1 , . . . , Xn = xn ].
15/23
Likelihood, Discrete case (2)
iid
Example 1 (Bernoulli trials): If X1 , . . . , Xn ∼ Ber(p) for some
p ∈ (0, 1):
◮ E = {0, 1};
◮ Θ = (0, 1);
◮ ∀(x1 , . . . , xn ) ∈ {0, 1}n , ∀p ∈ (0, 1),
n
n
L(x1 , . . . , xn , p) = IPp [Xi = xi ]
i=1
nn
= pxi (1 − p)1−xi
i=1
�n �n
xi
=p i=1 (1 − p)n− i=1 xi
.
16/23
Likelihood, Discrete case (3)
Example 2 (Poisson model):
iid
If X1 , . . . , Xn ∼ Poiss(λ) for some λ > 0:
◮ E = IN;
◮ Θ = (0, ∞);
◮ ∀(x1 , . . . , xn ) ∈ INn , ∀λ > 0,
n
n
L(x1 , . . . , xn , p) = IPλ [Xi = xi ]
i=1
nn
λxi
= e−λ
xi !
i=1
�n
λ i=1 xi
= e−nλ .
x1 ! . . . xn !
17/23
Likelihood, Continuous case (1)
( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that all the IPθ have density fθ .
Definition
L : En × Θ → �
IR
n
(x1 , . . . , xn , θ) → i=1 fθ (xi ).
18/23
Likelihood, Continuous case (2)
iid
Example 1 (Gaussian model): If X1 , . . . , Xn ∼ N (µ, σ 2 ), for
some µ ∈ IR, σ 2 > 0:
◮ E = IR;
◮ Θ = IR × (0, ∞)
◮ ∀(x1 , . . . , xn ) ∈ IRn , ∀(µ, σ 2 ) ∈ IR × (0, ∞),
n
1 1 L
L(x1 , . . . , xn , µ, σ 2 ) = √ exp − 2 (xi − µ)2 .
(σ 2π)n 2σ
i=1
19/23
Maximum likelihood estimator (1)
Definition
The likelihood estimator of θ is defined as:
provided it exists.
20/23
Maximum likelihood estimator (2)
Examples
◮ ˆ M LE = X
Poisson model: λ ¯n.
n
( ) ( )
◮ Gaussian model: µ̂n , σ̂n2 = X̄n , Ŝn .
21/23
Maximum likelihood estimator (3)
If Θ ⊂ IR, we get:
[ ] [ ]
I(θ) = var ℓ′ (θ) = −IE ℓ′′ (θ)
22/23
Maximum likelihood estimator (4)
Theorem
23/23
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications
1/14
Weierstrass Approximation Theorem (WAT)
Theorem
Let f be a continuous function on the interval [a, b], then, for any
ε > 0, there exists a0 , a1 , . . . , ad ∈ IR such that
d
�
ak xk � < ε .
� �
max �f (x) −
x∈[a,b]
k=0
2/14
Statistical application of the WAT (1)
◮ Let X1 , . . . , Xn be an i.i.d. sample associated
) with a
(identified) statistical model E, {IPθ }θ∈Θ . Write θ ∗ for the
(
true parameter.
◮ Assume that for all θ, the distribution IPθ has a density fθ .
◮ If we find θ such that
(only d + 1 equations)
◮ The quantity mk (θ) := xk fθ (x)dx is the kth moment of
IPθ . Can also be written as
mk (θ) = IEθ [X k ] .
4/14
Gaussian quadrature (1)
◮ The Weierstrass approximation theorem has limitations:
1. works only for continuous functions (not really a problem!)
2. works only on intervals [a, b]
3. Does not tell us what d (# of moments) should be
◮ What if E is discrete: no PDF but PMF p(·)?
◮ Assume that E = {x1 , x2 , . . . , xr } is finite with r possible
values. The PMF has r − 1 parameters:
p(x1 ), . . . , p(xr−1 )
r−1
�
because the last one: p(xr ) = 1 − p(xj ) is given by the
j=1
first r − 1.
◮ Hopefully, we do not need much more than d = r − 1
moments to recover the PMF p(·).
5/14
Gaussian quadrature (2)
◮ Note that for any k = 1, . . . , r1 ,
r
�
k
mk = IE[X ] = p(xj )xjk
j=1
and
r
�
p(xj ) = 1
j=1
1 1 ··· 1
8/14
Method of moments (1)
Let X1(, . . . , Xn be an
) i.i.d. sample associated with a statistical
d
model E, (IPθ )θ∈Θ . Assume that Θ ⊆ IR , for some d ≥ 1.
n
1� k
◮ Empirical moments: Let m̂k = Xnk = Xi , 1 ≤ k ≤ d.
n
i=1
◮ Let
ψ : Θ ⊂ IRd → IRd
θ → (m1 (θ), . . . , md (θ)) .
9/14
Method of moments (2)
Definition
Moments estimator of θ:
provided it exists.
10/14
Method of moments (3)
Analysis of θˆnM M
Assume −1
� ψ is continuously differentiable at M (θ). Write
◮
−1 �
∇ψ M (θ) for the d × d gradient matrix at this point.
11/14
Method of moments (4)
Theorem
√ ( MM ) (d)
n θ̂n − θ −−−→ N (0, Γ(θ)) (w.r.t. IPθ ),
n→∞
�⊤
where Γ(θ) = ∇ψ −1 �M (θ) Σ(θ) ∇ψ −1 �M (θ) .
� � � � �
12/14
Multivariate Delta method
Let (Tn )n≥1 sequence of random vectors in IRp (p ≥ 1) that
satisfies
√ (d)
n(Tn − θ) −−−→ N (0, Σ),
n→∞
∂gj
where ∇g(θ) = ∈ IRk×d .
∂θi 1≤i≤d,1≤j≤k
13/14
MLE vs. Moment estimator
14/14
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
18.650
Statistics for Applications
1/37
Cherry Blossom run (1)
2/37
Cherry Blossom run (2)
We can see from past data that the running time has Gaussian
distribution.
3/37
Cherry Blossom run (3)
4/37
Cherry Blossom run (4)
5/37
Cherry Blossom run (5)
Simple heuristic:
Better heuristic:
¯ n < 103.5−(something that −−−→ 0), then µ < 103.5”
“If X
n→∞
To make this intuition more precise, we need to take the size of the
random fluctuations of X ¯ n into account!
6/37
Clinical trials (1)
7/37
Clinical trials (2)
¯ drug < X
“If X ¯ control −(something that −−−−−−−→ 0), then
n →∞ drug
ncontrol →∞
conclude that µdrug < µcontrol ”
8/37
Heuristics (1)
Example 1: A coin is tossed 80 times, and Heads are obtained 54
times. Can we conclude that the coin is significantly unfair ?
iid
◮ n = 80, X1 , . . . , Xn ∼ Ber(p);
◮ ¯ n = 54/80 = .68
X
◮ If it was true that p = .5: By CLT+Slutsky’s theorem,
√ X̄n − .5
nJ ≈ N (0, 1).
.5(1 − .5)
√ X̄n − .5
◮ nJ ≈ 3.22
.̄5(1 − .5)
◮ Conclusion: It seems quite reasonable to reject the
hypothesis p = .5.
9/37
Heuristics (2)
Example 2: A coin is tossed 30 times, and Heads are obtained 13
times. Can we conclude that the coin is significantly unfair ?
iid
◮ n = 30, X1 , . . . , Xn ∼ Ber(p);
◮ ¯ n = 13/30 ≈ .43
X
◮ If it was true that p = .5: By CLT+Slutsky’s theorem,
√ X̄n − .5
nJ ≈ N (0, 1).
.5(1 − .5)
√ X¯n − .5
◮ Our data gives nJ ≈ −.77
.5(1 − .5)
◮ The number .77 is a plausible realization of a random variable
Z ∼ N (0, 1).
◮ Conclusion: our data does not suggest that the coin is unfair.
10/37
Statistical formulation (1)
◮ Consider a sample X1 , . . . , Xn of i.i.d. random variables and a
statistical model (E, (IPθ )θ∈Θ ).
{√ X̄n − .5 }
◮ ψ = 1I nJ > C , for some C > 0.
.5(1 − .5)
Rψ = {x ∈ E n : ψ(x) = 1}.
βψ : Θ1 → IR
θ �→ IPθ [ψ = 0].
◮ Power of a test ψ:
πψ = inf (1 − βψ (θ)) .
θ∈Θ1
13/37
Statistical formulation (4)
αψ (θ) ≤ α, ∀θ ∈ Θ0 .
lim αψ (θ) ≤ α, ∀θ ∈ Θ0 .
n→∞
14/37
Example (1)
iid
◮ Let X1 , . . . , Xn ∼ Ber(p), for some unknown p ∈ (0, 1).
◮ We want to test:
H0 : p = 1/2 vs. H1 : p = 1/2
√ p̂n − 0.5
◮ Let Tn = n J , where p̂n is the MLE.
.5(1 − .5)
15/37
Example (2)
16/37
p-value
Definition
Golden rule
The smaller the p-value, the more confidently one can reject
H0 .
18/37
The χ2 distributions
Definition
For a positive integer d, the χ2 (pronounced “Kai-squared”)
distribution with d degrees of freedom is the law of the random
iid
variable Z12 + Z22 + . . . + Zd2 , where Z1 , . . . , Zd ∼ N (0, 1).
Examples:
◮ If Z ∼ Nd (0, Id ), then IZI22 ∼ χ2d .
◮ Recall that the sample variance is given by
n n
1n 1n 2
Sn = (Xi − X̄n )2 = Xi − (X̄n )2
n n
i=1 i=1
iid
◮ Cochran’s theorem implies that for X1 , . . . , Xn ∼ N (µ, σ 2 ), if
Sn is the sample variance, then
nSn
∼ χ2n−1 .
σ2
◮ χ22 = Exp(1/2).
19/37
Student’s T distributions
Definition
For a positive integer d, the Student’s T distribution with d
degrees of freedom (denoted by td ) is the law of the random
Z
variable J , where Z ∼ N (0, 1), V ∼ χ2d and Z ⊥⊥ V (Z is
V /d
independent of V ).
Example:
iid
◮ Cochran’s theorem implies that for X1 , . . . , Xn ∼ N (µ, σ 2 ), if
Sn is the sample variance, then
√ X̄n − µ
n−1 √ ∼ tn−1 .
Sn
20/37
Wald’s test (1)
◮ If H0 is true, then
√ � � (d)
n I(θ̂ M LE )1/2 θ̂nM LE − θ0 −−−→ Nd (0, Id ) w.r.t. IPθ0 .
n→∞
21/37
Wald’s test (2)
◮ Hence,
� �⊤ � � (d)
n θ̂nM LE − θ0 I(θˆM LE ) θ̂nM LE − θ0 −−−→ χd2 w.r.t. IPθ0 .
� �� � n→∞
Tn
ψ = 1I{Tn > qα },
22/37
Likelihood ratio test (1)
(0) (0)
for some fixed and given numbers θr+1 , . . . , θd .
◮ Let
θ̂n = argmax ℓn (θ) (MLE)
θ∈Θ
and
θ̂nc = argmax ℓn (θ) (“constrained MLE”)
θ∈Θ0
23/37
Likelihood ratio test (2)
◮ Test statistic:
� �
Tn = 2 ℓn (θ̂n ) − ℓn (θ̂nc ) .
◮ Theorem
Assume H0 is true and the MLE technical conditions are satisfied.
Then,
(d)
Tn −−−→ χ2d−r w.r.t. IPθ .
n→∞
ψ = 1I{Tn > qα },
24/37
Testing implicit hypotheses (1)
25/37
Testing implicit hypotheses (2)
◮ Delta method:
√ � � (d)
n g(θ̂n ) − g(θ) −−−→ Nk (0, Γ(θ)) ,
n→∞
26/37
Testing implicit hypotheses (3)
ψ = 1I{Tn > qα },
27/37
The multinomial case: χ2 test (1)
K
n
◮ ΔK = p = (p1 , . . . , pK ) ∈ (0, 1)K : pj = 1 .
j=1
IPp [X = aj ] = pj , j = 1, . . . , K.
28/37
The multinomial case: χ2 test (2)
iid
◮ Let X1 , . . . , Xn ∼ IPp , for some unknown p ∈ ΔK , and let
p0 ∈ ΔK be fixed.
◮ We want to test:
H0 : p = p0 vs. H1 : p = p0
29/37
The multinomial case: χ2 test (3)
NK
Ln (X1 , . . . , Xn , p) = pN 1 N2
1 p2 . . . pK ,
where Nj = #{i = 1, . . . , n : Xi = aj }.
30/37
The multinomial case: χ2 test (4)
√
◮ If H0 is true, then n(p̂ − p0 ) is asymptotically normal, and
the following holds.
Theorem
� �2
K
n p̂j − pj0 (d)
n −−−→ χ2K−1.
j=1
pj0 n→∞
� �� �
Tn
31/37
The Gaussian case: Student’s test (1)
iid
◮ Let X1 , . . . , Xn ∼ N (µ, σ 2 ), for some unknown
µ ∈ IR, σ 2 > 0 and let µ0 ∈ IR be fixed, given.
◮ We want to test:
H0 : µ = µ0 vs. H1 : µ = µ0
√ X̄n − µ0
◮ If σ 2 is known: Let Tn = n . Then, Tn ∼ N (0, 1)
σ
and
ψα = 1I{|Tn | > qα/2 }
is a test with (non asymptotic) level α.
32/37
The Gaussian case: Student’s test (2)
If σ 2 is unknown:
√ X̄n − µ0
◮ Tn =
Let T n−1 √ , where Sn is the sample variance.
Sn
◮ Cochran’s theorem:
◮ ¯ n ⊥⊥ Sn ;
X
nSn
◮ ∼ χ2n−1 .
σ2
33/37
The Gaussian case: Student’s test (3)
◮ Student’s test with (non asymptotic) level α ∈ (0, 1):
Tn | > qα/2 },
ψα = 1I{|T
Tn > qα },
ψα′ = 1I{T
IE[X1 ] = · · · = IE[Xn ] = µX
, and
IE[Y1 ] = · · · = IE[Ym ] = µY
◮ We want to test:
H0 : µX = µY vs. H1 : µX = µY
Under H0 : µX = µY :
√ X̄n − Ȳm (d)
nJ −n→∞
−−−→ N (0, 1)
1 + m/n m→∞
m
n
→γ
{√ X¯n − Y¯m }
Test: ψα = 1I nJ > qα/2
1 + m/n 36/37
Two-sample T-test
◮ If the variances are unknown but we know that
Xi ∼ N (µX , σX 2 ), Y ∼ N (µ , σ 2 ).
i Y Y
◮ Then
( 2 2 )
X¯ n − Y¯m ∼ N µX − µY , σX + σY
n m
◮ Under H0 :
X¯ n − Y¯m
J ∼ N (0, 1)
σX2 /n + σ 2 /m
Y
where ( )2
2 /n + S 2 /m
SX Y
N= 4
SX SY4
2
n (n−1)
+ 2
m (m−1)
37/37
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Statistics for Applications
1/25
Goodness of fit tests
2/25
Cdf and empirical cdf (1)
Let X1 , . . . , Xn be i.i.d. real random variables. Recall the cdf of
X1 is defined as:
Definition
The empirical cdf of the sample X1 , . . . , Xn is defined as:
n
1L
Fn (t) = 1{Xi ≤ t}
n
i=1
#{i = 1, . . . , n : Xi ≤ t}
= , ∀t ∈ IR.
n
3/25
Cdf and empirical cdf (2)
4/25
Cdf and empirical cdf (3)
Donsker’s Theorem
If F is continuous, then
√ (d)
n sup |Fn (t) − F (t)| −−−→ sup |B(t)|,
t∈IR n→∞ 0≤t≤1
5/25
Kolmogorov-Smirnov test (1)
H0 : F = F 0 v.s. H1 : F = F 0 .
6/25
Kolmogorov-Smirnov test (2)
√
◮ Let Tn = sup n Fn (t) − F 0 (t) .
t∈IR
(d)
◮ By Donsker’s theorem, if H0 is true, then Tn −−−→ Z,
n→∞
where Z has a known distribution (supremum of a Brownian
bridge).
7/25
Kolmogorov-Smirnov test (3)
Remarks:
◮ In practice, how to compute Tn ?
√ { i−1 i }
Tn = n max max − F 0 (X(i) ) , − F 0 (X(i) ) .
i=1,...,n n n
8/25
Kolmogorov-Smirnov test (4)
i.i.d.
◮ If H0 is true, then U1 , . . . , Un ∼ U ([0.1])
√
and Tn = sup n |Gn (x) − x|.
0≤x≤1
9/25
Kolmogorov-Smirnov test (5)
(n)
◮ Estimate the (1 − α)-quantile qα of Tn by taking the sample
(n,M)
(1 − α)-quantile q̂α of Tn1 , . . . , TnM .
10/25
Kolmogorov-Smirnov test (6)
These quantiles are often precomputed in a table.
11/25
Other goodness of fit tests
◮ Cramér-Von Mises:
�
d2 (Fn , F ) = [Fn (t) − F (t)]2 dt
IR
◮ Anderson-Darling:
12/25
Composite goodness of fit tests
where
¯n ,
µ̂ = X ˆ 2 = Sn2
σ
2
ˆ σ2 (t) is the cdf of N (µ̂, σ̂ ).
and Φµ,ˆ
13/25
Kolmogorov-Lilliefors test (1)
14/25
Kolmogorov-Lilliefors test (2)
These quantiles are often precomputed in a table.
15/25
Quantile-Quantile (QQ) plots (1)
◮ Provide a visual way to perform GoF tests
◮ Not formal test but quick and easy check to see if a
distribution is plausible.
◮ Main idea: we want to check visually if the plot of Fn is close
to that of F or equivalently if the plot of Fn−1 is close to that
of F −1 .
◮ More convenient to check if the points
( −1 1 1 ) ( 2 2 ) n−1 n−1 )
F ( ), Fn−1 ( ) , F −1 ( ), Fn−1 ( ) , . . . , F −1 ( ), Fn−1 (
(
)
n n n n n n
are near the line y = x.
◮ Fn is not technically invertible but we define
and
pj = IP[X1 = aj ].
19/25
χ2 goodness-of-fit test, finite case (2)
◮ H0 is equivalent to:
20/25
χ2 goodness-of-fit test, finite case (3)
◮ Let
n
1L #{i : Xi = aj }
p̂j = 1{Xi = aj } = , j = 1, . . . , K.
n n
i=1
� �2
K
L p̂j − pj (θ̂)
◮ Define the test statistic: Tn = n .
ˆ
pj (θ)
j=1
21/25
χ2 goodness-of-fit test, finite case (4)
δα = 1{Tn > qα },
22/25
χ2 goodness-of-fit test, infinite case (1)
E = A1 ∪ . . . ∪ AK .
◮ Define, for θ ∈ Θ and j = 1, . . . , K:
◮ pj (θ) = IPθ [Y ∈ Aj ], for Y ∼ IPθ ,
◮ pj = IP[X1 ∈ Aj ],
n
1L #{i : Xi ∈ Aj }
◮ p̂j = 1{Xi ∈ Aj } = ,
n i=1 n
23/25
χ2 goodness-of-fit test, infinite case (2)
� �2
K
L p̂j − pj (θ̂)
◮ As previously, let Tn = n .
j=1 pj (θ̂)
δα = 1{Tn > qα },
24/25
χ2 goodness-of-fit test, infinite case (3)
◮ Practical issues:
◮ Choice of K ?
◮ Choice of the bins A1 , . . . , AK ?
◮ Computation of pj (θ) ?
25/25
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
Statistics for Applications
Chapter 7: Regression
1/43
Heuristics of the linear regression (1)
Consider a cloud of i.i.d. random points (Xi , Yi ), i = 1, . . . , n :
2/43
Heuristics of the linear regression (2)
3/43
Heuristics of the linear regression (3)
Examples:
Economics: Demand and price,
Di ⇡ a + bpi , i = 1, . . . , n.
4/43
Linear regression of a r.v. Y on a r.v. X (1)
5/43
Linear regression of a r.v. Y on a r.v. X (2)
Y = a + bX + ",
6/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.
7/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.
8/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.
9/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.
10/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , Y1 ), . . . , (Xn , Yn ) with
same distribution as (X, Y ) is available.
11/43
Linear regression of a r.v. Y on a r.v. X (4)
Definition
The least squared error (LSE) estimator of (a, b) is the minimizer
of the sum of squared errors:
n
X
(Yi − a − bXi )2 .
i=1
â = Y¯ − ˆbX.
¯
12/43
Linear regression of a r.v. Y on a r.v. X (5)
13/43
Multivariate case (1)
Yi = Xi β + "i , i = 1, . . . , n.
Dependent variable: Yi .
Y = Xβ + ".
β̂ = argmin kY − Xtk22 .
t2IRp
15/43
Multivariate case (3)
β̂ = (X X)−1 X Y.
Xβ̂ = P Y,
16/43
Linear regression with deterministic design and Gaussian
noise (1)
Assumptions:
" ⇠ Nn (0, σ 2 In ),
17/43
Linear regression with deterministic design and Gaussian
noise (2)
⇣ ⌘
LSE = MLE: ˆ ⇠ Np β, σ (X X)
β 2 −1
.
h i ⇣ ⌘
−1
Quadratic risk of β̂: IE kβ̂ − βk22 2
= σ tr (X X) .
h i
Prediction error: IE kY − Xβ̂k22 = σ 2 (n − p).
1
Unbiased estimator of σ 2 : σ̂ 2 = kY − Xβ̂k22 .
n−p
Theorem
σ̂ 2
(n − p) 2 ⇠ χ2n−p .
σ
β̂ ?? σ̂ 2 .
18/43
Significance tests (1)
Test whether the j-th explanatory variable is significant in the
linear regression (1 j p).
H0 : βj = 0 v.s. H1 : βj = 0.
β̂j − βj
p ⇠ tn−p .
2
σ̂ γj
β̂j
Let Tn(j) = p .
σ̂ 2 γj
H0 : βj = 0, 8j 2 S v.s. H1 : 9j 2 S, βj = 0, where
S ✓ {1, . . . , p}.
(j)
Bonferroni’s test: δ↵B = max δ↵/k , where k = |S|.
j2S
20/43
More tests (1)
H0 : Gβ = λ v.s. H1 : Gβ = λ.
If H0 is true, then:
and
−1
−2 −1
σ (Gβ̂ − λ) G(X X) G (Gβ − λ) ⇠ χ2k .
21/43
More tests (2)
� �−1
1 (Gβ̂ − λ) G(X X)−1 G (Gβ − λ)
Let Sn = .
σ̂ 2 k
Definition
The Fisher distribution with p and q degrees of freedom, denoted
U/p
by Fp,q , is the distribution of , where:
V /q
U ⇠ χ2p , V ⇠ χ2q ,
U ?? V .
22/43
Concluding remarks
23/43
Linear regression and lack of identifiability (1)
Consider the following model:
Y = Xβ + ",
with:
1. Y 2 IRn (dependent variables), X 2 IRn⇥p (deterministic
design) ;
2. β 2 IRp , unknown;
3. " ⇠ Nn (0, σ 2 In ).
kŶ − Yk22
2
⇠ χ2n k,
σ
25/43
Linear regression and lack of identifiability (3)
In particular:
26/43
Linear regression in high dimension (1)
Consider again the following model:
Y = Xβ + ",
with:
1. Y 2 IRn (dependent variables), X 2 IRn⇥p (deterministic
design) ;
2. β 2 IRp , unknown: to be estimated;
3. " ⇠ Nn (0, σ 2 In ).
For each i, Xi 2 IRp is the vector of covariates of the i-th
individual.
β̂ S 2 argmin kY − XS tk2 ,
t2IRS
28/43
Linear regression in high dimension (3)
29/43
Linear regression in high dimension (4)
Each of these criteria is equivalent to finding β 2 IRp that
minimizes:
kY − Xbk22 + λkbk0 ,
where kbk0 is the number of nonzero coefficients of b.
Lasso estimator:
p
X p
X � �
replacekbk0 = 1I{bj = 0} with kbk1 = �bj �
j=1 j=1
How to choose λ ?
31/43
Linear regression in high dimension (6)
32/43
Nonparametric regression (1)
33/43
Nonparametric regression (2)
LSE, MLE, all the previous theory for the linear case could be
adapted.
34/43
Nonparametric regression (3)
Yi ⇡ f (x) + "i .
35/43
Nonparametric regression (4)
36/43
Nonparametric regression (5)
0.5
●
0.4
●
● ● ●
●●
0.3
● ●
●
● ●
● ● ●
●
● ●
●
●
●
● ●
0.2
●
●
●
●
● ●●
●
● ●
●
●
Y
● ●
0.1
●
●
●
●
●
●
● ●
0.0
●
●
●
−0.2
37/43
Nonparametric regression (6)
0.5
l
0.4
l
l l l
ll ll
ll ll
0.3
ll
l l
l l ll ll
l ll
l
l
l
l l
0.2
l
l
ll
l
l
l
ll ll
l ll l
l
Y
l l
0.1
l
l
l
l
l
l
l l
0.0
l
l
x 0.6 l
h 0.1
^
−0.2
l f x 0.27
38/43
Nonparametric regression (7)
How to choose h ?
39/43
Nonparametric regression (8)
Example:
n = 100, f (x) = x(1 − x),
h = .005.
0.4
^
l
f nh
l
l
l
l l
l l
l
0.3
l l
l
l
l
l
l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2
l
l l
l
l l l
l l l
l l
l l
l l
l
l l l
Y
l l
l l l
l
l l l
l l l
l
0.1
l l l l
l
l
l
l
l
l
l
l
l
l l
l
l
l l
0.0
l l
l
l
l l l
X
40/43
Nonparametric regression (9)
Example:
n = 100, f (x) = x(1 − x),
h = 1.
0.4
^
l
f nh
l
l
l
l l
l l
l
0.3
l l
l
l
l
l
l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2
l
l l
l
l l l
l l l
l l
l l
l l
l
l l l
Y
l l
l l l
l
l l l
l l l
l
0.1
l l l l
l
l
l
l
l
l
l
l
l
l l
l
l
l l
0.0
l l
l
l
l l l
X
41/43
Nonparametric regression (10)
Example:
n = 100, f (x) = x(1 − x),
h = .2.
0.4
^
l
f nh
l
l
l
l l
l l
l
0.3
l l
l
l
l
l
l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2
l
l l
l
l l l
l l l
l l
l l
l l
l
l l l
Y
l l
l l l
l
l l l
l l l
l
0.1
l l l l
l
l
l
l
l
l
l
l
l
l l
l
l
l l
0.0
l l
l
l
l l l
X
42/43
Nonparametric regression (11)
Choice of h ?
43/43
0,72SHQ&RXUVH:DUH
KWWSVRFZPLWHGX
6WDWLVWLFVIRU$SSOLFDWLRQV
)DOO
)RULQIRUPDWLRQDERXWFLWLQJWKHVHPDWHULDOVRURXU7HUPVRI8VHYLVLWKWWSVRFZPLWHGXWHUPV
Statistics for Applications
1/17
The Bayesian approach (1)
2/17
The Bayesian approach (2)
3/17
The Bayesian approach (3)
Example
◮ Let p be the proportion of woman in the population.
◮ E.g., we are 90% sure that p is between .4 and .6, 95% that it
is between .3 and .8, etc...
6/17
The Bayes rule and the posterior distribution (1)
7/17
The Bayes rule and the posterior distribution (2)
8/17
The Bayes rule and the posterior distribution (3)
In the previous example:
i.i.d.
◮ Given p, X1 , . . . , Xn ∼ Ber(p), so
�n �n
Xi
pn (X1 , . . . , Xn |θ) = p i=1 (1 − p)n− i=1 Xi
.
◮ Hence,
�n �n
π(θ|X1 , . . . , Xn ) ∝ pa−1+ i=1 Xi
(1 − p)a−1+n− i=1 Xi
.
9/17
Non informative priors (1)
10/17
Non informative priors (2)
Examples:
i.i.d.
◮ If p ∼ U (0, 1) and given p, X1 , . . . , Xn ∼ Ber(p) :
�n �n
Xi
π(p|X1 , . . . , Xn ) ∝ p i=1 (1 − p)n− i=1 Xi
,
i.e., the posterior distribution is
n n
� �
� �
B 1+ Xi , 1 + n − Xi .
i=1 i=1
i.i.d.
◮ If π(θ) = 1, ∀θ ∈ IR and given θ, X1 , . . . , Xn ∼ N (θ, 1):
n
� �
1� 2
π(θ|X1 , . . . , Xn ) ∝ exp − (Xi − θ) ,
2
i=1
i.e., the posterior distribution is
N ¯n , 1
X .
n
11/17
Non informative priors (3)
◮ Jeffreys prior: J
πJ (θ) ∝ det I(θ),
where I(θ) is the Fisher information matrix of the statistical
model associated with X1 , . . . , Xn in the frequentist approach
(provided it exists).
12/17
Non informative priors (4)
13/17
Bayesian confidence regions
IP[θ ∈ R|X1 , . . . , Xn ] = 1 − α.
14/17
Bayesian estimation (1)
◮ The Bayesian framework can also be used to estimate the true
underlying parameter (hence, in a frequentist approach).
◮ Bayes estimator:
�
(π)
θ̂ = θ dπ(θ|X1 , . . . , Xn ) :
Θ
16/17
Bayesian estimation (3)
◮ In the previous examples:
◮ Ex. 1 with prior B(a, a) (a > 0):
�n
(π) a + i=1 Xi a/n + X̄n
p̂ = = .
2a + n 2a/n + 1
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
Statistics for Applications
1/16
Multivariate statistics and review of linear algebra (1)
· · · XT
⎛ ⎞
1 ···
X=⎝ ..
⎠.
⎜ ⎟
.
T
· · · Xn · · ·
2/16
Multivariate statistics and review of linear algebra (2)
� Mean of X:
T
E[X] = E[X 1 ], . . . , E[X d ] .
σj,k = cov(Xj , Xk ).
3/16
Multivariate statistics and review of linear algebra (3)
� Empirical mean of X1 , . . . , Xn :
n
T
¯ = 1
�
X Xi = X̄ 1 , . . . , X̄ d .
n
i=1
4/16
Multivariate statistics and review of linear algebra (4)
� If u ∈ Rd ,
� uT Σu is the variance of uT X;
� uT Su is the sample variance of uT X1 , . . . , uT Xn .
5/16
Multivariate statistics and review of linear algebra (5)
6/16
Multivariate statistics and review of linear algebra (6)
� In particular, Σ and S are symmetric, positive semi-definite.
A = P DP T ,
where:
� P is a d × d orthogonal matrix, i.e., P P T = P T P = Id ;
� D is diagonal.
8/16
Principal Component Analysis: Heuristics (2)
� Idea: Write S = P DP T , where
� P = (v1 , . . . , vd ) is an orthogonal matrix, i.e.,
Ivj I2 = 1, vjT vk = 0, ∀j = k.
⎛ ⎞
λ1
⎜
⎜
⎜ λ2 0 ⎟
⎟
⎟
� D=⎜
⎜ . .. ⎟, with λ1 ≥ . . . ≥ λd ≥ 0.
⎟
⎜ ⎟
⎜
⎝ 0 ..
.
⎟
⎠
λd
with equality if a = v1 .
10/16
Principal Component Analysis: Main principle
� Idea of the PCA: Find the collection of orthogonal directions
in which the cloud is much spread out.
Theorem
v1 ∈ argmax uT Su,
lul=1
v2 ∈ argmax uT Su,
lul=1,u⊥v1
···
vd ∈ argmax uT Su.
lul=1,u⊥vj ,j=1,...,d−1
5. Output: Y1 , . . . , Yn , where
Yi = PkT Xi ∈ Rk , i = 1, . . . , n.
12/16
Principal Component Analysis: Algorithm (2)
Question: How to choose k ?
� Experimental rule: Take k where there is an inflection point in
the sequence λ1 , . . . , λd (scree plot).
13/16
Example: Expression of 500,000 genes among 1400
Europeans
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
Statistics for Applications
1/52
Linear model
Y |X ∼ N (µ(X), σ 2 I),
And
IE(Y |X) = µ(X) = X ⊤ β,
2/52
Components of a linear model
3/52
Generalization
g µ(X) = X ⊤ β
� �
4/52
Example 1: Disease Occuring Rate
5/52
Example 2: Prey Capture Rate(1)
6/52
Example 2: Prey Capture Rate (2)
0.6
0.5
0.4
0.3
0.2
0.1
0.0
7/52
Example 2: Prey Capture Rate (3)
1 1 h 1 1
g(µi ) = = + = β0 + β1 .
µi α α xi xi
◮ The standard deviation of capture rate might be
approximately proportional to the mean rate, suggesting the
use of a Gamma distribution for the response.
8/52
Example 3: Kyphosis Data
9/52
GLM: motivation
10/52
Exponential Family
Lk
pθ (x) = exp[ ηi (θ)Ti (x) − B(θ)]h(x)
i=1
11/52
Normal distribution example
◮ Consider X ∼ N (µ, σ 2 ), θ = (µ, σ 2 ). The density is
(µ 1 2 µ2 ) 1
pθ (x) = exp x − x − √ ,
σ2 2σ 2 2σ 2 σ 2π
λx −λ
◮ Poisson(λ): e , x = 0, 1, . . . .
x!
13/52
Examples of Continuous distributions
The following distributions form continuous exponential families of
distributions with pdf:
1 x
◮ Gamma(a, b):
a
xa−1 e− b ;
Γ(a)b
◮ above: a: shape parameter, b: scale parameter
◮ reparametrize: µ = ab: mean parameter
( )a
1 a
xa−1 e− µ .
ax
Γ(a) µ
β α −α−1 −β/x
◮ Inverse Gamma(α, β): x e .
Γ(α)
2 2
σ 2 −σ 2µ(x−µ)
◮ Inverse Gaussian(µ, σ 2 ): e 2x
.
2πx3
Others: Chi-square, Beta, Binomial, Negative binomial
distributions.
14/52
Components of GLM
1. Random component:
g µ(X) = X ⊤ β
� �
15/52
One-parameter canonical exponential family
16/52
Normal distribution example
θ2
◮ Therefore θ = µ, φ = σ 2 , , b(θ) = 2 , and
1 y2
c(y, φ) = − ( + log(2πφ)).
2 φ
17/52
Other distributions
18/52
Likelihood
∂2ℓ ∂ℓ
IE( 2
) + IE( )2 = 0.
∂θ ∂θ
�
Obtained from fθ (y)dy ≡ 1 .
19/52
Expected value
Note that
Y θ − b(θ
ℓ(θ) = + c(Y ; φ),
φ
Therefore
∂ℓ Y − b′ (θ)
=
∂θ φ
It yields
∂ℓ IE(Y ) − b′ (θ))
0 = IE( )= ,
∂θ φ
which leads to
IE(Y ) = µ = b′ (θ).
20/52
Variance
Y − b′ (θ) Y − IE(Y )
=
φ φ
Together, with the second identity, this yields
21/52
Example: Poisson distribution
22/52
Link function
µ = g−1 (X ⊤ β).
23/52
Examples of link functions
24/52
Examples of link functions for Bernoulli response (1)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
ex
◮ in blue: f1 (x) =
1 + ex
◮ in red: f2 (x) = Φ(x) (Gaussian CDF)
25/52
Examples of link functions for Bernoulli response (2)
5
2
◮ in blue:
1 g1 (x) = f1−1 (x) =
� x �
log (logit link)
0
1−x
-1
◮ in red:
g2 (x) = f2−1 (x) = Φ−1 (x)
-2
(probit link)
-3
-4
-5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
26/52
Canonical Link
g(µ) = θ
27/52
Example: the Bernoulli distribution
b(θ) = log(1 + eθ )
◮ Hence we solve
( )
′ exp(θ) µ
b (θ) = =µ ⇔ θ = log
1 + exp(θ) 1−µ
◮ The canonical link for the Bernoulli distribution is the logit
link.
28/52
Other examples
b(θ) g(µ)
Normal θ 2 /2 µ
Poisson exp(θ) log µ
µ
Bernoulli log(1 + eθ ) log 1−µ
Gamma − log(−θ) − µ1
29/52
Model and notation
µi = b′ (θi )
30/52
Back to β
where h is defined as
31/52
Log-likelihood
up to a constant term.
◮ Note that when we use the canonical link function, we obtain
the simpler expression
L Yi X ⊤ β − b(X ⊤ β)
i i
ℓn (β, φ; Y, X) =
φ
i
32/52
Strict concavity
33/52
Optimization Methods
34/52
Gradient and Hessian
◮ Suppose f : IRm → IR has two continuous derivatives.
◮ Define the Gradient of f at point x0 , ∇f = ∇f (x0 ), as
x �→ Hf (x)
◮ If x∗ is maximum, we have
∇f (x∗ ) = 0
36/52
Newton-Raphson method
37/52
Fisher-scoring method (1)
38/52
Fisher-scoring method (2)
39/52
Example: Logistic Regression (1)
40/52
Example: Logistic Regression (2)
◮ Thus, we have
n ( ( ))
⊤
L
ℓn (β|Y, X) = Yi Xi⊤ β − log 1 + eXi β .
i=1
◮ The gradient is
n
� ⊤β
�
L eXi
∇ℓn (β) = Yi Xi − ⊤β
Xi .
i=1 1 + eXi
◮ The Hessian is
n ⊤β
L eXi ⊤
Hℓn (β) = − ( )2 Xi Xi .
⊤
i=1 1 + eXi β
42/52
Iteratively Re-weighted Least Squares
◮ Observe that
dµi ′′
µi = b′ (θi ), Xi⊤ β = g(µi ), = b (θi ) ≡ Vi .
dθi
43/52
Chain rule
44/52
Gradient
◮ Define
W = diag{W1 , . . . , Wn },
◮ Then, the gradient is
45/52
Hessian
◮ For the Hessian, we have
∂2ℓ L Yi − µi
= h′′ (Xi⊤ β)Xij Xij
∂βj ∂βk φ
i
( )
1L ∂µi
− h′ (Xi⊤ β)Xij
φ ∂βk
i
◮ Note that
∂µi ∂b′ (θi ) ∂b′ (h(Xi⊤ β))
= = = b′′ (θi )h′ (Xi⊤ β)Xik
∂βk ∂βk ∂βk
It yields
1 L ′′ ]2
b (θi ) h′ (Xi⊤ β) Xi Xi⊤
[
IE(Hℓn (β)) = −
φ
i
46/52
Fisher information
◮ Note that g−1 (·) = b′ ◦ h(·) yields
1
b′′ ◦ h(·) · h′ (·) =
g′ ◦ g−1 (·)
◮ As a result
L h′ (X ⊤ β)
i
IE(Hℓn (β)) = − Xi Xi⊤
g ′ (µi )φ
i
◮ Therefore,
( h′ (X ⊤ β) )
i
I(β) = −IE(Hℓn (β)) = X⊤ W X where W = diag
g ′ (µi )
47/52
Fisher-scoring updates
◮ which is equivalent to
48/52
Weighted least squares (1)
Let us open a parenthesis to talk about Weighted Least Squares.
◮ Assume the linear model Y = Xβ + ε, where
ε ∼ Nn (0, W −1 ), where W −1 is a n × n diagonal matrix.
When variances are different, the regression is said to be
heteroskedastic.
◮ The maximum likelihood estimator is given by the solution to
49/52
Weighted least squares (2)
50/52
IRLS procedure (1)
Iteratively Reweighed Least Squares is an iterative procedure to
compute the MLE in GLMs using weighted least squares.
We show how to go from β (k) to β (k+1)
(k)
1. Fix β (k) and µi = g−1 (Xi⊤ β (k) );
2. Calculate the adjusted dependent responses
(k) (k) (k)
Zi = Xi⊤ β (k) + g ′ (µi )(Yi − µi );
52/52
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.