0% found this document useful (0 votes)
59 views31 pages

Probability and Statistics: Cookbook

This document is a cookbook that provides an overview of many probability and statistical topics. It contains descriptions and formulas for numerous discrete and continuous probability distributions as well as sections on related concepts like expectation, variance, and statistical inference methods.

Uploaded by

ramesh158
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views31 pages

Probability and Statistics: Cookbook

This document is a cookbook that provides an overview of many probability and statistical topics. It contains descriptions and formulas for numerous discrete and continuous probability distributions as well as sections on related concepts like expectation, variance, and statistical inference methods.

Uploaded by

ramesh158
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Probability and Statistics

Cookbook

c Matthias Vallentin, 2015


Copyright
vallentin@icir.org
26th July, 2015

12 Parametric Inference
12.1 Method of Moments . . . . . . . . .
12.2 Maximum Likelihood . . . . . . . . .
California in Berkeley but also influenced by other sources [2, 3].
12.2.1 Delta Method . . . . . . . . .
If you find errors or have suggestions for further topics, I would
12.3 Multiparameter Models . . . . . . .
appreciate if you send me an email. The most recent version
12.3.1 Multiparameter delta method
of this document is available at http://matthias.vallentin.
12.4 Parametric Bootstrap . . . . . . . .
This cookbook integrates a variety of topics in probability theory and statistics. It is based on literature and in-class material
from courses of the statistics department at the University of

net/probability-and-statistics-cookbook/. To reproduce,
please contact me.

.
.
.
.
.
.

.
.
.
.
.
.

13 Hypothesis Testing
14 Exponential Family

Contents

.
.
.
.
.

.
.
.
.
.

16
22 Math
16
22.1 Gamma Function
17
22.2 Beta Function . .
17
22.3 Series . . . . . .
17
22.4 Combinatorics .
18

6 Inequalities

8
9 16 Sampling Methods
16.1 Inverse Transform Sampling . . . . . .
9
16.2 The Bootstrap . . . . . . . . . . . . .
16.2.1 Bootstrap Confidence Intervals
9
16.3 Rejection Sampling . . . . . . . . . . .
16.4 Importance Sampling . . . . . . . . . .
10

.
.
.
.
.

18
18
18
18
19
19

7 Distribution Relationships

10

1 Distribution Overview
1.1 Discrete Distributions . . . . . . . . . .
1.2 Continuous Distributions . . . . . . . .

3
3
5

2 Probability Theory

3 Random Variables
3.1 Transformations . . . . . . . . . . . . .
4 Expectation
5 Variance

15 Bayesian Inference
15.1 Credible Intervals . . . .
15.2 Function of parameters .
15.3 Priors . . . . . . . . . .
15.3.1 Conjugate Priors
15.4 Bayesian Testing . . . .

13 20 Stochastic Processes
20.1 Markov Chains . . . . . . . . . .
13
20.2 Poisson Processes . . . . . . . . .
14
14
15 21 Time Series
21.1 Stationary Time Series . . . . . .
15
21.2 Estimation of Correlation . . . .
15
21.3 Non-Stationary Time Series . . .
21.3.1 Detrending . . . . . . . .
15
21.4 ARIMA models . . . . . . . . . .
21.4.1 Causality and Invertibility
16
21.5 Spectral Analysis . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

19
19
20
20
20

.
.
18.3 Multiple Regression . . . .
18.4 Model Selection . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

20
20
21
21
22

19 Non-parametric Function Estimation


19.1 Density Estimation . . . . . . . . . . . .
11 Statistical Inference
12
19.1.1 Histograms . . . . . . . . . . . .
11.1 Point Estimation . . . . . . . . . . . . . 12
19.1.2 Kernel Density Estimator (KDE)
11.2 Normal-Based Confidence Interval . . . 13
19.2 Non-parametric Regression . . . . . . .
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13
19.3 Smoothing Using Orthogonal Functions

22
22
23
23
23
24

8 Probability
Functions

and

Moment

Generating

9 Multivariate Distributions
9.1 Standard Bivariate Normal . . . . . . .
9.2 Bivariate Normal . . . . . . . . . . . . .
9.3 Multivariate Normal . . . . . . . . . . .

17 Decision Theory
17.1 Risk . . . . . . . . . . . .
17.2 Admissibility . . . . . . .
11
17.3 Bayes Rule . . . . . . . .
17.4 Minimax Rules . . . . . .
11
11 18 Linear Regression
11
18.1 Simple Linear Regression
11
18.2 Prediction . . . . . . . . .

.
.
.
.
.

10 Convergence
11
10.1 Law of Large Numbers (LLN) . . . . . . 12
10.2 Central Limit Theorem (CLT) . . . . . 12

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

24
. . . . 24
. . . . 25

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

25
26
26
26
27
27
28
28

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

29
29
29
29
30

Distribution Overview

1.1

Discrete Distributions
Notation1

Uniform

Unif {a, . . . , b}

FX (x)

Bernoulli

Binomial

Multinomial

Hypergeometric

Negative Binomial

1 We

Bern (p)

Bin (n, p)

x<a
axb
x>b
(1 p)1x

bxca+1
ba

I1p (n x, x + 1)

Mult (n, p)

Hyp (N, m, n)

NBin (r, p)

Geometric

Geo (p)

Poisson

Po ()

x np
p
np(1 p)

Ip (r, x + 1)
1 (1 p)x
e

x N+

x
X
i
i!
i=0

fX (x)

E [X]

V [X]

MX (s)

I(a x b)
ba+1

a+b
2

(b a + 1)2 1
12

eas e(b+1)s
s(b a)

px (1 p)1x

p(1 p)

1 p + pes

np

np(1 p)

(1 p + pes )n

!
n x
p (1 p)nx
x
k
X
n!
x
xi = n
px1 1 pkk
x1 ! . . . xk !
i=1


m N m
x

npi

p(1 p)x1
x e
x!

x N+

npi (1 pi )

!n
si

pi e

i=0

nm
N

nx

N
n

!
x+r1 r
p (1 p)x
r1

k
X

1p
p

nm(N n)(N m)
N 2 (N 1)
r

1p
p2

1
p

1p
p2

p
1 (1 p)es

r

pes
1 (1 p)es
e(e

1)

use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).

Uniform (discrete)

Binomial

Geometric

n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9

0.8

Poisson

p = 0.2
p = 0.5
p = 0.8

=1
=4
= 10

0.3
0.2

0.6

0.4

10

30

1.00

2.5

1.0

5.0

7.5

0.0

10.0

10

1.00

20

0.75

15

Poisson

0.8

Geometric

0.75

0.50

0.6

10

0.25

0.4

0.00

0.25

0.50

CDF

CDF

CDF

0.0

i
n

40

Binomial

Uniform (discrete)

i
n

0.0

20

x
1

0.1

0.0

0.2

0.1

0.2

PMF

PMF

CDF

1
n

PMF

PMF

20

30

n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
40

0.2

0.0

2.5

5.0

7.5

p = 0.2
p = 0.5
p = 0.8
10.0

0.00

10

15

=1
=4
= 10
20

1.2

Continuous Distributions
Notation

FX (x)

Uniform

Unif (a, b)

Normal

N , 2

x<a
a<x<b

1
x>b
Z x
(x) =
(t) dt
xa
ba

Log-Normal

ln N , 2

Multivariate Normal

MVN (, )

Students t
Chi-square

Student()
2k



1
1
ln x
+ erf
2
2
2 2

fX (x)

E [X]

V [X]

MX (s)

I(a < x < b)


ba


(x )2
1
(x) = exp
2 2
2


(ln x )2
1

exp
2 2
x 2 2

a+b
2

(b a)2
12

esb esa
s(b a)


2 s2
exp s +
2

(2)k/2 ||1/2 e 2 (x)


 
Ix
,
2 2


k x
1

,
(k/2)
2 2

1 (x)

 
(+1)/2
+1
x2
2

1
+

2
1
xk/21 ex/2
2k/2 (k/2)
r

e+

(e 1)e2+

/2



1
exp T s + sT s
2

(
0

>2
1<2

2k

d2
d2 2

2d22 (d1 + d2 2)
d1 (d2 2)2 (d2 4)

(1 2s)k/2 s < 1/2


F

F(d1 , d2 )

Exponential

Exp ()

Gamma
Inverse Gamma

Dirichlet

Beta
Weibull
Pareto

d1 x
d1 x+d2

Gamma (, )
InvGamma (, )

d1 d1
,
2 2

Pareto(xm , )

(, x/)
()

, x
()

1
x1 ex/
()

1 /x
x
e
()
P

k
k

i=1 i Y 1
xi i
Qk
i=1 (i ) i=1

>1
1

2
>2
( 1)2 ( 2)

( + ) 1
x
(1 x)1
() ()

+


1
1 +
k

( + )2 ( + + 1)


2
2 1 +
2
k

xm
>1
1

x
m
>2
( 1)2 ( 2)

1 e(x/)
1

xB

d1 d1
, 2
2

1 x/
e

Ix (, )

Weibull(, k)

(d1 x+d2

)d1 +d2

1 ex/

Dir ()

Beta (, )

(d1 x)d1 d2 2

 x 
m

x xm

k  x k1 (x/)k
e

x
m
+1
x

x xm

i
Pk

i=1

1
(s < 1/)
1 s


1
(s < 1/)
1 s

p
2(s)/2
4s
K
()

E [Xi ] (1 E [Xi ])
Pk
i=1 i + 1
1+

k1
Y

k=1

n n

n=0

r=0

+r
++r

sk
k!


s
n
1+
n!
k

(xm s) (, xm s) s < 0

Uniform (continuous)

Normal
2.0

LogNormal
1.00

= 0, 2 = 0.2
= 0, 2 = 1
= 0, 2 = 5
= 2, 2 = 0.5

1.0

0.00
5.0

2.5

0.0

2
k=1
k=2
k=3
k=4
k=5

0.1

0.25

0.0

1.00

0.3

0.2

0.50

0.5

2.5

5.0

=1
=2
=5
=

PDF

PDF

1
ba

0.75

PDF

PDF

1.5

Student's t
0.4

= 0, 2 = 3
= 2, 2 = 2
= 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1

0.0
0

5.0

2.5

0.0

Exponential

2.0

=2
=1
= 0.4

= 1, = 2
= 2, = 2
= 3, = 2
= 5, = 1
= 9, = 0.5

1.5

1.5

0.75

5.0

Gamma

2.0

d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100

2.5

PDF

0.50

PDF

PDF

PDF

2
1.0

1.0

1
0.5

0.5

0.25

0.00
0

0.0

0.0
0

Inverse Gamma

10

15

20

Weibull
2.0

= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5

Beta
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5

Pareto
xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4

= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5
3

1.5
3

PDF

PDF

PDF

PDF

3
2

1.0

0.5

0
0

0.0
0.00

0.25

0.50

0.75

1.00

0.0

0.5

1.0

1.5

2.0

2.5

1.0

1.5

2.0

2.5

Uniform (continuous)

Normal

LogNormal

1.00

Student's t
1.00

= 0, 2 = 3
= 2, 2 = 2
= 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1

0.75
0.75

0.75

CDF

CDF

CDF

CDF

0.50
0.50

0.50

0.25
0.25

0.25
= 0, = 0.2
= 0, 2 = 1
= 0, 2 = 5
= 2, 2 = 0.5
2

0.00
a

5.0

2.5

0.0

2.5

0.00

5.0

0.00
4

Exponential

0.75

0.75

0.50

0.00
1

0.00
3

0.00

1.00

0.75

0.75

0.25

0.50

10

0.75

1.00

15

20

0.50

0.25
= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5

0.00
0.00

Pareto

1.00

0.25

0.25
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5

0.50

0.50

0.25

0.00

CDF

0.50

Weibull

= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5

CDF

0.75

CDF

0.75

Beta
1.00

= 1, = 2
= 2, = 2
= 3, = 2
= 5, = 1
= 9, = 0.5

CDF

Inverse Gamma

=2
=1
= 0.4

1.00

0.25

0.00

5.0

0.50

0.25
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100

2.5

CDF

0.75

0.0

Gamma
1.00

2.5

CDF

CDF

k=1
k=2
k=3
k=4
k=5

5.0

1.00

0.25

0.25

1.00

0.50

0.50

CDF

0.75

1.00

0.00
0

x
2

=1
=2
=5
=

0.0

0.5

1.0

1.5

2.0

2.5

xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4

0.00
1.0

1.5

2.0

2.5

Probability Theory

Law of Total Probability

Definitions

P [B] =

Sample space
Outcome (point or element)
Event A
-algebra A

P [B|Ai ] P [Ai ]

n
G

Ai

i=1

Bayes Theorem
P [B | Ai ] P [Ai ]
P [Ai | B] = Pn
j=1 P [B | Aj ] P [Aj ]
Inclusion-Exclusion Principle
[
X
n
n

Ai =
(1)r1

Probability Distribution P

i=1

1. P [A] 0 A
2. P [] = 1
" #

G
X
3. P
Ai =
P [Ai ]

r=1

n
G

Ai

i=1

\

r



A
ij

ii1 <<ir n j=1

Random Variables

Random Variable (RV)

i=1

X:R

Probability space (, A, P)
Probability Mass Function (PMF)

Properties

i=1

1. A
S
2. A1 , A2 , . . . , A = i=1 Ai A
3. A A = A A

i=1

n
X

P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] + P [A B]
P [] = 1
P [] = 0
S
T
T
S
( n An ) = n An ( n An ) = n An
DeMorgan
S
T
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B]

= P [A B] P [A] + P [B]
P [A B] = P [A B] + P [A B] + P [A B]
P [A B] = P [A] P [A B]

fX (x) = P [X = x] = P [{ : X() = x}]


Probability Density Function (PDF)
Z
P [a X b] =

f (x) dx
a

Cumulative Distribution Function (CDF)


FX : R [0, 1]

FX (x) = P [X x]

1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )


2. Normalized: limx = 0 and limx = 1
3. Right-Continuous: limyx F (y) = F (x)

Continuity of Probabilities
A1 A2 . . . = limn P [An ] = P [A]
A1 A2 . . . = limn P [An ] = P [A]

S
whereA = i=1 Ai
T
whereA = i=1 Ai

Z
P [a Y b | X = x] =

fY |X (y | x)dy

Independence

fY |X (y | x) =
A
B P [A B] = P [A] P [B]

f (x, y)
fX (x)

Independence

Conditional Probability
P [A | B] =

ab

P [A B]
P [B]

P [B] > 0

1. P [X x, Y y] = P [X x] P [Y y]
2. fX,Y (x, y) = fX (x)fY (y)

3.1

Transformations

E [XY ] =

xyfX,Y (x, y) dFX (x) dFY (y)


X,Y

Transformation function

E [(Y )] 6= (E [X])
(cf. Jensen inequality)
P [X Y ] = 1 = E [X] E [Y ]
P [X = Y ] = 1 E [X] = E [Y ]

X
E [X] =
P [X x]

Z = (X)
Discrete
X



fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =

f (x)

x1 (z)

x=1

Sample mean

Continuous

X
n = 1
Xi
X
n i=1

Z
FZ (z) = P [(X) z] =

with Az = {x : (x) z}

f (x) dx
Az

Special case if strictly monotone





dx

d
1
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)
dz
dz
|J|

Conditional expectation
Z
E [Y | X = x] = yf (y | x) dy
E [X] = E [E [X | Y ]]

The Rule of the Lazy Statistician

E[(X, Y ) | X = x] =

E [Z] =

(x, y)fY |X (y | x) dx
Z

Z
(x) dFX (x)

E [(Y, Z) | X = x] =

(y, z)f(Y,Z)|X (y, z | x) dy dz

Z
E [IA (x)] =

E [Y + Z | X] = E [Y | X] + E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
E[Y | X] = c = Cov [X, Y ] = 0

Z
dFX (x) = P [X A]

IA (x) dFX (x) =


A

Convolution
Z
Z := X + Y

fX,Y (x, z x) dx

fZ (z) =

X,Y 0

fX,Y (x, z x) dx

Z := |X Y |
Z :=

X
Y

fZ (z) = 2
fX,Y (x, z + x) dx
0
Z
Z

fZ (z) =
|x|fX,Y (x, xz) dx =
xfx (x)fX (x)fY (xz) dx

Variance

Definition and properties



 

2
2
V [X] = X
= E (X E [X])2 = E X 2 E [X]
" n
#
n
X
X
X
V
Xi =
V [Xi ] + 2
Cov [Xi , Yj ]
i=1

Expectation
V

Definition and properties

Z
E [X] = X =

x dFX (x) =

i=1

#
Xi =

i=1

xfX (x)

x
Z

xfX (x) dx

P [X = c] = 1 = E [X] = c
E [cX] = c E [X]
E [X + Y ] = E [X] + E [Y ]

" n
X

X discrete

n
X

i6=j

V [Xi ]

if Xi
Xj

i=1

Standard deviation
sd[X] =

V [X] = X

Covariance
X continuous

Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]


Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]

Cov [aX, bY ] = abCov [X, Y ]


Cov [X + a, Y + b] = Cov [X, Y ]

n
m
n X
m
X
X
X
Cov
Xi ,
Yj =
Cov [Xi , Yj ]
i=1

j=1

Distribution Relationships

Binomial
Xi Bern (p) =

i=1 j=1

n
X

Xi Bin (n, p)

i=1

Correlation

X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)


limn Bin (n, p) = Po (np)
(n large, p small)
limn Bin (n, p) = N (np, np(1 p))
(n large, p far from 0 and 1)

Cov [X, Y ]
[X, Y ] = p
V [X] V [Y ]

Independence

Negative Binomial

X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
Sample variance
n

S2 =

1 X
n )2
(Xi X
n 1 i=1

Conditional variance




2
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X]
V [Y ] = E [V [Y | X]] + V [E [Y | X]]

X NBin (1, p) = Geo (p)


Pr
X NBin (r, p) = i=1 Geo (p)
P
P
Xi NBin (ri , p) =
Xi NBin ( ri , p)
X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]

Poisson
Xi Po (i ) Xi Xj =

n
X

Xi Po

i=1

n
X

!
i

i=1

X
n
X
n

Xj Bin
Xj , Pn
Xi Po (i ) Xi Xj = Xi

j
j=1
j=1
j=1

Inequalities

Cauchy-Schwarz

Exponential

   
2
E [XY ] E X 2 E Y 2

Markov
P [(X) t]

E [(X)]
t

Chebyshev
P [|X E [X]| t]
Chernoff


P [X (1 + )]

Xi Exp () Xi
Xj =

Xi Gamma (n, )

i=1

Memoryless property: P [X > x + y | X > y] = P [X > x]

V [X]
t2

e
(1 + )1+

n
X

Normal
X N , 2


> 1

Hoeffding
X1 , . . . , Xn independent P [Xi [ai , bi ]] = 1 1 i n

 

E X
t e2nt2 t > 0
P X


2 2

 

E X
| t 2 exp Pn 2n t
P |X
t>0
2
i=1 (bi ai )
Jensen
E [(X)] (E [X]) convex

N (0, 1)


2


X N , Z = aX + b = Z N a + b, a2 2


P
P
P
Xi N i , i2 Xi
Xj =
Xi N
i , i i2
i
i



P [a < X b] = b
a

(x) = 1 (x)
0 (x) = x(x)
00 (x) = (x2 1)(x)
1
Upper quantile of N (0, 1): z = (1 )
Gamma
X Gamma (, ) X/ Gamma (, 1)
P
Gamma (, ) i=1 Exp ()
P
P
Xi Gamma (i , ) Xi
Xj =
i Xi Gamma (
i i , )

10

()
=

9.2

x1 ex dx

Bivariate Normal



Let X N x , x2 and Y N y , y2 .

Beta
1
( + ) 1
x1 (1 x)1 =
x
(1 x)1
B(, )
()()
  B( + k, )


+k1
=
E X k1
E Xk =
B(, )
++k1
Beta (1, 1) Unif (0, 1)

f (x, y) =

"
z=

x x
x

2x y

2


+

1
p

z
exp
2
2(1 2 )
1

y y
y

2


2

x x
x



y y
y

#

Conditional mean and variance

X
(Y E [Y ])
Y
p
V [X | Y ] = X 1 2

Probability and Moment Generating Functions


 
GX (t) = E tX

E [X | Y ] = E [X] +

|t| < 1

"
#
 

X (Xt)i
X

E Xi
MX (t) = GX (et ) = E eXt = E
=
ti
i!
i!
i=0
i=0


9.3

P [X = 0] = GX (0)
P [X = 1] = G0X (0)

Multivariate Normal
(Precision matrix 1 )

V [X1 ]
Cov [X1 , Xk ]

..
..
..
=

.
.
.

Covariance matrix

(i)

GX (0)
i!
E [X] = G0X (1 )
 
(k)
E X k = MX (0)


X!
(k)
E
= GX (1 )
(X k)!

P [X = i] =

V [X] =

G00X (1 )

Cov [Xk , X1 ]
If X N (, ),

G0X (1 )
d

1/2

fX (x) = (2)n/2 ||

2
(G0X (1 ))

GX (t) = GY (t) = X = Y

9
9.1

V [Xk ]

Properties

Multivariate Distributions
Standard Bivariate Normal

Let X, Y N (0, 1) X
Z where Y = X +

1 2 Z

10

Joint density
x2 + y 2 2xy
p
f (x, y) =
exp
2(1 2 )
2 1 2


and

Z N (0, 1) X = + 1/2 Z = X N (, )
X N (, ) = 1/2 (X ) N (0, 1)

X N (, ) = AX N A, AAT

X N (, ) kak = k = aT X N aT , aT a

Convergence


Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.

Conditionals
(Y | X = x) N x, 1 2



1
exp (x )T 1 (x )
2

(X | Y = y) N y, 1 2

Types of convergence
D

1. In distribution (weakly, in law): Xn X

Independence
X
Y = 0

lim Fn (t) = F (t)

t where F continuous

11

10.2

2. In probability: Xn X
( > 0) lim P [|Xn X| > ] = 0

Central Limit Theorem (CLT)

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .


n
n(Xn ) D
X
Z
Zn := q   =

n
V X

as

3. Almost surely (strongly): Xn X


h
P

i
lim Xn = X = P : lim Xn () = X() = 1

where Z N (0, 1)

lim P [Zn z] = (z)

zR

n
qm

4. In quadratic mean (L2 ): Xn X

CLT notations
Zn N (0, 1)


2
n N ,
X
n


2

Xn N 0,
n


n(Xn ) N 0,

n(Xn )
N (0, 1)



lim E (Xn X)2 = 0

Relationships
qm

Xn X = Xn X = Xn X
as
P
Xn X = Xn X
D
P
Xn X (c R) P [X = c] = 1 = Xn X

Xn
Xn
Xn
Xn

X
qm
X
P
X
P
X

Yn
Yn
Yn
=

Y = Xn + Yn X + Y
qm
qm
Y = Xn + Yn X + Y
P
P
Y = Xn Yn XY
P
(Xn ) (X)

Continuity correction

Xn X = (Xn ) (X)
qm
Xn b limn E [Xn ] = b limn V [Xn ] = 0
qm
n
X1 , . . . , Xn iid E [X] = V [X] < X



n x
P X



n x 1
P X

Slutzkys Theorem
D

Xn X and Yn c = Xn + Yn X + c
D
P
D
Xn X and Yn c = Xn Yn cX
D
D
D
In general: Xn X and Yn Y =
6
Xn + Yn X + Y

10.1

Law of Large Numbers (LLN)

x + 12

/ n



Yn N

11

2
,
n


= (Yn ) N

11.1

as
n
X

Strong (SLLN)

2
(), ( ())
n
0

iid

Weak (WLLN)
n

Statistical Inference

Let X1 , , Xn F if not otherwise noted.

n
X

x 12

/ n

Delta method

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = .

Point Estimation

Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )


h i
bias(bn ) = E bn
P
Consistency: bn

12

Sampling distribution: F (bn )


r h i
Standard error: se(bn ) = V bn
h
i
h i
Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn

Nonparametric 1 confidence band for F


L(x) = max{Fbn n , 0}
U (x) = min{Fbn + n , 1}
s
 
1
2
=
log
2n

limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent


bn D
N (0, 1)
Asymptotic normality:
se
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consistent estimator
bn .

11.2

Normal-Based Confidence Interval






b 2 . Let z/2 = 1 (1 (/2)), i.e., P Z > z/2 = /2
Suppose bn N , se


and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then
b
Cn = bn z/2 se

P [L(x) F (x) U (x) x] 1

11.4

Statistical Functionals
Statistical functional: T (F )
Plug-in estimator of = (F ): bn = T (Fbn )
R
Linear functional: T (F ) = (x) dFX (x)
Plug-in estimator for linear functional:
Z
T (Fbn ) =

11.3

Empirical distribution



b 2 = T (Fbn ) z/2 se
b
Often: T (Fbn ) N T (F ), se

Empirical Distribution Function (ECDF)


Pn
Fbn (x) =

i=1

1X
(Xi )
(x) dFbn (x) =
n i=1

I(Xi x)
n

(
1
I(Xi x) =
0

Xi x
Xi > x

Properties (for any fixed x)


h i
E Fbn = F (x)
h i F (x)(1 F (x))
V Fbn =
n
F (x)(1 F (x)) D
mse =
0
n
P
Fbn F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn F )

pth quantile: F 1 (p) = inf{x : F (x) p}


n

b=X
n
1 X
n )2
(Xi X

b2 =
n 1 i=1
Pn
1
b)3
i=1 (Xi
n

b=
b3
P
n

i=1 (Xi Xn )(Yi Yn )


qP
b = qP
n
n
2
2
i=1 (Xi Xn )
i=1 (Yi Yn )

12

Parametric Inference



Let F = f (x; ) : be a parametric model with parameter space Rk
and parameter = (1 , . . . , k ).

12.1

Method of Moments

j th moment




2


P sup F (x) Fbn (x) > = 2e2n
x

 
j () = E X j =

xj dFX (x)

13

j th sample moment

Fisher information (exponential family)


n

1X j

bj =
X
n i=1 i




I() = E s(X; )

Method of moments estimator (MoM)


Observed Fisher information
1 () =
b1
2 () =
b2
.. ..
.=.

Inobs () =

k () =
bk

Properties of the mle

Properties of the MoM estimator

bn exists with probability tending to 1


P
Consistency: bn
Asymptotic normality:

n(b ) N (0, )



where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
1
j ()
g = (g1 , . . . , gk ) and gj =

12.2

P
Consistency: bn
Equivariance: bn is the mle = (bn ) ist the mle of ()
Asymptotic normality:
p
1. se 1/In ()
(bn ) D
N (0, 1)
se
q
b 1/In (bn )
2. se

(bn ) D
N (0, 1)
b
se

Maximum Likelihood

Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If en is any other estimator, the asymptotic relative efficiency is

Likelihood: Ln : [0, )
Ln () =

n
2 X
log f (Xi ; )
2 i=1

n
Y

f (Xi ; )

h i
V bn
are(en , bn ) = h i 1
V en

i=1

Log-likelihood
`n () = log Ln () =

n
X

log f (Xi ; )

Approximately the Bayes estimator

i=1

Maximum likelihood estimator (mle)

12.2.1

Ln (bn ) = sup Ln ()

Score function
s(X; ) =

log f (X; )

Fisher information
I() = V [s(X; )]
In () = nI()

Delta Method

b where is differentiable and 0 () 6= 0:


If = ()
(b
n ) D
N (0, 1)
b )
se(b
b is the mle of and
where b = ()


b b
b = 0 ()
b n )
se
se(

14

12.3

13

Multiparameter Models

Hypothesis Testing

Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.


Hjj

2 `n
=
2

Hjk

H0 : 0

2 `n
=
j k

Definitions

Fisher information matrix

E [H11 ]

..
..
In () =
.
.
E [Hk1 ]

E [H1k ]

..

.
E [Hkk ]

Under appropriate regularity conditions


(b ) N (0, Jn )
with Jn () = In1 . Further, if bj is the j th component of , then
(bj j ) D
N (0, 1)
bj
se
h
i
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se
12.3.1

Null hypothesis H0
Alternative hypothesis H1
Simple hypothesis = 0
Composite hypothesis > 0 or < 0
Two-sided test: H0 : = 0 versus H1 : 6= 0
One-sided test: H0 : 0 versus H1 : > 0
Critical value c
Test statistic T
Rejection region R = {x : T (x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf ()
1

Test size: = P [Type I error] = sup ()


0

Multiparameter delta method

Let = (1 , . . . , k ) and let the gradient of be

1
.

=
..

H0 true
H1 true

1F (T (X))

p-value
< 0.01
0.01 0.05
0.05 0.1
> 0.1

(b
) D
N (0, 1)
b )
se(b
b ) =
se(b

b and
b = b.
and Jbn = Jn ()
=

12.4

Parametric Bootstrap

T

Type II Error ()

Reject H0
Type
I Error ()
(power)



p-value = sup0 P [T (X) T (x)] = inf : T (x) R


p-value = sup0
P [T (X ? ) T (X)]
= inf : T (X) R
{z
}
|


b Then,
Suppose =b 6= 0 and b = ().

r


Retain H0

p-value

where

H1 : 1

versus

since T (X ? )F

evidence
very strong evidence against H0
strong evidence against H0
weak evidence against H0
little or no evidence against H0

Wald test
 
b
Jbn

Two-sided test
b 0
Reject H0 when |W | > z/2 where W =
b
se


P |W | > z/2
p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)

Sample from f (x; bn ) instead of from Fbn , where bn could be the mle or method
of moments estimator.
Likelihood ratio test (LRT)

15

sup Ln ()
Ln (bn )
=
sup0 Ln ()
Ln (bn,0 )
k
X
D
iid
(X) = 2 log T (X) 2rq where
Zi2 2k and Z1 , . . . , Zk N (0, 1)

T (X) =

Vector parameter
(
fX (x | ) = h(x) exp

= h(x) exp {() T (x) A()}


= h(x)g() exp {() T (x)}

fX (x | ) = h(x) exp { T(x) A()}


= h(x)g() exp { T(x)}


= h(x)g() exp T T(x)

15

Bayesian Inference

Bayes Theorem
f ( | x) =

Pearson Chi-square Test


k
X
(Xj E [Xj ])2
where E [Xj ] = np0j under H0
E [Xj ]
j=1

T 2k1


p-value = P 2k1 > T (x)
D

2
Faster Xk1
than LRT, hence preferable for small n

f (x | )f ()
f (x | )f ()
=R
Ln ()f ()
f (xn )
f (x | )f () d

Definitions

Independence testing

X n = (X1 , . . . , Xn )
xn = (x1 , . . . , xn )
Prior density f ()
Likelihood f (xn | ): joint density of the data
n
Y
f (xi | ) = Ln ()
In particular, X n iid = f (xn | ) =
i=1

I rows, J columns, X multinomial sample of size n = I J


X
mles unconstrained: pbij = nij
X

mles under H0 : pb0ij = pbi pbj = Xni nj




PI PJ
nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj
PI PJ (X E[X ])2
PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D

LRT and Pearson 2k , where = (I 1)(J 1)

Posterior density f ( | xn )
R
Normalizing constant cn = f (xn ) = f (x | )f () d
Kernel: part of a density that depends Ron
R
Ln ()f ()
Posterior mean n = f ( | xn ) d = R
Ln ()f () d

15.1

Credible Intervals

Posterior interval
n

14

i ()Ti (x) A()

Natural form

The approximate size LRT rejects H0 when (X) 2k1,

T =

i=1

 i=1

p-value = P0 [(X) > (x)] P 2rq > (x)
Multinomial LRT


Xk
X1
,...,
mle: pbn =
n
n
Xj
k 
Y
pbj
Ln (b
pn )
=
T (X) =
Ln (p0 )
p0j
j=1


k
X
pbj
D
(X) = 2
Xj log
2k1
p
0j
j=1

s
X

Exponential Family

f ( | xn ) d = 1

P [ (a, b) | x ] =
a

Equal-tail credible interval


Z a
Z
n
f ( | x ) d =

Scalar parameter
fX (x | ) = h(x) exp {()T (x) A()}
= h(x)g() exp {()T (x)}

f ( | xn ) d = /2

Highest posterior density (HPD) region Rn

16

1. P [ Rn ] = 1
2. Rn = { : f ( | xn ) > k} for some k

15.3.1

Continuous likelihood (subscript c denotes constant)

Rn is unimodal = Rn is an interval

15.2

Conjugate Priors

Function of parameters

Likelihood

Conjugate prior

Unif (0, )

Pareto(xm , k)

Exp ()

Gamma (, )

i=1


2

Let = () and A = { : () }.
Posterior CDF for

N , c

H(r | xn ) = P [() | xn ] =


2

N 0 , 0

f ( | xn ) d
N c , 2

Posterior density
N , 2

h( | xn ) = H 0 ( | xn )

Bayesian delta method





b
b se
b 0 ()
| X n N (),

15.3

Posterior hyperparameters


max x(n) , xm , k + n
n
X
xi
+ n, +

MVN(, c )

MVN(c , )

Priors

Choice
Subjective bayesianism.
Objective bayesianism.
Robust bayesianism.

Scaled Inverse Chisquare(, 02 )


Normalscaled
Inverse
Gamma(, , , )
MVN(0 , 0 )

InverseWishart(, )

Pareto(xmc , k)

Gamma (, )

Pareto(xm , kc )

Pareto(x0 , k0 )

Gamma (c , )

Gamma (0 , 0 )

Pn
 

0
1
n
i=1 xi
+
/
+ 2 ,
2
2
02
c
 0
c1
1
n
+ 2
02
c
Pn
02 + i=1 (xi )2
+ n,
+n


+ n
x
n
,
+ n,
+ ,
+n
2
n
2
1X
(
x

)
+
(xi x
)2 +
2 i=1
2(n + )
1
1
0 + nc

1


1
1
x
,
0 0 + n


1 1
1
0 + nc
n
X
(xi c )(xi c )T
n + , +
i=1
n
X

xi
x
mc
i=1
x0 , k0 kn where k0 > kn
n
X
0 + nc , 0 +
xi
+ n, +

log

i=1

Types
Flat: f () constant
R
Proper: f () d = 1
R
Improper: f () d =
Jeffreys prior (transformation-invariant):
f ()

I()

f ()

det(I())

Conjugate: f () and f ( | xn ) belong to the same parametric family

17

log10 BF10

Discrete likelihood
Likelihood

Conjugate prior

Posterior hyperparameters

Bern (p)

Beta (, )

Bin (p)

Beta (, )

n
X
i=1
n
X

xi , + n
xi , +

i=1

NBin (p)

Beta (, )

Po ()

+ rn, +

Gamma (, )

Multinomial(p)

Dir ()

n
X
i=1
n
X

Beta (, )

n
X

xi

i=1

Ni

i=1

n
X

xi

xi , + n

16

1+

p
1p BF10

evidence

1 1.5
1.5 10
10 100
> 100

Weak
Moderate
Strong
Decisive

where p = P [H1 ] and p = P [H1 | xn ]

Sampling Methods

16.1

Inverse Transform Sampling

Setup

x(i)

+ n, +

p =

xi

i=1

i=1

i=1

Geo (p)

n
X

n
X

0 0.5
0.5 1
12
>2
p
BF
10
1p

BF10

n
X

xi

i=1

U Unif (0, 1)
XF
F 1 (u) = inf{x | F (x) u}
Algorithm

15.4

Bayesian Testing

1. Generate u Unif (0, 1)


2. Compute x = F 1 (u)

If H0 : 0 :
Z
Prior probability P [H0 ] =

f () d

16.2

f ( | xn ) d

Let Tn = g(X1 , . . . , Xn ) be a statistic.

Posterior probability P [H0 | xn ] =

Z
0

1. Estimate VF [Tn ] with VFbn [Tn ].


2. Approximate VFbn [Tn ] using simulation:

Let H0 , . . . , HK1 be K hypotheses. Suppose f ( | Hk ),


f (xn | Hk )P [Hk ]
,
P [Hk | xn ] = PK
n
k=1 f (x | Hk )P [Hk ]
Marginal likelihood
f (xn | Hi ) =

f (xn | , Hi )f ( | Hi ) d

P [Hi | x ]
P [Hj | xn ]

16.2.1
n

f (x | Hi )
f (xn | Hj )
| {z }

Bayes Factor BFij

Bayes factor

, . . . , Tn,B
, an iid sample from
(a) Repeat the following B times to get Tn,1
b
the sampling distribution implied by Fn
i. Sample uniformly X1 , . . . , Xn Fbn .
ii. Compute Tn = g(X1 , . . . , Xn ).
(b) Then
!2
B
B
X
X
1
1

bb =
vboot = V
Tn,b

T
Fn
B
B r=1 n,r
b=1

Posterior odds (of Hi relative to Hj )


n

The Bootstrap

P [Hi ]
P [Hj ]
| {z }

prior odds

Bootstrap Confidence Intervals

Normal-based interval
b boot
Tn z/2 se
Pivotal interval
1. Location parameter = T (F )

18

2. Pivot Rn = bn
3. Let H(r) = P [Rn r] be the cdf of Rn

4. Let Rn,b
= bn,b
bn . Approximate H using bootstrap:

2. Generate u Unif (0, 1)


Ln (cand )
3. Accept cand if u
Ln (bn )

B
1 X

b
H(r)
=
I(Rn,b
r)
B

16.4

Sample from an importance function g rather than target density h.


Algorithm to obtain an approximation to E [q() | xn ]:

b=1

5. = sample quantile of (bn,1


, . . . , bn,B
)

iid

), i.e., r = bn
6. r = sample quantile of (Rn,1
, . . . , Rn,B
 
7. Approximate 1 confidence interval Cn = a
, b where

a
=
b =



b 1 1 =
bn H
2

1
b
=
bn H
2

Percentile interval

16.3

bn r1/2
=

2bn 1/2

bn r/2
=

2bn /2

Rejection Sampling

We can easily sample from g()


We want to sample from h(), but it is difficult
k()
k() d
Envelope condition: we can find M > 0 such that k() M g()
We know h() up to a proportional constant: h() = R

Algorithm
1. Draw cand g()
2. Generate u Unif (0, 1)
k(cand )
3. Accept cand if u
M g(cand )
4. Repeat until B values of cand have been accepted
Example
We can easily sample from the prior g() = f ()
Target is the posterior h() k() = f (xn | )f ()
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M
Algorithm
1. Draw cand f ()

1. Sample from the prior 1 , . . . , n f ()


Ln (i )
i = 1, . . . , B
2. wi = PB
i=1 Ln (i )
PB
3. E [q() | xn ] i=1 q(i )wi

17

Decision Theory

Definitions



Cn = /2
, 1/2

Setup

Importance Sampling

Unknown quantity affecting our decision:


Decision rule: synonymous for an estimator b
Action a A: possible value of the decision rule. In the estimation
b
context, the action is just an estimate of , (x).
Loss function L: consequences of taking action a when true state is or
b L : A [k, ).
discrepancy between and ,
Loss functions
Squared error loss: L(, a) = ( a)2
(
K1 ( a) a < 0
Linear loss: L(, a) =
K2 (a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
Lp loss: L(, a) = | a|p
(
0 a=
Zero-one loss: L(, a) =
1 a 6=

17.1

Risk

Posterior risk
Z

h
i
b
b
L(, (x))f
( | x) d = E|X L(, (x))

h
i
b
b
L(, (x))f
(x | ) dx = EX| L(, (X))

r(b | x) =
(Frequentist) risk
b =
R(, )

19

18

Bayes risk
ZZ
b =
r(f, )

h
i
b
b
L(, (x))f
(x, ) dx d = E,X L(, (X))

h
h
ii
h
i
b = E EX| L(, (X)
b
b
r(f, )
= E R(, )
h
h
ii
h
i
b = EX E|X L(, (X)
b
r(f, )
= EX r(b | X)

Linear Regression

Definitions
Response variable Y
Covariate X (aka predictor variable or feature)

18.1

Simple Linear Regression

Model

17.2

Yi = 0 + 1 Xi + i

Admissibility

E [i | Xi ] = 0, V [i | Xi ] = 2

Fitted line

b0 dominates b if

b0

b
: R(, ) R(, )
b
: R(, b0 ) < R(, )
b is inadmissible if there is at least one other estimator b0 that dominates
it. Otherwise it is called admissible.

rb(x) = b0 + b1 x
Predicted (fitted) values
Ybi = rb(Xi )
Residuals



i = Yi Ybi = Yi b0 + b1 Xi

Residual sums of squares (rss)

17.3

Bayes Rule
rss(b0 , b1 ) =

Bayes rule (or Bayes estimator)

n
X

2i

i=1

b = inf e r(f, )
e
r(f, )

R
b
b = r(b | x)f (x) dx
(x)
= inf r(b | x) x = r(f, )

Least square estimates


bT = (b0 , b1 )T : min rss
b0 ,
b1

Theorems
Squared error loss: posterior mean
Absolute error loss: posterior median
Zero-one loss: posterior mode

17.4

Minimax Rules

Maximum risk
b = sup R(, )
b
)
R(

R(a)
= sup R(, a)

Minimax rule
b = inf R(
e = inf sup R(, )
e
)
sup R(, )

b =c
b = Bayes rule c : R(, )
Least favorable prior
bf = Bayes rule R(, bf ) r(f, bf )

n
b0 = Yn b1 X
Pn
Pn

i=1 Xi Yi nXY
i=1 (Xi Xn )(Yi Yn )
b
Pn
= P
1 =
n
2

2
2
i=1 (Xi Xn )
i=1 Xi nX


h
i
0
E b | X n =
1


P
h
i
2 n1 ni=1 Xi2 X n
V b | X n = 2
1
X n
nsX
r Pn
2

b
i=1 Xi

b b0 ) =
se(
n
sX n

b b1 ) =
se(
sX n
Pn
Pn 2
1
where s2X = n1 i=1 (Xi X n )2 and
b2 = n2
i (unbiased estimate).
i=1 
Further properties:
P
P
Consistency: b0 0 and b1 1

20

18.3

Asymptotic normality:
b0 0 D
N (0, 1)
b b0 )
se(

and

b1 1 D
N (0, 1)
b b1 )
se(

Multiple Regression
Y = X + 

where

Approximate 1 confidence intervals for 0 and 1 :

b b1 )
and b1 z/2 se(

b b0 )
b0 z/2 se(

Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 where


b b1 ).
W = b1 /se(

Xn1

Pn b
Pn 2
2

rss
i=1 (Yi Y )
R = Pn
= 1 Pn i=1 i 2 = 1
2
tss
i=1 (Yi Y )
i=1 (Yi Y )
Likelihood

L2 =

1

= ...

Xnk

n
Y
i=1
n
Y
i=1
n
Y

f (Xi , Yi ) =

n
Y

fX (Xi )

n
Y

fY |X (Yi | Xi )

2
1 X
exp 2
Yi (0 1 Xi )
2 i

b = (X T X)1 X T Y
h
i
V b | X n = 2 (X T X)1

Under the assumption of Normality, the least squares parameter estimators are
also the MLEs, but the least squares variance estimator is not the MLE

b N , 2 (X T X)1

1X 2

n i=1 i

rb(x) =

k
X

bj xj

j=1

Unbiased estimate for 2

Prediction

Observe X = x of the covariate and want to predict their outcome Y .


Yb = b0 + b1 x
h i
h i
h i
h
i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1
Prediction interval

Estimate regression function

b2 =

N
X
(Yi xTi )2

If the (k k) matrix X T X is invertible,

fX (Xi )
n

n

i=1

fY |X (Yi | Xi ) = L1 L2

i=1

i=1


1
..
=.

rss = (y X)T (y X) = kY Xk2 =

i=1

18.2



1
L(, ) = (2 2 )n/2 exp 2 rss
2

L1 =

X1k
..
.

Likelihood

R2

L=

..
.

X11
..
X= .

 Pn

2
2
2
i=1 (Xi X )
b
P
n =
b
2j + 1
n i (Xi X)
Yb z/2 bn

b2 =

n
1 X 2

n k i=1 i

 = X b Y

mle

b=X

b2 =

nk 2

1 Confidence interval
b bj )
bj z/2 se(

21

18.4

Model Selection

Akaike Information Criterion (AIC)

Consider predicting a new observation Y for covariates X and let S J


denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
Underfitting: too few covariates yields high bias
Overfitting: too many covariates yields high variance

AIC(S) = `n (bS ,
bS2 ) k
Bayesian Information Criterion (BIC)
k
BIC(S) = `n (bS ,
bS2 ) log n
2

Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score

Validation and training


bV (S) =
R

Hypothesis testing

m
X

(Ybi (S) Yi )2

m = |{validation data}|, often

i=1

n
n
or
4
2

H0 : j = 0 vs. H1 : j 6= 0 j J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
h
i
mspe = E (Yb (S) Y )2

bCV (S) =
R

n
X

(Yi Yb(i) )2 =

i=1

Prediction risk
R(S) =

n
X

mspei =

i=1

n
X

h
i
E (Ybi (S) Yi )2

n
X
i=1

Yi Ybi (S)
1 Uii (S)

!2

U (S) = XS (XST XS )1 XS (hat matrix)

i=1

Training error
btr (S) =
R

n
X
(Ybi (S) Yi )2

19

Non-parametric Function Estimation

i=1

19.1

btr (S)
rss(S)
R
R2 (S) = 1
=1
=1
tss
tss

Pn b
2
i=1 (Yi (S) Y )
P
n
2
i=1 (Yi Y )

The training error is a downward-biased estimate of the prediction risk.


h
i
btr (S) < R(S)
E R
n
h
i
h
i
X
b
b
bias(Rtr (S)) = E Rtr (S) R(S) = 2
Cov Ybi , Yi
i=1

Adjusted R2
R2 (S) = 1

Density Estimation

Estimate f (x), where f (x) = P [X A] =


Integrated square error (ise)
L(f, fbn ) =

Z 

R
A

f (x) dx.

Z
2
f (x) fbn (x) dx = J(h) + f 2 (x) dx

Frequentist risk
Z
h
i Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx

n 1 rss
n k tss

Mallows Cp statistic
b
btr (S) + 2kb
R(S)
=R
2 = lack of fit + complexity penalty

h
i
b(x) = E fbn (x) f (x)
h
i
v(x) = V fbn (x)

22

19.1.1

Histograms

KDE


n
x Xi
1X1
K
fbn (x) =
n i=1 h
h
Z
Z
1
1
4
00
2
b
R(f, fn ) (hK )
K 2 (x) dx
(f (x)) dx +
4
nh
Z
Z
2/5 1/5 1/5
c
c2 c3
2
2
h = 1
c
=

,
c
=
K
(x)
dx,
c
=
(f 00 (x))2 dx
1
2
3
K
n1/5
Z
4/5 Z
1/5
c4
5 2 2/5

2
00 2
b
R (f, fn ) = 4/5
K (x) dx
(f ) dx
c4 = (K )
4
n
|
{z
}

Definitions

Number of bins m
1
Binwidth h = m
Bin Bj has j observations
R
Define pbj = j /n and pj = Bj f (u) du

Histogram estimator

fbn (x) =

m
X
pbj
j=1

C(K)

I(x Bj )
Epanechnikov Kernel

pj
E fbn (x) =
h
h
i p (1 p )
j
j
V fbn (x) =
nh2
Z
h2
1
2
R(fbn , f )
(f 0 (u)) du +
12
nh
!1/3
1
6

h = 1/3 R
2 du
n
(f 0 (u))
 2/3 Z
1/3
C
3
2
b
0
R (fn , f ) 2/3
(f (u)) du
C=
4
n
h

(
K(x) =

4 5(1x2 /5)

|x| <

otherwise

Cross-validation estimate of E [J(h)]


Z
JbCV (h) =



n
n
n
2Xb
2
1 X X Xi Xj
fbn2 (x) dx
K
+
K(0)
f(i) (Xi )
2
n i=1
hn i=1 j=1
h
nh

K (x) = K

(2)

(x) 2K(x)

(2)

Z
(x) =

K(x y)K(y) dy

Cross-validation estimate of E [J(h)]


Z
JbCV (h) =

19.1.2

2Xb
2
n+1 X 2
fbn2 (x) dx
f(i) (Xi ) =

pb
n i=1
(n 1)h (n 1)h j=1 j

19.2

Non-parametric Regression

Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points


(x1 , Y1 ), . . . , (xn , Yn ) related by

Kernel Density Estimator (KDE)

Yi = r(xi ) + i
E [i ] = 0

Kernel K

K(x) 0
R
K(x) dx = 1
R
xK(x) dx = 0
R 2
2
x K(x) dx K
>0

V [i ] = 2
k-nearest Neighbor Estimator
rb(x) =

1
k

X
i:xi Nk (x)

Yi

where Nk (x) = {k values of x1 , . . . , xn closest to x}

23

20

Nadaraya-Watson Kernel Estimator


rb(x) =

n
X

Stochastic Processes

Stochastic Process

wi (x)Yi

i=1

wi (x) =

xxi
h

Pn
xxj
K
j=1
h

(
{0, 1, . . . } = Z discrete
T =
[0, )
continuous

{Xt : t T }

[0, 1]

Z
4 Z 
2
h4
f 0 (x)
2 2
00
0
R(b
rn , r)
x K (x) dx
r (x) + 2r (x)
dx
4
f (x)
R
Z 2
K 2 (x) dx
+
dx
nhf (x)
c1
h 1/5
n
c2

R (b
rn , r) 4/5
n

Notations Xt , X(t)
State space X
Index set T

20.1

Markov Chains

Markov chain
P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ]

n T, x X

Cross-validation estimate of E [J(h)]


JbCV (h) =

n
X

(Yi rb(i) (xi ))2 =

i=1

19.3

n
X
i=1

(Yi rb(xi ))2


1

K(0)
 xx 
Pn
j
j=1 K
h

Smoothing Using Orthogonal Functions

Approximation
r(x) =

j j (x)

j=1

J
X

j j (x)

Multivariate regression

where

i = i

..
.
0 (xn )

!2

pij P [Xn+1 = j | Xn = i]
pij (n) P [Xm+n = j | Xm = i]
Transition matrix P (n-step: Pn )

Chapman-Kolmogorov

J (x1 )
..
.

pij (m + n) =

J (xn )

b = (T )1 T Y
1
T Y (for equally spaced observations only)
n
Cross-validation estimate of E [J(h)]
2

n
J
X
X
bCV (J) =
Yi
R
j (xi )bj,(i)
j=1

pij (m)pkj (n)

Pm+n = Pm Pn

Least squares estimator

i=1

n-step

(i, j) element is pij


pij > 0
P

i pij = 1

i=1

Y = +

0 (x1 )
..
and = .

Transition probabilities

Pn = P P = Pn
Marginal probability
n = (n (1), . . . , n (N ))

where

i (i) = P [Xn = i]

0 , initial distribution
n = 0 Pn

24

20.2

Poisson Processes

Autocorrelation function (ACF)

Poisson process
{Xt : t [0, )} = number of events up to and including time t
X0 = 0
Independent increments:

(s, t)
Cov [xs , xt ]
=p
(s, t) = p
V [xs ] V [xt ]
(s, s)(t, t)
Cross-covariance function (CCV)

t0 < < tn : Xt1 Xt0


Xtn Xtn1

xy (s, t) = E [(xs xs )(yt yt )]

Intensity function (t)


Cross-correlation function (CCF)

P [Xt+h Xt = 1] = (t)h + o(h)


P [Xt+h Xt = 2] = o(h)
Xs+t Xs Po (m(s + t) m(s)) where m(t) =

Rt
0

xy (s, t) = p

(s) ds

xy (s, t)
x (s, s)y (t, t)

Homogeneous Poisson process


(t) = Xt Po (t)

Backshift operator

>0

B k (xt ) = xtk
Waiting times
Wt := time at which Xt occurs


1
Wt Gamma t,

Difference operator
d = (1 B)d

Interarrival times

White noise

St = Wt+1 Wt
 
1
St Exp

2
wt wn(0, w
)

St
Wt1

21

Wt


iid
2
Gaussian: wt N 0, w
E [wt ] = 0 t T
V [wt ] = 2 t T
w (s, t) = 0 s 6= t s, t T

Random walk

Time Series

Mean function

xt = E [xt ] =

xft (x) dx

Drift
Pt
xt = t + j=1 wj
E [xt ] = t

Autocovariance function
Symmetric moving average
x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t


x (t, t) = E (xt t )2 = V [xt ]

mt =

k
X
j=k

aj xtj

where aj = aj 0 and

k
X
j=k

aj = 1

25

21.1

Stationary Time Series

21.2

Estimation of Correlation

Sample mean

Strictly stationary

x
=

P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ]

1X
xt
n t=1

Sample variance

n 
|h|
1 X
1
x (h)
V [
x] =
n
n

k N, tk , ck , h Z

h=n

Weakly stationary
Sample autocovariance function
 
t Z
E x2t <
 2
E xt = m
t Z
x (s, t) = x (s + r, t + r)

r, s, t Z

Sample autocorrelation function

Autocovariance function

nh
1 X
(xt+h x
)(xt x
)
n t=1

b(h) =

(h) = E [(xt+h )(xt )]




(0) = E (xt )2
(0) 0
(0) |(h)|
(h) = (h)

h Z

b(h) =

b(h)

b(0)

Sample cross-variance function

bxy (h) =

nh
1 X
(xt+h x
)(yt y)
n t=1

Autocorrelation function (ACF)


Sample cross-correlation function
Cov [xt+h , xt ]
(t + h, t)
(h)
=p
=
x (h) = p
(0)
V [xt+h ] V [xt ]
(t + h, t + h)(t, t)
Jointly stationary time series

bxy (h)
bxy (h) = p

bx (0)b
y (0)
Properties

xy (h) = E [(xt+h x )(yt y )]

xy (h) = p

xy (h)
x (0)y (h)

1
bx (h) = if xt is white noise
n
1
bxy (h) = if xt or yt is white noise
n

21.3

Linear process

Non-Stationary Time Series

Classical decomposition model


xt = +

j wtj

where

j=

2
(h) = w

|j | <

xt = t + st + wt

j=

X
j=

j+h j

t = trend
st = seasonal component
wt = random noise term

26

21.3.1

Detrending

Moving average polynomial


(z) = 1 + 1 z + + q zq

Least squares

Moving average operator

1. Choose trend model, e.g., t = 0 + 1 t + 2 t2


2. Minimize rss to obtain trend estimate
bt = b0 + b1 t + b2 t2
3. Residuals , noise wt

(B) = 1 + 1 B + + p B p
MA (q) (moving average model order q)

Moving average

xt = wt + 1 wt1 + + q wtq xt = (B)wt

The low-pass filter vt is a symmetric moving average mt with aj =


vt =

1
2k + 1

k
X

1
2k+1 :

E [xt ] =

xt1

(
(h) = Cov [xt+h , xt ] =

i=k

ARIMA models

Autoregressive polynomial
z C p 6= 0

2
w
0

Pqh
j=0

j j+h

0hq
h>q

xt = wt + wt1

2 2

(1 + )w h = 0
2
(h) = w
h=1

0
h>1
(

h=1
2
(h) = (1+ )
0
h>1

t = 0 + 1 t = xt = 1

ARMA (p, q)
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq

Autoregressive operator

(B)xt = (B)wt
(B) = 1 1 B p B

Partial autocorrelation function (PACF)

Autoregressive model order p, AR (p)


xt = 1 xt1 + + p xtp + wt (B)xt = wt

xih1 , regression of xi on {xh1 , xh2 , . . . , x1 }


hh = corr(xh xh1
, x0 xh1
) h2
0
h
E.g., 11 = corr(x1 , x0 ) = (1)
ARIMA (p, d, q)

AR (1)

d xt = (1 B)d xt is ARMA (p, q)

xt = k (xtk ) +

k1
X

j (wtj )

k,||<1

j=0

j (wtj )

j=0

{z

(B)(1 B)d xt = (B)wt


Exponentially Weighted Moving Average (EWMA)
xt = xt1 + wt wt1

linear process

E [xt ] =

j=0

(E [wtj ]) = 0

(h) = Cov [xt+h , xt ] =


(h) =

j E [wtj ] = 0

MA (1)

Differencing

(z) = 1 1 z p zp

q
X
j=0

Pk
1
If 2k+1
i=k wtj 0, a linear trend function t = 0 + 1 t passes
without distortion

21.4

z C q 6= 0

(h)
(0)

xt =

2 h
w

12

(1 )j1 xtj + wt

when || < 1

j=1

= h

(h) = (h 1) h = 1, 2, . . .

x
n+1 = (1 )xn +
xn
Seasonal ARIMA

27

Denoted by ARIMA (p, d, q) (P, D, Q)s


d
s
P (B s )(B)D
s xt = + Q (B )(B)wt

Periodic mixture
xt =

21.4.1

Causality and Invertibility

j=0

j < such that

wtj = (B)wt

j=0

ARMA (p, q) is invertible {j } :


(B)xt =

(Uk1 cos(2k t) + Uk2 sin(2k t))

k=1

ARMA (p, q) is causal (future-independent) {j } :


xt =

q
X

Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2


Pq
(h) = k=1 k2 cos(2k h)
  Pq
(0) = E x2t = k=1 k2
Spectral representation of a periodic process

j=0

j < such that

(h) = 2 cos(20 h)
2 2i0 h 2 2i0 h
e
+
e
2
2
Z 1/2
=
e2ih dF ()

Xtj = wt

j=0

Properties

1/2

ARMA (p, q) causal roots of (z) lie outside the unit circle

(z)
(z) =
j z =
(z)
j=0
j

Spectral distribution function

0
F () = 2 /2

|z| 1

< 0
< 0
0

ARMA (p, q) invertible roots of (z) lie outside the unit circle

(z)
j z j =
(z) =
(z)
j=0

F () = F (1/2) = 0
F () = F (1/2) = (0)

|z| 1

Spectral density

Behavior of the ACF and PACF for causal and invertible ARMA models
ACF
PACF

21.5

AR (p)
tails off
cuts off after lag p

MA (q)
cuts off after lag q
tails off q

ARMA (p, q)
tails off
tails off

Spectral Analysis

Periodic process
xt = A cos(2t + )
= U1 cos(2t) + U2 sin(2t)

Frequency index (cycles per unit time), period 1/


Amplitude A
Phase
U1 = A cos and U2 = A sin often normally distributed rvs

(h)e2ih

|(h)| < = (h) =

R 1/2

f () =

h=

Needs

h=

1
1

2
2

1/2

e2ih f () d

h = 0, 1, . . .

f () 0
f () = f ()
f () = f (1 )
R 1/2
(0) = V [xt ] = 1/2 f () d
2
White noise: fw () = w
ARMA (p, q) , (B)xt = (B)wt :

|(e2i )|2
|(e2i )|2
Pp
Pq
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k
2
fx () = w

28

I0 (a, b) = 0
I1 (a, b) = 1
Ix (a, b) = 1 I1x (b, a)

Discrete Fourier Transform (DFT)


d(j ) = n1/2

n
X

xt e2ij t

22.3

i=1

Fourier/Fundamental frequencies

Finite
j = j/n

Inverse DFT
xt = n1/2

n1
X

d(j )e2ij t

j=0

I(j/n) = |d(j/n)|
Scaled Periodogram
4
I(j/n)
n
!2
n
2X
=
xt cos(2tj/n +
n t=1

P (j/n) =

22.1

n
2X
xt sin(2tj/n
n t=1

!2

Math

k=

k=1
n
X

n(n + 1)
2

(2k 1) = n2

k=1
n
X

k=1
n
X

ck =

k=0

cn+1 1
c1

k=0
n 
X

= 2n

 

r+k
r+n+1
=
k
n
k=0




n
X k
n+1

=
m
m+1

k2 =

n  
X
n

k=0

Vandermondes Identity:
 

r  
X
m
n
m+n
=
k
rk
r
k=0

c 6= 1

Binomial
Theorem:
n  
X
n nk k
a
b = (a + b)n
k
k=0

Infinite

Gamma Function
Z

ts1 et dt
0
Z
Upper incomplete: (s, x) =
ts1 et dt
x
Z x
Lower incomplete: (s, x) =
ts1 et dt

Ordinary: (s) =

( + 1) = ()
>1
(n) = (n 1)!
nN

(1/2) =

22.2

Binomial

n
X

n(n + 1)(2n + 1)
6
k=1

2
n
X
n(n + 1)

k3 =
2

Periodogram

22

Series

X
k=0

pk =

1
,
1p

p
|p| < 1
1p
k=1
!



X
d
1
1
k
p
=
=
dp 1 p
(1 p)2
pk =

d
dp
k=0
k=0


X r + k 1

xk = (1 x)r r N+
k
k=0
 
X
k

p = (1 + p) |p| < 1 , C
k

kpk1 =

|p| < 1

k=0

Beta Function

1
(x)(y)
Ordinary: B(x, y) = B(y, x) =
tx1 (1 t)y1 dt =
(x + y)
0
Z x
Incomplete: B(x; a, b) =
ta1 (1 t)b1 dt

Regularized incomplete:
a+b1
B(x; a, b) a,bN X
(a + b 1)!
Ix (a, b) =
=
xj (1 x)a+b1j
B(a, b)
j!(a
+
b

j)!
j=a

29

22.4

Combinatorics

Sampling
k out of n

w/o replacement
nk =

ordered

k1
Y

(n i) =

i=0

w/ replacement

n!
(n k)!

 
n
nk
n!
=
=
k
k!
k!(n k)!

unordered

Stirling numbers, 2nd kind


 

 

n
n1
n1
=k
+
k
k
k1

nk

 

n1+r
n1+r
=
r
n1

  (
1
n
=
0
0

1kn

n=0
else

Partitions
Pn+k,k =

n
X

Pn,i

k > n : Pn,k = 0

n 1 : Pn,0 = 0, P0,0 = 1

i=1

Balls and Urns


|B| = n, |U | = m

f :BU
f arbitrary
mn

B : D, U : D

B : D, U : D

B : D, U : D

f injective
(
mn m n
0
else


m+n1
n
m  
X
n
k=1

B : D, U : D

D = distinguishable, D = indistinguishable.

m
X
k=1

 
m
n

m!


 
n
m

n1
m1

1
0

mn
else

 
n
m

1
0

mn
else

Pn,m

k
Pn,k

f surjective

f bijective
(
n! m = n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
(
1 m=n
0 else

References
[1] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):4553, 2008.
[2] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
[3] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und Statistik.
Springer, 2002.

30

31

Univariate distribution relationships, courtesy Leemis and McQueston [1].

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy