0% found this document useful (0 votes)
3 views148 pages

Slides Large Sample

The document outlines a course on Large Sample Theory, referencing key texts and providing an overview of essential topics such as probability background, convergence concepts, and the Central Limit Theorem. It includes discussions on the Binomial distribution, the Law of Large Numbers, and consistency of estimators, with examples and applications in statistics. The notes emphasize the importance of understanding random variables and their properties in the context of statistical inference.

Uploaded by

majharulopu11595
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views148 pages

Slides Large Sample

The document outlines a course on Large Sample Theory, referencing key texts and providing an overview of essential topics such as probability background, convergence concepts, and the Central Limit Theorem. It includes discussions on the Binomial distribution, the Law of Large Numbers, and consistency of estimators, with examples and applications in statistics. The notes emphasize the importance of understanding random variables and their properties in the context of statistical inference.

Uploaded by

majharulopu11595
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

LARGE SAMPLE THEORY

GIORGIO PICCI
University of Padova, Italy

University of Guangzhou, November 2018

1
SCOPE OF THESE NOTES

To discuss the main ideas and complement the textbook

ELEMENTS OF LARGE SAMPLE THEORY


by
E.L. Lehmann

You should all have an electronic version of this book. Please take it to
class. It does not make sense for me to provide notes repeating this ma-
terial. Another useful reference is

A COURSE IN LARGE SAMPLE THEORY


by
T.S. Ferguson

2
Outline of the Course

1. Some Probability Background

2. Review of convergence concepts

3. The CLT for i.i.d random sequences

4. Applications

5. Law of Large Numbers and Ergodicity

6. Maximum Likelihood: Asymptotic properties

7. Time series

3
QUICK REVIEW OF PROBABILITY

◦ Probability space: {Ω, A, P} the elementary event ω ∈ Ω chosen by


“nature”. A contains all subsets of Ω (Events) of which you can com-
pute the probability.

P : A → [ 0, 1 ] , countably additive set function

◦ Random variables are functions x : Ω → R.


The Probability distribution function of x:
F(x) := P{ω | x(ω) ≤ x} ; x∈R
right-continuous non-decreasing monotonic function.

◦ Expectation of a random variable


Z Z
E x := x(ω) P(dω) = x dF(x)
Ω R

4
AN ELEMENTARY EXAMPLE
Assume we are tossing a coin and let p := probability to observe “TAIL”,
event which will be denoted by the symbol T and 1 − p := probability that
HEAD will show instead; event which is denoted by the symbol C. Naturally,
p is unknown. We want to obtain information on the value of p by tossing
the coin N consecutive times, assuming that each toss does not influence
the outcome of the other tosses.

Let Ω = {all possible outcomes of N consecutive tosses}.

The set Ω contains all sequences made of N symbols T and C in any


possible order say one possible ω being

T T C C C T T C C T T T T...C T N symbols
Let A be the family of all subsets A of Ω. These are called Events. Exam-
ple
A = {ω | in ω there are an even number of T’s}

5
Assume each toss does not influence the outcome of the other tosses.
This defines a class of probability measures which describes each toss as
being independent of the others. In formulas, this means that a class of
probability measures
P := {P p ; 0 < p < 1}
is defined on on {Ω, A } for each elementary event ω ∈ Ω by the Bernoulli
distribution

P p({ω}) = pT (ω) (1 − p)N−T (ω) 0< p<1 ,


where T (ω) is the number of symbols T in the sequence ω. Clearly the
probability measure P p is defined as soon as one assigns a value to p in
the interval(0 < p < 1).

In this case the family P is parametric; i.e.

P := {P p ; 0 < p < 1 } .
Estimating P is just selecting a plausible value of p based on the observa-
tion of the outcomes of N successive coin tosses.
6
Alternatively, one may want to validate some a priori belief on p for example
that p = 1/2 (that is, T and C are equiprobable). In this case one deals
with an hypothesis testing problem: on the basis of some observation ω̄
decide whether P p belongs to the class

P0 := {P1/2} ,
or Pp belongs to the complementary family
n o
P1 := P p ; p 6= 1/2 .

A Bernoulli random variable xk ; k = 1, 2, . . . , N takes value 1 when at the


k-toss T is observed and 0 otherwise. So xk (ω) only depends on the k-th
symbol ωk of the sequence.

A Binomial random variable sN (ω) is the gambler’s fortune after N tosses


N
X
sN (ω) := xk (ω)
k=0

7
The expected value of xk is
X
E xk = x(ωk ) P p({ωk })
ωk
Since ωk can be T with probability p and C with probability 1 − p, it follows
that E xk = p for all k.
The probability distribution of a Binomial random variable, assuming the
tosses are independent, is
n 
N
X 
P p{ω ; sn(ω) = n} = pk (1 − p)N−k
k
k=0
This is known as the Binomial distribution and is denoted B p,N (n).

8
NOTATIONS

◦ Random variables will be denote by lower case bold symbols such as


x, y , ... etc. Lehmann notation of using Upper case symbols X, Y .. is bad.
Upper case symbols are standarad for MATRICES such as covarinces or
loading matrices in linear models. Later we shall need to introduce multi-
variate statistics and Lehmann notation would produce confusion.

◦ The sample size is denoted by N: lower case n is often used for dimen-
sion of vectors (either random or non-random) or degrees of freedom. So
in general n is fixed while N → ∞.

◦ Names: Pdf instead of cdf; x ∼ F means that the random variable x has
Pdf F. In discrete probability spaces F(x) is a staircase function.
d F(x)
Continuous variables admit a probability density function p(x) :=
dx
(pdf).

9
CONVERGENCE OF RANDOM VARIABLES

1. Almost sure convergence: is ordinary convergence of functions


xN (ω) → x(ω) for all ω ∈ Ω, except perhaps a subset of ω’s of proba-
a.s.
bility zero. Written xN → x

2. quadratic mean convergence: E |xN − x|2 → 0


q.m.
This is written xN → x

3. Convergence in probability:
P{ω | |xN (ω) − x(ω)| > ε} → 0 for all ε > 0.

P
This is written xN → x or P − lim xN = x.

IMPLICATIONS:
1. ⇒ 3. 2. ⇒ 3.
10
CHEBYSHEV INEQUALITY

Suppose x and y have finite second moment, then for all ε > 0 and con-
stant c
1 h i
P { |x − y | ≥ ε} ≤ 2 E (x − y)2
ε
Same proof of Lemma 2.1.1 in Lehmann’s book for y = c. In fact just call
z := x − y.
q.m. P
Theorem 1 (Theorem 2.1.1). Let xN → x then xN → x

Proof: just let x ≡ xN and y ≡ x.

Examples on p. 49 in Lehmann’s book.

11
THE BINOMIAL DISTRIBUTION
A Binomial random variable sN is the total number n of Tails (T) in N inde-
pendent Bernoulli tosses with probability p. Its distribution is
n 
N
X 
B p,N (n) = pk (1 − p)N−k
k
k=0
Since sN is the sum of N independent Bernoulli random variables xk ; k =
1, 2, . . . , N each taking value 1 when T is observed and 0 otherwise, whose
mean is E x = p, we have
E sN = E {x1} + E {x2} + . . . + E {xN } = |p + ·{z
· · + p} = N p
N times
The variance is computed by summing the variances of each xk which can
be computed to be p(1 − p). Hence the variance of sN is N p(1 − p) and so
s  1 1
var N = 2 N p(1 − p) = p(1 − p). Then
N N N
sN P
→ p. (Lehmann Example (2.1.1))
N
12
WEAK LAW OF LARGE NUMBERS
Let {x1, x2, . . . xN } be a sequence of random variables (a sample of size
N)
Definition 1. The sample mean of the sequence is the random variable
N
1X
µ N :=
µ̂ xk
N
k=1
(also denoted x̄ or x̄N ). The sample variance of the sequence is the
random variable
N
1 X
σ 2N :=
σ̂ µ N )2
(xk − µ̂
N
k=1
Theorem 2. If the random variables are independent identically distributed
(i.i.d.) then
P
µ N → µ = E xk
µ̂
that is the sample mean is a consistent estimator of the mean.
13
PROOF

See p. 49. By Chebyshev inequality


1 h i
P { |µ̂ µ N − µ)2
µ N − µ | ≥ ε} ≤ 2 E (µ̂ µ N ≡ x̄ if you like)
(µ̂
ε
2
 
The quantity E (µ̂µ N − µ) is the variance of the sample mean:
   
1 X N N
1 1
 X 
var xk = 2 var xk = 2 N σ 2
N  N   N
k=1 k=1

where σ 2 = var (xk ). Then


1 2→0
P { |µ̂
µ N − µ | ≥ ε} ≤ σ
Nε 2
P
for all ε > 0 as N → ∞. Hence µ̂
µ N → µ = E xk .

14
CONSISTENCY OF ESTIMATORS

An estimator of a parameter θ (or of a function g(θ ) of the parameter) is a


function φ (y) of the sample y := {y1 , y2 . . . , yN } which does not depend on
θ . The estimator is (uniformly) unbiased if

E θ φ (y) = θ ; (or g(θ ) ) ∀θ


Definition 2. Assume the data are generated by a “true model” corre-
sponding to a “true parameter value” θ0. Let P0 ≡ Pθ0 be the corresponding
“true” probability law of the data. The estimator sequence φN ; N = 1, 2, . . .
is consistent in probability (or weakly consistent) if

P0 − lim φN = θ0
That is, the sequence of random variables {φN } converges in probability
P0 to a nonrandom constant equal to the true parameter θ0.

15
CONSISTENCY AND CHEBYSHEV INEQUALITY
Chebyshev inequality holds also for random vectors (see below):
  1 h i 1
Pθ kφN − θ k ≥ ε ≤ 2 E θ (φN − θ )> (φN − θ ) = 2 E θ kφN − θ k2 ,
ε ε
where φN := φN (y1, . . . , yN ) and k · k is Euclidean norm or absolute value
when φN is scalar. Proof: easy generalization of the proof of Lemma 2.1.1
in Lehmann’s book.

If φN is unbiased E θ φN = θ and the last member is the scalar variance,


σN2 (θ ), of φN (y1, . . . , yN ) divided by ε 2. If

lim σN2 (θ ) = 0 ; ∀θ ∈ Θ ,
N→∞
then φN (y1, . . . , yN ) is (weakly) consistent. (Remember that we do not know
the true value θ0).
Proposition 1. If φN (y1, . . . , yN ) is an asymptotically unbiased estimator
and if its scalar variance σN2 (θ ) tends to zero as N → ∞ for all θ ∈ Θ, then
φN (y1, . . . , yN ) is consistent.
16
APPLICATIONS p. 55

Suppose {x1, x2, . . . xN } is a sequence of independent random variables


with same mean µ but different variances σk2 = var (xk ) ; k = 1, 2, . . . N. Then
2
 
µ N − µ) is still the variance of the sample mean:
E (µ̂
   
1 X N N N
1 1 X 2
 X 
var xk = 2 var xk = 2 σk
N  N   N
k=1 k=1 k=1
If
N
1 X σk2
→0
N N
k=1
the sample mean µ̂ µ N is still a consistent estimator. The variances cannot
grow faster than (or as) N 2 otherwise the sum will diverge to ∞. If σk2 < σ 2 k
the sum has a finite limit when N → ∞ and the variance tends to zero.

17
SIMPLE LINEAR REGRESSION p. 58-59

Suppose you measure data pairs {xk , yk ; k = 1, 2, . . . , N} where the xk are


known exactly but the yk are uncertain because affected by errors.

You would like to describe approximately these data by a straight line say
y = α + β x. What is the best straight line approximating the measured
data?
Suppose you model the measurement process by a statistical model
yk = α + β xk + ek , k = 1, 2, . . . , N
where the errors ek are zero-mean independent random variables with
variances σk2. In the given experimental condition ω you have observed the
values yk (ω) = yk ; k = k = 1, 2, . . . , N corresponding to errors ek (ω) (which
you do not know).
Definition 3. The least squares estimator of the parameter (α, β ) is the
solution of the minimization problem
N
X
min [ yk − (α + β xk ) ]2
(α, β )
k=1

18
SOLUTION OF THE LINEAR REGRESSION PROBLEM 1

The minimizers are (formulas 2.2.14 and 2.2.13 in the book)

α̂N = ȳN − β̂N x̄N


P
k (xk − x̄N ) yk
β̂N = P 2
k (xk − x̄N )
You may imagine that these are sample values of random variables (before
collecting the data)

α N = ȳN − βˆ N x̄N
α̂
P
(x − x̄N ) yk
βˆ N = Pk k 2
k (xk − x̄N )
Question: are these consistent estimators of the parameters (α, β )?
The answer depends on how you describe the errors.

19
SOLUTION OF THE LINEAR REGRESSION PROBLEM 2

The estimators are unbiased

α N = E (ȳN − βˆ N x̄N )
E α̂
P
ˆ k (xk − x̄N ) E yk
Eβ N = P 2
k (xk − x̄N )

First not that


X X
(xk − x̄N ) E yk = β (xk − x̄N ) xk
k k
P P
but since k (xk − x̄N ) = 0 then k (xk − x̄N ) x̄N = 0 as well , and hence
P
(x − x̄ )2
k N
E βˆ N = β P k
2

k (xk − x̄N )
Then can show that α̂
α N is also unbiased.

20
PROOF OF UNBIASEDNESS

Since the errors are zero-mean E (ȳN ) = α + β x̄N and hence

α N = α + E (β − βˆ N ) x̄N
E α̂
On the other hand E yk = α + β xk , and so
P P
ˆ k (x k − x̄N ) (α + β xk ) k (xk − x̄N ) xk
E (β N − β ) = P 2
−β = β P 2
−β = 0
k (xk − x̄N ) k (xk − x̄N )

P P
since k (xk − x̄N ) α = 0 and likewise k (xk − x̄N ) x̄N = 0.

21
CONSISTENCY OF THE LEAST SQUARES ESTIMATORS

First look at
P
ˆ k (xk − x̄N ) yk X
βN = P 2
:= wk yk
k (xk − x̄N ) k
and since the yk are independent (as the ek are)
X N
X
var (βˆ N ) = w2k var (yk ) = w2k σk2
k k=1
where σk2 = var (ek ). For convergence in probability of βˆ N to β we need
N
X
lim w2k σk2 = 0
N→∞
k=1
which, in case σk2 = σ 2 independent of k implies
N
(xk − x̄N )2
P
X
2 k 1
wk = P 2 2
= P 2
→ 0.
[ k (xk − x̄N ) ] k (xk − x̄N )
k=1

22
This is the same as
+∞
X
(xk − x̄N )2 = ∞
k=1
which means that the points xk should not remain too close to their sample
mean. Since
N
ˆ 1X
α N = α + (β − β N ) x̄N +
α̂ ek
N
k=1
and, under this condition, both the last two terms converge to zero in proba-
bility, it is easy to see that the estimator α̂
α N is also consistent in probability.

23
CONVERGENCE IN DISTRIBUTION
Definition 4. A sequence of Pdf’s {FN } (may be multivariable), converges
L
in law to a Pdf F; notation: FN → F, if the functions {FN (x)}, converge to a
Pdf F(x) at all points x where F is continuous.

One also talks abut convergence in distribution (or also in law) of random
variables: a sequence {xN } (maybe vector valued), converges in distribu-
L
tion: xN → x if the Pdf’s of {xN } converge in law to the Pdf of x.
This is a weaker notion than convergence of random variables as defined
above.

WARNING: To talk about convergence of random variables {xN } and x


must be defined in the same probability space (the same random ex-
periment). Otherwise FN → F does not necessarily mean that {xN (ω)}
with xN ∼ FN converges to a limit random variable x(ω 0) , ω 0 ∈ Ω0 in any
reasonable sense. The elementary events ω 0 must lie in the same space
Ω.
24
Theorem 3. Convergence in probability implies convergence in distribu-
tion. (Theorem 2.3.5)

Convergence in distribution is weaker than (implied by) convergence in


probability except when the limit is a constant (nonrandom) variable.
A degenerate PDF is
(
1 if x ≥ c
F(x) := 11(x − c) =
0 if x < c
this is the Pdf of a constant (nonrandom) variable x(ω) = c for all ω ∈ Ω.
Theorem 4. Convergence in law to a degenerate Pdf (that is convergence
in distribution to a constant) implies and is hence equivalent to conver-
gence in probability to the same constant :
L P
xN → c ⇔ xN → c
whenever c is a (nonrandom) constant.

This is the setting of Lehmann’s book.


25
Example 1. Let {xN } be a i.i.d. sequence so that each xN has the same
distribution say FN = F1 = F (a continuous function) for all n. Then obviously
FN → F1 but {xN } cannot converge in probability to x1 since x1 is indepen-
dent of all {xN ; N > 1}.
In fact,

P{ω | |xN (ω) − x1(ω)| > ε} = 1 − E {F(x1 + ε) − F(x1 − ε)}


which does not depend on n and hence cannot converge to zero.
Theorem 5 (Weak Convergence). The sequence of random variables {xN }
converges in distribution to x if and only if
Z Z
E f (xN ) → E f (x) ; that is f (x) dFN (x) → f (x) dF(x)

for all bounded continuous real valued functions f . In fact for all real valued
functions f which are bounded and continuous in a set of probability one
for the Pdf of x.

26
CHARACTERISTIC FUNCTIONS
Z
φx(it) := eitx dF(x) = E eit x

NOTE: The imaginary argument of the exponential here is essential to


guarantee boundedness, as | eitx| = 1.
Therefore convergence in distribution implies pointwise convergence
of the characteristic functions

φxN (it) := E eit xN → φx(it) := E eit x, for all t ∈ R .


Actually this result can be inverted
Theorem 6 (Levy-Helly Bray). The convergence of characteristic functions
is necessary and sufficient (and hence equivalent to) convergence in dis-
tribution.

This is a very useful fact. Used for example in the proof of the CLT.

27
CONVERGENCE OF MOMENTS

The moments of a Pdf are derivatives of the characteristic function com-


puted at t = 0.
Z Z
φ (k)(it) := i k xk eitx dF(x) ⇒ φ (k)(0) := i k xk dF(x) = i k µk

Convergence φn(it) → φ (it) does not necessarily imply convergence of the


derivatives at t = 0. In general convergence in law does not imply con-
vergence of the moments . Means, variances etc.., etc., of a sequence
L
{xN } → x, do not necessarily converge to means, variances etc.., of the
limit.
L
Theorem 7 (Billingsley p.32). Let xN → x, and

sup E x2N < ∞ (1)


N
then all existing moments of xN converge to the respective moments of the
limit distribution.
28
CONTINUOUS MAPPING THEOREMS
The Continuous Mapping Theorem states that for every continuous
P P
function f (·), if xN → x, then also f (xN ) → f (x).. Does it hold also for
convergence in distribution?

We show that for scalar random variables this is true also for conver-
L
gence in distribution. If FN → F, for every continuous function g composed
with another arbitrary continuous function f we must have
Z Z
g( f (y)) dFN (y) → g( f (y)) dF(y)
R R
By a change of variable (suppose f is invertible)
Z Z
g(x) dFN ( f −1(x)) → g(x) dF( f −1(x))
R R
where FN ( f −1(x)) and F(( f −1(x))) are the Pdf’s of f (xN ) and f (x).
NB: For vector functions this proof does not necessarily work. See Slut-
sky’s Theorem below. See also Billingsley book pp. 29-30.
29
MULTIVARIATE STATISTICS

In many applications one has to deal with multiple measuremensts taken


simultaneously. Also there may be many unknown parameters θ ≡ {θ1, θ2 . . . , θ p}.
An estimator must then have the same dimension p of the parameter. Need
to work with multivariate random variables.
Convenient to introduce vector notation. An n-dimensional random vec-
tor is denoted
 
x1
>
 
x := . . . ,
  x := x1 . . . xn
xn
The mean E x := µ is a vector in Rn. The Covariance (or simply Variance)
of the vector x is the n × n matrix
 
σ1,1 . . . σ1,n
Σ := Var {x} := E (x − µ)(x − µ)> =  . . . . . . ... 
σn,1 . . . σn,n
Usual convention: σk,k ≡ σk2 the variance of the k-th component xk .
30
BASICS ON LINEAR ALGEBRA
Refer to Lehmann Chapter 5 (with some notations changed) p.277.

A covariance matrix is always symmetric Σ = Σ> and positive semidefinite


that is the quadratic form x>Σ x ≥ 0. Excluding pathological cases one has
in fact strict positivity except when x = 0.

The scalar variance of a random vector y is the trace of the covariance


matrix. Notation:
n
var(y) := E {y>y} = Tr E {yy>} = Tr Var (y) =
X
σk2
k=1
Gaussian random vectors have a probability density function depending
only on the mean vector µ and the Covariance matrix Σ:
  
1 1 
  −1 x1 − µ1 
p(x1, . . . , xn) = exp − x1 − µ1 . . . xn − µn Σ  . . . 
[ 2π n det Σ]1/2 2
xn − µn

Uniquely determined by the parameters µ, Σ.


31
EIGENVALUES AND EIGENVECTORS
Along some directions a square matrix A ∈ Rn×n acts like a multiplication
by a scalar
Av = λ v , v ∈ Rn
the scalar factor λ is called the eigenvalue associated to the eigenvector
v. Eigenvectors are actually directions in space and are usually normalized
to unit norm. In general eigenvalues (and eigenvectors) are complex as
they must be roots of the characteristic polynomial equation

χA(λ ) := det(A − λ I) = 0
which is of degree n in λ and hence has n (not necessarily distinct) complex
roots {λ1, . . . , λn}. This set is called the spectrum of A and is denoted σ (A).
The multiplicity of λk as a root of the characteristic polynomial is called the
algebraic multiplicity.

When eigenvectors are linearly independent they form a basis in which the
matrix A looks like multiplication by a diagonal matrix whose elements are
32
the eigenvalues. Unfortunately this happens only for special classes of
matrices.

SYMMETRIC MATRICES
Theorem 8. Let A = A> ∈ Rn×n. Then
1. The eigenvalues of A are real and the eigenvectors can be chosen to be
a real orthonormal basis.
2. A is diagonalizable by an orthogonal transformation (∃T s.t. T >T = I
and T >AT is diagonal).
3. A positive (semi-) definite matrix can always be taken symmetric. Its
eigenvalues are real and positive (nonnegative).
MULTIVARIATE CONVERGENCE

L
Let xN ∼ FN and x ∼ F be n-dimensional random vectors; then xN → x
means that
FN (x1, x2, . . . , xn) → F(x1, x2, . . . , xn)
>
at all points x = x1 x2 . . . xn ∈ Rn where F is continuous. This implies


that all marginals converge, that is

FN (x1) → F(x1) ; ... ; FN (xn) → F(xn)


which can be written
L L
x1,N → x1 ; ... ; xn,N → xn
But the converse implication is not necessarily true ! as convergence
of the marginals FN (xk ) ; k = 1, 2, . . . , n does not imply convergence of the
>
joint distributions at all points x = x1 x2 . . . xn ∈ Rn.


33
Example 2. Let xN and yN be two scalar sequences converging separately
L L
in distribution to the random variables x and y. Then xN → x and yN → y
does NOT NECESSARILY imply that
   
xN L x
→ , (In general not true)
yN y
unless one of the two limits is a degenerate r.v. (a constant) c; see
Slutsky Theorem.
SLUTSKY THEOREM
L
Theorem 9 (Slutsky). Let xN → x scalar random variables. Then:

L
1. For every continuous function f , f (xN ) → f (x).

2. If {yN } is a squence of random v’s such that (xN − yN ) → 0 in probabil-


L
ity, then yN also converges in law to x (that is: yN → x).

 
x
3. Let zN = N and let the sequence {yN ; N = 1, 2, . . .} converge in prob-
yN
ability (or in law) to a constant c. Then if f (z) := f (x, y) is a continuous
L
function of the two arguments f (xN , yN ) → f (x, c).

All statements are also valid for random vectors.


34
SLUTSKY THEOREM CONT’D

In statement (c) one cannot relax the assumption that {yN ; N = 1, 2, . . .}


converges to a constant c to convergence to a non-degenerate random
variable.
Example 3 (Ferguson p.40). Let xN = x for all N where x ∼ U[0, 1] and let
yN = x for N odd and yN = 1 − x for N even.  
L x
Then yN → U[0, 1] but the (joint) distribution of the vector zN = N cannot
yN
converge in distribution as
(
P{x ≤ x, x ≤ y} = min{x, y} for N odd
FzN (x, y) =
P{x ≤ x, x ≤ 1 − y} = min{x, 1 − y} for N even
where x, y ∈ [0, 1].

35
FERGUSON EXAMPLE CONT’D

We show that yN ∼ U[0, 1] for all N. Hence FyN → U[0, 1] as N → ∞.

Obviously true by definition for N odd. For N even y = 1 − x hence

P{y ≤ y} = P{1 − x ≤ y} = P{x ≥ 1 − y} = 1 − P{x ≤ 1 − y} = 1 − (1 − y) = y


So also for N even yN ∼ U[0, 1].

Similarly you get


(
P{x ≤ x, x ≤ y} = min{x, y} for N odd
FzN (x, y) =
P{x ≤ x, x ≤ 1 − y} = min{x, 1 − y} for N even
where x, y ∈ [0, 1]. Draw a picure of this function on the square [0, 1]x[0, 1]
for N odd and N even and check that it jumps. So FzN (x, y) cannot converge.

36
APPLICATIONS OF SLUTSKY THEOREM
(Lehmann p. 70)
L
Corollary 1 (Theorem 2.3.3). If xN → x and two random variables aN , bN
converge in probability to constants (a, b) (same as also converging in
distribution !), then
L
zN := aN + bN xN → a + bx
(In Lehmann xN ≡ Yn).

 
a
Proof. Follows from statement (3) of Slutsky Theorem, just let yN = N
bN
 
a
and c := ∈ R2 and f (x, y) = a + b x.
b
L L
Corollary 2 (Corollary 2.3.1). If xN → x and rN → 0 then
L
xN + rN → x .

37
ABOUT SLUTSKY THEOREM
(Lehmann p. 70)

Warning: The Pdf of the sum of two random variables is not the sum of
the two Pdf’s!!

You may want to compute the Pdf (easier the pdf) of x + c given that of x.

Also the Pdf of a random variable equal to zero a.s. (x = 0) is not the zero
Pdf! which in fact is not even a Pdf function!

P
Two sequences of random variables {xN }, {yN } such that (xN − yN ) → 0
in probability (equiv. in distribution, since 0 is a constant), are said to be
asymptotically equivalent.

38
THE CENTRAL LIMIT THEOREM (CLT)
The first version of the Central Limit Theorem is due to De Moivre and
Laplace for binomial random variables and later to Gauss for sum of con-
tinuous i.i.d. variables.

What do we mean by a limit distribution ? Let {xk } be i.i.d. random


variables of mean µ and variance σ 2. By the law of large numbers the
PN
sample mean x̄N = N1 t=1 xt converges to the mean µ = E xk in probability
(in fact almost surely) as N → ∞. The variance of x̄N must then tend to zero
[CHECK THIS !] and the limit distribution must obviously be degenerate
L
x̄N → µ
1
This is not very interesting. Let us note that var x̄N tends to zero as . In
N
fact
 
N
1 1 X 1 2
var(x̄N ) = var(xk ) = σ .

N N N
k=1

39
THE CLASSICAL CLT

Hence as N → ∞, the variance of the random variable N [ x̄N − µ ] has a
finite limit (the variance of each xk ). Then its limit distribution cannot be
degenerate. The key is to discover the convergence rate.
Theorem 10. Assume {xk√ } are i.i.d. random variables of mean µ and finite
variance σ 2. Then yN := N (x̄N − µ) converges in distribution to a Gaus-
sian of mean zero and variance σ 2. Letting x̃k := xk − µ this is the same
as
N
√ 1X L
N x̃k → N(0, σ 2)
N
k=1
√ L
May be written in short also as N x̃¯ N → N(0, σ 2) .
NB Theorem is no longer true if you do not subtract the mean:
N N
√ √ 1X √ 1X √
N x̄N = N xk = N x̃k + N µ
N N
k=1 k=1

since N µ does not converge in distribution.
40
ABOUT THE CONVERGENCE RATE
(Lehmann last paragraph of p.70 gives only an intuitive argument)

L
When x̄N → c then (under suitable assumptions) the variance σN2 of x̄N
must tend to zero. (Show that the variance of a degenerate Pdf is zero !).
The main question is: Does there exist a deterministic function k(N) → ∞
such that
L
k(N) (x̄N − c) → F
where F(x) is a (limit) non-degenerate Pdf ?
Example: if x̄N is the sample mean of N i.i.d. random variables, c = µ and

k(N) = N, then F ≡ N(0, σ 2). This is the CLT.

41
PROOF OF THE CLT 1
Need to review properties of the characteristic function, see Lehmann
p. 581.
Proposition 2. The characteristic function of N(0, 1) is the function
|it|2 2
t
φo(it) = e 2 = e− 2 .
PN
Proposition 3. Let x̄ = k=1 xk be the sum of independent random vari-
ables (not necessarily having the same distribution).The characteristic func-
tion of x̄ is
φx̄(it) = φx1 (it)φx2 (it) · · · φxN (it) .

The proof is from the property of the exponential function


PN
φx̄(it) = E eit x̄ = E eit k=1 xk

N
Y N Z
Y
= E eit xk = eitxk dFk (xk ) = φx1 (it)φx2 (it) · · · φxN (it) .
k=1 k=1
42
If the variables are i.i.d. then φx̄(it) = φx1 (it)N .

Another simple fact used in the proof of the CLT:

if ȳ = α x̄ where α 6= 0, then
t
it ȳ x̄
i t
φȳ(it) = E e = E e α = φx̄(i )
α
PROOF OF THE CLT 2

Let yk := xk − µ; then y1, y2, . . . yN are i.i.d. of mean zero and variance σ 2.
We shall show that
√ L
zN := N ȳN → N(0, σ 2) .
In fact from
N  
Y it
φzN (it) = E exp{ √ yk }
N
k=1
take Taylor expansion about t = 0
it it 1 |it|2 2 |t|
exp{ √ y} = 1 + √ y + y + o[( )2 y2]
N N N 2 N
substitute y = yk and take expectation. Since E yk = 0 ; E y2k = σ 2 you get
N
1 −t 2 |t|2 N

φzN (it) = 1 + σ2 + o[ ( ) ]
N 2 N

43
PROOF OF THE CLT 3

Then pass to the limit as N → +∞. As in Lehmann p. 41, Prob.4.8


t 2
N
1 t2 2 − σ2
 
lim 1 − σ =e 2
N→∞ N 2
which is the characteristic function of N(0, σ 2). By Levy-Helly-Bray Theo-
rem :
√ L
N ȳN → N(0, σ 2) .
NB: this result is independent of the distribution of the xk ’s !!

It could be anything provided the mean and variance are finite, e.g. Bino-
mial, U[a, b] etc. See Lehmann pp. 73-74.

44
Example 4 (Lehmann p. 75-76). Let y := {y1, y2 . . . yN } a i.i.d. sequence
where each yk has mean µ and variance σ 2. Find the asymptotic (limit)
distribution of the random variable

N [ȳ − µ]
ϕN (y) := q N
σ̂N2 (y)

where σ̂N2 (y) is the sample variance


N
2 1X
σ̂N (y) = (yk − ȳN )2
N
k=1
Solution : We know that
√ L
N [ȳN − µ] → N(0, σ 2)
on the other hand, as N → ∞,
P
σ̂N2 (y) → σ 2
( in fact also almost surely). By (3) of Slutsky Theorem
L
ϕN (y) → N(0, 1) .
45
If it is known that µ = 0, then ϕN (y) is the so called Student’s t statistics.

Exercise (2.4.4): Assume τ 2 = var y2k is finite. Prove that under the i.i.d.
assumption
P
σ̂N2 (y) → σ 2 .

Use the CLT to prove that if µ = E yk is known, the asymptotic distribution


of the sample second order moment is Gaussian; in fact,
N
√ 1X L
N (yk − µ)2 → N(0, τ 2)
N
k=1
But for finding the the asymptotic distribution of the sample variance σ̂N2 (y)
we need a more sophisticated tool. See Lehmann pp. 75-76.

46
SOLUTION
See Lehmann p. 51-52

N N
1 X 1 X
σ̂N2 (y) = (yk − ȳN )2 = { (yk − µ)2 + 2(yk − µ)(µ − ȳN ) + (µ − ȳN )2}
N N
k=1 k=1
N
1 X
= (yk − µ)2 + (ȳN − µ)2 := s2N (y) + (ȳN − µ)2
N
k=1
P P
Now (ȳN − µ)2 → 0 since ȳN − µ → 0; and since E s2N (y) = σ 2,
 2
N
1 X 1 h i
var{s2N (y)} = E  (yk − µ)2 − σ 2  = τ 2 − 2σ 4 + σ 4
N N
k=1
P
which tends to zero for N → ∞. Therefore s2N (y) → σ 2 and so does σ̂N2 (y).

47
THE χ 2 DISTRIBUTION
One says that a scalar random variable y has a χ 2(n) distribution if its pdf
is supported on the nonnegative real line and has the following form:
1 n −1 −x/2
P(x ≤ y < x + dx) = (
 x 2) e dx , x ≥ 0. (2)
2n/2 Γ 2n
In this expression n is a natural number called the number of degrees of
freedom of the distribution. One sees that the χ 2 is a special case of the
Gamma distribution. Its characteristic function is

φ (it) := E eit y = (1 − 2it)−n/2 . (3)


From this one can derive formulas for the moments of the distribution. The
first few central moments are

µ1 = n
µ2 = 2n
µ3 = 8n
µ4 = 48n + 12n2 ecc... (4)
48
Lemma 1. For large n a χ 2(n) random variable tends in distribution to a
Gussian variable with pdf N (n, 2n).

Proof. Let y ∼ χ 2(n); introduce a standardized random variable


y−n
zn := √ ;
2n
which for all n has mean zero and unit variance. Of course zn is no longer
a χ 2 (as this could happen only for Gaussian random variables!). We shall
show that the limit in distribution, L − limn→∞ zn, is a standard N (0, 1) den-
sity. Recall that
Proposition 4. Let xn be a sequence of random variables with character-
istic functions φn(it) then
L
xn → x if and only if φn(it) → φ (it) ; ∀t . (5)
Proof: the characteristic function, φn(it), of zn can be written as,
y n n
 −n/2
it √ −it √ −it √ 2it
φn(it) = E e 2n e 2n =e 2n 1− √
2n
q !n/2 r !−n/2
−it 2n 2
= e 1 − it
n
" q r q #−n/2 −n/2
t2

it 2 2 it 2 ψ(n)
= e n − it e n = 1− + ,
n n n
where limn→∞ ψ(n) = 0. By a well known formula in Analysis the limit
limn→∞ φn(t) is equal to
2 n/2 −t 2 /2
φ (t) = lim (1 − t /n) = e ,
n→∞
which is the characteristic function of a standard Gaussian distribution.

The χ 2 distribution plays a role in many questions of statistical inference,


especially entering in the pdf of estimators.
Proposition 5. The sum of N independent random variables yi ∼ χ 2(ni) is
distributed as χ 2(n) where
N
X
n= ni , (6)
i=1
that is, when summing i.i.d. χ 2’s, the degrees of freedom add up.

PN
Proof. Recall that the pdf of the sum 1 yi of i.i.d. random variables is
just the N-fold convolution of the respective p.d.f’s, so that the charac-
teristic functions φi(t), of the yi’s get multiplied together. It is then clear that
multiplying functions like (3) the exponents at the denominators must add
up.

The following is a partial converse of this statement.


Proposition 6. Let y = y1 + y2 be the sum of two independent random
variables. Assume that y ∼ χ 2(n) and y2 ∼ χ 2(n2) where n > n2. Then
y1 ∼ χ 2(n − n2).
Proof. By independence the characteristic function of y is φ = φ1φ2 so that
φ
φ1 =
φ2
and by substituting the relative expressions (3) one sees that the statement
must be true.
Proposition 7. The pdf of the random variable
N
N s̄2N 1 X 2
:= (y k − µ) ,
σ2 σ2
1
where yk ∼ N (µ, σ 2) i.i.d. is χ 2(N).

Proof. Just note that, with y ∼ N (µ, σ 2), the pdf of z := (y− µ)2/σ 2 is χ 2(1)
and then use Proposition 5.

Note also that


Proposition 8. The pdf of z = x2 with x ∼ N (0, 1) is χ 2(1).
Proof. Using the well-known rules for the pdf of a function of random vari-
able, say z = f (x) with f (x) = x2, one obtains
1  √ √ 
pz(z) = px( z) + px(− z) 11(z)
d
dx f (x) x= f −1 (z)

1 1 −z/2 −z/2 1
= √ √ [e +e ] 11(z) = √ e−z/2 ; z ≥ 0 ,
|2 z| 2π 2πz
which is indeed χ 2(1).
Proposition 9. Let yk ∼ N (µ, σ 2), k = 1, . . . , N, i.i.d. Then the pdf of the
normalized sample variance:
N
σ 2N
N σ̂ 1 X 2
= (y k − ȳN ) ,
σ2 σ2
1
is χ 2(N − 1).

Proof. This follows from the following remarkable result:


σ 2N are inde-
Lemma 2. Under the above hypotheses, the statistics ȳN and σ̂
pendent.

Proof. We just need to show that ȳN and yk − ȳN are uncorrelated for all
k’s. By Gaussianity, this will imply independence.

Define xk = yk − µ and x̄N := ȳN − µ, so that yk − ȳN = xk − x̄N and E ȳN (yk −
ȳN ) = E x̄N (xk − x̄N ) = E x̄N xk − E x̄2N . Independence of the variables yk im-
plies
N
!
1 X 1 2 σ2
E x̄N xk = E xk xi = E (xi) =
N N N
1
so that, comparing with E (x̄N )2 = σ 2/N, one gets the conclusion.

By the identity
N
X N
X
(yk − µ)2 = (yk − ȳN )2 + N(ȳN − µ)2 (7)
1 1
one has
N N
X (yk − µ)2 X (yk − ȳN )2 (ȳN − µ)2
= +N
σ2 σ2 σ2
1 1
where the two random variables in the right member are independent. We
know from Proposition 7 that N s2N /σ 2 ∼ χ 2(N) and that (ȳN − µ)2/(σ 2/N) ∼
χ 2(1) (which also follows from Proposition 7 with N = 1). By Proposition 6
the pdf of first summand in the second member must be χ 2(N − 1).

So far we have been discussing the case of scalar variables. Suppose y


is an m-dimensional random vector. We are interested in finding out when
the pdf of quadratic forms like y>Qy con Q = Q>, is χ 2. The most obvious
situation in which this happens is the following.
Proposition 10. Let y ∼ N (µ, Σ) with µ ∈ Rm and Σ ∈ Rm×m positive defi-
nite; then
(y − µ)> Σ−1(y − µ) ∼ χ 2(m) . (8)
Proof. One just needs to standardize y, by setting z := Σ−1/2(y − µ) ; so
that z = [z1, . . . , zm]> is N (0, I), in particular z1, . . . , zm are i.i.d. and N (0, 1).
It follows that
m
(y − µ)> Σ−1(y − µ) = z>z =
X
z2i
1
and the last member is χ 2(m) by Proposition 5.

A less obvious characterization which is used frequently is the following.


Proposition 11. Let z ∼ N (0, Im) and Q ∈ Rm×m. Then the quadratic form
z>Qz is χ 2 distributed if and only if Q is idempotent; i.e. Q = Q2. In this
case the number of degrees of freedom is equal to r = rank Q.

Proof. The proof is based on diagonalization of Q. Indeed since Q is sym-


metric (and can always be assumed to be such) and idempotent, it is really
an orthogonal projection in Rm. Its non-zero eigenvalues are all equal to 1
and there are exactly r = rank Q of them. The spectral decomposition of Q
can therefore be written
 
I 0 >
Q=U r U , UU > = U >U = Im
0 0
that is
Q = U1U1> ,
where U1 is an m × r matrix formed by the first r (orthonormal) columns of
U. Hence
z>Qz = z>
1 z1
where the r-dimensional random vector z1 := U1>z is distributed as N (0, Ir ).
Proposition 5 then yields the conclusion.
THE STUDENT DISTRIBUTION

Let y ∼ N(0 , 1) and x ∼ χ 2(n) be independent. Then the ratio


y
t := p (9)
x /n
has the pdf
− n+1
t2
 2
1
pn(t) = √ 1+ t ∈R (10)
n B(1/2 , n/2) n
called a Student distribution with n degrees of freedom, which we shall
denote by the symbol S (n). In (10) B Euler Beta function:
Z 1
p−1 q−1 Γ(p) Γ(q)
B(p , q) := x (1 − x) dx =
0 Γ(p + q)
where the function Γ is the well-known generalization of the factorial. When
n is an integer greater than 1, Γ(n) = (n − 1)!.

49
The Student pdf has a curious history which is reported in all textbooks of
classical Statistics. For n = 1 it reduces to the Cauchy distribution :
1
S (1) ≡ .
π(1 + t 2)
It can be shown that S (n) has finite moments only up to order n − 1, given
by the formulas

µr = 0 when r i sodd and r < n


Γ( 21 n − r)Γ(r + 12 )
µr = 1 1
when r is even and 2r < n .
Γ( 2 n)Γ( 2 )
It is also not hard to show that for n → ∞ the distribution S (n) converges to
N(0 , 1). (Example 4)
Example 5. It is not difficult to check that he Cauchy distribution centered
at y = θ
1 1
C(y, θ ) = ; θ ∈R .
π 1 + (y − θ )2
has characteristic function
Z +∞
1 1 ity dy = eitθ −|t|
φ (it) = e
π −∞ 1 + (y − θ )2
Suppose that x1, x1, . . . xN is an i.i.d sample from the Cauchy distribution.
Its sample mean x̄N has the characteristic function
i t P
xk = ei Nt [ Nθ −N|t| ] = eitθ −|t| = φ (it)
φx̄N (it) = Ee N

which is invariant so that φx̄N (it) converges to the same initial characteristic
L
function. In distribution, x̄N → xk (k arbitrary).

This implies that x̄N as an estimator of θ is not consistent (in fact E θ xk =



∞ !). Moreover the characteristic function of N x̄N diverges with N at all
points t ∈ R. No CLT for the Cauchy distribution!
50
CRAMÈR’S THEOREM
Ferguson p. 45, see also Theorem 2.5.2 in Lehmann
Theorem 11. Let g : Rn → Rm have continuous partial derivatives and con-
sider the n × m Jacobian matrix:
 
∂ gi
G(x) :=
∂ x j i=1,...n, j=1,...,m
If y1, y2, . . . is a sequence of n-dimensional random vectors (not necesarily
√ L
i.i.d.) such that N (yN − µ) → y then
√ L
N ( g(yN ) − g(µ) ) → G(µ)y .
√ L
In particular, if N (yN − µ) → N(0, Σ) where Σ is a covariance matrix, then
√ L
N ( g(yN ) − g(µ) ) → N(0, G(µ)Σ G(µ)>) .

Can compute the asymptotic distribution of many functions of the sequence


y1, y2, . . ..
51
APPLICATIONS OF CRAMÈR THEOREM
Many examples in Lehmann pp 86-90

Example 6. Here n = 1. Suppose that x̄N = N1 N


P
k=1 xk wher xk are not
2
√ L
necessarily i.i.d. with common mean µ and variance σ and N (x̄N − µ) →
N(0, σ 2). What is the asymptotic distribution of x̄2N ?

dg
Solution: Let g(x) = x2 then g0(x) = = 2x. Hence
dx
√ L
N (x̄2N − µ 2) → N(0, 4µ 2σ 2) .
√ 2 L
Warning: If µ = 0 this just says that N x̄N → 0 which is a degenerate
1
distribution. Means that √ is not the correct convergence rate. In fact by
N
Proposition 8 and Slutsky Theorem
x̄2N L 2
N 2 → χ (1) .
σ

52
Example 7 ( Lehmann ppp 75-76). Show that the asymptotic distribution
σ 2N (x), of an i.i.d. sample having finite fourth order
of the sample variance σ̂
moment µ4 is
√ L
N (σ̂σ 2N (x) − σ 2) → N(0, µ4 − σ 4) .

Solution: Using the identity (7) the sample variance can be written
N N
2 1X 2 1X
σ N (x) =
σ̂ (xk − x̄N ) = (xk − µ)2 − (x̄N − µ)2 := m2(x) − m1(x)2 .
N N
k=1 k=1
σ 2N (x) we use Cramèr Theorem. Let
To find the asymptotic distribution of σ̂
us define the function g(m1, m2) = −m21 + m2. Then
 
m (x)
σ 2N (x) = −m1(x)2 + m2(x) := g(m1(x), m2(x)) = g( 1
σ̂ )
m2(x)
Note that the two components are correlated. One cannot “add” the two
asymptotic distributions. Need to use the joint distribution of the random
vector.
53
Since the sample x is i.i.d. one has
√ L
N m1(x) → N(0, σ 2)
. Similarly, since also the (xk − µ)2 ; k = 1, 2, . . . are i.i.d. one has
N
√ √ 1X L
N m2(x) := N (xk − µ)2 → N(σ 2, µ4 − σ 4)
N
k=1
because var (xk − µ)2 = E (xk − µ)4 − 2σ 4 + σ 4 = µ4 − σ 4 is the central fourth
order moment.
By the CLT for random vectors

   
m1(x) 0 L
N − 2 → N(0, Σ)
m2(x) σ
 
m1(x̄)
where the random vector has variance matrix
m2(x̄)
2
   2 
var xk Cov (xk , xk ) σ µ3
Σ= =
Cov (x2k , xk ) var x2k µ3 µ4 − σ 4

54
We want the asymptotic distribution of g(m1(x), m2(x)). Don’t need to com-
pute the third order moment µ3 since the derivatives of g with respect to
the two variables m1, m2 is g0(m1, m2) = [−2m1, 1] so that g0(0, σ 2) = [0, 1]
and hence
h i>
g0(0, σ 2)Σ g0(0, σ 2) = var x2k = E x4k − (E x2k )2 = µ4 − σ 4 .

In conclusion:
√ h 2 i √ h i
L
σ 2N (x) − σ 2 → N(0, µ4 − σ 4) .
σ N (x) − g(0, σ 2) = N σ̂
N σ̂

If we know that the pdf of xk is Gaussian, µ4 = 3σ 4 and the limit distribution


is N(0, 2σ 4).

55
GENERALIZATIONS OF THE CLT
Theorem 12 (Ferguson Problem 5, p.34, Lehmann Theorem 2.7.4 p.102).
Let {x1, x2, . . . xN } be a sequence of i.i.d random variables with mean µ and
variance σ 2 and let w(N, k) ; k = 1, 2, . . . , N be a sequence of real numbers.
Define
N
X N
X
yN := w(N, k) xk , σN2 := var (yN ) = σ 2 w(N, k)2 .
k=1 k=1
Then :
yN − E yN L
√ 2 → N(0, 1)
σN
if the following Lindeberg condition holds: for N → +∞
w(N, k)2
max PN → 0.
k≤N
k=1 w(N, k)2

56
Note: you can restate the theorem assuming independent random vari-
ables defined as zk = w(N, k)xk which have arbitrary mean µk = w(N, k)µ
and variance σk2 = w(N, k)2σ 2 then
PN
(z − µ ) L
yN := k=1√ k2 k → N(0, 1) .
σN
Lehmann condition (2.7.3) is more complicated to verify

57
APPLICATION TO LINEAR REGRESSION
See Lehmann p. 101 and 104

Recall the regrssion model


yk = α + β xk + ek , k = 1, 2, . . . , N
where the errors ek are zero-mean i.i.d. with variances σk2. The least
squares estimates are
α N = (ȳN − βˆ N x̄N )
α̂
N N
X (xk − x̄N ) X
βˆ N = PN yk := wk yk
2
k=1 k=1 (xk − x̄N ) k=1
Note that the weights wk actually depend on N. We shall rewrite them
w(N, k). The expression is similar to that in Theorem 12 (see the note
below the statement). The variance σN2 in this case is
N
X N
X
σN2 = var (βˆ N ) = w2(N, k) var (yk ) = w2(N, k) σk2
k=1 k=1
58
where σk2 = var (ek ). For convergence in probability of βˆ N to β we need
N
X
lim w2(N, k) σk2 = 0
N→∞
k=1
In case σk2 = σ 2 independent of k the Lindeberg condition is equivalent to
N
w2(N, k) (xk − x̄N )2 X (xk − x̄N )2
PN = PN (xk − x̄N )2 = PN → 0.
k=1 w2(N, k) [ k=1 (xk − x̄N )2 ]2 k=1 k=1 (xk − x̄N )
2

Hence, under the same condition we also have


√ βˆ N − β L
N → N(0, 1) .
σN
THE ASYMPTOTIC DISTRIBUTION OF α̂
αN
See Lehmann Prob 7.11 p. 104

ˆ 1 PN
α N = α + x̄N (β − β N ) +
Since α̂ e we have
N k=1 k
α N − α) = (β − βˆ N ) x̄N + ēN
(α̂
that is
√  √
   
 αN
α̂ α
N 1 x̄N − = N ēN
βˆ N β

so that, the random variable say N zN in the first member has mean zero
1 2 PN
and variance τN := σN where σN = k=1 σk2 → ∞. Therefore under this
2 2
N
condition,
zN L
→ N(0, 1) .
τN
Need a careful analysis of the vector least squares problem.

59
THE CRAMÈR - RAO BOUND
Ferguson Chap 19

Let x be a r-dimensional random vector with x ∼ {Fθ ; θ ∈ Θ} (x could in par-


ticular be a random sample as (y1, . . . , yN ), but the Cramèr-Rao inequality
does not require independence). Assume the following:

[A.1)] Fθ admits a density p(·, θ ) which is twice differentiable w.r.t. θ .

[A.2)] For every statistics φ with E θ φ < ∞,


∂ R R ∂ p(x, θ ) dx
r φ (x) p(x, θ ) dx = r φ (x) for i = 1, . . . , p ; ∀ θ ∈ Θ.
∂ θi R R ∂ θi
In particular,
∂ R R ∂
R r p(x, θ ) dx = Rr ∂ θ p(x, θ ) dx.
∂ θi i

∂2 R R ∂2
[A.3)]
∂ θi ∂ θ j R
r p(x, θ ) dx = Rr ∂ θi ∂ θ j p(x, θ ) dx ; ∀ i, j = 1, . . . , p ; ∀ θ ∈ Θ.

60
Definition 5. The Fisher Information Matrix I(θ ), of the parametric family
of densities {pθ } is defined as
  
∂ log p(x, θ ) ∂ log p(x, θ )
I(θ ) := E θ · (11)
∂ θi ∂θj i, j=1,...,p
I(θ ) can also be written as
∂ 2 log p(x, θ )
 
I(θ ) = −E θ . (12)
∂ θi ∂ θ j i, j=1,...,p

That
R (12) and (11) are equivalent follows by differentiating the identity
p(x, θ ) dx = 1 (constant with respect to θ ) termwise with respect to θ get-
ting
∂ 2 p(x, θ )
Z Z
∂ p(x, θ )
dx = 0 , dx = 0 , i, j = 1, . . . , p .
R r ∂ θi R r ∂ θ i ∂ θ j
Equation (12) then follows from
∂ 2 log p ∂ log p ∂ log p 1 ∂ 2 p
− = − ,
∂ θi ∂ θ j ∂ θi ∂θj p ∂ θi ∂ θ j
61
To understand the meaning of I(θ ) we shall bring in the p-dimensional
random vector of the random sensitivities of p(·, θ ) with respect to the
parameter θ ,
   
∂ log p(x, θ ) ∂ p(x, θ )
zθ := = /p(x, θ )
∂ θi i=1,...,p ∂ θi i=1,...,p
by which
I(θ ) = E θ zθ z>
θ , (13)
where the matrix is (at least) positive semidefinite since it is a variance. In
fact, it easily follows from [A.2)] that E θ ∂ ∂log
θ
p
= 0 for all i’s and so
i

E θ zθ = 0 .

62
Theorem 13 (The Cramèr-Rao Inequality). Let g be a differentiable function
from Θ to Rq and φ be an unbiased estimator of g(θ ). Let V (θ ) be the
variance matrix of φ and G(θ ) the Jacobian matrix of g,
 
∂ gi(θ )
G(θ ) = . (14)
∂θj i = 1, . . . , q
j = 1, . . . , p
Then, if the Fisher matrix I(θ ) is invertible, one has

V (θ ) − G(θ ) I −1(θ ) G(θ )> ≥ 0 , (15)


where ≥ 0 means that the matrix on the left is positive semidefinite.

Proof: The proof is based on the classical formula for the error variance
of the linear Bayesian estimator φ̂ (x) := E θ [ φ (x) | zθ ] of the vector φ (x),
given zθ , that is

Var θ {φ (x)− φ̂ (x)} = Var θ {φ (x)}−Cov θ {φ (x), zθ }Var θ {zθ }−1Cov θ {φ (x), zθ }> .
(16)
See for example [?, p. 27].
63
Since φ (x) is an unbiased estimator of g(θ ); i.e.
Z
φ (x) p(x, θ ) dx = g(θ ) , ∀θ ∈ Θ ,
Rr
by applying property A.3) one gets
Z
j ∂ p(x, θ ) 1 ∂ g(θ )
E θ φ (x) zθ = φ (x) · · p(x, θ ) dx = ,
Rr ∂θj p(x, θ ) ∂θj
j = 1, . . . , p ,
∂ g(θ )
and hence ∂ θ is the j-th column of the covariance matrix of φ and zθ ,
j

p
E θ φ (x) z>
θ = E θ φ (x) [z1,...,z ] ,
θ θ
that is,
E θ φ z>
θ = G(θ ) . (17)
The inequality follows since the variance of the random vector φ (x)−G(θ ) I(θ )−1 zθ
must be (at least) positive semidefinite.

64
Remarks

When φ is an unbiased estimator of θ (that is if g is the identity map) one


has G(θ ) = I (p × p) and (15) becomes

V (θ ) − I(θ )−1 ≥ 0 . (18)


Pp
Since the scalar variance varθ (φ ) = 1 E θ (φi − θi)2 is the trace of V (θ )
and
h i
−1 −1
TrV (θ ) − Tr I (θ ) = Tr V (θ ) − I (θ ) ≥ 0

(the trace is the sum of the eigenvalues and the eigenvalues of a positive
semidefinite matrix are all non-negative) it follows that the scalar variance
of any unbiased estimator of the parameter θ cannot be less than the
positive number Tr I(θ )−1, that is
h i
varθ (φ ) ≥ Tr I(θ )−1 , ∀θ . (19)

This lower bound only depends on the probabilistic model class {p(·, θ ) ; θ ∈
Θ} and is independent of which estimation criterion is used to construct φ .
65
Remark:

One should however be aware of the fact that the Cramèr-Rao bound is just
one possible bound for the variance which is not necessarily the tightest
possible bound.There are in fact unbiased estimators whose variance is
strictly larger than Tr [I(θ )−1] but nevertheless have minimum variance.

66
Example 8. Let y ∼ N (θ , σ 2) be a scalar random variable with a known
variance σ 2. Since
1 (y − θ )2
log p(y, θ ) = C − ,
2 σ2
d y−θ
log p(y, θ ) = 2
dθ σ
we have
 2
y−θ 1
i(θ ) = E θ = 4 · σ 2 = 1/σ 2 .
σ2 σ
Hence the variance of any unbiased estimator of θ based on a sample of
size one, cannot be smaller that the variance of y. Assume now we have
a random sample of size N from the same Gaussian distribution. Now we
have a random vector x = (y1, . . . , yN ) of dimension r = N and
N
Y
p(y1, . . . , yN | θ ) = p(yt , θ )
t=1

67
and hence
N
1 X (yt − θ )2
log p(y1, . . . , yN | θ ) = N ×Const − ,
2 σ2
t=1
N
d log p X yt − θ
= .
dθ σ2
t=1
Since the random variables y1, . . . , yN are independent, it follows that,
 2
d log p(y, θ ) 1 2 N
I(θ ) = E θ = 4 ·Nσ = 2 .
dθ σ σ
1 PN
Let us consider the sample mean ȳN = N t=1 yt which has distribution
N (θ , σ 2/N). Since ȳN is an unbiased estimator of θ with variance σ 2/N,
exactly equal to the inverse of the Fisher information, the sample mean
is the best possible estimator of θ (of course if the sample distribution is
Gaussian). One says that an unbiased estimator whose variance is ex-
actly equal to the inverse of the Fisher information matrix, V (θ ) = I(θ )−1 is
efficient. ♦
Example 9. Let y ∼ N (µ, θ 2), where µ is known and (y1, . . . , yN ) is a ran-
dom sample from N (µ, θ 2). Consider the unbiased estimator
2 N
N σ
σ̂ 1 X
σ 2N =
σ̃ N = (yk − ȳN )2 ;
N −1 N −1
k=1
σ 2N
N σ̂
We know that 2 has a chi square distribution with N − 1 degrees of free-
θ
dom, which has expectation N − 1 and variance 2(N − 1).

4
2θ . The Cramèr-Rao bound in this case is 2θ 4 /N and
Its variance is then N−1
σ 2N is strictly greater than I(θ )−1. One can however
hence the variance of σ̃
show [?] that any unbiased estimator of θ 2 cannot have a smaller variance
σ 2N .
than that of σ̃
From this example it follows that I(θ )−1 is not the best possible lower
bound. ♦
Note that a biased estimator can have smaller variance than the C-R limit.
2 4 N −1
From Proposition 9 it follows that var σ̂σ N = 2σ .
N2
68
Exercises

1-1 Let I(θ ) be the Fisher matrix relative to an arbitrary density p(y, θ ).
Show that for a random sample of size N one has IN (θ ) = N I(θ ).

1-2 Show, without using the χ 2 distribution, that the Cramèr-Rao bound
for a random sample from N (µ, θ 2) of size N is 2θ 4/N.

1-3 Show that the Cramèr-Rao bound for N (θ1, θ22) (two dimensional pa-
rameter θ ) is
 2 
θ /N 0
I(θ )−1 = 2 .
0 2θ24/N

69
THE KULLBACK-LEIBLER DISTANCE
See Ferguson p112

In this section we shall define a measure of deviation of two random vari-


ables x1 ∼ p(·, θ1) and x2 ∼ p(·, θ2) described by the same parametric family
of distributions.

We shall use this measure to quantify in rather precise terms, the ability of
observations extracted from the model, to discriminate between different
values of the parameter θ .
Definition 6. Let f and p be probability densities such that p(x) = 0 ⇒
f (x) = 0. The Kullback-Leibler (pseudo-)distance between f and p, is
Z Z
K( f , p) := [ log f − log p] f (x) dx = log f /p f (x) dx = E f log f /p ; (20)
Rr Rr

It is immediate that K( f , p) = 0 if and only if f = p.

70
From Jensen inequality:
Z Z
log g(x) dµ ≤ log{ g(x) dµ}

which holds for g(x) > 0 and an arbitrary positive measure µ, one gets
Z Z
p p
−K( f , p) = log f dx ≤ log{ f dx} = log{1} = 0
Rr f Rr f
so that K( f , p) ≥ 0.

For this reason K( f , p) can be interpreted as a measure of deviation of the


probability density p from a “reference” density f .
Note in fact that K( f , p) is not symmetric; i.e. K(p, f ) 6= K( f , p) and does
not satisfy the triangle inequality. In Information Theory K( f , p) is called
divergence and is denoted by the symbol D( f kp) (here p is the approxima-
tion of f ). The article in Wikipedia on Kullback-Leibler divergence provides
a rather complete overview and a bibliography.

71
Let us assume that the family p(·, θ ) satisfies the same regularity assump-
tions listed for the Cramèr-Rao bound and let f ≡ p(·, θ0) and p ≡ p(·, θ ),
θ0, θ ∈ Θ. Denoting K(p(·, θ0), p(·, θ )) by K(θ0, θ ) and letting θ = θ0 + ∆θ ,
one has
 2 
∂K 1 > ∂ K
K(θ0, θ ) = K(θ0, θ0) + ∆θ + ∆θ ∆θ + o(k∆θ k2) .
∂θ θ 2 ∂ θi ∂ θ j θ
0 0
Since K(θ0, θ0) = 0 and
Z
∂K ∂ log p(x, θ )
=− p(x, θ0) dx ,
∂ θi R r ∂ θi
it follows that
Z  
∂K ∂ p(x, θ )
=− dx = 0
∂ θi θ R r ∂ θi θ0
0
for all i = 1, . . . , p.

In the same way one can verify that


2 Z  2   2 
∂ K ∂ log p(x, θ ) ∂ log p(x, θ )
=− p(x, θ0) dx = −E θ0
∂ θi ∂ θ j θ Rr ∂ θi ∂ θ j θ0 ∂ θi ∂ θ j θ0
0
72
and hence the first member of this equality is the (i, j)-th element of the
Fisher matrix I(θ0). Hence, for small variation of the parameter θ , it holds

∼ 1 >
K(θ0, θ ) = ∆θ I(θ0) ∆θ ; (21)
2
which says that, for small deviations ∆θ of the parameter from the reference
value θ0, the Kullback-Leibler distance between p(·, θ ) and p(·, θ0) is a
quadratic form whose weighting matrix is the Fisher matrix I(θ0). In
the next section we will see a remarkable consequence of this fact.

Exercise Compute the Kullback-Leibler distance between the two Gaus-


sian densities, f ≡ N (µ , σ02) and p ≡ N (µ , σ 2). Check what happens if
you invert the order of the two densities.
IDENTIFIABILITY
See Ferguson p112

The observations may be structurally incapable of providing enough infor-


mation to uniquely locate the value of the parameter θ which has generated
them. A rather trivial example :
Let θ be a two-dimensional parameter [θ1, θ2]>, ranging on Θ = R2 and
let Fθ depend on (θ1, θ2) only through their product θ1θ2; for example
Fθ ∼ N (θ1θ2, σ 2).
 >
For any fixed value θ̄ = (θ̄1, θ̄2)>, the parameters θ̂ = α θ̄1, α1 θ̄2 , α 6= 0,
define the same PDF; that is Fθ̄ (x) = Fθ̂ (x), ∀x. Hence a sample observa-
tion extracted from this family, irrespective of its size N, will never be able
to distinguish between θ̄ and θ̂ .
00
Definition 7. Two parameter values θ 0 and θ in Θ are said to be indistin-
00
guishable if Fθ1 (x) = Fθ2 (x), ∀x ∈ Rr . Notation: θ 0 ' θ .
The family of PDF’s {Fθ ; θ ∈ Θ} (sometimes one says improperly that
the parameter θ ∈ Θ) is globally identifiable if θ 0 ' θ 00, or, equivalently,
Fθ 0 = Fθ 00 , implies that θ 0 = θ 00 for all θ 0, θ 00 in Θ.
73
For many applications global identifiability is too restrictive. A weaker con-
dition is the local notion.
Definition 8. The family of PDF’s {Fθ ; θ ∈ Θ} is locally identifiable about
θ0 if there exists an open neighborhood of θ0 which does not contain pa-
rameter values θ which are indistinguishable from θ0 (of course, except θ0
itself).
This concept is often overlooked. There is a remarkable relation between
(local) identifiability and nonsingularity of the Fisher matrix.
Theorem 14 (Rothenberg). Let the parametric model {pθ ; θ ∈ Θ} satisfy
the assumptions A.1, A.2, A.3. Then θ0 is locally identifiable if and only if
I(θ0) is non-singular.

Proof. Is based on the properties of the Kullback-Leibler (pseudo)-metrics


which guarantees that K(θ0, θ ) = 0 ⇔ p(·, θ0) = p(·, θ ). For small deviations
∆θ of the parameter θ about the reference value θ0, the Kullback-Leibler
distance between the two densities p(·, θ ) and p(·, θ0) is the quadratic form
1 >
2 ∆θ I(θ0 ) ∆θ . It follows that in any small enough neighborhhod of θ0 one
can have parameter values θ 6= θ0 for which p(·, θ ) = p(·, θ0) if and only if
I(θ0) is singular.
74
In the previous trivial example one has
2 2
 
(x − θ1θ2) 2 (x − θ1θ2)
θ2 θ1θ2  2 
σ 4 σ 4
= 1 θ2 θ1θ2

I(θ ) = E θ 
 (x − θ θ )2 2 2 2 .
1 2 (x − θ1θ2) 2  σ θ θ
1 2 θ1
θ 1 θ 2 θ 1
σ4 σ4
one sees that det I(θ ) = 0, ∀θ ∈ R2 and hence the model is never locally
identifiable about any arbitrary parameter value θ . In fact, the model is
globally unidentifiable as all indistinguishability classes contain infinitely
many parameter values.

75
ERGODICITY AND THE STRONG LAW OF
LARGE NUMBERS
Lehmann p. 62

Let us pretend that we have an infinite sequence of observations indexed


by time, extending from t = −∞ to the infinite future t = +∞. This is called
a stochastic process denoted

y = {y(t)} , t ∈Z
the symbol Z (Zhalen in German) stands for integer numbers.
Definition 9. A stochastic process {y(t)} is stationary (in the strict sense)
if all Pdf’s relative to y(t1), y(t2), . . . y(tn) say Fn(x1, . . . , xn,t1, . . . ,tn) are invari-
ant for temporal translation, that is for every n it must hold that,

Fn(x1, . . . , xn,t1 + ∆, . . . ,tn + ∆) = Fn(x1, . . . , xn,t1, . . . ,tn) ,


(same function of x1, . . . , xn, t1, . . . ,tn), whatever the time shift ∆ ∈ Z.

76
Consequences

• The Pdf F(x,t) of any variable y(t) does not depend on t; that is the
random variables y(t), t ∈ Z, are identically distributed;

• The second order joint Pdf F2(x1, x2,t1,t2) of the variablesy(t1), y(t2),
only depends on τ = t1 − t2 and not on the date.
In particulao, µ(t) := E y(t),ia constant equal to µ ∈ Rm and the Covari-
ance
Σ(t1,t2) := E [y(t1) − µ(t1)] [y(t2) − µ(t2)]>
depends only on τ = t1 − t2.

77
THE ERGODIC THEOREM

Let f (y) denote a statistic, function of any number of random variables of


the process, which does not depend on time. Denote by fk (y) the same
function in which all time indices of these variables are shifted by k units.
Theorem 15 ( Birkhoff Ergodic Theorem). Let {y(t)} be a strictly stationary
process. The limit
T
1 X
z̄ := lim fk (y) (22)
T →∞ T
k=1
exists with probability one for all functions f such that E | f (y)| < ∞

The limit can either be random or constant. If it is random it must be a “very


special” random variable. These are called invariant random variables. We
shall not investigate them.

If the limit is a constant then the process is called Ergodic.


78
Note now that
T T
1X 1X
E{ fk (y)} = E fk (y) = E f (y)
T T
k=1 k=1
since z(k) = fk (y) is itself a strictly stationary process. Take expectation in
both members of (22). For T → ∞ one finds

E z̄ = E f (y) .
Corollary 3. If {y(t)} is ergodic
T
1 X
lim fk (y) = E f (y) (23)
T →∞ T
k=1
with probability one whatever may be f (y) having finite expectation.

Proof: In fact z̄ must be a constant and hence coincides with its own ex-
pectation z̄ = E z̄ = E f (y). 2

79
THE STRONG LAW OF LARGE NUMBERS
Theorem 16 (Kolmogorov). Every i.i.d. process having finite expectation is
ergodic.

The following is an important consequence.


Corollary 4. Let y be an ergodic process and z(t) := ft (y) the sequence of
translates of an arbitrary function of the process, having finite expectation.
The {z(t)} is stationary and ergodic. In particular the translates of every
time-invariant function of an i.i.d. process form an ergodic process.
P+∞
For example if e is i.i.d. of finite variance and −∞ |ck | < ∞, the translated
random variables
+∞
X +∞
X
z(t) := ck e(t + k) , = c−k e(t − k) ; CONV
−∞ −∞
form an ergodic process. In fact by Cauchy-Schwartz inequality
+N
X +N
X +N
X
E| ck e(t + k)| ≤ |ck |2 E |e(t + k)|2 = |ck |2 σe2
−N −N −N
80
Proposition 12. An ergodic process cannot admit limit for t → ±∞ unless
it reduces to a deterministic sequence (with probability 1).

In fact such a limit should be a constant random variable.


STRONG CONSISTENCY OF ML

Assume x1, x2, . . . , xN is an i.i.d. sample from a family of pdf’s { pθ (x) ; θ ∈


Θ} with Θ ⊂ R p. After performing an experiment you observe a sequence
of sample values x := ( x1, x2, . . . , xN ). The likelihood function of θ corre-
sponding to these sample values is the function
N
Y
LN (θ ) = LN (θ | x) = pθ (xk ) .
k=1
The function lN (θ ) := log LN (θ ) is called the log-likelihood. A maximum
likelihood estimate (MLE) of θ is any function θ̂N (x) such that

LN (θ̂N (x)) = sup LN (θ | x)


θ ∈Θ
or equivalently lN (θ̂N (x)) = supθ ∈Θ lN (θ | x). This supremum (which by def-
inition always exists) may be +∞ for all x and the (MLE) as a function of x
may not exist. It certainly exists if Θ is a compact set and LN (θ ) is continu-
ous (minimum requirement: upper semicontinuous).
81
Suppose the sample is generated by an unknown true value θ0 of the pa-
rameter and assume the model is locally identifiable about θ0 (in practice
need to check this condition for all θ ) then the KL distance
pθ0 (xk ) pθ0 (xk )
Z
K(θ0, θ ) := log pθ0 (xk ) dxk = E θ0 log
R p (x
θ k ) pθ (xk )
is positive (independent of k) and can be zero only if θ = θ0.
Denote the log-likelihood by lN (θ ) = lN (θ | x) then
N N
X X p (x )
lN (θ ) − lN (θ0) = (log pθ (xk ) − log pθ0 (xk ) ) = log θ k
pθ0 (xk )
k=1 k=1
By the law of large numbers, for every θ ∈ Θ,
N
1 X p (x ) p (x )
lim log θ k = E θ0 log θ k = −K(θ0, θ ) < 0 ,
N→∞ N pθ0 (xk ) pθ0 (xk )
k=1
with probability 1.

82
1 a.s.
Therefore for N → ∞, (lN (θ ) − lN (θ0)) → − K(θ0, θ ) < 0
N
this means that for N large, lN (θ ) < lN (θ0) for all θ 6= θ0 so, taking expo-
nentials, the likelihood of θ0 will be larger than that of any other θ ∈ Θ. In
fact we have asymptotic exponential decay:
LN (θ )
= O(e−NK(θ0, θ )) ; ∀θ ∈Θ
LN (θ0)
which means that the ratio will certainly be < 1 for N large. When Θ is a
finite set this implies that any maximum, θ̂N , of LN (θ ) must converge to θ0.
Theorem 17. If Θ is compact, pθ (x) is continuous in θ for all x and there is
K(x) such that
p (x)
log θ ≤ K(x) ; E θ0 K(xk ) < ∞ (24)
pθ0 (x)
then any maximizing θ̂N (x) converges almost surely to θ0 as N → ∞.

83
ASYMPTOTIC NORMALITY OF MLE
Theorem 18 (Cramèr). Assume again that x1, x2, . . . , xN is an i.i.d. sample
from a family of pdf’s { pθ (x) ; θ ∈ Θ} with Θ ⊂ R p now an open subset, sat-
isfying the three conditions [A.1)][A.2)][A.3)] of the Cramèr-Rao Theorem.
Let θ0 be the true value of the parameter and assume that the information
matrix I(θ ) is non singular at θ = θ0. In other words

I(θ0) > 0
(which implies local identifiability). Assume further that condition (24) is
satisfied. Then there exists a strongly consistent sequence θ̂θ N of roots of
the likelihood equation
∂ lN (θ )
= 0; i = 1, 2, . . . , p
∂ θi
such that
√ L
N (θ̂θ N − θ0) → N(0, I(θ0)−1)

84
ASYMPTOTIC NORMALITY OF MLE 2
Lehmann book pp.459-476 does only the scalar case of 1 parameter

Somewhat confusing. Will see a more general CLT statement later

85
ASYMPTOTICS OF STATISTICAL TESTS
See Lehmann p. 133-4, and 137
Example 10. Consider an i.i.d sample with finite fourth order moment µ4.
We want to test the hypothesis

H0 : σ 2 = σ02 ,
against the alternative σ 2 > σ02 based on N (→ ∞) independent observa-
tions. We reject the hypothesis H0 if σ̂ σ 2N is large. The asymptotic distribu-
tion
√ L
σ 2N (x) − σ 2) → N(0, µ4 − σ 4) .
N (σ̂
has mean σ 2 and variance µ4 − σ 4 which in case of near normality is 2σ 4 −
σ 4 = σ 4 and the rejection region becomes
√  2 2
 √ 2
σ N − σ0 ≥ uα 2 σ0
N σ̂

does not depend on the parameter µ of the parent distribution.

86
TIME SERIES
Suppose you have a sequence of scalar regression data

yN := {y(t) ; t = 1, 2, . . . , N} , uN := {u(t) ; t = 1, 2, . . . , N}
where you measure the u(t)’s exactly but the y(t)’s are random due to er-
rors of various kinds. We shall imagine that they are extracted from two
stochastic processes {y(t), u(t) }which are jointly stationary zero-mean
and have finite second order joint moments.

There is serial correlation among successive sample values so that y(t)


is correlated with its past values y(t − 1), y(t − 2), y(t − 3), . . .. We don’t
care much about u since it is going to be observed exactly. The simplest
generalization of regression models to describe serial correlation is
n
X m
X
y(t) + ak y(t − k) = bk u(t − k) + e(t) , (ARX)
k=1 k=1
where e := {e(t), t ∈ Z} is a process of random errors.
87
In econometric applications the variable u is an external forcing term called
exogenous variable. Sometimes you want to describe how y changes in
time as a consequences of time-varying exogenous variables. If there is
no u then the model is called (purely) Auto-Regressive and is denoted by
the acronym AR.
n
X
y(t) + ak y(t − k) = e(t) , (AR)
k=1
There are also more complicated models: ARMA, ARMAX, GARCH etc..

88
We shall assume that e is an i.i.d. process. The model ARX depends on
p := n + m unknown parameters written as a column vector:
 >
θ := a1 . . . an b1 . . . bm
and can be written in regression form as

y(t) = ϕ (t)> θ + e(t) .


where

ϕ (t)> = −y(t − 1) . . . −y(t − n) u(t − 1) . . . u(t − m)


 

(so ϕ (t) is a column vector). The function of the past data

ŷθ (t | t − 1) = ϕ (t)> θ
is the (one step ahead) predictor function associated to the model. Note
that the predictor function is a linear function of θ and of the previous n + m
past samples of the joint process.

89
PEM IDENTIFICATION OF TIME SERIES

We want to estimate the parameter θ from observed data (yN , uN ).

Often we do not know the probability distribution of the error process. We


may assume it is Gaussian but we shall see that this assumption is not so
useful. We try to do by just assuming that e is an i.i.d. process.

We shall use the Prediction Error Minimization (PEM) approach. This


is a variant of Empirical Risk Minimization (ERM) in Machine Learning.

Given the model (ARX) one does as follows:

90
1. For a generic value of θ , construct a predictor based on data up to
time t − 1 of the next output, y(t). For each θ the predictor is a de-
terministic function of the past data denoted ŷθ (t | t − 1). For analysis
purpose we may consider ŷθ (t | t − 1) as a function (of θ ) and of the
past random observed data denoted ŷθ (t | t − 1).

2. Form the empirical prediction errors incurred by using θ :

εθ (t) := y(t) − ŷθ (t | t − 1) ; t = 1, 2, . . . , N


these are numbers but may also be interpreted as sample values of a
random variables, written ε θ (t).

3. Minimize with respect to θ the empirical average prediction error


N
1X
VN (θ ) := εθ (t)2
N
t=1

91
More generally may introduce a discount factor for past errors: a positive
sequence β (N,t),
N
1X
VN (θ ) := β (N,t)εθ (t)2 β (t, N) > 0
N
t=1
For small N, the function β gives small weight to errors incurred at the be-
ginning. One designs the weighting function so that For N → ∞, β (N,t) → 1
.
The parameter estimate

θ̂N := Arg min VN (θ )


θ
becomes a function of the data (yN , uN ). For the simple ARX model it can
be computed explicitly.
Define as an estimate of λ 2 = var{e(t)}, the residual quadratic error,

λ̂N2 := VN (θ̂N )
where VN is defined above.
92
ARX MODEL ESTIMATION
Assume we have data {y(t) , u(t) ; t = t0,t0 − 1, . . . , 0, 1, 2, . . . , N} to be de-
scribed by an (ARX) model using the PEM method. Write a N-vector model
for all data as
y = ΦN θ + e column vectors of dimension N
where y and e have components y(t) and e(t) for t = 1, 2, . . . , N and ΦN is
an N × p matrix of past data:
>
 
ϕ(1)
ΦN :=  ...  ,
ϕ(N)>
Assuming the initial time t0 is far enough, we can fill in ΦN with data from
time say t = 1. The N-dimensional vector of predictors and prediction errors
are ŷθ = ΦN θ , ε θ = y − ΦN θ . Then VN (θ ) is the squared Euclidean
norm of ε θ ,
N
1X 1
VN (θ ) = [ y(t) − ϕ(t)>θ ]2 = ky − ΦN θ k2 .
N N
t=1

93
DETERMINISTIC VECTOR LEAST SQUARES
Hence the estimation leads to a vector Least Squares Regression Prob-
lem. More generally to weighted Least squares: if the t-th measurement
is more reliable weight the t-th prediction errors by larger weight qt2. If
the error variances are approximately known, the optimal choice is take
1
qt2 = the inverse of the error variance.
var(e(t))

Linear Regression: Given a deterministic linear model class { ŷθ = Xθ , θ ∈


R p } where X is a known N × p matrix. Choose the unknown p-dimensional
parameter θ in such a way to describe in the best way a vector of observed
data y ∈ RN .
Let Q = diag{q21, . . . , q2N } be a matrix of weights. Problem is to minimize with
respect to θ the quadratic form
VQ(θ ) = [y − Xθ ]>Q [y − Xθ ] := ky − Xθ k2Q .
1
(forget ) where Q = Q> is a positive definite weight matrix. The minimiza-
N
tion can be done by elementary calculus. However it is more instructive to
do this by geometric means.
94
GEOMETRY OF VECTOR LEAST SQUARES
Make RN into an inner product space by introducing the inner product
N
hx, yiQ = x>Qy =
X
xk q2k yk
k=1
and let the corresponding norm be denoted by kxk2Q := x>Qx. Note that for
2 P 2
Q = I the norm kxkQ is just xk .

Let X be the linear subspace of RN linearly spanned by the columns of the


 
matrix X = x1 . . . x p that is
p
X
X := { θk xk , θ1, . . . , θ p arbitrary real numbers }
k=1
Then the minimization of ky − Xθ k2Q is just the minimum distance problem
of finding the vector ŷ ∈ X of shortest distance from the data vector y
according to the distance defined by the norm k · kQ.
95
THE GEOMETRY OF LEAST SQUARES

y 


  
  
  
  
  
  
 P 
 P PP 
P
X ⊂ RN
 q
P 

 X θ̂ 


 
 
 

X θ̂ is the Orthogonal Projection of y ∈ RN onto X.

96
THE ORTHOGONALITY PRINCIPLE

Want to find the vector ŷ ∈ X of shortest distance from the data vector y.
Theorem 19. The minimizer of VQ(θ ) = ky − Xθ k2Q must make the error
y − Xθ orthogonal (according to the scalar product hx, yiQ) to the sub-
space X, or, equivalently, to the columns of X, that is

X >Q(y − Xθ ) = 0 ,
Equivalently the optimal θ must solve the Normal Equations

X >Q X θ = X >Q y .

The normal equations of Least-Squares for Q = IN :

X> X θ = X> y.
To solve need invertibility of X > X.

97
Let us now assume that
rank X = p ≤ N . (25)
This is an identifiability condition of the model class. Each model corre-
sponds 1 : 1 to a unique value of the parameter. Under this condition the
Normal have a unique solution which we denote θ̂θ (y) given by

θ̂θ (y) = [X >QX ]−1 X >Qy , (26)


which is a linear function of the observations y. For short we shall denote
θ̂θ (y) = Ay. Then X θ̂θ (y) := XAy is the orthogonal projection of y onto the
subspace X = span (X). In other words the matrix P ∈ RN×N , defined as

P = XA ,
is the orthogonal projector, with respect to the inner product h·, ·iQ, from
RN onto X. In fact P is idempotent (P = P2), since

XA · XA = X · I · A = XA

98
STATISTICAL ANALYSIS

Assume that the error e in the model

y = Xθ + e . (27)
is zero-mean random vector with known variance R = E ee> positive defi-
nite and that the rank condition 25 holds.
Theorem 20. The estimator θ̂θ (y) = Ay is unbiased and has variance matrix
h i−1
Var θ̂θ (y) = X >R−1X

If {ek , k = 1, 2, . . . , N} are i.i.d. that is R = σ 2I, then


h i−1
Var θ̂θ (y) = σ 2 X >X

100
In the ARX case just assume Q = IN (N × N identity matrix) then the PEM
estimator of θ is
h i−1
θ̂N = Φ>
N ΦN Φ>Ny

which can also be rewritten


" N #−1 N
>
X X
θ̂N = ϕ(t)ϕ(t) ϕ(t)y(t) .
t=1 t=1
where we assume that the inverse exists for suitably large N.

• What are the statistical properties of this estimator?


• Note that θ̂N is a non linear function of the observed data. Don’t know if
it is unbiased; even if y and u were Gaussian the Pdf of θ̂N is impossible to
compute. Can only try to see what happens for N → ∞.

101
CONSISTENCY OF THE PEM ESTIMATOR

(Preview)

Theorem 21. Assume there is a true model describing the data having
orders n, m as in the candidate ARX model and true parameter θ0. Assume
also that the true model is causal that is

E θ0 ϕ (t) e(t) = 0 ; ∀t ∈ Z ; (INNOV)


ϕ (t)> > 0; then
and that E θ0 ϕ (t)ϕ

lim θ̂θ N = θ0
N→∞
with probability one.

102
STRONG CONSISTENCY OF THE PEM
ESTIMATOR
Rewrite θ̂θ N as
" N
#−1 N
1 1
ϕ(t)ϕ(t)>
X X
θ̂N = ϕ(t)y(t) . ; (EST)
N N
t=1 t=1
substitute y(t) = ϕ(t)>θ0 + e(t) (true model) and define the sample covari-
ance matrix of ϕ (t)
N
1X
ΣN :=
Σ̂ ϕ (t)> ;
ϕ (t)ϕ ∈ R p×p
N
t=1
For notation simplicity we do the case of no exogenous input (u ≡ 0).
Lemma 3. If {e} is an i.i.d. process then {y} is ergodic and Σ̂ ΣN converges
almost surely for N → ∞ to the positive semidefinite covariance matrix
 
y(t − 1)  
Σ := E θ0 {  ...  y(t − 1) . . . y(t − n) };
y(t − n)

103
PROOF OF STRONG CONSISTENCY OF THE
PEM ESTIMATOR
Proof. The ergodicity follows from Corollary 4 but we shall need to under-
stand why y(t) admits such a representation. We shall do that in the next
slides

Now just go to the limit in formula (EST)


" N
#−1 N
1 1  
ϕ (t)> ϕ (t) ϕ (t)> θ0 + e(t)
X X
lim ϕ (t)ϕ
N→∞ N N
t=1 t=1
to get, by ergodicity and in virtue if the two main assumptions
lim θ̂θ N = Σ−1Σ θ0 = θ0
N→∞
ϕ (t)> > 0 . For ARMAX models we need to assume that
here Σ := E θ0 ϕ (t)ϕ
the input process u is also ergodic and uncorrelated with the noise input.
Of course the reason why the two main assumptions should hold needs to
be investigated.
104
SOLVING DIFFERENCE EQUATIONS
Linear difference equations (with constant coefficients) arise as determin-
istic mathematical models of many physical or economic systems. They
may be written as
n
X
y(t) + ak y(t − k) = f (t) , t ∈Z (28)
k=1
or, equivalently as
n
X
y(t + n) + ak y(t + n − k) = g(t), t ∈Z (29)
k=1
where f (t) or g(t) = f (t + n) are exogenous signals. To find a solution first
look at the homogeneous case where f (t) = 0. Try with a simple exponen-
tial y(t) = λ t ; which leads to
n
X
λ t+n + ak λ t+n−k = 0
k=1
105
since we want λ 6= 0 we can collect λ t and end with an algebraic equation
od degree n
n
X
λn + ak λ n−k = 0
k=1
which is the characteristic equation of the system. It has n complex so-
lutions λk ; k = 1, 2, . . . , n not necessarily distinct. Hence any solution must
be a linear combination of these n exponentials. Assuming all roots are
distinct, the general solution turns out to be
n
X
y(t) = ck λkt
k=1
where all coefficients ck can be determined from the initial conditions say
y(0), y(1), . . . y(n−1), . Note that in general the λk may be complex numbers.
CONVOLUTION AND DIFFERENCE EQUATIONS

Suppose now that you want to solve


n
X
y(t + n) + ak y(t + n − k) = δ (t) ; t ∈Z
k=1
where the input sequence δ (t) is equal to 1 for t=0 and zero otherwise. This
function is called the elementary or unit impulse function. You can work out
an equivalent homogeneous equation to (DE) by solving for successive n
initial conditions the system of equations obtained by writing (DE) at times
t = 0, 1, . . . , n − 1.
Let’s denote the solution by h(t); this is called the impulse response of the
system.

106
Once we have h(t) we can solve the equation for an arbitrary input f (t).
Note that the system represented by the DE operates a linear transfor-
mation on the input f (t). Since any input function can be expressed as
a (possibly infinite) linear combinations of impulse functions located at all
times t = k,
+∞
X
f (t) = f (k) δ (t − k), t ∈Z
k=−∞
because of linearity the response of the system can be written as a sum
of infinitely many impulse responses to the δ (t − k)’s each located at times
t = k and weighted by amplitude f (k). This leads to the convolution rep-
resentation
+∞
X +∞
X
y(t) = f (k) h(t − k) ; ⇔ y(t) = h(k) f (t − k)
k=−∞ k=−∞

107
Let’s go back to the AR model. Now the input is an i.i.d. process {e(t)}.
We can still write the solution as a convolution sum
+∞
X
y(t) = h(k) e(t − k)
k=−∞
which is of the same form of the representation (CONV) on p. 78. We need
however to check under what circumstances the convergence condition
P+∞
k=−∞ |h(k)| < ∞ is satisfied.
Lemma 4. If and only if all roots of the characteristic equation have modu-
lus strictly less than 1; i.e |λk | < 1 ; k = 1, 2, . . . , n, one has :
1. Causality i.e. h(t) = 0 for t < 0
P+∞
2. Stability i.e. k=0 |h(k)| < ∞ .

108
Clearly when |λk | < 1 then limt→+∞ λkt = 0. True also for multiple roots. the
impulse response is summable.

If |λk | > 1 then limt→+∞ λkt = ∞ the impulse response is not causal. May
be causal but then not summable.

If λk imaginary i.e |λk | = 1; since the characteristic polynomial is real they


come always as pairs of compl. conjugate roots

ak eiλ t + āk e−iλ t = ℜe ak cos λ t + ℑm ak sin λ t


oscillatory behaviour. In this case the process y is not ergodic.

109
CAUSALITY

When |λk | < 1 ; k = 1, 2, . . . , n the process y is ergodic but more is true. Since
h(t) = 0 for t < 0, one can write
+∞
X t
X
y(t) = h(k) e(t − k) = h(t − k) e(k)
k=0 k=−∞
so y(t) depends only on the past history of {e(t)}. In general infinite,
since h(t) is non zero for all t ≥ 0. Write
t−i
X
y(t −i) = h(t −i−k) e(k) ; linear function of past e’s at times ≤ t − i
k=−∞
Therefore, since the e(t) are uncorrelated,

E e(t)y(t − i) = 0), ; i = 1, 2, . . . , n
If there is an input process this needs to be assumed.

110
CAUSALITY AND THE PREDICTION ERROR

For a causal AR model


n
X
y(t) = ak y(t − k) + e(t)
k=1
the two terms in the right side are uncorrelated.
The second term is the conditional expectation of y(t), given its (infinite)
past history up to time t − 1. This is the (optimal) one-step-ahead predictor
of y(t) given its own past. The i.i.d. process e(t) is then the (random)
prediction error.

This is true also if there is an external input assumed independent of e. If in


the AR model all roots of the characteristic equation are in absolute value
less than one. The i.i.d. process e has the interpretation of one-step-ahead
prediction error of y(t) given the joint past of y and u at time t − 1.

111
CONDITIONAL EXPECTATION
Denote by L2(y), the vector space of statistics of the process y which have
finite second order moment; that is f (y(s) ; s ∈ Z) with E f (y)2 < ∞.

Let x be a random variable of the same experiment. We want to find the


statistic in L2(y) which approximates x in some optimal way; that is such
that the expected square error
E [ x − f (y) ]2 , f (y) ∈ L2(y)
is minimal. The solution of this problem is the conditional expectation
denoted E (x | y), of x given y. We have the following (stochastic) orthogo-
nality principle
Theorem 22. The conditional expectation is the unique random variable in
L2(y) satisfying the orthogonality condition:

x − E (x | y) ⊥ L2(y) (30)
where the orthogonality is with respect to the covariance inner product
hx , z i := cov {x , z}.
112
MARTINGALE-DIFFERENCE PROCESSES

For AR models with i.i.d. noise input, the (one-step-ahead) prediction error
turns out to coincide with e(t) itself and is (statistically) independent of the
optimal predictor. This was defined as the conditional expectation of y(t)
given the past history of the process up to time t − 1:
h i
ŷ(t | t − 1) = E y(t) | yt−1 .

Denote the prediction error by

ỹ(t) := y(t) − ŷ(t | t − 1)


By the orthogonality property of conditional expectation ỹ(t) is uncorre-
lated but not necessarily independent of all statistics of the past say

E ỹ(t) f (yt−1) = 0 , ∀ f ∈ L2(yt−1)


This is the martingale difference property which is weaker and general-
izes the i.i.d. property.
113
Since E {ỹ(t) | yt−1} = 0 ,

E ỹ(t) = E { E [ ỹ(t) | yt−1 ]} = 0


Often the conditional variance E {ỹ(t)ỹ(t)> | yt−1} does not depend on the
conditioning variables, that is

E {ỹ(t)ỹ(t)> | yt−1} = E {ỹ(t) ỹ(t)>) = Var (ỹ(t))


we say that ỹ has a non random conditional variance.

For ARX and ARMAX models the predictor also depends on the past of
some exogenous variable u. The definition is generalized as conditional
expectation given the joint past histories (yt−1, ut−1) and

ỹ(t) = y(t) − E {y(t) | yt−1, ut−1} t ∈Z


In this case one replaces L2(yt−1) with L2(yt−1 ut−1).

114
GENERAL D-MARTINGALES
Definition 10. Let {zt ; t ∈ Z} be a stationary vector process and consider
the sequence of subspaces L2(z t ), which is non-decreasing i.e. L2(z t ) ⊂
L2(zt+1). A stochastic process {x(t) ; t ∈ Z}is a martingale difference, or
briefly, a d-martingale with respect to the family {L2(z t )}, if,

◦ For all t, x(t) ∈ L2(z t ) ; i.e. x(t) is itself a statistic of the past history of z at
time t, having finite variance.

◦ z(t + 1) is uncorrelated with all random variables in the space L2(z t ) that
is E {x(t + 1) f (z t )} = 0 for all f (z t ) ∈ L2(z t ) which is equivalent to:

E {x(t + 1) | z t } = 0 t ∈ Z. (DMART)

Note that by the very notion of a projection,

E {x(t) | z s} = E {E [x(t) | zt−1] | z s} = 0 ∀s < t .


In particular a d-martingale has mean zero.
115
Example:
Let ỹ(t) be the random prediction error defined in a previous slide an con-
sider the process

x(t) := ϕ (t) ỹ(t), ϕ (t) ∈ L2(yt−1, ut−1)


where ϕ (t) = ϕ (yt−1, ut−1) is a function of the past history of y, u up to time
t − 1. This is also a d-martigale; in fact

E {x(t) | L2(yt−1, ut−1)} = ϕ (t) E {ỹ(t) | L2(yt−1, ut−1)} = 0 t ∈ Z.


A martingale is the discrete integral of a d-martingale,
t
X
m(t) = m(0) + x(s) (31)
s=1
It is a non-stationary process, generalization of the well-kown random walk
process.

116
The following lemma generalizes the sum-of-variances property, known to
hold for sums of i.i.d. processes, to d-martingales.
Lemma 5. For every d-martingale x one has
XN N
X
Var { x(t)} = Var {x(t)} (32)
t=1 t=1
When the d-martigale x is stationary the second member is just N σx2. This
property holds unchanged in the vector case.

Proof.
N
X X
E{ x(t)}2 = E {x(1)2 + x(2)2 + . . . + x(N)2} + E {2 x(t)x(s)} =
t=1 t>s
N
X X
= Var {x(t)} + 2 E x(t)x(s)
t=1 t>s
But the last term is zero since for t > s, x(s) ∈ L2(z s), and by the d-martingale
property
E x(t)x(s) = E {E [ x(t)x(s) | z s]} = E { x(s) E [ x(t) | z s] } = 0
117
THE CLT FOR D-MARTINGALES
Theorem 23 (Levy, Doob, Billingsley, Ibragimov). Let {x(t)} be a stationary
(vector) d-martingale having constant conditional variance, equal to Σx =
E x(t)x(t)>. One has
√ L
T x̄T → N(0, Σx) (33)

that is the modified sample mean T x̄T converges in distribution to a mul-
tivariate Gaussian distribution of mean zero and variance Σx.

Proof (for the scalar case): We shall use the conditional characteristic func-
tion, substituting the variable it with iλ . By stationarity the variances of the
random variables {x(t)} are uniformly bounded and φx(t)(iλ | zt−1) admits
a second derivative at λ = 0 equal to the (conditional) variance, σ 2, of x(t).
One can write
2 σ 2λ 2
h i  
x(t) λ 2
E e iλ |zt−1 = E 1 + iλ x(t) − x(t) + η (λ , x(t)) | zt−1 = 1− +o(λ 2)
2 2
where o(λ 2) is a random variable in L2(zt−1) which tens to zero as λ → 0
faster than λ 2. Call φT (λ ) the conditional characteristic function of the sum
118
PT
x̄(T ) := k=1 x(t). Then
n h i o
φT (λ ) = E E eiλ x(T ) | zT −1 eiλ x̄(T −1) =
2 2
 
σ λ
= 1− E {eiλ x̄(T −1)} + E {o(λ 2)eiλ x̄(T −1)} =
2
σ 2λ 2
 
= 1− φT −1(λ ) + ō(λ 2)
2
where ō(λ 2) is the expected value of a variable in L2(zT −1) having the
same absolute value of o(λ 2) hence tending to zero faster than λ 2. Solving
the difference equation one finds
 2 2
T
σ λ
φT (λ ) = 1 − + ōT (λ 2)
2
where ōT (λ 2) is still infinitesimal of higher order than λ 2 for λ → 0.
1 PT
Now, the characteristic function of s(T ) := √ k=1 x(t) is the same func-
√ T
tion φT computed in λ / T ,so that
2 2
T
λ2

λ σ λ
φT ( √ ) = 1 − + ōT ( )
T 2 T T
where the second term tends to zero for T → ∞, for whatever fixed value of
σ 2λ 2
λ while the limit of the first term is the well known function exp {− }.
2
Hence the characteristic function of s(T ) converges pointwise to that of the
Gaussian N(0, σ 2). 2

119
THE CLT FOR THE PEM ESTIMATOR
We shall consider simultaneously models of AR, ARX, ARMA, ARMAX
type depending smoothly on a p-dimensional parameter θ . Assume at
least local identifiability about the true value θ0 of the true model which
has generated the data. Assume also that the true model generates data
which are stationary and ergodic and that the PEM estimator is consistent.

Each candidate model defines a probability density which belongs to a


parametric class {p(y | θ )} each member being uniquely defined by just
selecting a p-dimensional paremeter value θ . So all models in the class
have the same order or complexity. In particular the true model belongs to
this class and is uniquely described by assigning the true parameter θ0.

We assume that the one-step-ahead predictor function ŷθ (t | t − 1) is sta-


tionary and has a known expression. Our goal is to study the asymptotic
distribution of minimizer(s) of some generalized average square prediction
error criterion which we shall denote VN (θ ).
120
Recall that each candidate minimizer must solve the gradient equation,
∂VN (θ )
:= VN (θ )0 = 0 . (a p-vector function)
∂θ
To approximate VN (θ̂θ N )0 use Taylor’s formula about the point θ = θ0 trun-
cated to the first order. There is always some point θ̄ , such that, exactly
00
VN (θ̂θ N )0 = VN (θ0)0 +VN (θ̄θ ) (θ̂θ N − θ0) = 0 (34)
00
wher VN (θ̄θ ) is the second derivatives (Hessian) matrix computed at some
unknown θ̄ which however must belong to thep-dimensional interval having
extremes θ0 andθ̂θ N , that is
k k
θ0k ≤ θ̄θ ≤ θ̂θ N , k = 1, 2, . . . , p
Assuming the Hessian matrix is invertible from (34) one gets
 −1
1 00 1
θ̂θ N − θ0 = − VN (θ̄θ ) VN (θ0)0 (35)
2 2
where the factor 21 is intruduced for convenience.
121
Let’s compute the gradient and the Hessian matrix using the expression
1 PN 2 . Define:
VN (θ ) = ε (t)
N t=1 θ

∂ ε θ (t) ∂ ŷ (t | t − 1)
ψ θ (t) := =− θ
∂θ ∂θ
to get
N
1 0 1X
VN (θ ) = ψ θ (t)εε θ (t) (36)
2 N
k=1
N   2 
1 00 1 ∂ ε θ (t)
ψ θ (t)> + ε θ (t)
X
VN (θ ) = ψ θ (t)ψ (37)
2 N ∂ θi∂ θ j
k=1
The asymptotic behaviour of the second derivative is discussed in the fol-
lowing lemma,
Lemma 6. One has
1 00
n o
ψ θ0 (t)>
lim VN (θ̄θ ) = E θ0 ψ θ0 (t)ψ (38)
N→∞ 2
with probability one.
122
Proof : By consistency, θ̂θ N → θ0 and hence also θ̄θ → θ0 (with probability
one). Since the time average in (37) converges to the expectation, we
have,
(  2  )
1 00 ∂ ε θ (t)
ψ θ0 (t)> + ε θ0 (t)
VN (θ̄θ ) → E θ0 ψ θ0 (t)ψ
2 ∂ θi∂ θ j |θ =θ
0

almost surely. Since the true model belongs to the model class, ε θ0 (t) =
e0(t) where e0(t) is the true random prediction error (a d-martingale). Fi-
nally, both the gradient (ψψ θ (t)), and the second derivative of ŷθ (t | t − 1)
are functions (often linear functions) only of the past data (yt−1, ut−1), all
entries in the second derivative matrix in the second member are uncorre-
lated with e0(t) and hence the expectation of the last term is zero. 2

Note now that the last term in (35), has the expression
N
1 0 1X
VN (θ0) = ψ θ0 (t)e0(t) (39)
2 N
k=1

123
AN INTERMEDIATE CLT
Theorem 24. Assume the prediction error e0, is a stationary d-martingale
with respect to the flow of past data (yt , ut ) having finite variance. Then
also the process {ψ
ψ θ0 (t)e0(t)} is a d-martingale and,
√ 1 L
N VN (θ0)0 → N(0, Q) (40)
2
If the conditional variance of e0(t) doesnot depend on the data (yt−1, ut−1),
that is if
E 0{e0(t)2 | yt−1, ut−1} = E 0{e0(t)2} = σ02 , (41)
the asymptotic variance Q is given by the formula,

Q = σ02 E 0{ψ ψ θ0 (t)>} .


ψ θ0 (t)ψ (42)

124
Proof : The first statement follows from the previous observation that {ψ
ψ θ0 (t)e0(t)}
is also a d-martingale with respect to the past data flow {yt , ut }.

The second is a corollary of the CLT for d-martingales 23. The xpression
for the Variance matrix Q follows from the property (41), which implies,
h i
Var {ψ
ψ θ0 (t)e0(t)} = ψ θ0 (t)>} =
E 0{E 0 e0(t)2 | yt−1, ut−1 } ψ θ0 (t)ψ
= E 0{e0(t)2} E 0{ψ ψ θ0 (t)>} .
ψ θ0 (t)ψ
2

Obviously the theorem includes the case where e0 is an i.i.d. process but
applies to a much wider variety of cases.

125
THE ASYMPTOTIC DISTRITBUTION OF THE
PEM ESTIMATOR

The following theorem is a fundamental result in time series analysis. It ac-


tually includes asymptotic normality of the Maximum Likelihood estimate,
which is asymptotically equivalent to PEM.
Theorem 25. Under the same assumptions of Theorem 24 the PEM esti-
mator has a Gaussian asymptotic distribution as described by,
√ L
N(θ̂θ N − θ0) → N(0, P) , (43)
where the asymptotic variance matrix P is given by the expression
h i−1
2
P = σ0 E 0{ψ >
ψ θ0 (t) }
ψ θ0 (t)ψ (44)

and the inverse of the matrix between square brackets exists.

126
Proof: Follows from (35). For N → ∞
 −1
N N
1 1
ψ θ̄θ (t)>
X X
θ̂θ N − θ0 ' −  ψ θ̄θ (t)ψ ψ θ0 (t)εε θ0 (t)
N N
k=1 k=1
and use the third statement of Slutsky. The expression for P follows from
h n oi−1 h n oi−1
P = E θ0 ψ θ0 (t)ψψ θ0 (t)> ψ θ0 (t)>
Q E θ0 ψ θ0 (t)ψ

where Q is the asymptotic variance of the limit (40). The invertibility of P is


equivalent to that of the Fisher matrix. 2

127
ASYMPTOTIC VARIANCE
Recall that the variance of a consistent estimator must tend to zero as
N → ∞. The concept of asymptotic variance of a consistent estimator must
therefore be defined properly. Here is one possible definition.
Definition 11. Let {φN (y) ; N = 1, 2, . . .} be a consistent sequence of esti-
mators of the parameter θ and d(N) a function of N which is increasing to
+∞ with N and strictly positive. One say that φN (y) has asymptotic variance
Σ if
p L
d(N) [ φN (y) − θ0 ] → D(0, Σ)
where D(0, Σ) is a pdf having variance Σ, possibly depending on theta0,
which is finite and strictly positive definite.

Hence for N large the variance of φN (y) can be approximated by d(N) 1 Σ. In


most asymptotically normal examples discussed above d(N) can be taken
equal to N.
The condition of strict positivity Σ > 0 is essential since it excludes the pos-
sibility of linear combinations of the components of φN (y) whose variance
tends to zero, which just means that the order of infinitesimal of the vari-
ance of these combinations will be different from O( d(N) 1 ).
128
ASYMPTOTIC EFFICIENCY
Discussion in Lehmann p.510

One may compare consistent estimators based on the asymptotic vari-


ance, saying that estimator 1 is more efficient than estimator 2 when Σ1 ≤
Σ2. Unfortunately however in general the asymptotic variance is a function
of the true parameter which is unknown so it may be that for different values
of θ0 the two estimators compare in the opposite way or do not compare at
all.

For maximum likelihood under the usual identifiability condition, the asymp-
totic variance is I(θ0)−1 which in force of the Cramèr-Rao bound is the best
possible (for a fixed θ0). One usually says that maximum likelihood is an
asymptotically efficient estimator but this sentence must of course be
interpreted with a grain of salt.

129

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy