Slides Large Sample
Slides Large Sample
GIORGIO PICCI
University of Padova, Italy
1
SCOPE OF THESE NOTES
You should all have an electronic version of this book. Please take it to
class. It does not make sense for me to provide notes repeating this ma-
terial. Another useful reference is
2
Outline of the Course
4. Applications
7. Time series
3
QUICK REVIEW OF PROBABILITY
4
AN ELEMENTARY EXAMPLE
Assume we are tossing a coin and let p := probability to observe “TAIL”,
event which will be denoted by the symbol T and 1 − p := probability that
HEAD will show instead; event which is denoted by the symbol C. Naturally,
p is unknown. We want to obtain information on the value of p by tossing
the coin N consecutive times, assuming that each toss does not influence
the outcome of the other tosses.
T T C C C T T C C T T T T...C T N symbols
Let A be the family of all subsets A of Ω. These are called Events. Exam-
ple
A = {ω | in ω there are an even number of T’s}
5
Assume each toss does not influence the outcome of the other tosses.
This defines a class of probability measures which describes each toss as
being independent of the others. In formulas, this means that a class of
probability measures
P := {P p ; 0 < p < 1}
is defined on on {Ω, A } for each elementary event ω ∈ Ω by the Bernoulli
distribution
P := {P p ; 0 < p < 1 } .
Estimating P is just selecting a plausible value of p based on the observa-
tion of the outcomes of N successive coin tosses.
6
Alternatively, one may want to validate some a priori belief on p for example
that p = 1/2 (that is, T and C are equiprobable). In this case one deals
with an hypothesis testing problem: on the basis of some observation ω̄
decide whether P p belongs to the class
P0 := {P1/2} ,
or Pp belongs to the complementary family
n o
P1 := P p ; p 6= 1/2 .
7
The expected value of xk is
X
E xk = x(ωk ) P p({ωk })
ωk
Since ωk can be T with probability p and C with probability 1 − p, it follows
that E xk = p for all k.
The probability distribution of a Binomial random variable, assuming the
tosses are independent, is
n
N
X
P p{ω ; sn(ω) = n} = pk (1 − p)N−k
k
k=0
This is known as the Binomial distribution and is denoted B p,N (n).
8
NOTATIONS
◦ The sample size is denoted by N: lower case n is often used for dimen-
sion of vectors (either random or non-random) or degrees of freedom. So
in general n is fixed while N → ∞.
◦ Names: Pdf instead of cdf; x ∼ F means that the random variable x has
Pdf F. In discrete probability spaces F(x) is a staircase function.
d F(x)
Continuous variables admit a probability density function p(x) :=
dx
(pdf).
9
CONVERGENCE OF RANDOM VARIABLES
3. Convergence in probability:
P{ω | |xN (ω) − x(ω)| > ε} → 0 for all ε > 0.
P
This is written xN → x or P − lim xN = x.
IMPLICATIONS:
1. ⇒ 3. 2. ⇒ 3.
10
CHEBYSHEV INEQUALITY
Suppose x and y have finite second moment, then for all ε > 0 and con-
stant c
1 h i
P { |x − y | ≥ ε} ≤ 2 E (x − y)2
ε
Same proof of Lemma 2.1.1 in Lehmann’s book for y = c. In fact just call
z := x − y.
q.m. P
Theorem 1 (Theorem 2.1.1). Let xN → x then xN → x
11
THE BINOMIAL DISTRIBUTION
A Binomial random variable sN is the total number n of Tails (T) in N inde-
pendent Bernoulli tosses with probability p. Its distribution is
n
N
X
B p,N (n) = pk (1 − p)N−k
k
k=0
Since sN is the sum of N independent Bernoulli random variables xk ; k =
1, 2, . . . , N each taking value 1 when T is observed and 0 otherwise, whose
mean is E x = p, we have
E sN = E {x1} + E {x2} + . . . + E {xN } = |p + ·{z
· · + p} = N p
N times
The variance is computed by summing the variances of each xk which can
be computed to be p(1 − p). Hence the variance of sN is N p(1 − p) and so
s 1 1
var N = 2 N p(1 − p) = p(1 − p). Then
N N N
sN P
→ p. (Lehmann Example (2.1.1))
N
12
WEAK LAW OF LARGE NUMBERS
Let {x1, x2, . . . xN } be a sequence of random variables (a sample of size
N)
Definition 1. The sample mean of the sequence is the random variable
N
1X
µ N :=
µ̂ xk
N
k=1
(also denoted x̄ or x̄N ). The sample variance of the sequence is the
random variable
N
1 X
σ 2N :=
σ̂ µ N )2
(xk − µ̂
N
k=1
Theorem 2. If the random variables are independent identically distributed
(i.i.d.) then
P
µ N → µ = E xk
µ̂
that is the sample mean is a consistent estimator of the mean.
13
PROOF
14
CONSISTENCY OF ESTIMATORS
P0 − lim φN = θ0
That is, the sequence of random variables {φN } converges in probability
P0 to a nonrandom constant equal to the true parameter θ0.
15
CONSISTENCY AND CHEBYSHEV INEQUALITY
Chebyshev inequality holds also for random vectors (see below):
1 h i 1
Pθ kφN − θ k ≥ ε ≤ 2 E θ (φN − θ )> (φN − θ ) = 2 E θ kφN − θ k2 ,
ε ε
where φN := φN (y1, . . . , yN ) and k · k is Euclidean norm or absolute value
when φN is scalar. Proof: easy generalization of the proof of Lemma 2.1.1
in Lehmann’s book.
lim σN2 (θ ) = 0 ; ∀θ ∈ Θ ,
N→∞
then φN (y1, . . . , yN ) is (weakly) consistent. (Remember that we do not know
the true value θ0).
Proposition 1. If φN (y1, . . . , yN ) is an asymptotically unbiased estimator
and if its scalar variance σN2 (θ ) tends to zero as N → ∞ for all θ ∈ Θ, then
φN (y1, . . . , yN ) is consistent.
16
APPLICATIONS p. 55
17
SIMPLE LINEAR REGRESSION p. 58-59
You would like to describe approximately these data by a straight line say
y = α + β x. What is the best straight line approximating the measured
data?
Suppose you model the measurement process by a statistical model
yk = α + β xk + ek , k = 1, 2, . . . , N
where the errors ek are zero-mean independent random variables with
variances σk2. In the given experimental condition ω you have observed the
values yk (ω) = yk ; k = k = 1, 2, . . . , N corresponding to errors ek (ω) (which
you do not know).
Definition 3. The least squares estimator of the parameter (α, β ) is the
solution of the minimization problem
N
X
min [ yk − (α + β xk ) ]2
(α, β )
k=1
18
SOLUTION OF THE LINEAR REGRESSION PROBLEM 1
α N = ȳN − βˆ N x̄N
α̂
P
(x − x̄N ) yk
βˆ N = Pk k 2
k (xk − x̄N )
Question: are these consistent estimators of the parameters (α, β )?
The answer depends on how you describe the errors.
19
SOLUTION OF THE LINEAR REGRESSION PROBLEM 2
α N = E (ȳN − βˆ N x̄N )
E α̂
P
ˆ k (xk − x̄N ) E yk
Eβ N = P 2
k (xk − x̄N )
20
PROOF OF UNBIASEDNESS
α N = α + E (β − βˆ N ) x̄N
E α̂
On the other hand E yk = α + β xk , and so
P P
ˆ k (x k − x̄N ) (α + β xk ) k (xk − x̄N ) xk
E (β N − β ) = P 2
−β = β P 2
−β = 0
k (xk − x̄N ) k (xk − x̄N )
P P
since k (xk − x̄N ) α = 0 and likewise k (xk − x̄N ) x̄N = 0.
21
CONSISTENCY OF THE LEAST SQUARES ESTIMATORS
First look at
P
ˆ k (xk − x̄N ) yk X
βN = P 2
:= wk yk
k (xk − x̄N ) k
and since the yk are independent (as the ek are)
X N
X
var (βˆ N ) = w2k var (yk ) = w2k σk2
k k=1
where σk2 = var (ek ). For convergence in probability of βˆ N to β we need
N
X
lim w2k σk2 = 0
N→∞
k=1
which, in case σk2 = σ 2 independent of k implies
N
(xk − x̄N )2
P
X
2 k 1
wk = P 2 2
= P 2
→ 0.
[ k (xk − x̄N ) ] k (xk − x̄N )
k=1
22
This is the same as
+∞
X
(xk − x̄N )2 = ∞
k=1
which means that the points xk should not remain too close to their sample
mean. Since
N
ˆ 1X
α N = α + (β − β N ) x̄N +
α̂ ek
N
k=1
and, under this condition, both the last two terms converge to zero in proba-
bility, it is easy to see that the estimator α̂
α N is also consistent in probability.
23
CONVERGENCE IN DISTRIBUTION
Definition 4. A sequence of Pdf’s {FN } (may be multivariable), converges
L
in law to a Pdf F; notation: FN → F, if the functions {FN (x)}, converge to a
Pdf F(x) at all points x where F is continuous.
One also talks abut convergence in distribution (or also in law) of random
variables: a sequence {xN } (maybe vector valued), converges in distribu-
L
tion: xN → x if the Pdf’s of {xN } converge in law to the Pdf of x.
This is a weaker notion than convergence of random variables as defined
above.
for all bounded continuous real valued functions f . In fact for all real valued
functions f which are bounded and continuous in a set of probability one
for the Pdf of x.
26
CHARACTERISTIC FUNCTIONS
Z
φx(it) := eitx dF(x) = E eit x
This is a very useful fact. Used for example in the proof of the CLT.
27
CONVERGENCE OF MOMENTS
We show that for scalar random variables this is true also for conver-
L
gence in distribution. If FN → F, for every continuous function g composed
with another arbitrary continuous function f we must have
Z Z
g( f (y)) dFN (y) → g( f (y)) dF(y)
R R
By a change of variable (suppose f is invertible)
Z Z
g(x) dFN ( f −1(x)) → g(x) dF( f −1(x))
R R
where FN ( f −1(x)) and F(( f −1(x))) are the Pdf’s of f (xN ) and f (x).
NB: For vector functions this proof does not necessarily work. See Slut-
sky’s Theorem below. See also Billingsley book pp. 29-30.
29
MULTIVARIATE STATISTICS
χA(λ ) := det(A − λ I) = 0
which is of degree n in λ and hence has n (not necessarily distinct) complex
roots {λ1, . . . , λn}. This set is called the spectrum of A and is denoted σ (A).
The multiplicity of λk as a root of the characteristic polynomial is called the
algebraic multiplicity.
When eigenvectors are linearly independent they form a basis in which the
matrix A looks like multiplication by a diagonal matrix whose elements are
32
the eigenvalues. Unfortunately this happens only for special classes of
matrices.
SYMMETRIC MATRICES
Theorem 8. Let A = A> ∈ Rn×n. Then
1. The eigenvalues of A are real and the eigenvectors can be chosen to be
a real orthonormal basis.
2. A is diagonalizable by an orthogonal transformation (∃T s.t. T >T = I
and T >AT is diagonal).
3. A positive (semi-) definite matrix can always be taken symmetric. Its
eigenvalues are real and positive (nonnegative).
MULTIVARIATE CONVERGENCE
L
Let xN ∼ FN and x ∼ F be n-dimensional random vectors; then xN → x
means that
FN (x1, x2, . . . , xn) → F(x1, x2, . . . , xn)
>
at all points x = x1 x2 . . . xn ∈ Rn where F is continuous. This implies
33
Example 2. Let xN and yN be two scalar sequences converging separately
L L
in distribution to the random variables x and y. Then xN → x and yN → y
does NOT NECESSARILY imply that
xN L x
→ , (In general not true)
yN y
unless one of the two limits is a degenerate r.v. (a constant) c; see
Slutsky Theorem.
SLUTSKY THEOREM
L
Theorem 9 (Slutsky). Let xN → x scalar random variables. Then:
L
1. For every continuous function f , f (xN ) → f (x).
x
3. Let zN = N and let the sequence {yN ; N = 1, 2, . . .} converge in prob-
yN
ability (or in law) to a constant c. Then if f (z) := f (x, y) is a continuous
L
function of the two arguments f (xN , yN ) → f (x, c).
35
FERGUSON EXAMPLE CONT’D
36
APPLICATIONS OF SLUTSKY THEOREM
(Lehmann p. 70)
L
Corollary 1 (Theorem 2.3.3). If xN → x and two random variables aN , bN
converge in probability to constants (a, b) (same as also converging in
distribution !), then
L
zN := aN + bN xN → a + bx
(In Lehmann xN ≡ Yn).
a
Proof. Follows from statement (3) of Slutsky Theorem, just let yN = N
bN
a
and c := ∈ R2 and f (x, y) = a + b x.
b
L L
Corollary 2 (Corollary 2.3.1). If xN → x and rN → 0 then
L
xN + rN → x .
37
ABOUT SLUTSKY THEOREM
(Lehmann p. 70)
Warning: The Pdf of the sum of two random variables is not the sum of
the two Pdf’s!!
You may want to compute the Pdf (easier the pdf) of x + c given that of x.
Also the Pdf of a random variable equal to zero a.s. (x = 0) is not the zero
Pdf! which in fact is not even a Pdf function!
P
Two sequences of random variables {xN }, {yN } such that (xN − yN ) → 0
in probability (equiv. in distribution, since 0 is a constant), are said to be
asymptotically equivalent.
38
THE CENTRAL LIMIT THEOREM (CLT)
The first version of the Central Limit Theorem is due to De Moivre and
Laplace for binomial random variables and later to Gauss for sum of con-
tinuous i.i.d. variables.
39
THE CLASSICAL CLT
√
Hence as N → ∞, the variance of the random variable N [ x̄N − µ ] has a
finite limit (the variance of each xk ). Then its limit distribution cannot be
degenerate. The key is to discover the convergence rate.
Theorem 10. Assume {xk√ } are i.i.d. random variables of mean µ and finite
variance σ 2. Then yN := N (x̄N − µ) converges in distribution to a Gaus-
sian of mean zero and variance σ 2. Letting x̃k := xk − µ this is the same
as
N
√ 1X L
N x̃k → N(0, σ 2)
N
k=1
√ L
May be written in short also as N x̃¯ N → N(0, σ 2) .
NB Theorem is no longer true if you do not subtract the mean:
N N
√ √ 1X √ 1X √
N x̄N = N xk = N x̃k + N µ
N N
k=1 k=1
√
since N µ does not converge in distribution.
40
ABOUT THE CONVERGENCE RATE
(Lehmann last paragraph of p.70 gives only an intuitive argument)
L
When x̄N → c then (under suitable assumptions) the variance σN2 of x̄N
must tend to zero. (Show that the variance of a degenerate Pdf is zero !).
The main question is: Does there exist a deterministic function k(N) → ∞
such that
L
k(N) (x̄N − c) → F
where F(x) is a (limit) non-degenerate Pdf ?
Example: if x̄N is the sample mean of N i.i.d. random variables, c = µ and
√
k(N) = N, then F ≡ N(0, σ 2). This is the CLT.
41
PROOF OF THE CLT 1
Need to review properties of the characteristic function, see Lehmann
p. 581.
Proposition 2. The characteristic function of N(0, 1) is the function
|it|2 2
t
φo(it) = e 2 = e− 2 .
PN
Proposition 3. Let x̄ = k=1 xk be the sum of independent random vari-
ables (not necessarily having the same distribution).The characteristic func-
tion of x̄ is
φx̄(it) = φx1 (it)φx2 (it) · · · φxN (it) .
N
Y N Z
Y
= E eit xk = eitxk dFk (xk ) = φx1 (it)φx2 (it) · · · φxN (it) .
k=1 k=1
42
If the variables are i.i.d. then φx̄(it) = φx1 (it)N .
if ȳ = α x̄ where α 6= 0, then
t
it ȳ x̄
i t
φȳ(it) = E e = E e α = φx̄(i )
α
PROOF OF THE CLT 2
Let yk := xk − µ; then y1, y2, . . . yN are i.i.d. of mean zero and variance σ 2.
We shall show that
√ L
zN := N ȳN → N(0, σ 2) .
In fact from
N
Y it
φzN (it) = E exp{ √ yk }
N
k=1
take Taylor expansion about t = 0
it it 1 |it|2 2 |t|
exp{ √ y} = 1 + √ y + y + o[( )2 y2]
N N N 2 N
substitute y = yk and take expectation. Since E yk = 0 ; E y2k = σ 2 you get
N
1 −t 2 |t|2 N
φzN (it) = 1 + σ2 + o[ ( ) ]
N 2 N
43
PROOF OF THE CLT 3
It could be anything provided the mean and variance are finite, e.g. Bino-
mial, U[a, b] etc. See Lehmann pp. 73-74.
44
Example 4 (Lehmann p. 75-76). Let y := {y1, y2 . . . yN } a i.i.d. sequence
where each yk has mean µ and variance σ 2. Find the asymptotic (limit)
distribution of the random variable
√
N [ȳ − µ]
ϕN (y) := q N
σ̂N2 (y)
Exercise (2.4.4): Assume τ 2 = var y2k is finite. Prove that under the i.i.d.
assumption
P
σ̂N2 (y) → σ 2 .
46
SOLUTION
See Lehmann p. 51-52
N N
1 X 1 X
σ̂N2 (y) = (yk − ȳN )2 = { (yk − µ)2 + 2(yk − µ)(µ − ȳN ) + (µ − ȳN )2}
N N
k=1 k=1
N
1 X
= (yk − µ)2 + (ȳN − µ)2 := s2N (y) + (ȳN − µ)2
N
k=1
P P
Now (ȳN − µ)2 → 0 since ȳN − µ → 0; and since E s2N (y) = σ 2,
2
N
1 X 1 h i
var{s2N (y)} = E (yk − µ)2 − σ 2 = τ 2 − 2σ 4 + σ 4
N N
k=1
P
which tends to zero for N → ∞. Therefore s2N (y) → σ 2 and so does σ̂N2 (y).
47
THE χ 2 DISTRIBUTION
One says that a scalar random variable y has a χ 2(n) distribution if its pdf
is supported on the nonnegative real line and has the following form:
1 n −1 −x/2
P(x ≤ y < x + dx) = (
x 2) e dx , x ≥ 0. (2)
2n/2 Γ 2n
In this expression n is a natural number called the number of degrees of
freedom of the distribution. One sees that the χ 2 is a special case of the
Gamma distribution. Its characteristic function is
µ1 = n
µ2 = 2n
µ3 = 8n
µ4 = 48n + 12n2 ecc... (4)
48
Lemma 1. For large n a χ 2(n) random variable tends in distribution to a
Gussian variable with pdf N (n, 2n).
PN
Proof. Recall that the pdf of the sum 1 yi of i.i.d. random variables is
just the N-fold convolution of the respective p.d.f’s, so that the charac-
teristic functions φi(t), of the yi’s get multiplied together. It is then clear that
multiplying functions like (3) the exponents at the denominators must add
up.
Proof. Just note that, with y ∼ N (µ, σ 2), the pdf of z := (y− µ)2/σ 2 is χ 2(1)
and then use Proposition 5.
1 1 −z/2 −z/2 1
= √ √ [e +e ] 11(z) = √ e−z/2 ; z ≥ 0 ,
|2 z| 2π 2πz
which is indeed χ 2(1).
Proposition 9. Let yk ∼ N (µ, σ 2), k = 1, . . . , N, i.i.d. Then the pdf of the
normalized sample variance:
N
σ 2N
N σ̂ 1 X 2
= (y k − ȳN ) ,
σ2 σ2
1
is χ 2(N − 1).
Proof. We just need to show that ȳN and yk − ȳN are uncorrelated for all
k’s. By Gaussianity, this will imply independence.
Define xk = yk − µ and x̄N := ȳN − µ, so that yk − ȳN = xk − x̄N and E ȳN (yk −
ȳN ) = E x̄N (xk − x̄N ) = E x̄N xk − E x̄2N . Independence of the variables yk im-
plies
N
!
1 X 1 2 σ2
E x̄N xk = E xk xi = E (xi) =
N N N
1
so that, comparing with E (x̄N )2 = σ 2/N, one gets the conclusion.
By the identity
N
X N
X
(yk − µ)2 = (yk − ȳN )2 + N(ȳN − µ)2 (7)
1 1
one has
N N
X (yk − µ)2 X (yk − ȳN )2 (ȳN − µ)2
= +N
σ2 σ2 σ2
1 1
where the two random variables in the right member are independent. We
know from Proposition 7 that N s2N /σ 2 ∼ χ 2(N) and that (ȳN − µ)2/(σ 2/N) ∼
χ 2(1) (which also follows from Proposition 7 with N = 1). By Proposition 6
the pdf of first summand in the second member must be χ 2(N − 1).
49
The Student pdf has a curious history which is reported in all textbooks of
classical Statistics. For n = 1 it reduces to the Cauchy distribution :
1
S (1) ≡ .
π(1 + t 2)
It can be shown that S (n) has finite moments only up to order n − 1, given
by the formulas
which is invariant so that φx̄N (it) converges to the same initial characteristic
L
function. In distribution, x̄N → xk (k arbitrary).
dg
Solution: Let g(x) = x2 then g0(x) = = 2x. Hence
dx
√ L
N (x̄2N − µ 2) → N(0, 4µ 2σ 2) .
√ 2 L
Warning: If µ = 0 this just says that N x̄N → 0 which is a degenerate
1
distribution. Means that √ is not the correct convergence rate. In fact by
N
Proposition 8 and Slutsky Theorem
x̄2N L 2
N 2 → χ (1) .
σ
52
Example 7 ( Lehmann ppp 75-76). Show that the asymptotic distribution
σ 2N (x), of an i.i.d. sample having finite fourth order
of the sample variance σ̂
moment µ4 is
√ L
N (σ̂σ 2N (x) − σ 2) → N(0, µ4 − σ 4) .
Solution: Using the identity (7) the sample variance can be written
N N
2 1X 2 1X
σ N (x) =
σ̂ (xk − x̄N ) = (xk − µ)2 − (x̄N − µ)2 := m2(x) − m1(x)2 .
N N
k=1 k=1
σ 2N (x) we use Cramèr Theorem. Let
To find the asymptotic distribution of σ̂
us define the function g(m1, m2) = −m21 + m2. Then
m (x)
σ 2N (x) = −m1(x)2 + m2(x) := g(m1(x), m2(x)) = g( 1
σ̂ )
m2(x)
Note that the two components are correlated. One cannot “add” the two
asymptotic distributions. Need to use the joint distribution of the random
vector.
53
Since the sample x is i.i.d. one has
√ L
N m1(x) → N(0, σ 2)
. Similarly, since also the (xk − µ)2 ; k = 1, 2, . . . are i.i.d. one has
N
√ √ 1X L
N m2(x) := N (xk − µ)2 → N(σ 2, µ4 − σ 4)
N
k=1
because var (xk − µ)2 = E (xk − µ)4 − 2σ 4 + σ 4 = µ4 − σ 4 is the central fourth
order moment.
By the CLT for random vectors
√
m1(x) 0 L
N − 2 → N(0, Σ)
m2(x) σ
m1(x̄)
where the random vector has variance matrix
m2(x̄)
2
2
var xk Cov (xk , xk ) σ µ3
Σ= =
Cov (x2k , xk ) var x2k µ3 µ4 − σ 4
54
We want the asymptotic distribution of g(m1(x), m2(x)). Don’t need to com-
pute the third order moment µ3 since the derivatives of g with respect to
the two variables m1, m2 is g0(m1, m2) = [−2m1, 1] so that g0(0, σ 2) = [0, 1]
and hence
h i>
g0(0, σ 2)Σ g0(0, σ 2) = var x2k = E x4k − (E x2k )2 = µ4 − σ 4 .
In conclusion:
√ h 2 i √ h i
L
σ 2N (x) − σ 2 → N(0, µ4 − σ 4) .
σ N (x) − g(0, σ 2) = N σ̂
N σ̂
55
GENERALIZATIONS OF THE CLT
Theorem 12 (Ferguson Problem 5, p.34, Lehmann Theorem 2.7.4 p.102).
Let {x1, x2, . . . xN } be a sequence of i.i.d random variables with mean µ and
variance σ 2 and let w(N, k) ; k = 1, 2, . . . , N be a sequence of real numbers.
Define
N
X N
X
yN := w(N, k) xk , σN2 := var (yN ) = σ 2 w(N, k)2 .
k=1 k=1
Then :
yN − E yN L
√ 2 → N(0, 1)
σN
if the following Lindeberg condition holds: for N → +∞
w(N, k)2
max PN → 0.
k≤N
k=1 w(N, k)2
56
Note: you can restate the theorem assuming independent random vari-
ables defined as zk = w(N, k)xk which have arbitrary mean µk = w(N, k)µ
and variance σk2 = w(N, k)2σ 2 then
PN
(z − µ ) L
yN := k=1√ k2 k → N(0, 1) .
σN
Lehmann condition (2.7.3) is more complicated to verify
57
APPLICATION TO LINEAR REGRESSION
See Lehmann p. 101 and 104
ˆ 1 PN
α N = α + x̄N (β − β N ) +
Since α̂ e we have
N k=1 k
α N − α) = (β − βˆ N ) x̄N + ēN
(α̂
that is
√ √
αN
α̂ α
N 1 x̄N − = N ēN
βˆ N β
√
so that, the random variable say N zN in the first member has mean zero
1 2 PN
and variance τN := σN where σN = k=1 σk2 → ∞. Therefore under this
2 2
N
condition,
zN L
→ N(0, 1) .
τN
Need a careful analysis of the vector least squares problem.
59
THE CRAMÈR - RAO BOUND
Ferguson Chap 19
∂2 R R ∂2
[A.3)]
∂ θi ∂ θ j R
r p(x, θ ) dx = Rr ∂ θi ∂ θ j p(x, θ ) dx ; ∀ i, j = 1, . . . , p ; ∀ θ ∈ Θ.
60
Definition 5. The Fisher Information Matrix I(θ ), of the parametric family
of densities {pθ } is defined as
∂ log p(x, θ ) ∂ log p(x, θ )
I(θ ) := E θ · (11)
∂ θi ∂θj i, j=1,...,p
I(θ ) can also be written as
∂ 2 log p(x, θ )
I(θ ) = −E θ . (12)
∂ θi ∂ θ j i, j=1,...,p
That
R (12) and (11) are equivalent follows by differentiating the identity
p(x, θ ) dx = 1 (constant with respect to θ ) termwise with respect to θ get-
ting
∂ 2 p(x, θ )
Z Z
∂ p(x, θ )
dx = 0 , dx = 0 , i, j = 1, . . . , p .
R r ∂ θi R r ∂ θ i ∂ θ j
Equation (12) then follows from
∂ 2 log p ∂ log p ∂ log p 1 ∂ 2 p
− = − ,
∂ θi ∂ θ j ∂ θi ∂θj p ∂ θi ∂ θ j
61
To understand the meaning of I(θ ) we shall bring in the p-dimensional
random vector of the random sensitivities of p(·, θ ) with respect to the
parameter θ ,
∂ log p(x, θ ) ∂ p(x, θ )
zθ := = /p(x, θ )
∂ θi i=1,...,p ∂ θi i=1,...,p
by which
I(θ ) = E θ zθ z>
θ , (13)
where the matrix is (at least) positive semidefinite since it is a variance. In
fact, it easily follows from [A.2)] that E θ ∂ ∂log
θ
p
= 0 for all i’s and so
i
E θ zθ = 0 .
62
Theorem 13 (The Cramèr-Rao Inequality). Let g be a differentiable function
from Θ to Rq and φ be an unbiased estimator of g(θ ). Let V (θ ) be the
variance matrix of φ and G(θ ) the Jacobian matrix of g,
∂ gi(θ )
G(θ ) = . (14)
∂θj i = 1, . . . , q
j = 1, . . . , p
Then, if the Fisher matrix I(θ ) is invertible, one has
Proof: The proof is based on the classical formula for the error variance
of the linear Bayesian estimator φ̂ (x) := E θ [ φ (x) | zθ ] of the vector φ (x),
given zθ , that is
Var θ {φ (x)− φ̂ (x)} = Var θ {φ (x)}−Cov θ {φ (x), zθ }Var θ {zθ }−1Cov θ {φ (x), zθ }> .
(16)
See for example [?, p. 27].
63
Since φ (x) is an unbiased estimator of g(θ ); i.e.
Z
φ (x) p(x, θ ) dx = g(θ ) , ∀θ ∈ Θ ,
Rr
by applying property A.3) one gets
Z
j ∂ p(x, θ ) 1 ∂ g(θ )
E θ φ (x) zθ = φ (x) · · p(x, θ ) dx = ,
Rr ∂θj p(x, θ ) ∂θj
j = 1, . . . , p ,
∂ g(θ )
and hence ∂ θ is the j-th column of the covariance matrix of φ and zθ ,
j
p
E θ φ (x) z>
θ = E θ φ (x) [z1,...,z ] ,
θ θ
that is,
E θ φ z>
θ = G(θ ) . (17)
The inequality follows since the variance of the random vector φ (x)−G(θ ) I(θ )−1 zθ
must be (at least) positive semidefinite.
64
Remarks
(the trace is the sum of the eigenvalues and the eigenvalues of a positive
semidefinite matrix are all non-negative) it follows that the scalar variance
of any unbiased estimator of the parameter θ cannot be less than the
positive number Tr I(θ )−1, that is
h i
varθ (φ ) ≥ Tr I(θ )−1 , ∀θ . (19)
This lower bound only depends on the probabilistic model class {p(·, θ ) ; θ ∈
Θ} and is independent of which estimation criterion is used to construct φ .
65
Remark:
One should however be aware of the fact that the Cramèr-Rao bound is just
one possible bound for the variance which is not necessarily the tightest
possible bound.There are in fact unbiased estimators whose variance is
strictly larger than Tr [I(θ )−1] but nevertheless have minimum variance.
66
Example 8. Let y ∼ N (θ , σ 2) be a scalar random variable with a known
variance σ 2. Since
1 (y − θ )2
log p(y, θ ) = C − ,
2 σ2
d y−θ
log p(y, θ ) = 2
dθ σ
we have
2
y−θ 1
i(θ ) = E θ = 4 · σ 2 = 1/σ 2 .
σ2 σ
Hence the variance of any unbiased estimator of θ based on a sample of
size one, cannot be smaller that the variance of y. Assume now we have
a random sample of size N from the same Gaussian distribution. Now we
have a random vector x = (y1, . . . , yN ) of dimension r = N and
N
Y
p(y1, . . . , yN | θ ) = p(yt , θ )
t=1
67
and hence
N
1 X (yt − θ )2
log p(y1, . . . , yN | θ ) = N ×Const − ,
2 σ2
t=1
N
d log p X yt − θ
= .
dθ σ2
t=1
Since the random variables y1, . . . , yN are independent, it follows that,
2
d log p(y, θ ) 1 2 N
I(θ ) = E θ = 4 ·Nσ = 2 .
dθ σ σ
1 PN
Let us consider the sample mean ȳN = N t=1 yt which has distribution
N (θ , σ 2/N). Since ȳN is an unbiased estimator of θ with variance σ 2/N,
exactly equal to the inverse of the Fisher information, the sample mean
is the best possible estimator of θ (of course if the sample distribution is
Gaussian). One says that an unbiased estimator whose variance is ex-
actly equal to the inverse of the Fisher information matrix, V (θ ) = I(θ )−1 is
efficient. ♦
Example 9. Let y ∼ N (µ, θ 2), where µ is known and (y1, . . . , yN ) is a ran-
dom sample from N (µ, θ 2). Consider the unbiased estimator
2 N
N σ
σ̂ 1 X
σ 2N =
σ̃ N = (yk − ȳN )2 ;
N −1 N −1
k=1
σ 2N
N σ̂
We know that 2 has a chi square distribution with N − 1 degrees of free-
θ
dom, which has expectation N − 1 and variance 2(N − 1).
4
2θ . The Cramèr-Rao bound in this case is 2θ 4 /N and
Its variance is then N−1
σ 2N is strictly greater than I(θ )−1. One can however
hence the variance of σ̃
show [?] that any unbiased estimator of θ 2 cannot have a smaller variance
σ 2N .
than that of σ̃
From this example it follows that I(θ )−1 is not the best possible lower
bound. ♦
Note that a biased estimator can have smaller variance than the C-R limit.
2 4 N −1
From Proposition 9 it follows that var σ̂σ N = 2σ .
N2
68
Exercises
1-1 Let I(θ ) be the Fisher matrix relative to an arbitrary density p(y, θ ).
Show that for a random sample of size N one has IN (θ ) = N I(θ ).
1-2 Show, without using the χ 2 distribution, that the Cramèr-Rao bound
for a random sample from N (µ, θ 2) of size N is 2θ 4/N.
1-3 Show that the Cramèr-Rao bound for N (θ1, θ22) (two dimensional pa-
rameter θ ) is
2
θ /N 0
I(θ )−1 = 2 .
0 2θ24/N
69
THE KULLBACK-LEIBLER DISTANCE
See Ferguson p112
We shall use this measure to quantify in rather precise terms, the ability of
observations extracted from the model, to discriminate between different
values of the parameter θ .
Definition 6. Let f and p be probability densities such that p(x) = 0 ⇒
f (x) = 0. The Kullback-Leibler (pseudo-)distance between f and p, is
Z Z
K( f , p) := [ log f − log p] f (x) dx = log f /p f (x) dx = E f log f /p ; (20)
Rr Rr
70
From Jensen inequality:
Z Z
log g(x) dµ ≤ log{ g(x) dµ}
which holds for g(x) > 0 and an arbitrary positive measure µ, one gets
Z Z
p p
−K( f , p) = log f dx ≤ log{ f dx} = log{1} = 0
Rr f Rr f
so that K( f , p) ≥ 0.
71
Let us assume that the family p(·, θ ) satisfies the same regularity assump-
tions listed for the Cramèr-Rao bound and let f ≡ p(·, θ0) and p ≡ p(·, θ ),
θ0, θ ∈ Θ. Denoting K(p(·, θ0), p(·, θ )) by K(θ0, θ ) and letting θ = θ0 + ∆θ ,
one has
2
∂K 1 > ∂ K
K(θ0, θ ) = K(θ0, θ0) + ∆θ + ∆θ ∆θ + o(k∆θ k2) .
∂θ θ 2 ∂ θi ∂ θ j θ
0 0
Since K(θ0, θ0) = 0 and
Z
∂K ∂ log p(x, θ )
=− p(x, θ0) dx ,
∂ θi R r ∂ θi
it follows that
Z
∂K ∂ p(x, θ )
=− dx = 0
∂ θi θ R r ∂ θi θ0
0
for all i = 1, . . . , p.
∼ 1 >
K(θ0, θ ) = ∆θ I(θ0) ∆θ ; (21)
2
which says that, for small deviations ∆θ of the parameter from the reference
value θ0, the Kullback-Leibler distance between p(·, θ ) and p(·, θ0) is a
quadratic form whose weighting matrix is the Fisher matrix I(θ0). In
the next section we will see a remarkable consequence of this fact.
75
ERGODICITY AND THE STRONG LAW OF
LARGE NUMBERS
Lehmann p. 62
y = {y(t)} , t ∈Z
the symbol Z (Zhalen in German) stands for integer numbers.
Definition 9. A stochastic process {y(t)} is stationary (in the strict sense)
if all Pdf’s relative to y(t1), y(t2), . . . y(tn) say Fn(x1, . . . , xn,t1, . . . ,tn) are invari-
ant for temporal translation, that is for every n it must hold that,
76
Consequences
• The Pdf F(x,t) of any variable y(t) does not depend on t; that is the
random variables y(t), t ∈ Z, are identically distributed;
• The second order joint Pdf F2(x1, x2,t1,t2) of the variablesy(t1), y(t2),
only depends on τ = t1 − t2 and not on the date.
In particulao, µ(t) := E y(t),ia constant equal to µ ∈ Rm and the Covari-
ance
Σ(t1,t2) := E [y(t1) − µ(t1)] [y(t2) − µ(t2)]>
depends only on τ = t1 − t2.
77
THE ERGODIC THEOREM
E z̄ = E f (y) .
Corollary 3. If {y(t)} is ergodic
T
1 X
lim fk (y) = E f (y) (23)
T →∞ T
k=1
with probability one whatever may be f (y) having finite expectation.
Proof: In fact z̄ must be a constant and hence coincides with its own ex-
pectation z̄ = E z̄ = E f (y). 2
79
THE STRONG LAW OF LARGE NUMBERS
Theorem 16 (Kolmogorov). Every i.i.d. process having finite expectation is
ergodic.
82
1 a.s.
Therefore for N → ∞, (lN (θ ) − lN (θ0)) → − K(θ0, θ ) < 0
N
this means that for N large, lN (θ ) < lN (θ0) for all θ 6= θ0 so, taking expo-
nentials, the likelihood of θ0 will be larger than that of any other θ ∈ Θ. In
fact we have asymptotic exponential decay:
LN (θ )
= O(e−NK(θ0, θ )) ; ∀θ ∈Θ
LN (θ0)
which means that the ratio will certainly be < 1 for N large. When Θ is a
finite set this implies that any maximum, θ̂N , of LN (θ ) must converge to θ0.
Theorem 17. If Θ is compact, pθ (x) is continuous in θ for all x and there is
K(x) such that
p (x)
log θ ≤ K(x) ; E θ0 K(xk ) < ∞ (24)
pθ0 (x)
then any maximizing θ̂N (x) converges almost surely to θ0 as N → ∞.
83
ASYMPTOTIC NORMALITY OF MLE
Theorem 18 (Cramèr). Assume again that x1, x2, . . . , xN is an i.i.d. sample
from a family of pdf’s { pθ (x) ; θ ∈ Θ} with Θ ⊂ R p now an open subset, sat-
isfying the three conditions [A.1)][A.2)][A.3)] of the Cramèr-Rao Theorem.
Let θ0 be the true value of the parameter and assume that the information
matrix I(θ ) is non singular at θ = θ0. In other words
I(θ0) > 0
(which implies local identifiability). Assume further that condition (24) is
satisfied. Then there exists a strongly consistent sequence θ̂θ N of roots of
the likelihood equation
∂ lN (θ )
= 0; i = 1, 2, . . . , p
∂ θi
such that
√ L
N (θ̂θ N − θ0) → N(0, I(θ0)−1)
84
ASYMPTOTIC NORMALITY OF MLE 2
Lehmann book pp.459-476 does only the scalar case of 1 parameter
85
ASYMPTOTICS OF STATISTICAL TESTS
See Lehmann p. 133-4, and 137
Example 10. Consider an i.i.d sample with finite fourth order moment µ4.
We want to test the hypothesis
H0 : σ 2 = σ02 ,
against the alternative σ 2 > σ02 based on N (→ ∞) independent observa-
tions. We reject the hypothesis H0 if σ̂ σ 2N is large. The asymptotic distribu-
tion
√ L
σ 2N (x) − σ 2) → N(0, µ4 − σ 4) .
N (σ̂
has mean σ 2 and variance µ4 − σ 4 which in case of near normality is 2σ 4 −
σ 4 = σ 4 and the rejection region becomes
√ 2 2
√ 2
σ N − σ0 ≥ uα 2 σ0
N σ̂
86
TIME SERIES
Suppose you have a sequence of scalar regression data
yN := {y(t) ; t = 1, 2, . . . , N} , uN := {u(t) ; t = 1, 2, . . . , N}
where you measure the u(t)’s exactly but the y(t)’s are random due to er-
rors of various kinds. We shall imagine that they are extracted from two
stochastic processes {y(t), u(t) }which are jointly stationary zero-mean
and have finite second order joint moments.
88
We shall assume that e is an i.i.d. process. The model ARX depends on
p := n + m unknown parameters written as a column vector:
>
θ := a1 . . . an b1 . . . bm
and can be written in regression form as
ŷθ (t | t − 1) = ϕ (t)> θ
is the (one step ahead) predictor function associated to the model. Note
that the predictor function is a linear function of θ and of the previous n + m
past samples of the joint process.
89
PEM IDENTIFICATION OF TIME SERIES
90
1. For a generic value of θ , construct a predictor based on data up to
time t − 1 of the next output, y(t). For each θ the predictor is a de-
terministic function of the past data denoted ŷθ (t | t − 1). For analysis
purpose we may consider ŷθ (t | t − 1) as a function (of θ ) and of the
past random observed data denoted ŷθ (t | t − 1).
91
More generally may introduce a discount factor for past errors: a positive
sequence β (N,t),
N
1X
VN (θ ) := β (N,t)εθ (t)2 β (t, N) > 0
N
t=1
For small N, the function β gives small weight to errors incurred at the be-
ginning. One designs the weighting function so that For N → ∞, β (N,t) → 1
.
The parameter estimate
λ̂N2 := VN (θ̂N )
where VN is defined above.
92
ARX MODEL ESTIMATION
Assume we have data {y(t) , u(t) ; t = t0,t0 − 1, . . . , 0, 1, 2, . . . , N} to be de-
scribed by an (ARX) model using the PEM method. Write a N-vector model
for all data as
y = ΦN θ + e column vectors of dimension N
where y and e have components y(t) and e(t) for t = 1, 2, . . . , N and ΦN is
an N × p matrix of past data:
>
ϕ(1)
ΦN := ... ,
ϕ(N)>
Assuming the initial time t0 is far enough, we can fill in ΦN with data from
time say t = 1. The N-dimensional vector of predictors and prediction errors
are ŷθ = ΦN θ , ε θ = y − ΦN θ . Then VN (θ ) is the squared Euclidean
norm of ε θ ,
N
1X 1
VN (θ ) = [ y(t) − ϕ(t)>θ ]2 = ky − ΦN θ k2 .
N N
t=1
93
DETERMINISTIC VECTOR LEAST SQUARES
Hence the estimation leads to a vector Least Squares Regression Prob-
lem. More generally to weighted Least squares: if the t-th measurement
is more reliable weight the t-th prediction errors by larger weight qt2. If
the error variances are approximately known, the optimal choice is take
1
qt2 = the inverse of the error variance.
var(e(t))
y
P
P PP
P
X ⊂ RN
q
P
X θ̂
96
THE ORTHOGONALITY PRINCIPLE
Want to find the vector ŷ ∈ X of shortest distance from the data vector y.
Theorem 19. The minimizer of VQ(θ ) = ky − Xθ k2Q must make the error
y − Xθ orthogonal (according to the scalar product hx, yiQ) to the sub-
space X, or, equivalently, to the columns of X, that is
X >Q(y − Xθ ) = 0 ,
Equivalently the optimal θ must solve the Normal Equations
X >Q X θ = X >Q y .
X> X θ = X> y.
To solve need invertibility of X > X.
97
Let us now assume that
rank X = p ≤ N . (25)
This is an identifiability condition of the model class. Each model corre-
sponds 1 : 1 to a unique value of the parameter. Under this condition the
Normal have a unique solution which we denote θ̂θ (y) given by
P = XA ,
is the orthogonal projector, with respect to the inner product h·, ·iQ, from
RN onto X. In fact P is idempotent (P = P2), since
XA · XA = X · I · A = XA
98
STATISTICAL ANALYSIS
y = Xθ + e . (27)
is zero-mean random vector with known variance R = E ee> positive defi-
nite and that the rank condition 25 holds.
Theorem 20. The estimator θ̂θ (y) = Ay is unbiased and has variance matrix
h i−1
Var θ̂θ (y) = X >R−1X
100
In the ARX case just assume Q = IN (N × N identity matrix) then the PEM
estimator of θ is
h i−1
θ̂N = Φ>
N ΦN Φ>Ny
101
CONSISTENCY OF THE PEM ESTIMATOR
(Preview)
Theorem 21. Assume there is a true model describing the data having
orders n, m as in the candidate ARX model and true parameter θ0. Assume
also that the true model is causal that is
lim θ̂θ N = θ0
N→∞
with probability one.
102
STRONG CONSISTENCY OF THE PEM
ESTIMATOR
Rewrite θ̂θ N as
" N
#−1 N
1 1
ϕ(t)ϕ(t)>
X X
θ̂N = ϕ(t)y(t) . ; (EST)
N N
t=1 t=1
substitute y(t) = ϕ(t)>θ0 + e(t) (true model) and define the sample covari-
ance matrix of ϕ (t)
N
1X
ΣN :=
Σ̂ ϕ (t)> ;
ϕ (t)ϕ ∈ R p×p
N
t=1
For notation simplicity we do the case of no exogenous input (u ≡ 0).
Lemma 3. If {e} is an i.i.d. process then {y} is ergodic and Σ̂ ΣN converges
almost surely for N → ∞ to the positive semidefinite covariance matrix
y(t − 1)
Σ := E θ0 { ... y(t − 1) . . . y(t − n) };
y(t − n)
103
PROOF OF STRONG CONSISTENCY OF THE
PEM ESTIMATOR
Proof. The ergodicity follows from Corollary 4 but we shall need to under-
stand why y(t) admits such a representation. We shall do that in the next
slides
106
Once we have h(t) we can solve the equation for an arbitrary input f (t).
Note that the system represented by the DE operates a linear transfor-
mation on the input f (t). Since any input function can be expressed as
a (possibly infinite) linear combinations of impulse functions located at all
times t = k,
+∞
X
f (t) = f (k) δ (t − k), t ∈Z
k=−∞
because of linearity the response of the system can be written as a sum
of infinitely many impulse responses to the δ (t − k)’s each located at times
t = k and weighted by amplitude f (k). This leads to the convolution rep-
resentation
+∞
X +∞
X
y(t) = f (k) h(t − k) ; ⇔ y(t) = h(k) f (t − k)
k=−∞ k=−∞
107
Let’s go back to the AR model. Now the input is an i.i.d. process {e(t)}.
We can still write the solution as a convolution sum
+∞
X
y(t) = h(k) e(t − k)
k=−∞
which is of the same form of the representation (CONV) on p. 78. We need
however to check under what circumstances the convergence condition
P+∞
k=−∞ |h(k)| < ∞ is satisfied.
Lemma 4. If and only if all roots of the characteristic equation have modu-
lus strictly less than 1; i.e |λk | < 1 ; k = 1, 2, . . . , n, one has :
1. Causality i.e. h(t) = 0 for t < 0
P+∞
2. Stability i.e. k=0 |h(k)| < ∞ .
108
Clearly when |λk | < 1 then limt→+∞ λkt = 0. True also for multiple roots. the
impulse response is summable.
If |λk | > 1 then limt→+∞ λkt = ∞ the impulse response is not causal. May
be causal but then not summable.
109
CAUSALITY
When |λk | < 1 ; k = 1, 2, . . . , n the process y is ergodic but more is true. Since
h(t) = 0 for t < 0, one can write
+∞
X t
X
y(t) = h(k) e(t − k) = h(t − k) e(k)
k=0 k=−∞
so y(t) depends only on the past history of {e(t)}. In general infinite,
since h(t) is non zero for all t ≥ 0. Write
t−i
X
y(t −i) = h(t −i−k) e(k) ; linear function of past e’s at times ≤ t − i
k=−∞
Therefore, since the e(t) are uncorrelated,
E e(t)y(t − i) = 0), ; i = 1, 2, . . . , n
If there is an input process this needs to be assumed.
110
CAUSALITY AND THE PREDICTION ERROR
111
CONDITIONAL EXPECTATION
Denote by L2(y), the vector space of statistics of the process y which have
finite second order moment; that is f (y(s) ; s ∈ Z) with E f (y)2 < ∞.
x − E (x | y) ⊥ L2(y) (30)
where the orthogonality is with respect to the covariance inner product
hx , z i := cov {x , z}.
112
MARTINGALE-DIFFERENCE PROCESSES
For AR models with i.i.d. noise input, the (one-step-ahead) prediction error
turns out to coincide with e(t) itself and is (statistically) independent of the
optimal predictor. This was defined as the conditional expectation of y(t)
given the past history of the process up to time t − 1:
h i
ŷ(t | t − 1) = E y(t) | yt−1 .
For ARX and ARMAX models the predictor also depends on the past of
some exogenous variable u. The definition is generalized as conditional
expectation given the joint past histories (yt−1, ut−1) and
114
GENERAL D-MARTINGALES
Definition 10. Let {zt ; t ∈ Z} be a stationary vector process and consider
the sequence of subspaces L2(z t ), which is non-decreasing i.e. L2(z t ) ⊂
L2(zt+1). A stochastic process {x(t) ; t ∈ Z}is a martingale difference, or
briefly, a d-martingale with respect to the family {L2(z t )}, if,
◦ For all t, x(t) ∈ L2(z t ) ; i.e. x(t) is itself a statistic of the past history of z at
time t, having finite variance.
◦ z(t + 1) is uncorrelated with all random variables in the space L2(z t ) that
is E {x(t + 1) f (z t )} = 0 for all f (z t ) ∈ L2(z t ) which is equivalent to:
E {x(t + 1) | z t } = 0 t ∈ Z. (DMART)
116
The following lemma generalizes the sum-of-variances property, known to
hold for sums of i.i.d. processes, to d-martingales.
Lemma 5. For every d-martingale x one has
XN N
X
Var { x(t)} = Var {x(t)} (32)
t=1 t=1
When the d-martigale x is stationary the second member is just N σx2. This
property holds unchanged in the vector case.
Proof.
N
X X
E{ x(t)}2 = E {x(1)2 + x(2)2 + . . . + x(N)2} + E {2 x(t)x(s)} =
t=1 t>s
N
X X
= Var {x(t)} + 2 E x(t)x(s)
t=1 t>s
But the last term is zero since for t > s, x(s) ∈ L2(z s), and by the d-martingale
property
E x(t)x(s) = E {E [ x(t)x(s) | z s]} = E { x(s) E [ x(t) | z s] } = 0
117
THE CLT FOR D-MARTINGALES
Theorem 23 (Levy, Doob, Billingsley, Ibragimov). Let {x(t)} be a stationary
(vector) d-martingale having constant conditional variance, equal to Σx =
E x(t)x(t)>. One has
√ L
T x̄T → N(0, Σx) (33)
√
that is the modified sample mean T x̄T converges in distribution to a mul-
tivariate Gaussian distribution of mean zero and variance Σx.
Proof (for the scalar case): We shall use the conditional characteristic func-
tion, substituting the variable it with iλ . By stationarity the variances of the
random variables {x(t)} are uniformly bounded and φx(t)(iλ | zt−1) admits
a second derivative at λ = 0 equal to the (conditional) variance, σ 2, of x(t).
One can write
2 σ 2λ 2
h i
x(t) λ 2
E e iλ |zt−1 = E 1 + iλ x(t) − x(t) + η (λ , x(t)) | zt−1 = 1− +o(λ 2)
2 2
where o(λ 2) is a random variable in L2(zt−1) which tens to zero as λ → 0
faster than λ 2. Call φT (λ ) the conditional characteristic function of the sum
118
PT
x̄(T ) := k=1 x(t). Then
n h i o
φT (λ ) = E E eiλ x(T ) | zT −1 eiλ x̄(T −1) =
2 2
σ λ
= 1− E {eiλ x̄(T −1)} + E {o(λ 2)eiλ x̄(T −1)} =
2
σ 2λ 2
= 1− φT −1(λ ) + ō(λ 2)
2
where ō(λ 2) is the expected value of a variable in L2(zT −1) having the
same absolute value of o(λ 2) hence tending to zero faster than λ 2. Solving
the difference equation one finds
2 2
T
σ λ
φT (λ ) = 1 − + ōT (λ 2)
2
where ōT (λ 2) is still infinitesimal of higher order than λ 2 for λ → 0.
1 PT
Now, the characteristic function of s(T ) := √ k=1 x(t) is the same func-
√ T
tion φT computed in λ / T ,so that
2 2
T
λ2
λ σ λ
φT ( √ ) = 1 − + ōT ( )
T 2 T T
where the second term tends to zero for T → ∞, for whatever fixed value of
σ 2λ 2
λ while the limit of the first term is the well known function exp {− }.
2
Hence the characteristic function of s(T ) converges pointwise to that of the
Gaussian N(0, σ 2). 2
119
THE CLT FOR THE PEM ESTIMATOR
We shall consider simultaneously models of AR, ARX, ARMA, ARMAX
type depending smoothly on a p-dimensional parameter θ . Assume at
least local identifiability about the true value θ0 of the true model which
has generated the data. Assume also that the true model generates data
which are stationary and ergodic and that the PEM estimator is consistent.
∂ ε θ (t) ∂ ŷ (t | t − 1)
ψ θ (t) := =− θ
∂θ ∂θ
to get
N
1 0 1X
VN (θ ) = ψ θ (t)εε θ (t) (36)
2 N
k=1
N 2
1 00 1 ∂ ε θ (t)
ψ θ (t)> + ε θ (t)
X
VN (θ ) = ψ θ (t)ψ (37)
2 N ∂ θi∂ θ j
k=1
The asymptotic behaviour of the second derivative is discussed in the fol-
lowing lemma,
Lemma 6. One has
1 00
n o
ψ θ0 (t)>
lim VN (θ̄θ ) = E θ0 ψ θ0 (t)ψ (38)
N→∞ 2
with probability one.
122
Proof : By consistency, θ̂θ N → θ0 and hence also θ̄θ → θ0 (with probability
one). Since the time average in (37) converges to the expectation, we
have,
( 2 )
1 00 ∂ ε θ (t)
ψ θ0 (t)> + ε θ0 (t)
VN (θ̄θ ) → E θ0 ψ θ0 (t)ψ
2 ∂ θi∂ θ j |θ =θ
0
almost surely. Since the true model belongs to the model class, ε θ0 (t) =
e0(t) where e0(t) is the true random prediction error (a d-martingale). Fi-
nally, both the gradient (ψψ θ (t)), and the second derivative of ŷθ (t | t − 1)
are functions (often linear functions) only of the past data (yt−1, ut−1), all
entries in the second derivative matrix in the second member are uncorre-
lated with e0(t) and hence the expectation of the last term is zero. 2
Note now that the last term in (35), has the expression
N
1 0 1X
VN (θ0) = ψ θ0 (t)e0(t) (39)
2 N
k=1
123
AN INTERMEDIATE CLT
Theorem 24. Assume the prediction error e0, is a stationary d-martingale
with respect to the flow of past data (yt , ut ) having finite variance. Then
also the process {ψ
ψ θ0 (t)e0(t)} is a d-martingale and,
√ 1 L
N VN (θ0)0 → N(0, Q) (40)
2
If the conditional variance of e0(t) doesnot depend on the data (yt−1, ut−1),
that is if
E 0{e0(t)2 | yt−1, ut−1} = E 0{e0(t)2} = σ02 , (41)
the asymptotic variance Q is given by the formula,
124
Proof : The first statement follows from the previous observation that {ψ
ψ θ0 (t)e0(t)}
is also a d-martingale with respect to the past data flow {yt , ut }.
The second is a corollary of the CLT for d-martingales 23. The xpression
for the Variance matrix Q follows from the property (41), which implies,
h i
Var {ψ
ψ θ0 (t)e0(t)} = ψ θ0 (t)>} =
E 0{E 0 e0(t)2 | yt−1, ut−1 } ψ θ0 (t)ψ
= E 0{e0(t)2} E 0{ψ ψ θ0 (t)>} .
ψ θ0 (t)ψ
2
Obviously the theorem includes the case where e0 is an i.i.d. process but
applies to a much wider variety of cases.
125
THE ASYMPTOTIC DISTRITBUTION OF THE
PEM ESTIMATOR
126
Proof: Follows from (35). For N → ∞
−1
N N
1 1
ψ θ̄θ (t)>
X X
θ̂θ N − θ0 ' − ψ θ̄θ (t)ψ ψ θ0 (t)εε θ0 (t)
N N
k=1 k=1
and use the third statement of Slutsky. The expression for P follows from
h n oi−1 h n oi−1
P = E θ0 ψ θ0 (t)ψψ θ0 (t)> ψ θ0 (t)>
Q E θ0 ψ θ0 (t)ψ
127
ASYMPTOTIC VARIANCE
Recall that the variance of a consistent estimator must tend to zero as
N → ∞. The concept of asymptotic variance of a consistent estimator must
therefore be defined properly. Here is one possible definition.
Definition 11. Let {φN (y) ; N = 1, 2, . . .} be a consistent sequence of esti-
mators of the parameter θ and d(N) a function of N which is increasing to
+∞ with N and strictly positive. One say that φN (y) has asymptotic variance
Σ if
p L
d(N) [ φN (y) − θ0 ] → D(0, Σ)
where D(0, Σ) is a pdf having variance Σ, possibly depending on theta0,
which is finite and strictly positive definite.
For maximum likelihood under the usual identifiability condition, the asymp-
totic variance is I(θ0)−1 which in force of the Cramèr-Rao bound is the best
possible (for a fixed θ0). One usually says that maximum likelihood is an
asymptotically efficient estimator but this sentence must of course be
interpreted with a grain of salt.
129