0% found this document useful (0 votes)
56 views11 pages

Lecture Notes No. 18: 2.160 System Identification, Estimation, and Learning

1) The estimate θˆN converges to the true parameter value θ* as the number of data points N increases. However, the rate of convergence and how quickly the variance decreases with more data needs to be analyzed. 2) Applying the central limit theorem, as N increases the distribution of the parameter estimate θˆN approaches a Gaussian distribution with mean 0 and covariance matrix Q. 3) For a linear regression model, the asymptotic distribution of the parameter estimate θˆN can be derived. As N increases, the estimate θˆN approaches the true parameter θ0 according to this distribution.

Uploaded by

Ioannis Frangis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views11 pages

Lecture Notes No. 18: 2.160 System Identification, Estimation, and Learning

1) The estimate θˆN converges to the true parameter value θ* as the number of data points N increases. However, the rate of convergence and how quickly the variance decreases with more data needs to be analyzed. 2) Applying the central limit theorem, as N increases the distribution of the parameter estimate θˆN approaches a Gaussian distribution with mean 0 and covariance matrix Q. 3) For a linear regression model, the asymptotic distribution of the parameter estimate θˆN can be derived. As N increases, the estimate θˆN approaches the true parameter θ0 according to this distribution.

Uploaded by

Ioannis Frangis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

2.

160 System Identification, Estimation, and Learning


Lecture Notes No. 18
April 26, 2006

13 Asymptotic Distribution of Parameter Estimates

13.1 Overview

If convergence is guaranteed, then θˆN → θ * .


But, how quickly does the estimate θˆN approach the limit θ * ? How many data points
are needed? Æ Asymptotic Variance Analysis

θˆN
Distribution of θˆN The variance is
small for large N.

θ*
Iteration/Data Number
The variance is How quickly
large for small N. does the variance
reduce?

The main points to be obtained in this chapter

The variance analysis of this chapter will reveal


1
a) The estimate converges to θ * at a rate proportional to
N
b) Distribution converges to a Gaussian distribution: N(0, Q).
∂ŷ
c) Cov θˆN depends on the parameters sensitivity of the predictor:
∂θ

Identified model parameter θˆN with cov θˆN : a “quality tag” confidence interval

12.2 Central Limit Theorems.

The mathematical tool needed for asymptotic variance analysis is “Central Limit”
theorems. The following is a quick review of the theory.

1
Consider two independent random variable, X and Y, with PDF, f X (x) and fY ( y ) .
Define another random variable Z as the sum of X and Y:
Z = X +Y
Let us obtain the PDF of Z.
y
Pr ob( z ≤ Z ≤ z + ∆z ) ∆xy
z + ∆z = x + y
= ∫ ∫ f X ( x) fY ( y )dxdy
∆XY

⎡ ∞

= ⎢ ∫ f X ( x) fY ( z − x)dx ⎥ ∆z = f Z ( z )∆z
⎣ −∞ ⎦
z = x+ y x

Example
f X ( x) f X ( x)

1/2 1/2

-1 1 x -1 1 x

f X (x) and fY ( y ) have the same


uniform distribution. Combining the f Z = X +Y ( z )
two distributions, we can obtain the
distribution of Z. 1/2

-2 2 z

Further, consider W = X + Y + V , fV (v) has the same rectangular PDF as X and Y.

fW ( w)

w
-3 -1 0 1 3
The resultant PDF is getting close to a Gaussian distribution.

2
N
In general, the PDF of a random variable ∑X
i =1
i approaches a Gaussian distribution,

regardless of the PDF of each X i , as N gets larger. More rigorously, the following
central limit theorem has been proven.

A Central Limit Theorem of Independent Random Variables

Let X t , t = 0,1, " be a d-dimensional random variable with


Mean m = E( X t )
Co-variance [
Q = E ( X t − m )( X t − m )
T
] (1)

Consider the sum of X t − m given by


N

∑ (X − m)
1
YN = t
N t =1 (2)

Then, as N tends to infinity, the distribution of YN converges to the Gaussian distribution


given by PDF:

1 ⎧ 1 ⎫
fY ( y ) = exp ⎨− y T Q −1 y ⎬ (3)
(2π ) d /2
det Q ⎩ 2 ⎭

where
N
1
y = lim
N →∞ N
∑( X
t =1
t − m) .

13.3 Distribution of Estimate θˆN

Applying the Central Limit Theorem, we can obtain the distribution of estimate θˆN as N
tends to infinity.
Let θˆN be an estimate based on the prediction error method (PEM);
θˆN = arg min VN (θ , Z N ) (4)
θ ∈DM

N
1 1
VN (θ , Z N ) =
N
∑2ε
t =1
2
(t ,θ ) (5)

For simplicity, we first assume that the predictor yˆ (t θ ) is given by a linear regression:
yˆ (t θ ) = ϕ T θ (6)

3
and the parameter vector of the true system, θ 0 , is involved in the model set, θ 0 ∈ DM .

The actual data is generated by

yˆ (t θ ) = ϕ T θ 0 + e0 (t ) (7)
where
⎧λ t=s
E [e0 (t )e0 ( s )] = ⎨ 0 .
⎩0 t≠s

Since θˆN minimizes VN (θ , Z N )

d
V ' N (θˆN , Z N ) = VN (θ N , Z N ) θ =θ N = 0, V ' N ∈ R d ×1 (8)

Using the Mean Value Theorem, V ' N can be expressed as

V ' N (θˆN , Z N ) = V ' N (θ 0 , Z N ) + V ' ' N (ξ N , Z N )(θˆN − θ 0 ) θˆN ≤ ξ N ≤ θ 0 or θ 0 ≤ ξ N ≤ θˆN (9)

where ξ N is a parameter vector somewhere between θ 0 and θˆN .

d
Assuming that V ' ' N (ξ N , Z N ) = V ' N is non-singular and using (8) for (9),

θˆN − θ 0 = −[V ' ' N (ξ N , Z N )] V ' N (θ 0 , Z N )


−1
(10)

To obtain the distribution of θˆN − θ 0 , let us first examine V ' N (θ0 , Z N ) as N tends to
infinity.

1 N

V ' N (θ 0 , Z N ) =
N
∑ ε (t ,θ
t =1
0 )

θ =θ 0 , (11)

ε (t ,θ ) = y (t ) − yˆ (t θ )
Recall and (6)

dε d
θ0 =− yˆ (t θ ) θ 0 = −ϕ T (t ), (12)
dθ dθ
and
ε (t ,θ 0 ) = ϕ T (t )θ 0 + e0 (t ) − ϕ T (t )θ 0 = e0 (t )

Therefore, (11) reduces to

4
N
1
− V ' N (θ 0 , Z N ) =
N
∑ϕ ( t ) e ( t )
t =1
0 (13)

Let us treat ϕ (t )e0 (t ) ≡ X t as a random variable. Its mean is zero, since

m = E [ϕ (t )e0 (t )] = E [ϕ (t )]E [e0 (t )] = 0 (14)

The covariance is

cov( X t X sT ) = E ⎡( X t − m )( X s − m ) ⎤ = E ⎡⎣ϕ (t )e0 (t )e0 ( s )ϕ ( s )T ⎤⎦ = 0, for t ≠ s


T
⎣ ⎦
(15)
T 2
[ ][
cov( X t X t ) = E e0 (t ) E ϕ (t )ϕ (t ) = λ0 R
T
] (16)

Note that X1, X2, …, XN are independent, since e0(t) is independent.

Consider
N N

∑ (X t − m) =
1 1
YN =
N t =1 N
∑ϕ ( t ) e ( t )
t =1
0

and apply the Central Limit Theorem. The distribution of YN , i.e. − N V ' N (θ 0 , Z N ) ,
converges to a Gaussian distribution as N tends to infinity.

Y N = − N V ' N (θ 0 , Z N ) ~ N (0, λ0 R ) (17)

Next, compute V ' ' N (ξ N , Z N )

V ' N (θ , Z N ) θ =ξ N
d
V ' ' N (ξ N , Z N ) =

d 1 N dε
= ∑
dθ N t =1
ε (t ,θ )

θ =ξ N

(18)
1 ⎧⎪ dε ⎛ dε ⎞
T
d 2ε ⎫⎪
= ⎨ ⎜ ⎟ + ε ( t , θ ) ⎬ θ =ξ
N ⎪⎩ dθ ⎝ dθ ⎠ dθ 2 ⎪⎭ N

∑ (ϕ (t )ϕ (t ) )
N
1
= T

N t =1

Therefore, under the ergodicity assumption,

∑ (ϕ (t )ϕ (t ) ) = R
N
1
V '' N (ξ N , Z N ) = lim T
(19)
N →∞ N
t =1

5
From (10), (17) and (19), the distribution of N (θˆN − θ 0 ) converges to the Gaussian
distribution given by

N (θˆN − θ 0 ) ~ N (0, Q ) as N →∞ (20)


where
Q = R −1 (λ0 R ) R −1 = λ0 R −1 (21)

Note that, as coordinate transformation y=Ax is performed, the covariant matrix C


associated with a multivariate Gaussian distribution is transformed to ACAT. This is used
in (21).

Large N
N (0, Q )

Small N

N (θˆN − θ0 ) (θˆN − θ0 )

Remarks

1) Eq.(20) manifests that the standard deviation of θˆN − θ 0 decreases at the rate of
1 1
for large N. See the figure above. Note that cov θˆN = Q .
N N

2) The above result is for a very restrictive case. A similar result can be obtained for
general cases with mild assumptions.
z The true system (7) does not have to be assumed. Instead, θ * = arg minV (θ )
must be involved in DM.
z The linear regression (6) can be extended to a general predictor where the model
parameter θ is determined based on the prediction error method (4), (5).

The extended result of estimate distribution is summarized in the following theorem, i.e.
Ljun’g Textbook Theorem 9-1.

6
Theorem 1 Consider the estimate θˆN determined by (4) and (5). Assume that the model
structure is linear and uniformly stable and that the data set Z ∞ satisfies the quasi
stationary and ergodicity requirements. Assume also that θˆN converges with probability 1
to a unique parameter vector θ * involved in DM:

θˆN → θ * ∈ DM w. p.1 as N → ∞ (22)

and that
V ' ' N (θ * ) > 0 ; positive definite (23)

and that

1 N
⎛ d ⎞
V '(θ *) = lim
N →∞ N
∑ ⎜⎝ dθ yˆ (t θ ) ⎟⎠ ε (t,θ ) θ
t =1
* converges to mt with probability 1

(24)

where mt is the ensemble mean given by


⎡N ⎛ d ⎞ ⎤
mt = E ⎢∑ ⎜ yˆ (t θ ) ⎟ε (t ,θ ) θ * ⎥ (25)
⎣ t =1 ⎝ dθ ⎠ ⎦

Then, the distribution of N (θˆN − θ 0 ) converges to the Gaussian distribution given by

N (θˆN − θ 0 ) ~ N (0, Pθ ) (26)

where Pθ is given by
[ ] [ −1
Pθ = V ' ' N (θ * ) Q V ' ' N (θ * ) ]
−1
(27)

( )(
Q = lim N ⋅ E[ V ' N (θ * ) V ' N (θ * ) ]
N →∞
)
T
(28)
⎛ d ⎞
The proof is quite complicated, since the random variable ⎜ yˆ (t θ ) ⎟ε (t ,θ ) θ * is not
⎝ dθ ⎠
independent. Therefore, the standard central limit theorem is not applicable.

Appendix 9A, at p.309 of Ljung’s textbook, shows the outline of proof. Since the model
structure is assumed to be stable uniformly in θ , Xt and Xs are independent as t and s are
1 N
distal. Because of this property, the sum, ∑ ( X t − mt ) , converges to the Gaussian
N t =1
distribution.

7
13.4 Expression for the Asymptotic Variance.

As stated formally in Theorem 1, the distribution of N θˆN − θ * converges to a ( )


Gaussian distribution for the broad class of system identification problems. This implies
that the covariance of θˆN asymptotically converges to:
1
CovθˆN ~ Pθ (29)
N

This is called the asymptotic covariance matrix.

The asymptotic variance depends not only on

(a) the member of samples/data set size: N, but also on

(b) the parameter sensitivity of the predictor:

ψ (t ,θ * ) =
d d
yˆ (t θ ) θ * = − ε (t ,θ ) θ * and (30)
dθ dθ

(c) Noise variance λ0 .

Let us compute the covariance once again for the general case. Form (5) and (30),


( )
N N

∑ ε (t ,θ )ψ (t ,θ )
d 1 1

VN θ , Z N =
N

t =1
ε (t ,θ )

=−
N t =1
(31)

Unlike the linear regression, the sensitivity ψ (t , θ ) is a function of θ ,

⎛ dε dψ ⎞
d2

(
V θ,Z N = −
2 N
1
N
) ∑ ⎜⎝ dθ ψ + ε ⎟
dθ ⎠
(32)
1 N
⎛ d2 ⎞
= ∑ ⎜⎜ψ (t , θ )ψ T (t , θ ) − ε (t , θ ) 2 yˆ (t θ ) ⎟⎟
t =1 ⎝ dθ
N ⎠

When the true system is contained in the model structure, θ 0 ∈ DM , and that is unique,

ε (t ,θ 0 ) = e0 (t ) (33)

from (28), (31), and (33)

8
∑∑ E[e (t )ψ (t ,θ ]
N N
N
Q = lim 0 0 )ψ T ( s, θ 0 )e0 ( s )
N →∞ N 2
t =1 s =1
(34)
∑ λ E[ψ (t ,θ )ψ ] [ ]
N
1
= lim 0 0
T
(t , θ 0 ) = λ0 E ψ (t ,θ 0 )ψ (t ,θ 0 )
T
N →∞ N
t =1

Also from (32)


[ ] d2 ⎤
N
V " (θ 0 ) = lim
1
N →∞ N
∑ ⎢ E ψ (t , θ 0 )ψ (t , θ 0 ) − ε (t , θ 0 ) dθ 2 yˆ (t θ ) θ0 ⎥
t =1 ⎣
T

0 ⎦ (35)

[ ]⎡ ⎤ 2
d
= E ψ (t , θ 0 )ψ T (t ,θ 0 ) − E ⎢e0 (t ) 2 yˆ θ0 ⎥
⎣ dθ ⎦
This depends on Z t −1 not on Z t
d2
Since e0 (t ) and yˆ are independent, the second term varnishes. Substituting (34) and
dθ 2
(35) into (29),

λ
1
[ (
CovθˆN ~ Pθ = 0 E ψ (t , θ 0 )ψ T (t ,θ 0 )
N N
)]
−1
(36)

The asymptotic variance is therefore a) inversely proportional to the number of samples,


b) proportional to the noise variance, and c) inversely related to the parameter sensitivity.
The more a parameter affects the prediction, the smaller the variance becomes.

Since θ 0 is not known, the asymptotic variance cannot be determined. In practice,


however, an empirical estimate, like the following formula, works well for large N.
−1
 ⎡1 N

PN = λˆN ⎢
⎣N

t =1
ψ (t , θˆN )ψ T (t ,θˆN )⎥

(37)

N
1
λˆN =
N
∑ε
t =1
2
(t ,θˆN ) (38)


If one computes PN during experiments, sufficient data samples needed for assuming the
model accuracy may be obtained.

13.5 Frequency-Domain Expressions for the Asymptotic Variance.

The asymptotic variance has different expression in the frequency domain, which we will
find useful for variance analysis and experiment design.

9
Let transfer function G (q, θ ) and noise model H (q, θ ) be consolidated into as 1X2
matrix:
T (q, θ ) = [G (q, θ ), H (q, θ )] (39)

The gradient of T, that is, the sensitivity of T to θ , is

T ' (q, θ ) = T (q, θ ) = [G ' (q, θ ), H ' (q,θ )]


d
(40)

For a predictor, we have already defined W (q, θ ) and z(t), s.t.


⎡u ⎤
[ ]
yˆ (t θ ) = Wu (q)u (t ) + Wy (q) y (t ) = Wu W y ⎢ ⎥ = WZ (t )
⎣ y⎦
Therefore the predictor sensitivity ψ (t ,θ ) is given by
ψ (t ,θ ) =
d

[
yˆ (t θ ) = W 'u W ' y Z (t ) ] (41)

Wu' and W y' are computed as


d d HG '− H ' G
Wu' ⇒ Wu ( z ,θ ) = H −1 ( z ,θ )G ( z , θ ) = (42)
dθ dθ H 2 ( z,θ )

H ' ( z,θ )
Wy' ⇒
d

Wy ( z, θ ) =
d

[
1 − H −1 ( z , θ ) = 2 ]
H ( z,θ )

Substituting these back to ψ (t ,θ )

ψ (t ,θ ) =
1
[HG'− H ' G, H ']Z (t )
H ( q, θ )
2

[G ' , H ']⎡⎢
1 H 0⎤ ⎡u (t ) ⎤
= ⎥⎢ ⎥ (43)
H (q, θ )
2
⎣− G 1⎦ ⎣ y (t )⎦
1 ⎡ u (t ) ⎤
= T ' ( q, θ ) ⎢ −1 ⎥
H ( q, θ ) −1
⎣− H Gu + H y ⎦

At θ = θ 0 (the true system), note ε (t ,θ 0 ) = e0 (t ) and


− H −1 (q, θ 0 )G (q, θ 0 )u (t ) + H −1 (q,θ 0 ) y (t ) = e0 (t )
∴ψ (t , θ 0 ) = H −1 (q, θ 0 )T ' (q, θ 0 ) x0 (t ) (44)

where x0 (t ) = [u (t ) e0 (t )] .
T

Let Φ x0 (ω ) be the spectrum matrix of x0 (t )

10
⎡ Φ u (ω ) Φ ue0 (ω )⎤ Φ e0 (ω ) = λ0
Φ x0 (ω ) = ⎢ ⎥ (45)
⎣Φ ue0 (−ω ) Φ e0 (ω ) ⎦ Φ ue0 (ω ) = 0 for open − loop

1 π
Using the familiar formula: Rs (0) =
2π ∫ π Φ (ω )dω

s

[ ] π −2
1
E ψ (t ,θ 0 )ψ T (t ,θ 0 ) =
2π ∫π

H (eiω , θ 0 ) T ' (eiω ,θ 0 )Φ x0 (ω )T 'T (eiω ,θ 0 )dω (46)

For the noise spectrum,


2
Φ v (ω ) = λ0 H (eiω , θ 0 ) (47)

Using these in (36)


−1
1⎡ 1 π 1 ⎤
∫−π Φ v (ω ) T ' (e ,θ 0 )Φ x0 (ω )T ' (e ,θ 0 )dω ⎥⎦
iω iω
CovθˆN ~ ⎢ T

N ⎣ 2π

The asymptotic variance in the frequency domain.

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy