Lecture Notes No. 18: 2.160 System Identification, Estimation, and Learning
Lecture Notes No. 18: 2.160 System Identification, Estimation, and Learning
13.1 Overview
θˆN
Distribution of θˆN The variance is
small for large N.
θ*
Iteration/Data Number
The variance is How quickly
large for small N. does the variance
reduce?
Identified model parameter θˆN with cov θˆN : a “quality tag” confidence interval
The mathematical tool needed for asymptotic variance analysis is “Central Limit”
theorems. The following is a quick review of the theory.
1
Consider two independent random variable, X and Y, with PDF, f X (x) and fY ( y ) .
Define another random variable Z as the sum of X and Y:
Z = X +Y
Let us obtain the PDF of Z.
y
Pr ob( z ≤ Z ≤ z + ∆z ) ∆xy
z + ∆z = x + y
= ∫ ∫ f X ( x) fY ( y )dxdy
∆XY
⎡ ∞
⎤
= ⎢ ∫ f X ( x) fY ( z − x)dx ⎥ ∆z = f Z ( z )∆z
⎣ −∞ ⎦
z = x+ y x
Example
f X ( x) f X ( x)
1/2 1/2
-1 1 x -1 1 x
-2 2 z
fW ( w)
w
-3 -1 0 1 3
The resultant PDF is getting close to a Gaussian distribution.
2
N
In general, the PDF of a random variable ∑X
i =1
i approaches a Gaussian distribution,
regardless of the PDF of each X i , as N gets larger. More rigorously, the following
central limit theorem has been proven.
∑ (X − m)
1
YN = t
N t =1 (2)
1 ⎧ 1 ⎫
fY ( y ) = exp ⎨− y T Q −1 y ⎬ (3)
(2π ) d /2
det Q ⎩ 2 ⎭
where
N
1
y = lim
N →∞ N
∑( X
t =1
t − m) .
Applying the Central Limit Theorem, we can obtain the distribution of estimate θˆN as N
tends to infinity.
Let θˆN be an estimate based on the prediction error method (PEM);
θˆN = arg min VN (θ , Z N ) (4)
θ ∈DM
N
1 1
VN (θ , Z N ) =
N
∑2ε
t =1
2
(t ,θ ) (5)
For simplicity, we first assume that the predictor yˆ (t θ ) is given by a linear regression:
yˆ (t θ ) = ϕ T θ (6)
3
and the parameter vector of the true system, θ 0 , is involved in the model set, θ 0 ∈ DM .
yˆ (t θ ) = ϕ T θ 0 + e0 (t ) (7)
where
⎧λ t=s
E [e0 (t )e0 ( s )] = ⎨ 0 .
⎩0 t≠s
d
V ' N (θˆN , Z N ) = VN (θ N , Z N ) θ =θ N = 0, V ' N ∈ R d ×1 (8)
dθ
d
Assuming that V ' ' N (ξ N , Z N ) = V ' N is non-singular and using (8) for (9),
dθ
To obtain the distribution of θˆN − θ 0 , let us first examine V ' N (θ0 , Z N ) as N tends to
infinity.
1 N
dε
V ' N (θ 0 , Z N ) =
N
∑ ε (t ,θ
t =1
0 )
dθ
θ =θ 0 , (11)
ε (t ,θ ) = y (t ) − yˆ (t θ )
Recall and (6)
dε d
θ0 =− yˆ (t θ ) θ 0 = −ϕ T (t ), (12)
dθ dθ
and
ε (t ,θ 0 ) = ϕ T (t )θ 0 + e0 (t ) − ϕ T (t )θ 0 = e0 (t )
4
N
1
− V ' N (θ 0 , Z N ) =
N
∑ϕ ( t ) e ( t )
t =1
0 (13)
The covariance is
Consider
N N
∑ (X t − m) =
1 1
YN =
N t =1 N
∑ϕ ( t ) e ( t )
t =1
0
and apply the Central Limit Theorem. The distribution of YN , i.e. − N V ' N (θ 0 , Z N ) ,
converges to a Gaussian distribution as N tends to infinity.
V ' N (θ , Z N ) θ =ξ N
d
V ' ' N (ξ N , Z N ) =
dθ
d 1 N dε
= ∑
dθ N t =1
ε (t ,θ )
dθ
θ =ξ N
(18)
1 ⎧⎪ dε ⎛ dε ⎞
T
d 2ε ⎫⎪
= ⎨ ⎜ ⎟ + ε ( t , θ ) ⎬ θ =ξ
N ⎪⎩ dθ ⎝ dθ ⎠ dθ 2 ⎪⎭ N
∑ (ϕ (t )ϕ (t ) )
N
1
= T
N t =1
∑ (ϕ (t )ϕ (t ) ) = R
N
1
V '' N (ξ N , Z N ) = lim T
(19)
N →∞ N
t =1
5
From (10), (17) and (19), the distribution of N (θˆN − θ 0 ) converges to the Gaussian
distribution given by
Large N
N (0, Q )
Small N
N (θˆN − θ0 ) (θˆN − θ0 )
Remarks
1) Eq.(20) manifests that the standard deviation of θˆN − θ 0 decreases at the rate of
1 1
for large N. See the figure above. Note that cov θˆN = Q .
N N
2) The above result is for a very restrictive case. A similar result can be obtained for
general cases with mild assumptions.
z The true system (7) does not have to be assumed. Instead, θ * = arg minV (θ )
must be involved in DM.
z The linear regression (6) can be extended to a general predictor where the model
parameter θ is determined based on the prediction error method (4), (5).
The extended result of estimate distribution is summarized in the following theorem, i.e.
Ljun’g Textbook Theorem 9-1.
6
Theorem 1 Consider the estimate θˆN determined by (4) and (5). Assume that the model
structure is linear and uniformly stable and that the data set Z ∞ satisfies the quasi
stationary and ergodicity requirements. Assume also that θˆN converges with probability 1
to a unique parameter vector θ * involved in DM:
and that
V ' ' N (θ * ) > 0 ; positive definite (23)
and that
1 N
⎛ d ⎞
V '(θ *) = lim
N →∞ N
∑ ⎜⎝ dθ yˆ (t θ ) ⎟⎠ ε (t,θ ) θ
t =1
* converges to mt with probability 1
(24)
where Pθ is given by
[ ] [ −1
Pθ = V ' ' N (θ * ) Q V ' ' N (θ * ) ]
−1
(27)
( )(
Q = lim N ⋅ E[ V ' N (θ * ) V ' N (θ * ) ]
N →∞
)
T
(28)
⎛ d ⎞
The proof is quite complicated, since the random variable ⎜ yˆ (t θ ) ⎟ε (t ,θ ) θ * is not
⎝ dθ ⎠
independent. Therefore, the standard central limit theorem is not applicable.
Appendix 9A, at p.309 of Ljung’s textbook, shows the outline of proof. Since the model
structure is assumed to be stable uniformly in θ , Xt and Xs are independent as t and s are
1 N
distal. Because of this property, the sum, ∑ ( X t − mt ) , converges to the Gaussian
N t =1
distribution.
7
13.4 Expression for the Asymptotic Variance.
ψ (t ,θ * ) =
d d
yˆ (t θ ) θ * = − ε (t ,θ ) θ * and (30)
dθ dθ
Let us compute the covariance once again for the general case. Form (5) and (30),
dε
( )
N N
∑ ε (t ,θ )ψ (t ,θ )
d 1 1
dθ
VN θ , Z N =
N
∑
t =1
ε (t ,θ )
dθ
=−
N t =1
(31)
⎛ dε dψ ⎞
d2
dθ
(
V θ,Z N = −
2 N
1
N
) ∑ ⎜⎝ dθ ψ + ε ⎟
dθ ⎠
(32)
1 N
⎛ d2 ⎞
= ∑ ⎜⎜ψ (t , θ )ψ T (t , θ ) − ε (t , θ ) 2 yˆ (t θ ) ⎟⎟
t =1 ⎝ dθ
N ⎠
When the true system is contained in the model structure, θ 0 ∈ DM , and that is unique,
ε (t ,θ 0 ) = e0 (t ) (33)
8
∑∑ E[e (t )ψ (t ,θ ]
N N
N
Q = lim 0 0 )ψ T ( s, θ 0 )e0 ( s )
N →∞ N 2
t =1 s =1
(34)
∑ λ E[ψ (t ,θ )ψ ] [ ]
N
1
= lim 0 0
T
(t , θ 0 ) = λ0 E ψ (t ,θ 0 )ψ (t ,θ 0 )
T
N →∞ N
t =1
⎡
[ ] d2 ⎤
N
V " (θ 0 ) = lim
1
N →∞ N
∑ ⎢ E ψ (t , θ 0 )ψ (t , θ 0 ) − ε (t , θ 0 ) dθ 2 yˆ (t θ ) θ0 ⎥
t =1 ⎣
T
0 ⎦ (35)
[ ]⎡ ⎤ 2
d
= E ψ (t , θ 0 )ψ T (t ,θ 0 ) − E ⎢e0 (t ) 2 yˆ θ0 ⎥
⎣ dθ ⎦
This depends on Z t −1 not on Z t
d2
Since e0 (t ) and yˆ are independent, the second term varnishes. Substituting (34) and
dθ 2
(35) into (29),
λ
1
[ (
CovθˆN ~ Pθ = 0 E ψ (t , θ 0 )ψ T (t ,θ 0 )
N N
)]
−1
(36)
N
1
λˆN =
N
∑ε
t =1
2
(t ,θˆN ) (38)
If one computes PN during experiments, sufficient data samples needed for assuming the
model accuracy may be obtained.
The asymptotic variance has different expression in the frequency domain, which we will
find useful for variance analysis and experiment design.
9
Let transfer function G (q, θ ) and noise model H (q, θ ) be consolidated into as 1X2
matrix:
T (q, θ ) = [G (q, θ ), H (q, θ )] (39)
H ' ( z,θ )
Wy' ⇒
d
dθ
Wy ( z, θ ) =
d
dθ
[
1 − H −1 ( z , θ ) = 2 ]
H ( z,θ )
ψ (t ,θ ) =
1
[HG'− H ' G, H ']Z (t )
H ( q, θ )
2
[G ' , H ']⎡⎢
1 H 0⎤ ⎡u (t ) ⎤
= ⎥⎢ ⎥ (43)
H (q, θ )
2
⎣− G 1⎦ ⎣ y (t )⎦
1 ⎡ u (t ) ⎤
= T ' ( q, θ ) ⎢ −1 ⎥
H ( q, θ ) −1
⎣− H Gu + H y ⎦
where x0 (t ) = [u (t ) e0 (t )] .
T
10
⎡ Φ u (ω ) Φ ue0 (ω )⎤ Φ e0 (ω ) = λ0
Φ x0 (ω ) = ⎢ ⎥ (45)
⎣Φ ue0 (−ω ) Φ e0 (ω ) ⎦ Φ ue0 (ω ) = 0 for open − loop
1 π
Using the familiar formula: Rs (0) =
2π ∫ π Φ (ω )dω
−
s
[ ] π −2
1
E ψ (t ,θ 0 )ψ T (t ,θ 0 ) =
2π ∫π
−
H (eiω , θ 0 ) T ' (eiω ,θ 0 )Φ x0 (ω )T 'T (eiω ,θ 0 )dω (46)
N ⎣ 2π
11