Limit Theorems, OLS, and HAC
Limit Theorems, OLS, and HAC
1
14.384 Time Series Analysis, Fall 2007
Professor Anna Mikusheva
Paul Schrimpf, scribe
September 11, 2007
revised September 9, 2013
Lecture 2
n 1
(n
n
i=1
xi ) N (0, 1)
We stated these while assuming independence. In time series, we usually dont have independence. Let us
explore where independence may have been used.
First, lets start at the simplest proof of LLN:
(
E
1
xi
n i=1
n
)2
)
n
1
=Var
xi
n i=1
( n
)
1
= 2 Var
xi
n
i=1
(1)
(2)
n
1
Var(xi )
n2 i=1
(3)
n 2
0
n2
(4)
1
k
=
2
k 1
+ 0
n
n
k=1
Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu),
Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Limit Theorems
j=
|j | < , then
[ n
]
(
)
1
k
lim
k 1
+ 0 = 0
n n
n
k=1
Thus, we have:
Lemma 1. If xt is a weakly stationary time series(with mean ) with absolutely summable auto-covariances
then a law of large numbers holds (in probability and L2 ).
Remark 2. Stationarity is not enough. Let z N (0, 2 ). Suppose xt = z t. Then cov(xt , xs ) = 2
t, s, so
n
we do not have absolute summability, and clearly we do not have a LLN for {xt } since the average n1 i=1 xi
equals to z, which is random.
Remark 3. For an MA, xt = c(L)et , we have j=1 |cj | < implies |j | <
The proof is easy. Last time we showed that
k =
cj cj+k
j=0
then
|k | =
|
cj cj+k |
k=0 j=0
k=0
|cj ||cj+k |
k=0 j=0
|cj ||cl |
l=0 j=0
=
|cj | <
j=0
From the new proof of LLN one can guess that the variance in a central limit theorem should change.
Remember that we wish to normalize the sum in such a way that the limit variance would be 1.
(
)
(
)
n
n
1
k
Var
xi =0 + 2
k 1
n
n i=1
0 + 2
k=1
k = J
k=1
Theorem 4. Let yt = + j=0 cj etj , where et is independent white noise and j=0 |cj | < , then
)
T
1
yt N (0, J )
T t=1
Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu),
Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Limit Theorems
Let It be information available at time t, i.e. It is the sigma-algebra generated by {yj }tj=
Let t,k = E[yt |Itk ] E[yt |Itk1 ] is the revision of forecast about yt as the new information arrives
at time t k.
Definition 5. A strictly stationary process {yt } is ergodic if for any t, k, l and any bounded functions, g
and h,
lim cov(g(yt , ..., yt+k ), h(yt+k+n , ...yt+k+n+l )) = 0
Theorem 6 (Gordons CLT). Assume that we have a strictly stationary and ergodic series {yt } with Eyt2 <
satisfying:
2 1/2
1.
<
j (Et,j )
2. E[yt |Itj ] 0 in L2 as j
then
where J = 0 + 2
T
1
yt N (0, J ),
T t=1
k is a long-run variance.
Remark 7. Notice, that yt = j=0 t,j . The condition 1 is intended to make the dependence between distant
observations to decrease to 0. Condition 1 can be checked (see an example below). Im not sure how the
ergodicity can be easily checked. Condition 2 is aimed at the correct centering, in particular, it implies that
E[yt ] = 0
Example 8. AR(1) yt = yt1 + et
2
We can check condition 2. We have E[yt |Itk ] E[yt |Itk1 ] = k etk and Et,j
= 2k 2 , so condition 2 is
satised. More generally, if the MA has absolutely summable coecients, then condition 2 will hold. One
can notice that E[yt |Itk ] = k ytk , so condition 3 holds. Now lets calculate the long-run variance:
k=1
2 k
k =
1 2
(
)
2
2
k
J = 0 + 2
k =
1
+
2
=
1 2
(1 )2
k=1
k=1
Remark 9.
J = 0 + 2
k =
k = (1)
k=
k=1
where (1) is the covariance function from last lecture evaluated at 1. Recall:
() =
i i
i=
(
J =
b()b( 1 )
a()a( 1 )
b(1)
a(1)
)2
2
Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu),
Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
OLS
Remark 10. If {yt } is a vector, then let k = cov(yt , yt+k ) and J = k . The only thing thats dierent
k . Instead, k = k . All the formulas above also hold, except in matrix
from the scalar case is that k =
notation. For example, for a VARMA,
J =A1 BB A1
Remark 11. If yt is a martingale dierence: E[yt |It1 ] = 0, then there is no serial correlation and J = 2 .
OLS
Suppose yt = xt + ut . In cross-section xt is always independent from us if s =
t due to iid assumption,
so the exclusion restriction is formulated as E(ut |xt ) = 0. In time series, however, we have to describe the
dependence between error terms and all regressors.
Definition 12. xt is weakly exogenous if E(ut |xt , xt1 , ...) = 0
Definition 13. xt is strictly exogenous if E(ut |{xt }
t= ) = 0
Usually, strict exogeneity is too strong an assumption, it is dicult to nd a good empirical example for
it. The weak exogeneity is much more functional (and we will mainly assume it).
OLS estimator: = (X X)1 (X y)
What is the asymptotic distribution?
1
1
T ( ) =( X X)1 ( X u)
T
T
1
1
1
=(
xt xt ) (
xt ut )
T t
T t
condition for
Appropriate assumptions will gives us a LLN for ( T1 t xt xt ) M . Assume also Gordons
T ( ) N (0, M 1 J M 1 )
The only thing that is dierent from usual is the J . J = j (where j are the autocovariances of zt =
xt ut ) is called the long-run variance. The long-run variance usually arise from potentially auto-dependent
error terms ut . The errors usually contains everything that is not in the regression, which is arguably autocorrelated. It also may arise from xt being autocorrelated and from conditional heteroskedasticity of the
error terms. We need to gure out how to estimate J . This is called HAC (heteroskedasticity autocorrelation
consistent) standard errors.
Remark 14. A side note on GLS. If one believes in strict exogeneity, then the estimation can be done more
eciently by using GLS. However, GLS is generally invalid if only weak exogeneity holds.
The logic here is the following. In many settings error terms ut are arguably auto-correlated, one may
think that estimation is not fully ecient (as Gauss-Markov theorem assumes that observations are uncrrelated) and could be improved. Assume for a moment that
yt = xt + ut ;
and
ut = ut1 + et .
Assume also for a moment that is known and et are serially uncorrelated(white noise). You may think of
transforming the system of observations and replace t s equation with the quasi-dierenced one:
yt yt1 = (xt xt1 ) + et ;
Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu),
Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
HAC
or yet = x
et + et , where yet = yt yt1 and x
et = xt xt1 . The new system seems to be better since the
errors are not autocorrelated and have the same variance (with the exception of the rst one). If we have
strong exogeneity then the OLS for the new system (the rst equation should be corrected to have the same
variance) is the ecient(BLUE). What we described is ecient GLS in this case. The problem thought is
that
E[et |xt , xt1 , ....] = E[ut |xt , xt1 , ...] E[ut1 |xt , xt1 , ...]
However, if ut satised only weak exogenuity but not strong exogenuity assumption, then the new error may
not satisfy the exogenuity condition, and the OLS in the transformed system will be biased. So, unless you
believe in strong exogeneity (which is extremely rare), you should not use GLS.
HAC
Assume we have a series {zt } satisfying Assumptions of CLT, and we want to estimate J =
are two main ways: parametric and non-parametric.
k . There
Parametric
Assume zt is AR(p):
zt = a1 zt1 + ... + ap ztp + et
then J =
2
a(1)2 ,
where a(L) = 1 a1 L ...ap Lp . We can proceed in the following way: run OLS regression
of zt on zt1 , ..., ztp , get a
1 , ..., a
p and
2 , then use a
(L) = 1 a
1 L .. a
p Lp to construct J,
2
J =
.
a
(1)2
Two important practical questions:
What p should we use? model selection criteria, BIC (Bayesian informaiton criteria)
What if zt is not AR(p)?
The second question is still an open question. Den Haan and Levin (1997) showed that if zt is AR(p), then
the convergence of the parameteric estimator is faster than the kernel estimator described below.
Non-parametric
A nave approach
J is the sum of all auto-covariance. We can estimate T 1 of these, but not all. What if we just use the
ones we can estimate, i.e.
J =
T
1
k=T 1
T k
1
k , k =
zj zj+k
T j=1
Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu),
Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Non-parametric
J =
k=(T 1)
1
T
T
1
T
k
zj zj+k
k=(T 1) j=1
1 2
= (
zt )
T t=1
T
T
1 2
=(
zt )
T t=1
N (0, J )2
so J is not consistent; it converges to a distribution instead of a point. The problem is that were summing
too many imprecisely estimated covariances. So, the noise does not die out. For example, to estimate T 1
we use only one observation, how good can it be?
Truncated sum of sample covariances
What if we dont use all the covariances?
J2 =
ST
k=ST
ST
kT (j)j
j=ST
Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu),
Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.