Dokumen - Pub Time Series Econometrics J 6726102
Dokumen - Pub Time Series Econometrics J 6726102
A Concise Course
Francis X. Diebold
University of Pennsylvania
Edition 2015
Version 2015.03.22
Time Series Econometrics
Time Series Econometrics
A Concise Course
Francis X. Diebold
Copyright c 2013 Onward, by Francis X. Diebold.
All rights reserved.
To Marc Nerlove,
who taught me time series,
Acknowledgments xx
Preface xxiv
Chapter 1. Introduction 1
Chapter 5. Markovian Structure, Linear Gaussian State Space, and Optimal (Kalman) Filtering 52
Chapter 7. Simulation for Economic Theory, Econometric Theory, Estimation, Inference, and
Optimization 96
Chapter 8. Bayesian Time Series Posterior Analysis by Markov Chain Monte Carlo 119
Chapter 11. Non-Linear Non-Gaussian State Space and Optimal Filtering 179
Appendices 185
Acknowledgments xx
Preface xxiv
Chapter 1. Introduction 1
1.1 Economic Time Series and Their Analysis 1
1.2 A Practical Toolkit 1
1.2.1 Software (and a Tiny bit of Hardware) 1
1.2.2 Data 2
1.2.3 Markup 3
1.2.4 Version Control 3
1.3 Exercises, Problems and Complements 3
1.4 Notes 4
Chapter 5. Markovian Structure, Linear Gaussian State Space, and Optimal (Kalman) Filtering 52
5.1 Markovian Structure 52
5.1.1 The Homogeneous Discrete-State Discrete-Time Markov Process 52
5.1.2 Multi-Step Transitions: Chapman-Kolmogorov 52
5.1.3 Lots of Definitions (and a Key Theorem) 53
5.1.4 A Simple Two-State Example 54
5.1.5 Constructing Markov Processes with Useful Steady-State Distributions 55
5.1.6 Variations and Extensions: Regime-Switching and More 56
5.1.7 Continuous-State Markov Processes 57
5.2 State Space Representations 58
5.2.1 The Basic Framework 58
5.2.2 ARMA Models 60
5.2.3 Linear Regression with Time-Varying Parameters and More 65
5.2.4 Dynamic Factor Models and Cointegration 67
5.2.5 Unobserved-Components Models 68
5.3 The Kalman Filter and Smoother 69
5.3.1 Statement(s) of the Kalman Filter 70
5.3.2 Derivation of the Kalman Filter 71
5.3.3 Calculating P0 74
5.3.4 Predicting yt 74
5.3.5 Steady State and the Innovations Representation 75
5.3.6 Kalman Smoothing 77
5.4 Exercises, Problems and Complements 77
5.5 Notes 83
xiv DETAILED TABLE OF CONTENTS
Chapter 7. Simulation for Economic Theory, Econometric Theory, Estimation, Inference, and
Optimization 96
7.1 Generating U(0,1) Deviates 96
7.2 The Basics: c.d.f. Inversion, Box-Mueller, Simple Accept-Reject 98
7.2.1 Inverse c.d.f. 98
7.2.2 Box-Muller 99
7.2.3 Simple Accept-Reject 99
7.3 Simulating Exact and Approximate Realizations of Time Series Processes 101
7.4 more 101
7.5 Economic Theory by Simulation: “Calibration” 101
7.6 Econometric Theory by Simulation: Monte Carlo and Variance Reduction 101
7.6.1 Experimental Design 102
7.6.2 Simulation 103
7.6.3 Variance Reduction: Importance Sampling, Antithetics, Control Variates
and Common Random Numbers 104
7.6.4 Response Surfaces 109
7.7 Estimation by Simulation: GMM, SMM and Indirect Inference 110
7.7.1 GMM 110
7.7.2 Simulated Method of Moments (SMM) 110
7.7.3 Indirect Inference 111
7.8 Inference by Simulation: Bootstrap 112
7.8.1 i.i.d. Environments 112
7.8.2 Time-Series Environments 114
7.9 Optimization by Simulation 116
7.9.1 Local 116
7.9.2 Global 116
7.9.3 Is a Local Optimum Global? 117
7.10 Interval and Density Forecasting by Simulation 118
7.11 Exercises, Problems and Complements 118
7.12 Notes 118
Chapter 8. Bayesian Time Series Posterior Analysis by Markov Chain Monte Carlo 119
8.1 Bayesian Basics 119
8.2 Comparative Aspects of Bayesian and Frequentist Paradigms 119
8.3 Markov Chain Monte Carlo 121
8.3.1 Metropolis-Hastings Independence Chain 121
DETAILED TABLE OF CONTENTS xv
Chapter 11. Non-Linear Non-Gaussian State Space and Optimal Filtering 179
11.1 Varieties of Non-Linear Non-Gaussian Models 179
11.2 Markov Chains to the Rescue (Again): The Particle Filter 179
11.3 Particle Filtering for Estimation: Doucet’s Theorem 179
11.4 Key Application I: Stochastic Volatility (Revisited) 179
11.5 Key Application II: Credit-Risk and the Default Option 179
11.6 Key Application III: Dynamic Stochastic General Equilibrium (DSGE) Macroe-
conomic Models 179
11.7 A Partial “Solution”: The Extended Kalman Filter 179
Appendices 185
Francis X. Diebold is Paul F. and Warren S. Miller Professor of Economics, and Professor
of Finance and Statistics, at the University of Pennsylvania and its Wharton School, as well
as Faculty Research Associate at the National Bureau of Economic Research in Cambridge,
Mass., and past President of the Society for Financial Econometrics. He has published
widely in econometrics, forecasting, finance and macroeconomics, and he has served on the
editorial boards of numerous scholarly. He is an elected Fellow of the Econometric Society,
the American Statistical Association, and the International Institute of Forecasters, and
the recipient of Sloan, Guggenheim, and Humboldt fellowships. Diebold lectures actively,
worldwide, and has received several prizes for outstanding teaching. He has held visiting
appointments in Economics and Finance at Princeton University, Cambridge University,
the University of Chicago, the London School of Economics, Johns Hopkins University, and
New York University. His research and teaching are firmly rooted in applications; he has
served as an economist under Paul Volcker and Alan Greenspan at the Board of Governors
of the Federal Reserve System in Washington DC, an Executive Director at Morgan Stanley
Investment Management, Co-Director of the Wharton Financial Institutions Center, and
Chairman of the Federal Reserve System’s Model Validation Council. All his degrees are
from the University of Pennsylvania; he received his B.S. from the Wharton School in 1981
and his economics Ph.D. in in 1986. He is married with three children and lives in suburban
Philadelphia.
About the Cover
The colorful graphic is by Peter Mills and was obtained from Wikimedia Commons. As
noted there, it represents “the basins of attraction of the Gaspard-Rice scattering system
projected onto a double impact parameter” (whatever that means). I used it mainly because
I like it, but also because it’s vaguely reminiscent of a trending time series.
• Hyperlinks to internal items (table of contents, index, footnotes, etc.) appear in red.
• Hyperlinks to bibliographic references appear in green.
All media (images, audio, video, ...) were either produced by me (computer graphics using
R, original audio/video, etc.) or obtained from the public domain repository at Wikimedia
Commons.
List of Figures
Time Series Econometrics (TSE ) provides a modern and concise Ph.D.-level course in
econometric time series. It can be covered realistically in one semester; indeed I have used
the material successfully for many years with first-year Ph.D. students at the University of
Pennsylvania.
The elephant in the room is of course Hamilton’s Time Series Analysis, so let me address
it immediately. TSE complements it in three key ways. First, TSE offers a concise yet
precise overview – from the classic early framework of Wold, Wiener, and Kolmogorov,
straight through to cutting-edge Bayesian MCMC analysis of non-linear non-Gaussian state
space models with the particle filter – and Hamilton’s book can be used for more extensive
background reading for those topics that overlap.
Second and crucially, however, many of the topics do not overlap, as TSE treats a variety
of more recently-emphasized topics. It stresses Markovian structure throughout, from linear
state space, to MCMC, to optimization, to non-linear state space and particle filtering. Bayes
features prominently, as do simulation, continuous time, realized volatility, nonparametrics,
global optimization, and more.
Finally, TSE is in touch with modern computing environments. It uses R throughout,
which in this author’s opinion is the clear environment of choice for the foreseeable future.
Related, TSE is generally e-aware, with numerous hyperlinks to internal items, bibliographic
references, the internet (web pages, video, etc.), databases, etc.
Francis X. Diebold
Philadelphia
Introduction
Any series of observations ordered along a single dimension, such as time, may be thought
of as a time series. The emphasis in time series analysis is the study of dependence among
the observations at different points in time.1
Many economic and financial variables, such as prices, sales, stocks, GDP and its com-
ponents, stock returns, interest rates and foreign exchange rates, are observed over time;
in addition to being interested in the interrelationships among such variables, we are also
concerned with relationships among the current and past values of one or more of them,
that is, relationships over time.
At its broadest level, time series analysis provides the language for of stochastic dy-
namics. Hence it’s the language of even pure dynamic economic theory, quite apart from
empirical analysis. It is, however, a great workhorse of empirical analysis, in “pre-theory”
mode (non-structurally “getting the facts straight” before theorizing, always a good idea),
in “post-theory” mode (structural estimation and inference), and in forecasting (whether
non-structural of structural).
Empirically, the analysis of economic time series is central to a wide range of applica-
tions, including business cycle measurement, financial risk management, policy analysis,
and forecasting. Special features of interest in economic time series include trends and non-
stationarity, seasonality, cycles and persistence, predictability (or lack thereof), structural
change, and nonlinearities such as volatility fluctuations and regime switching.
1 Indeed what distinguishes time series analysis from general multivariate analysis is precisely the temporal
Eviews, however, can sometimes be something of a “black box.” Hence you’ll also want
to have available slightly lower-level (“mid-level”) environments in which you can quickly
program, evaluate and apply new tools and techniques. R is one very powerful and popular
such environment, with special strengths in modern statistical methods and graphical data
analysis.2 R is available for free as part of a massive and highly-successful open-source
project. RStudio provides a fine R working environment, and, like R, it’s free. A good R
tutorial, first given on Coursera and then moved to YouTube, is here. R-bloggers is a massive
blog with all sorts of information about all things R.
If you need real speed, such as for large simulations, you will likely need a low-level
environment like Fortran or C++. And in the limit (and on the hardware side), if you
need blazing-fast parallel computing for massive simulations etc., graphics cards (graphical
processing units, or GPU’s) provide stunning gains, as documented for example in Aldrich
et al. (2011). Actually the real limit is quantum computing, but we’re not there yet.
For a compendium of econometric and statistical software, see the software links site,
maintained by Marius Ooms at the Econometrics Journal.
1.2.2 Data
Here we mention just a few key “must-know” sites. Resources for Economists, maintained
by the American Economic Association, is a fine portal to almost anything of interest to
economists. It contains hundreds of links to data sources, journals, professional organiza-
tions, and so on. FRED (Federal Reserve Economic Data) is a tremendously convenient
source for economic data. The National Bureau of Economic Research site has data on U.S.
business cycles, and the Real-Time Data Research Center at the Federal Reserve Bank of
1.2.3 Markup
Markup languages effectively provide typesetting or “word processing.” HTML is the most
well-known example. Research papers and books are typically written in LaTeX. MiCTeX
is a good and popular flavor of LaTeX, and TeXworks is a good editor designed for LaTeX.
knitr is an R package, but it’s worth mentioning separately, as it powerfully integrates R
and LaTeX./footnoteYou can access everything in RStudio.
Another markup language worth mentioning is Sphinx, which runs under Python. The
Stachurchski-Sargent e-book Quantitative Economics, which features Python prominently,
is written in Sphinx.
Git and GitHub are useful for open/collaborative development and version control. For
my sorts of small-group projects I find that Dropbox or equivalent keeps me adequately
synchronized, but for serious large-scale development, use of git or equivalent appears crucial.
Consider the following point/counterpoint items. In each case, which do you think
would be more useful for analysis of economic time series? Why?
• Continuous / discrete
• linear / nonlinear
• deterministic / stochastic
• univariate / multivariate
• time domain / frequency domain
• conditional mean / conditional variance
• trend / seasonal / cycle / noise
• ordered in time / ordered in space
• stock / flow
• stationary / nonstationary
• aggregate / disaggregate
• Gaussian / non-Gaussian
1.4 NOTES
• The study of time series of, for example, astronomical observations predates recorded
history. Early writers on economic subjects occasionally made explicit reference to
astronomy as the source of their ideas. For example, Cournot stressed that, as in as-
tronomy, it is necessary to recognize secular variation that is independent of periodic
variation. Similarly, Jevons made clear his approach to the study of short-term fluc-
tuations used the methods of astronomy and meteorology. During the 19th century
interest in, and analysis of, social and economic time series evolved into a new field of
study independent of developments in astronomy and meteorology. Time-series anal-
ysis then flourished. Nerlove et al. (1979) provides a brief history of the field’s early
development.
• For references old and new, see the “library” of useful books in Appendix A.
Chapter Two
Strict Stationarity
Weak Stationarity
Eyt = µ, ∀t
Autocovariance Function
(a) symmetric
γ(τ ) = γ(−τ ), ∀τ
∞
X
g(z) = γ(τ ) z τ
τ =−∞
Autocorrelation Function
γ(τ )
ρ(τ ) =
γ(0)
iid
Independent (strong) white noise: ηt ∼ (0, σ 2 )
iid
Gaussian white noise: ηt ∼ N (0, σ 2 )
E(ηt ) = 0
var(ηt ) = σ 2
Conditional Moment Structure of Strong White Noise
E(ηt |Ωt−1 ) = 0
where
(
σ2 , τ = 0
γ(τ ) =
0, τ ≥ 1
(
1, τ = 0
ρ(τ ) =
0, τ ≥ 1
where:
b0 = 1
∞
X
b2i < ∞
i=0
∞
X
yt = B(L)εt = bi εt−i
i=0
εt ∼ W N (0, σ 2 )
b0 = 1
∞
X
b2i < ∞
i=0
∞ ∞ ∞
!
X X X
E(yt ) = E bi εt−i = bi Eεt−i = bi · 0 = 0
i=0 i=0 i=0
∞ ∞ ∞
!
X X X
var(yt ) = var bi εt−i = b2i var(εt−i ) = σ 2 b2i
i=0 i=0 i=0
∞
X
= 0 + b1 εt−1 + b2 εt−2 + ... = bi εt−i
i=1
Autocovariance Structure
THE WOLD REPRESENTATION 9
∞ ∞
" ! !#
X X
γ(τ ) = E bi εt−i bh εt−τ −h
i=−∞ h=−∞
∞
X
= σ2 bi bi−τ
i=−∞
(where bi ≡ 0 if i < 0)
(Obvious truncation)
Unconditional moment structure, conditional moment structure, autocovariance func-
tions, stationarity and invertibility conditions
2.5.1 Extraction
2.5.2 Prediction
yt = εt + b1 εt−1 + ...
Prediction Error
h−1
X
eT +h,T = yT +h − yT +h,T = bi εT +h−i
i=0
E(eT +h,T ) = 0
h−1
X
var(eT +h,T ) = σ 2 b2i
i=0
History:
{yt }Tt=1
Immediately,
yT +1,T = φyT
yT +2,T = φyT +1,T = φ2 yT
..
.
yT +h,T = φyT +h−1,T = φh yT
2.6 MULTIVARIATE
E(y1t ) = µ1 ∀ t
E(y2t ) = µ2 ∀ t
!
y1t − µ1
Γy1 y2 (t, τ ) = E (y1,t−τ − µ1 , y2,t−τ − µ2 )
y2t − µ2
THE WOLD REPRESENTATION 11
!
γ11 (τ ) γ12 (τ )
=
γ21 (τ ) γ22 (τ )
τ = 0, 1, 2, ...
∞
X
Gy1 y2 (z) = Γy1 y2 (τ ) z τ
τ =−∞
Cross Correlations
Ry1 y2 (τ ) = Dy−1
1 y2
Γy1 y2 (τ ) Dy−1
1 y2
, τ = 0, 1, , 2, ...
!
σ1 0
D =
0 σ2
! ! !
y1t B11 (L) B12 (L) ε1t
=
y2t B21 (L) B22 (L) ε2t
yt = B(L)εt = (I + B1 L + B2 L2 + ...)εt
(
Σ if t = s
E(εt ε0s ) =
0 otherwise
∞
X
k Bi k2 < ∞
i=0
12 CHAPTER 2
Autocovariance Structure
∞
X
0
Γy1 y2 (τ ) = Bi Σ Bi−τ
i=−∞
(where Bi ≡ 0 if i < 0)
Gy (z) = B(z) Σ B 0 (z −1 )
Φ(L) yt = εt
(N xN )(N x1)(N x1)
εt ∼ (0, Σ)
(N x1)(N xN )
yt = (I + Θ1 L + Θ2 L2 + ...) εt
εt ∼ (0, Σ)
(I − Φ1 L − ... − Φp LP )yt = εt
THE WOLD REPRESENTATION 13
(I − Φ1 L − ... − Φp LP )yt = P vt
yt = (I + Θ1 L + Θ2 L2 + ...) P vt
= (P + Θ1 P L + Θ2 P L2 + ...) vt
T −|τ |
1 X
γ̂(τ ) = xt xt+|τ | , τ = 0, ± 1, ..., ± (T − 1)
T t=1
T −|τ |
1 X
γ ∗ (τ ) = xt xt+|τ | , τ = 0, ± 1 , ..., ± (T − 1)
T − |τ | t=1
d
√
T (ρ̂ − ρ) → N (0, Σ)
PT 2
t=1 et
M SE =
T
PT 2
2 t=1 et
R = 1 − PT
t=1 (yt − ȳ)2
M SE
= 1 − 1
PT
T t=1 (yt − ȳ)2
Still bad:
PT 2
t=1 et
s2 =
T −k
PT !
2
2 T t=1 et
s =
T −k T
PT 2
2 t=1 et / T −k
R̄ = 1 − PT
t=1 (yt − ȳt )2 / T − 1
s2
= 1 − PT
t=1 (yt − ȳt )2 / T − 1
Good:
PT !
2
t=1 et
SIC = T ( T )
k
More generally,
−2lnL KlnT
SIC = +
T T
2.7.2.2 Diagnostics
m
X
QBP = T ρ̂2 (τ ) ∼ χ2 (m)
τ =1
m
X 1
QLB = T (T + 2) ρ̂2 (τ )
τ =1
T −τ
1. Ergodicity.
We shall say (loosely speaking) that a time series is ergodic if consistent inference
regarding its stochastic structure can be made on the basis of one realization. While
ergodicity is a deep mathematical property of the distribution function characteriz-
ing the time series in question, its meaning for a stationary time series is essentially
independence of observations far enough apart in time.
Ergodicity refers to consistent moment estimability based only on a single realization,
as opposed to stationarity, which is concerned with the time−constancy of the prob-
ability structure of a stochastic process. It is therefore nonsensical to pose questions
regarding the ergodicity of nonstationary processes. We stress that ergodicity cannot
be “checked,” even with a (doubly) infinitely sample path. The intuition is simple:
regardless of whether or not a time−series is ergodic, sample moments converge to a
random variable. If the series is ergodic, that random variable is in fact a (degenerate)
constant. It is immediately clear, then, that even with an infinitely large sample one
cannot tell whether or not sample moments converge to a constant (fixed in repeated
realizations) or just one particular realization of a random variable (which will change
from realization to realization). To check ergodicity, one must have available an entire
ensemble, which is never the case in practice.
Due to the impossibility of empirically checking ergodicity in observed time series,
attention has focused on the study of specific parameterizations for which ergodic-
ity can be theoretically established. For example, the important LRCSSP, discussed
below, is always ergodic. More generally, we seek sufficient conditions under which
laws of large numbers (LLN) can be shown to hold. For a time series of independent,
identically distributed random variables, Kolmogorov’s LLN holds. For dependent,
identically (unconditionally) distributed time series, sufficient conditions for the LLN
are well known. Much recent research examines conditions sufficient for the LLN in
16 CHAPTER 2
more general situations, such as dependent time series with heterogeneous innovations.
The resulting theories of mixing, martingale difference, and near−epoch dependent se-
quences are discussed in White (1984), Gallant and White (198*), and White (199*),
among many others.
3. Predicting AR processes.
Show the following.
(a) If yt is a covariance stationary AR(1) process, i.e., yt = αyt−1 + t with |α| < 1,
then yt+h,t = αh yt .
(b) If yt is AR(2),
where yt−j = yt−j , for j = 0, 1, ..., at time t. Thus for pure autoregressions, the
MMSE prediction is a linear combination of only the p most recently observed
values.
4. Predicting M A process.
If yt is M A(1),
yt = t − βt−1 ,
yt − αyt−1 = t − βt−1 ,
6. Prediction-error dynamics.
Consider the general linear process with strong white noise innovations. Show that
both the conditional (with respect to the information set Ωt = {t , t−1 , ...}) and
unconditional moments of the Wiener-Kolmogorov h-step-ahead prediction error are
identical.
(a) Obtain the usual quarterly expenditure-side U.S. GDPE from FRB St. Louis,
1960.1-present.
(b) Leaving out the 12 most recent quarters of data, perform a full correlogram
analysis for GDPE logarithmic growth.
(c) Again leaving out the 12 most recent quarters of data, specify, estimate and de-
fend appropriate AR(p) and ARM A(p, q) models for GDPE logarithmic growth.
(d) Using your preferred AR(p) and ARM A(p, q) models for GDPE logarithmic
growth, generate a 12-quarter-ahead linear least-squares path forecast for the
“hold-out” sample. How do your AR(p) and ARM A(p, q) forecasts compare to
the realized values? Which appears more accurate?
(e) Obtain ADNSS GDPplus logarithmic growth from FRB Philadelphia, read about
it, and repeat everything above.
18 CHAPTER 2
(f) Contrast the results for GDPE logarithmic growth and GDPplus logarithmic
growth.
(a) Obtain monthly U.S. housing starts and completions data from FRED at FRB
St. Louis, seasonally-adjusted, 1960.1-present. Your two series should be of equal
length.
(b) Using only observations {1, ..., T −4}, perform a full correlogram analysis of starts
and completions. Discuss in detail.
(c) Using only observations {1, ..., T − 4}, specify and estimate appropriate univari-
ate ARM A(p, q) models for starts and completions, as well as an appropriate
V AR(p). Discuss in detail.
(d) Characterize the Granger-causal structure of your estimated V AR(p). Discuss in
detail.
(e) Characterize the impulse-response structure of your estimated V AR(p) using all
possible Cholesky orderings. Discuss in detail.
(f) Using your preferred ARM A(p, q) models and V AR(p) model, specified and es-
timated using only observations {1, ..., T − 4}, generate linear least-squares path
forecasts for the four quarters of “hold out data,” {T − 3, T − 2, T − 1, T }. How
do your forecasts compare to the realized values? Discuss in detail.
under the usual assumptions. Suppose further that B11 (L) = B21 (L) = 0 and ε1t = ε2t = εt
(with variance σ 2 ). Discuss the nature of this system. Why might it be useful in eco-
nomics?
2.9 NOTES
By 1930 Slutzky and Yule had shown that rich dynamics could be obtained by tak-
ing weighted averages of random shocks. Wold’s celebrated 1937 decomposition established
the converse, decomposing covariance stationary series into weighted averages of random
shocks, and paved the way for subsequent path-breaking work by Wiener, Kolmogorov,
Kalman and others. The beautiful 1963 treatment by Wold’s student Whittle (1963), up-
dated and reprinted as Whittle (1983) with a masterful introduction by Tom Sargent, re-
mains widely-read. Much of macroeconomics is built on the Slutzky-Yule-Wold-Wiener-
Kolmogorov foundation. For a fascinating overview of parts of the history in its relation
to macroeconomics, see Davies and Mahon (2009), at http://www.minneapolisfed.org/
publications_papers/pub_display.cfm?id=4348.
Chapter Three
f smooth in [x0 − h, x0 + h]
Z x0 +h
1 1
f (x0 ) ≈ f (u)du = P (x ∈ [x0 − h, x0 + h])
2h x0 −h 2h
1 #xi ∈ [x0 − h, x0 + h]
fˆh (x0 ) =
2h N
N
1 X1 x0 − xi
= I ≤1
N h i=1 2 h
“Rosenblatt estimator”
bandwidth: h
Standard conditions:
R
K(u)du = 1
K(u) = K(−u)
Common Kernel Choices
u2
Standard normal: K(u) = √1 e− 2
2π
N
1 X x0 − xi
fˆh (x0 ) = K
N h i=1 h
“Rosenblatt-Parzen estimator”
(So h → 0 =⇒ bias → 0)
(So N h → ∞ =⇒ var → 0)
Thus,
) p
h→0 ˆ
=⇒ fh (x0 ) → f (x0 )
Nh → ∞
22 CHAPTER 3
d
√
N h(fˆh (x0 ) − f (x0 )) → D
h∗ = γ ∗ N −1/5
Corresponding Optimal Convergence Rate
Recall:
√ d
N h fˆh (x0 ) − f (x0 ) → D
h∗ ∝ N −1/5
p d
N 4/5 fˆh (x0 ) − f (x0 ) → D
h∗ = 1.06σN −1/5
So use:
3.2 MULTIVARIATE
N
1 X x0 − xi
fˆh (x0 ) = K
N h i=1 h
where Kh (·) = h1 K( h· )
N
1 X
fˆH (x0 ) = KH (x0 − xi ),
N i=1
N
1 X x0 − xi
=⇒ fˆh (x0 ) = K
N hd i=1 h
) p
h→0 ˆ
=⇒ fh (x0 ) → f (x0 )
N hd → ∞
√ d
N hd ˆ
fh (x0 ) − f (x0 ) → D
1
h∗ ∝ N − d+4
q
d
d
N 1− d+4 fˆh (x0 ) − f (x0 ) → D
“Curse of dimensionality”
Silverman’s Rule
1
d+4
4 1
∗
ĥ = σ̂N − d+4
d+2
where
d
2 1X 2
σ̂ = σ̂
d i=1 i
∂M (x) (M (x + h2 ) − M (x − h2 ))
β(x) = = lim
∂xj h→0 h
f (u), u = y − M (x)
Conditional Variance
Z
f (y, x)
var(y|x) = V (x) = y2 dy − M (x)2
f (x)
Hazard Function
f (t)
λ(t) =
1 − F (t)
Using multivariate kernel density estimates and manipulating gives the “Nadaraya-Watson”
estimator:
N
" #
x0 −xi
X K h
M̂h (x0 ) = PN x0 −xi
yi
i=1 i=1 K h
h → 0, N h → ∞ =⇒
d
√
N hd (M̂h (x0 ) − M (x0 )) → N (0, V )
M̂k (x0 ) = k1
P
i∈n(x0 ) yi (Locally Constant, uniform weighting)
k P
k → ∞, N → 0 ⇒ M̂k (x0 ) M (x0 )
→
√ d
k (M̂k (x0 ) − M (x0 ))
D
→
Equivalent to Nadaraya-Watson kernel regression with:
1
K(u) = 2 I(|u| ≤ 1) (uniform)
and h = R(k) (distance from x0 to k th nearest neighbor)
⇒ Variable bandwidth!
yt = g(xt ) + εt
Computation of ĝ(x∗ ) :
0 < ξ ≤ 1
kT = int(ξ · T )
λ(xt , x∗ )
vt (xt , x∗ , x∗kT ) = C λ(x∗ , x∗
k )
T
(
(1 − u3 )3 f or u < 1
C(u) =
0 otherwise
M (x0 ) = Σ∞
j=0 βj φj (x0 )
(the φj are orthogonal basis functions)
M̂J (x0 ) = ΣJj=0 β̂j φj (x0 )
J P
J → ∞, N → 0 ⇒ M̂ J (x0 ) M (x0 )
→
hjt = Ψ(γjo + ΣR
i=1 γij xit ), j = 1, ..., S (N euron j)
e.g. Ψ(·) can be logistic (regression), 0-1 (classification)
S
Universal Approximator: S → ∞, N → 0 ⇒ Ô(x0 ) →p O(x0 )
to get:
x −xN
(N −1)hd fˆN −1 (x0 )M̂N −1 (x0 ) + YN K( 0 )
M̂N (x0 ) = x −x
h
(N −1)hd fˆN −1 (x0 ) + K( 0 N )
h
Back substitution:
Ot = g(xt , xt−1 , ..., x1 ; θ)
3.8 NOTES
Chapter Four
Spectral Analysis
Spectral Analysis
∞
X
yt = B(L)εt = bi εt−i
i=0
= σ 2 B(z)B(z −1 )
Spectrum
Evaluate g(z) on the unit circle, z = e−iω :
∞
X
g(e−iω ) = γ(τ ) e−iωτ , − π < ω < π
τ = −∞
= σ 2 B(eiω ) B(e−iω )
= σ 2 | B(eiω ) |2
Spectrum
Trigonometric form:
∞
X
g(ω) = γ(τ )e−iωτ
τ =−∞
∞
X
γ(τ ) eiωτ + e−iωτ
= γ(0) +
τ =1
∞
X
= γ(0) + 2 γ(τ ) cos(ωτ )
τ =1
1
f (ω) = g(ω)
2π
∞
1 X
f (ω) = γ(τ )e−iωτ (−π < ω < π)
2π τ =−∞
∞
1 1X
= γ(0) + γ(τ ) cos(ωτ )
2π π τ =1
σ2
B eiω B e−iω
=
2π
σ2
| B eiω |2
=
2π
SPECTRAL ANALYSIS 31
1. symmetric around ω = 0
2. real-valued
3. 2π-periodic
4. nonnegative
∞
X
g(ω) = γ(τ )e−iωτ
τ =−∞
Z π
1
γ(τ ) = g(ω)eiωτ dω
2π −π
Z π
1
γ(τ ) = g(ω)eiωτ dω
2π −π
Z π
= f (ω)eiωτ dω
−π
Hence
Z π
γ(0) = f (ω)dω
−π
T −1
√
X |τ |
T (x̄ − µ) ∼ 0, 1− γ(τ )
T
τ =−(T −1)
d
√
T (x̄ − µ) → N (0, gx (0))
32 CHAPTER 4
yt = ε t
εt ∼ W N (0, σ 2 )
σ2
B eiω B e−iω
f (ω) =
2π
σ2
f (ω) =
2π
AR(1) Spectral Density
yt = φyt−1 + εt
εt ∼ W N (0, σ 2 )
σ2
f (ω) = B(eiω )B(e−iω )
2π
σ2 1
=
2π (1 − φeiω )(1 − φe−iω )
σ2 1
=
2π 1 − 2φ cos(ω) + φ2
How does shape depend on φ? Where are the peaks?
ARMA(1, 1) Spectral Density
(1 − φL)yt = (1 − θL)εt
σ 2 1 − 2θ cos(ω) + θ2
f (ω) =
2π 1 − 2φ cos(ω) + φ2
“Rational spectral density”
Internal peaks? What will it take?
SPECTRAL ANALYSIS 33
4.4 MULTIVARIATE
∞
1 X
= Γyx (τ ) e−iωτ , − π < ω < π
2π τ =−∞
(Complex-valued)
Co-Spectrum and Quadrature Spectrum
∞
1 X
Cyx (ω) = Γyx (τ ) cos(ωτ )
2π τ =−∞
∞
−1 X
Qyx (ω) = Γyx (τ ) sin(ωτ )
2π τ =−∞
Cross Spectrum
fyx (ω) = gayx (ω)exp(i phyx (ω)) (generic cross spectrum)
34 CHAPTER 4
1
2
gayx (ω) = [Cyx (ω) + Q2yx (ω)] 2 (gain)
Qyx (ω)
phyx (ω) = arctan Cyx (ω) (phase)
ph(ω)
(Phase shift in time units is ω )
|fyx (ω)|2
cohyx (ω) = (coherence)
fxx (ω)fyy (ω)
Squared correlation decomposed by frequency
Useful Spectral Results for Filter Design and Analysis
fyx (ω)
B(e−iω ) =
fxx (ω)
yt = .5xt−1 + εt
SPECTRAL ANALYSIS 35
εt ∼ W N (0, 1)
xt = .9xt−1 + ηt
ηt ∼ W N (0, 1)
Correlation Structure
Autocorrelation and cross-correlation functions are straightforward:
ρy (τ ) = .9|τ |
ρx (τ ) ∝ .9|τ |
1
xt = ηt
1 − .9L
1 1 1
=⇒ fxx (ω) = −iω
2π 1 − .9e 1 − .9eiω
1 1
=
2π 1 − 2(.9) cos(ω) + (.9)2
1
=
11.37 − 11.30 cos(ω)
Shape?
Spectral Density of y
yt = 0.5Lxt + εt
1
=⇒ fyy (ω) =| 0.5e−iω |2 fxx (ω) +
2π
1
= 0.25fxx (ω) +
2π
36 CHAPTER 4
0.25 1
= +
11.37 − 11.30 cos(ω) 2π
Shape?
Cross Spectrum
B(L) = .5L
B(e−iω ) = 0.5e−iω
fyx (ω) = B(e−iω )fxx (ω)
= 0.5e−iω fxx (ω)
= (0.5fxx (ω)) e−iω
0.5
gyx (ω) = 0.5fxx (ω) = 11.37−11.30 cos(ω)
P hyx (ω) = −ω
(In time units, P hyx (ω) = −1, so y leads x by -1)
Coherence
| fyx (ω) |2 2
.25fxx (ω) .25fxx (ω)
Cohyx (ω) = = =
fxx (ω)fyy (ω) fxx (ω)fyy (ω) fyy (ω)
1 1
.25 2π 1−2(.9) cos(ω)+.92
= 1 1 1
.25 2π 1−2(.9) cos(ω)+.92 + 2π
1
=
8.24 + 7.20 cos(ω)
Shape?
yt = xt − xt−1
=⇒ B(e−iω ) = 1 − e−iω
2
−iω 1 X −iωj sin(5ω/2)
=⇒ B1 (e )= e =
5 j=−2 5sin(ω/2)
zt = yt+5 − yt−5
B2 (e−iω ) = |2sin(5ω)|
sin(5ω/2)
B1 (e−iω )B2 (e−iω ) = |2sin(5ω)|
5sin(ω/2)
Kuznets’ Filters, Continued
Filter Design: A Bandpass Filter
Canonical problem:
Find B(L) s.t.
(
fx (ω) on [a, b] ∪ [−b, −a]
fy (ω) =
0 otherwise,
where
∞
X
yt = B(L)xt = bj εt−j
j=−∞
• How “best” to make this filter feasible in practice? What does that mean? Simple
truncation?
• Phase shift?
4.6.1 Univariate
T 2 r T
! r T
!
2 X −iωt 2 X −iωt 2 X iωt
I(ω) = yt e = yt e yt e
T t=1 T t=1 T t=1
−π ≤ ω ≤ π
2πj
Usually examine frequencies ωj = T , j = 0, 1, 2, ..., T2
Sample Spectral Density
T −1
1 X
fˆ(ω) = γ̂(τ )e−iωτ
2π
τ =−(T −1)
T 2
1 X −iωt
fˆ(ω) = yt e
2πT t=1
T
! T
!
1 X −iωt 1 X iωt
= √ yt e √ yt e
2πT t=1 2πT t=1
1
= I(ω)
4π
Properties of the Sample Spectral Density
2πj
(Throughout we use ωj , j = T , j = 0, 1, ..., T2 )
• Hence inconsistent
T −1 T −1
1 X 1 2 X
fˆ(ω) = γ̂(τ )e−iωτ = γ̂(0) + γ̂(τ ) cos(ωτ )
2π 2π 2π τ =1
τ =−(T −1)
T −1
1 X
f ∗ (ω) = λ(τ )γ̂(τ )e−iωτ
2π
τ =−(T −1)
4.6.2 Multivariate
Different lag windows may be used for different elements of Fyx (ω)
Or do model-based...
2. HAC Estimation
4. Sample spectrum.
Generate samples of Gaussian white noise of sizes 32, 64, 128, 256, 512, 1024 and
2056, and for each compute and graph the sample spectral density function at the
usual frequencies. What do your graphs illustrate?
Solution: Assume normality, and then take draws from the process by using a noram
random number generator in conjunction with the Cholesky factorization of the data
covariance matrix. This procedure can be used to estimate the sampling distribution
of the autocorrelations, taken one at a time. One will surely want to downweight
the long-lag autocorrelations before doing the Cholesky factorization, and let this
downweighting adapt to sample size. Assessing sampling uncertainty for the entire
autocorrelation function (e.g., finding a 95% confidence “tunnel”) appears harder,
due to the correlation between sample autocorrelations, but can perhaps be done
numerically. It appears very difficult to dispense with the normality assumption.
9. Sample coherence.
If a sample coherence is completed directly from the sample spectral density matrix
(without smoothing), it will be 1, by definition. Thus, it is important that the sam-
ple spectrum and cross-spectrum be smoothed prior to construction of a coherence
estimator.
Solution:
2
|fyx (ω)|
coh(ω) =
fx (ω) fy (ω)
In unsmoothed sample spectral density analogs,
[Σyt e−iωt Σxt e+iωt ][Σyt e+iωt Σxt e−iωt ]
côh(ω) = [Σxt e−iωt Σxt e+iωt ][Σyt e−iωt Σyt e+iωt ]
≡ 1.
10. De-meaning.
Consider two forms of a covariance stationary time series: “raw” and de-meaned.
Contrast their sample spectral density functions at ordinates 2πj/T, j = 0, 1, ...,
44 CHAPTER 4
T/2. What do you conclude? Now contrast their sample spectral density functions at
ordinates that are not multiples of 2πj/T. Discuss.
Solution: Within the set 2πj/T, j = 0, 1, ..., T/2, only the sample spectral density at
frequency 0 is affected by de-meaning. However, de-meaning does affect the sample
spectral density function at all frequencies in [0, π] outside the set 2πj/T, j = 0, 1,
..., T/2. See Priestley (1980, p. 417). This result is important for the properties of
time- versus frequency-domain estimators of fractionally-integrated models. Note in
particular that
1 X iωj t 2
I(ωj ) ∝ | yt e |
T
so that
1 X 2
I(0) ∝ | yt | ∝ T ȳ 2 ,
T
which approaches infinity with sample size so long as the mean is nonzero. Thus it
makes little sense to use I(0) in estimation, regardless of whether the data have been
demeaned.
the variance of the sample mean of such a time series. If you are very ambitious, you
might want to explore in a Monte Carlo experiment the sampling properties of your
estimator of the standard error vs. the standard estimator of the standard error, for
various population models (e.g., AR(1) for various values of ρ) and sample sizes. If
you are not feeling so ambitious, at least conjecture upon the outcome of such an
experiment.
15. Coherence.
a. Write out the formula for the coherence between two time series x and y.
b. What is the coherence between the filtered series, (1 - b1 L) xt and (1 - b2 L) yt ?
(Assume that b1 6= b2 .)
c. What happens if b1 = b2 ? Discuss.
Solution:
(a) G2 = 1 - e-iω2 is monotonically increasing on [0, π]. This is an example of a “high
pass” filter.
(b) G2 = 1 + e-iω2 is monotonically decreasing on [0, π]. This is an example of a “low
pass” filter.
(c) G2 = (1 - .5 e-12iω)2 has peaks at the fundamental seasonal frequency and its
harmonics, as expected. Note that it corresponds to a seasonal autoregression.
(d) G2 = (1 - .5 e-12iω)2 has troughs at the fundamental seasonal frequency and its
harmonics, as expected, because it is the inverse of the seasonal filter in (c) above.
46 CHAPTER 4
Thus, the seasonal process associated with the filter in (c) above would be appropri-
ately “seasonally adjusted” by the present filter, which is its inverse.
18. Filtering
(a) Consider the linear filter B(L) = 1 + θ L. Suppose that yt = B(L) xt,
where xt ∼ WN(0, σ2). Compute fy(ω).
(b) Given that the spectral density of white noise is σ2/2π, discuss how the filtering
theorem may be used to determine the spectrum of any LRCSSP by viewing it as a
linear filter of white noise.
Solution:
(a) fy(ω) = 1 + θe-iω2 fx(ω)
= σ2/2π (1 + θe-iω)(1 + θeiω)
= σ2/2π (1 + θ2 + 2θ cos ω),
which is immediately recognized as the sdf of an MA(1) process.
(b) All of the LRCSSP’s that we have studied are obtained by applying linear filters
to white noise. Thus, the filtering theorem gives their sdf’s as
f(ω) = σ2/2π B(e-iω)2
= σ2/2π B(e-iω) B(eiω)
= σ2/2π B(z) B(z-1),
evaluated on |z| = 1, which matches our earlier result.
Solution: The series must be deterministic, because one could design a filter such that
the filtered series has zero spectrum everywhere.
20. Period.
Period is 2π/ω and is expressed in time/cycle. 1/P, cycles/time. In engineering, time
is often measured in seconds, and 1/P is Hz.
(b) Compute and plot the spectral density f(ω), for various values of φ. Does it have
any internal peaks on (0, π)? Discuss.
(c) The lowest-frequency internal peak occurs at the so-called fundamental seasonal
frequency. What is it? What is the corresponding period?
(d) The higher-frequency spectral peaks occur at the harmonics of the fundamental
seasonal frequency. What are they? What are the corresponding periods?
Solution:
(a) Monthly, because of the 12-period lag.
(b)
σ2
f (ω) = (1 + φ2 − 2φ cos(12ω))
2π
The sdf has peaks at ω = 0, π/6, 2π/6, ..., 5π/6, and π.
(c) The fundamental frequency is π/6, which corresponds to a period of 12 months.
(d) The harmonic frequencies are 2π/6, ..., 5π/6, and π, corresponding to periods of
6 months, 4 months, 3 months, 12/5 months and 2 months, respectively.
(d) The higher-frequency spectral peaks occur at the harmonics of the fundamental
seasonal frequency. What are they? What are the corresponding periods?
Solution: (a) Quarterly, because of the 4-period lag.
(b)
σ2
f (ω) = (1 + φ2 − 2φ cos(4ω))
2π
The sdf has peaks at ω = 0, π/2 and π.
(c) The fundamental frequency is π/2, which corresponds to a period of 4 quarters.
(d) The only harmonic is π, corresponding to a period of 2 quarters.
Let
T
1X
x̄ = xt .
T t=1
Then
1
PT PT
var(x̄) = T2 s=1 t=1 γ(t − s)
1
PT −1 |τ |
= T τ =−(T −1) (1 − T )γ(τ ),
Solution:
a. By the law of iterated expectations, we have
0.2
γ(0) = = 1
1 − 0.8
γ(τ ) = E(xt xt−τ ) = E[E(xt xt−τ |xt−1 , xt−2 , · · · )] = E[xt−τ E(xt |xt−1 , xt−2 , · · · )] = E(xt−τ 0) = 0
for τ =1,2,. . ..
Therefore
∞
1 X 1 1
f (ω) = γ(τ )e−iωτ = γ(0) =
2π τ =−∞ 2π 2π
E(x4t ) = E[E(x4t |xt−1 , xt−2 · · · )] = E[3(0.2 + 0.8x2t−1 )2 ] = 3[0.04 + 0.32E(x2t−1 ) + 0.64E(x4t−1 )].
Because
x2t
γx2 (τ ) = 0.8γx2 (τ − 1)
σ2 σ2
f (ω) = [(1 + 0.8e12iω )(1 + 0.8e−12iω )] = (1 + 1.6cos12ω + 0.64)
2π 2π
SPECTRAL ANALYSIS 51
But asymptotically,
Σ = P 0 DP
so
Σ−1 = P 0 D−1 P
Thus asymptotically
4.8 NOTES
Harmonic analysis is one of the earliest methods of analyzing time series thought to exhibit
some form of periodicity. In this type of analysis, the time series, or some simple trans-
formation of it, is assumed to be the result of the superposition of sine and cosine waves
of different frequencies. However, since summing a finite number of such strictly periodic
functions always results in a perfectly periodic series, which is seldom observed in practice,
one usually allows for an additive stochastic component, sometimes called “noise.” Thus, an
observer must confront the problem of searching for “hidden periodicities” in the data, that
is, the unknown frequencies and amplitudes of sinusoidal fluctuations hidden amidst noise.
An early method for this purpose is periodogram analysis, initially used to analyse sunspot
data, and later to analyse economic time series.
Spectral analysis is a modernized version of periodogram analysis modified to take account
of the stochastic nature of the entire time series, not just the noise component. If it is assumed
that economic time series are fully stochastic, it follows that the older periodogram technique
is inappropriate and that considerable difficulties in the interpretation of the periodograms
of economic series may be encountered.
These notes draw in part on Diebold, Kilian and Nerlove, New Palgrave, ***.
Chapter Five
(Kalman) Filtering
{Xt }, t = 0, 1, 2, . . .
[time (t + 1)]
p11 p12 ···
[time t] p21 p22 ···
· · ···
P ≡ · · ···
·
P∞
pij ≥ 0, j=1 pij = 1
(m)
Let P (m) ≡ pij .
STATE SPACE AND THE KALMAN FILTER 53
Chapman-Kolmogorov theorem:
Corollary: P (m) = P m
Two states i and j communicate (or are in the same class) if each is accessible from the
other. We write i ↔ j.
A Markov process is irreducible if there exists only one class (i.e., all states communicate).
(n)
State i has period d if pii = 0 ∀n such that n/d 6∈ Z, and d is the greatest integer with
that property. (That is, a return to state i can only occur in multiples of d steps.) A state
with period 1 is called an aperiodic state.
A Markov process all of whose states are aperiodic is called an aperiodic Markov process.
Still more definitions....
The first-transition probability is the probability that, starting in i, the first transition to
j occurs after n transitions:
(n)
fij = P rob(Xn = j, Xk 6= j, k = 1, ..., (n − 1)|X0 = i)
P∞ (n)
Denote the eventual transition probability from i to j by fij (= n=1 fij ).
πP = π.
Then either:
54 CHAPTER 5
(1) All states are transient or all states are null recurrent
(n)
pij → 0 as n → ∞ ∀i, j. No stationary distribution.
or
(n)
pij → πj as n → ∞ ∀i, j. {πj , j = 1, 2, 3, ...} is the unique stationary distribution.
π is any row of limn→∞ P n .
We will verify many of our claims, and we will calculate the steady-state distribution.
pij ≥ 0 ∀i, j
2
X 2
X
p1j = 1, p2j = 1
j=1 j=1
Clearly, 1 ↔ 2, so P is irreducible.
STATE SPACE AND THE KALMAN FILTER 55
5.1.4.4 Periodicity
State 1: d(1) = 2
State 2: d(2) = 2
(1) (n)
f12 = 1, f12 = 0 ∀ n > 1 ⇒ f12 = 1
(1) (n)
f21 = 1, f21 = 0 ∀ n > 1 ⇒ f21 = 1
5.1.4.6 Recurrence
Moreover,
∞
(n)
X
µ11 = nf11 = 2 < ∞ (and similarly µ22 = 2 < ∞)
n=1
!
0 1
(.5, .5) = (.5, .5).
1 0
Note that in this example we can not get the stationary probabilities by taking limn→∞ P n .
Why?
In section 5.1.4 we considered an example of the form, “for a given Markov process, character-
ize its properties.” Interestingly, many important tools arise from the reverse consideration,
“For a given set of properties, find a Markov process with those properties.”
56 CHAPTER 5
st ∼ P
st ∼ Pt
STATE SPACE AND THE KALMAN FILTER 57
We call semi-Markov a process with transitions governed by P , such that the state durations
(times between transitions) are themselves random variables. The process is not Markov,
because conditioning not only the current state but also time-to-date in state may be useful
for predicting the future, but there is an embedded Markov process.
Key result: The stationary distribution depends only on P and the expected state dura-
tions. Other aspects of the duration distribution are irrelevant.
Theorem: If {Xt } is a stationary Markov process with transition probabilities pij and sta-
tionary probabilities πi , then the reversed process is also Markov with transition probabilities
πj
p∗ij = pji .
πi
In general, p∗ij 6= pij . In the special situation p∗ij = pij (so that πi pij = πj pji ), we say that
the process is time-reversible.
αt = T αt−1 + Rηt
yt = Zαt + εt
ηt ∼ N, εt ∼ N
58 CHAPTER 5
αt = Q(αt−1 , ηt )
yt = G(αt , εt )
ηt ∼ Dη , εt ∼ Dε
Still Markovian!
Transition Equation
αt = T αt−1 + R ηt
mx1 mxm mx1 mxg gx1
t = 1, 2, ..., T
Measurement Equation
yt = Z αt + Γ wt + εt
1x1 1xm mx1 1xL Lx1 1x1
t = 1, 2, ..., T
(Important) Details
!
ηt
∼ WN 0, diag( Q , |{z}
h )
εt |{z}
g×g 1×1
E(α0 ηt 0 ) = 0mxg
E(α0 εt ) = 0mx1
STATE SPACE AND THE KALMAN FILTER 59
αt = T αt−1 + R ηt
mx1 mxm mx1 mxg gx1
yt = Z αt + Γ wt + εt
1x1 1xm mx1 1xL Lx1 1x1
!
ηt
∼ WN 0, diag( Q , |{z}
h )
εt |{z}
g×g 1×1
αt = T αt−1 + R ηt
mx1 mxm mx1 mxg gx1
yt = Z αt + Γ wt + εt
1x1 1xm mx1 1xL Lx1 1x1
αt = T B −1 B αt−1 + R ηt
mx1 mxm mxm mxm mx1 mxg gx1
yt = Z B −1 B αt + Γ wt + εt
1x1 1xm mxm mxm mx1 mxL Lx1 1x1
(B αt ) = (B T B −1 ) (B αt−1 ) + (B R) ηt
mx1 mxm mx1 mxg gx1
yt = (Z B −1 ) (B αt ) + Γ wt + εt
1x1 1xm mx1 mxL Lx1 1x1
yt = φ yt−1 + ηt
ηt ∼ W N (0, ση2 )
αt = φ αt−1 + ηt
yt = αt
(T = φ, R = 1, Z = 1, Γ = 0, Q = ση2 , h = 0)
MA(1)
yt = Θ(L)εt
εt ∼ W N (0, σ 2 )
where
Θ(L) = 1 + θ1 L
yt = ηt + θ ηt−1
ηt ∼ W N (0, ση2 )
STATE SPACE AND THE KALMAN FILTER 61
! ! ! !
α1t 0 1 α1,t−1 1
= + ηt
α2t 0 0 α2,t−1 θ
yt = (1, 0) αt = α1t
!
yt
αt =
θηt
MA(q)
yt = Θ(L)εt
εt ∼ W N (0, σ 2 )
where
Θ(L) = 1 + θ1 L + ... + θq Lq
ηt ∼ W N N (0, ση2 )
α1t 0 α1,t−1 1
α2t 0 Iq α2,t−1 θ1
..
=
..
.. + . ηt
.
. . . .
αq+1,t 0 00 αq+1,t−1 θq
θq ηt−q + . . . + θ1 ηt−1 + ηt yt
..
..
αt ≡
.
=
.
θq ηt−1 + θq−1 ηt
θ η
q t−1 + θq−1 ηt
θq ηt θq ηt
AR(p)
Φ(L)yt = εt
εt ∼ W N (0, σ 2 )
where
Φ(L) = (1 − φ1 L − φ2 L2 − ... − φp Lp )
ηt ∼ W N (0, ση2 )
α1t φ1 α1,t−1 1
α2t φ2 Ip−1 α2,t−1 0
αt = . = + . ηt
. ..
.. .. ..
.
αpt φp 00 αp,t−1 0
Φ(L)yt = Θ(L)εt
εt ∼ W N (0, σ 2 )
where
Φ(L) = (1 − φ1 L − φ2 L2 − ... − φp Lp )
Θ(L) = 1 + θ1 L + ... + θq Lq
ηt ∼ W N (0, ση2 )
yt = (1, 0, ..., 0) αt
ARMA(p,q) in State Space Form Recursive substitution from the bottom up yields:
α1t φ1 α1,t−1 + φp α1,t−p + ηt + θ1 ηt−1 + . . . + θq ηt−q
. .
. .
.
=
.
αm−1,t φm−1 α1,t−1 + αm,t−1 + θm−2 ηt
αmt φm α1,t−1 + θm−1 ηt
yt
.
.
=
.
φm−1 yt−1 + φm yt−2 + θm−1 ηt−1 + θm−2 ηt
φm yt−1 + θm−1 ηt
αt = T αt−1 + R ηt
mx1 mxm mx1 mxg gx1
64 CHAPTER 5
yt = Z αt + Γ Wt + εt
N x1 N xm mx1 N xL Lx1 N x1
!
ηt
∼ WN H )
0, diag( Q , |{z}
εt |{z}
g×g N ×N
N -Variable V AR(p)
ηt ∼ W N (0, Σ)
yt = (IN , 0N , ..., 0N ) αt
N x1 N xN p N px1
Multivariate ARMA(p,q)
ηt ∼ W N (0, Σ)
STATE SPACE AND THE KALMAN FILTER 65
Multivariate ARMA(p,q)
Φ1 I
Φ2 IN (m−1) Θ1
αt = .
.
αt−1 +
.. ηt
. .
N mx1
Φm 0N xN (m−1) Θm−1
where m = max(p, q + 1)
yt = β 0 xt + ε t
αt = αt−1
Measurement:
yt = x0t αt + εt
(T = I, R = 0, Zt = x0t , γ = 0, H = σε2 )
Note the time-varying system matrix.
Linear Regression with ARMA(p,q) Disturbances
yt = βxt + ut
φ1 1
φ2 Im−1 θ1
αt = .
.
αt−1 +
.. ηt
. .
φm 00 θm−1
66 CHAPTER 5
where m = max(p, q + 1)
αt = φ αt−1 + ηt
Measurement:
yt = x0t αt + εt
ηt ∼ W N (0, I)
Reduced form:
yt = φ−1 −1 −1
0 φ1 yt−1 + ... + φ0 φp yt−p + φ0 P ηt
φ−1 φ−1
α1t 0 φ1 α1,t−1 0 P
−1 η1t
α2t φ 0 φ2 I α2,t−1 0
.
. = .. .. + .. .
.
.
. . . .
ηN t
αpt φ−1
0 φp 00 αp,t−1 0
yt = (IN , 0N , ..., 0N ) αt
N x1 N xN p N px1
STATE SPACE AND THE KALMAN FILTER 67
y1t µ1 λ1 ε1t
. . . .
. = . + . Ft + .
. . . .
yN t µN λN εN t
Ft = φFt−1 + ηt
Φ(L) Ft = Θ(L) ηt
Dynamic Factor Model – Single ARMA(p,q) Factor State vector for F is state vector for
system:
φ1 1
φ2 Im−1 θ1
αt = .
.
αt−1 +
.. ηt
. .
φm 00 θm−1
Dynamic Factor Model – Single ARMA(p,q) factor System measurement equation is then:
y1t µ1 λ1 ε1t
.
. = .. + .. (1, 0, ..., 0) αt + ..
. . . .
yN t µN λN εN t
µ1 λ1 0 ... 0 ε1t
. .
αt + ..
= . .
. + . .
µN λN 0 ... 0 εN t
! ! !
y1t λ1 ε1t
= αt +
y2t λ2 ε2t
αt = αt−1 + ηt
“Common trend” αt
y1t y2t ε1t ε2t
Note that λ1 − λ2 = λ1 − λ2
That is, I(1) − I(1) = I(0)
“CI(1,0)”
xt = φ xt−1 + ηt
yt = xt + εt
! ! !
εt σε2 0
∼ WN 0,
ηt 0 ση2
yt = ct + st + εt
ct = φ ct−1 + ηct
st = γ st−4 + ηst
0 1
0 I3
αs,t−1 + 0 ηst
αst =
0 0
γ 00 0
Cycle + Seasonal + Noise Stacking transition equations gives the grand transition equa-
tion:
0 1 0 0 0 1 0
0 0 10 0 0 0
! ! !
αst αs,t−1 ηst
= 0 0 0 1 0 +
0 0
αct αc,t−1 ηct
γ 0 0 0 0 0 0
0 0 0 0 φ 0 1
Finally, the measurement equation is:
!
αst
yt = (1, 0, 0, 0, 1) + εt
αct
αt = T αt−1 + R ηt
mx1 mxm mx1 mxg gx1
yt = Z αt + γ Wt + εt
N x1 N xm mx1 N xL Lx1 N x1
!
ηt
∼ WN 0, diag( Q , |{z}
H )
εt |{z}
g×g N ×N
a0 = E(α0 )
P0 = E(α0 − a0 ) (α0 − a0 )0
at/t−1 = T at−1
Pt/t−1 = T Pt−1 T 0 + R Q R0
(where Ft = Z Pt/t−1 Z 0 + H)
t = 1, ..., T
State-Space in Density Form (Assuming Normality)
yt |αt ∼ N (Zαt , H)
! !
x Σxx Σxy
∼N µ, Σ µ = (µx , µy )0 Σ=
y Σyx Σyy
=⇒ x|y ∼ N µx|y , Σx|y
a0 = E0 (α0 ) = E (α0 )
α1 = T α0 + Rη1
72 CHAPTER 5
= T a0
(subst. a1/0 ) = E0 (α1 − T a0 ) (α1 − T a0 )0
(subst. α1 ) = E0 (T (α0 − a0 ) + Rη1 ) (T (α0 − a0 ) + Rη1 )0
= T P0 T 0 + RQR0
α1 |(Ω0 ∪ y1 )
or
α1 |Ω1
E0 (α1 ) = a1/0
var0 (α1 ) = E0 (α1 − a1/0 ) (α1 − a1/0 ) = P1/0
var0 (y1 ) = E0 (y1 − Za1/0 − γW1 ) (y1 − Za1/0 − γW1 )0
= E0 (Z(α1 − a1/0 ) + ε1 ) (Z(α1 − a1/0 ) + ε1 )0
! ! !!
α1 a1/0 P1/0 P1/0 Z 0
Ω0 ∼ N ,
y1 Za1/0 + γW1 ZP1/0 ZP1/0 Z 0 + H
(F1 = Z P1/0 Z 0 + H)
5.3.3 Calculating P0
Treatment of Initial Covariance Matrix: P0 = Γ(0) (Covariance stationary case: All eigen-
values of T inside |z| = 1)
αt = T αt−1 + Rηt
=⇒ P0 = T P0 T 0 + RQR0
= (T ⊗ T )vec(P0 ) + vec(RQR0 )
5.3.4 Predicting yt
Point prediction:
Prediction error:
vt = yt − (Zat/t−1 + γWt )
Density Prediction of yt
yt |Ωt−1 ∼ N (yt/t−1 , Ft )
or equivalently
vt | Ωt−1 ∼ N (0, Ft )
= ZPt/t−1 Z 0 + H ≡ Ft
Kt = Pt/t−1 Z 0 Ft−1
at+1/t = T at/t−1 + T Kt vt
αt = T αt−1 + Rηt
yt = Zαt + εt
E(ηt ηt0 ) = Q
E(εt ε0t ) = H
(Nothing new)
= Z at|t−1 + vt (measurement)
Note that one-shock state space representation
has time-varying system matrices:
• Covariance matrix of vt is Ft
yt = Z at|t−1 + vt
where
K̄ = P̄ Z 0 F̄ −1
E(vt vt0 ) = F̄ = Z P̄ Z 0 + H
– Prediction yt+1/t is now the projection of yt+1 on infinite past, and one-step prediction
errors vt are now the Wold-Wiener-Kolmogorov innovations
Remarks on the Steady State
2. Because the recursions for Pt|t−1 and Kt don’t depend on the data, but only on P0 ,
we can calculate arbitrarily close approximations to P̄ and K̄ by letting the filter run
Then:
where
−1
Jt = Pt T 0 Pt+1,t
(e) The expected number of returns to a recurrent state is infinite, and the expected
number of returns to a transient state is finite. That is,
∞
X
n
State j is recurrent ⇐⇒ Pjj = ∞,
n=1
X∞
n
State j is transient ⇐⇒ Pjj < ∞.
n=1
Solution:
STATE SPACE AND THE KALMAN FILTER 79
Pij ≥ 0 ∀i, j
2
X 2
X
P1j = 1, P2j = 1
j=1 j=1
! ! ! !
3 .9 .1 .9 .1 .9 .1 .804 .196
P = =
.3 .7 .3 .7 .3 .7 .588 .412
State 1: d(1) = 1
State 2: d(2) = 1
(1)
f12 = .1
(2)
f12 = .9 ∗ .1 = .09
(3)
f12 = .92 ∗ .1 = .081
(4)
f12 = .93 ∗ .1 = .0729
···
(1)
f12 = .3
(2)
f21 = .7 ∗ .3 = .21
(3)
f21 = .72 ∗ .3 = .147
(4)
f21 = .73 ∗ .3 = .1029
···
Eventual:
∞
X (n) .1
f12 = f12 = =1
n=1
1 − .9
80 CHAPTER 5
∞
X (n) .3
f21 = f21 = =1
n=1
1 − .7
(g) Recurrence:
Because f12 = f21 = 1, both states 1 and 2 are recurrent. In addition,
P∞ (n)
µ11 = n f11 < ∞
Pn=1
∞ (n)
µ22 = n=1 n f22 < ∞
States 1 and 2 are therefore positive recurrent and (given their aperiodicity es-
tablished earlier) ergodic.
(h) Stationary distribution
We can iterate on the P matrix to see that:
!
.75 .25
lim P n =
n→∞ .75 .25
Hence π1 = 0.75 and π2 = 0.25.
Alternatively, in the two-state case, we can solve analytically for the stationary
probabilities as follows.
!
P11 P12
π1 π2 = π1 π2
P21 P22
π1 P11 + (1 − π1 )P21 = π1
P21
π1 =
1 − P11 + P21
1 − P11
π2 =
1 − P11 + P21
Thus,
!
n 1 P21 1 − P11
lim P =
n→∞ (1 − P11 + P21 ) P21 1 − P11
yt = φ yt−1 + ηt + θ ηt−1
ηt ∼ W N (0, ση2 )
! ! ! !
α1t φ 1 α1,t−1 1
= + ηt
α2t 0 0 α2,t−1 θ
yt = (1, 0) αt = α1t
! !
φ yt−1 + θ ηt−1 + ηt yt
αt = =
θ ηt θηt
7. Identification in UCM’s.
Discuss the identifying assumption that UC innovations are orthogonal at all leads and
lags. What convenient mathematical properties does it entail for the observed sum of
the unobserved components? In what ways is it restrictive?
Solution: Orthogonality of component innovations implies that the spectrum of the
observed time series is simply the sum of the component spectra. Moreover, the or-
thogonality facilitates identification. The assumption is rather restrictive, however, in
that it entails no interaction between cyclical and secular economic fluctuations.
82 CHAPTER 5
xt = yt + ut
yt = αyt−1 + vt .
yt = αyt−1 + εt + βεt−1
and provide expressions for σε2 and β in terms of the underlying parameters α, σv2 and
σu2 .
Solution:
Box and Jenkins (1976) and Nerlove et al. (1979) show the ARMA result and give the
formula for β. That leaves σε2 . We will compute var(x) first from the UCM and then
from the ARMA(1,1) reduced form, and equate them.
From the UCM:
σv2
var(x) = + σu2
1 − α2
From the reduced form:
(1 + β 2 − 2αβ)
var(x) = σ2
1 − α2
Equating yields
σv2 + σu2 (1 − α2 )
σ2 =
1 + β 2 − 2αβ
extract the seasonal and substract–both methods yield the same answer;
ii) ŷs , the estimated seasonal, is less variable than ys , the true seasonal, and ŷn , the
estimated nonseasonal, is less variable than yn , the true nonseasonal. It is paradoxical
that, by (ii), both estimates are less variable than their true counterparts, yet, by (i),
they still add up to the same observed series as their true counterparts. The paradox is
explained by the fact that, unlike their true counterparts, the estimates ŷs and ŷn are
correlated (so the variance of their sum can be more than the sum of their variances).
5.5 NOTES
Chapter Six
and Inference
yt ∼ N (µ, Σ(θ))
Example: AR(1)
(yt − µ) = φ(yt−1 − µ) + εt
σ2
Σij (φ) = φ|i−j|
1 − φ2
T /2 −1/2 1 0 −1
L(y; θ) = (2π) |Σ(θ)| exp − (y − µ) Σ (θ)(y − µ)
2
1 1
lnL(y; θ) = const − ln|Σ(θ)| − (y − µ)0 Σ−1 (θ) (y − µ)
2 2
T xT matrix Σ(θ) can be very hard to calculate (we need analytic formulas for the auto-
covariances) and invert (numerical instabilities and inaccuracies; slow even if possible)
Prediction-error decomposition and the Kalman filter:
Schweppe’s prediction-error likelihood decomposition is:
T
Y
L(y1 , . . . , yT ; θ) = Lt (yt |yt−1 , . . . , y1 ; θ)
t=1
or:
T
X
ln L(y1 , . . . , yT ; θ) = ln Lt (yt |yt−1 , . . . , y1 ; θ)
t=1
MAXIMUM LIKELIHOOD 85
“Prediction-error decomposition”
In the univariate Gaussian case, the Schweppe decomposition is
T T
T 1X 1 X (yt − µt )2
ln L = − ln 2π − ln σt2 −
2 2 t=1 2 t=1 σt2
T T
T 1X 1 X vt2
=− ln 2π − ln Ft −
2 2 t=1 2 t=1 Ft
T T
NT 1X 1 X 0 −1
=− ln 2π − ln |Ft | − v F vt
2 2 t=1 2 t=1 t t
1. Specify θ(0)
4. Compute θ(m+1)
Convergence
Convergence Criteria
k s(m) k “small”
k θ(m) − θ(m−1) k “small”
Convergence Rates
kθ (m+1) −θ̂k
p such that limm→∞ kθ (m) −θ̂k
p = O(1)
Method of Steepest Decent
Use D(m) = I, t(m) = 1, ∀ m.
Properties:
−1
∂ 2 lnL ∂ 2 lnL
|
∂θ12 θ (m)
. . . ∂θ1 ∂θk |θ (m)
.
D(m) = H −1(m) = .
.
∂ 2 lnL ∂ 2 lnL
∂θk ∂θ1 |θ (m) . . . ∂θk 2 |θ (m)
MAXIMUM LIKELIHOOD 87
An interesting duality...
Line search: First determine direction, then step
Trust region: First determine step, then direction
– Approximate the function locally in a trust region
containing all admissible steps, and then determine direction
Classic example: Levenberg-Marquardt
88 CHAPTER 6
Related R packages:
trust (trust region optimization)
minpack.lm (R interface to Levenberg-Marquardt in MINPACK)
aT,T = aT
PT,T = PT
Smooth:
−1
where Jt = Pt T 0 Pt+1,t
Initialize:
Then:
0 0
P(t−1,t−2),T = Pt−1 Jt−2 + Jt−1 (P(t,t−1),T − T Pt−1 )Jt−2
αt = T αt−1 + ηt
yt = Zαt + εt
MAXIMUM LIKELIHOOD 89
lnL(y; θ)
Construct lnL(m) (y; θ) ≈ Eα lnL y, {αt }Tt=0 ; θ
2. M Step:
θ(m+1) = argmaxθ lnL(m) (y; θ }
1. E Step:
Approximate a “complete data” situation by replacing
{αt }Tt=0 with at,T from the Kalman smoother
2. M Step:
Estimate parameters by running regressions:
at,T → at−1,T
yt → at,T
3. If convergence criterion not met, go to 1
Complete-Data Likelihood:
T
Y T
Y
fθ (y, α0 , {αt }Tt=1 ) = fa0 ,P0 (α0 ) fT,Q (αt |αt−1 ) fZ,H (yt |αt )
t=1 t=1
6.3.2.2 E Step
Construct: lnL(m) (y; θ) ≈ Eα lnL y, {αt }Tt=0 ; θ
h i 1 1 h i
Eα ln L(y, {αt }T
t=0 ; θ) = const − ln |P0 | − Eα (α0 − a0 )0 P0−1 (α0 − a0 )
2 2
T
T 1X
Eα (αt − T αt−1 )0 Q−1 (αt − T αt−1 )
− ln |Q| −
2 2 t=1
T
T 1X
Eα (yt − Zαt )0 H −1 (yt − Zαt )
− ln |H| −
2 2 t=1
T
! T !−1
X X
0 0 0
T̂ = Eα αt αt−1 Eα αt−1 αt−1
t=1 t=1
T
1X
Q̂ = Eα [η̂t η̂t0 ]
T t=1
T
! T
!−1
0
X X
0
Ẑ = yt Eα [αt ] Eα [αt αt0 ]
t=1 t=1
T
1X
Ĥ = t ˆ0t ]
Eα [ˆ
T t=1
where:
Eα [αt ] = at|T
0
= at|T a0t−1|T + P(t,t−1)|T
Eα αt αt−1
Simply replacing αt with at,T won’t work because E [αt αt0 |ΩT ] 6= at,T a0t,T
Instead we have E [αt αt0 |ΩT ] = E [αt | ΩT ] E [αt |ΩT ]0 + V ar(αt |ΩT ) = at,T a0t,T + Pt,T
T
X
ln L(θ) = ln Lt (θ)
t=1
∂ 2 ln L(θ)
IEX,H (θ0 ) = −E
∂θ ∂θ0 θ0
T T
∂ 2 ln Lt (θ)
X X
= − EH(θ0 ) = −E = −EHt (θ0 )
t=1
∂θ ∂θ0 t=1
θ0
d
√
T (θ̂M L − θ0 ) → N (0, VEX (θ0 )) (6.1)
where
−1
IEX,H (θ0 )
VEX (θ0 ) = VEX,H (θ0 ) = plimT →∞
T
−1
IEX,s (θ0 )
= VEX,s (θ0 ) = plimT →∞
T
!−1 !−1
IEX,H (θ̂M L ) IEX,s (θ̂M L )
V̂EX,H (θ0 ) = V̂EX,s (θ0 ) =
T T
!−1 !−1
IOB,H (θ̂M L ) IOB,s (θ̂M L )
V̂OB,H (θ0 ) = V̂OB,s (θ0 ) =
T T
Under correct specification, plimT →∞ V̂EX,H (θ0 ) = plimT →∞ V̂EX,s (θ0 ) = VEX (θ0 )
plimT →∞ V̂OB,H (θ0 ) = plimT →∞ V̂OB,s (θ0 ) = VEX (θ0 )
Under possible distributional misspecification (but still assuming correct conditional mean
and variance function specifications),
d
√ m
T (θ̂M L − θ0 ) → N (0, VEX (θ0 )) (6.2)
MAXIMUM LIKELIHOOD 93
where:
m
VEX (θ0 ) = VEX,H (θ0 )−1 VEX,s (θ0 )VEX,H (θ0 )−1
!−1 ! !−1
m IEX,H (θ̂M L ) IEX,s (θ̂M L ) IEX,H (θ̂M L )
V̂EX (θ0 ) =
T T T
!−1 ! !−1
m IOB,H (θ̂M L ) IOB,s (θ̂M L ) IOB,H (θ̂M L )
V̂OB (θ0 ) =
T T T
“Sandwich Estimator”
d
√ m
T (θ̂M L − θ∗ ) → N (0, VEX (θ∗ )) (6.3)
where:
m
VEX (θ∗ ) = VEX,H (θ∗ )−1 VEX,s (θ∗ )VEX,H (θ∗ )−1
!−1 ! !−1
m IEX,H (θ̂M L ) IEX,s (θ̂M L ) IEX,H (θ̂M L )
V̂EX (θ∗ ) =
T T T
94 CHAPTER 6
!−1 ! !−1
m IOB,H (θ̂M L ) IOB,s (θ̂M L ) IOB,H (θ̂M L )
V̂OB (θ∗ ) =
T T T
where f (ωj ; θ) is the spectral density and the χ22 random variables are independent
across frequencies
2πj T
ωj = , j = 0, 1, ...,
T 2
⇒ MGF of any one of the xj ’s is
1
Mx (t) =
1 − 2t
Let
f (ωj ; θ) xj
yj = fˆ(ωj ) =
2
f (ωj ; θ) 1
⇒ My (t) = Mx t =
2 1 − f (ωj ; θ) t
ˆ
−f (ωj )
1
⇒ g(fˆ(ωj ); θ) = e f (ωj ;θ)
f (ωj ; θ)
T /2 T /2
X X fˆ(ωj )
ln L(fˆ; θ) = − ln f (ωj ; θ) −
j=0 j=0
f (ωj ; θ)
T /2 T /2
X X
ln L(fˆ; θ) = − ln |F (ωj ; θ)| − trace F −1 (ωj ; θ) F̂ (ωj )
j=0 j=0
3. Method of scoring
Slight variation on Newton:
Use E(H (m) )−1 rather than H −1(m)
4. Constrained optimization.
6.6 NOTES
Chapter Seven
1. Statistically independent
2. Reproducible
3. Non-repeating
4. Quickly-generated
Example:
SIMULATION 97
Figure 7.1: Ripley’s “Horror” Plots of pairs of (Ui+1 , Ui ) for Various Congruential Gener-
ators Modulo 2048 (from Ripley, 1987)
x0 = 1, x1 = 3, x2 = 9, x3 = 11, x4 = 1, x5 = 3, ...
Remarks
xt
1. xt ∈ [0, m − 1], ∀t. So take x∗t = m, ∀t
3. The maximum period, m, can be attained using the mixed congruential generator if:
Figure 7.2: Transforming from U(0,1) to f (from Davidson and MacKinnon, 1993)
7.2.2 Box-Muller
∂x1 ∂x1
∂y1 ∂y2
f (y1 , y2 ) = f (x1 , x2 ) ∂x2 ∂x2
∂y1 ∂y2
1 2 2
y2
Box-Muller (Continued) Here we have x1 = e− 2 (y1 +y2 ) and x2 = 1
2π arctan y1
∂x1 ∂x1
1 2 1 2
Hence ∂y1
∂x2
∂y2
∂x2
= √ e−y1 /2 √ e−y2 /2
∂y1 ∂y2 2π 2π
Bivariate density is the product of two N (0, 1) densities, so we have generated two inde-
pendent N (0, 1) deviates.
Generating Deviates Derived from N(0,1)
Accept-Reject
(Naive but Revealing Example)
We want to sample x ∼ f (x)
Draw:
ν1 ∼ U (α, β)
100 CHAPTER 7
ν2 ∼ U (0, h)
1. Draw x0 ∼ g(x)
f (x0 )
2. Take x = x0 w.p. g(x0 )M ; else go to 1.
(Allows for “blanket” functions g(·) more efficient than the uniform)
Note that accept-reject requires that we be able to evaluate f (x) and g(x) for any x.
Mixtures
On any draw i,
x ∼ fi (x), w.p. pi
where
0 ≤ pi ≤ 1, ∀ i
N
X
pi = 1
i=1
SIMULATION 101
For example, all of the fi could be uniform, but with different location and scale.
3. Parametric II: Approximate realization via arbitrary startup value with early realiza-
tion discarded
4. Parametric III: Exact realization via drawing startup values from unconditional den-
sity
7.4 MORE
Slice Sampling
Copulas and Sampling From a General Joint Density
Monte Carlo
Key: Solve deterministic problems by simulating stochastic analogs, with the analytical
unknowns reformulated as parameters to be estimated.
Many important discoveries made by Monte Carlo.
Also, numerous mistakes avoided by Monte Carlo!
The pieces:
(I) Experimental Design
(II) Simulation (including variance reduction techniques)
(III) Analysis: Response surfaces (which also reduce variance)
102 CHAPTER 7
• Objective
• e.g., MSE of an estimator:
π = g(θ, T )
a
α(1 − α)
N ormal approximation : α̂ ∼ N α,
N
" r #!
α(1 − α)
P α∈ α̂ ± 1.96 = .95
N
r
α0 (1 − α0 )
2 ∗ 1.96 = .01
N
If α0 = .05, N = 7299
1
Strategy 2 (Use α = 2 = argmaxα [α(1 − α)]; conservative):
s
1 1
2 2
2 ∗ 1.96 = .01 ⇒ N = 38416
N
Strategy 3 (Use α = α̂; the obvious strategy)
7.6.2 Simulation
(II) Simulation
Running example: Monte Carlo integration
R1
Definite integral: θ = 0 m(x)dx
Key insight:
R1
θ = 0 m(x)dx = E(m(x))
x ∼ U (0, 1)
Notation:
θ = E[m(x)]
σ 2 = var(m(x))
Direct Simulation:
Arbitrary Function, Uniform Density
Generate N U (0, 1) deviates xi , i = 1, ..., N
Form the N deviates mi = m(xi ), i = 1, ..., N
N
1 X
θ̂ = mi
N i=1
d
√
N (θ̂ − θ) → N (0, σ 2 )
Z
θ = E(m(x)) = m(x)f (x)dx
104 CHAPTER 7
d
√
N (θ̂ − θ) → N (0, σ 2 )
Z
θ = E(x) = xf (x)dx
d
√
N (θ̂ − θ) → N (0, σ 2 )
d
√
N (θ̂∗ − θ) → N (0, σ∗2 )
SIMULATION 105
Z
f (y) = f (y/x)f (x)dx.
N
1 X
E(y)
b = f (y|xi )
N i=1
So importance sampling replaces a simple average of f (y|xi ) based on initial draws from
f (x) with a weighted average of f (y|xi ) based on initial draws from g(x), where the weights
wi reflect the relative heights of f (xi ) and g(xi ).
Indirect Simulation
“Variance-Reduction Techniques”
(“Swindles”)
Importance Sampling to Achieve Variance Reduction
Again we use:
Z
f (x)
θ= x g(x)dx,
g(x)
and again we arrive at
d
√
N (θ̂∗ − θ) → N (0, σ∗2 )
xf (x)
Key: Pick g(x) s.t. g(x) has small variance
Importance Sampling Example
Let x ∼ N (0, 1), and estimate the mean of I(x > 1.96):
Z
θ = E(I(x > 1.96)) = P (x > 1.96) = I(x > 1.96) φ(x) dx
| {z } |{z}
m(x) f (x)
N
X I(xi > 1.96)
θ̂ = (with variance σ 2 )
i=1
N
g(x) = N (1.96, 1)
Z
I(x > 1.96) φ(x)
P (x > 1.96) = g(x) dx
g(x)
σ∗2
≈ 0.06
σ2
Antithetic Variates
We average negatively correlated unbiased estimators of θ (Unbiasedness maintained,
variance reduced)
The key: If x ∼ symmetric(µ, v), then xi ± µ are equally likely
e.g., if x ∼ U (0, 1), so too is (1 − x)
e.g., if x ∼ N (0, v), so too is −x
Consider for example the case of zero-mean symmetric f (x)
Z
θ = m(x)f (x)dx
N
1 X
Direct : θ̂ = mi , (θ̂ is based on xi , i = 1, ..., N )
N i=1
1 1
Antithetic : θ̂∗ = θ̂(x) + θ̂(−x)
2 2
(θ̂(x) is based on xi , i = 1, ..., N/2 , and
θ̂(−x) is based on −xi , i = 1, ..., N/2)
Antithetic Variates, Cont’d
SIMULATION 107
More concisely,
N/2
2 X
θ̂∗ = ki (xi )
N i=1
where:
1 1
ki = m(xi ) + m(−xi )
2 2
d
√
N (θ̂∗ − θ) → N (0, σ∗2 )
1 1 1
σ∗2 = var (m(x)) + var (m(−x)) + cov (m(x), m(−x))
4 4 2 | {z }
<0 f or m monotone incr.
Often σ∗2 σ 2
Z Z Z
θ= m(x)f (x)dx = g(x)f (x)dx + [m(x) − g(x)]f (x)dx
Control function g(x) simple enough to integrate analytically and flexible enough to absorb
most of the variation in m(x).
We just find the mean of m(x)−g(x), where g(x) has known mean and is highly correlated
with m(x).
Control Variates
Z N
1 X
θ̂ = g(x)dx + [m(xi ) − g(xi )]
N i=1
d
√
N (θ̂ − θ) → N (0, σ∗2 )
Related method (conditioning): Find the mean of E(z|w) rather than the mean of z. The
two are of course the same (the mean conditional mean is the unconditional mean), but
var(E[z|w]) ≤ var(z).
Control Variate Example
Z 1
f (x) = ex dx
0
108 CHAPTER 7
N
1 X xi
θ̂direct = e
N i=1
N
1 X xi
θ̂cv = 1.85 + [e − (1 + 1.7xi )]
N i=1
var(θ̂direct )
≈ 78
var(θ̂CV )
Common Random Numbers
We have discussed estimation of a single integral:
Z 1
f1 (x)dx
0
But interest often centers on difference (or ratio) of the two integrals:
Z 1 Z 1
f1 (x)dx − f2 (x)dx
0 0
The key: Evaluate each integral using the same random numbers.
Common Random Numbers in Estimator Comparisons
Two estimators θ̂, θ̃ ; true parameter θ0
Compare MSEs: E(θ̂ −θ0 )2 , E(θ̃ − θ0 )2
Expected difference: E (θ̂ − θ0 )2 − (θ̃ − θ0 )2
Estimate:
N
1 X 2
(θ̂i − θ0 )2 − (θ̃i − θ0 )
N i=1
Variance of estimate:
1 1 2
var (θ̂ − θ0 )2 + var (θ̃ − θ0 )2 − cov (θ̂ − θ0 )2 , (θ̃ − θ0 )2
N N N
Extensions...
α̂ = α + ε = g(T ) + ε
g(T )(1 − g(T ))
ε∼N 0,
N
Note the heteroskedasticity: variance of ε changes with T .
Example: Assessing Finite-Sample Test Size
Enforce analytically known structure on α̂.
Common approach:
p
!
−i
− 12
X
α̂ = α0 + T c0 + ci T 2 +ε
i=1
α0 is nominal size, which obtains as T → ∞. Second term is the vanishing size distortion.
Response surface regression:
110 CHAPTER 7
1 3
(α̂ − α0 ) → T − 2 , T −1 , T − 2 , ...
7.7.1 GMM
where
m1 (θ) − m̂1
m2 (θ) − m̂2
d(θ) =
..
.
mr (θ) − m̂r
The mi (θ) are model moments and the m̂i are data moments.
MM: k = r and the mi (θ) calculated analytically
GMM: k < r and the mi (θ) calculated analytically
• Model moments for GMM may also be unavailable (i.e., analytically intractable)
• MLE efficiency lost may be a small price for SMM tractability gained.
• Under correct specification any consistent estimator (e.g., MLE or GMM/SMM) takes
you to the right place asymptotically, and MLE has the extra benefit of efficiency.
• Under misspecification, consistency becomes an issue, quite apart from the secondary
issue of efficiency. Best DGP approximation for one purpose may be very different
from best for another.
• In contrast, pseudo-MLE ties your hands. Gaussian pseudo-MLE, for example, is con-
sistent for the KLIC-optimal approximation (1-step-ahead mean-squared prediction
error).
• The bottom line: under misspecification MLE may not be consistent for what you
want, whereas by construction GMM is consistent for what you want (once you decide
what you want).
where
β̂1 (θ) − β̂1
β̂2 (θ) − β̂2
d(θ) =
..
.
β̂d (θ) − β̂d
β̂i (θ) are est. params. of aux. model fit to simulated model data
112 CHAPTER 7
T
1X
x̄T = xt , σ 2 (x) = E(x − µ)2
T t=1
!
(x̄T − µ)
uα solves P ≤ uα =α
√σ
T
σ̂(x) σ̂(x)
I = [x̄T − û(1+α)/2 √ , x̄T − û(1−α)/2 √ ]
T T
T
1 X
σ̂ 2 (x) = (xt − x̄T )2
T − 1 t=1
!
(x̄T − µ)
Root c.d.f. : H(z) = P ≤z
√σ
T
(j) T
1. Draw {xt }t=1 with replacement from {xt }Tt=1
(j)
x̄T −x̄T
2. Compute σ̂(x)
√
T
(j)
x̄T −x̄T
3. Repeat many times and build up the sampling distribution of σ̂(x)
√
which is an
T
x̄T −µ
approximation to the distribution of √σ
T
(j)
Ĥ(z) = P (x̄T − x̄T ) ≤ z
σ̂(x)
√
T
σ̂(x) σ̂(x)
Iˆ = [x̄T − û(1+α)/2 √ , x̄T − û(1−α)/2 √ ]
T T
!
(j)
(x̄T − x̄T )
where P σ̂(x)
≤ ûα = Ĥ(ûα ) = α
√
T
“Percentile-t” Bootstrap
(x̄T − µ)
S= σ̂(x)
√
T
H(z) = P (x̄T − µ) ≤ z
σ̂(x)
√
T
(j)
Ĥ(z) = P (x̄T − x̄T ) ≤ z
σ̂(x(j) )
√
T
114 CHAPTER 7
σ̂(x) σ̂(x)
Iˆ = [x̄T − û(1+α)/2 √ , x̄T − û(1−α)/2 √ ]
T T
(j)
(x̄ T − x̄T)
P (j)
≤ ûα = α
σ̂(x )
√
T
Bootstrap-world root:
d
∗
S → D∗ (as T, N → ∞)
3. Monte Carlo indicates that bootstrap often does very well in finite samples (not un-
related to 2, but does not require 2)
(j) T
2. Draw {xt }t=1 with replacement from {xt }Tt=1
SIMULATION 115
Issues:
1. Inappropriate standardization of S for dynamic data. So replace σ̂(x) with 2πfx∗ (0),
where fx∗ (0) is a consistent estimator of the spectral density of x at frequency 0.
(j) T
2. Inappropriate to draw {xt }t=1 with replacement for dynamic data. What to do?
xt = c + φxt−1 + εt , εt ∼ iid
1. Regress xt → (c, xt−1 ) to get ĉ and φ̂, and save residuals, {et }Tt=1
(j)
2. Draw {εt }Tt=1 with replacement from {et }Tt=1
(j)
3. Draw x0 from {xt }Tt=1
(j) (j) (j)
4. Generate xt = ĉ + φ̂xt−1 + εt , t = 1, ..., T
(j) (j)
5. Regress xt → (c, xt−1 ) to get ĉ(j) and φ̂(j) , associated t-statistics, etc.
at+1/t = T at/t−1 + T Kt vt
yt = Zat/t−1 + vt
1. Estimate system parameters θ. (We will soon see how to do this.)
2. At the estimated parameter values θ̂, run the Kalman filter to get the corresponding 1-step-ahead
−1/2
prediction errors v̂t ∼ (0, F̂t ) and standardize them to ût = Ω̂t v̂t ∼ (0, I), where Ω̂t Ω̂0t = F̂t .
116 CHAPTER 7
7.9.1 Local
Using MCMC for MLE (and Other Extremum Estimators)
Chernozukov and Hong show how to compute extremum estimators as mean of pseudo-posterior distri-
√
butions, which can be simulated by MCMC and estimated at the parametric rate 1/ N , in contrast to the
much slower nonparametric rates achievable (by any method) by the standard posterior mode extremum
estimator.
7.9.2 Global
Summary of Local Optimization:
Simulated Annealing
(Illustrated Here for a Discrete Parameter Space)
Framework:
1. A set Θ, and a real-valued function lnL (satisfying regularity conditions) defined on Θ. Let Θ∗ ⊂ Θ
be the set of global maxima of lnL
2. ∀θ(m) ∈ Θ, a set N (θ(m) ) ⊂ Θ − θ(m) , the set of neighbors of θ(m)
3. A nonincreasing function, T (m) : N → (0, ∞) (“the cooling schedule”), where T (m) is the “temper-
ature” at iteration m
4. An initial guess, θ(0) ∈ Θ
θ ∈ Θ ⊂ Rk
lnL(θ) is continuous
lnL(θ∗ ) is the unique finite global max of lnL(θ), θ ∈ Θ
H(θ∗ ) exists and is nonsingular
lnL(θ̂) is a local max
Develop statistical inference for θ∗
Draw {θi }N i=1 uniformly from Θ and form {lnL(θi )}i=1
N
1. Convex relaxation.
Our approaches to global optimization involved attacking a nasty objective function with methods
involving clever randomization. Alternatively, one can approximate the nasty objective with a friendly
(convex) objective, which hopefully has the same global optimum. This is called “convex relaxation,”
and when the two optima coincide we say that the relaxation is “tight.”
7.12 NOTES
Chapter Eight
Carlo
Overarching Paradigm (T → ∞)
√
T (θ̂ − θ) ∼ N (0, Σ)
Shared by classical and Bayesian, but interpretations differ.
Classical: θ̂ random, θ fixed
Bayesian: θ̂ fixed, θ random
Classical: Characterize the distribution of the random data (θ̂) conditional on fixed “’true” θ. Focus on
the likelihood max (θ̂M L ) and likelihood curvature in an -neighborhood of the max.
Bayesian: Characterize the distribution of the random θ conditional on fixed “true” data (θ̂). Examine
the entire likelihood.
Bayesian Computational Mechanics
Data y ≡ {y1 , . . . , yT }
Bayes’ Theorem:
f (y/θ)f (θ)
f (θ/y) =
f (y)
or
f (θ/y) = c f (y/θ)f (θ)
where c−1 = θ f (y/θ)f (θ)
R
√
T (θ̂M L − θ) ∼ N (0, Σ)
(Enough said.)
Bayesian Paradigm (T → ∞)
(Note that as T → ∞, p(θ/y) ≈ L(θ/y),
so the likelihood below can be viewed as the posterior.)
120 CHAPTER 8
Model comparison:
p(Mi |y) p(y|Mi ) p(Mi )
= •
p(Mj |y) p(y|Mj ) p(Mj )
| {z } | {z } | {z }
posterior odds Bayes f actor prior odds
T
Y
P (y) = P (y1 , ..., yT ) = P (yt |y1:t−1 )
t=1
T
X
=⇒ lnP (y) = lnP (yt |y1:t−1 )
t=1
T
X Z
= ln P (yt |θ, y1:t−1 )P (θ|y1:t−1 ) dθ
t=1
Bayesian model averaging:
Weight by posterior model probabilities:
As T → ∞, the distinction between model averaging and selection vanishes, as one π goes to 0 and the
other goes to 1.
If one of the models is true, then both model selection and model averaging are consistent for the true
model. Otherwise they’re consistent for the X-optimal approximation to the truth. Does X = KLIC?
BAYES 121
Metropolis-Hastings
We want to draw S values of θ from p(θ). Initialize chain at θ(0) and burn it in.
1. Draw θ∗ from proposal density q(θ; θ(s−1) )
2. Calculate the acceptance probability α(θ(s−1) , θ∗ )
3. Set
θ∗ w.p. α(θ(s−1) , θ∗ ) “accept”
s
θ =
θ(s−1) w.p. 1 − α(θ(s−1) , θ∗ ) “reject”
θ∗ = θ(s−1) + ε
Acceptance probability reduces to:
p(θ = θ∗ )
α(θ(s−1) , θ∗ ) = min ,1
p(θ = θ(s−1) )
8.3.3 More
Burn-in, Sampling, and Dependence
“total simulation” = “burn-in” + “sampling”
Questions:
How to assess convergence to steady state?
In the Markov chain case, why not do something like the following. Whenever time t is a multiple of m,
use a distribution-free non-parametric (randomization) test for equality of distributions to test whether the
unknown distribution f1 of xt , ..., xt−(m/2) equals the unknown distribution f2 of xt−(m/2)+1 , ..., xt−m . If,
for example, we pick m = 20, 000, then whenever time t is a multiple of 20,000 we would test equality of the
distributions of xt , ..., xt−10000 and xt−10001 , ..., xt−20000 . We declare arrival at the steady state when the
null is not rejected. Or something like that.
Of course the Markov chain is serially correlated, but who cares, as we’re only trying to assess equality of
unconditional distributions. That is, randomizations of xt , ..., xt−(m/2) and of xt−(m/2)+1 , ..., xt−m destroy
the serial correlation, but so what?
How to handle dependence in the sampled chain?
Better to run one long chain or many shorter parallel chains?
122 CHAPTER 8
Gibbs iteration j = 1:
a. Draw z11 from f (z1 |z20 , ..., zk0 )
b. Draw z21 from f (z2 |z11 , z30 , ..., zk0 )
c. Draw z31 from f (z3 |z11 , z21 , z40 , ..., zk0 )
...
k. Draw zk1 from f (zk |z11 , ..., zk−1
1 )
Repeat j = 2, 3, ....
Again, limj→∞ f (z j ) = f (z)
Metropolis Within Gibbs
Gibbs breaks a big draw into lots of little (conditional) steps. If you’re lucky, those little steps are simple.
If/when a Gibbs step is difficult, i.e., it’s not clear how to sample from the relevant conditional, it can be
done by Metropolis.
(”Metropolis within Gibbs”)
Metropolis is more general but also more tedious, so only use it when you must.
Composition
We may want (x1 , y1 ), ..., (xN , yN ) ∼ iid from f (x, y)
Or we may want y1 , ..., yN ∼ iid from f (y)
They may be hard to sample from directly.
But sometimes it’s easy to:
Draw x∗ ∼ f (x)
Draw y ∗ ∼ f (y|x∗ )
Then:
(x1 , y1 ), ..., (xN , yN ) ∼ iid f (x, y)
(y1 , ..., yN ) ∼ iid f (y)
2
T σ̂M L
∼ χ2T −K
σ2
Bayesian Inference for β/σ
Prior:
β/σ 2 ∼ N (β0 , Σ0 )
g(β/σ 2 ) ∝ exp(−1/2(β − β0 )0 Σ−1 0 (β − β0 ))
Likelihood:
−1 0
L(β/σ 2 , y) ∝ exp( 2σ 2 (y − Xβ) (y − Xβ))
Posterior:
p(β/σ 2 , y) ∝ exp(−1/2(β − β0 )0 Σ−1 1 0
0 (β − β0 ) − 2σ 2 (y − Xβ) (y − Xβ))
This is the kernel of a normal distribution (*Problem*):
β/σ 2 , y ∼ N (β1 , Σ1 )
where −1
β1 = Σ−1 0 +σ
−2 (X 0 X) (Σ−1
0 β0 + σ
−2 (X 0 X)β̂
ML)
124 CHAPTER 8
Σ1 = (Σ−1
0 +σ
−2 (X 0 X))−1
(Independent
σ 2 /β for completeness.)
of β, but write
1 −T /2
L σ2 /β, y ∝ σ 2 exp − 2σ1 2 (y − Xβ)0 (y − Xβ)
(*Problem*: In contrast to L(β/σ 2 , y) earlier, we don’t absorb the (σ 2 )−T /2 term into the constant of
proportionality. Why?)
Hence (*Problem*):
v1 −1
exp −δ
2
p σ12 /β, y ∝ σ12 1
2σ 2
or σ12 /β, y ∼ Γ v21 , δ21
v1 = v0 + T
δ1 = δ0 + (y − Xβ)0 (y − Xβ)
Bayesian Pros Thus Far
1. Feels sensible to focus on p(θ/y). Classical relative frequency in repeated samples replaced with
subjective degree of belief conditional on the single sample actually obtained
2. Exact finite-sample full-density inference
Bayesian Cons Thus Far
1. From where does the prior come? How to elicit prior distributions?
2. How to do an “objective” analysis?
(e.g. what is an “uninformative” prior? Uniform?)
(Note, however, that priors can be desirable and helpful. See, for example, the cartoon at http:
//fxdiebold.blogspot.com/2014/04/more-from-xkcdcom.html)
3. We still don’t have the marginal posteriors that we really want: p(β, σ 2 /y), p(β/y).
– Problematic in any event!
αt = T αt−1 + Rηt
yt = Zαt + εt
! !
ηt iid Q 0
∼ N
εt 0 H
Let α̃T = (α01 , . . . , α0T )0 , θ = (T 0 , R0 , Z 0 , Q0 , H 0 )0
The key: Treat α̃T as a parameter, along with system matrices θ
Recall the State-Space Model in Density Form
yt |αt ∼ N (Zαt , H)
Recall the Kalman Filter in Density Form
Initialize at a0 , P0
State prediction:
αt |ỹt−1 ∼ N (at/t−1 , Pt/t−1 )
at/t−1 = T at−1
Pt/t−1 = T Pt−1 T 0 + RQR0
State update:
αt |ỹt ∼ N (at , Pt )
at = at/t−1 + Kt (yt − Zat/t−1 )
Pt = Pt/t−1 − Kt ZPt/t−1
Data prediction:
yt |ỹt−1 ∼ N (Zat/t−1 , Ft )
where ỹt = (y10 , ..., yt0 )0
Carter-Kohn Multi-move Gibbs Sampler
Let ỹT = (y10 , . . . , yT
0 )0
0. Initialize θ(0)
Gibbs sampler at generic iteration j:
(j)
j1. Draw from posterior α̃T /θ(j−1) , ỹT (“hard”)
(j) (j)
j2. Draw from posterior θ /α̃T , ỹT (“easy”)
Iterate to convergence, and then estimate posterior moments of interest
Just two Gibbs draws: (1) α̃T parameter, (2) θ parameter
(j)
Multimove Gibbs Sampler, Step 2 (θ(j) |α̃T , ỹT ) (“easy”)
(j) (j)
Conditional upon draws α̃T , sampling θ becomes a multivariate regression problem.
We have already seen how to do univariate regression. We can easily extend to multivariate regression.
The Gibbs sampler continues to work.
Multivariate Regression
Y = X B + E ,
|{z} |{z} |{z} |{z}
T ×n T ×k k×n T ×n
iid
t ∼ N (0, Σ)
126 CHAPTER 8
vec(B)|Σ ∼ N (B0 , Σ0 )
Inverse Wishart refresher (multivariate inverse gamma):
X ∼ W −1 (n, V) ↔ X −1 ∼ W (n, V)
where
n−p−1 1
W (X; n, V) ∝ |X| 2 exp − tr(XV−1 )
2
Σ|B inverse Wishart prior:
n−p−1 1
p(Σ−1 |vec(B)) ∝ |Σ−1 | 2 exp(− tr(Σ−1 V−1 ))
2
Bayesian Inference for B|Σ
Prior:
1
p(vec(B)|Σ) ∝ exp − tr vec(B − B0 )0 V0−1 vec(B − B0 )
2
Likelihood:
T
!
1X 0 0 −1 0
p(Y, X|B, Σ) ∝ exp − (Yt − B Xt ) Σ (Yt − B Xt )
2 t=1
1
∝ exp − tr Σ−1 (Y − XB)0 (Y − XB)
2
1
∝ exp − tr vec(B − B̂)0 Σ−1 ⊗ X 0 X vec(B − B̂)
2
Posterior:
p(vec(B)|Σ, Y ) ∝
1
tr vec(B − B̂)0 Σ−1 ⊗ X 0 X vec(B − B̂) + vec(B − B0 )0 V0−1 vec(B − B0 )
exp −
2
This is the kernel of a multivariate normal distribution:
vec(B)|Σ, Y ∼ N (B1 , V1 )
h i h i−1
vec(B1 ) = V1 (Σ−1 ⊗ X 0 X)vec(B̂) + V0−1 B0 , V1 = Σ−1 ⊗ X 0 X + V0−1
and B̂ = (X 0 X)−1 (X 0 Y )
Bayesian Inference for Σ|B
Prior:
n−p−1 1
p(Σ−1 |vec(B)) ∝ |Σ−1 | 2 exp(− tr(Σ−1 V−1 ))
2
Likelihood:
T 1
p(Y, X|B, Σ) ∝ |Σ|− 2 exp − tr Σ−1 (Y − XB)0 (Y − XB)
2
Posterior:
T +n−p−1 1
p(Σ−1 |vec(B), Y ) ∝ |Σ−1 | exp − tr Σ−1 (Y − XB)0 (Y − XB) + V−1
2
2
BAYES 127
in terms of β0 , Σ0 , δ0 , x, and y.
Moreover, the key marginal posterior
P (β/y) = 0∞ p(β, σ12 /y)dσ 2 is multivariate t.
R
8.8 NOTES
Chapter Nine
Random Walks
Random walk:
yt = yt−1 + εt
εt ∼ W N (0, σ 2 )
Random walk with drift:
yt = δ + yt−1 + εt
εt ∼ W N (0, σ 2 )
Properties of the Random Walk
t
X
yt = y0 + εi
i=1
(shocks perfectly persistent)
E(yt ) = y0
var(yt ) = tσ 2
lim var(yt ) = ∞
t→∞
Properties of the Random Walk with Drift
t
X
yt = tδ + y0 + εi
i=1
(shocks again perfectly persistent)
E(yt ) = y0 + tδ
var(yt ) = tσ 2
lim var(yt ) = ∞
t→∞
The Random Walk as a Building Block
Generalization of random walk: ARIM A(p, 1, q)
Beveridge-Nelson decomposition:
yt ∼ ARIM A(p, 1, q) ⇒ yt = xt + zt
xt = random walk
zt = covariance stationary
– So shocks to ARIM A(p, 1, q) are persistent, but not perfectly so.
130 CHAPTER 9
xt = b + xt−1 + εt
εt ∼ W N (0, σ 2 )
Optimal forecast:
xT +h,T = bh + xT
Forecast does not revert to trend
Forecasting a Linear Trend + Stationary AR(1)
xt = a + bt + yt
yt = φyt−1 + εt
εt ∼ W N (0, σ 2 )
Optimal forecast:
xT +h,T = a + b(T + h) + φh yT
Forecast reverts to trend
Some Language...
“Random walk with drift” vs. “stat. AR(1) around linear trend”
“unit root” vs. “stationary root”
“Difference stationary” vs. “trend stationary”
“Stochastic trend” vs. “deterministic trend”
“I(1)” vs. “I(0)”
Stochastic Trend vs. Deterministic Trend
BAYES 131
yt = yt−1 + εt
d
T (φ̂LS − 1) → DF
Superconsistent
Biased in finite samples (E φ̂ < φ ∀ φ ∈ (0, 1])
“Hurwicz bias” “Dickey-Fuller bias”
“Nelson-Kang spurious periodicity”
Bigger as T → 0 , as φ → 1 , and as intercept, trend included
Non-Gaussian (skewed left)
DF tabulated by Monte Carlo
Studentized Version
φ̂ − 1
τ̂ = r
s PT 1 2
t=2 yt−1
(yt − µ) = φ(yt−1 − µ) + εt
yt = α + φyt−1 + εt
where α = µ(1 − φ)
Random walk null vs. mean-reverting alternative
Studentized statistic τ̂µ
Deterministic Trend Under the Alternative
3. Construct yt
• τ̂ : yt = φyt−1 + et
• τ̂µ : yt = c + φyt−1 + et
• τ̂τ : yt = c + βt + φyt−1 + et
p
X
yt + φj yt−j = εt
j=1
p
X
yt = ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2
Studentized statistic τ̂
Allowing for Nonzero Mean Under the Alternative
p
X
(yt − µ) + φj (yt−j − µ) = εt
j=1
p
X
yt = α + ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2
p
X
(yt − a − bt) + φj (yt−j − a − b(t − j)) = εt
j=1
p
X
yt = k1 + k2 t + ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2
p
X p
X
k1 = a(1 + φi ) − b iφi
i=1 i=1
p
X
k2 = b (1 + φi )
i=1
Pp
Under the null hypothesis, k1 = −b i=1 iφi and k2 = 0
Studentized statistic τ̂τ
k−1
X
yt = ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2
k−1
X
yt = α + ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2
k−1
X
yt = k1 + k2 t + ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2
k−1
X
(yt − yt−1 ) = (ρ1 − 1)yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2
yt = β x t + ε t
d
√t → RV (t diverges)
T
d
√β̂ → RV (β̂ diverges)
T
Cointegration
Consider an N -dimensional variable x:
x ∼ CI (d, b) if
1. xi ∼ I(d), i = 1, . . . , N
Leading Case
134 CHAPTER 9
x ∼ CI(1, 1) if
(1) xi ∼ I(1), i = 1, . . . , N
(2) ∃ 1 or more linear combinations
zt = α0 xt s.t. zt ∼ I(0)
Example
xt = xt−1 + vt , vt ∼ W N
yt = xt−1 + εt , εt ∼ W N, εt ⊥ vt−τ , ∀t, τ
⇒ (yt − xt ) = εt − vt = I(0)
Cointegration and “Attractor Sets”
xt is N -dimensional but does not wander randomly in RN
α0 xt is attracted to an (N − R)-dimensional subspace of RN
N : space dimension
R: number of cointegrating relationships
Attractor dimension = N − R
(“number of underlying unit roots”)
(“number of common trends”)
Example
3-dimensional V AR(p), all variables I(1)
R = 0 ⇔ no cointegration ⇔ x wanders throughout R3
R = 1 ⇔ 1 cointegrating vector ⇔ x attracted to a 2-Dim hyperplane in R3 given by α0 x = 0
R = 2 ⇔ 2 cointegrating vectors ⇔ x attracted to a 1-Dim hyperplane (line) in R3 given by intersection
of two 2-Dim hyperplanes, α01 x = 0 and α02 x = 0
R = 3 ⇔ 3 cointegrating vectors ⇔ x attracted to a 0-Dim hyperplane (point) in R3 given by the
intersection of three 2-Dim hyperplanes, α01 x = 0 , α02 x = 0 and α03 x = 0
(Covariance stationary around E(x))
Cointegration Motivation: Dynamic Factor Structure
Factor structure with I(1) factors
(N − R) I(1) factors driving N variables
e.g., single-factor model:
y1t 1 ε1t
. . .
. = . ft + .
. . .
yN t 1 εN t
ft = ft−1 + ηt
R = (N − 1) cointegrating combs: (y2t − y1t ), ..., (yN t − y1t )
(N − R) = N − (N − 1) = 1 common trend
Cointegration Motivation: Optimal Forecasting
I(1) variables always co-integrated with their optimal forecasts
Example:
xt = xt−1 + εt
xt+h|t = xt
⇒ xt+h − xt+h|t = h
P
i=1 εt+i
p−1
X
∆xt = − Πxt−1 + Bi ∆xt−i + ut
i=1
Integration/Cointegration Status
• Rank(Π) = 0
0 cointegrating vectors, N underlying unit roots
(all variables appropriately specified in differences)
• Rank(Π) = N
N cointegrating vectors, 0 unit roots
(all variables appropriately specified in levels)
• Rank(Π) = R (0 < R < N )
R cointegrating vectors, N − R unit roots
New and important intermediate case
(not possible in univariate)
xt ∼ V ECM ⇔ xt ∼ CI(1, 1)
V ECM ⇐ Cointegration
We can always write
Pp−1
∆xt = i=1 Bi ∆xt−i − Π xt−1 + ut
But under cointegration, rank(Π) = R < N , so
Π γ α0
=
N ×N N ×R R×N
⇒ ∆xt = p−1 0
P
i=1 Bi ∆xt−i − γα xt−1 + ut
Pp−1
= i=1 Bi ∆xt−i − γzt−1 + ut
V ECM ⇒ Cointegration
p−1
X
∆xt = Bi ∆xt−i − γ α0 xt−1 + ut
i=1
Premultiply by α0 :
p−1
X
α0 ∆xt = α0 Bi ∆xt−i − α0 γ α0 xt−1 + α0 ut
|{z}
i=1
full rank
So equation balance requires that α0 x t−1 be stationary.
136 CHAPTER 9
Stationary-Nonstationary Decomposition
α0
(R × N ) CI combs
M0
x
= x =
(N × N ) (N × 1)
δ com. trends
(N − R) × N
(Rows of δ ⊥ to columns of γ)
Intuition Transforming the system by δ yields
p−1
X
δ ∆ xt = δ Bi ∆ xt−i − δ0 γ α0 xt−1 + δ µt
|{z}
i=1
0 by orthogonality
!
0
Π= −1 1 = γα0
1
! !
0 −1 1 α0
M = =
1 0 ⊥γ
! ! !
0 x1t u2t − u1t x2t − x1t
M = =
x2t x1t x1t
BAYES 137
Frequency domain, ω → 0
• I(1) : f (ω) ∝ ω −2
ln f ∗ (ω) = β0 + β1 lnω + εt
|{z}
−2d
∗
GPH estimator of d: Regress ln f (ω) → const, ln ω
So take dˆ = − 1 β̂1 . “GPH estimator”
2
1. Applied modeling.
138 CHAPTER 9
2. Aggregation.
Granger (1980) shows that aggregation of a very large number of stationary ARMA
time series results, under regularity conditions (generalized in Robinson, 1991), in a
fractionally-integrated process. Thus, aggregation of short-memory processes results
in a long-memory process. Discuss this result in light of theorems on aggregation of
ARMA processes. In particular, recall that aggration of ARMA processes results in
new ARMA processes, generally of higher order than the components.
9.9 NOTES
Chapter Ten
Volatility Dynamics
10.2 GARCH
Prologue: Reading
Much of what follows draws heavily upon:
• Andersen, T.G., Bollerslev, T., Christoffersen, P.F. and Diebold, F.X. (2012), ”Finan-
cial Risk Measurement for Financial Risk Management,” in G. Constantinedes, M.
Harris and Rene Stulz (eds.), Handbook of the Economics of Finance, Elsevier.
• Andersen, T.G., Bollerslev, T. and Diebold, F.X. (2010), ”Parametric and Nonpara-
metric Volatility Measurement,” in L.P. Hansen and Y. Ait-Sahalia (eds.), Handbook
of Financial Econometrics. Amsterdam: North-Holland, 67-138.
• Andersen, T.G., Bollerslev, T., Christoffersen, P.F., and Diebold, F.X. (2006), ”Volatil-
ity and Correlation Forecasting,” in G. Elliott, C.W.J. Granger, and A. Timmermann
(eds.), Handbook of Economic Forecasting. Amsterdam: North-Holland, 778-878.
Prologue
• Aggregation level
– Portfolio-level (aggregated, univariate) Risk measurement
– Asset-level (disaggregated, multivariate): Risk management
• Risk management
• Portfolio allocation
• Asset pricing
• Hedging
• Trading
Risk Management
VOLATILITY DYNAMICS 141
r ∼ (µ, Σ)
Portfolio returns:
rp = λ0 r ∼ (λ0 µ, λ0 Σλ)
minw w0 Σw
s.t. w0 µ = µp
Importantly, w∗ = f (Σ)
If Σ varies, we have wt∗ = f (Σt )
Asset Pricing I: Sharpe Ratios
Standard Sharpe:
E(rit − rf t )
σ
Conditional Sharpe:
E(rit − rf t )
σt
Asset Pricing II: CAPM Standard CAPM:
(rit − rf t ) = α + β(rmt − rf t )
cov((rit − rf t ), (rmt − rf t ))
β =
var(rmt − rf t )
Conditional CAPM:
covt ((rit − rf t ), (rmt − rf t ))
βt =
vart (rmt − rf t )
Asset Pricing III: Derivatives
Black-Scholes:
PC = BS(σ, ...)
∆Ht = δ ∆St + ut
cov(∆Ht , ∆St )
δ =
var(∆St )
• Dynamic hedging
∆Ht = δt ∆St + ut
Trading
Some Warm-Up
Unconditional Volatility Measures
Variance: σ 2 = E(rt − µ)2 (or standard deviation: σ)
Mean Absolute Deviation: M AD = E|rt − µ|
Interquartile Range: IQR = 75% − 25%
VOLATILITY DYNAMICS 145
12
10
8
True Probability, %
0
0 100 200 300 400 500 600 700 800 900 1000
Day Number
p
Z −V aRT +1|T
p = PT (rT +1 ≤ −V aRTp +1|T ) = fT (rT +1 )drT +1
−∞
Z p
ESTp+1|T = p −1
V aRTγ+1|T dγ
0
σt2 = λ σt−1
2 2
+ (1 − λ) rt−1
∞
X
σt2 = 2
ϕj rt−1−j
j=0
ϕj = (1 − λ) λj
Rigorous Modeling I
Conditional Univariate Volatility Dynamics from “Daily”
Data
Conditional Return Distributions
f (rt ) vs. f (rt |Ωt−1 )
Key 1: E(rt |Ωt−1 )
Are returns conditional mean independent? Arguably yes.
Returns are (arguably) approximately serially uncorrelated, and (arguably) approximately
free of additional non-linear conditional mean dependence.
Conditional Return Distributions, Continued Key 2: var(rt |Ωt−1 ) = E((rt − µ)2 |Ωt−1 )
Are returns conditional variance independent? No way!
Squared returns serially correlated, often with very slow decay.
The Standard Model
(Linearly Indeterministic Process with iid Innovations)
∞
X
yt = bi εt−i
i=0
VOLATILITY DYNAMICS 147
∞
X
ε ∼ iid (0, σε2 ) b2i < ∞ b0 = 1
i=0
∞
X
E(yt+k | Ωt ) = bk+i εt−i
i=0
rt |Ωt−1 ∼ N (0, ht )
2
ht = ω + αrt−1
E(rt ) = 0
2 ω
E(rt − E(rt )) =
(1 − α)
E(rt |Ωt−1 ) = 0
2 2
E([rt − E(rt |Ωt−1 )] |Ωt−1 ) = ω + αrt−1
GARCH(1,1) Process
“Generalized ARCH”
rt | Ωt−1 ∼ N (0, ht )
148 CHAPTER 10
2
ht = ω + αrt−1 + βht−1
E(rt ) = 0
2 ω
E(rt − E(rt )) =
(1 − α − β)
E(rt |Ωt−1 ) = 0
2 2
E([rt − E(rt | Ωt−1 )] | Ωt−1 ) = ω + αrt−1 + βht−1
T T
T −p 1 X 1 X rt2
ln L (θ; rp+1 , . . . , rT ) ≈ − ln(2π) − ln ht (θ) −
2 2 t=p+1 2 t=p+1 ht (θ)
where wj = γ (1 − γ)j
But in GARCH(1,1) we have:
2
ht = ω + αrt−1 + βht−1
ω X
ht = + α β j−1 rt−j
2
1−β
Variance Targeting
Sample unconditional variance:
T
1X 2
σ̂ 2 = r
T t=1 t
ω = (1 − α − β)σ̂ 2
rt2 = ω + (α + β)rt−1
2
− βνt−1 + νt ,
where νt = rt2 − ht .
Variations on the GARCH Theme
Regression with GARCH Disturbances
yt = x0t β + εt
εt |Ωt−1 ∼ N (0, ht )
150 CHAPTER 10
2
ht = ω + α rt−1 + β ht−1 + γ 0 zt
γ is a parameter vector
z is a set of positive exogenous variables.
Asymmetric Response and the Leverage Effect I: TARCH
2
Standard GARCH: ht = ω + αrt−1 + βht−1
2 2
TARCH:( ht = ω + αrt−1 + γrt−1 Dt−1 + βht−1
1 if rt < 0
Dt =
0 otherwise
positive return (good news): α effect on volatility
rt−1 rt−1
ln(ht ) = ω + α 1/2
+ γ 1/2
+ β ln(ht−1 )
ht−1 ht−1
iid
td
zt ∼
std(td )
yt = x0t β + εt
εt |Ωt−1 ∼ N (0, ht )
yt = x0t β + γht + εt
εt |Ωt−1 ∼ N (0, ht )
A GARCH(1,1) Example
A GARCH(1,1) Example
A GARCH(1,1) Example
A GARCH(1,1) Example
After Exploring Lots of Possible Extensions...
Rigorous Modeling II
Conditional Univariate Volatility Dynamics from High-
Frequency Data
152 CHAPTER 10
Figure 10.10: Conditional Standard Deviation, History and Forecast, Daily NYSE Returns.
VOLATILITY DYNAMICS 153
Dependent Variable: R
Method: ML - ARCH (Marquardt) - Student's t distribution
Date: 04/10/12 Time: 13:48
Sample (adjusted): 2 3461
Included observations: 3460 after adjustments
Convergence achieved after 19 iterations
Presample variance: backcast (parameter = 0.7)
GARCH = C(4) + C(5)*RESID(-1)^2 + C(6)*RESID(-1)^2*(RESID(-1)<0)
+ C(7)*GARCH(-1)
Variance Equation
100
50
0
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010
10
−5
−10
−15
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010
Figure 10.12: S&P500 Daily Returns and Volatilities (Percent). The top panel shows daily S&P500
returns, and the bottom panel shows daily S&P500 realized volatility. We compute realized volatility as the
square root of AvgRV , where AvgRV is the average of five daily RVs each computed from 5-minute squared
returns on a 1-minute grid of S&P500 futures prices.
N (∆)
X 2
RVt (∆) ≡ pt−1+j∆ − pt−1+(j−1)∆
j=1
Z t
RVt (∆) → IVt = σ 2 (τ ) dτ
t−1
Microstructure Noise
– State space signal extraction
– AvgRV
– Realized kernel
– Many others
RV is Persistent
RV is Reasonably Approximated as Log-Normal
RV is Long-Memory
Exact and Approximate Long Memory
Exact long memory:
(1 − L)d RVt = β0 + νt
VOLATILITY DYNAMICS 155
−100
−5 −4 −3 −2 −1 0 1 2 3 4 5
Standard Normal Quantiles
QQ plot of Daily Realized Volatility
Quantiles of Input Sample
10
−10
−5 −4 −3 −2 −1 0 1 2 3 4 5
Standard Normal Quantiles
QQ plot of Daily log RV−AVR
Quantiles of Input Sample
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5
Standard Normal Quantiles
Figure 10.13: S&P500: QQ Plots for Realized Volatility and Log Realized Volatility. The top
panel plots the quantiles of daily realized volatility against the corresponding normal quantiles. The bottom
panel plots the quantiles of the natural logarithm of daily realized volatility against the corresponding normal
quantiles. We compute realized volatility as the square root of AvgRV , where AvgRV is the average of five
daily RVs each computed from 5-minute squared returns on a 1-minute grid of S&P500 futures prices.
Even better:
RV − V aRTp+1|T = RV
d T +1|T Φ−1
p ,
GARCH-RV
σt2 = ω + β σt−1
2
+ γ RVt−1
– “Realized GARCH”
– “HEAVY”
Separating Jumps
0.6
Autocorrelation
0.4
0.2
0.6
Autocorrelation
0.4
0.2
Figure 10.14: S&P500: Sample Autocorrelations of Daily Realized Variance and Daily Re-
turn. The top panel shows realized variance autocorrelations, and the bottom panel shows return autocor-
relations, for displacements from 1 through 250 days. Horizontal lines denote 95% Bartlett bands. Realized
variance is AvgRV , the average of five daily RVs each computed from 5-minute squared returns on a 1-minute
grid of S&P500 futures prices.
where
Jt
X
2
JVt = Jt,j
j=1
• Truncation:
N (∆)
X
T Vt (∆) = ∆p2t−1+j∆ I ( ∆pt−1+j∆ < T )
j=1
• Bi-Power Variation:
N (∆)−1
π N (∆) X
BP Vt (∆) = |∆pt−1+j∆ | |∆pt−1+(j+1)∆ |
2 N (∆) − 1 j=1
• Minimum:
N (∆)−1
π N (∆) X 2
M inRVt (∆) = min |∆pt−1+j∆ |, |∆pt−1+(j+1)∆ |
π−2 N (∆) − 1 j=1
• Portfolio risk change under a certain scenario involving price movements of set of
assets or asset classes?
• How do optimal portfolio shares change if the covariance matrix moves in a certain
way?
Similarly, what about almost any other question in asset pricing, hedging, trading? Almost
all involve correlation.
Basic Framework and Issues I
N × 1 return vector Rt
N × N covariance matrix Ωt
N (N +1)
• 2 distinct elements
rt = σt zt
zt ∼ i.i.d.(0, 1)
Multivariate:
1/2
Rt = Ωt Zt
Zt ∼ i.i.d.(0, I)
1/2
where Ωt is a “square-root” (e.g., Cholesky factor) of Ωt
Ad Hoc Exponential Smoothing (RM)
0
Ωt = λ Ωt−1 + (1 − λ) Rt−1 Rt−1
158 CHAPTER 10
• Assumes that the dynamics of all the variances and covariances are driven by a single
scalar parameter λ (identical smoothness)
• But covariance matrix forecasts inherit the implausible scaling properties of the uni-
variate RM forecasts and will in general be suboptimal
Multivariate GARCH(1,1)
0
vech (Ωt ) = vech (C) + B vech (Ωt−1 ) + A vech (Rt−1 Rt−1 )
0
vech (Ωt ) = vech (C) + (Iβ) vech (Ωt−1 ) + (Iα) vech (Rt−1 Rt−1 )
– Mirrors RM, but with the important difference that the Ωt forecasts now revert to
Ω = (1 − α − β)−1 C
– Fewer parameters than diagonal, but still O(N )2
(because of C)
Encouraging Parsimony: Covariance Targeting
Recall variance targeting:
T
1X 2 ω
σ̂ 2 = r , σ2 = =⇒ take ω = (1 − α − β)σ̂ 2
T t=1 t 1−α−β
VOLATILITY DYNAMICS 159
DECO
• Time-varying correlations assumed identical across all pairs of assets, which implies:
Γt = (1 − ρt ) I + ρt J ,
ρt = ωρ + αρ ut + βρ ρt−1
160 CHAPTER 10
0.8
0.6
0.4
0.2
0
75 80 85 90 95 00 05 10
Figure 10.15: Time-Varying International Equity Correlations. The figure shows the estimated
equicorrelations from a DECO model for the aggregate equity index returns for 16 different developed
markets from 1973 through 2009.
• Updating rule is naturally given by the average conditional correlation of the stan-
dardized returns,
PN PN
2 i=1 j>i ei,t ej,t
ut = PN
N i=1 e2i,t
DECO Example
Factor Structure
Rt = λFt + νt
where
1/2
Ft = ΩF t Zt
Zt ∼ i.i.d.(0, I)
νt ∼ i.i.d.(0, Ων )
=⇒ Ωt = λ ΩF t λ0 + Ωνt
Rt = λft + νt
VOLATILITY DYNAMICS 161
where
ft = σf t zt
zt ∼ i.i.d.(0, 1)
νt ∼ i.i.d.(0, σν2 )
=⇒ Ωt = σf2 t λλ0 + Ων
2
σit = σf2 t λ2i + σνi
2
2
σijt = σf2 t λi λj
Rigorous Modeling IV
Conditional Asset-Level (Multivariate) Volatility Dynam-
ics from High-Frequency Data
Realized Covariance
N (∆)
X
0
RCovt (∆) ≡ Rt−1+j∆,∆ Rt−1+j∆,∆
j=1
Z t
RCovt (∆) → ICovt = Ω (τ ) dτ
t−1
15
10
5
Quantiles of Input Sample
−5
−10
−15
−15 −10 −5 0 5 10 15
Standard Normal Quantiles
Figure 10.16: QQ Plot of S&P500 Returns. We show quantiles of daily S&P500 returns from January
2, 1990 to December 31, 2010, against the corresponding quantiles from a standard normal distribution.
Rigorous Modeling V
Distributions
Modeling Entire Return Distributions:
Returns are not Unconditionally Gaussian
Modeling Entire Return Distributions:
Returns are Often not Conditionally Gaussian
Modeling Entire Return Distributions: Issues
• Gaussian conditional VaR is somewhat better but left tail remains bad
VOLATILITY DYNAMICS 163
2
Quantiles of Input Sample
−1
−2
−3
−4
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5
Standard Normal Quantiles
• Gaussian conditional expected shortfall, which integrates over the left tail, would be
terrible
• So we want more accurate assessment of things like V aRTp +1|T than those obtained
under Gaussian assumptions
–Doing so for all values of p ∈ [0, 1] requires estimating the entire conditional return
distribution
– More generally, best-practice risk measurement is about tracking the entire condi-
tional return distribution
rT +1 = σT +1/T εT +1
εT +1 ∼ iid(0, 1)
Multiply εT +1 draws by σT +1/T (fixed across draws, from a GARCH model) to build up
the conditional density of rT +1 .
2
Quantiles of Input Sample
−1
−2
−3
−4
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5
Standard Normal Quantiles
rT +1 = σT +1 εT +1
εT +1 ∼ iid(0, 1)
rT +1 = σT +1 εT +1
εT +1 ∼ iid(0, 1)
Multiply εT +1 draws from N (0, 1) by σT +1 draws (from a simulated RV model fit to log
realized standard deviation) to build up the conditional density of rT +1 .
VOLATILITY DYNAMICS 165
rT +1 = σT +1/T εT +1
εT +1 ∼ iidN (0, 1)
But in the conditionally non-Gaussian case there is potential loss of generality in writing:
rT +1 = σT +1/T εT +1
εT +1 ∼ iid(0, 1),
because there may be time variation in conditional moments other than σT +1/T , and using
εT +1 ∼ iid(0, 1) assumes that away
Multivariate Return Distributions
– If reliable realized covariances are available, one could do a multivariate analog of the
earlier lognormal/normal mixture model. But the literature thus far has focused primarily
on conditional distributions for “daily” data.
Return version:
−1/2
Zt = Ω t Rt , Zt ∼ i.i.d., Et−1 (Zt ) = 0 V art−1 (Zt ) = I
where Dt denotes the diagonal matrix of conditional standard deviations for each of the
assets, and Γt refers to the potentially time-varying conditional correlation matrix.
Leading Examples
Multivariate normal:
Multivariate t:
−(d+N )/2
e0 Γ−1 et
f (et ) = C (d, Γt ) 1+ t t
(d − 2)
166 CHAPTER 10
Multivariate asymmetric t:
r
C d, Γ˙t K d+N d + (et − µ̇)0 Γ̇−1
t (et − µ̇) ξ 0 Γ̇−1
t ξ exp (et − µ̇)0 Γ̇−1
t ξ
2
f (et ) = (d+N )
−1
(d+N ) r −
(et −µ̇)0 Γ̇t (et −µ̇) 2 2
1+ d d + (et − µ̇)0 Γ̇−1
t (et − µ̇) ξ 0 Γ̇−1
t ξ
– More flexible than symmetric t but requires estimation of N asymmetry parameters si-
multaneously with the other parameters, which is challenging in high dimensions.
Copula methods sometimes provide a simpler two-step approach.
Copula Methods
Sklar’s Theorem:
N
∂ N G(F1 (e1 ), ..., FN (eN )) Y
f (e) = = g (u) × fi (ei )
∂e1 ...∂eN i=1
T
X T X
X N
=⇒ log L = log g(ut ) + log fi (ei,t )
t=1 t=1 i=1
Standard Copulas
Normal:
−1 1 −1 ∗−1
g(ut ; Γ∗t ) = |Γ∗t | 2 0 −1
exp − Φ (ut ) (Γt − I)Φ (ut )
2
where Φ−1 (ut ) refers to the N × 1 vector of standard inverse univariate normals, and the
correlation matrix Γ∗t pertains to the N × 1 vector e∗t with typical element,
– Often does not allow for sufficient dependence between tail events.
– t copula
– Asymmetric t copula
Asymmetric Tail Correlations
Multivariate Distribution Simulation (General Case)
Simulate using:
1/2
Rt = Ω̂t Zt
Zt ∼ i.i.d.(0, I)
Empirical
Gaussian
0.4 DECO
0.3
Threshold Correlation
0.2
0.1
0
−1 −0.5 0 0.5 1
Standard Deviation
Figure 10.19: Average Threshold Correlations for Sixteen Developed Equity Markets. The
solid line shows the average empirical threshold correlation for GARCH residuals across sixteen developed
equity markets. The dashed line shows the threshold correlations implied by a multivariate standard normal
distribution with constant correlation. The line with square markers shows the threshold correlations from a
DECO model estimated on the GARCH residuals from the 16 equity markets. The figure is based on weekly
returns from 1973 to 2009.
1/2
Ft = Ω̂F,t ZF,t
Rt = λ̂ Ft + νt
Rigorous Modeling VI
Risk, Return and Macroeconomic Fundamentals
We Want to Understand the Financial / Real Connections
Statistical vs. “scientific” models
Returns ↔ Fundamentals
r↔f
Disconnect?
“excess volatility,” “disconnect,” “conundrum,” ...
µr , σr , σf , µf
Links are complex:
µr ↔ σr ↔ σf ↔ µf
Volatilities as intermediaries?
For Example...
168 CHAPTER 10
Table 10.1: Stock Return Volatility During Recessions. Aggregate stock-return volatility is quar-
terly realized standard deviation based on daily return data. Firm-level stock-return volatility is the cross-
sectional inter-quartile range of quarterly returns.
Table 10.2: Real Growth Volatility During Recessions. Aggregate real-growth volatility is quarterly
conditional standard deviation. Firm-level real-growth volatility is the cross-sectional inter-quartile range of
quarterly real sales growth.
µf ↔ σr
Return Volatility is Higher in Recessions
Schwert’s (1989) “failure”: Very hard to link market risk to expected fundamentals (lever-
age, corporate profitability, etc.).
Actually a great success:
Key observation of robustly higher return volatility in recessions!
– Earlier: Officer (1973)
– Later: Hamilton and Lin (1996), Bloom et al. (2009)
Extends to business cycle effects in credit spreads via the Merton model
µf ↔ σr , Continued
Bloom et al. (2009) Results
µf ↔ σf
Fundamental Volatility is Higher in Recessions
More Bloom, Floetotto and Jaimovich (2009) Results
σf ↔ σr
Return Vol is Positively Related to Fundamental Vol
Follows immediately from relationships already documented
Moreover, direct explorations provide direct evidence:
– Engle et al. (2006) time series
VOLATILITY DYNAMICS 169
90
80
70
ΔP (% per year)
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
Real Stock Return Volatility and Real PCE Growth Volatility, 1983-2002
Rt = β0 + β1 Xt + β2 σt + εt
σt2 = ω + αrt−1
2 2
+ βσt−1
• Reliable risk measurement requires conditional models that allow for time-varying
volatility.
• Risk measurement may be done using univariate volatility models. Many important
recent developments.
• Other tasks require multivariate models. Many important recent developments, espe-
cially for N large. Factor structure is often useful.
****************************
Models for non-negative variables (from Minchul)
Introduction Motivation: Why do we need dynamic models for positive values?
Alternative model
– Harvey (2013)
– Creal, Koopman, and Lucas (2013)
Autoregressive Gamma Processes
Autoregressive Gamma Processes (ARG): Definition Definition: Yt follows the autoregressive gamma
process if
Measurement:
Yt Zt ∼ Gamma(δ + Zt , c)
Transition:
Zt Yt−1 ∼ P oisson(βYt−1 )
BAYES 173
Yt Zt ∼ Gamma(δ + Zt , c)
Transition:
Zt Yt−1 ∼ P oisson(βYt−1 )
Conditional moments:
where ρ = βc > 0.
When δ < 1,
• The stationary ARG process features marginal over-dispersion.
• The process may feature either conditional under- or over-dispersion, depending on the value of Yt−1 .
Remark: ACD (autoregressive conditional duration) model assumes the path-indepednet over-dispersion.
Continuous time limit of ARG(1) The stationary ARG process is a discretized version of the CIR process.
p
dYt = a(b − Yt )dt + σ Yt dWt
where
a = − log ρ
cδ
b=
1−ρ
−2 log ρ
σ2 = c
1−ρ
The autocorrelation function features hyperbolic decay when the distribution π assigns sufficiently large
probabilities to values close to one.
Figures 1
Transition:
Zt Yt−1 ∼ P oisson(βYt−1 )
• Yt : Interquote durations of the Dayton Mining stock traded on the Toronto Stock Exchange in October
1998.
• Estimation based on QMLE
Measurement
yt ∼ p(yt |ht , xt ; θ)
Transition
ht ∼ Gamma(δ + zt , c)
zt ∼ P oisson(ρht−1 )
Transition
ht ∼ Gamma(δ + zt , c)
zt ∼ P oisson(ρht−1 )
Transition
ht ∼ Gamma(δ + zt , c)
zt ∼ P oisson(ρht−1 )
Transition
ht ∼ Gamma(δ + zt , c)
zt ∼ P oisson(ρht−1 )
Recent extension: ARG-zero processes 1 Monfort, Pegoraro, Renne, Roussellet (2014) extend ARG process
to take account for zero-lower bound spells,
Recent extension: ARG-zero processes 2 Monfort, Pegoraro, Renne, Roussellet (2014) extend ARG process
to take account for zero-lower bound spells,
176 CHAPTER 10
Yt Zt ∼ Gamma(δ + Zt , c)
Zt Yt−1 ∼ P oisson(βYt−1 )
Yt Zt ∼ Gamma(Zt , c)
Zt Yt−1 ∼ P oisson(α + βYt−1 )
Two modifications
• δ = 0: As δ → 0, Gamma(δ, c) converges to dirac delta function.
• α is related with a probability of escaping from the zero lower bound.
Characterization Probability density for ARG-zero is
∞
X
p(Yt |Yt−1 ; α, β, c) = g(Yt , Yt−1 , α, β, c, z)1{Yt >0} + exp(−α − βYt−1 )1{Yt =0}
z=1
Conditional moments
E[Yt |Yt−1 ] = αc + ρYt−1
and
V (Yt |Yt−1 ) = 2c2 α + 2cρYt−1
where ρ = βc.
Figure: ARG-zero
Autoregressive conditional duration model (ACD) Yt follows the autoregressive conditional duration
model if
yt = µt et , E[et ] = 1
µt = w + αµt−1 + βyt−1
• Because of its multiplicative form, it is classified as the multiplicative error model (MEM).
• Conditional moments
E[yt |y1:t−1 ] = µt
V (yt |y1:t−1 ) = k0 µ2t
V (yt |y1:t−1 )
= k0
E[yt |y1:t−1 ]2
Dynamic conditional score (DCS) model Dynamic conditional score model (or Generalized Autoregressive
Score model) is a general class of observation-driven model.
Convenient and general modelling strategy. I will describe it within the MEM class of model.
DCS Example: ACD 1 Recall
yt = µt et , E[et ] = 1
µt = w + αµt−1 + βyt−1
178 CHAPTER 10
Instead, we apply DCS principle: “Give me conditional likelihood and time-varying parameters, then I
will give you a law motion”
yt = µt et , et ∼ Gamma(κ, 1/κ)
DCS Example: ACD 2
yt = µt et , et ∼ Gamma(κ, 1/κ)
µt = w + αµt−1 + βst−1
µt = w + αµt−1 + βyt−1
which is ACD.
However, a law of motion will be different with different choice of distribution – General-
ized Gamma, Log-Logistic, Burr, Pareto, and many other distributions
10.6 NOTES
Chapter Eleven
αt = T αt−1 + Rηt
yt = Zαt + εt
Linear / Non-Gaussian
αt = T αt−1 + Rηt
yt = Zαt + εt
180 CHAPTER 11
ηt ∼ Dη , εt ∼ Dε
Non-Linear / Gaussian
αt = Q(αt−1 , ηt )
yt = G(αt , εt )
Non-Linear / Gaussian II
(Linear / Gaussian with time-varying system matrices)
αt = Tt αt−1 + Rt ηt
yt = Zt αt + εt
ηt ∼ N η , εt ∼ N ε
“Conditionally Gaussian”
White’s theorem
Non-Linear / Non-Gaussian
αt = Q(αt−1 , ηt )
yt = G(αt , εt )
ηt ∼ D η , εt ∼ Dε
αt = Q(αt−1 ) + ηt
yt = G(αt ) + εt
BAYES 181
ηt ∼ D η , εt ∼ Dε
αt = Qt (αt−1 , ηt )
yt = Gt (αt , εt )
ηt ∼ Dtη , εt ∼ Dtε
St = max(Vt − D, 0)
– Multivariate
Stochastic Volatility Model (Nonlinear/Gaussian Form)
ht = ω + βht−1 + ηt (transition)
p
rt = eht εt (measurement)
ηt ∼ N (0, ση2 ), εt ∼ N (0, 1)
Stochastic Volatility Model (Linear/Non-Gaussian Form)
ht = ω + βht−1 + ηt (transition)
2ln|rt | = ht + 2ln|εt | (measurement)
or
ht = ω + βht−1 + ηt
yt = ht + ut
ηt ∼ N (0, ση2 ),
ut ∼ Du
– A “signal plus (non-Gaussian) noise”
components model for volatility
Realized and Integrated Volatility
IVt = φIVt−1 + ηt
RVt = IVt + εt
ε represents the fact that RV is based on less than an infinite sampling frequency.
Microstructure Noise Model
**Hasbrouck
(Non-linear / non-Gaussian)
A Distributional Statement of the Kalman Filter ****
Multivariate Stochastic Volatility with Factor Structure
***
Approaches to the General Filtering Problem Kitagawa (1987), numerical integration (linear / non-
Gaussian) More recently, Monte Carlo integration
Extended Kalman Filter (Non-Linear / Gaussian)
αt = Q(αt−1 , ηt )
yt = G(αt , εt )
ηt ∼ N, εt ∼ N
Take first-order Taylor expansions of:
Q around at−1
G around at,t−1
Use Kalman filter on the approximated system
Unscented Kalman Filter (Non-Linear / Gaussian)
Bayes Analysis of SSMs: Carlin-Polson-Stoffer 1992 JASA
“single-move” Gibbs sampler
(Many parts of the Gibbs iteration: the parameter vector, and then each observation of the state vector,
period-by-period)
Multi-move Gibbs sampler can handle non-Gaussian (via mixtures of normals), but not nonlinear.
Single-move can handle nonlinear and non-Gaussian.
BAYES 183
for example, has three regimes, two thresholds, and a d-period delay regulating the switches:
(u)
c(u) + φ(u) yt−1 + εt , θ(u) < yt−d
(m)
yt = c(m) + φ(m) yt−1 + εt , θ(l) < yt−d < θ(u) .
(l)
c(l) + φ(l) yt−1 + εt , θ(l) > yt−d .
The superscripts indicate “upper,” “middle,” and “lower” regimes, and the regime operative at any time t
depends on the observable past history of y – in particular, on the value of yt−d .
—————–
Latent Markovian Regimes
Although observable threshold models are of interest, models with latent states as opposed to observed
states may be more appropriate in many business, economic and financial contexts. In such a setup, time-
series dynamics are governed by a finite-dimensional parameter vector that switches (potentially each period)
depending upon which of two unobservable states is realized, with state transitions governed by a first-
order Markov process. To make matters concrete, let’s take a simple example. Let {st }T t=1 be the (latent)
sample path of two-state first-order autoregressive process, taking just the two values 0 or 1, with transition
probability matrix given by !
p00 1 − p00
M = .
1 − p11 p11
The ij-th element of M gives the probability of moving from state i (at time t − 1) to state j (at time t).
Note that there are only two free parameters, the staying probabilities, p00 and p11 . Let {yt }T t=1 be the
sample path of an observed time series that depends on {st }T
t=1 such that the density of y t conditional upon
{st } is !
1 −(yt − µst )2
f (yt |st ; θ) = √ exp .
2π σ 2σ 2
Thus, yt is Gaussian white noise with a potentially switching mean. The two means around which yt moves
are of particular interest and may, for example, correspond to episodes of differing growth rates (“booms”
and “recessions”, “bull” and “bear” markets, etc.).
Appendices
185
Appendix A
Ait-Sahalia, Y. and Hansen, L.P. eds. (2010), Handbook of Financial Econometrics. Amsterdam: North-
Holland.
Ait-Sahalia, Y. and Jacod, J. (2014), High-Frequency Financial Econometrics, Princeton University Press.
Beran, J., Feng, Y., Ghosh, S. and Kulik, R. (2013), Long-Memory Processes: Probabilistic Properties and
Statistical Methods, Springer.
Box, G.E.P. and Jenkins, G.W. (1970), Time Series Analysis, Forecasting and Control, Prentice-Hall.
Davidson, R. and MacKinnon, J. (1993), Estimation and Inference in Econometrics, Oxford University Press.
Douc, R., Moulines, E. and Stoffer, D.S. (2014), Nonlinear Time Series: Theory, Methods, and Applications
with R Examples, Chapman and Hall.
Durbin, J. and Koopman, S.J. (2001), Time Series Analysis by State Space Methods, Oxford University
Press.
Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, Chapman and Hall.
Elliott, G., Granger, C.W.J. and Timmermann, A., eds. (2006), Handbook of Economic Forecasting, Volume
1, North-Holland.
Elliott, G., Granger, C.W.J. and Timmermann, A., eds. (2013), Handbook of Economic Forecasting, Volume
2, North-Holland.
Engle, R.F. and McFadden, D., eds. (1995), Handbook of Econometrics, Volume 4, North-Holland.
Geweke, J. (2010), Complete and Incomplete Econometric Models, Princeton University Press.
Geweke, J., Koop, G. and van Dijk, H., eds. (2011), The Oxford Handbook of Bayesian Econometrics, Oxford
University Press.
Granger, C.W.J. and Newbold, P. (1977), Forecasting Economic Time Series, Academic Press.
Granger, C.W.J. and Tersvirta, Y. (1996), Modeling Nonlinear Economic Relationships, Oxford University
Press.
USEFUL BOOKS 187
Hammersley, J.M. and Handscomb, D.C. (1964), Monte Carlo Methods, Chapman and Hall.
Hansen, L.P. and Sargent, T.J. (2013), Recursive Models of Dynamic Linear Economies, Princeton University
Press.
Harvey, A.C. (1989), Forecasting, Structural Time Series Models and the Kalman Filter, Cambridge Uni-
versity Press.
Harvey, A.C. (2013), Dynamic Models for Volatility and Heavy Tails, Cambridge University Press.
Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning: Data Mining,
Inference and Prediction, Springer-Verlag.
Kim, C.-J. and Nelson, C.R. (1999), State-Space Models with Regime Switching, MIT Press.
Nerlove, M., Grether, D.M., Carvalho, J.L. (1979), Analysis of Economic Time Series: A Synthesis, Academic
Press.
Silverman, B.W. (1986), Density Estimation for Statistics and Data Analysis, Chapman and Hall.
Whittle, P. (1963), Prediction and Regulation by Linear Least Squares Methods, University of Minnesota
Press.
Zellner, A. (1971), An Introduction to Bayesian Inference in Econometrics, John Wiley and Sons.
Appendix B
B.1 DIFFUSIONS
dx = dW,
where
Z t
W (t) = ε(u)du
0
(that is, it is an additive process). W(t) is the continuous-time analog of a discrete-time driftless Gaussian
random walk. Intuitively, the normality arises from central-limit considerations stemming from the
additive nature of the process. A key property of Brownian motion is its independent Gaussian increments,
iid
(W (t) − W (s)) N (0, t − s), ∀0 ≤ s ≤ t ≤ ∞
∼
Brownian motion is fundamental, because processes with richer dynamics are built up from it, via location
and scale shifts. “W” stands for “Wiener process.” Standard Brownian motion is the simplest example of
the slightly more general Wiener process.
dx = α dt + σ dW.
Figure: Gaussian random walk with drift, optimal point and interval forecasts
Wiener process arises as the continuous limit of a discrete-time binomial tree. Discrete periods ∆t . Each
period the process moves up by ∆h w.p. p, and down by ∆h w.p. 1-p. If we take limits as ∆t → 0 and
adjust ∆h and p apropriately (as they depend on ∆t ), we obtain the Wiener process. Useful for simplified
derivatives pricing, as in Cox, Ross and Rubinstein (1979, JFE ).
dx = α dt + σ dW.
CONTINUOUS TIME 189
• Ito process
dx = α(x, t) dt + σ(x, t) dW
An important generalization of a Wiener process.
dx = α x dt + σ x dW.
Simple and important Ito process.
Figure: Exp of a logarithmic Gaussian random walk with drift, optimal point and interval forecasts
• Orstein-Uhlenbeck process
dx = (α + βx)dt + σdW.
Simple and important Ito process. Reverts to a mean of -α/β. Priestley (1980) shows how it arises as one
passes to continuous time when starting from a discrete-time AR(1) process.
Figure: Gaussian AR(1) with nonzero mean, optimal point and interval forecasts
√
dx = (α + βx)dt + σ xdW.
This is an important example of an Ito process – this time heteroskedastic, as the variance depends on the
level.
dr = (a + βr)dt + ψrγ dW
Discrete approximation:
1 We change the notation from x to r, in keeping with the fact that the models are commonly used for
interest rates.
190 APPENDIX B
h−1
X h−1
X
r1,t = a bi + bh r1,t−h + bi ε1,t−i .
i=0 i=0
Thus the h-aggregated series follows:
h−1
X
rh,t = a bi + bh rh,t−1 + εh,t
i=0
h−1
!
X
εh,t |Ωt−1 ∼ N 0, b2i var(ε1,t−i ) ,
i=0
where
2γ
var(ε1,t−i ) = ψ 2 r1,t−i−1 .
Note that, although the ”discretization interval” must be set by the investigator, and is therefore
subject to discretion, from that point on the parameter estimates are (asymptotically) invariant to the
data recording interval.
• Diffusion limit of GARCH (Nelson, 1990; Drost-Werker, 1996)
drt = σt dWpt
√
dσt2 = θ(ω − σt2 ) + 2λθσt2 dWst
ω > 0, θ > 0, 0 < λ < 1, Wpt indepof Wst
Drost and Werker (1996) show that approximate discretizations are available, which follow weak-GARCH
processes and are therefore closed under temporal aggregation. They provide formulae for the
continuous-time coefficients in terms of the discrete weak-GARCH coefficients at any aggregation level.
Makes for a tidy framework bridging continuous and discrete time. See also Andersen and Bollerslev (1998).
dx = α(x, t) dt + σ(x, t) dP
0, w.p.1 − λdt
where dP =
u, w.p.λdt
2
h i
dF = ∂F ∂t
+ α(x, t) ∂F
∂x
+ 12 σ 2 (x, t) ∂∂xF2 dt + σ(x, t) ∂F
∂x
dW
Ito’s Lemma is central because we often need to characterize the diffusion followed by a
function of an underlying diffusion, as in derivatives pricing.
CONTINUOUS TIME 191
B.2 JUMPS
B.7 NOTES
Appendix C
0 βi + ε
yit = Xit it
cov(εit , εjt ) = σij , Σ = [σij ]
i = 1, ..., N ; t = 1, ...T
y = Xβ + ε
cov(ε) = Σ ⊗ I ≡ Ω
−1
β̂SU R = X 0 Ω̂−1 X X 0 Ω̂−1 y
Bibliography
Aldrich, E.M., F. Fernndez-Villaverde, A.R. Gallant, and J.F. Rubio-Ramrez (2011), “Tapping the Super-
computer Under Your Desk: Solving Dynamic Equilibrium Models with Graphics Processors,” Journal
of Economic Dynamics and Control, 35, 386–393.
Aruoba, S.B., F.X. Diebold, J. Nalewaik, F. Schorfheide, and D. Song (2013), “Improving GDP Measurement:
A Measurement Error Perspective,” Working Paper, University of Maryland, Federal Reserve Board, and
University of Pennsylvania.
Nerlove, M., D.M. Grether, and J.L. Carvalho (1979), Analysis of Economic Time Series: A Synthesis. New
York: Academic Press. Second Edition.
Ruge-Murcia, Francisco J. (2010), “Estimating Nonlinear DSGE Models by the Simulated Method of Mo-
ments,” Manuscript, University of Montreal.
Yu, Yaming and Xiao-Li Meng (2010), “To Center or Not to Center: That is Not the Question An Ancillarity-
Sufficiency Interweaving Strategy (ASIS) for Boosting MCMC Efficiency,” Manuscript, Harvard Univer-
sity.