0% found this document useful (0 votes)
40 views38 pages

EQI Gappy ch2 20240430

Chapter 2 discusses the properties and models of returns for various financial instruments, focusing on equities, bonds, and credit instruments. It covers definitions of returns, stylized facts, and models like GARCH and Kalman Filtering for estimating volatility and returns. The chapter emphasizes the importance of understanding univariate returns as the foundation for portfolio management and risk assessment.

Uploaded by

mcucurin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views38 pages

EQI Gappy ch2 20240430

Chapter 2 discusses the properties and models of returns for various financial instruments, focusing on equities, bonds, and credit instruments. It covers definitions of returns, stylized facts, and models like GARCH and Kalman Filtering for estimating volatility and returns. The chapter emphasizes the importance of understanding univariate returns as the foundation for portfolio management and risk assessment.

Uploaded by

mcucurin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Chapter 2

Returns: Properties and Models

The Questions

1. How do we define returns for equities, bonds, credit instruments,


futures?

2. What are the stylized properties of returns?

3. Why is volatility an important measure for risk and portfolio con-


struction?

4. What is GARCH? How do we use it?

5. How do we use Kalman Filtering to estimate volatility?

6. How do we model multivariate returns?

Draft (April 30, 2024). Please read the chapter carefully and send
comments and corrections to the author. Any contribution will be
acknowledged in the final copy.
Email: paleologo@gmail.com
Xtwitter: @__paleologo (DM me)
LinkedIn: https://www.linkedin.com/in/gappy/ (connect, then message)

We start with models of univariate returns for two reasons. First, because
single-asset returns are the basic constituents of portfolios. We cannot hope to
understand the behavior of portfolios if we do not have a solid understanding
of their building blocks, so it is necessary to summarize the salient empirical

37
38 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

properties of stock returns and the most common processes employed to model
them, and specifically to model volatility e↵ectively. These models have general
applicability, and are even more useful when combined with other families of
models for multivariate returns. GARCH and exponential moving averages are
essential tools of the working modeler. In the process, I am introducing models
that justify their use. Exponential moving average find their motivation in linear
state-space models; GARCH is an instance of a nonlinear state-space model.
These models will be your friends for life.
There are five parts to the chapter. First, we lay out definitions of returns;
second, we summarize some “stylized facts” (empirical features of returns that
are ubiquitous and relevant to risk management); third, we skim GARCH
models and realized volatility models. Because both topics have been covered
extensively in textbooks, my goal here is to introduce the essentials and their
associated insights and give a jump-o↵ point for the reader. Then, I touch
on State-Space Model for Variance Estimation. Lastly, I cover spherical and
elliptical distributions.

2.1 Returns
2.1.1 Definitions
We have a set of n assets and a currency, also called the numeraire1 . We will
use dollars throughout as currency. It is customary to assume that each of these
assets is infinitely divisible. We buy the equivalent of a unit of currency for
asset i. We denote the value of the asset tomorrow Ri . An equivalent way to
define returns is from the closing price of security i on days 0 and 1, Pi (0) and
Pi (1), respectively. The return is defined as

Pi (1) Pi (0)
ri (1) :=
Pi (0)
We extend this definition to the case in which the security pays a dividend. The
holder of the asset receives an amount Di (1), and the return is then defined as
Pi (0) and Pi (1) respectively. The dividend-adjusted return is defined as

Pi (1) + Di (1) Pi (0)


ri (1) :=
Pi (0)
1 This
word comes to English from the Latin numerarius, or ”a number”, ”a unit”, through the
French numéraire.
2.1. RETURNS 39

We denote the vector of daily returns at time t as (R1 (t), . . . , Rn (t)). A great
deal of equity risk management deals with the properties of this vector. For
a portfolio w 2 Rn , where wi is an investment in asset i, the Profit and Loss
(PnL)
P in a single
P period is given by the change in the0 value of the portfolio:
i wi Pi (1) i wi Pi (0). In vector form, this equals w r.

2.1.2 Log Returns


If r follows a multivariate Gaussian distribution, then so does the portfolio.
The variance of this portfolio can be computed by using just two pieces of
information: the portfolio weights, and the covariance matrix of the returns.
The question of whether net returns are Gaussian is an empirical one. We at
least know that if net returns are Gaussian, they are very tractable for analysis
at a given point in time. However, they are not easily tractable in time series
analysis. For example, define the cumulative total return over periods 1, . . . , T .

Pi (T )
ri (1 : T ) := 1
Pi (0)
Pi (T ) Pi (T 1) Pi (1)
= ... 1
Pi (T 1) Pi (T 2) Pi (0)
=(ri (T ) + 1) ⇥ (ri (T 1) + 1) ⇥ . . . ⇥ (ri (1) + 1) 1

If ri (t) are normally distributed, the cumulative total return is not normal
distributed, and its distribution rapidly diverges from the normal distribution.
The variance of the cumulative returns is not a simple function of the
single-period variances.
On the other side, log returns compound under multiplication. Let r̃(t) :=
log(1 + ri (t)). Then, the log of the compound return is equal to the sum of the
log returns in the same period, and if the log return is normal, so is the log of
the compound returns. If the returns are independent, the variance of the log of
compound log return is equal to the sum of the variances. We can reconcile the
two view of returns – raw and log – if the approximation log(x) = x 1+o(|x 1|)
is sufficiently accurate, i.e., if net returns are small. In this case, we can make
the approximation r̃i ' ri , which is sufficiently accurate provided the returns
are not too large.
A common approximation for the compounded net return of an asset over
40 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

time is given by
!
Y X
(r(t) + 1) 1 = exp r̃(t) 1
t t
X
'1+ r̃(t) 1
t
X
' r(t) .
t

Always verify the accuracy of the approximation, for example comparing the
estimate of models developed using r and r̃. When the assets are equities, the
approximation is usually considered adequate for daily interval measurements
or shorter.

2.1.3 Estimating Prices and Returns


To estimate return, we need prices. Prices, however, depend crucially on the
way a market is designed. Over-the-counter markets (Harris, 2003) di↵er from
exchanges that employ limit-order books (Bouchaud et al., 2018). Within a
single-exchange, the trading mechanism can change over the course of the day,
with auctions often taking place at the beginning and at the close of the trading
day. As a result of market design, the observation of prices exhibits measurement
error. The most conspicuous example of such an error is the bid-ask spread. In
limit order books, the buy orders have a price attribute (the “bidding” price
per share the buyer is willing to pay) and a quantity. Similarly, the sell orders
have a price attribute, or “asking price” and a quantity. Asking prices are
higher than bidding prices, and the di↵erence is called the bid-ask spread. This
spread is a multiple of the minimum tick size2 For a transaction to occur, a
buy order or a sell order must cross the spread; either event can occur. As
a result, the transaction price will be either at the top or the bottom of the
bid-ask spread interval. Successive transactions will have di↵erent price marks
due to the partial randomness of buying and selling transaction. The bid-ask
spread bounce is not the only source of measurement error. For example, prices
can di↵er by exchanges, and the selection of price by timestamp depends on the
choice of data integration. Then, there may be outright measurement errors. It
is important to consider the fact that prices are imperfectly observed early on,
2 As of publication time, the minimum tick size is $0.01 in US exchanges for shares trading above
$1.
2.1. RETURNS 41

rather than ignore them and their impact and face unintended consequences.
Perhaps the simple model is the Roll model (Roll, 1984). Model for asset prices.
In this model, the “true” price mt of an asset evolves as an arithmetic random
walk, and we imperfectly observe the price pt . In formulas:

mt+1 =mt + ✏ ✏t+1 (evolution)


pt+1 =mt+1 + ⌘ ⌘t+1 (observation)

with ✏t , ⌘t independent random variables (serially and from each other) dis-
tributed according to a standard normal.
Before we try to estimate prices, the model has an immediate and testable
consequence: consecutive price di↵erences are negatively correlated. The price
di↵erence is pt+1 := ✏ ✏t+1 + ⌘ (⌘t+1 ⌘t ), which is zero in expectation. However,

2
E( pt+1 pt ) = ⌘
E( pt+1 ps ) =0, s<t

The lag-one autocorrelation can also be used to estimate the measurement error.
The presence of large non-zero autocorrelations beyond lag one may point to
model inadequacy, in the sense that there are actual long-term dependencies in
the price process mt . The model can be extended; see Section 2.4. An optimal
estimator for mt is provided by the Kalman filter. The filter is covered in the
Appendix, Section 14.3, and specifically in Example 1 of Subsection 14.3.1. The
estimator is given by

m̂t+1|t =(1 K)m̂t|t 1 + Kpt

Where the explicit formula for K 2 (0, 1) is given in the Appendix. The smaller
the ratio ⌘ / ✏ , the higher the K, which makes sense: we do not need to average
observation if the price observations are accurate. The gist of the model is that
an exponential moving averages of prices is preferable to just taking the last price
in the measurement period. If we want the daily closing price, for example, we
may want to use a weighted average of 5-minute interval prices in the preceding
interval. There is a caveat, however. Suppose we have estimates m̂t , and we use
these estimates to compute returns at intervals T ; i.e. rT := m̂nT /m̂(n 1)T 1.
Because we employ the same observed prices p both in m̂(n 1)T and m̂nT the two
estimates are positively correlated. One should always check that (1 K)T ⌧ 1
to alleviate this spurious correlation.
42 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

2.1.4 Stylized Facts


Before building the house, we need to look at the bricks, namely, the statistical
properties of the single-stock returns. Below I list some “stylized facts” about
stock returns, and discuss their relevance to risk modeling and management.
Returns have a lower bound at 0. We usually characterize the properties of
r(t) := log(R(t)). We focus on the properties of r(t), but also |r(t)| and r2 (t),
the volatility of the log returns. Here are some properties. See (Cont, 2001;
Taylor, 2007; Ratli↵-Crain et al., 2023).

1. Absence of autocorrelations. Lagged autocorrelations are small unless you


observe prices and returns at time scales in which the market microstructure
becomes relevant (say, intraday). See Fig. 2.1.

Skewness Kurtosis
Stock Mean Left Right Mean Left Right
AAPL -0.2 -0.5 0.2 5.7 3.6 7.8
IBM 0.1 -0.2 0.5 7.1 5.4 8.7
NRG 0.4 -0.5 1.2 14.3 7.9 20.0
WAT -2.0 -3.3 -0.6 29.8 12.8 48.1
SPY -0.1 -0.7 0.6 11.4 6.5 16.0
Table 2.1: Sample skewness and kurtosis of daily log returns and p = 0.01
confidence intervals estimated using nonparametric bootstrap with replacement
(5000 variates). Range: 1/3/2001-12/8/2017.

2. Heavy tails. The unconditional distribution of returns shows heavy tail


behavior. This will be made more precise in the following section, but the
probability of a large return is higher than what would be consistent with
any “thin-tailed” distribution with infinite moments. Examples of sample
kurtosis are in Table 2.1. The conditional (say, conditional on the return’s
entire history up to time t) distribution of returns may show heavy tail
behavior as well, but with lighter tails than the unconditional one.

3. Autocorrelation of absolute returns and second moments. The time series


|R(t)| and Ri2 show strong autocorrelation. The autocorrelation of absolute
values is greatest and is called the “Taylor E↵ect” in the literature (Taylor,
1986; Granger and Ding, 1995).
2.1. RETURNS 43

1.0

1.0
0.8

0.8
0.6

0.6
ACF

ACF
0.4

0.4
0.2

0.2
0.0

0.0
0 5 10 15 20 0 5 10 15 20

Lag Lag

(a) (b)
1.0

1.0
0.8

0.8
0.6

0.6
ACF

ACF
0.4

0.4
0.2

0.2
0.0

0.0

0 5 10 15 20 0 5 10 15 20

Lag Lag

(c) (d)
1.0

1.0
0.8

0.8
0.6

0.6
ACF

ACF
0.4

0.4
0.2

0.2
0.0

0.0

0 5 10 15 20 0 5 10 15 20

Lag Lag

(e) (f)

Figure 2.1: Autocorrelation plot of daily log returns (range: 1/3/2000-12/8/2017)


for (a) AAPL, (b) IBM, (c) NRG, (d) WAT, (e) SPY, (f) XLK.
44 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

4. Aggregational Gaussianity. At longer time scales (say, weekly or monthly


returns, as opposed to daily or intraday returns), the distribution of returns
becomes closer to a Gaussian distribution.

Reality3 is in stark contrast with simple models of univariate price dynamics like
the geometric di↵usion process at the core of simple derivative pricing models:

dP (t) = µP (t)dt + P (t)dW (t) (2.1)

This model predicts Gaussian, independent log returns, which are inconsistent
with the empirical evidence. First, returns show little serial autocorrelation.
This does not mean that returns are independent, nor that returns are unpre-
dictable based on the history of returns or some additional explanatory variables.
Regarding the former point: zero-autocorrelation does not imply independence.
Regarding the latter, returns are predictable. This is not only an article of
faith of active investors, who usually do a terrible job at it, but also a relatively
uncontroversial empirical finding among academics4 . Nevertheless, even though
they are predictable, they are not so trivially predictable.
Regarding heavy tails: for asset returns, we restrict our attention to power-
tailed distributions: the complement of the cumulative distribution function
follows a power law: F̄ (x) := P (r > x) = Cx ↵ , with ↵ > 0. Compare this to
Gaussian returns: if r ⇠ N (0, 1), then a common approximation Wasserman
(2004) for the tail probability is
✓ ◆
1 2 1 1 1 x2 /2 1
p e x /2  F̄ (x)  p e (2.2)
2⇡ x x3 2⇡ x

For the case |x| 1, the right-side inequality can be used to bound quantiles
and the symmetric inequality of the left tail:
q p
1 x2 /2 1
F̄ (x)  p e )F̄ (1 )  2 log[1/(( 2⇡(1 ))] (2.3)
2⇡ q
1 x2 /2 1
p
F (x) p e )F ( ) 2 log[1/( 2⇡ )] (2.4)
2⇡
3 Note,however, that I am not including the Leverage E↵ect among the stylized facts. In the words
of Cont (2001), “most measures of volatility of an asset are negatively correlated with the returns of
that asset”. This e↵ect is not sufficiently strong in recent data, as shown by Ratli↵-Crain et al. (2023).
4 John Cochrane has written extensively on this, e. g., Cochrane (2008) and the blog en-

try “Predictability and correlation” (http://johnhcochrane.blogspot.com/2014/01/predictability-and-


correlation.html)
2.2. CONDITIONAL HETEROSCEDASTIC MODELS (CHM) 45
q p 1
The approximation is quite accurate: 2 log[1/( 2⇡ )]  F ( )  0.965 ⇥
q p
2 log[1/( 2⇡ )] for 10 10 < < 1. A Gaussian random variable has finite
moments of any order. A power-tail random variable with exponent ↵ has finite
moment only up p to ↵. A Gaussian random variable has quantiles bounded, up
to constants, by log(1/ ), while a power-tail one has a quantile of the form
(1/ )1/↵ . It is not controversial that the unconditional log returns have heavy
tails. It is still not settled what the exponent ↵ associated with the distribution
is. it seems however that ↵ ' 4. This is important for estimation purposes. A
sufficient condition for the estimability of the volatility of returns is that their
fourth moment is finite. To see this, recall that the Central Limit Theorem
says that,
P if xt are iid random variables with mean µ and variance 2 , then
T 1/2 Tt=1 xt converges in distribution to a Gaussian random variable with mean
µ and variance 2 . The theorem allows us to establish an asymptotic result on
E(r2 ): assume that ri are iid. Set xt := rt2 . If we want to estimate E(rt2 ) using
the CLT, then we need finiteness of E(rt4 ). This seems to be the case. However,
a related question is whether the conditional return distribution is heavy-tailed.
If the heavy tailed characteristic of conditional returns is ignored or considered
inessential, then it is possible to model returns as a process with conditional
Gaussian returns and heavy-tailed unconditional ones. This family, denoted
Conditional Heteroscedastic Models, is rich and the subject of the following
subsection. We won’t cover models with long-range dependence and/or heavy
tailed conditional and unconditional returns, like Lévy processes and FARIMA
models. No model covers all the empirical features observed in stock returns.
GARCH models (and mixture models in general) have the benefit of being easy
to interpret, simulate, and estimate.

2.2 Conditional Heteroscedastic Models


(CHM)
This family of models was first proposed in the early 1980s by Engle (1982);
Engle and Bollerslev (1986). By the next decade they had been generalized
and applied to several economic domains. They are extensively covered in any
Econometrics book.
The most popular and studied model in this family is the GARCH(1,1)
model. It has good empirical properties, its theoretical properties have been
characterized, and can be estimated efficiently. It also conveys the gist of the
46 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

large set of models in this family. The fundamental insight of the model is to
make the parameters in the model a part of the state of the stochastic process.
The laws for GARCH(1,1) are
rt = h t ✏t (2.5)
h2t = ↵0 + ↵1 rt2 1 + 2
1 ht 1 (2.6)
✏t ⇠ N (0, 1) (2.7)
To gain some intuition, let us look at the second equation of the GARCH process
when we remove the term ↵1 rt2 1 . The equation
h2t = ↵0 + 2
1 ht 1 (2.8)
can be rewritten as
h2t h2 = 2
1 (ht 1 h2 )
where h2 := ↵0 /(1 2 2
1 ). The value of ht converges to h at a geometric rate, so
long as | 1 | < 1. High values of the squared return rt2 shock the volatility upward,
provided that ↵1 > 0. This in turn increases the probability of large squared
returns in the following period, giving rise to a rich dynamic behavior. The
increase in volatility cannot continue unabated, because the term 1 (h2t 1 h2 )
will dampen variances that are much greater than the “equilibrium level” h2 .
This can be seen through substitution in the second equation of the model:
1
X
h2t 2
= h + ↵1 i 1 2
1 rt i (2.9)
i=1

One could replace the true values of ↵0 , ↵1 , 1 with estimates, and interpret the
formula by saying that the variance estimate is an exponential moving average
of non-iid returns, since they are modulated by ht , in light of Equation (2.5).

2.2.1 GARCH as random recursive equations?


We now look at GARCH(1, 1) through di↵erent modeling approaches. First, we
could reformulate it as a random iterated function. Rewrite Equation (2.6) as
h2t = ↵0 + ↵1 h2t 1 ✏2t 1 + 2
1 ht 1

Set
at := 1 + ↵1 ✏2t 1
2.2. CONDITIONAL HETEROSCEDASTIC MODELS (CHM) 47

The random variables at are iid. Then

h2t = at h2t 1 + ↵0

This formulation shows that the process is Markovian, and that the variance
process is governed by an autoregressive equation with random coefficients. This
allows us to study the process using the toolkit of random recursive equations.
By recursion (Lindner, 2009), we can rewrite the equations as
k
! k Yi 1
Y X
2 2
ht = at i h t k 1 + ↵ 0 at j
i=0 i=0 j=0
Q
The product xt := ti=0 ai plays an important role (Nelson, 1990). If we can
identify the conditions under whichPit converges to zero almost surely (a.s.)
k Qi 1
and fast enough to guarantee that i=0 j=0 at j is finite a.s., then we have
proven the existence of an asymptotic limit for h2t . Let us consider the process
{xt : t > 0}. First, assume xt 0; it diverges if and only if log xt ! 1. We
then have to find the conditions under which
t
X
log( 1 + ↵1 ✏2i 1 ) ! 1 a.s.
i=0

Since this is the sum of iid random variables, a necessary and sufficient condition
for this is that µ := E[log( 1 + ↵1 ✏20 )] > 0, provided that the variance of
log( 1 + ↵1 ✏20 ) is finite. If that is the case, then we can apply the Strong Law of
Large Numbers:
t
1X
log( 1 + ↵1 ✏2i 1 ) ! µ a.s.
t i=0

Conversely, assume that E[log( 1 + ↵1 ✏20 )] < 0. Then log xt ! 1 a.s., and
xt ! 0 a.s. Under this condition, the unconditional variance is
1 Y
X i 1
h2t = ↵0 at j (2.10)
i=0 j=0

The kurtosis of the process is


3(1 + ↵1 + 1 )(1 ↵1 1) 1 (↵1 + 1 )2
k= 2 =3 >3
1 1 2↵1 1 3↵12 1 (↵1 + 1 )2 2↵12
48 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

so the process is leptokurtic as long as ↵1 > 0. How about skewness? The the
unconditional returns are not skewed, because

E((r1 Er1 )3 ) = E(h31 )E(✏3t ) = 0

Finally, we point out that not only are the unconditional returns leptokurtic, but
do in fact have Pareto tails, provided the process is stationary: P (rt > x) ⇠ x ↵ ,
for some ↵ > 0; see Mikosch and Stărică (2000); Buraczewski et al. (2016).

2.2.2 GARCH(1,1) and Return Stylized Facts


The GARCH improves on the distributional properties of returns of returns rt ,
by making them closer to normal;see Figure 2.2. How does the GARCH(1, 1)
model stack up against the stylized facts?

1. Absence of autocorrelations. This property is satisfied (not hard to verify


directly).

2. Heavy Tails. The unconditional returns are leptokurtic. Moreover (Mikosch


and Stărică, 2000), the tails of the unconditional returns are heavy tailed.
So, this checks out. However, wait until point 4 below (autocorrelation of
absolute return) before you celebrate.

3. Autocorrelation of absolute and squared returns. The ACF for GARCH(1,


1) is positive for both absolute and squared returns. For squared returns,
it has the form (He and Teräsvirta, 1999; Ruppert and Matteson, 2015)
8 2
< ↵1 (1 ↵1 1 1)
if n = 1
1 2↵1 1 2
⇢n = 1
:
⇢1 (↵1 + 1 )n 1
if n > 1

However, if we look at kurtosis and lag-1 autocorrelation for common stock


indices, it appears that the autocorrelation is too high for a given observed
kurtosis level. See Teräsvirta (2009a).

4. Aggregational Gaussianity. Although there are no known results on this


properties, to the best my knowledge, it is satisfied empirically.

Summing up, some but not all of the stylized facts about log-returns are
captured by GARCH(1, 1).
2.2. CONDITIONAL HETEROSCEDASTIC MODELS (CHM) 49

10
● ● ●

● ●


5 ●●

●●●
●●●●
●● ●
●● ●

●●●● ● ●
●● ●
●● ●
●●


●●


●●●

●●

● ●


●●●
●●
●●


●●●● ●


●●●●

● ●


●●


●●



●●

● ● ●


●●



●●
● ●

●●

●●

●●



●●
● ●
●●

● ●
●● ●●

●●

●●

●●












●●


●● 5 ●
●●
●●●
●●

●●















●●
●●
● ● ●●
●●


●●


●● ● ●


●●





●●
● ●●


●●


●●

● ●●


●●


●●

●●
●●


●● ● ●
●●

●●




●●


●●

















●●
● ●● ●


●●

●●



●●


●●

●● ●●
● ●●

●●


●●



●●



●●


●●

● ●

●●


●●





●●




●●


●●
● ●●

● ●
●●


●●






●●




●●


●● ●●
● ●●●


●●



●●





●●


● ●

●●
● ●●
●●




●●


●●



●●
● ●●

●● ●




●●


●●


● ●
●● ●
●●


●●


●●


●●


● ●
●●
●● ●●


●●
0 ●

●●


●●



●●













●●



●●


●●


●●


●●


●●








●●


●●

●●
●●
●●




●●


●●


●●





●●



●●


●●

● ●
●●


●●


●●

●●


●●


●●


●●




●●



●●


●●


●● ●
●●●


●●


●●


●●


●●




●●


●●


●●
● ●●

●●
●●


●●


●●

●●




●●

●●


●● ●
●●

●●



●●

●●


sample

sample


●●


●●





●●


●●


● ●
●●

●●


●●


●●
●●


●●




●●



●●


● ●
●●


●●


●●


●●





●●


●●
● ●
●●




●●




●●



●●



●●




●●


● ●●
●●



●●


●●


●●


●●





●●



●●

● ●●


●●




●●





●●




●●


●●




●●


●● ●
●●


●●




●●



●●





●●



●●


●●


● ●

●●


●●



●●




●●

●●



●●



●●


●●

● ●
●●


●●

●●




●●


●●


●●

●●

●●



●●
● ●

●●


●●




●●




●●


●●


●●




●●

● ●


●●


●●






●●



●●


●●



●●

● ●


●●






●●






●●


●●








●●


●●
● ●


●●


●●


●●




●●




●●

●● ●

●●


●●


●●


●●



●●



●●

● ●●


●●


●●


●●


●●






















●● 0 ●

●●


●●




●●






●●





●●

●●


●●


●●


●●


●●


●●


●●

●●
●● ●


●●


●●


●●






●●



●●


●●




●●

●● ●●


●●

●●


●●






●●


●●


●●




●●

● ●


●●


●●


●●


●●


●●


●●




●● ●●


●●


●●


●●


●●


●●


●●




●●


● ●

●●


●●


●●


●●


●●


●●



●●


● ●
●●


●●

●●
●●


●●


●●


● ●
●●


●●


●●

●●


●●


●●


●●

●● ●

●●


●●

●●


●●


●●




●●
● ●

●●


●●



●●


●●


●●
● ●

●●




●●



●●



●●
● ●


●●



●●

●●

●●

●● ●●



●●


● ●

●●

●●
●●
●● ●

●●


●●









●●

●● ●●

●●

●●●

●●

●●
● ●
●●



●●

●●
● ●

●●
●●● ●●

● ●






● ●
●● ●

●● ●●

●●

● ●
●●
● ●●




●●● ●
● ●●

● ●


●● ●
● ●●


−5 ●●●












●●

●● ●●
●●
● ●
●●●
●●●

● ●●●
●● ● ● ●
● ● ● ● ●
●● ●
●●
−5 ●

●●●●
● ● ● ●
●●

● ●
● ● ●

−4 −2 0 2 4 −4 −2 0 2 4
theoretical theoretical

type ● GARCH(1, 1) ● UNCONDITIONAL type ● GARCH(1, 1) ● UNCONDITIONAL

(a) (b)
15 10
● ●

● ●
●● ●
●●●
● ● ●●
●●
●●
●●●
10 ●


●●●

●●
●●
● ●●●
● ●●

●● ●

● ●●
●●

● ●


●●

● ●●
●●

● ●

●●


●●
●●
●● ●●

●●


●●

● ●
●●
●●




●●
● ●

●●
●●


●●


●●

●●

●●


●●

●●

●●
●●





●●




●●





●●

●●

●●

●●



●●




●●



●●




●●

●●


●●

●●


●●

●●



● ●●


●●


●●


●●


●●


●●


●●


●●


●●


● ●●

●●


●●


●●


●●


●●




●●


●●


●●


●●


●●

●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●
0 ●
●●


●●


●●


●●


●●






●●






●●





●●






●●






●●






●●






●●





●●






●●






●●






●●


●●


●●


●●


●●
● ●


●●


●●


●●


●●



●●






●●




●●


●●


●●


●●



● ●

●●


●●

●●


●●


●●



●●


●●


●●


●●


●●



●●


●●


●●


●●


●●

●●


●●


●●


●●

●●





●●


●●


●●●


●●


●●


●●


●●
● ●●


●●


●●


●●


●●


●●

●●



● ●

●●

●●



●●


●●

●●


5 ●



●●






















●●



●●

●●



●●




●●


●●● ●
●●

●●





















●●

●●●

●●

●●●●
●● ●●

●●

●●









sample

sample

●● ●
●●
●●●

●●


●●


●●

●●


●● ●●
● ●

●●●


● ●
●●● ●●
● ●

●●●●
● ●●●




●●●
●● ●●
●●
● ●
●● ●

●●



●●


●●

●●

● ●
●●


●●
●●

●●

● ●
●●
●● ●
●●

●●





●●
● ●
●●

●●
● ●●



●●

● ●

●●
●●



●●

●●

●●

●●


●●

●●



●●

●●



●●

●●

● ●●


●●

●●



●●


●●

●● ●


●●

●●

●●


●●


●● ●●


●●


●●


●●





●●



●●

●● ●●



●●

●●●

●●

●●

●●


●●


●●


































●●


●●


●● ●●
●●
●●



●●




●●






●●




●●


●●


●●
●●


●●



●●



●●





●●





●●



●●


●●


●●


●●



●●


●●


●●


●●


●●


●●

0 ●
●●


●●


●●


●●


●●


●●


●●
● ●


●●


●●


●●




●●






●●






●●




●●

●●


●●


●●


●●



●●





●●



●●


●●


●●


●●


●●


●●


●●


●●





●●


●●


●●


●●

● ●●●

●●


●●


●●


●●


●●



●●


●●


●●


●●



●●


●●


●●


●●
●●

●●


●●


●●


●●



●●


●●


●●

●●
●●


●●


●●

●●

●●


●●


●●


●●●


●●


●●


●●

●●


●●

●●
●●

●●


●●


●●




●●


●●

●●


●●


●●

●●


●●



●●


●●●


●●

●●

● ●



●●



●●


●●


●●


●●




●●

●●

●●

●●


●● ●

●●

●●
● ●
●●

●●

●●

●●

●●


●●
●●

●● −10 ●


●●


●●●

●● ● ●

●●
●●●

● ●●

●●
●●
●● ●

●● ●
●● ●●

●●
● ●

●●● ●

●●

●●● ●

●●

● ●● ●●

●●

●●●●
−5
● ●

● ●
● ●

−2 0 2 −4 −2 0 2 4
theoretical theoretical

type ● GARCH(1, 1) ● UNCONDITIONAL type ● GARCH(1, 1) ● UNCONDITIONAL

(c) (d)

● 10 ●

10


● ●

●● ●

5 ●●●● 5 ●



●●●●● ●●● ●



● ● ● ●
●●●●





● ●● ●



●●
● ● ●

●●



●●

● ●●●● ●
●●

● ●●



●●
●●

●●
●●
●●●●● ●●

●●


●●
●●

● ●
●●
● ●


●●




●● ●

●●

●● ●




● ●
●●
●●

●●


●●

● ●
●●
● ●●
●●●●
●●


● ●

●●
●●

●● ●
●● ●
●●
●●
●●
●●

●●

●● ●
●● ●
sample

●●

●●


●●

●● ●
●● ●
●●


●●

●●


●●
● ●
●● ●●


●●

●●

●●

sampl

●●


●●

●●






●●

●●
● ●●


● ●
●●



●●


●●




●●



●●

●●


● ●
●●

● ●

●●

●●




●●

●●


●●


●●
●●


●●
● ●● ●
●●




●●

●●


●●


● ●

●●


●●


●● ●

●●●
● ●

●●


●●



●●


●●


●●●●


●●

●●


●● ●

●●

●●

●●


●●


●●


●●

●●●


●●

●●


●●

● ●
●●


●●






●●


●●


●●
●●

●●


●●


●●

● ●●


●●



●●


●●


●●



●●


●●


●●


●●


● ●●

●●




●●




●●


●●


●●






●●


●●


●●


●●

● ●


●●

●●

●●

●●



●●






●●






●●



●●


●●


● ●●


●●

●●


●●
●●





●●


●●


●●


●●

●●

● ●


●●

●●

● ●


●●


●●

●●


●●


●● ●


● ●


0 ●

●●


●●


●●


●●




































●●






●●





●●


●●


●●


●●


●●


●●


●●

●●


●●


























●●

●●


●●


●●




●●



●●●


●●


●●
●●


●●


●●


●●
● ●
●●


●●


●●
●●
●●


●●


●●




●●


●●


●●●●


●●


●●

● ●
●●


●●


●●●
●●

●●


●●



●●


●●


● ●

●●


●●


●●
● ●


●●


●●


●●

●●


●●


●●


●●


●●

●●


●●


● ●

●●


●●


●● ●


●●


●●


●●
●●


●●


●●




●●


●●


●●
● ●


●●


●● ●
●●


●●



●●






●●


●●


●●


●●


●●


●●


● ●


●●


●●

● ●


●●






●●



●●


●●





●●


●●
● ●


●●


●●

● ●

●●


●●


●●


●●


●●


● ●●


● ●

●●


●●

●●


●●





●●



















●●


●●


●●


●●

● 0 ●●


●●


●●


●●


































●●




●●




●●


●●
●●

●●
●●●


●●


● ●●


●●

●●


●●

●●
●●


●●


●●




●●

●●

●●
●●


●●
● ●

●●


●●


●●

● ●

●●


●●


●●


●●


●●●

●●

●●
● ●●


●●


●● ●

●●


●●


●●

●●

●●

●●

●●


● ●


●●


●●
● ●●


●●





●●


●●

●●

●●
● ●

●●


●●


● ●

●●


●●





●●

●●


●● ●

●●


●● ●●


●●


●●


●●

●● ●


●●


●● ●
●●


●●



●●

●●

● ●
●●


●●

● ●

●●


●●


●●


●●●

● ●


●●
● ●●


●●


●●

●●


●●●
●● ●

●●


● ●●


●●


●●

●●●●

●●
● ●

●●


● ●

●●


●●




●● ●
● ●
●●


●●●
●●
●●

●●




●●
● ● ●
●●

●●
●●
●●

●●

●● ●
● ●
●●
●●●


●●
●● ● ●

●●●


●●

●●●● ● ●

●●
●●




●●









●●


●●



● ●
●●
●●● ●
● ●

















●● ●

●●


●●

●●


●●

●●
●●



●● ●●
●●

● ●


●●
● ●

●●

●●
● ●
●●

● ●


● ●

●●

−5 ● ●







●●

●●





● ●
●●
●●
● ●●

●●


● ● ●
● ●●

● ●●
●●● ● ●●

● ●● ● ●●

●●
● ●●
● ●● ●●

● ●



● ● −5 ● ●●


● ●

−4 −2 0 2 4 −4 −2 0 2 4
theoretical theoretical

type ● GARCH(1, 1) ● UNCONDITIONAL

(e) (f)

Figure 2.2: Quantile-Quantile plot for daily log returns (blue dots) and
GARCH(1,1) residuals (orange dots) of log returns against the theoretical
normal distribution for (a) AAPL, (b) IBM, (c) NRG, (d) WAT, (e) SPY, (f)
XLK. Return range: 1/3/2001-12/8/2017.
50 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

Table 2.2: Kolmogorov-Smirnov distances between the theoretical normal distri-


bution and the empirical distribution of log returns and residuals of GARCH(1,
1) of log returns. The distance is reduced in all instances, with the largest
improvements for the two proxies for the market (SPY) and the technology
sector (XLK). For background on the Kolmogorov-Smirov distance, see DeGroot
and Schervish (2012), Ch. 10.
Stock Unconditional GARCH(1, 1)
AAPL 0.067 0.044
IBM 0.078 0.047
NRG 0.088 0.060
WAT 0.109 0.091
SPY 0.098 0.040
XLK 0.091 0.043

Table 2.3: Estimated ↵ for left and right tail of probability


P density function

p(x) / x . We use the MLE estimator ↵ ˆ = 1 + n[ i log(xi /xmin )] 1 , where n
is the number of observations above a cut-o↵ value xmin . The value of xmin is
set to -2.5 and 2.5 respectively. The values of ↵ increases sizably for the two
indices SPY and XLK.
Left Tail Right Tail
Stock Unconditional GARCH(1, 1) Unconditional GARCH(1, 1)
AAPL 4.8 4.6 4.8 4.9
IBM 4.3 3.9 4.2 4.6
NRG 4.0 5.9 3.8 4.1
WAT 3.4 3.2 4.3 4.0
SPY 4.1 5.9 4.2 8.5
XLK 5.0 6.3 4.4 5.9

2.2.3 GARCH(1,1) Estimation?


The vast majority of CHM applications are of order p = q = 1, so we restrict
our analysis to this case for simplicity. Generalization to finite-order processes
is straightforward. Define ✓ := (↵0 , ↵1 , 1 ), and let f be the log density function
of the standard normal distribution.

rt = h t ✏t
h2t = (h2t 1 , rt2 1 , ✓)
2.2. CONDITIONAL HETEROSCEDASTIC MODELS (CHM) 51

By repeated substitution, we can express the unobserved variance h2t as a


function of the sequence r1 , . . . , rt 1 and ✓. The log-likelihood of the sequence
✏t 1 = rt /ht is given by

T
X ✓ ◆
rt
L(✓) = f
t=1
ht (r1 , . . . , rt 1 , ✓)

We can then estimate the parameters ✓ of the model by maximizing the log-
likelihood. As an example, consider the GARCH(1,1) model. The recursive
equation for h2t is given by Equation (2.9), so we solve

T ✓
X ◆
rt2
min log h2t + 2
t=1
ht
t
!1/2
1 t 1 X
1 i 1 2
s.t. ht = ↵0 + ↵1 1 rt i t = 1, . . . , T
1 1 i=1

2.2.4 Realized Volatility


CHMs model the asset volatility as an (unobserved) state of the return stochastic
process. Once we have an estimate of the volatility at time t of returns, the
rest is trivial. An alternative route would be to estimate directly the volatility
from the data, for example with a simple moving-window estimator of the
empirical volatility. This approach would not work if the epochs for which
we need the estimates are days, and we only have daily data. In recent years,
tick-level price data have become widely available; indeed, order-book level
data are also available (with the entire process of order arrivals, fillings, and
cancellations). It is now possible to compute 1-minute returns, enabling us
to estimate the volatility of returns for daily predictions by using these high-
frequency data. Below we review some of the statistical properties of realized
volatility measurements. The starting point is Equation (2.1), i.e., a di↵usion
process for the log price p(t) = log P (t):

dp = ↵dt + dW

where W (t) is a Brownian process. ↵ 2 R (the drift) > 0 (the volatility) are
constants. In all applications of interest, the drift is much smaller than the
volatility: |↵| ⌧ . The quantity ↵/ is termed the (daily) Sharpe Ratio and
52 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

will figure prominently in the rest of the book5 . We observe the process in the
interval [0, 1] and measure the state variable p at intervals of length 1/n. The
measured return is r(j) := p(j/n) p((j 1)/n). Clearly, r(j) are iid random
variables, and r(j) ⇠ N (↵/n, 2 /n). The maximum likelihood estimators for
drift and moments are
X
↵ˆ= r(j) = p(1) p(0)
j
X
ˆ12 = [r(j) ˆ /n]2

j

We also consider the uncentered estimator of the volatility.


X
ˆ22 = r2 (j) (2.11)
j

The first remarkable phenomenon is that the MLE estimator for the drift
does not depend on the number of intervals n. Moreover, one can show that
↵) = var(p(1) p(0)), and p(1) p(0) ⇠ N (↵, ), so that var(ˆ
var(ˆ ↵) = 2 . The
estimation error does not depend on the number of intervals either. To estimate
the variance of ˆ12 and ˆ22 we need a few formulas. The moments of r(j) are
those of a Gaussian random variable with mean ↵/n and variance 2 /n:

E[r(j)] = (2.12)
n
⇣ ↵ ⌘2 2
2
E[r (j)] = + (2.13)
n n
⇣ ↵ ⌘4 ⇣ ↵ ⌘2 2 ✓ 2 ◆2
E[r4 (j)] = +6 +3 (2.14)
n n n n
so that
✓ 2
◆2 ⇣ ↵ ⌘2 2
2
var(r (j)) = 2 +4 (2.15)
n n n
and
↵2
E(ˆ22 ) = 2
+ from Equation (2.13)
n
4 ⇣ ↵ ⌘2
var(ˆ22 ) =2 +4 2
from Equation (2.15)
n n
5 This
is the Sharpe Ratio of log returns, which is to a first approximation close to the daily Sharpe
Ratio computed on returns.
2.2. CONDITIONAL HETEROSCEDASTIC MODELS (CHM) 53

The estimator 22 has a small finite-sample bias and is asymptotically consistent.


Let us reflect on the steps we took. We discretized the interval over which
the price process occurs into n subintervals, and retained only the last price
within an interval of length 1/n, assuming the price had no measurement error.
We saw that the drift estimator is unbiased, but its variance does not depend
on the discretization: we have more estimates of the drift, but they are noisier.
Unfortunately, there is no easy way to measure the drift, i.e., the expected
return, of a security; otherwise, all Statistics undergraduates would be rich.
Conversely, we have identified an uncentered estimator of the true variance 2 .
As the number n of intervals approaches infinity, the estimator is unbiased. Its
variance decreases like 2 2 /n, which is good news: in principle, we can estimate
to arbitrary accuracy the volatility of the returns at time t; and provided that
the true volatility varies very little over time, we can use this estimate to predict
variance at time t + 1. The good news here is that if you need volatility estimates
over a long time scale for your decisions (e.g., days), but have data over a shorter
time scale (e.g., minutes), you do not have to devise a generative model like
CHMs or others. What assumptions do not hold in this line of reasoning? Here
is a list of issues to consider:

1. We ignored market microstructure. One source of noise is the bid-ask


spread (Harris, 2003). When the seller initiates the transaction, she receives
the bid price; when the buyer initiates it, he pays the ask price. There
is an intrinsic error in the measurement of price, which is approximately
equal to half the bid-ask spread. Measured log prices in interval t are
pt + ✏t , where the noise terms ✏t are iid random variables of the length of
the measurement interval.

2. Another form of microstructure is thinly-traded securities, etc. If a stock


trades less than once every five minutes on average, then using one-minute
intervals is probably not a good modeling choice.

3. We assumed that volatility is changing slowly, or is ideally constant. This


is not the case in practice. One approach is to impose a model on the time
series of realized variances, so that we can produce an error estimate. E.g.,
a simple AR(1) model ˆ (t+1) = a+bˆ (t)+⌧ ✏(t+1), with ✏(t+1) ⇠ N (0, 1).

4. We ignored the distinction between open-to-close and close-to-open interval.


Close-to-open returns are often fundamentally driven. Also, we are ignoring
the large volatility and bid-ask spreads in the first minutes of the trading
day.
54 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

For the rest of us, the question is: what to choose? Liu et al. (2015) compare a
broad set of estimators, with several choices of parameters, for assets in di↵erent
asset classes (equities, futures, indices). They use Romano and Wolf’s procedure
for multiple comparison (Romano and Wolf, 2005) and Hansen et al.“model
confidence set” (Hansen et al., 2011). They find that the Vanilla RV at 5-minutes
intervals performs competitively across various assets and asset classes. There
are a few cases where this is not true. When higher-frequency measurements
are available, this estimator is outperformed by a one-minute subsampled RV,
by 1- and 5-second interval realized kernel. In addition, at lower frequencies,
5- and 15-minute truncated RV (Mancini, 2009, 2011) also outperforms vanilla
RV. However, where available, 5-min nonoverlapping intervals seem to be a
reasonable choice.

2.2.5 Combining CHM and Realized Volatility


Is it possible to have the best of both worlds, GARCH models and realized
volatility? Hansen et al. (2012) present a model, RealGARCH(1,1), that combine
both.

rt =ht ✏t
h2t =↵0 + 2
1 ht 1 + xt 1
xt =⇠ + h2t + ut (2.16)

The first two equations are similar to the standard GARCH(1, 1) model, with
one di↵erence: the term proportional to rt2 1 has been replaced by a term
proportional to xt 1 . This variable is the observed estimate of the realized
variance at time t; when this estimator is more accurate than the rough estimate
of variance rt2 1 , then the model will probably outperform GARCH(1, 1). The
last Equation (2.16) models the dynamic behavior of the realized variance. It
posits a linear dependence on ht and on a stochastic term ut . The random
variables u1 , . . . , ut are iid random variables, not necessarily with zero mean.

2.3 State-Space Estimation of Variance


2.3.1 Muth’s Original Model: EWMA
A very popular estimator of the expected value of a time series {xs }, based on
data up to time t, is the exponentially weighted moving average (or EWMA). It
2.3. STATE-SPACE ESTIMATION OF VARIANCE 55

takes the form


1
X
x̂t = (1 K) K s xt s
s=0

for some 0 < K < 1. We discount the past by giving its observations exponen-
tially decreasing weights, which makes sense, and even more so when we write
the estimate as a recursion:

x̂t = (1 K)xt + K x̂t 1

A low value of K forgets the past faster. The formula is computationally efficient
both in terms of storage and computation. For uncentered variance estimation
of a return, this takes the form

ˆt2 = (1 K)rt2 + K ˆt2 1 (2.17)

Insight 2.1: GARCH are EWMA with an o↵set

Recall Equation (2.9):


1
X
↵0
h2t = + ↵1 i 1 2
1 rt i
1 1 i=1

This is, save for an o↵set, very similar to Equation (2.17):


1
X
ˆt2 = (1 K)rt2 i
i=1

In academic journals, EWMA receives relatively low attention compared


to GARCH models (for a rare example, see Ding and Meade (2010)); among
practitioners, including major commercial risk model providers like RiskMetrics,
Barra, and Axioma, it is the other way around. Aside from these practical
considerations, is it possible to motivate the approach based on a model? We
devote this section to understanding and extending this simple formula.
We will employ linear state-space models and Kalman Filters, which are
briefly covered in the Appendix, Section 14.3. Rather than giving it a general
treatment and then specializing to a specific model, we will jump right in the
56 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

middle with a relevant example. As it happens, this example is also the simplest
non-trivial example of a state-space model. The model (Muth, 1960) posits that
there is a scalar state xt that evolves randomly over time with the addition of a
gaussian disturbance to its previous value. We observe the state imperfectly;
the observation yt is a noisy measure value xt . In formulas:

xt+1 =xt + ⌧✏ ✏t+1


yt+1 =xt+1 + ⌧⌘ ⌘t+1
✏t ⇠N (0, 1)
⌘t ⇠N (0, 1)

The innovations and the measurement noises are gaussian with mean zero, and
their are independent of each other: ✏s ? ✏t ⌘s ? ⌘t for all s 6= t, and ✏s ? ⌘t for
all t and s. I skipped the derivation, which the interested reader can find in the
Appendix. Define the ratio of measurement to innovation noise  := ⌧⌘ /⌧✏ The
stationary ˆt+1|t standard deviation of the state estimate is given by:
p
2 2 1 + (2)2 + 1
ˆt+1|t = ⌧✏
2
and the optimal estimation recursion is
2
ˆt+1|t
K := 2
ˆt+1|t + ⌧⌘2
x̂t+1|t =(1 K)x̂t|t 1 + Kyt

For  1 the formula simplifies:


 1
x̂t|t = x̂t|t 1+ yt
1+ 1+
This is an exponential weighted average with a simple interpretation. Imagine
that the state does not change at all. Then we want to use all the history we
can, since old observations and new ones are drawn from the same distribution.
The half-life of EWMA is indeed long. Conversely, when the state changes at a
rapid pace, i.e.,  ' 0, then we want to discount the past very aggressively.
According to Muth’s original model applied to volatility estimation, the state is
the instantaneous variance, and the observation yt is rt2 , which is equal to t2 in
expectation.
***
2.3. STATE-SPACE ESTIMATION OF VARIANCE 57

1.0
0.8
0.6

0.4
K

0.2

0.0 2.5 5.0 7.5 10.0



Figure 2.3: Relationship between K and  := ⌧⌘ /⌧✏ .

The model has obvious shortcomings. If returns are normally distributed,


then the observation error is not normally distributed. More importantly, the
model allows for negative values of the variance, and additionally models the
variance evolution as the sum of iid innovations. Over time, the distribution
of the variance becomes more and more spread out: the standard deviation of
the distribution grows as the square root of the number of periods. In practice,
however, volatility appears to revert to a long-term average.
We cannot directly address the first problem. Kalman filters can work well
with non-normal innovations and measurement errors, provided that these are
not too heavy-tailed. As for the other shortcomings, we can refine the model to
accommodate them. For example, we can introduce a mean-reverting model of
variance, so that it behaves like an autoregressive process. We extend slightly
the state equation by adding a mean-reversion term:
xt+1 =xt (xt µ) + ⌧✏ ✏t+1
The state reverts to value µ when it is away from this equilibrium value. The
stationary distribution of xt is gaussian, with the expected value equal to µ and
standard deviation equal to ⌧✏2 /(2 2
). The optimal variance estimator is still
x̂t+1|t =(1 K)x̂t|t 1 + Kyt
However, compared to the first model, the value of K, when > 0 is smaller.
Otherwise stated, the mean reversion term makes the distribution of the true
58 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

variance more concentrated around its long-term mean. This means that we
discount the past less. The detailed derivation of these formulas is in the
Appendix, Section 14.3.1.

2.3.2 The Harvey-Shephard Model?


As a final example of the flexibility that linear state-space models can o↵er, I
present the model by Harvey and Shephard (1996), which has several desirable
features: it has a closed-form solution; the volatility is by design positive and
the distribution of the volatility itself is log-normal, hence right-skewed, as we
would expect; and the stock returns are locally lognormal.
The generating process for returns rt is assumed to be
+exp(xt /2)⇠t
rt = e 1 (2.18)

where is a known constant, and ⇠t ⇠ N (0, 1); hence returns are, at any point
in time, lognormally distributed. Define

ut := log(1 + rt )
) ut = exp(xt /2)⇠t
) log u2t =xt + log ⇠t2
=xt + ⌘t +

where := E(log ⇠t2 ) ' 1.27, and ⌘t is a zero-mean random variable with
standard deviation stdev(log ⇠t2 ) ' 2.22. Define

yt := log u2t
= log[(log(1 + rt ) )2 ]

so that we get an observation equation:

y t = x t + ⌘t

Now, we posit an evolution equation for xt :

xt+1 = b + axt + ✏t

This is the same model as AR(1), from which we obtain an estimate x̂t . If = 0,
then the formulas take a simple form: ut ' rt and the state estimate is given by

x̂t+1|t = (1 K)x̂t|t 1 + K[log[(log(1 + rt ) )2 ] ]


2.4. FURTHER READING 59

Since Rt = exp(exp(xt /2)⇠t ) is a lognormal random variable, the estimated


standard deviation of Rt is
q
ˆt+1|t = (eexp(x̂t+1|t ) 1)eexp(x̂t+1|t )

A simplified Harvey-Shephard model starts with Equation (2.18), to which it


applies the first-order approximation ex 1 ' x, and the parameter = 0:

rt = exp(xt /2)⇠t

Define

log rt2 :=xt + log ⇠t2


=xt + ⌘t +

where and ⌘t are defined as for the Harvey-Shephard model above. The model
is completed by the Equations, also from the original model,

xt+1 =b + axt + ✏t
yt = log rt2

The state estimate and volatility estimates are

x̂t+1|t =(1 K)x̂t|t 1 + K[log rt2 ]


ˆt+1|t =ex̂t+1|t /2

2.4 Further Reading


Definition of returns, log returns, dividend-adjusted returns are in Ruppert and
Matteson (2015) and Connor et al. (2010). A slightly dated reference on stylized
facts for asset returns is Cont (2001); more recent ones are Taylor (2007); Zivot
(2009); Ratli↵-Crain et al. (2023). Gaussian bounds like Equation (2.2) are
classic results; see Wasserman (2004). Vershinin (2018) on high-dimensional
probability is a comprehensive reference for various finite-sample bounds. The
literature on GARCH models alone is immense; Tsay (2010); Zivot and Wang
(2003); Cižek et al. (2011); Ruppert and Matteson (2015); Tsay (2010); Lütkepohl
(2005) are standard references, and survey are Andersen et al. (2006, 2013). The
handbook Andersen et al. (2009) has dedicated chapters covering univariate
(Teräsvirta, 2009a) and multivariate GARCH (Teräsvirta, 2009b), moments
60 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

of GARCH models (Lindner, 2009), their detailed extremal properties (Davis


and Mikosch, 2009), multivariate GARCH. For a recent empirical paper on
the performance of GARCH, TARCH, EGARCH and a few other models, see
Hansen and Lunde (2005); Brownlees et al. (2011).

An early analysis of the “vanilla” Realized Variance estimator is Barndor↵-


Nielsen and Shephard (2002) and a survey is Andersen and Benzoni (2009). Also
useful are the surveys of Andersen et al. (2006, 2013), which situate realized
volatility in the context of risk management techniques. Essential readings on
realized volatility estimators are Zhang et al. (2005), which presents several
estimators and introduces the idea of subsampling for RV; the series of papers
Barndor↵-Nielsen et al. (2008, 2009) on kernel-based estimators; the empirical
paper by Liu et al. (2015), comparing several estimators, which includes subsam-
pling and kernel. This list of estimators is not exhaustive. For example, Hansen
and Lunde Hansen and Lunde (2006b) analyze an autocorrelation-adjusted
estimator introduced in French et al. (1987). Bipower estimators are studied by
Podolskij and Vetter (2009) and maximum likelihood ones by Aı̈t-Sahalia et al.
(2005). Moreover, these estimators depend on several parameters, like sampling
and subsampling intervals, or the choice of kernel.

The convergence properties of Random Recursive Equations (RREs) were


studied first by Kesten (1973); Diaconis and Freedman (1999) survey the general
recursive equations xt = f (xt 1 , ✏t+1 ), where (✏t )1
t=1 is an iid random sequence,
of which RREs are a special case. A monograph on RREs, covering both the
univariate and multivariate case, is Buraczewski et al. (2016).
There are, to a first approximation, infinite references on Kalman filtering.
Aside from the original Kalman (1960); Kalman and Bucy (1961), more modern
texbook treatments are Harvey (1990); Simon (2006); Whittle (1996).
Roll introduced his model in Roll (1984) and a detailed discussion of the
Roll model and its extensions is in Hasbrouck (2007).

2.5 Exercises
Exercise 2.1 (Portfolio Covariances).

1. (5)Prove that, if R e is Gaussian with covariance matrix ⌦r , then the


net return portfolio w has variance w0 ⌦r w.
2.5. EXERCISES 61

2. (10)Generalize this result. Let x be a random vector taking values in Rn


with covariance matrix ⌦. Let A be an m ⇥ n matrix. Prove that the
covariance matrix of the random vector Ax is A⌦A0 .

3. (10) Say that a random vector x follows a multivariate normal distribution


with covariance matrix ⌦. Let the Singular Value Decomposition of ⌦ be
U⇤U0 , and define
0 1/2 1
1 0 . . . 0
B 1/2 C
1/2 B 0 2 0 ... 0 C 0
⌦ = UB CU
@ ... ... ... ... 0 A
1/2
0 0 0 ... n

Let ⇠ a gaussian distribution with unit covariance matrix. Prove that


⌦1/2 ⇠ has covariance ⌦.
⇣Q ⌘1/T P
T
Exercise 2.2. (15) Prove that t=1 (1 + r(t)) 1  T 1 Tt=1 r(t).

Exercise 2.3. (20) Provide an example of two random variables that are un-
correlated but dependent.

Exercise 2.4. (25) Provide a second example, employing an entirely di↵erent


rationale for the lack of correlation from the first one.

Exercise 2.5. (30) Let X, Y be two random variables taking values in R+ . Show
that cor(X 2 , Y 2 ) if and only if cor(X, Y ) > 0.

Exercise 2.6. (15) Derive the formula for E(h21 ) from Equation (2.10).

Exercise 2.7. (10) Prove that if E(h21 ) is finite, i.e. ↵1 + 1 < 1, then a
stationary distribution exists, i.e. E[log( 1 + ↵1 ✏20 )] < 0. (Hint: use Jensen’s
inequality)
62 CHAPTER 2. RETURNS: PROPERTIES AND MODELS

The Takeaways

1. Security returns exhibit heavy tails, low autocorrelation but high


autocorrelation in absolute value or square, and they are approaching
log-normal returns over longer time intervals.

2. GARCH models capture most properties of returns, and can be used


to estimate volatility.

3. The GARCH volatility estimates are exponential weighted averages


of non-iid squared returns.

4. State-space models can be used to model a variety of

5. They will appear in factor models.


14.3. THE KALMAN FILTER 263

and let w(⌦ ˆ r ) be its solution. Denote the realized variance of the portfolio
var(w(⌦ˆ r ), ⌦r ).
The realized volatility of portfolio w(⌦ˆ r ) is greater than the one of w(⌦r ),
and the two are identical if and only if ⌦r / ⌦ ˆ r.

Proof. The solution of Problem (14.3) is w(⌦ ˆ r 1 b) 1 ⌦


ˆ r ) = (b0 ⌦ ˆ r 1 b. The ratio
between realized variance of the portfolios constructed on ⌦ ˆ r and on ⌦r is

ˆ r 1 ⌦r ⌦
ˆ r ), ⌦r ) b0 ⌦r 1 b b0 ⌦
var(w(⌦ ˆ r 1b
=
ˆ r 1b
ˆ r ), ⌦r ) b0 ⌦
var(w(⌦ b0 ⌦ˆ r 1b

ˆ r 1 / ⌦r the ratio is one. Let ⌦


One can verify directly that if ⌦ ˆ r = ÛŜÛ0 ,
⌦r = USU0 . Let x := Ŝ 1/2 Û0 b. Let H := Ŝ1/2 Û0 US 1 U0 ÛŜ1/2 . Then we
rewrite the variance ratio as
ˆ r ), ⌦r ) x0 Ŝ1/2 Û0 US 1 U0 ÛŜ1/2 x x0 Ŝ
var(w(⌦ 1/2
Û0 USU0 ÛŜ 1/2
x
=
ˆ r ), ⌦r )
var(w(⌦ kxk2 kxk2
x0 Hx x0 H 1 x
=
kxk2 kxk2

Consider now the SVD of H = VDV0 and define y := V0 x. We have


! !
var(w(⌦ˆ r ), ⌦r ) X y2 X y2
= P i 2 di P i 2 di 1
var(w(⌦ˆ r ), ⌦r ) j yj j yj
i i

The term on the RHS can be interpreted as E(⇠)E(1/⇠), where P 2⇠ is a random


2
variable taking value di is state i with probability pi := yi / j yj . By Jensen’s
inequality, E(1/⇠) 1/E(⇠) and the result follows.

14.3 The Kalman Filter


This section contains a short treatment of the Kalman Filter (KF). The Kalman
Filter predates Kalman’s original articles in the early 1960’s (Kalman, 1960;
Kalman and Bucy, 1961). At the time of their publication, computers had
become available that made calculations feasible in real time. This made the
(re)discovery of the filter by Kalman very timely. Rockets used by the Apollo
program contained implementations of the Kalman Filter in 2KB of RAM. Since
the 60s, the topic of linear control and filtering has flourished. Thousands of
264 CHAPTER 14. APPENDIX?

papers have been written on it, and there are several monographs covering the
Kalman Filter in details from di↵erent perspectives: control (Simon, 2006),
statistical (Harvey, 1990), econometric (Hansen and Sargent, 2008). I cover
the KF for two reasons. First, because, for somewhat mysterious reasons, the
derivation of the KF is often more complicated that it should be. A rigorous yet,
I hope, intuitive proof essentially fits in half a page and should save the reader
a few hours. Secondly, I wanted to present the problem under two di↵erent lens,
and show its close connection to the Linear Quadratic Regulator (LQR). Both
problems are essential tools in the arsenal of the quantitative finance researcher,
so there is value in catching two birds with one stone2 .

We need the following elementary fact. Let Z := [x, y]0 be multivariate


normal random vector with mean and covariance matrix
 
µx ⌃x,x ⌃x,y
µZ := cov(Z) =
µy ⌃y,x ⌃y,y

The random vector x, conditional on y = b is still normally distributed,


with conditional mean and covariance matrix equal to
1
E(x|y = b) =µx + ⌃x,y ⌃y,y (b µy )
1
cov(x|y = b) =⌃x,x ⌃x,y ⌃y,y ⌃y,x

This can be verified directly by integration.


Our model has two components. The first is a state, represented by a random
vector xt . This vector follows a simple evolution rule: xt+1 = Axt + ✏t+1 .
The vector ✏t is random, serially independent, and distributed according to
a multivariate normal distribution. The state is not observable directly; the
only thing we know is its probability distribution at time 1. We assume it is
normal with known mean and covariance matrix. In addition, over time we
observe is a vector yt , which is a linear transformation of xt , corrupted by noise:
yt+1 = Bxt+1 + ⌘ t+1 . Note the similarity with the factor model equation:

state $ factor return


observation $ asset return

What is di↵erent is that factors returns are usually not modeled as being serially
dependent.
2 However, should you catch birds, please don’t use stones, but nets, or food.
14.3. THE KALMAN FILTER 265

The vector ⌘ t is random, serially independent, independent of (✏t )1


t=1 , and
distributed according to a multivariate normal distribution.
Summing up, the distributions of x1 , ✏t , ⌘ t are given by
ˆ 0)
x1 ⇠N (x̂0 , ⌃
✏t ⇠N (0, ⌃✏ ) ✏t ? ✏s , ✏t ? ⌘ s+1 st
⌘ t ⇠N (0, ⌃⌘ ) ⌘ t ? ⌘ s , ⌘ t ? ✏s+1 st

And the Linear State Space Model is given by

xt+1 =Axt + ✏t+1 (14.4)


yt+1 =Bxt+1 + ⌘ t+1 (14.5)
ˆ t|t 1 , the conditional estimates for the mean and covariance
I denote x̂t|t 1 , ⌃
matrix of the state xt , based on the information y0 , . . . , yt 1 . And I denote x̂t|t ,
ˆ t|t the estimates based on information y0 , . . . , yt .

The vector Zt is defined as the combination of state and observation:

xt
Zt :=
yt
Based on information up to time t 1, the covariance of Zt is

ˆ t|t 1
⌃ ⌃x t B 0
cov(Zt ) = ˆ t|t 1 B0 + ⌃⌘
B⌃xt B⌃
We observe yt . The vector xt is normally distributed. The conditional
covariance of xt given yt is
ˆ t|t =⌃
⌃ ˆ t|t 1 ⌃ˆ t|t 1 B0 (B⌃ ˆ t|t 1 B0 + ⌃⌘ ) 1 B⌃ ˆ t|t 1 (update step)
=[I ˆ t|t 1 B0 (B⌃
⌃ ˆ t|t 1 B0 + ⌃⌘ ) 1 B]⌃ ˆ t|t 1 (14.6)
x̂t|t =x̂t|t 1
ˆ t|t 1 B0 (B⌃
+⌃ ˆ t|t 1 B0 + ⌃⌘ ) 1 (yt Bx̂t|t 1 ) (14.7)

Once we have the posterior distribution given the observation yt , the conditional
distribution of xt+1 follows from Equation (14.4). xt+1 is Gaussian with the
following conditional mean and covariance matrix:
ˆ t+1|t =A⌃
⌃ ˆ t|t A0 + ⌃✏ (prediction step)
(14.8)
x̂t+1|t =Ax̂t|t 1
ˆ t|t B0 (B⌃
+ A⌃ ˆ t|t B0 + ⌃⌘ ) 1 (yt Bx̂t|t 1 ) (14.9)
266 CHAPTER 14. APPENDIX?

The measurement and time update equations above are the whole of the Kalman
Filter. If we combine Equations (14.6) and (14.8), the covariance matrix evolves
according to the equation:
ˆ t+1|t =A(⌃
⌃ ˆ t|t 1
ˆ t|t 1 B0 (B⌃
⌃ ˆ t|t 1 B0 + ⌃⌘ ) 1 B⌃
ˆ t|t 1 )A0 + ⌃✏

This is called a Riccati recursion. In steady state the covariance matrix does not
change in consecutive periods: ⌃ ˆ t+1|t = ⌃
ˆ t|t 1 . We can solve for the stationary
matrix:

X =AXA0 AXB0 (BXB0 + ⌃⌘ ) 1 BXA0 + ⌃✏

This is a discrete time algebraic Riccati equation.


The matrix
ˆ t|t 1 B0 (B⌃
Kt := ⌃ ˆ t|t 1 B0 + ⌃⌘ ) 1

is called the optimal Kalman gain. The equations become


ˆ t|t =[I
⌃ ˆ t|t
Kt B]⌃ 1 (14.10)
x̂t|t =(I Kt B)x̂t|t 1 + Kt yt (14.11)
ˆ t+1|t =A⌃
⌃ ˆ t|t A0 + ⌃⌘ (14.12)
x̂t+1|t =Ax̂t|t (14.13)

14.3.1 Examples
Example 1 (Muth, 1960):

xt+1 =xt + ⌧✏ ✏t+1 (14.14)


yt+1 =xt+1 + ⌧⌘ ⌘t+1 (14.15)

The stationary ˆt+1|t is given by the solution to the Riccati equation:


4 p
ˆt+1|t 1 2
2 2
=⌧✏2 ) 2
ˆt+1|t = ⌧✏ (1 + (2)2 + 1)
ˆt+1|t + ⌧⌘ 2
2
ˆt+1|t
K= 2
ˆt+1|t + ⌧⌘2
x̂t+1|t =(1 K)x̂t|t 1 + Kyt
14.3. THE KALMAN FILTER 267

where we have introduced the parameter


⌧⌘
 :=
⌧✏
Loosely, this is a noise-to-signal ratio. It is high when the measurement error
is high compared to the typical change of the state per period. For  1 the
formula simplifies: K ' 1/( + 1).
 1
x̂t|t = x̂t|t 1 + yt
1+ 1+
Example 2 (AR(1) model): In this model, the state equation is
xt+1 = b + axt + ⌧✏ ✏t (14.16)
To have a mean-reverting process, introduce a long-term mean value µ > 0 and
a relaxation constant > 0, and set
a :=1 (14.17)
b := µ (14.18)
Equation (14.16) becomes
xt+1 =xt (xt µ) + ⌧✏ ✏t+1
The state reverts to value µ when it is away from this equilibrium value. The
stationary
p distribution of xt is gaussian, with mean µ and standard deviation
⌧✏ / 2 2.

Define:
ut :=xt µ (14.19)
vt :=yt µ (14.20)
We rewrite the equation as
xt+1 µ =xt µ + (a 1)(xt µ) + ⌧✏ ✏t+1
ut+1 =x̃t + (a 1)ut + ⌧✏ ✏t+1
ut+1 =ax̃t + ⌧✏ ✏t+1
The state space equations are
ut+1 =aut + ⌧✏ ✏t+1
vt+1 =ut+1 + ⌧⌘ ⌘t+1
268 CHAPTER 14. APPENDIX?

The Riccati equation is

2 2
a2 ˆt+1|t
4
(1 a )ˆt+1|t + 2 =⌧✏2
ˆt+1|t + ⌧⌘2
1h 2 q i
2
) ˆt+1|t = (a 1)⌧⌘2 + ⌧✏2 + (a2 1)⌧⌘4 + ⌧✏4 + 2(a2 + 1)⌧⌘2 ⌧✏2
2h q i
1
= (a2 1)⌧⌘2 + ⌧✏2 + [(a2 1)⌧⌘2 + ⌧✏2 ]2 + 4⌧⌘2 ⌧✏2
2 2 s 3
✓ ◆2
1 ⇥ ⇤ 2
= ⌧✏2 (a2 1)2 + 1 41 + 1+ 5
2 (a2 1)2 + 1
2
ˆt+1|t
K= 2
ˆt+1|t + ⌧⌘2
ût+1|t =(1 K)ût|t 1 + Kvt
Now replace u, v using Equations (14.19) and (14.20):
) x̂t+1|t =(1 K)x̂t|t 1 + Kyt
For a = 1 the formula is identical to that of Example 1; and it is straightforward
2
to verify that ˆt+1|t is decreasing in a, and consequently also K is decreasing in
a. There are two insights to be drawn from this:
1. The EWMA is still an optimal estimator for a mean-reverting model of
volatility.
2. In the presence of mean reversion, K decreases, everything else being
equal. We discount the past less, because mean-reversion causes volatility
to be more concentrated. When the volatility is changing less from period
to period, past observations become more informative.
Example 3 (Harvey and Shephard, 1996): The generating process for gross
returns Rt = Pt /Pt 1 is assumed to be
+exp(ht /2)⇠t
Rt = e
where is a known constant, and ⇠t ⇠ N (0, 1). Define ut = log Rt . Then
ut = exp(ht /2)⇠t . Square ut and take the logarithm to linearize the equation:
log u2t =ht + log ⇠t2
=ht + ⌘t + 
14.4. LINEAR REGRESSION 269

where  := E(log ⇠t2 ) ' 1.2703, and ⌘t is a zero-mean random variable. Define
yt = log u2t 
=2 log | log Rt | 
so that we get an observation equation:
y t = h t + ⌘t
Now, we posit an evolution equation for ht :
ht+1 = b + aht + ✏t

This is the same model as AR(1), from which we obtain an estimate ĥt . If = 1,
then the formulas take a simple form: ut ' rt and the volatility estimate is
given by
"1 #
X
ˆt ' exp (1 K) s (log | log Rt 1| )
s=0

14.4 Linear Regression


Linear models are by far the most important class of models in Statistics. There
are more books on the subject than citizens of the sovereign state of the Vatican3 .
In fact, one could argue there is so much material on linear models, that two
humans on planet Earth may have completely interpretations of them. In order
to have some common ground, I will describe the salient aspects some less-well
know aspects which will be needed later. Our setting is as follows. We are given
a pair (y, x), where y is a random variable taking values in R and x is a random
vector taking values in Rm . y and x; y and x are in general dependent random
variables: knowing the value of a realization of x tells us something about the
values of y and this makes the problem infinitely interesting. Say that we want
to provide a forecast of y, which we denote ŷ(x). One way to select such forecast
is to try to minimize a loss function; we should pay a price for being wrong. One
natural choice of loss is the quadratic loss: it is nonnegative; it is symmetric; it
is di↵erentiable; and it penalizes more for large errors. The problem we face is
min E[(ŷ(x) y)2 |x] (14.21)

3 Not a joke: as of October 2017, the Vatican has 842 citizens; Amazon lists 1,392 books in the
“Probability and Statistics” section with “regression” in their title or subject, the vast majority of them
covering linear models.
270 CHAPTER 14. APPENDIX?

One basic result in statistics and in control theory is that, if E(y 2 ) < 1, the
function that minimizes this expectation is the conditional expectation of y
given x. We introduce a new variable ✏:
y = E(y|x) + ✏ (14.22)
It follows thatE(✏) = E(y) E(E(y|x)) = E(y) E(y) = 0. Then use the chain
the following chain:
E[(ŷ(x) y)2 |x] =E[(ŷ(x) E(y|x) + E(y|x) y)2 |x] (14.23)
=E[✏2 |x] + E[(ŷ E(y|x)2 |x] 2E[✏|x](y E(y|x)) (14.24)
=E[✏2 |x] + E[(ŷ E(y|x)2 |x] (14.25)
E[✏2 |x] (14.26)
The equality holds only if ŷ = E(y|x). The term E[✏2 |x] is finite, because
E[✏2 |x] 2E(y 2 ) + 2E[E(y|x)2 ] (14.27)
2E(y 2 ) + 2E[E(y 2 |x)] (Jensen) (14.28)
=4E(y 2 ) (Iterated Expectation) (14.29)
<1 (14.30)
In applications, we have n samples (yi , xi ) and we choose a functional form for
ŷ = g(x, ✓) where ✓ is a finite-P
or infinite-dimensional vector. We then minimize
1 2
the empirical
P squared loss n i (yi g(x, ✓)) . The simplest form of g is linear:
g(x, ) = i i xi . In matrix form, Equation (14.22) becomes
y =X +✏ (14.31)
where y 2 Rn , X 2 Rn⇥m , 2 Rm . n are the observations, and m are the
“features”. we want to estimate the parameters , and estimates for X . We
then minimize the empirical loss
min ky X k2 (14.32)

A di↵erent way to arrive at the same problem is to posit that the true model is
Equation (14.31), and to further assume that ✏ ⇠ N (0, 2 In ). If we fix , we
have ✏ = y X ; and since we know the distribution of ✏, we can associate to
a choice of a likelihood f (✏| ). If we choose the parameter to maximize the
likelihood, we end up solving the same problem as Equation (14.32). The choice
of maximizing the likelihood is called the Maximum Likelihood Principle 4 .
4 For a detailed discussion of the MLP, see Robert (2007).
14.4. LINEAR REGRESSION 271

Finally there is a geometrical interpretation for the regression problem. You


can interpret the set S := {X | 2 Rm } as a subspace of Rn . The columns of
X are a (generally non-orthonormal) basis of the subspace. We are then given a
point y 2 R and find the point ŷ 2 S that is closest to y. This is the definition
of a projection of y on S. The projection is a linear operator. The minimum5 is
attained at
ˆ = (X0 X) 1 X0 y (14.33)

and the estimates ŷ := E(y| ) are

ŷ = X ˆ
= X(X0 X) 1 X0 y (14.34)

The matrix H = X(X0 X) 1 X0 is called the hat matrix or projection matrix .


The estimated residuals are

ˆ✏ = (I H)y (14.35)

Intuitively, the optimal estimates should not change if we change the base of
the subspace. To see this rigorously, transform X into XQ, where Q 2 Rm⇥m is
non-singular. The transformed set of predictors spans the same subspace as X.
Then

ŷ = XQ((XQ)0 XQ) 1 (XQ)0 y


= XQ(Q0 X0 xQ) 1 Q0 X0 y
= XQQ 1 (X0 X) 1 (Q0 ) 1 Q0 X0 y
= X(X0 X) 1 X0 y (14.36)

hence y is independent of base representation.


Another property of the estimate ŷ is that, if we iterate the estimation process
on the estimate ŷ, we obtain again ŷ. This also has geometric interpretation.
Once a point has been projected on a hyperplane, the projection of the projection
is unchanged; see Figure xxx. In algebraic terms, Hŷ = H2 y = Hy = ŷ.
Here is another facet of linear regression tying geometric and algebraic
interpretations of linear regression. Decompose x using the SVD: x = U⇤V0 .
5 Theminimum is unique if the rank of X is m, i.e., if all the columns of X are linearly independent.
In Chapter 7 we will encounter cases of rank-deficient matrices.
272 CHAPTER 14. APPENDIX?

U is an orthonormal basis for the column subspace of x. Then

ŷ = U⇤V0 (V⇤U0 U⇤V0 ) 1 V⇤U0 y


= UU0 y

So y is projected on the column space of U.


Replace Equation (14.31) in beta estimation formula (14.33) to obtain
ˆ =(X0 X) 1 X0 (X + ✏) (14.37)
0 1 0
= + (X X) X ✏ . (14.38)

The estimate of beta is unbiased, because E[(X0 X) 1 X0 ✏] = 0; and the covariance


matrix of ˆ is The standard deviations of the estimates are

var( ˆ ) = 2
(X0 X) 1
(14.39)

Similarly,

var(ŷ) = 2
X(X0 X) 1 X0 (14.40)

We can write these formulas using the SVD:

var( ˆ) = 2
V⇤ 2 V0 (14.41)
var(ŷ) = 2
UU0 (14.42)

The variance of the estimates var( ˆ ) becomes larger as the columns of X become
more collinear. In our interpretation of the matrix X, this occurs when we
include factors that overlap heavily with pre-existing factors.

14.4.1 Regression Decomposition


Split Equation (14.31) into two parts:

y = x1 1 + x2 2 +✏ (14.43)

where we have partitioned the predictors x = (x1 |x2 ). Equation (14.33) can
be rewritten by using block submatrices for X0 X and X0 y, and the formula for
the inverse of submatrices, in order to obtain ˆ 1 , ˆ 2 . It can be shown that
the coefficient ˆ 2 can be estimated by a two-step process. First, regress the
columnns of y2 on x1 : x2 = x1 + u, where x2 x1 ? u Second, regress y on
u: y = u 3 + v. The least-squared coefficient of this regression is the same as
ˆ 2 , i.e. ˆ 3 = ˆ 2 . The proof can be found in Hansen (2022).
14.4. LINEAR REGRESSION 273

Exercise 14.1. If a matrix X 2 Rn⇥m has near collinear columns, then there
is a vector u such that kXuk2 = h for some small positive h.

1. Show that u0 (X 0 X)u = h.


2
2. Show that mini i  h.

3. From this, show that var( ˆ) = maxi i


2
1/h.
274 CHAPTER 14. APPENDIX?

14.5 Asymptotic Properties of Principal


Component Analysis
This is a summary of the asymptotic properties of PCA in the regime where
the number of variables n is constant and the number of observations T goes to
infinity. We have T realizations of iid random vectors xt ⇠ N (0, ⌃), from which
we want to estimate ⌃. We assume that the xt,i have finite fourth moments. Let
P
⌃ˆ T := T 1 T xt x0t . By the Law of Large Numbers, ⌃ ˆ T ! ⌃ almost surely.
t=1
Both eigenvalues and eigenvectors converge to the covariance matrix. Anderson
(1963) proves a CLT for the eigenvalues of the covariance matrix. Decompose the
empirical and true covariance matrices into their eigenvalues and eigenvectors:
⌃ = U⇤U0 (14.44)
⌃ ˆ Û0
ˆ T = Û⇤ (14.45)
with 1 > 2 > . . . > n ; all eigenvalues are assumed to be distinct. Anderson
proves that, as T ! 1,
p
T (ˆ ) ⇠N (0, 2⇤) (14.46)
p
T (ûi ui ) ⇠N (0, Ei ) (14.47)
0 1
1 i
B ( 2
0 ... 0 C
B 1 i ) C
B 2 i C
B 0 . . . 0 C
B ( ) 2 C 0
Ei :=U BB
2 i CU
C (14.48)
B 0 0 0 0 C
B . . . . . . . . . . . . C
B C
@ n i A
0 0 ... 2
( n i)

where the ith row has all zeros. Therefore:


p
1. the standard error on ˆ i is 2 i / T .
q
2. the standard error on the principal components, defined as E(kûi ui k2 ),
is
v
u X n
1 u k i
p t 2
(14.49)
T k=1,k6=i ( k i)

The relative error depends on the separation between eigenvalues.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy