0% found this document useful (0 votes)
17 views250 pages

Basic Econometrics

Uploaded by

tilfani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views250 pages

Basic Econometrics

Uploaded by

tilfani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 250

Basic Econometrics

Riccardo (Jack) Lucchetti

4th December 2023


Foreword
This is a very basic course in econometrics, in that it only covers basic tech-
niques, although I tried to avoid the scourge of over-simplification, so some may
find it not so basic in style. What makes it perhaps a little different from others
you find on the Net is that I made a few not-so-common choices.

1. Separating clearly the properties OLS has by construction from those it


has when interpreted as an estimator.

2. Using matrix algebra whenever possible.

3. Using asymptotic inference only.

Point number one is modelled after the ideas in the two great masterpieces,
Davidson and MacKinnon (1993) and Davidson and MacKinnon (2004). I have
several reasons for this choice, but it is mainly a pedagogical one. The students I
am writing for are people who often don’t feel at ease with the tools of statistical
inference: they have learned the properties of estimators by heart, they are not
sure they can read a test, find the concept of the distribution of a statistic a little
unclear (never mind asymptotic distributions), get confused between the vari-
ance of an estimator and an estimator of the variance. In the best cases. Never
mind; no big deal.
There’s an awful lot you can say on the base tool in econometrics (OLS) even
without all this, and that’s good to know. Once a student has learned how to
handle OLS properly as a mere computational tool, the issues of its usage and
interpretation as an estimator and of how to read the associated test statistics
can be grasped more correctly. If you mix the two aspects too early, a beginner
is prone to mistake properties of least squares that are true by construction for
properties that depend on some probabilistic assumptions.
Point number two is motivated by laziness. In my teaching career, I have
found that once students get comfortable with matrices, my workload halves. Of
course, it takes some initial sunk cost to convey properly ideas such as projec-
tions and properties of quadratic forms, but the payoff is very handsome. This
book contains no systematic account of matrix algebra; we’re using just the ba-
sics, so anything you find on the Net by googling “matrix algebra lecture notes”
is probably good enough.
As for probability and statistics, I will only assume some familiarity with the
very basics: simple descriptive statistics and basic properties of probability, ran-
dom variables and expectations. Chapter 2 contains a cursory treatment of the
concepts I will use later, but I wouldn’t recommend it as a general reference on
the subject. Its purpose is mainly to make the notation explicit and clarify a few
points. For example, I will avoid any kind of reference to maximum likelihood
methods.

i
I don’t think I have to justify point number three. I am writing this in 2023,
when typical data sets have hundreds, if not thousands observations and no-
body would ever dream of running any kind of inferential procedure with less
than 50 data points. Apart from OLS, there is no econometric technique in actual
use that does not depend vitally on asymptotics, so I guess that readers should
get familiar with the associated concepts if there is a remote chance that this
will not be put them off econometrics completely. The t test, the F tests and,
in general, all kinds of degrees-of-freedom corrections are ad-hockeries of the
past; unbiasedness is overrated. Get over it.
I promise I’ll try to be respectful of the readers and don’t treat them like id-
iots. I assume that if you’re reading this, you want to know more than you do
about econometrics, but this doesn’t give me the right to assume that you need
to be taken by the hand and treated like an 11-year-old.
All the examples and scripts in this book are replicable. All the material is in
a zip file you can download from this link. The software I used throughout the
book is gretl, so data and scripts are in gretl format, but if you insist on using
inferior software (;-)), data are in CSV format too.
Finally, a word of gratitude. A book like this is akin to a software project,
and there’s always one more bug to fix. So, I’d like to thank first all my students
who helped me eradicate quite a few. Then, my colleagues Allin Cottrell, Stefano
Fachin, Francesca Mariani, Giulio Palomba, Luca Pedini, Matteo Picchio, Clau-
dia Pigini, Alessandro Pionati and Francesco Valentini for making many valuable
suggestions. Needless to say, the remaining shortcomings are all mine. Claudia
also allowed me to grab a few things from her slides on IV estimation, so thanks
for that too. If you want to join the list, please send me bug reports and fea-
ture requests. Also, I’m not an English native speaker (I suppose it shows). So,
Anglophones of the world, please correct me whenever needed.
The structure of this book is as follows: chapter 1 explores the properties
of OLS as a descriptive statistic. Inference comes into play at chapter 2 with
some general concepts, while their application to OLS is the object of chapter 3,
with some basic ideas on diagnostic testing and heteroskedasticity in Chapter
4. Extension of basic OLS are considered in the subsequent chpater: Chapter
5 deals with dynamic models chapter 6 with instrumental variable estimation
and finally, Chapter 7 considers linear models for panel data. Each chapter has
an appendix, named “Assorted results”, where I discuss some of the material I
use during the chapter in a little more detail.

In some cases, I will use a special format for dispensable for the overall comprehension of
short pieces of texts, like this. They contain ex- the main topic.
tra stuff that I consider interesting, but not in-

This work is licensed under a Creative Commons


“Attribution-ShareAlike 3.0 Unported” licence.

ii
Contents

Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

1 OLS: algebraic and geometric properties 1


1.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 OLS as a descriptive statistic . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 OLS on a dummy variable . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Collinearity and the dummy trap . . . . . . . . . . . . . . . . . 14
1.3.4 Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 The geometry of OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Projection matrices . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.2 Measures of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.3 Reparametrisations . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.4 The Frisch-Waugh theorem . . . . . . . . . . . . . . . . . . . . 27
1.5 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.A Assorted results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.A.1 Matrix differentiation rules . . . . . . . . . . . . . . . . . . . . 31
1.A.2 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.A.3 Rank of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.A.4 Rank and inversion . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.A.5 Step-by-step derivation of the sum of squares function . . . . 36
1.A.6 Numerical collinearity . . . . . . . . . . . . . . . . . . . . . . . 36
1.A.7 Definiteness of square matrices . . . . . . . . . . . . . . . . . 37
1.A.8 A few more results on projection matrices . . . . . . . . . . . 38

2 Some statistical inference 41


2.1 Why do we need statistical inference? . . . . . . . . . . . . . . . . . . 41
2.2 A crash course in probability . . . . . . . . . . . . . . . . . . . . . . . 43
2.2.1 Probability and random variables . . . . . . . . . . . . . . . . 43
2.2.2 Independence and conditioning . . . . . . . . . . . . . . . . . 45
2.2.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2.4 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . 48
2.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

iii
2.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4.1 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.5 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.A Assorted results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.A.1 Jensen’s lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.A.2 More on consistency . . . . . . . . . . . . . . . . . . . . . . . . 68
p
2.A.3 Why n ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.A.4 The normal and χ2 distributions . . . . . . . . . . . . . . . . . 71
2.A.5 Gretl script to reproduce example 2.6 . . . . . . . . . . . . . . 74

3 Using OLS as an inferential tool 77


3.1 The regression function . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2 Main statistical properties of OLS . . . . . . . . . . . . . . . . . . . . 80
3.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.3 In short . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3 Specification testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3.1 Tests on a single coefficients . . . . . . . . . . . . . . . . . . . 85
3.3.2 More general tests . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.4 Example: reading the output of a software package . . . . . . . . . . 89
3.4.1 The top table: the coefficients . . . . . . . . . . . . . . . . . . 89
3.4.2 The bottom table: other statistics . . . . . . . . . . . . . . . . 91
3.5 Restricted Least Squares and hypothesis testing . . . . . . . . . . . . 93
3.5.1 Two alternative test statistics . . . . . . . . . . . . . . . . . . . 96
3.6 Exogeneity and causal effects . . . . . . . . . . . . . . . . . . . . . . . 98
3.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.8 The so-called “omitted-variable bias” . . . . . . . . . . . . . . . . . . 102
3.A Assorted results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.A.1 Consistency of σ̂2 . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.A.2 The classical assumptions . . . . . . . . . . . . . . . . . . . . . 105
3.A.3 The Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . 106
3.A.4 Cross-validation and leverage . . . . . . . . . . . . . . . . . . . 108
3.A.5 Derivation of RLS . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.A.6 Asymptotic properties of the RLS estimator . . . . . . . . . . 113

4 Diagnostic testing in cross-sections 115


4.1 Diagnostics for the conditional mean . . . . . . . . . . . . . . . . . . 116
4.1.1 The RESET test . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.1.2 Interactions and the Chow test . . . . . . . . . . . . . . . . . . 119
4.2 Heteroskedasticity and its consequences . . . . . . . . . . . . . . . . 123
4.2.1 If Σ were known . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.2.2 Robust estimation . . . . . . . . . . . . . . . . . . . . . . . . . 127

iv
4.2.3 White’s test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.2.4 So, in practice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.A Assorted results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.A.1 Proof that full interactions are equivalent to split-sample
estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.A.2 Proof that GLS is more efficient than OLS . . . . . . . . . . . . 136
4.A.3 The “vec” and “vech” operators . . . . . . . . . . . . . . . . . . 137
4.A.4 The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5 Dynamic Models 141


5.1 Dynamic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2 Manipulating difference equations . . . . . . . . . . . . . . . . . . . . 146
5.2.1 The lag operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2.2 Dynamic multipliers . . . . . . . . . . . . . . . . . . . . . . . . 149
5.2.3 Interim and long-run multipliers . . . . . . . . . . . . . . . . . 151
5.3 Inference on OLS with time-series data . . . . . . . . . . . . . . . . . 153
5.3.1 Martingale differences . . . . . . . . . . . . . . . . . . . . . . . 153
5.3.2 Testing for autocorrelation and the general-to-specific ap-
proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.4 An example, perhaps? . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.5 The ECM representation . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.6 Hypothesis tests on the long-run multiplier . . . . . . . . . . . . . . 162
5.7 Forecasting and Granger causality . . . . . . . . . . . . . . . . . . . . 164
5.A Assorted results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.A.1 Inverting polynomials . . . . . . . . . . . . . . . . . . . . . . . 169
5.A.2 Basic concepts on stochastic processes . . . . . . . . . . . . . 171
5.A.3 Why martingale difference sequences are serially uncorre-
lated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.A.4 From ADL to ECM . . . . . . . . . . . . . . . . . . . . . . . . . 173

6 Instrumental Variables 175


6.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.1.1 Measurement error . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.1.2 Simultaneous equation systems . . . . . . . . . . . . . . . . . 177
6.2 The IV estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.2.1 The generalised IV estimator . . . . . . . . . . . . . . . . . . . 180
6.2.2 The instruments . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.3 An example with real data . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.4 The Hausman test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.5 Two-stage estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
6.5.1 The control function approach . . . . . . . . . . . . . . . . . . 191
6.6 The examples, revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.6.1 Measurement error . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.6.2 Simultaneous equation systems . . . . . . . . . . . . . . . . . 194

v
6.7 Are my instruments OK? . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.7.1 The Sargan test . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.7.2 Weak instruments . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.A Assorted results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.A.1 Asymptotic properties of the IV estimator . . . . . . . . . . . 202
6.A.2 Proof that OLS is more efficient than IV . . . . . . . . . . . . . 204
6.A.3 Covariance matrix for the Hausman test (scalar case) . . . . . 204
6.A.4 Hansl script for the weak instrument simulation study . . . . 205

7 Panel data 207


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.2 Individual effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.3 Fixed effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
7.3.1 Using dummy variables . . . . . . . . . . . . . . . . . . . . . . 211
7.3.2 The “within” transformation . . . . . . . . . . . . . . . . . . . 214
7.3.3 Asymptotics for the FE estimator . . . . . . . . . . . . . . . . . 218
7.3.4 Heteroskedasticity and dependence between observations . 219
7.4 Random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.4.1 The Hausman test . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.4.2 Correlated Random Effects, aka “the Mundlak trick” . . . . . 224
7.5 An example with real data . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.5.1 The Kuznets curve . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.5.2 Fixed-effects estimates . . . . . . . . . . . . . . . . . . . . . . . 227
7.5.3 Random-effects estimates . . . . . . . . . . . . . . . . . . . . . 228
7.5.4 Correlated random effects . . . . . . . . . . . . . . . . . . . . . 230
7.A Assorted results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
7.A.1 The Kronecker product . . . . . . . . . . . . . . . . . . . . . . 231
7.A.2 The trace operator . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.A.3 A neat matrix inversion trick . . . . . . . . . . . . . . . . . . . 233
7.A.4 Time dummies . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.A.5 Proof that Q = MD . . . . . . . . . . . . . . . . . . . . . . . . . . 234
7.A.6 The estimator of the variance in the within regression . . . . 235
7.A.7 The RE estimator as FGLS . . . . . . . . . . . . . . . . . . . . . 236
7.A.8 Proof that CRE yields FE . . . . . . . . . . . . . . . . . . . . . . 238

Bibliography 239

vi
Chapter 1

OLS: algebraic and geometric


properties

1.1 Models

I won’t even attempt to give the reader an account of the theory of econometric
modelling. For our present purposes, suffice it to say that we econometricians
like to call a model a mathematical description of something, that doesn’t aim
at being 100% accurate, but still, hopefully, useful.1
We have a quantity of interest, also called the dependent variable, which
we observe more than once: a collection of numbers y 1 , y 2 , . . . , y n , where n is the
size of our data set. These numbers can be anything that can be given a coherent
numerical representation; in this course, however, we will confine ourselves to
the case where the i -th observation y i is a real number. So for example, we could
record the income for n individuals, the export share for n firms, the inflation
rate for a given country at n points in time.
Now suppose that, for each data point, we also have a vector of k elements
containing auxiliary data possibly helpful in better understanding the differ-
ences between the y i s; we call these explanatory variables,2 or xi in symbols.3
To continue the previous examples, xi may include a numerical description of
the individuals we recorded the income of (such as age, gender, educational at-
tainment and so on), or characteristics of the firms we want to study the export
propensity for (size, turnover, R&D expenditure and so on), or the conditions of
the economy at the time the inflation rate was recorded (interest rate, level of
output, and so forth).

1 “All models are wrong, but some are useful” (G. E. P. Box). In fact, one may argue that, in order

to be useful, a model may have to be inaccurate. More on this in section 1.4.2.


2 Terminology is very much field-specific here; statisticians traditionally tend to use the term

covariates, while people from the machine learning community like the word features.
3 I will almost always use boldface symbols to indicate vectors.

1
2 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

What we call a model is a formula like the following:

y i ≃ m(xi )

where we implicitly assume that if xi is not too different from x j , then we should
expect y i to be broadly close to y j : if we pick two people of the same age, with
the same educational level and many other characteristics in common we would
expect that their income should be roughly the same. Of course this won’t be
true in all cases (in fact, chances are that this will never be true exactly), but
hopefully our model won’t lead us to catastrophic mistakes.
The reason why we want to build models is that, once the function m(·) is
known, it becomes possible to ask ourselves interesting questions by inspect-
ing the characteristics of that function. So for example, if it turned out that the
export share of a firm is increasing in the expenditure in R&D, we may make con-
jectures about the reasons why it should be so, look for some economic theory
that could explain the result, and wonder if one could improve export competi-
tiveness by giving the firms incentives to do research.
Moreover, the door is open to forecasting: given the characteristics of a hy-
pothetical firm or individual, the model makes it possible to guess what their
export share or income (respectively) should be. I don’t think I have to convince
the reader of how useful this could be in practice.
Of course, we will want to build our model in the best possible way. In other
words, our aim will be choosing the function m(·) according to some kind of
optimality criterion. This is what the present course is about.
But there’s more: as we will see, building an optimal model is impossible in
general. At most, we may hope to build the best possible model for the data
that we have available. Of course, there is no way of knowing if the model we
built, that perhaps works rather well with our data, will keep working equally
well with new data. Imagine you built a model for the inflation rate in a country
with monthly data from January 2000 to December 2017. It may well be that your
model performs (or, as we say, “fits the data”) very well for that period, but what
guarantee do you have that it will keep doing so in 2018, or in the more distant
future? The answer is: you have none. But still, this is something that we’d like to
do; our mind has a natural tendency to generalise, to infer, to extrapolate. And
yet, there is no logical compelling basis for proving that it’s a good idea to do
so.4 The way out is framing the problem in a probabilistic setting, and this is the
reason why econometrics is so intimately related with probability and statistics.
For the moment, we’ll start with the problem of choosing m(·) in a very sim-
ple case, that is when we have no extra information xi . In this case, the function
becomes a constant:
y i ≃ m(xi ) = m
and the problem is very much simplified, because it means we have to pick a
number m in some optimal way, given the data y 1 , y 2 , . . . , y n . In other words, we
4 ‘The philosophically inclined reader may at this point google for “Bertrand Russel’s turkey”.
1.2. THE AVERAGE 3

have to find a function of the data which returns the number m. Of course, a
function of the data is what we call a statistic. In the next section, I will prove
that the statistic we’re looking for is, in this case, the average of the y i s, that is
Ȳ = n1 ni=1 y i .
P

1.2 The average


What is a descriptive statistic? It is a function of the data which synthesises a
particular feature of interest of the data; of course, the more informative, the
better. The idea behind descriptive statistics is more or less: we have some data
on some real-world phenomenon; our data set, unfortunately, is too “large”, and
we don’t have time/can’t/don’t feel like going through the whole thing. Hence,
we are looking for a function of these data to tell us what we want, without being
bothered with unnecessary details.
The most obvious example of a descriptive statistic is, of course, the sample
average. Let’s stick our observations y 1 , y 2 , . . . , y n into a column vector y; the
sample average is nothing but

1X n 1
Ȳ = y i = ι′ y, (1.1)
n i =1 n

where ι is a column vector full of ones. The “sum” notation is probably more
familiar to most readers; I prefer the matrix-based one not only because I find
it more elegant, but also because it’s far easier to generalise. The nice feature
of the vector ι is that its inner product with any conformable vector x yields the
sum of the elements of x.5
We use averages all the time. Why is the average so popular? As I said, we’re
looking for a descriptive statistic m, as a synthesis of the information contained
in our data set.
In 1929, Oscar Chisini (pronounced kee-zee-nee) pro-
posed the following definition: for a function of interest
g (·), the mean of the vector y is the number m that yields
the unique solution to g (y) = g (m · ι). Powerful idea: for ex-
ample, the average is the solution of the special case when
the g (·) function is the sum of the vector’s elements, and
the reader may want to spend some time with more exotic
cases.
Chisini’s idea may be further generalised: if our aim is
to use m — that we haven’t yet chosen — as an imperfect O SCAR C HISINI

5 Reminder: the inner products of two vectors a and b is defined as P a b . Mathematicians


i i i
like the notation 〈a, b〉 for the inner product, on the grounds of its greater generality (google
“Hilbert space” if you’re curious), but we econometricians are more accustomed to the “matrix”
notation a′ b, where the apostrophe means “transposed”.
4 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

but parsimonious description of the whole data set, the question that naturally
arises is: how much information is lost?
If all we knew, for a given data set, was m, what could we say about each
single observation? If we lack any more information, the most sensible thing to
say is that, for a generic i , y i should more or less be m. Consider the case of
A. S. Tudent, who belongs to a class for which the “typical” grade in economet-
rics is 23;6 the most sensible answer we could give to the question “What was the
grade that A. S. Tudent got in econometrics?” would be “Around 23, I guess”. If
the actual grade that A. S. got were in fact 23, OK. Otherwise, we could measure
by how much we were wrong by taking the difference between the actual grade
and our guess, e i = y i − m. We call these quantities the residuals; the vector of
residuals is, of course, e = y − ι · m.
In the ideal case, using m to summarise the data should entail no informa-
tion loss at all, and the difference between y i and m should be 0 for all i (all stu-
dents got 23). If it weren’t so, we may measure how well m does its job through
the size of the residuals. Let’s define a function, called loss function, which mea-
sures the cost we incur because of the information loss.

L(m) = C [e(m)]

In principle, there are not many properties such a function should be assumed
to have. It seems reasonable that C (0) = 0:7 if all the residuals are 0, no approxi-
mation errors occur and the cost is nil. Another reasonable idea is C (e) ≥ 0: you
can’t gain from a mistake.8 Apart from this, there is not very much you can say:
the L(·) function cannot be assumed to be convex, or symmetric, or anything
else. It depends on the context.
Whatever the shape of this function, however, we’ll want to choose m so that
is L(m) as small as possible. In math-speak: for a given problem, we can write
down the loss function and choose the statistic which minimises it. In formulae:

m̂ = Argmin L(m) = Argmin C (y − ι · m), (1.2)


m∈R m∈R

where you read the above as: m with a hat on is that number that you find if you
choose, among all real numbers, the one that makes the function L(m) as small
as possible.
In practice, by finding the minimum of the L(·) function for a given prob-
lem, we can be confident that we are using our data in the best possible way. At
this point, the first thing that crosses a reasonable person’s mind is “How do I
choose L(·)? I mean, what should it look like?”. Fair point. Apart from extraordi-
nary cases when the loss function is a natural consequence of the problem itself,
6 Note for international readers: in the Italian academic systems, which is what I’m used to,

grades go from 18 (barely pass) to 30 (full marks).


7 I use a boldface 0 to indicate a vector full of zeros, as in 0 · ι = 0.
8 Warning: the converse is not necessarily true. It’s possible that the cost is nil even with non-

zero errors. For example, in some contexts “small” error may be irrelevant.
1.2. THE AVERAGE 5

writing down its exact mathematical form may be complicated. What does the
L(m) function look like for the grades in econometrics of our hypothetical class?
Hard to say.
Moreover, we often must come up with a summary statistic without know-
ing in advance what it will be used for. Obviously, in these cases finding a one-
size-fits-all optimal solution is downright impossible. We have to make do with
something that is not too misleading. A possible choice is
n
(y i − m)2 = (y − ι · m)′ (y − ι · m) = e′ e
X
L(m) = (1.3)
i =1

The above criterion is a function of m based on the sum of squared residuals,


that enjoys several desirable properties. Not only it’s simple to manipulate al-
gebraically: it’s symmetric and convex, so that positive and negative residuals
are penalised equally, and large errors are more costly than small ones. It’s not
unreasonable to take this loss function as an acceptable approximation. More-
over, this choice makes it extremely easy to solve the associated minimisation
problem.
Minimising L(m) with respect to m leads to the so-called least squares prob-
lem. All is needed to find the minimum in (1.3) is taking the derivative of L(m)
with respect to m;

n d y −m 2
¡ ¢
dL(m) X n ¡
i X ¢
= = −2 yi − m
dm i =1 dm i =1

The derivative must be 0 for a minimum, so that


n ¡
X ¢
y i − m̂ = 0
i =1

which in turn implies


n
X
n · m̂ = yi
i =1
Pn
and therefore m̂ = n −1 i =1 y i = Ȳ . The reader is invited to verify that m̂ is
d2 L(m)
indeed a minimum, by checking that the second derivative dm 2
is positive.
Things are even smoother in matrix notation:

L(m) = (y − ιm)′ (y − ιm) = y′ y − 2m · ι′ y + m 2 ι′ ι,

so the derivative is
dL(m)
= −2ι′ y + 2m · ι′ ι = −2ι′ (y − ιm) = 0
dm
whence
ι′ y = (ι′ ι) · m̂ =⇒ m̂ = (ι′ ι)−1 ι′ y = Ȳ
6 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

because of course ι′ ι = n. The value of L(m) at the minimum, that is L(m̂) =


e′ e = ni=1 (y i − Ȳ )2 is a quantity that in this case we call deviance, but that we
P

will more often call SSR, as in Sum of Squared Residuals.


The mathematically sophisticated way to say the same, that we used a few
pages back, is
m̂ = Argmin L(m);
m∈R

where again, the hat ( ˆ ) on m indicates that, among all possible real numbers,
we are choosing the one that minimises our loss function.
The argument above, which leads to choosing the average as an optimal
summary is, in fact, much more general than it may seem: many of the descrip-
tive statistics we routinely use are special cases of the average, where the data
y are subject to some preliminary transformation. In practice: the average of z,
where z i = h(y i ) can be very informative, if we choose the function h(·) wisely.
The variance is the most obvious example: the sample variance9 is just the aver-
age of z i = (y i − Ȳ )2 , which measures how far y i is from Ȳ .
Things get even more interesting when we express a frequency as an average:
define the event E = {y i ∈ A}, where A is some subset of the possible values for
y i ; now define the variable z i = I(y i ∈ A), where I(·) is the so-called “indicator
function”, that gives 1 when its argument is true and 0 when false. Evidently, the
average of the z i , Z , is the relative frequency of E :
Pn
i =1 z i
Z= = K /n;
n

since z i can only be 0 or 1, K = ni=1 z i is just the number of times the event E
P

has occurred. I’m sure you can come up with more examples.

1.3 OLS as a descriptive statistic


1.3.1 OLS on a dummy variable
Now let’s bring the explanatory variables xi back in. For the moment, let’s con-
sider the special case where xi is a one-element vector, that is a scalar.
A possible way to check if y i and x i are related to each other is to see if y i is
“large” or “small” when x i is “large” or “small”. Define

z i = (y i − Ȳ )(x i − X̄ )

which is, in practice, a sort of indicator of “matching magnitudes”: z i is positive


when y i > Ȳ and x i > X̄ (both are “large”) or when y i < Ȳ and x i < X̄ (both
are “small”); on the contrary, z i is negative when magnitudes don’t match. As
9 I’m not applying the “degrees of freedom correction”; I don’t see why I should, as long I’m

using the variance as a descriptive statistic.


1.3. OLS AS A DESCRIPTIVE STATISTIC 7

is well known, the average of z i is known as covariance; but this is just boring
elementary statistics.
The reason why I brought this up is to highlight the main problem with co-
variance (and correlation, that is just covariance rescaled so that it’s guaranteed
to be between -1 and 1): it’s a symmetric concept. The variables y i and x i are
treated equally: the covariance between y i and x i is by construction the same as
between x i and y i . On the contrary, we often like to think in terms of y i = m(x i ),
because what we have in mind is an interpretation where y i “depends” on x i ,
and not the other way around.10 This is why we call y i the dependent variable
and x i the explanatory variable. In this context, it’s rather natural to see what
happens if you split y into several sub-vectors, according to the values that x i
takes. In a probabilistic context, we’d call this conditioning (see section 2.2.2).
Simple example: suppose our vector y includes observations on n people,
with n m males and n f = n − n m females. The information on gender is in a vari-
able x i , that equals 1 for males and 0 for females. As is well known, a 0/1 variable
may be called “binary”, “Boolean”, “dichotomic”, but we econometricians tradi-
tionally call it a dummy variable.11
Common sense suggests that, if we take into account the information we
have on gender, the average by gender will give us a data description which
should be slightly less concise than overall average (since we’re using two num-
bers instead of one), but certainly not less accurate. Evidently, we can define
Sf
P P
x i =1 y i Sm x i =0 y i
Ȳm = = Ȳ f = =
nm nm nf nf
where S m and S f are the sums of y i for males and females, respectively.
Now, everything becomes more elegant and exciting if we formalise the prob-
lem in a similar way to what we did with the average. We would like to use in the
best possible way the information (that we assume we have) on the gender of the
i -th individual. So, instead of summarising the data by a number, we are going
to use a function, that is something like
m(x i ) = m m · x i + m f · (1 − x i )
which evidently equals m m for men (since x i = 1) and m f for women (since x i =
0). Our summary will be a rule giving us ‘representative’ values of y i according
to x i .
Let’s go back to our definition of residuals as approximation errors: in this
case, you clearly have that e i ≡ y i − m(x i ), and therefore
y i = m m x i + m f (1 − x i ) + e i (1.4)
10 I’m being deliberately vague here: in everyday speech, saying that A depends on B may mean

many things, not necessarily consistent. For example, “dependence” may not imply a cause-effect
link. This problem is much less trivial than it seems at first sight, and we’ll leave it to professional
epistemologists.
11 I am aware that there are people who don’t fit into the tradtional male/female distinction, and

I don’t mean to disrespect them. Treating gender as a binary variable just makes for a nice and
simple example here, ok?
8 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

Equation (1.4) is a simple example of an econometric model. The number


y i is split into two additive components: a systematic part, that depends on the
variable x i (a linear function of x i , to be precise), plus a remainder term, that we
just call the residual for now. In this example, m(x i ) = m m x i + m f (1 − x i ).
It is convenient to rewrite (1.4) as
· ¸
£ ¤ mf
y i = m f + (m m − m f )x i + e i = 1 xi + ei
mm − m f

so we can use matrix notation, which is much more compact and elegant

y = Xβ + e, (1.5)

where
β1
· ¸ · ¸
mf
β= =
mm − m f β2
and X is a matrix with n rows and 2 columns; the first column is ι and the second
one is x. The i -th row of X is [1, 1] if the corresponding individual is male and
[1, 0] otherwise. To be explicit:
     
y1 1 x1 e1
 y2   1 x2   e2 
 β1
   · ¸  
 ..  
= .. ..  .. 

. . .

 β + .

2
    

 y n−1
 
x n−1
 
 e n−1

  1  
yn 1 xn en

Therefore, the problem of choosing m m and m f optimally is transformed


into the problem of finding the vector β that minimises the loss function e′ e.
The solution is not difficult: find the solutions to12
d ′ d d ′
e e= (y − Xβ )′ (y − Xβ ) = (y y − 2β ′ X′ y + β ′ X′ Xβ ) = 0
dβ dβ dβ

By using the well-known13 rules for matrix differentiation, you have

X′ y = X′ X · β̂ (1.6)

What we have to do now is solve equation (1.6) for β̂ . The solution is unique

if X X is invertible (if you need a refresher on matrix inversion, and related mat-
ters, subsection 1.A.3 is for you):

¢−1
β̂ = X′ X X′ y.
¡
(1.7)

12 Need I remind the reader of the rule for transposing a matrix product, that is (AB )′ = B ′ A ′ ?

Obviously not.
13 Not so well-known, maybe? Jump to subsection 1.A.1.
1.3. OLS AS A DESCRIPTIVE STATISTIC 9

Equation (1.7) is the single most important equation in this book, and this is
why I framed it into a box. The vector β̂ is defined as the vector that minimises
the sum of squared residuals among all vectors with k elements (where k = 2 in
this case):
β̂ = Argmin e′ e,
β ∈Rk

and the expression in equation (1.7) turns the implicit definition into an explicit
formula that you can use to calculate β̂ .
The coefficients β̂ obtained from (1.7) are known as OLS coefficients, or OLS
statistic, from Ordinary Least Squares.14 A very common idiom that economists
use when referring to the calculation of OLS is “regressing y on X”. The usage of
the word “regression” here might seem odd, but will be justified in chapter 3.
The “hat” symbol has exactly the same meaning as in eq. (1.2): of all the
possible choices for β , we pick the one that makes eq. (1.6) true, and therefore
minimises the associated loss function e′ e. The vector

ŷ = Xβ̂

is our approximation to y. The term we normally use for the elements of ŷ are
the fitted values: the closer they are to y, the better we say that the model fits
the data.
In this example, a few simple calculations suffice to show that
· ¸
′ n nm
XX =
nm nm
· Pn ¸ · ¸
′ i =1 yi Sm + S f
Xy = P =
x i =1 y i Sm
P P
where S m = xi =1 y i and S f = xi =0 y i : the sums of y i for males and females,
respectively. By using the standard rule for inverting (2 × 2) matrices, which I
will also assume known,15
· ¸
1 n m −n m
(X′ X)−1 =
n m n f −n m n

so that
· ¸· ¸ · ¸
1 nm −n m Sm + S f 1 nm S f
β̂ = =
n m n f −n m n Sm nm n f n f S m − nm S f

and finally
Sf
 
· ¸
nf Ȳ f
β̂ =  Sm Sf
=
− nf Ȳm − Ȳ f
nm

14 Why “ordinary”? Well, because there are more sophisticated variants, so we call these “ordi-

nary” as in “not extraordinary”. We’ll see one of those variants in section 4.2.1.
15 If you’re in trouble, go to subsection 1.A.4.
10 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

so that our model is:


¡ ¢
ŷ i = Ȳ f + Ȳm − Ȳ f x i

and it’s easy to see that the fitted value for males (x i = 1) is Ȳm , while the one for
the females (x i = 0) is Ȳ f .

Example 1.1
Let me give you a numerical example of the above: suppose we have 80 individ-
uals (50 males and 30 females) and that we’re interested in their monthly wage.
Moreover, S m = xi =1 y i = € 60000 and S f = xi =0 y i = € 42000: therefore, the
P P

average wage is Ȳm = 1200 = 60000/50 for males and Ȳ f = 1400 = 42000/30 for
females. After ordering observations by putting the data for males first,16 the X
matrix looks like
1 1
 
 1 1 
 
 .
 ..


 
X= 1 1 
 
 
 1 0 
 ..
 

 . 
1 0

where the top block of rows has 50 rows and the bottom one has 30. As the reader
may easily verify,
· ¸ · ¸
′ 80 50 ′ 102000
X X= X y=
50 50 60000

By performing the appropriate calculations, one finds that


· ¸
′ −1 1400
β̂ = (X X) Xy =
−200

and the model can be written as:

ŷ i = 1400 − 200x i ,

which reads: for females, x i = 0, so their typical income is €1400; for males, in-
stead, x i = 1, so their income is given by 1400 − 200 · 1 = €1200.

Once again, opting for a quadratic loss function (and therefore minimising

e e) delivers a solution consistent with common sense, and our approximate de-
scription of the vector y uses a function whose parameters are the statistics we
are interested in.
16 With no loss of generality, as a mathematician would say.
1.3. OLS AS A DESCRIPTIVE STATISTIC 11

1.3.2 The general case


In reading the previous subsection, the discerning reader will have noticed that,
in fact, the assumption that x is a dummy variable plays a very marginal role.
There is no reason why the equation m(x i ) = β1 + β2 x i should not hold when
x i contains generic numeric data. The solution to the problem remains un-
changed; clearly, the vector β̂ will not contain the averages by sub-samples, but
the fact that the loss function is minimised by β̂ = (X′ X)−1 X′ y keeps being true.

Example 1.2
Suppose that
   
1 4
3 3
   
2 2
y=  x= 
   
3 5
   
0 1
1 1

The reader is invited to check that17


· ¸ · ¸ · ¸
6 16 0.7 −0.2 10
X′ X = ⇒ (X′ X)−1 = X′ y =
16 56 −0.2 3/40 33

and therefore
   
2.3 −1.3
1.825  1.175
· ¸    
0.4 1.345  0.65
β̂ = ŷ =  e=
   
0.475 2.775  0.225
 
   
0.875 −0.875
0.875 0.125

Hence, the function m(x i ) minimising the sum of squared residuals is m(x i ) =
0.4 + 0.475x i and e′ e equals 4.325.

In traditional textbooks, at this point you always get a picture similar to the
one in Figure 1.1, which is supposed to aid intuition; I don’t like it very much, and
will explain why shortly. Nevertheless, let me show it to you: in this example, we
use the same data as in the present example.
In Figure 1.1, each black dot corresponds to a (x i , y i ) pair; the dashed line
plots the m(x) function and the residuals are the vertical differences between the
dots and the dashed line; the least squares criterion makes the line go through
the dots in such a way that the sum of these differences (squared) is minimal.
So, for example, for observation number 1 the observed value of x i is 4 and the
17 Before you triumphantly shout “It’s wrong!”, remember to stick ι and x together.
12 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

Figure 1.1: OLS on six data points


y

m(x) = 0.4 + 0.475x

(x 1 , ŷ 1 )

(x 1 , y 1 )

observed observed value for y i is 1; the approximation yields ŷ 1 = 0.4+0.475×4 =


2.3 (observe the position of the white dot). Therefore, e 1 = y 1 − ŷ 1 = −1.3 (the
vertical distance between the black dot and the white dot).
The example above can be generalised by considering the case where we
have more than one explanatory variable, except for the fact that producing a
figure akin to Figure 1.1 becomes difficult, if not impossible. Here, the natural
thing to do is expressing our approximation as a function of the vector xi , that is
m(xi ) = x′i β , or more explicitly,

ŷ i = β0 + β1 x 1i + β2 x 2i + . . . + βk x ki

For example, suppose we have data on each student in the class A. S. Tudent
belongs to. How many hours each student spent studying econometrics, their
previous grades in related subjects, and so on; these data, for the i -th student,
are contained in the vector x′i , which brings us back to equation (1.5).
The algebraic apparatus we need for dealing with the generalised problem
is, luckily, unchanged; allow me to recap it briefly. If the residual we use for
minimising the loss function is e i (β ) = y i − x′i β , then the vector of residuals is

e(β ) = y − Xβ (1.8)

so the function to minimise is L(β ) = e(β )′ e(β ).


Since the derivative of e(β ) with respect to β is −X, we can use the chain rule
and write
X′ e(β̂ ) = 0 (1.9)

(a more detailed proof, should you need it, is in subsection 1.A.5). By putting
together (1.8) and (1.9) you get a system of equations sometimes referred to as
normal equations:
X′ X · β̂ = X′ y (1.10)

and therefore, if X′ X is invertible, β̂ = (X′ X)−1 X′ y.


1.3. OLS AS A DESCRIPTIVE STATISTIC 13

If you think that all this is very clever, well, you’re right.
The inventor of OLS is arguably the greatest mathemati-
cians of all time: the great Carl Friedrich Gauss, also known
as the princeps mathematicorum.18
Note, again, that the average can be obtained as the
special case when X = ι. Moreover, it’s nice to observe that
the above formulae make it possible to compute all the rel-
evant quantities without necessarily observing the matri-
C ARL F RIEDRICH
ces X and y; in fact, all the elements you need are the fol-
G AUSS
lowing:

1. the scalar y′ y;

2. the k-element vector X′ y and

3. the k × k matrix X′ X (or equivalently, its inverse).

where k is the number of columns of X, the number of unknown coefficients in


our m(·) function. Given these quantities, β̂ is readily computed, but also e′ e:

e′ e = (y − Xβ̂ )′ (y − Xβ̂ ) = y′ y − y′ Xβ̂ − β̂ ′ X′ y + β̂ ′ (X′ X)β̂

and using (1.10) you have


e′ e = y′ y − β̂ ′ (X′ y). (1.11)
Equation (1.11) expresses the SSR as the difference between a scalar and the
inner product of two k-element vectors β̂ and (X′ y). The number of rows of y,
that is the number of observation n, never comes into play, and could well be
huge.
I guess you now understand my lack of enthusiasm for Figure 1.1: if X has
3 columns, drawing a similar picture is difficult. For 4 or more columns, it be-
comes impossible. Worse, the geometric intuition that it conveys may overlap
with another geometric interpretation of OLS, which I consider more interesting
and more useful, and is the object of section 1.4.
A nice feature of a linear function like (1.5) is that the coefficients β can be
interpreted as marginal effects, or partial derivatives if you prefer. In the previ-
ous example, the coefficient associated to the number of hours that each student
spent studying econometrics may be defined as

∂m(x)
= βj (1.12)
∂x j

and therefore can be read as the partial derivative of the m(·) function with re-
spect to the number of hours. Clearly, you may attempt to interpret these mag-
nitude by their sign (do more hours of study improve your grade?) and by their
18 To be fair, the French mathematician Adrien-Marie Legendre rediscovered it independently a

few years later.


14 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

magnitude (if so, by how much?). However, you should resist the temptation to
give the coefficients a counterfactual interpretation (If A. S. Tudent had studied
2 more hours, instead of watching that football game, by how much would their
mark have improved?); this is possible, in some circumstances, but not always
(more on this in Section 3.6).

Focusing on marginal effects is what we do popular at the beginning of the XXI century,
most often in econometrics, because the ques- and are the tools that companies like Google
tion of interest is not really approximating y and Amazon use to predict what video you’d
given x, but rather understanding what the ef- like to see on Youtube or what book you’d like
fect of x on y is (and, possibly, how general and to buy when you open their website. As we all
robust this effect is). In other words, the ob- know, these models perform surprisingly well
ject of interest in econometrics is much more in practice, but nobody would be able to re-
often β , rather than m(x). The opposite hap- construct how their predictions come about.
pens in a broad class of statistical methods that The phrase some people use is that machine
go, collectively, by the name of machine learn- learning procedures are “black boxes”: they
ing methods and focus much more on predic- work very well, but they don’t provide you with
tion than interpretation. In order to predict cor- an explanation of why you like that particular
rectly, these models use much more sophisti- video. The pros and cons of econometric mod-
cated ways of handling the data than a simple els versus machine learning tools are still un-
linear function, and even writing the rule that der scrutiny by the scientific community, and,
links x to ŷ is impossible. if you’re curious, I’ll just give you a pointer to
Machine learning tools have been getting quite Mullainathan and Spiess (2017).

1.3.3 Collinearity and the dummy trap


Of course, for solving equation (1.10), X′ X must be invertible. Now, you may ask:
what if it’s singular? This is an interesting case. The solution ŷ can still be found,
but there is more than one vector β̂ associated with it. In fact, there are infinitely
many. Let me give you an example. Suppose that X contain only one non-zero
column, x1 . The solution is easy to find:

x′1 y
β̂1 = ,
x′1 x1

so that ŷ = β1 x1 . Now, add to X a second column, x2 , which happens to be equal


to x1 , so x2 = x1 . Evidently, x2 adds no information to our model, because it con-
tains exactly the same information as x1 so ŷ remains the same. Now, however,
we can write it in infinitely may ways:

ŷ = β1 x1 = 0.5β1 x1 + 0.5β1 x2 = 0.01β1 x1 + 0.99β1 x2 = . . .

because obviously β1 x2 = β1 x1 . In other words, there are infinitely many ways to


combine x1 and x2 to obtain ŷ, even though the latter is unique and the objective
function has a well-defined minimum. It is rather easy to generalise the example
above when x2 is a multiple of x1 , that is x2 = αx1 .
1.3. OLS AS A DESCRIPTIVE STATISTIC 15

We call this problem collinearity, or multicollinearity, and it can be solved


quite easily: all you have to do is drop the collinear columns until X has full rank.
Therefore, in the example above, you can choose to leave out x1 or x2 ; whatever
your choice, problem solved.
In practice, things are not always so easy, because (as is well known) digital
computers work with finite numerical precision, but in the cases we will con-
sider we should have no problems. The interested reader may want to have a
look at section 1.A.6.
A situation where this issue may arise goes commonly under the name of
dummy trap. Suppose that you want to include in your model a qualitative vari-
able, in which we have a conventional coding. For example, the marital status of
an individual, and you conventionally code this information as 1=single, 2=mar-
ried, 3=divorced, 4=it’s complicated, etc.
Clearly, using this variable “as is” makes no sense: a function like ŷ i = β1 +
β2 x i would consider x i as a proper numerical value, whereas in fact its coding is
purely conventional. The solution is recoding x i as a set of dummy variables, in
which each dummy corresponds to one category: so for example the vector
1
 
3
 
2
 
x = 1
 
 
2
 
3
3
would be substituted by the matrix
1 0 0
 
0 0 1
 
0 1 0
 
Z = 1 0 0
 
 
0 1 0
 
0 0 1
0 0 1
so the first column of Z is a dummy variable for x i = 1, the second one is a
dummy variable for x i = 2, and so on.
However, using the matrix Z unmodified leads to a collinearity problem if
the model contains a constant, since the sum of the columns of the matrix Z is
by construction equal to ι. Hence, the matrix

X = [ι Z]

has not full rank, so X′ X doesn’t have full rank either, and consequently is not
invertible.19
19 If you have problems following this argument, sections 1.A.3 and 1.A.4 may be of help.
16 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

The remedy you normally adopt is to drop one of the column of Z, and the
corresponding category becomes the so-called “reference” category. For exam-
ple, suppose you have a geographical variable x i conventionally coded from 1 to
3 (1=North, 2=Centre, 3=South). The model m(x i ) = β1 + β2 x i is clearly mean-
ingless, but one could think to set up an alternative model like

ŷ i = β1 + β2 Ni + β3C i + β4 S i ,

where Ni = 1 if the i -th observation pertains to the North, and so on. This would
make more sense, as all the variables in the model have a proper numerical in-
terpretation. However, in this case we would have a collinearity problem for the
reasons given above, that is Ni + C i + S i = 1 by construction for all observations
i.
The solution is dropping one of the geographical dummies from the model:
for example, let’s say we drop the “South” dummy S i : the model would become

ŷ i = β1 + β2 Ni + β3C i ;

observe that with the above formulation the fitted value for a southern observa-
tion would be
ŷ i = β1 + β2 × 0 + β3 × 0 = β1
whereas for a northern one you would have

ŷ i = β1 + β2 × 1 + β3 × 0 = β1 + β2 ,

so β2 indicates the difference between a northern observation and a southern


one, in the same way as β3 indicates the difference between Centre and South.
More in general, after dropping one of the dummies, the coefficient for each of
the remaining ones indicates the difference between that category and the one
you chose as a reference.

1.3.4 Nonlinearity
A further step in enhancing this setup would be allowing for the possibility that
the function m(x i ) is non-linear. In a traditional econometric setting this idea
would take us to consider the so-called NLS (Nonlinear Least Squares) tech-
nique. I won’t go into this either, for two reasons.
¤2
First, because minimising a loss function like L(β ) = ni=1 y i − m(x i , β ) ,
P £

where m(·) is some kind of crazy arbitrary function may be a difficult problem:
it could have more than one solution, or none, or maybe one that cannot be
written in closed form.
Second, the linear model is in fact more general than it seems, since in order
to use OLS it is sufficient that the model be linear in the parameters, not nec-
essarily in the variables. For example, suppose that we have one explanatory
variable; it is perfectly possible to use a model formulation like

m(x i ) = β1 + β2 x i + β3 x i2 . (1.13)
1.3. OLS AS A DESCRIPTIVE STATISTIC 17

The equation above contains a non-linear transformation of x i (the square), but


the function itself is just a linear combination of observable data: in this case,
we use a formulation that implies that the effect of x i on m(x i ) is nonlinear, but
this is still achieved by employing a linear combination of observable variables.
To be more explicit, the X matrix would be, in this case,
 
1 x1 x 12
1 x2 x 22 
 
X= .. 
.
 
 
1 xn x n2

and the algebra would proceed as usual.


This device is very common in applied econometrics, where powers of ob-
servable variables are used to accommodate nonlinear effects in the model with-
out having to give up the computational simplicity of OLS. The parameter β3 is
also quite easy to read: if it’s positive (negative), the m(x i ) function is convex
(concave).
The only caveat we have to be aware of is that, of course, you cannot inter-
pret the β vector as marginal effects, as the right-hand side of equation (1.12)
is no longer a fixed scalar. In fact, the marginal effects for each variable in the
model become functions of the whole parameter vector β and of xi ; in other
terms, marginal effects may be different for each observation in our sample. For
example, for the model in equation (1.13) the marginal effect of x i would be

∂m(x i )
= β2 + 2β3 x i ;
∂x i

β
and its sign would depend on the condition x i > − 2β23 , so it’s entirely possible
that the marginal effect of x i on y i is positive for some units in our sample and
negative for others.

Example 1.3
Suppose you have the following model:
p
ŷ i = m(x i ) = −1 + 2x i − 0.4x i2 + 2 x i

A plot of this function is depicted in Figure 1.2. The marginal effect of x i on y i is


easy to find as
∂m(x i ) 1
= 2 − 0.8x i + p
∂x i xi
by differentiating each term. As you can see, the effect of x i on y i becomes
individual-specific: for two individuals with a different x i , the effect of a rise
in x i on y i would depend on x i , and can even change sign. So, what is a good
thing for someone could be a bad thing for someone else.
18 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

Figure 1.2: m(x i ) and its derivative as functions of x i in example 1.3

m(x)

∂m(x)
∂x

More generally, what we can treat via OLS is the class of models that can be
written as
k
β j g j (xi ),
X
m(xi ) =
j =1

where xi are our “base” explanatory variables and g j (·) is a sequence of arbitrary
transformations, no matter how crazy. Each element of this sequence becomes
a column of the X matrix. Clearly, once you have computed the β̂ vector, the
marginal effects are easy to calculate (of course, as long as the g j (·) functions
are differentiable):
∂m(xi ) X k ∂g j (xi )
= β̂ j .
∂xi j =1 ∂xi

1.4 The geometry of OLS


The OLS statistic and associated concepts can be given an interpretation that
has very little to do with statistics; instead, it’s a geometrical interpretation. Given
the typical audience of this book, a few preliminaries may be in order here.
The first concept we’ll want to define is the concept of distance (also known
as metric). Given two objects a and b, their distance is a function that should
enjoy four properties:

1. d (a, b) = d (b, a)

2. d (a, b) ≥ 0

3. d (a, b) = 0 ⇔ a = b

4. d (a, b) + d (b, c) ≥ d (a, c)


1.4. THE GEOMETRY OF OLS 19

The first three are obvious; as for the last one, called triangle inequality, it just
means that the shortest way is the straight one. The objects in question may be
of various sorts, but we will only consider the case when they are vectors. The
distance of a vector from zero is its norm, written as ∥x∥ = d (x, 0).
Many functions d (·) enjoy the four properties above, but the concept of dis-
tance we use in everyday life is the so-called Euclidean distance, defined as
q
d (x, y) = (x − y)′ (x − y)

and the reader may verify that the four properties are satisfied
p by this definition.
Obviously, the formula for the Euclidean norm is ∥x∥ = x x. ′

The second concept I will use is the idea of a vector space. If you’re not
familiar with vector spaces, linear combinations and the rank of a matrix, then
sections 1.A.2 and 1.A.3 are for you.20 In brief, I use the expression Sp (X) to
indicate the set of all vectors that can be obtained as a linear combination of the
columns of X.
Consider the space Rn , where you have a vector y and a few vectors x j , with
j = 1 . . . k and k < n, all packed in a matrix X. What we want to find is the element
of Sp (X) which is closest to y. In formulae:

ŷ = Argmin ∥y − x∥;
x∈Sp(X)

since the optimal point must belong to Sp (X), the problem can be rephrased as:
find the vector β such that Xβ (that belongs to Sp (X) by construction) is closest
to y:
β̂ = Argmin ∥y − Xβ ∥. (1.14)
β ∈Rk

If we decide to adopt the Euclidean definition of distance, then the solution


is exactly the same as the one to the statistical problem of Section 1.3.2: since
the “square root” function is monotone, the minimum of ∥y−Xβ ∥ is the same as
the minimum of (y − Xβ )′ (y − Xβ ), and therefore

Argmin ∥y − Xβ ∥ = β̂ = (X′ X)−1 X′ y


β ∈Rk

from which
ŷ = Xβ̂ = X(X′ X)−1 X′ y.

Note that ŷ is a linear transform of y: you obtain ŷ by premultiplying y by the


matrix X(X′ X)−1 X′ ; this kind of transformation is called a “projection”.

20 If, on the other hand, you find the topic intriguing and want a rigorous yet very readable book

on this subject, check out Axler (2015).


20 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

1.4.1 Projection matrices


In the previous subsection I pointed out that ŷ is a linear transform of y. The
matrix that operates the transform is said to be a projection matrix.21 To see
why, there’s an example I always use: the fly in the cinema. Imagine you’re sitting
in a cinema, and there’s a fly somewhere in the room. You see a dot on the screen:
the fly’s shadow. The position of the fly is y, the space spanned by X is the screen
and the shadow of the fly is ŷ.
The matrix that turns the position of the fly into the position of its shadow
is X(X′ X)−1 X′ . To be more precise, this matrix projects onto Sp (X) any vector it
premultiplies, and it’s such a handy tool that it has its own abbreviation: PX .

PX ≡ X(X′ X)−1 X′

and ŷ can be written as ŷ = PX y. The reader may find it amusing that in the
econometrics jargon the PX matrix is sometimes referred to as the “hat” matrix,
because PX “puts a hat on y”.

Figure 1.3: Example: projection of a vector on another one

coordinate 2
y

Sp (x)

e ŷ

coordinate 1

In this simple example, x = (3, 1) and y = (5, 3); the reader may want to check that
ŷ = (5.4, 1.8) and e = (−0.4, 1.2).

The base property of PX is that, by construction, PX X = X, as you can easily


check. Moreover, it’s symmetric and idempotent.

PX = PX ′ PX PX = PX .

We call idempotent something that does not change when multiplied by itself;
for example, the real numbers 0 and 1 are idempotent. A nice way to understand
the meaning of idempotency is by reflecting on its geometrical implication: the
21 If I were pedantic, I’d have to say orthogonal projection, because you also get a tool called

oblique projection. We’ll never use it in this book, apart from a passing reference in chapter 6.
1.4. THE GEOMETRY OF OLS 21

matrix PX takes a vector from wherever it is and moves it onto the closest point of
Sp (X); but if the starting point already belongs to Sp (X), obviously no movement
takes place at all, so applying PX to a vector more than once produces no extra
effects (PX y = PX PX y = PX PX · · · PX y).
It can also be proven that PX is singular;22 again, this algebraic property can
be given a nice intuitive geometric interpretation: a projection entails a loss
of information, because some of the original coordinates get “squashed” onto
Sp (X): in the fly example, it’s impossible to know the exact position of the fly
from its shadow, because one of the coordinates (the distance from the screen)
is lost. In formulae, the implication of PX being singular is that no matrix A exists
such that A·PX = I, and therefore no matrix exists such that Aŷ = y, which means
that y is impossible to reconstruct from its projection.
In practice, when you regress y on X, you are performing exactly the same
calculations that are necessary to find the projection of y onto Sp (X), and the
vector β̂ contains the coordinates for locating ŷ in that space.
There is another interesting matrix we’ll be using often:

M X = I − PX .

By definition, therefore, MX y = y− ŷ = e. The MX matrix performs a complemen-


tary task to that of PX : when you apply MX to a vector, it returns the difference
between the original point and its projection. We may say that e = MX y contains
the information that is lost in the projection. It is easily checked that MX X = [0]
and as a consequence,
MX PX = PX MX = [0],

where I’m using the notation [0] for “a matrix full of zeros”.
Some more noteworthy properties: MX is symmetric, idempotent and sin-
gular, just like PX .23 As for is rank, it can be proven that its rank equals n − k,
where n is the number of rows of X and r = rk (X).
A fundamental property this matrix enjoys is that every vector of the type
MX y is orthogonal to Sp (X), so it forms a 90° angle with any vector that can be
written as Xλ.24 These properties are very convenient in many cases; a notable
one is the possibility of rewriting the SSR as a quadratic form:25 .

e′ e = (MX y)′ (MX y) = y′ MX MX y = y′ MX y


22 To be specific: it can be proven that rk (P ) = rk (X), so P is a n × n matrix with rank k; evi-
X X
dently, in the situation we’re considering here, n > k. Actually, it can be proven that no idempotent
matrix is invertible, the identity matrix being the only exception.
23 In fact, M is itself a projection matrix, but let’s not get into this, ok?
X
24 Let me remind the reader that two vectors are said to be orthogonal if their inner product is

0. In formulae: x ⊥ y ⇔ x′ y = 0. A vector is orthogonal to a space if it’s orthogonal to all the points


that belong to that space: y ⊥ Sp (X) ⇔ y′ X = 0, so y ⊥ Xλ for any λ.
25 A quadratic form is an expression like x′ Ax, where x is a vector and A is a square matrix,

usually symmetric. I sometimes use the metaphor of a sandwich and call x the “bread” and A the
“cheese”.
22 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

where the second equality comes from symmetry and the third one from idem-
potency. By the way, the above expression could be further manipulated to re-
obtain equation (1.11):

y′ MX y = y′ (I − PX )y = y′ y − y′ PX y = y′ y − y′ X(X′ X)−1 X′ y = y′ y − β̂ ′ (X′ y).

Example 1.4
Readers are invited to check (by hand or using a computer program of their
choice) that, with the matrices used in example 1.2, PX equals
 
0.3 0.2 0.1 0.4 0 0
0.2 0.175 0.15 0.225 0.125 0.125
 
0.1 0.15 0.2 0.05 0.25 0.25
PX = 
 
0.4 0.225 0.05 0.575 −0.125 −0.125

 
 0 0.125 0.25 −0.125 0.375 0.375
0 0.125 0.25 −0.125 0.375 0.375

and that does in fact satisfy the idempotency property.

In the present context, the advantage of using projection matrices is that the
main quantities that appear in the statistical problem of approximating y i via xi
become easy to represent in a compact and intuitive way:
Magnitude Symbol Formula
OLS Coefficients β̂ (X′ X)−1 X′ y
Fitted values ŷ PX y
Residuals e MX y
Sum of squared residuals SSR e′ e = y′ MX y

Take for example the special case X = ι. As we now know, the optimal solu-
tion to the statistical problem is using the sample average, so β̂ = Ȳ : the fitted
values are Pι y = ι · Ȳ and the residuals are simply the deviations from the mean:
e = Mι y = y − ι · Ȳ . Finally, deviance can be written as y′ Mι y.

1.4.2 Measures of fit


We are now ready to tackle a very important issue. How good is our model? We
know that β̂ is the best we can choose if we want to approximate y i via ŷ i =
x′i β , but nobody guarantees that our best should be particularly good. A natural
way to rephrase this question is: how much information are we losing in the
projection? We know the information loss is minimal, but it could still be quite
large.
In order to answer this question, let us start from the following two inequal-
ities:
0 ≤ ŷ′ ŷ = y′ PX y ≤ y′ y; (1.15)
1.4. THE GEOMETRY OF OLS 23

the first one is rather obvious, considering that ŷ′ ŷ is a sum of squares, and
therefore non-negative. The other one, instead, can be motivated via y′ PX y =
y′ y−y′ MX y = y′ y−e′ e; since e′ e is also a sum of squares, y′ PX y ≤ y′ y. If we divide
everything by y′ y, we get

ŷ′ ŷ e′ e
0≤ ′
= 1 − ′ = R u2 ≤ 1. (1.16)
yy yy

This index bears the name R u2 (“uncentred R-squared”), and, as the above
expression shows, it’s bounded by construction between 0 and 1. It can be given
a very intuitive geometric interpretation: evidently, in Rn the points 0, y and ŷ
form a right triangle (see also Figure 1.3), in which you get a “good” leg, that is ŷ,
and a “bad” one, the segment linking ŷ and y, which is congruent to e: we’d like
the bad leg to be as short as possible. After Pythagoras’ theorem, the R u2 index
gives us (the square of) the ratio between the good leg and the hypotenuse. Of
course, we’d like this ratio to be as close to 1 as possible.

Example 1.5
With the matrices used in example 1.2, you get that y′ y = 24 and e′ e = 4.325;
therefore,
4.325
R u2 = 1 − ≃ 81.98%
24

The R u2 index makes perfect sense geometrically, but hardly any from a sta-
tistical point of view: the quantity y′ y has a natural geometrical interpretation,
but statistically it doesn’t mean much, unless we give it the meaning

y′ y = (y − 0)′ (y − 0),

that is, the SSR for a model in which ŷ = 0. Such a model would be absolutely
minimal, but rather silly as a model. Instead, we might want to use as a bench-
mark our initial proposal described in section 1.2, where X = ι. In this case,
the SSR is just the deviance of y, that is the sum of squared deviations from the
mean, which can be written as y′ Mι y.
If ι ∈ Sp (X) (typically, when the model contains a constant term, but not
necessarily), then a decomposition similar to (1.15) is possible: since y = ŷ + e,
then obviously
y′ Mι y = ŷ′ Mι ŷ + e′ Mι e = ŷ′ Mι ŷ + e′ e (1.17)

because if ι ∈ Sp (X), then Mι e = e.26 Therefore,

0 ≤ e′ e ≤ y′ Mι y,
26 Subsection 1.A.8 should help the readers who want this result proven.
24 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

where the second inequality comes from the fact that ŷ′ Mι ŷ is a sum of squares
and therefore non-negative.27 The modified version of R 2 is known as centred
R-square:
e′ e
R2 = 1 − ′ . (1.18)
y Mι y
The concept of R 2 that we normally use in econometrics is the centred one, and
this is why the index defined at equation (1.16) has the “u” as a footer (from the
word uncentred).
In a way, the definition of R 2 is implicitly based on a comparison between
different models: one which uses all the information contained in X and another
(smaller) one, which only uses ι, because y′ Mι y is just the SSR of a model in
which we regress y on ι. Therefore, equation (1.18) can be read as a way to com-
pare the loss function for those two models.
In fact, this same idea can be pushed a little bit further: imagine that we
wanted to compare model A and model B, in which B contains the same ex-
planatory variables as A, plus some more. In practice:

Model A y ≃ Xβ
Model B y ≃ Xβ + Zγ = Wθ
· ¸
β
where W = [X Z] and θ = .
γ
The matrix Z contains additional regressors to model A. It is important to
realise that the information contained in Z could be perfectly relevant and le-
gitimate, but also ridiculously useless. For example, a model for the academic
performance of A. S. Tudent could well contain, as an explanatory variable, the
number of pets A. S. Tudent’s neighbours have, or the number of consonants in
A. S. Tudent’s mother’s surname.
It’s easy to prove that the SSR for model B is always smaller than that for A:

SSR A = e′a ea SSR B = e′b eb

where ea = MX y and eb = MW y. Since X ∈ Sp (W), clearly28 PW X = X and therefore

MW MX = MW ,

which implies MW ea = eb ; as a consequence,

SSR B = e′b eb = e′a MW ea = e′a ea − e′a PW ea ≤ e′a ea = SSR A .

More generally, if Sp (W) ⊃ Sp (X), then y′ MW y ≤ y′ MX y for any vector y.


27 Note that, in the rare but not impossible case ι ∉ Sp (X), it is perfectly possible that e′ e < y′ M y,
ι
so the centred version of the R 2 index may be negative.
28 Some may say: “well, not so clearly”. OK, here goes: X ∈ Sp (W) implies that there is a matrix H

such that X = WH. Hence, PW X = PW WH = WH = X.


1.4. THE GEOMETRY OF OLS 25

The implication is: if we had to choose between A and B by using the SSR
as a criterion, model B would always be the winner, no matter how absurd the
choice of the variables Z is. The R 2 index isn’t any better: proving that

SSR B ≤ SSR A ⇒ R B2 ≥ R 2A .

is a trivial exercise, so if you add any explanatory variable to an existing model,


the R 2 index cannot become smaller.
A possible solution could be using a slight variation of the index, which goes
by the name of adjusted R 2 :

2 e′ e n − 1
R = 1− , (1.19)
y′ Mι y n − k

where n is the size of our dataset and k is the number of explanatory variables. It
is easy to prove that if you add silly variables to a model, so that the SSR changes
only slightly, the n − k in the denominator should offset that effect. However,
as we will see in section 3.3.2, the best way of choosing between models is by
framing the decision in a proper inferential context.
One final thing on the R 2 index. Although it’s perfectly legitimate to think
that 0 is “bad” and 1 is “good”, it would be unwise to automatically consider a
number close to 0 (say, 10%) as “rather bad” or, symmetrically, a number close
to 1 (say, 90%) as “pretty good”: a model is an approximate description of the
dependent variable y i , insofar as the explanatory variables xi contain relevant
information. It may well be that the main determinants of y i are unobservable,
and therefore xi only manages to capture a small portion of the overall disper-
sion of y i . In these cases, the R 2 index will be very small, but it doesn’t neces-
sarily follow that our model is worthless: the relationship that it reveals between
the dependent variable and the explanatory variables may be extremely valu-
able, even if the fraction of variance we explain is small. But again, this idea is
more properly framed as a statistical inference issue, which is what chapter 3 is
about.

1.4.3 Reparametrisations
Suppose that there are two researchers (Alice and Bob), who have the same
dataset, which contains three variables: y i , x i and z i . Alice performs OLS on
the model
y i ≃ β1 x i + β2 z i
Bob, instead, computes the new variables s i = x i + z i and d i = x i − z i and com-
putes his coefficients using the transformed regressors as

y i ≃ γ1 s i + γ2 d i .

How different will the two models be? Before delving into algebra, it is worth
observing that Alice and Bob are using the same data, and it would be surprising
26 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

if they arrived at different conclusions. Moreover, Alice and Bob’s choices are
simply a matter of taste, and there’s no “right” way to set up a model. One could
compute s i and d i from x i and z i , or the other way around. In other words,
the set of explanatory variables Alice and Bob are using are invertible transfor-
mations of one another, and therefore must contain the same information, ex-
pressed in a different way.
With this in mind, a relationship between the two sets of parameters is easy
to find: start from Bob’s model

yi ≃ γ1 s i + γ2 d i =
= γ1 (x i + z i ) + γ2 (x i − z i ) =
= (γ1 + γ2 )x i + (γ1 − γ2 )z i ,

so β1 = (γ1 +γ2 ) and β2 = (γ1 −γ2 ). Clearly, this entails that Bob’s parameters can
β +β β −β
be recovered from Alice’s as γ1 = 1 2 2 and γ2 = 1 2 2 . It is perfectly legitimate
to surmise that the two models are in fact equivalent, and should give the same
fit.
More generally, it is possible to show that Alice’s model can be written as
y ≃ Xβ and Bob’s model as y ≃ Zγ , where Z = XA and A is square and invertible.
In the example above,
· ¸
1 1
A=
1 −1

This simple fact has a very nice consequence on the respective projection
matrices:

PZ = Z(Z′ Z)−1 Z′
= XA(A ′ X′ XA)−1 A ′ X′
= XA(A)−1 (X′ X)−1 (A ′ )−1 A ′ X′
= X(X′ X)−1 X′ = PX ,

that is, the two projection matrices are the same.29 Therefore, Sp (X) = Sp (Z): Al-
ice and Bob are projecting y onto the same space. It should be no surprise that
they will get the same fitted values ŷ and the same residuals e. As a further con-
sequence, all the quantities that depend on the projection will be the same, such
as the sum of squared residuals, the R 2 index and so on. As a matter of fact, Al-
ice’s and Bob’s models are just the same model written in a different way, by a
different representation choice, which uses different parameters. The relation-
ship between the two sets of parameters is easy to show: since ŷ is the same for
the two models, then it must also hold

ŷ = Zγ̂ = XA γ̂ = Xβ̂ .
29 If you find some of the passages above unclear, then Section 1.A.4 may be useful.
1.4. THE GEOMETRY OF OLS 27

and therefore β̂ = A γ̂ (and of course γ̂ = A −1 β̂ ). The word we use in this con-


text is reparametrisation: Bob’s model is a reparametrisation of Alice’s and vice
versa. The difference between the two is just aesthetic, so to speak: in some
cases, it could be more natural to interpret the coefficients of a model written in
a certain way than another. This is a very common trick in applied economics,
and an egregious example will be given in Section 5.5.

1.4.4 The Frisch-Waugh theorem


Projection matrices are also useful to illustrate a remarkable result, known as the
Frisch-Waugh theorem:30 given a model of the kind y = Xβ̂ + e, split X vertically
into two sub-matrices Z and W, and β accordingly
· ¸
£ ¤ β̂1
ŷ = Z W
β̂2

Applying equation (1.7) we get the following:


¸−1 ·
Z′ Z Z ′ W Z′ y
· ¸ · ¸
β̂1
=
β̂2 W′ Z W′ W W′ y

It would seem that finding an analytical closed form for β1 and β2 as func-
tions of Z, W and y is quite difficult; fortunately, it isn’t so: start from

y = ŷ + e = Zβ̂1 + Wβ̂2 + e

and premultiply the equation above by MW :

MW y = MW Zβ̂1 + e,

since MW W = 0 (by construction) and MW e = e (because e = MX y, but Sp (W) ⊂


Sp (X), so MW MX = MX ).31 Now premultiply by Z′ :

Z′ MW y = Z′ MW Zβ̂1

since Z′ e = 0, because Z′ MX = 0. As a consequence,


¢−1
β̂1 = Z′ MW Z Z′ M W y
¡
(1.20)

Since MW is idempotent, an alternative way to write (1.20) could be


¤−1
β̂1 = (Z′ MW )(MW Z) (Z′ MW )(MW y);
£

30 In fact, many call this theorem the Frisch-Waugh-Lovell theorem, as it was Micheal Lovell

who, in a paper appeared in 1963, generalised the original result that Frisch and Waugh had ob-
tained 30 years earlier to its present form.
31 If you’re getting a bit confused, you may want to take a look at section 1.A.8.
28 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

therefore β̂1 is the vector of the coefficients for a model in which the dependent
variable is the vector of the residuals of y with respect to W and the regressor
matrix is the matrix of residuals of Z with respect to W. For symmetry reasons,
you also obviously get a corresponding expression for β̂2 :
¢−1
β̂2 = W′ MZ W W′ MZ y
¡

In practice, a perfectly valid algorithm for computing β̂1 could be:

1. regress y on W; take the residuals and call them ỹ;

2. regress each column of Z on W; form a matrix with the residuals and call it
Z̃;

3. regress ỹ on Z̃: the result is β̂1 .

This result is not just a mathematical curiosity, nor a computational gim-


mick: it comes in handy in a variety of situations for proving theoretical results.
For example, we’ll use this theorem more than once in chapters 3, 6 and 7.
An interpretation that the Frisch-Waugh theorem can be given is the follow-
ing: the coefficients for a group of regressors measure the response of ŷ having
taken into account the other ones or, as we say, “everything else being equal”.
The phrase normally used in the profession is “controlling for”. For example:
suppose that y contains data on the wages for n employees, that Z is their educa-
tion level and W is a geographical dummy variable (North vs South). The vector
ỹ = MW y will contain the differences between the individual wages and the aver-
age wage of the region where they live, in the same way as Z̃ = MW Z contains the
data on education as deviation from the regional mean. Therefore, regressing ỹ
on Z̃ is a way to implicitly take into account that differences in wages between
regions may depend on different educational levels. Consequently, by regressing
y on both the “education” variable and the regional dummy variable, the coeffi-
cient for education will measure its effect on wages controlling for geographical
effects.

1.5 An example
For this example, I got some data from the 2016 SHIW dataset;32 our dataset
contains 1917 individuals, who are full-time employees.33 We are going to use
four variables, briefly described in Table 1.1. Our dependent variable is going to
be w, the natural logarithm of the hourly wage in Euro. The set of explanatory
32 SHIW is the acronym for “Survey on Household Income and Wealth”, provided by the Bank of

Italy, which is a very rich and freely available dataset: see


https://www.bancaditalia.it/statistiche/tematiche/indagini-famiglie-imprese/
bilanci-famiglie/.
33 I can send you the details on the construction of the dataset from the raw data, if you’re inter-

ested. Just send me an email.


1.5. AN EXAMPLE 29

variables was chosen in accordance with some vague and commonsense idea
of the factors that can account for differences in wages. We would expect that
people with higher education and/or longer work experience should command
a higher wage, but we would also use the information on gender, because we are
aware of an effect called “gender gap”, that we might want to take into account.

Variable Description Mean Median S. D. Min Max


w Log hourly wage 2.22 2.19 0.364 0.836 4.50
g dummy, male = 1 0.601 1.00 0.490 0.00 1.00
e education (years) 11.7 13.0 3.60 0.00 21.0
a work experience (years) 27.4 29.0 10.9 0.00 58.0

Table 1.1: Wage example

The data that we need to compute β̂ are:


 
1917 1153 22493 52527
 1153 1153 13299 32691 
X′ X = 
 
 22493 13299 288731 594479


52527 32691 594479 1666703
 
4253.3716
 2633.5507 
X′ y = 
 
 51038.9769 

116972.6710
y′ y = 9690.62

The reader is invited to check that the inverse of X′ X is (roughly)


 
1356.9719 −120.7648 −63.9068 −17.6027
−120.7648 220.4901 1.2056 −0.9488
(X′ X)−1 = 10−5 · 
 
 −63.9068 1.2056 4.4094 0.4177

−17.6027 −0.9488 0.4177 0.4844

and therefore we have


 
1.3289
 0.1757 
β̂ =  e′ e = 177.9738 R 2 = 29.76%
 
 0.0526


0.0061

but it is much more common to see results presented in a table like Table 1.2.
At this point, there are quite a few numbers in the table above that we don’t
know how to read yet, but we have time for this: chapter 3 is devoted entirely to
this purpose. The important thing for now is that we have a reasonably efficient
way to summarise the information on wages via the following model:

ŵ i = 1.33 + 0.176g i + 0.053e i + 0.006a i


30 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

coefficient std. error t-ratio p-value


----------------------------------------------------------
const 1.32891 0.0355309 37.40 2.86e-230 ***
male 0.175656 0.0143224 12.26 2.42e-33 ***
educ 0.0526218 0.00202539 25.98 1.02e-127 ***
wexp 0.00608615 0.000671303 9.066 2.97e-19 ***

Mean dependent var 2.218765 S.D. dependent var 0.363661


Sum squared resid 177.9738 S.E. of regression 0.305015
R-squared 0.297629 Adjusted R-squared 0.296528

Table 1.2: Wage example — OLS output

where w i is the log wage for individual i , g i is their gender, and the rest follows.
In practice, if we had a guy who studied for 13 years and has worked for 20
years, we would guess that the log of his hourly wage would be

1.33 + 0.176 · 1 + 0.052 · 13 + 0.006 · 20 ≃ 2.31

which is roughly €10 an hour (which sounds reasonable).


The quality of the approximation is not bad: the R 2 index is roughly 30%,
which means that if we compare the loss functions for our model and the one
we get if we has just used the average wage, we get

e′ e
0.298 = 1 − =⇒ e′ e = 0.702 · y′ Mι y;
y′ Mι y

if you consider the dazzling complexity of the factors that potentially dictate why
two individuals get different wages, the fact that a simple linear rule involving
only three variables manages to describe 30% of the heterogeneity between in-
dividual is surprisingly good.
Of course, nothing is stopping us from interpreting the sign and magnitude
of our OLS coefficients: for example, the coefficient for education is about 5%,
and therefore the best way to use the educational attainment variable for sum-
marising the data we have on wages is by saying that each year of extra edu-
cation gives you a guess which is about 5% higher.34 Does this imply that you
get positive returns to education in the Italian labour market? Strictly speaking,
it doesn’t. This number yields a fairly decent approximation to our dataset of
1917 people. To assume that the same regularity should hold for others is totally
unwarranted. And the same goes for the gender gap: it would seem that being
male shifts your fitted wage by 17.5%. But again, at the risk of being pedantic,
all we can say is that among our 1917 data points, males get (on average) more
34 One of the reasons why we economists love logarithms is that they auto-magically turn abso-

lute changes into relative ones: β2 = dw


de
= d ln(W
de
)
=W1 dW
de
≃ ∆W∆e/W . In other words, the coeffi-
cient associated with the educational variable gives you a measure of the relative change in wage
in response to a unit increase in education.
1.A. ASSORTED RESULTS 31

money than females with the same level of experience and education. Coinci-
dence? We should be wary of generalisations, however tempting they may be to
our sociologist self.
And yet, these thoughts are perfectly natural. The key ingredient to give sci-
entific legitimacy to this sort of mental process is to frame it in the context of
statistical inference, which is the object of the next chapter.

1.A Assorted results


This section contains several results on matrix algebra, in the simplest form pos-
sible. If you want an authoritative reference, my advice is to get one of Horn and
Johnson (2012), Abadir and Magnus (2005) or Lütkepohl (1996), which are all ex-
cellent and use a notation and style that is close to what we use in econometrics.

1.A.1 Matrix differentiation rules


The familiar concept of a derivative of a function of a scalar can be generalised
to functions of a vector
y = f (x),
where you have a real number y for every possible vector x. For example, if
y = x + w z , you can define the vector x = [x, w, z]′ . The generalisation of the
concept of derivative is what we call the gradient, that is a vector collecting the
partial derivatives with respect to the corresponding elements of x. We adopt the
convention by which the gradient is a row vector; hence, for the example above,
the gradient is
∂y h ∂y ∂y ∂y
i
zw z−1 log(w) · w z
£ ¤
= ∂x ∂w ∂z
= 1
∂x
The cases we’ll need are very simple, because they generalise the simple uni-
variate functions y = ax and y = ax 2 . Let’s begin by
n
f (x) = a′ x =
X
ai xi ;
i =1

evidently, the partial derivative of f (x) with respect to x i is just a i ; by stacking all
the partial derivatives into a vector, the result is just the vector a, and therefore
d ′
a x = a′
dx
d
note that the familiar rule dx ax = a is just a special case when a and x are
scalars.
As for the quadratic form
n X
n
f (x) = x′ Ax =
X
ai j xi x j ;
i =1 j =1
32 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

it can be proven easily (but it’s rather boring) that

d ′
x Ax = x′ (A + A ′ )
dx
d ′
and of course if A is symmetric (as in most cases), then dx x Ax = 2 · x′ A. Again,
d
note that the scalar case dx ax 2 = 2ax is easy to spot as a special case.
One last thing: the convention by which differentiation expands “by row”
turns out to be very useful because it makes the chain rule for the derivatives
“just work” automatically. For example, suppose you have y = Ax and z = B y;
of course, if you need the derivative of z with respect to x you may proceed by
defining C = B · A and observing that

∂z
z = B (Ax) = C x =⇒ =C
∂x
but you may also get the same result via the chain rule, as

∂z ∂z ∂y
= = B · A = C.
∂x ∂y ∂x

1.A.2 Vector spaces


Here we will draw heavily on the fact that a vector with n elements can be thought
of as a point in an n-dimensional space: a scalar is a point on the real line, a vec-
tor with two elements is a point on a plane, and so on. Actually, the notation
x ∈ Rn is a concise way of saying that x has n elements.
There are two basic operations we can perform on vectors: (i) multiplying a
vector by a scalar and (ii) summing two vectors. In both cases, the result you get
is another vector. Therefore, if you consider k vectors with n elements each, it
makes sense to define an operation called a linear combination of them:
k
z = λ1 x1 + λ2 x2 + · · · + λk xk = λj xj ;
X
j =1

note that the above could have been written more compactly in matrix notation
as z = Xλ, where X is a matrix whose columns are the vectors x j and λ is a k-
element vector.
The result is, of course, an n-element vector, that is a point in Rn . But the
k vectors x1 , . . . , xk are also a cloud of k points in Rn ; so we may ask ourselves if
there is any kind of geometrical relationship between z and x1 , x2 , . . . , xk .
Begin by considering the special case k = 1. Here z is just a multiple of x1 ;
longer, if |λ1 | > 1, shorter otherwise; mirrored across the origin if λ1 < 0, in the
same quadrant otherwise. Easy, boring. Note that, if you consider the set of all
the vectors z you can obtain by all possible choices for λ1 , you get a straight line
going through the origin, and of course x1 ; this set of points is called the space
spanned, or generated by x1 ; or, in symbols, Sp (x1 ). It’s important to note that
1.A. ASSORTED RESULTS 33

this won’t work if x1 = 0: in this case, Sp (x1 ) is not a straight line, but rather a
point (the origin).
If you have two vectors, instead, the standard case occurs when they are not
aligned with respect to the origin. In this case, Sp (x1 , x2 ) is a plane and z = λ1 x1 +
λ2 x2 is a point somewhere on that plane. Its exact location depends on λ1 and
λ2 , but note that

• by a suitable choice of λ1 and λ2 , no point on the plane is unreachable;

• no matter how you choose λ1 and λ2 , you can’t end up outside the plane.

However, if x2 is a multiple of x1 , then x2 ∈ Sp (x1 ) and Sp (x1 , x2 ) = Sp (x1 ), that is


a line, and not a plane. In this case, considering x2 won’t make Sp (x1 ) “grow” in
dimension, since x2 is already contained in it, so to speak.
In order to fully generalise the point, we use the concept of linear indepen-
dence: a set of k vectors x1 , . . . , xk is said to be linearly independent if none of
them can be expressed as a linear combination of the remaining ones.35 The
case I called “standard” a few lines above happens when x1 and x2 are linearly
independent.

1.A.3 Rank of a matrix


If we take k vectors with n elements each and we arrange them side by side so as
to form an (n × k) matrix (call it X), the maximum number of linearly indepen-
dent columns of X is the rank of X (rk (X) in formulae). The rank function enjoys
several nice properties:36

1. 0 ≤ rk (X) ≤ k (by definition);

2. rk (X) = rk X′ ;
¡ ¢

3. 0 ≤ rk (X) ≤ min(k, n) (by putting together the previous two); but if rk (X) =
min(k, n), and the rank hits its maximal value, the matrix is said to have
“full rank”;

4. rk (A · B ) ≤ min(rk (A) , rk (B )); but in the special case when A ′ = B , then


equality holds, and rk B ′ B = rk B B ′ = rk (B ).
¡ ¢ ¡ ¢

We can use the rank function to measure the dimension of the space spanned
by X. For example, if rk (X) = 1, then Sp (X) is a line, if rk (X) = 2, then Sp (X) is a
plane, and so on. This number may be smaller than the number of columns of
X.
35 The usual definition is that x , . . . , x are linearly independent if no linear combination
Pk 1 k
j =1
λ j x j is zero unless all the λ j are zero. The reader is invited to check that the two defini-
tions are equivalent.
36 I’m not proving them for the sake of brevity: if you’re curious, have a look at https://en.

wikipedia.org/wiki/Rank_(linear_algebra).
34 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

A result we will not use very much (only in chapter 6), but is quite useful to
know in more advanced settings is that, if you have a matrix A with n rows, k
columns and rank r , it is always possible to write it as

A = UV ′

where U is (n × r ), V is (k × r ), and both have rank r . For example, the matrix

1 0
 

A = 0 0
0 0

can be written as
1
 
£ ¤
A = 0 1 0
0
where
1
 
· ¸
1
U = 0
 and V= .
0
0
Note that such decomposition is not unique: there are infinitely many pairs
of matrices that satisfy the decomposition above. The example above would
have worked just as well with
 
−10 · ¸
−0.1
P =  0 and Q=
0
0

and the reader can easily verify that A = UV ′ = PQ ′ .

1.A.4 Rank and inversion


A square matrix A is said to be “invertible” if there is another matrix B such that
AB = B A = I , where I is the identity matrix. if B exists, it’s also notated as A −1
and called the inverse of A; otherwise, A is said to be singular.
The mathematically accepted way to say “A is non-singular” is by writing
|A| ̸= 0, where the symbol |A| is used for the determinant of the matrix A, which
is a scalar function such that |A| = 0 if and only if A is singular.37
The concept of matrix inversion becomes quite intuitive if you look at it ge-
ometrically. Take a vector x with n elements. If you pre-multiply it by a square
matrix A you get another vector with n elements:

y = A x;
37 You almost never need to compute a determinant by hand, so I’ll spare you its definition. If

you’re curious, there’s always Wikipedia.


1.A. ASSORTED RESULTS 35

in practice, A defines a displacement that takes you from a point in Rn to a point


in the same space. Is it possible to “undo” this movement? If A takes you from
x to y, is there a matrix B that performs the return trip? If such a matrix exists,
then
x = B y.
The only way to guarantee that this happens for every pair of vectors x and y is
to have
AB = B A = I .
Now, note that y is a linear combination of the columns of A; intuition sug-
gests that all the n separate pieces of information originally contained in x can
get preserved during the trip only if the rank of A is n. In fact, it can be proven
formally that if A is an (n × n) matrix, then rk (A) = n is a necessary and suffi-
cient condition for A −1 to exist: for square matrices, full rank is the same thing
as invertibility.

The world of matrix algebra is populated with singular, and therefore has no inverse. How-
results that appear unintuitive when you’re ever, it’s got a pseudo-inverse, which is A it-
used to the algebra of scalars. A notable one self. In fact, all projection matrices are their
is: any matrix A (even singular ones; even own pseudo-inverses.
non-square ones) admits a matrix B such that
AB A = A and B AB = B ; B is called the “Moore- Roger Penrose has been awarded the 2020 No-
Penrose” pseudo-inverse, or “generalised” in- bel prize for physics. Not for the generalised in-
·
1 0
¸ verse, but you get the idea of how brilliant the
verse. For example, the matrix A = is guy is.
0 0

Computing an inverse in practice is very boring: this is one of those tasks


that computers are very good at, while humans are not. The only interesting
cases for which I’m giving you instructions on how to invert a matrix by hand
are:
1. when A is 2 × 2. In this case, it’s easy to memorise the explicit formula for
the inverse: µ ¶−1 µ ¶
a b 1 d −b
= ;
c d ad − bc −c a
2. when A is block-diagonal, that is it can be written as
 
A1 0 · · · 0
 0 A2 · · · 0 
 
A=  .. .. .. ;
.. 
 . . . . 
0 0 ... Am
if the inverse exists, it has the same structure:
 
A −1
1 0 ··· 0
 0 A −1 ··· 0 
 
−1 2
A = .  .. .. .
.. 
 .. . . . 
0 0 . . . A −1
m
36 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

There are many nice properties that invertible matrices enjoy. For example:

• the inverse, if it exists, is unique; that is, if AB = I = AC , then B = C ;

• the inverse of a symmetric matrix is also symmetric;

• the transpose of the inverse is the inverse of the transpose ((A ′ )−1 = (A −1 )′ );

• if a matrix is positive definite (see section 1.A.7), then its inverse is positive
definite too;

• if A is invertible, then the only solution to Ax = 0 is x = 0; conversely, if A


is singular, then there exists at least one non-zero vector such that Ax = 0.

• if A and B are invertible, then (AB )−1 = B −1 A −1 .

1.A.5 Step-by-step derivation of the sum of squares function


The function we have to differentiate with respect to β is

L(β ) = e(β )′ e(β );

the elegant way to do this is by using the chain rule:

∂L(β ) ∂e(β ) ′ ∂e(β ) ′ ∂e(β ) ′


· ¸
= e(β ) + e(β )′ = 2· e(β );
∂β ∂β ∂β ∂β

the reason why we have to transpose the second element of the sum in the equa-
tion above is conformability: you can’t sum a row vector and a column vector.
Therefore, since e(β ) is defined as e(β ) = y − Xβ , we have

∂e(β )
= −X
∂β
∂L(β )
and the necessary condition for minimisation is ∂β = −2X′ e = 0, which of
course implies equation 1.9.

1.A.6 Numerical collinearity


Collinearity can sometimes be a problem as a consequence of finite precision of
computer algebra.38 For example, suppose you have the following matrix X:
 
1 1
2 2
X=
 
3 3

4 4+ϵ
38 If you find this kind of things intriguing I cannot but recommend chapter 1 in Epperson

(2013); actually, the whole book!


1.A. ASSORTED RESULTS 37

ϵ (X′ X)−1 (X′ X)


· ¸
1 0
0.1
9.09495e − 13 1
· ¸
1 −1.16415e − 10
0.01
0 1
· ¸
1
7.45058e − 09
0.001
2.23517e − 08 1
· ¸
0.999999 0
0.0001
9.53674e − 07 1
· ¸
0.999756 0
1e-05
0 0.999878
· ¸
0.992188 0
1e-06
0 0.992188
· ¸
0.5 0
1e-07
0.5 1
· ¸
1 1
1e-08
1 1

Table 1.3: Numerical precision

For ϵ > 0, the rank of X is, clearly, 2; nevertheless, if ϵ is a very small number, a
computer program39 goes berserk; technically, this situation is known as quasi-
collinearity. To give you an example, I used gretl to compute (X′ X)−1 (X′ X) for
decreasing values of ϵ; Table 1.3 contains the results. Ideally, the right-hand side
column in the table should only contain identity matrices. Instead, results are
quite disappointing for ϵ = 1e − 05 or smaller. Note that this is not a problem
specific to gretl (which internally uses the very high quality LAPACK routines),
but a consequence of finite precision of digital computers.
This particular example is easy to follow, because X is a small matrix. But if
that matrix had contained hundreds or thousands of rows, things wouldn’t have
been so obvious.

1.A.7 Definiteness of square matrices


A square matrix B is positive definite (pd for short) if the quadratic form x′ B x
returns a positive number for any choice of x and positive semi-definite (psd for
short) if x′ B x ≥ 0.
If B is positive (semi-)definite, then −B is negative (semi-)definite. Of course,
it is entirely possible that x′ B x can take positive or negative values depending
on x, in which case B is said to be indefinite. If B is semi-definite and invertible,
then it’s also definite. Figure 1.4 may be helpful.40
There are many interesting facts on psd matrices. A nice one is: if a matrix
39 I should say: a computer program not explicitly designed to operate with arbitrary precision.

There are a few, but no statistical package belongs to this category, for very good reasons.
40 In fact, figure 1.4 contains a slight inaccuracy. Finding it is left as an exercise to the reader.
38 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES

Figure 1.4: Square matrices

Positive Negative
semi-definite semi-definite

Positive Negative
definite definite

Invertible

H exists such that B = H H ′ , then B is psd.41 This, for example, gives you a quick
way to prove that that I is pd and PX is psd.

1.A.8 A few more results on projection matrices


Consider an n-dimensional space and a matrix X with n rows, k columns and
full rank. Of course, the columns of this matrix define a k-dimensional subspace
that we call Sp (X).
We would like to say something about the space spanned by matrices de-
fined as W = X · A. There are two cases of interest. The first one arises when A
is square and invertible: in this case, Sp (X) = Sp (W), so PX = PW . The result is
easy to prove: for any y ∈ Sp (X), there must be a vector β such that Xβ = y. But
then, by choosing γ = A −1 β , it’s easy to see that y can also be written as Wγ and
therefore y ∈ Sp (W); by a similar reasoning, it can also be proven that if y ∈ Sp (W)
then y also belongs to Sp (X), and therefore

y ∈ Sp (X) ⇐⇒ y ∈ Sp (W)

and the two sets are the same.


The equivalence of the two projection matrices can also be proven directly
by using elementary results on matrix inversion (see section 1.A.4):
¤−1 ′ ′ ¤−1 ′ −1 ′ ′ ¤−1 ′
PW = XA A ′ X′ XA A X = XA A −1 X′ X (A ) A X = X X′ X
£ £ £
X = PX .

Let’s now consider the case when A is a matrix with rank less than k (for
example, a column vector). Evidently, any linear combination of the columns of
W is also a linear combination of the columns of X, and therefore each column
of W is an element of Sp (X). As a consequence, any vector that belongs to Sp (W)
also belongs to Sp (X).
41 Easy to prove, too. Try.
1.A. ASSORTED RESULTS 39

The converse is not true, however: some elements of Sp (X) do not belong
to Sp (W) (allow me to skip the proof ). In short, Sp (W) is a subset of Sp (X); in
formulae, Sp (W) ⊂ Sp (X).
A typical example occurs when W contains some of the columns of X, but
not all. Let’s say, without loss of generality, that W contains the leftmost k − p
columns of X. In this case, the matrix A can be written as
· ¸
I
A=
0

where the identity matrix above has k − p rows and columns, and the 0 zero
matrix below has p and k − p columns.

PW MW PX MX
PW PW 0 PW 0
MW 0 MW PX − PW MX
PX PW PX − PW PX 0
MX 0 MX 0 MX
Important: it is assumed that Sp (W) ⊂ Sp (X). All products commute.

Table 1.4: Projection matrices “multiplication table”

In this situation, the property PX W = PX XA = XA = W implies some interest-


ing consequences on the projection matrices for the spaces Sp (W) and Sp (X),
that can be summarised in the “multiplication table” shown in Table 1.4. The
reader is invited to prove them; it shouldn’t take long.
40 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
Chapter 2

Some statistical inference

2.1 Why do we need statistical inference?


So far, we have followed a purely descriptive approach, trying to find the smartest
possible method for compressing as much information as we can from the orig-
inal data into a small, manageable container.
However, we are often tempted to read the evidence we have in a broader
context. Strictly speaking, any statistic we compute on a body of data tells us
something about those data, and nothing else. Thus, the OLS coefficients we
compute are nothing but a clever way to squeeze the relevant information out
of our dataset; however, we would often like to interpret the size and the mag-
nitudes of the coefficients that we get out of our OLS calculations as something
that tells us a more general story. In other words, we would like to perform what
in philosophical language is known as induction.
In the 18th century, the Scottish philosopher David
Hume famously argued against induction.

When it is asked, What is the nature of all


our reasonings concerning matter of fact? the
proper answer seems to be, that they are
founded on the relation of cause and effect.
When again it is asked, What is the foundation
D AVID H UME
of all our reasonings and conclusions concern-
ing that relation? it may be replied in one word, Experience. But if
we still carry on our sifting humour, and ask, What is the founda-
tion of all conclusions from experience? this implies a new question,
which may be of more difficult solution and explication.1

Inductive reasoning can be broadly formalised as follows:

1. Event X has always happened.


1 D. Hume, An Enquiry Concerning Human Understanding (1748).

41
42 CHAPTER 2. SOME STATISTICAL INFERENCE

2. The future will be like the past.

3. Therefore, event X will happen in the future.

Even if you could establish statement 1 beyond any doubt, statement 2 is ba-
sically an act of faith. You may believe in it, but there is no rational argument one
could convincingly use to support it. And yet, we routinely act on the premise of
statement 2. Hume considered our natural tendency to rely on it as a biological
feature of the human mind. And it’s a good thing: if we didn’t have this fun-
damental psychological trait, we’d be unable to learn anything at all;2 the only
problem is, it’s logically unfounded.
Statistical inference is a way to make an inductive argument more rigorous
by replacing statement number 2 with some assumptions that translate into for-
mal statements our tendency to generalise, by introducing uncertainty into the
picture. Uncertainty is the concept we use to handle situations in which our
knowledge is partial. So for example we cannot predict which number will show
up when we roll a die, although in principle it would be perfectly predictable,
given initial conditions, using the laws of physics. We simply don’t have the re-
sources to perform such a monster computation, so we represent our imperfect
knowledge through the language of probability, or, more correctly, via a proba-
bilistic model; then, we assume that the same model will keep being valid in the
future. Therefore, if we rolled a die 10 times and obtained something like

x = [1, 5, 4, 6, 3, 3, 2, 6, 3, 4]

we would act on the assumption that if we keep rolling the same die we will
observe something that, in our eyes, looks “just as random” as x. To put it differ-
ently, our aim will not be to predict exactly which side the die will land on, but
rather to make statements on how surprising or unsurprising certain outcomes
will be.
Therefore, we use the idea of a Data Generating Process, or DGP. We assume
that the DGP is the mechanism that Nature (or any divinity of your choice) has
used to produce the data we observe, and will continue doing so for the data
we have not observed yet. By describing the DGP via a mathematical structure
(usually, but not necessarily, via a probability distribution), we try to come up
with statistics T (x) whose aim is not as much to describe the available data x,
but rather to describe the DGP that generated x, and therefore, to provide us
with some insight that goes beyond merely descriptive statistics.
Of course, in order to accomplish such an ambitious task, we need a set of
tools to represent imperfect knowledge in a mathematical way. This is why we
need probability theory.

2 To be fair, this only applies to what Immanuel Kant called “synthetic” propositions. But maybe

I’m boring you?


2.2. A CRASH COURSE IN PROBABILITY 43

2.2 A crash course in probability


Disclaimer: in this section, I’ll just go quickly though a few concepts that the
reader should already be familiar with; as a consequence, it is embarrassingly
simplistic and probably quite misleading in many ways. The reader who wants
to go for the real thing might want to read (in increasing order of difficulty): Gal-
lant (1997); Bierens (2011); Davidson (1994); Billingsley (1986). Having said this,
let’s go ahead.

2.2.1 Probability and random variables


The concept of probability has been the object of philo-
sophical debate for centuries. The meaning of probability
is still open for discussion,3 but fortunately the syntax of
probability is clear and undisputed since the great Soviet
mathematician Andrej Nikolaevič Kolmogorov made prob-
ability a proper branch of measure theory.
The meaning I give to the word probability here is
largely operational: probability is a number between 0 and
1 that we attach to something called an event. Loosely
speaking, an event is a statement that in our eyes could be
conceivably true or false. Formally, an event is defined as a
subset of an imaginary set, called the state space, and usu- A NDREJ
ally denoted by the letter Ω, whose elements ω are all the N IKOLAEVI Č

states of the world that our mind can conceive as possible. K OLMOGOROV

Probability is a function of subsets of Ω, which obeys a few properties the reader


should already know, such as

P (Ω) = 1, P (;) = 0, P (A ∪ B ) = P (A) + P (B ) − P (A ∩ B )

and so forth. Event A can be defined as the subset of Ω including all the states
ω in which a statement A is true, and only those. P (A) is the measure of A,
where the technical word “measure” is a generalisation of our intuitive notion of
“extension” (length, area, volume).4 The familiar laws of probability are simple
consequences of the way usual set operations (complement, union, intersec-
tion) work; let’s waste no time on those.5
Random variables are a convenient way to map events to segments on the
real line. That is, a random variable X is defined as a measurable function from
3 The interested reader might want to have a look at Freedman and Stark (2016), section 2. You

can download it from https://www.stat.berkeley.edu/~stark/Preprints/611.pdf.


4 Warning: not all subsets can be associated with a corresponding probability: some are “non-

measurable”. Providing a simple example is difficult, this is deep measure theory: google “Vitali
set” if you’re curious.
5 In most cases, intuition will suffice. For tricky cases, I should explain what a σ-algebra is, but

I don’t think that this is the right place for this, really.
44 CHAPTER 2. SOME STATISTICAL INFERENCE

Ω to R; or, to put it differently, for any ω in Ω you get a corresponding real num-
ber X (ω). The requisite of measurability is necessary to avoid paradoxical cases,
and simply amounts to requiring that, if we define A as the the subset of Ω such
that
a < X (ω) ≤ b ⇐⇒ ω ∈ A,
then A is a proper event. In practice, it must be possible to define P (a < X ≤ b)
for any a and b. I will sometimes adopt the convention of using the acronym
“rv” for random variables.
There are two objects that a random variable comes equipped with: the first
is its support, which is the subset of R with all the values that X can take; in
formulae, X : Ω 7→ S ⊆ R, and the set S is sometimes indicated as S(X ). For a
six-sided die, S(X ) = {1, 2, 3, 4, 5, 6}; if X is the time before my car breaks down,
then S(X ) = [0, ∞), and so on.
The other one is its distribution function, or cumulative distribution func-
tion (often abbreviated as cdf), defined as

F X (a) = P (X ≤ a),

which of course makes it easy to compute the probability of X being inside an


interval as
P (a < X ≤ b) = F X (b) − F X (a).
By their definition, cdfs enjoy three basic properties:

• lima→−∞ F X (a) = 0;

• lima→∞ F X (a) = 1;

• if b > a, then F X (b) ≥ F X (a); that is, F X (·) is non-decreasing.

Apart from this, there’s very little that can be said in general. However, in many
cases it is assumed that F X (a) has a known functional form, which depends on
a vector of parameters θ.
Two special cases are of interest:

1. The cdf is a function that goes up in steps; the support is a countable set,
and the corresponding rv is said to be discrete; for every member of the
support x it is possible to define p(x) = P (X = x) > 0; the function p(x) is
the so-called probability function.

2. The cdf is everywhere differentiable; the support is an interval on R, (pos-


sibly, the whole real line), and the corresponding rv is said to be continu-
ous; the derivative of F x (a) is known as the density function of X , or f X (a)
and therefore, by definition,
Z b
P (a < X ≤ b) = f X (z)dz;
a
2.2. A CRASH COURSE IN PROBABILITY 45

in most cases, when the meaning is clear from the context, we just write
the density function for X as f (x).6

In the rest of the book, I will mostly use continuous random variables for exam-
ples; hopefully, generalisations to discrete rvs should be straightforward.
Of course, you can collect a bunch of random variables into a vector, so you
have a multivariate random variable, or random vector. The multivariate ex-
tension of the concepts I sketched above is a little tricky from a technical view-
point, but for our present needs intuition will again suffice. I will only mention
that for a multivariate random variable x with k elements you have that

F x (a) = P [(x 1 ≤ a 1 ) ∩ (x 2 ≤ a 2 ) ∩ . . . ∩ (x k ≤ a k )]

If all the k elements of x are continuous random variables, then you can define
the joint density as
∂k F x (z)
f x (z) = .
∂z 1 ∂z 2 · · · ∂z k
The marginal density of the i -th element of x is just the density of x i taken in iso-
lation. For example, suppose you have a trivariate random vector w = [X , Y , Z ]:
the marginal density for Y is
Z Z
f Y (a) = f w (x, a, z) dz dx
S(X ) S(Z )

2.2.2 Independence and conditioning


If P (A) is the probabilistic evaluation we give of A, we may ask ourselves if we
would change our mind when additional information becomes available. If we
receive the news that event B has occurred, then we can safely exclude the event
B from Ω.7 In fact, after receiving the message “B is true”, our state space Ω
shrinks to B , because states in which B is false are no longer conceivable as pos-
sible.
The consequence for A is that the subset A ∩ B is no longer possible; hence,
we must update our probability measure so that P (B ) becomes 1, and conse-
quently8 P (A) must be revised as

P (A ∩ B )
P (A|B ) ≡ (2.1)
P (B )
6 I imagine the reader doesn’t need reminding that the density at x is not the probability that

X = x; for continuous random variables, the probability is only defined for intervals by the for-
mula in the text, from which it follows that P (X = x) is 0.
7 When speaking about sets, I use the bar to indicate the complement.
8 Technically, it’s more complicated that this, because P (B ) may be 0, in which case the defini-

tion has to be adapted and becomes more technical. If you’re interested, chapter 10 in Davidson
(1994) is absolutely splendid.
46 CHAPTER 2. SOME STATISTICAL INFERENCE

You read the left-hand side of this definition as “the probability of A given B ”,
which is of course what we call conditional probability. It should be clear that

P (A) = P (A|B ) ⇐⇒ P (A ∩ B ) = P (A) · P (B )

Which means: “if you don’t need to revise your evaluation of A after having
received some message about B , then A and B have nothing to do with each
other”; in this situation, A and B are said to be independent, and we write A ⊥
⊥ B,
so independence can be thought of as lack of mutual information. Note that in-
dependence is a symmetric concept: if A is independent of B , then B is inde-
pendent of A, and vice versa.

Equation (2.1) has the following implication: thing as the probability that someone who died
from COVID was a no-vaxxer (think about it).
P (A ∩ B ) = P (A|B ) · P (B ) = P (B |A) · P (A),
Another reason is that this expression is the
so that cornerstone of an approach to statistics known
P (B |A) · P (A) as Bayesian, after the English statistician and
P (A|B ) = .
P (B ) clergyman Reverend Thomas Bayes, who lived
The expression above is interesting for many in the 18th century; I’m not going to use any-
reasons. One is: in general, P (A|B ) ̸= P (B |A), thing Bayesian in this book, but Bayesian meth-
so, for example, the probability of dying from ods are getting increasingly popular in many
COVID if you’re not vaccinated is not the same areas of econometrics.

The same concept can be applied to random variables. If

F Y (z) = F Y (z|a < X ≤ b)

for any a and b, then evidently X carries no information about Y , and we say
that the two random variables are independent: Y ⊥ ⊥ X . If this is not the case,
it makes sense to consider the conditional distribution of Y on X , which de-
scribes our uncertainty about Y once we have information about X . So for ex-
ample, if Y is the yearly expenditure on food by a household and X is the number
of its components, it seems safe to say that F (Y |X > 6) should be different from
F (Y |X < 3), because more people eat more food.
The case a = b is important9 , because it gives us a tool for evaluating proba-
bilities about Y in a situation when X is not uncertain at all, because in fact we
observe its realisation X = x. In this case, we can define the conditional density
as
f Y ,X (z, x)
f Y |X =x (z) = (2.2)
f X (x)
and when what we mean is clear from the context, we simply write f (y|x).
Therefore, in many cases we will use the intuitive notion of X , the set of
random variables we are conditioning Y on, as being “the relevant information
9 Albeit special: a moment’s reflection is enough to convince the reader that if X is continuous,

the event X = x has probability 0, and our naïve definition of conditioning breaks down. But
again, treating the subject rigorously implies using measure theory, σ-algebras and other tools
that I’m not willing to use in this book.
2.2. A CRASH COURSE IN PROBABILITY 47

about Y that we have”; in certain contexts, this idea is expressed by the notion
of an information set. However, a formalised description of this idea is, again,
far beyond the scope of this book and I am contented to leave this to the reader’s
intuition.

2.2.3 Expectation
The expectation of a random variable is a tremendously important concept.
A rigorous definition, valid in all cases, would require a technical tool called
Lebesgue integral, that I’d rather avoid introducing. Luckily, in the two elemen-
tary special cases listed in section 2.2.1, its definition is quite simple:
X
E [X ] = x · p(x) for discrete rvs (2.3)
x∈S(X )
Z
E [X ] = z · f X (z) dz for continuous rvs. (2.4)
S(X )

The expectation of a function of X , E [h(X )], is defined simply as


Z
E [h(X )] = h(z) · f X (z) dz
S(X )

for continuous random variables and the parallel definition for the discrete case
is obvious. The extension to multivariate rvs should also be straightforward: the
expectation of a vector is the vector of expectations.
Some care must be taken, since E [X ] may not exist, even in apparently harm-
less cases.

Example 2.1
If X is a uniform continuous random variable between 0 and 1, its density func-
tion is f (x) = 1 for 0 < x ≤ 1. Its expectation is easy to find as
¸1
1 x2
Z ·
E [X ] = x · 1 dx = = 1/2;
0 2 0

however, it’s not difficult to prove that E [1/X ] does not exist (the corresponding
integral diverges):
Z 1 1 £ ¤1
E [1/X ] = · 1 dx = log x 0 = ∞.
0 x

However, it can be proven that if the support of X is finite, then E [X ] is al-


ways bounded, and therefore exists. To be more specific: if S(X ) = [a, b], where
48 CHAPTER 2. SOME STATISTICAL INFERENCE

a and b are finite, then a < E [X ] < b. The proof is easy and left to the reader as
an exercise.
The expectation operator E [·] is linear, and therefore we have the following
simple rule for affine transforms (A and b must be non-stochastic):

E [Ax + b] = A · E [x] + b (2.5)


¤ £
For nonlinear transformation, things are not so easy. As a rule, E g (X ) ̸= g [E [X ]],
and there’s very little you can say in general.10
The expectation of the k-th power of X is called its k-th moment, so the first
moment is E [X ], the second moment is E X 2 and so on. Of course, E [X n ] (with
£ ¤

n ≥ 1) may not exist, but if it does then E X n−1 is guaranteed to exist too.
£ ¤

The most egregious example of usefulness of moments is the definition of


variance:11 V [X ] = E X 2 − E [X ]2 . The variance is always non-negative, and is
£ ¤

the most widely used indicator of dispersion. Of course, in order to exist, the sec-
ond moment of X must exist. Its multivariate generalisation is the covariance
matrix, defined as
Cov [x] = E xx′ − E [x] E [x]′ ;
£ ¤
(2.6)
The properties of Cov [x] should be well known, but let’s briefly mention the
most important ones: if Σ = Cov [x], then
• Σ is symmetric;

⊥ x j , then Σi j = 0 (warning: the converse is not necessarily true);


• if x i ⊥

• Σ is positive semi-definite.12
Definition 2.6 makes it quite easy to calculate the covariance matrix of an
affine transform:13
Cov [Ax + b] = A · Cov [x] · A ′ . (2.7)
Note that this result makes it quite easy to prove that if X and Y are independent
rvs, then V [X + Y ] = V [X ] + V [Y ] (hint: put X and Y into a vector and observe
that its covariance matrix is diagonal).

2.2.4 Conditional expectation


The easiest way to see the conditional expectation of Y given X is by defining it
as the expectation of Y with respect to f (Y |X = x), that is
Z
E [Y |X = x] = z · f Y |X =x (z) dz.
S(Y )
10 In fact, there is something more general we can say, when the transformation g (X ) is concave

on the whole support of X : it’s called Jensen’s lemma. We will not use this result in this book, but
the result is widely used in economics and econometrics; if you’re interested, the idea is briefly
explained in section 2.A.1.
11 An alternative equivalent definition, perhaps more common, is V [X ] = E (X − E [X ])2 .
£ ¤
12 If you’re wondering what “semi-definite” means, you may want to go back to section 1.A.7.
13 The proof is an easy exercise, left to the reader.
2.2. A CRASH COURSE IN PROBABILITY 49

If f Y |X =x (z) changes with x, the result of the integral (if it exists) should change
£ ¤
with x too, so we may see E y|x = m(x) as a function of x. This function is
sometimes called the regression function of Y on X .
£ ¤
Does E y|x have a closed functional form? Not necessarily, but if it does, it
hopefully depends on a small number of parameters θ.

Example 2.2
Assume that you have a bivariate variable (Y , X ) where Y is 1 if an individual
catches COVID and 0 otherwise, and X is 1 if the same individual is vaccinated.
Suppose that the joint probability is

X =0 X =1
Y =0 0.1 0.3
Y =1 0.3 0.3
0.3
The probability of catching COVID among vaccinated people is 0.3+0.3 = 50%,
0.3
while for unvaccinated people it’s 0.1+0.3 = 75%. The same statement could have
been stated in formulae as

E [Y |X ] = 0.75 − 0.25X ,

which gives 0.5 if X = 1 and 0.75 if X = 0. The regression function of Y on X is


linear (E [Y |X ] = θ0 + θ1 X ), and it depends on the vector θ = [θ0 , θ1 ].

Of course, if x is a random variable, m(x) is too. Does it have an expectation?


If so,
£ £ ¤¤ £ ¤
E E y|x = E y . (2.8)
This is called law of iterated expectations, and is in fact more general than
it appears at first sight. For example, it applies to density functions too:
£ ¤
f (y) = E f (y|x)

To continue with example 2.2, note that, since E [X ] = 0.6, E [Y ] = E [0.75 − 0.25 · X ] =
0.75 − 0.25 · E [X ] = 0.7 − 0.25 × 0.6 = 0.6.

Example 2.3
As a more elaborate example, suppose that

E [Y |X ] = m(X ) = 4X − 0.5X 2

and that E [X ] = V [X ] = 1. It follows that

E [Y ] = E [m(X )] = 4 · E [X ] − 0.5 · E X 2 = 4 − 0.5 · 2 = 3


£ ¤

where I used V [X ] = E X 2 − E [X ]2 .
£ ¤
50 CHAPTER 2. SOME STATISTICAL INFERENCE

It must be stressed that expressing the relationship between two random


variables by means of the conditional expectation has no meaning on causal
relationship. Example 2.2 above should not be taken to imply, by itself, that if
you get vaccinated your chances of getting ill are lower, although the idea is very
natural. More on this in Section 3.1.

2.3 Estimation
The best way to define the concept of an estimator is to assume that we observe
some data x, and that the DGP which generated x can be described by means
of a vector of parameters θ. We assume to know nothing about θ, apart from
the fact that it can be thought of as a vector with a certain number of elements
(say, k), and that it belongs to a subset S of of Rk called the parameter space. An
estimator is a statistic θ̂ = T (x) that should be “likely” to yield a value “close to”
the parameters of interest θ.
To state the same idea more formally: since x is random and θ̂ is a function
of x, then θ̂ is a random variable too, and therefore it must have a support and
a distribution function. Clearly, both will depend on those of x, but ideally, we’d
like to choose the function T (·) so that the support S(θ̂) contains at least a neigh-
bourhood of θ, and we’d like the probability of observing a realisation of θ̂ that
is “near” θ, P (θ − ϵ < θ̂ < θ + ϵ) = P (|θ̂ − θ| < ϵ), to be as close to 1 as possible.
The indispensable ingredient for evaluating those probabilities would be the
distibution of θ̂ = T (x). However, it is almost always tremendously difficult to
pin it down exactly, either because of the characteristics of x, which could be a
very complex random variable, or because the function T (·) could be very intri-
cate. In fact, the cases when we’re able to work out the exact distibution of θ̂ are
exceptionally few. In very simple cases,14 we may be able to compute E θ̂ and
£ ¤

perhaps even V θ̂ , which leads us to the well known concepts of unbiasedness


£ ¤

and efficiency:
• the bias of θ̂ is the difference E θ̂ − θ; therefore θ̂ is said to be unbiased if
£ ¤

E θ̂ = θ;
£ ¤

• θ̂ is more efficient than θ̀ if V θ̀ − V θ̂ > 0 (if both are unbiased).


£ ¤ £ ¤

The problem that makes these concepts not very useful is that, in many cases of
interest, it’s very hard, if not impossible, to compute the moments of θ̂ (in some
cases, θ̂ may even possess no moments at all). So we need to use something else.
Fortunately, asymptotic theory comes to the rescue.

2.3.1 Consistency
The estimator θ̂ is consistent if its probability limit is the parameter we want to
estimate. To explain what this means, let us first define convergence in proba-
14 Notably, when θ̂ is an affine function of x.
2.3. ESTIMATION 51

bility:
p
X n −→ X ⇐⇒ lim P [|X n − X | < ϵ] = 1 (2.9)
n→∞

Also notated as plim (X n ) = X .


A description in words of the definition above is: given a sequence of random
variables X 1 , X 2 , . . . , X n , we define a parallel sequence of events of the kind

|X 1 − X | < ϵ, |X 2 − X | < ϵ, ..., |X n − X | < ϵ;

the sequence above can be read as a sequence of events, where |X i − X | < ϵ


means “X i is more or less X ”.15 Convergence in probability means that the se-
quence of probabilities for those events tends to 1; that is, the probability of X n
and X being “substantially different” becomes negligible if n is large.16
In general, the limit X could be a random variable, but we’ll be mostly in-
p
terested in the case when the limit is a constant: if X n −→ a, the chances of
X n being far from a become zero, and therefore the cdf of X n tends to a step
function which is 0 before a and 1 after it. Or, if X n is continuous, the density
function f (X n ) collapses to a point.
This is exactly what happens, in many circumstances, when we compute the
sample average in a data set. Imagine you have n observations: you can com-
pute the average of the observations you have as they become available; that
is,
X1 + X2 X1 + X2 + X3
X̄ 1 = X 1 , X̄ 2 = , X̄ 3 = ,··· ;
2 3
does the sequence X̄ n have a limit in probability? Or, in other words, if n is large
enough, do we have good chances that X̄ n will be a number arbitrarily near to
something? The question may sound abstract and technical, but in fact this is
something that we implicitly do all the time, when we try something many times
in the hope that our knowledge stabilises with repetition.
The conditions that must occur for this idea to make sense are studied by
the so-called Laws of Large Numbers, or LLNs for short.17 There are many dif-
ferent LLNs, that cover different cases. Basically, there are three dimensions to
the problem that must be considered:

1. How heterogeneous are the X i variables?

2. Are the X i variables independent?

3. Can we assume the existence of at least some of the moments?

The simplest version of the LLN is due to the Soviet mathematician Alek-
sandr Khinchin, and sets very strong bounds on the first two conditions and
15 Where ϵ > 0 is the mathematically respectable way of saying “more or less”.
16 The curious reader might be interested in knowing that there are several other ways to define

a similar concept. A particularly intriguing one is the so-called “almost sure” convergence.
17 Technically, these are the weak LLNs. The strong version uses a different concept of limit.
52 CHAPTER 2. SOME STATISTICAL INFERENCE

relaxes the third one as much as possible: if x 1 , x 2 , . . . , x n are independent and


p
identically distributed (iid for short) and E [x i ] = m, then X̄ −→ m. Other ver-
sions exist: for example, a different version of the LLN can be used if obser-
vations are not independent, but in that case more stringent assumptions are
needed; allow me to skip these complications. For the curious, an example is
provided in section 2.A.2.

Example 2.4
Let’s toss a coin n times. The random variable representing the i -th toss is x i ,
which is assumed to obey the following probability distribution (often referred
to as a Bernoulli distribution):
1 with probability π
½
xi =
0 with probability 1 − π
Note that the probability π is assumed to be the same for all x i ; that is, the coin
we toss does not change its physical properties during the experiment. More-
over, it is safe to assume that what happens at the i -th toss has no consequences
on all the other ones. In short, the x i random variables are iid.
Does x i have a mean? Yes: E [x i ] = 1 · π + 0 · (1 − π) = π. Together with the iid
p
property, this is enough for invoking the LLN and establishing that X̄ = p̂ −→ π.
Therefore, we can take the empirical frequency p̂ as a consistent estimator of
the true probability π.

The LLN becomes enormously powerful when coupled with another won-
derful result, which is a special case of a powerful tool called Slutsky’s Theorem,
p
that I’m not exposing in full here. If X n −→ a and g (·) is continuous at a, then
p
g (X n ) −→ g (a) (note how much easier this property makes it to work with prob-
ability limits rather than expectations).
In the context of estimation, obviously we will want our estimators to be con-
sistent:
p
θ̂ −→ θ ⇐⇒ lim P |θ̂ − θ| < ϵ = 1;
£ ¤
(2.10)
n→∞
that is, we will want to use as estimators statistics that become increasingly un-
likely to be grossly wrong. Fortunately, the combination of the LLN and Slutsky’s
Theorem provides a very nice way to devise estimators that are consistent by
construction. If the average has a probability limit that is a continuous, invert-
ible function of the parameter we want, we just apply a suitable transformation
the the average and we’re done: so for example ifpE [x i ] = 1/θ, then θ̂ = 1/ X̄ ; if
E [x i ] = e θ , then θ̂ = log( X̄ ); if E [x i ] = θ 2 , then θ̂ = X̄ ; and so on.
More generally, the extension to the case when θ is a vector is technically
messier, but conceptually identical. This is known as the method of moments:
it is by no means the only one used in inferential statistics, but it will suffice for
our purposes. The core intuition that motivates it is relatively straightforward:
2.3. ESTIMATION 53

1. express the moments of the observables as continuous functions of the


parameters of interest θ: m = m(θ).

2. Estimate m via the corresponding sample moments m̂, using the LLN, so
p
that m̂ −→ m.

3. Estimate θ by inverting the correspondence between parameters and mo-


p
ments: θ̂ = m −1 (m̂). This should guarantee consistency: θ̂ −→ θ.

Example 2.5
Suppose you have a sample of iid random variables for which you know that
p
E [X ] =
α
£ 2¤ p(p + 1)
E X = ;
α2
and define the two statistics m 1 = X̄ = n −1 i x i and m 2 = n −1 i x i2 . Clearly
P P

p p
m1 −→
α
p p(p + 1)
m2 −→ .
α2
m 12
Now consider the statistic p̂ = m 2 −m 12
. Since p̂ is a continuous function of
both m 1 and m 2 ,
m 12 p p 2 /α2 p2
p̂ = −→ = = p,
m 2 − m 12 p(p + 1)/α2 − p 2 /α2 p 2 + p − p 2
So p̂ is a consistent estimator of p.
But then, by the same token, by dividing p̂ by m 1 you get that
p̂ m1 p p
= −→ = α,
m 1 m 2 − m 12 p/α
m1
so you get a second statistic, α̂ = m 2 −m 12
which estimates α consistently.

From the discussion above, the meaning of an estimator being consistent


should be rather clear: a consistent estimator is a statistic that becomes arbi-
trarily precise if your dataset is large enough, because its distribution tends to
collapse to a single point if n goes to infinity. Interestingly, there are two ways in
which an estimator may not be consistent: one case arises when θ̂ has a prob-
p
ability limit, but is different from the desired point; in other words, θ̂ −→ c ̸= θ.
However, it may be the case that θ̂ does not have a probability limit at all. In that
case, lack of consistency is a consequence of the distribution of our statistic not
collapsing to a single point, but rather remaining spread across a set of values
no matter how large n is.
54 CHAPTER 2. SOME STATISTICAL INFERENCE

2.3.2 Asymptotic normality


Consistency is important because we want our estimators to be reasonably pre-
cise for large samples, but this is almost never enough, as we may need to make
more precise statements on the distribution of our estimators.
For example, imagine that we have two consistent estimators for the same
quantity: that is, two different statistics θ̂ and θ̃ that have the same probability
limit θ. How do we choose which one to use? Consistency can’t be used as a
criterion, since they are both consistent: if we define

= P |θ̂ − θ| < ϵ
£ ¤
Pbn
= P |θ̃ − θ| < ϵ
£ ¤
Pen

clearly limn→∞ Pbn = limn→∞ Pen = 1, so a decision can’t be made on these grounds.
Nevertheless, if we could establish that, for n large enough, Pbn > Pen , so that our
probability of being grossly wrong is lower if we use θ̂ instead of θ̃, our preferred
course of action would be obvious. Unfortunately, this is not an easy check: Pbn
is defined as Z θ+ϵ
Pn =
b fˆ(x)dx,
θ−ϵ

where fˆ(x) is the density function for θ̂ (clearly, a parallel definition holds for
Pen ). In most cases, the analytical form of fˆ(x) is very hard to establish, if not at
all impossible. However, we could try to approximate the actual densities with
something good enough to perform the required check. This is almost invari-
ably achieved by resorting to a property called asymptotic normality, by which
the unknown density fˆ(x) can be approximated via a suitably chosen Gaussian
density.18
At first sight, this sounds like a very ambitious task: how can we hope to
make general statements on the distribution of an arbitrary function of arbitrar-
ily distributed random variables? Besides, why the Gaussian density, rather than
something else? What’s so special about the bell-shaped curve?
And yet, there is a result that applies in a surprisingly large number of cases,
and goes under the name of Central Limit Theorem, or CLT for short. Basically,
the CLT says that, under appropriate conditions, when you observe a random
variable X that can be conceivably thought of as the accumulation of a large
number of random causes that are reasonably independent of each other, with
none of them dominating the others in magnitude, there are very good chances
that the distribution of X should be approximately normal.
The practical effect of this theorem is ubiquitous in nature; most natural
phenomena follow (at least approximately) a Gaussian distribution; the width
of leaves, the length of fish, the height of humans. The French mathematician
Henri Poincaré is credited with the following remark:
18 I assume that the reader is reasonably comfortable with the Gaussian distribution, but section

2.A.4 is there, just in case.


2.3. ESTIMATION 55

Everyone is sure of this, Mr. Lippman told me one day, since the
experimentalists believe that it is a mathematical theorem, and the
mathematicians that it is an experimentally determined fact.19

In order to illustrate the concept, we have to begin by


defining convergence in distribution:

d
X n −→ X ⇐⇒ F X n (z) → F X (z) (2.11)

When X n converges in distribution to X , the difference


between P (a < X n ≤ b) and P (a < X ≤ b) becomes negli-
gible for large n. So, for large n, we can approximate quite
accurately the probability of events defined for X n via the
H ENRI P OINCARÉ
corresponding event defined for X .20
Note the fundamental difference between convergence in probability and in
p
distribution: if X n −→ X , then for large n each time we observe a realisation of
X n and X we can be fairly confident that the two numbers will be very close. If
d
X n −→ X instead, there is no guarantee that |X n − X | will be small: the only thing
we know is that they come from (nearly) the same probability distribution, and
therefore all we can say is that P (a < X n ≤ b) should be very close to P (a < X ≤
b). Convergence in distribution is useful because probabilities involving X are
often much easier to compute than probabilities involving X n .

p
Convergence in distribution is a much weaker On the other hand, if X n −→ X , the fact that
concept than convergence in probability: for limn→∞ P [|X n − X | < ϵ implies that, when n is
example, take a sequence X 1 , X 2 , . . . X n of iid large, P (a < X n < b) ≃ P (a < X < b) for every
random variable with the same distribution F . d
interval (a, b), and therefore X n −→ X . This
Of course, by the definition we can say that result is often spelt “convergence in probabil-
d
X n −→ X , where the distribution of X is, again, ity implies convergence in distribution, but not
F , but there is very little we can say about the p d
vice versa”, or X n −→ X ⇒ X n −→ X
behaviour of the sequence itself.

p p
Now imagine that the LLN holds and X̄ −→ m. Clearly, X̄ − m −→ 0. In many
p
cases, it can be proven that multiplying that quantity by n gives you something
that doesn’t collapse to 0 but does not diverge to infinity either. The Central
Limit Theorems analyse the conditions under which
p d
n( X̄ − m) −→ N (0, v) , (2.12)
19 French original: Tout le monde y croit cependant, me disait un jour M. Lippmann, car les ex-

périmentateurs s’imaginent que c’est un théorème de mathématiques, et les mathématiciens que


c’est un fait expérimental.
20 If I had wanted to interrupt the flow of the argument for the sake of accuracy, I should have

said at this point that in many cases we should take into account the fact that the support of X n
may be discrete, and special care is needed to interpret what happens when F X n (z) “takes a step”.
I thought that would have been rather pedantic, so this remark is confined to a footnote.
56 CHAPTER 2. SOME STATISTICAL INFERENCE

so that we can use a Gaussian density to approximate the distribution of the


average. A multivariate version also exists, which is slightly more intricate from
a technical point of view, but the intuition carries over straightforwardly.21
The approximation provided by the CLT can also be stated by using the sym-
a
bol ∼ , which means “approximately distributed as” (where the approximation
gets better and better as n grows):

p
µ ¶
d a 1
n (w − m) −→ N (0, Σ) =⇒ w ∼ N m, Σ
n

If a certain quantity w converges in distribution to a Normal rv with covari-


ance Σ, then we call Σ the asymptotic variance of w, by which we mean that the
distribution of w resembles more and more one of a normal rv whose variance
is Σ so we can take Σ as an approximation of the variance of w.22 This is usually
notated as AV [w] = Σ.
In the same way as the LLN, there are many versions of the CLT, designed
to cover different cases. A simple version, close in spirit to Khinchin’s LLN,
was provided by Lindeberg and Lévy: if x 1 , x 2 , . . . , x n are iid, E [x i ] = m, and
V [x i ] = v, then equation (2.12) holds. In practice, the conditions are the same as
in Khinchin’s LLN, with the additional requirement that the variance of x i must
exist.

Example 2.6
Let’s go back to example 2.4 (the coin-tossing experiment). Here not only the
mean exists, but also the variance:

V [x i ] = E x i2 − E [x i ]2 = π − π2 = π(1 − π)
£ ¤

Therefore, the Lindeberg-Lévy version of the CLT is readily applicable, and we


have
p d
n(p̂ − π) −→ N (0, π(1 − π)) ,
so the corresponding asymptotic approximation is

π(1 − π)
µ ¶
a
p̂ ∼ N π, .
n

In practice, if you toss a fair coin (π = 0.5) n = 100 times, the distribution
of the relative frequency you get is very well approximated by a Gaussian ran-
dom variable with mean 0.5 and variance 0.0025. Just so you appreciate how
well the approximation works, consider that the event 0.35 < p̂ ≤ 0.45 has a true
probability of 18.234%, while the approximation the CLT gives you is 18.219%. If
21 At this point, the inquisitive reader may ask: why the square root of n? Why not n itself, or

the cube root, or some other function of n? Section 2.A.3 offers an intuitive explanation of why it
should be so.
22 Note that, from a technical point of view, w may not have a variance for any n, although its

limit distribution does. But let’s not be pedantic.


2.3. ESTIMATION 57

you’re interested in reproducing these numbers, Section 2.A.5 contains a small


gretl script with all the necessary steps. Of course, you’re strongly encouraged to
translate it to any other software you like better.

The CLT, by itself, describes the convergence in distribution of averages. How-


ever, we need to see what happens to our estimators, that are usually functions
of those averages. There are two tools that come in especially handy. The first
p
one is sometimes called Cramér’s theorem: if X n −→ a (where a is a constant)
d
and Yn −→ Y , then
d
X n · Yn −→ a · Y . (2.13)

The second result we will often use is the delta method: if your estimator θ̂ is
defined as a differentiable transformation of a quantity which obeys a LLN and
a CLT, there is a relatively simple rule to obtain the limit in distribution of θ̂;
( p ) ( p )
X̄ −→ m θ̂ = g ( X̄ ) −→ θ = g (m)
p ¡ ¢ d =⇒ p ¡ ¢ d (2.14)
n X̄ − m −→ N (0, Σ) n θ̂ − θ −→ N 0, J ΣJ ′
¡ ¢

∂g (x) ¯
¯
where n is the sample size and J is the Jacobian ∂x ¯x=m .

Example 2.7
Given a sample of iid random variables x i for which E [x i ] = 1/a and V [x i ] =
1/a 2 , it is straightforward to construct a consistent estimator of the parameter a
as
1 p 1
â = −→ = a.
X̄ 1/a
Its asymptotic distribution is easy to find: start from the CLT:
p ¡ ¢ d
n X̄ − 1/a −→ N 0, 1/a 2 .
¡ ¢

All we need is the Jacobian term, which is

dâ 1 1
J = plim = −plim =− = −a 2 ;
d X̄ X̄ 2 1/a 2

therefore, the asymptotic variance of â is given by

1
AV [â] = (−a 2 ) (−a 2 ) = a 2 ,
a2

and therefore
p d
n (â − a) −→ N 0, a 2
¡ ¢

³ 2
´
a
so we can use the approximation â ∼ N a, an .
58 CHAPTER 2. SOME STATISTICAL INFERENCE

By using these tools, we construct estimators satisfying not only the consis-
tency property, but also asymptotic normality. These estimators are sometimes
termed CAN estimators (Consistent and Asymptotically Normal). Asymptotic
normality is important for three reasons:

1. it can be used to compare two consistent estimators in terms of their rel-


ative asymptotic efficiency. Given two consistent estimators a and b for
the same parameter m, we’ll say that a is asymptotically more efficient
than b if AV [b] − AV [a] is positive semi-definite (see page 38). This cri-
terion is easy to understand when a and b are scalars: AV [b] ≥ AV [a].
The vector case is more subtle: if our object of interest is estimating some
scalar function of m (say, g (m)), then the two natural competing estima-
tors would be g (a) and g (b), respectively, that are both consistent by Slut-
sky’s theorem. However, by applying the delta method, it can be proven
£ ¤ £ ¤
that AV g (b) ≥ AV g (a) in all cases.

2. It provides a fairly general way to construct statistics for testing hypothe-


ses, which is probably the most useful thing an applied scientist might
want to do with data. The next section is just about this.

3. asymptotic normality makes it quite easy to construct confidence inter-


vals: in order to illustrate the concept, suppose we have a scalar estimator
θ̂, whose asymptotic distribution is
p ¡ ¢ d
n θ̂ − θ −→ N (0, ω) ;

this means that, for a decently large value of n, we can approximate the
distribution of θ̂ as θ̂ ∼ N θ, ω
a ¡ ¢
n . This, in turn, implies that
"¯ #
¯θ̂ − θ ¯
¯
P p < 1.96 ≃ 95%;
ω/n

therefore, the chances that the interval


p
θ̂ ± 1.96 × ω/n

contains the true value of θ are roughly 95%. This is what is called a 95%
confidence interval. Of course, a 99% confidence interval would be some-
what larger. Generalising this to a vector of parameters would lead us to
speaking of confidence sets; for example, when θ is a 2-parameter vector,
the confidence set would be an ellipse.

Example 2.8
Consider the same setup as example 2.7, that is tossing a coin n = 100 times,
and suppose we get “heads” 45 times. Therefore, our estimate of π would be
p̂ = 45/100 = 0.45.
2.4. HYPOTHESIS TESTING 59

Of course, this would imply that the asymptotic variance of our estimator
can be itself estimated as
p̂(1 − p̂) 0.45 · 0.55
v̂ = = = 0.002475
n 100
p
so, since its square root is v̂ = 0.04975, we may say that the interval

A = [0.45 − 0.04975 × 1.96, 0.045 + 0.04975 × 1.96] = 0.45 ± 0.0975 ≃ [0.35, 0.55]

has a very good chance (95%) of containing the true value of π.


This example should help the reader steer clear of a common misconcep-
tion: it is often said “the parameter has a 95% chance of being between a and
b” as if the parameter was random and the interval was fixed. It’s the other way
around. The value of the parameter is non-random (and unknown), whereas
the bounds of the interval are random, since they are a function of the estima-
tor. Therefore, a better choice of words would be “the interval between a and b
has a 95% chance of containing the parameter”.

2.4 Hypothesis Testing


The starting point is a tentative conjecture (called the null hypothesis, or H0 )
that we make about the parameters of a DGP.23 As I said at the beginning of
Section 2.3, we take it for granted that the DGP parameters belong to a certain
set S (the parameter space), but we may conjecture that in fact there is a certain
subset of S (say, H ), that also contains θ. In formulae:

H0 : θ ∈ H ⊂ S.

We would like to check whether our belief is consistent with the observed data.
If what we see is at odds with our hypothesis, then reason dictates we should
drop it in favour of something else.
The coin toss is a classic example. The one parameter in our DGP is π, that
is a probability, so the parameter space is S = [0, 1]. However, there is a subset
of S that is of special significance, namely the point H = {0.5}, because if π ∈ H
(which trivially means π = 0.5), then the coin is fair.
We presume that the coin is fair, but it’d be nice if we could check. What we
can do is flip it a number of times, and then decide, on the basis of the results,
if our original conjecture is still tenable. After flipping the coin n times, we ob-
tain a vector x of zeros and ones. What we want is a function T (x) (called a test
statistic) such that we can decide whether to reject H0 or not.
23 Disclaimer: this section is horribly simplistic. Any decent statistics textbook is far better than

this. My aim here is just to lay down a few concepts that I will use in subsequent chapters, with no
claim to rigour or completeness. My advice to the interested reader is to get hold of Casella and
Berger (2002) or Gourieroux and Monfort (1995, volume 2). Personally, I adore Spanos’ historical
approach to the matter.
60 CHAPTER 2. SOME STATISTICAL INFERENCE

Figure 2.1: Types of Error

Note: in this case, H0 is that the person is not pregnant.

By fixing beforehand a subset R of the the support of T (x) (called the “rejec-
tion region”), we can follow a simple rule: we reject H0 if and only if T (x) ∈ R.
Since T (x) is a random variable, the probability of rejecting H0 will be between 0
and 1 regardless of the actual truth of H0 .24 Therefore, there is a possibility that
we’ll end up rejecting H0 while it’s in fact true, but the opposite case, when we
don’t reject while in fact we should, is also possible. These two errors are known
as type I and type II errors, respectively, and the following 4-way table appears
in all statistics textbooks, but memorising the difference could be easier if you
look at figure 2.1:

Not Reject Reject


H0 true OK Type I error
H0 false Type II error OK

This situation is not unlike the job of a judge in a criminal case. The judge
starts from the premise that the defendant is not guilty, and then evidence is ex-
amined. By the “presumption of innocence” principle, the judge declares the
defendant guilty only if the available evidence is overwhelmingly against H0 .
Thus, type I error happens when an innocent person goes to jail; type II error
is when a criminal gets acquitted.
This line of thought is very much in line with the idea philosophers call fal-
sificationism, whose most notable exponent was the Austrian-British philoso-
pher Karl Popper (see eg Popper (1968)). According to the falsificationist point
of view, scientific progress happens via a series of rejections of previously made
conjectures.
24 Unless, of course, we do something silly such as deciding to always reject, or never. But in that

case, what’s the point of performing the experiment at all?


2.4. HYPOTHESIS TESTING 61

The hallmark of science (as opposed to other kinds of


perfectly respectable mental endeavours, such as literary
criticism, musical composition or political analysis) is that
a conjecture can be falsified: we don’t use evidence to verify
a theory, but we can use it to prove that it’s wrong.
Scientific theories always sound like: “when A, then
B ”. Clearly, you can never verify a theory, because you can
never rule out the possibility of observing A and not B ;
at most, you can say “the theory holds so far”. But once
K ARL P OPPER
you’ve observed one single case that disproves it, you need
no more evidence to decide on the theory. When a theory is proved to be in-
consistent with the available evidence, we move on to something better, and
progress is made; but until a conjecture is not rejected, we adopt it as a tentative
(and possibly wrong) explanation.

I am fully aware that the debate on philoso- hypothesis testing borrows quite a few ideas
phy of science has long established that fal- from the falsificationist approach. For a fuller
sificationism is untenable, as a description of account, check out Andersen and Hepburn
scientific progress, on several accounts. What (2016).
I’m saying is just that the statistical theory of

Therefore, strictly speaking, “not rejecting” doesn’t mean “accepting”. Rejec-


tion is always final; failure to reject is always provisional. That said, it is quite
common (although incorrect) to use the word “accept” instead of “not reject”,
and I will do the same here. However, the reader should bear in mind is that
“accepting” really means “accepting for now”.25
The recipe we are going to use for constructing test statistics is simple: first,
we will formulate our hypothesis of interest as H0 : g (θ) = 0, where θ are the DGP
parameters and g (·) is a differentiable function. Then, given a CAN estimator θ̂,
we evaluate the function at that point. Given consistency, we would expect g (θ̂)
p
to be “small” if H0 is true, since under H0 , g (θ̂) −→ 0.
In order to build a rejection region, we need some criterion for deciding
when g (θ̂) is large enough to force us to abandon H0 ; we do so by exploiting
asymptotic normality. By using the delta method, we can find an asymptotic
approximation to the distribution of g (θ̂) as

a
g (θ̂) ∼ N g (θ), Σ ;
¡ ¢

under consistency, Σ should tend to a zero matrix; as for g (θ), that should be
0 if and only if H0 is true. These two statement imply that the quadratic form
g (θ̂)′ Σ−1 g (θ̂) should behave very differently in the two cases: if H0 is false, it

25 Some people use the word “retain” instead of “accept”, which is certainly more correct, but

unfortunately not very common.


62 CHAPTER 2. SOME STATISTICAL INFERENCE

should diverge to infinity, since plim g (θ̂) ̸= 0; if H0 is true, instead, approxi-


£ ¤

mate normality implies that26 .


a
W = g (θ̂)′ Σ−1 g (θ̂) ∼ χ2p

where p = rk (Σ). Hence, under H0 , the W statistic should take values typical of
a χ2 random variable.27 Therefore, we should expect to see “small” values of W
when H0 is true and large values when it’s false. The natural course of action is,
therefore, to set the rejection region as R = (c, ∞), where c, the critical value,
is some number to be determined. Granted, there is always the possibility that
W > c even if H0 is true. In that case, our decision to reject would imply a type
I error. But since we can calculate the distribution function for W , we can set
c to a prudentially large value. What is normally done is to set c such that the
probability of a type I error (called the size of the test, and usually denoted by
the Greek letter α) is a small number, typically 5%.
What people do in most cases is deciding which α they want to use and then
set c accordingly, so that in many cases you see c expressed as a function if α
(and written c α ), rather than the other way around.
But, I hear you say, what about type II error? Well, if W in fact diverges when
H0 is false, the the probability of rejection (also known as the power of the test)
should approach 1, and we should be OK, at least when our dataset is reasonably
large.28 There are many interesting things that could and should be said about
the power of tests, especially a truly marvellous result known as the Neyman-
Pearson lemma, but I’m afraid this is not the place for this. See the literature
cited at footnote 23.

Example 2.9
Let’s continue example 2.6 here. So we have that the relative frequency is a CAN
estimator for the probability of a coin showing “heads”.
p d
n(p̂ − π) −→ N (0, π(1 − π)) .

Let’s use this result for building a test for the “fair coin” hypothesis, H0 : π = 0.5.
We need a differentiable function g (x) such that g (x) = 0 if and only if x = 0.5.
One possible choice is
g (π) = 2π − 1
£ ¤
What we have to find is the asymptotic ´ variance of g (p̂), which is AV g (p̂) =
∂g (x)
³
J · π(1 − π) · J ′ = ω, where J = plim ∂x = 2, so
p d
n(g (p̂) − g (π)) −→ N (0, ω) .
26 To see why, see section 2.A.4
27 The reason why I’m using the letter W to indicate the test is that, in a less cursory treatment

of the matter, the test statistic constructed in this way could be classified as a “Wald-type” test.
28 When the power of a test goes to 1 asymptotically, the test is said to be consistent. I know, it’s

confusing.
2.4. HYPOTHESIS TESTING 63

Under the null, g (π) = 0 and ω = 1; therefore, the approximate distribution for
g (p̂) is
a
g (p̂) ∼ N 0, n −1
¡ ¢

and our test statistic is easy to build as


· ¸−1
1
W = g (p̂) g (p̂) = n · (2p̂ − 1)2
n
A simple numerical example: suppose n = 100 and p̂ = 46%. The value of
W equals W = 100 · 0.082 = 0.64. Is the number 0.64 incompatible with the pre-
sumption that the coin is fair? Not at all: if the coin is in fact fair W should come
from a random variable that in 95% of the cases takes values below 3.84. There-
fore, there is no reason to change our mind about H0 . The reader may want to
check what happens if the sample size of our experiment is set to n = 1000.

2.4.1 The p-value


The way to make decisions on H0 that I illustrated in the previous section is per-
fectly legitimate, but what people do in practice is slightly different, and involves
a quantity known as the p-value.
The traditional method of checking the null hypothesis is based on the con-
struction of a test statistic with a known distribution under H0 ; once the size of
the test (most often, 5%) is decided, the corresponding critical value c is found
and then the realised statistic W is compared to c. If W > c, reject; otherwise,
don’t. This method makes perfect sense if finding the critical value for a given
size α is complicated: you compile a table of critical values for a given α once
and for all, and then every time you perform a test you just check your realised
value of W against the number in the table. All 20th century statistics textbooks
contained an appendix with tables of critical values for a variety of distributions.
With the advent of cheap computing, it has become easy to compute c as
a function of α, as well as performing the inverse calculation, since algorithms
for computing the cdf of a χ2 random variable are fast, precise and efficient.
Thus, an equivalent route is often followed: after computing W , you calculate
the probability that a χ2 variable should take values greater than W .
That number is called the p-value for the test statistic
W . Clearly, if W > c, the p-value must be smaller than α, so
the decision rule can be more readily stated as “reject H0 if
the p-value is smaller than α”.
In fact, according to its inventor, Sir Ronald Aylmer
Fisher (arguably, the greatest statistician of all time), the p-
value can be seen as a continuous (or monotonous) sum-
mary statistic of how well the data are compatible with the
hypothesis;29 in Fisher’s own words, when we see a small R ONALD F ISHER
29 Thanks to Sven Schreiber for putting it so clearly and concisely.
64 CHAPTER 2. SOME STATISTICAL INFERENCE

p-value, “[e]ither an exceptionally rare chance has occurred or the theory [...] is
not true”.30
Figure 2.2 shows an example in which W = 9, and is compared against a χ23
distribution. The corresponding 95% percentile is 7.815, so with α = 0.05 the
null should be rejected. Alternatively, we could compute the area to the right of
the number 9 (shaded in the Figure), which is 2.92%; obviously, 2.92% < 5%, so
we reject.

Figure 2.2: p-value example

critical value at 5%
realised W

p-value area

To make results even quicker to read, most statistical packages adopt a graph-
ical convention, based on ornating the test statistic with a variable number of ’*’
characters, usually called “stars”. Their meaning is as follows:

Stars Meaning
(none) p-value greater than 10%
* p-value between 5% and 10%
** p-value between 1% and 5%
*** p-value under 1%
Therefore, when you see 3 stars you “strongly” reject the null, but you don’t
reject Ho where no stars are printed. One star means “use your common sense”.
In fact, I’d like add a few words on “using your common sense”: relying on
the p-value for making decisions is OK; after all, that’s what it was invented for.
However, you should avoid blindly following the rule “above 5% → yes, below
5% → no”. You should always be aware of the many limitations of this kind of
approach: for example,

• even if all the statistical assumption of your model are met, the χ2 distri-
bution is just an approximation to the actual density of the test. Therefore,
30 Fisher RA. Statistical Methods and Scientific Inference. Ed 2, 1959. On this subject, if you’re

into the history of statistics, you might like Biau et al. (2009).
2.5. IDENTIFICATION 65

the quantiles of the χ2 density may be (slightly) misleading, especially so


when your sample is not very large;

• even if the test was in fact exactly distributed as a χ2 variable, type I and
type II errors are always possible; actually, if you choose 5% as your signif-
icance level (like everybody does), you will make a mistake in rejecting H0
one time out of twenty;

• and besides, why 5%? Why not 6%? Or 1%? In fact, someone once said:

Q: Why do so many colleges and grad schools teach p = 0.05?


A: Because that’s still what the scientific community and jour-
nal editors use.
Q: Why do so many people still use p = 0.05?
A: Because that’s what they were taught in college or grad school.

(for more details, see Wasserstein and Lazar (2016)).

That said, I don’t want you to think “OK, the p-value is rubbish”: it isn’t. Ac-
tually, it’s the best tool we have for the purpose. But like any other tool (be it a
screwdriver, a microwave oven or a nuclear reactor), in order to use it effectively,
you must be aware of its shortcomings.

2.5 Identification
A common problem in econometrics is identification of a model. The issue is
quite complex, and I cannot do justice to it in an introductory book such as this,
so I’ll just sketch the main ideas with no pretence to rigour or completeness.
Basically, a model is said to be identified with reference to a question of interest
if the model’s probabilistic structure is informative on that question.
A statistical model is, essentially, a probabilistic description of the data that
we observe. When we perform inference on a dataset we assume that

• our available dataset is a realisation of some probabilistic mechanism (the


DGP — see section 2.1);

• the salient features of the DGP can be described by a parameter vector θ;

• the data we observe are such that asymptotic theory is applicable (for ex-
ample, n is large and the data are iid) and we can define statistics that we
can use as estimators or tests;

• our question of interest can be phrased as a statement on the vector θ.

The importance of the first three items in the list should be clear to the reader
from the past sections of this chapter. In this section, we will discuss the fourth
one.
66 CHAPTER 2. SOME STATISTICAL INFERENCE

The vector θ contains parameters that describe the probability distribution


of our data, but in principle the empirical problem we are ultimately interested
in is expressed as a vector of parameters of interest ψ. That is, the parameters
θ characterise the DGP, while ψ is a formalised description of the aspect of re-
ality we are trying to analyse. For example, in a typical econometric model, ψ
may contain quantities such as the elasticity of demand for a certain good to
its own price, the causal effect of a policy on a target variable, the risk aversion
parameter for the representative individual in a macroeconomic model, and so
on.
What is the relationship between ψ and θ? If we take ψ as being a stylised
description of reality, and θ as a stylised description of what we observe, then θ
should be a known function of ψ, that we assume known:

θ = M (ψ).

In some cases, the relationship is trivial; often, M (·) is just the identity function,
θ = ψ, but sometimes this is not the case.
Statistics gives us the tools to estimate θ; is this enough to estimate ψ? It
depends on the function M (·); if the function is invertible, and we have a CAN
estimator θ̂, a possible estimator for ψ is

ψ̂ = M −1 (θ̂).

If M (·) is continuous and differentiable, then its inverse will share these prop-
erties, so we can use Slutsky’s theorem and the delta method and ψ̂ is a CAN
estimator too. In this case, we say that the model is identified.
In some cases, however, the function M (·) is not invertible, typically when
different values of ψ give rise to the same θ.31 In other terms, two alternative
descriptions of the world give rise to the same observable consequences: if

M (ψ1 ) = M (ψ2 )

for ψ1 ̸= ψ2 , we would observe data from the same DGP (described by θ) in both
cases; this situation is known as observational equivalence, and ψ1 and ψ2 are
said to be observationally equivalent. In these cases, being able to estimate θ,
even in an arbitrarily precise way, doesn’t tell us if the “true” description of the
world is ψ1 or ψ2 . This unfortunate case is known in econometrics as under-
identification.

Example 2.10
Suppose you have an urn full of balls, some white and some red. Call w the
number of white balls and r the number of red balls. We want to estimate both
w and r .
31 The technical way to say this would be “the M (·) function is not injective”.
2.A. ASSORTED RESULTS 67

Suppose also that the only experiment we can perform works as follows: we
can extract one ball from the urn as many times as we want, but we must put it
back after extraction (statisticians call this “sampling with replacement”). De-
fine the random variable x i as 1 if the ball is red. Clearly
(
r
1 with probability π = w+r
xi = w
0 with probability (1 − πi ) = w+r
In this case, the probability distribution of our data is completely characterised
by the parameter π; as we know, we have a perfectly good way to estimate π;
since the data are iid, X̄ is a CAN estimator of π and testing hypotheses on π is
easy.
If, however, the parameters of interest are ψ = [r, w], there is no way to es-
timate them separately, because the function θ = M (ψ) is not invertible, for the
very simple reason that the relationship between the DGP parameter π and our
parameters of interest w and r
r
π=
w +r
is one equation in two unknowns. Therefore, in the absence of extra information
we are able to estimate π (the proportion of red balls) as precisely as wanted, but
there is no way to estimate r (the number of red balls).
Even if we knew the true value of π, there would still be an infinite array of
observationally equivalent descriptions of the urn. If, say, π = 0.3 the alterna-
tives ψ1 = [3, 10], ψ2 = [15, 50], ψ3 = [3000, 10000], etc would all be observation-
ally equivalent.

Identification of a model can be a very serious concern in some settings:


if a model is under-identified, we may be able to estimate consistently the pa-
rameters that describe the data, but this wouldn’t be helpful for the economic
question we are ultimately after. In this book, we will not encounter any of these
cases, except for the models I will describe in chapter 6, but you should be aware
of the potential importance of the problem.

2.A Assorted results


2.A.1 Jensen’s lemma
£ ¤
As we argued in Section 2.2.3, if g (·) is not a linear function, generally E g (X ) ̸=
g [E [X ]]. However, when g (·) is concave, we have a usable result that comes as
an inequality:
£ ¤
E g (X ) ≤ g (E [X ])
£ ¤
For example, if E [X ] = 1, we can be sure that E log(X ) is negative, just because
£ ¤
the logarithm is a concave function (provided, of course, that E log(X ) exists),
£ ¤
since E log(X ) ≤ log(1) = 0.
68 CHAPTER 2. SOME STATISTICAL INFERENCE

This remarkable result is easy to prove if g (·) is also differentiable, since in


this case g (·) is said to be concave between a and b if

g (x) ≤ g (x ∗ ) + g ′ (x ∗ )(x − x ∗ ) (2.15)

for each x ∗ ∈ (a, b). Now assume that the interval (a, b) is the support of the rv
X , which possesses an expectation. Clearly, a < µ = E [X ] < b; this implies that
equation (2.15) holds when x ∗ = µ, and therefore

E g (X ) ≤ E g (µ) + g ′ (µ)(X − µ) = g (µ) + g ′ (µ) · E (X − µ)


£ ¤ £ ¤ £ ¤

Since obviously E X − µ = 0, it follows that


£ ¤

£ ¤
E g (X ) ≤ g (µ) = g [E [X ]],

as required. Note that

• by linearity, E [−X ] = −E [X ], so if the function is convex instead of con-


cave, you can flip the inequality, because the negative of a concave func-
£ ¤
tion is convex: E g (X ) ≥ g (E [X ]);

• it is possible to prove Jensen’s lemma in the more general case when g (·) is
not everywhere differentiable in (a, b), but that’s a bit more intricate (see
for example Williams (1991), page 61).

2.A.2 More on consistency


We stated in section 2.3.1 that if x 1 , x 2 , . . . , x n are independent and identically
p
distributed (iid for short) and E [x i ] = m, then X̄ −→ m. However, this is only a
sufficient condition, and is by no means necessary.
In this subsection, I will provide an example of an alternative scenario under
p
which X̄ −→ m even though the x i variables may not be iid: let’s just say that
E [x i ] = m, although nothing is said about heterogeneity or independence. In
this example, however, it is crucial that they all possess a variance v i = V [x i ].
Let’s begin by a technical result, called Markov’s inequality. It states that if
W is a random variable with positive support and expectation E [W ] = m, then
m
P [W ≥ a] ≤ . (2.16)
a
for any a. The proof is surprisingly easy:
Z ∞ Z a Z ∞ Z ∞
m = w f (w)dw = w f (w)dw + w f (w)dw ≥ w f (w)dw ≥
0 0 a a
Z ∞ Z ∞
≥ a f (w)dw = a f (w)dw = a · P [W ≥ a]
a a

So for example if you knew that the expectation of a non-negative random vari-
able X was 4, you could safely say that P [X > 8] ≤ 1/2 without knowing anything
on the distribution of X . Nice.
2.A. ASSORTED RESULTS 69

Now let’s have a look at the moments of X̄ ; its first moment is trivial to find,
since
nm
· ¸
£ ¤ 1X 1X
E X̄ = E xi = E [x i ] = = m;
n n n
as for its variance, just assume that it exists (which in turn requires existence of
all the variances v i ) and that
£ ¤
lim V X̄ = 0;
n→∞

that is, its variance shrinks as the dataset grows larger (this may be tricky: see
also Section 2.A.3).
Now define W = ( X̄ −m)2 , so E [W ] is the variance of X̄ . But the most impor-
tant thing is that W cannot be negative (it’s a square), so we can use Markov’s
inequality (2.16) directly and say32
£ ¤
2 V X̄
P [W ≥ ε ] ≤ ;
ε2
of course, the left-hand side of this inequality can be rewritten as P [| X̄ −m| ≥ ε],
which is in turn equal to 1 − P [| X̄ − m| < ε]. Therefore,
£ ¤
V X̄
P [| X̄ − m| < ε] ≥ 1 − 2 .
ε
£ ¤
If we assume that V X̄ → 0, then we have the desired result:
p
lim P [| X̄ − m| < ε] = 1 ⇐⇒ X̄ −→ m.
n→∞

Note that this case is nearly useless in practice, because being able to com-
pute the moments of our quantities of interest is extremely rare, but still gives
you a nice idea of the kind of conditions can be used to prove consistency.

p
2.A.3 Why n?
Here I’ll give you an intuitive account of the reason why, in the standard cases,
p
the Central Limit Theorem works by using n as the normalising transforma-
tion instead of some other power of n. Suppose we have a vector x of size n
containing our observations, that are not necessarily independent nor identi-
cal. However, we do require that they possess second moments and use Σ to
indicate the covariance matrix of x:
 
V [x 1 ] Cov [x 1 , x 2 ] . . . Cov [x 1 , x n ]
 Cov [x 1 , x 2 ] V [x 2 ] . . . Cov [x 2 , x n ] 
 
V [x] = Σ = 
 .. .. .. .. 
. . . .

 
Cov [x 1 , x n ] Cov [x 2 , x x ] ... V [x n ]
32 This special case of Markov’s inequality is sometimes called Chebyshev’s inequality.
70 CHAPTER 2. SOME STATISTICAL INFERENCE

p
Suppose also that X̄ has a probability limit that we call m: X̄ −→ m. Now
note that the average X̄ can be written as X̄ = n1 ι′ x, and therefore its variance
can be easily calculated by the rule (2.7). Therefore,

1
V X̄ = 2 · ι′ Σι
£ ¤
n

What can we say about ι′ Σι? First, given the properties of ι, this is simply
the sum of all the elements of Σ; second, since Σ is positive semi-definite by
construction, this cannot be a negative number, but it may be a large positive
one. Especially so, considering that the size of Σ grows with n.
We must now examine what happens to ι′ Σι as n → ∞. When the x i rvs
are iid, this is easy, since in this special case Σ is just a multiple of the identity
matrix; hence, in the iid case, Σ = v ·I and ι′ Σι = n · v. In a more general case, the
non-diagonal elements may be non-zero (which could happen for dependent
observations), or the elements on the diagonal may be heterogeneous (which
could happen in the non-identical case). However, it may still be that, despite
these complications, ι′ Σι behaves asymptotically as a linear function of n. To be
more precise, it may happen that

ι′ Σι
lim =K,
n→∞ n

where K is some constant. For example, in the iid case, K would just be equal to
v. In all these cases, you have that
£ ¤
n · lim V X̄ = K
n→∞

and therefore, for large n,


£ ¤ K
V X̄ ≃
n
£ ¤
The most immediate consequence of the equation above is that V X̄ tends to 0
for large n, which is what you would expect from a consistent estimator. More-
over,
V n α X̄ − m ≃ K n 2α−1 .
£ ¡ ¢¤

¡ ¢
Therefore, the only way to multiply X̄ − m by a power of n and have that the
variance of the result is a constant is to choose that α = 1/2, which of course
p
gives you n.
When observations are not iid, there may be cases when ι′ Σι grows at a rate
that is different from n. In these cases, the normalising factor needed to achieve
convergence in distribution is actually different from the square root. This typ-
ically happens when the x i rvs come from a time-series sample, and the degree
of dependence between nearby observations can be substantial. The beginning
of chapter 5 contains a brief discussion of “persistence” in time series.
2.A. ASSORTED RESULTS 71

2.A.4 The normal and χ2 distributions


A continuous random variable X is a standard normal random variable when
its support is R, and its density function is
½ 2¾
1 x
ϕ(x) = p exp −
2π 2
as depicted in figure 2.3.

0.4

0.3
ϕ(x)

0.2

0.1

0
−3 −2 −1 0 1 2 3
x

Figure 2.3: Standard normal density function

As is well known, ϕ(x) has no closed-form in- proximate numerically, so every statistical pro-
definite integral: that is, it can be proven that gram (heck, even spreadsheets) will give you
the function Φ(x), whose derivative is ϕ(x), excellent approximations via clever numeri-
does exist, but cannot be written as a combi- cal methods. If you’re into this kind of stuff,
nation of “simple” functions (the proof is very Marsaglia (2004) is highly recommended.
technical). Nevertheless, it’s quite easy to ap-

As the reader certainly knows, this object was discovered33 by C. F. Gauss


(the guy on page 13), so it’s also also known as a Gaussian random variable. By
playing with integrals a little34 , it can be proven that E [X ] = 0 and V [X ] = 1. One
of the many nice properties of Gaussian rvs is that an affine transformation of a
normal rv is also normal. Therefore, by the rules for expected values (see section
2.2.3), if X is a standard normal rv, then Y = m + s · X is a normal rv with mean
m and variance s 2 . Its density function is
(y − m)2
½ ¾
1
f (y) = p exp −
2πs 2 2s 2
33 Or invented? Interesting point.
34 If you want to have some fun with the moments of the standard normal distribution, you’ll
dϕ(x) R
find the result dx = −xϕ(x) very useful, because it implies that xϕ(x)dx = −ϕ(x).
72 CHAPTER 2. SOME STATISTICAL INFERENCE

A compact way to say this is Y ∼ N m, s 2 .


¡ ¢

In fact, one can define a multivariate normal random variable as a random


vector x with density

f (x) = (2π)−n/2 |Σ|−1/2 exp (x − m)′ Σ−1 (x − m) ,


© ª

or, in short, x ∼ N (m, Σ), where n is the dimension of x, m is its expectation and
Σ its covariance matrix. The multivariate version of this random variable also
enjoys the linearity property, so if x ∼ N (m, Σ), then

y = Ax + b ∼ N Am + b, AΣA ′ .
¡ ¢
(2.17)

It is easy to overlook how amazing this result is: the fact that E [Ax + b] =
AE [x] + b is true for any distribution and does not depend on Gaussianity; and
the same holds for the parallel property of the variance. The special thing about
the Gaussian distribution is that a linear transformation of a Gaussian rv is itself
Gaussian. And this is a very special property, that is only shared by a few distri-
butions (for example: if you take a linear combination of two Bernoulli rvs, the
result is not Bernoulli-distributed).
The Gaussian distribution has a very convenient feature: contrary to what
happens in general, if X and Y have a joint normal distribution (that is, the vec-
tor x = [Y , X ] is a bivariate normal rv), absence of correlation implies indepen-
dence (again, this can be proven quite easily: nice exercise left to the reader).
Together with the linearity property, this also implies another very important
result: if y and x are jointly Gaussian, then the conditional density f (y|x) is Gaus-
sian as well. In formulae:

f (y|x) = (2π)−n/2 |Σ|−1/2 exp (y − m)′ Σ−1 (x − m) ,


© ª

where

m = E y|x = E y + B ′ (x − E [x])
£ ¤ £ ¤

B = Σ−1
x Σx,y
Σ = V y|x = Σy − Σ′x,y Σ−1
x Σx,y
£ ¤

where Σy is the covariance matrix of y, Σx is the covariance matrix of x and Σx,y


is the matrix of covariances between x and y.

Example 2.11
For example, suppose that the joint distribution of y and x = [x 1 , x 2 ] is normal,
with
y 1
   

E  x1  =  2 
x2 3
y 3 0 1
   

V  x1  =  0 1 1 
x2 1 1 2
2.A. ASSORTED RESULTS 73

then you have


· ¸
1 1
E y = 1 E [x] = [2, 3]′ Σy = 3 Σx = Σx,y = [0, 1]′
£ ¤
1 2

and therefore
· ¸−1 · ¸ · ¸
1 1 0 −1
B = =
1 2 1 1
· ¸−1 · ¸
¤ 1 1 0
Σ = 3− 0 1
£
= 3 − 1 = 2,
1 2 1
· ¸−1 · ¸
1 1 2 −1
since = . Thus, the conditional expectation of y given x equals
1 2 −1 1
· ¸
x1 − 2
E y|x = 1 + [−1, 1]′
£ ¤
= −x 1 + x 2
x3 − 3

and in conclusion
y|x ∼ N [x 2 − x 1 , 2] .
Note that:

• the conditional mean is a linear function of the x; this needn’t happen in


general: it’s a miraculous property of Gaussian random variables;

• the conditional variance is not a function of the x variables (it’s a con-


stant); again, this doesn’t happen in general, but with Gaussian random
variables, it does;

• if you apply the Law of Iterated Expectations (eq. (2.8)) you get
£ £ ¤¤ £ ¤
E E y|x = −E [x 1 ] + E [x 2 ] = −2 + 3 = 1 = E y ;

which is, in fact, unsurprising, but it’s nice and reassuring.

If one instead needs to investigate the distribution of quadratic forms of


Gaussian rvs, then another distribution arises, namely the chi-square distribu-
tion (χ2 in symbols). The general result is that, if x ∼ N (0, Σ), then x′ Σ−1 x ∼ χ2n ,
where n is the number of elements of x, commonly known as the “degrees of
freedom” parameter.
The support of the χ2 density is over the non-negative reals; its shape de-
pends on n (the degrees of freedom) in the following way:

0.5n/2 n/2−1 −x/2


f (x) = x e .
Γ(n/2)
74 CHAPTER 2. SOME STATISTICAL INFERENCE

Figure 2.4: Density function of χ2p , for p = 1 . . . 4

f (x)
χ21
χ22
χ23
χ24

1 2 3 4 5 6 x

The most common cases, where n ranges from 1 to 4, are shown in Figure 2.4.35
Like the normal density, there is no way to write down the distribution function
of χ2 random variables, but numerical approximations work very well, so critical
values are easy to compute via appropriate software. The 95% critical values for
the cases n = 1 . . . 4 are

degrees of freedom 1 2 3 4
critical value at 95% 3.84 5.99 7.81 9.49

For example, a χ21 random variable takes values from 0 to 3.84 with probabil-
ity 95%. Memorising them may turn out to be handy from time to time.

2.A.5 Gretl script to reproduce example 2.6


Input:

set verbose o f f
clear

# c h a r a c t e r i s t i c s of the event

scalar p = 0 .5
scalar n = 100
scalar lo = 36
scalar hi = 45

35 In case you’re wondering what Γ(n/2) is, just google for “gamma function”; it’s a wonderful

object, you won’t be disappointed. Suffice it to say that, if x is a positive integer, then Γ(x) = (x+1)!,
but the gamma function is defined for all real numbers, except nonpositive integers.
2.A. ASSORTED RESULTS 75

# true p r o b a b i l i t y via the binomial d i s t r i b u t i o n

matrix bin = pdf (B , p , n , seq ( lo , hi ) ’ ) # Binomial p r o b a b i l i t i e s


scalar true = sumc( bin )

# approximation via the Central Limit Theorem

scalar m = p*n # mean


scalar s = sqrt (p * (1 −p ) * n) # standard e r r o r
scalar z0 = ( lo − 0 .5 − m) / s # s u b t r a c t 0 .5 to compensate f o r c o n t i n u i t y
scalar z1 = ( hi + 0 .5 − m) / s # add 0.5 to compensate f o r c o n t i n u i t y

# "cnorm" = Normal d i s t r i b u t i o n function

scalar appr = cnorm( z1 ) − cnorm( z0 )

# printout

printf " p r o b a b i l i t y of \" heads \" = %g\n" , p


printf "number of t o s s e s = %g\n" , n
printf " p r o b a b i l i t y of heads between %d and %d : \ n" , lo , hi
printf " true = %g , approximate v i a CLT = %g\n" , true , appr

Output:

probability of "heads" = 0.5


number of tosses = 100
probability of heads between 36 and 45:
true = 0.182342, approximate via CLT = 0.182194
76 CHAPTER 2. SOME STATISTICAL INFERENCE
Chapter 3

Using OLS as an inferential tool

3.1 The regression function


In this chapter, we well revisit the OLS statistic and give it an inferential interpre-
tation. As we will see, under many circumstances OLS is a consistent and asymp-
totically normal estimator. The first question that springs to mind is: what is OLS
an estimator of, exactly?
Generally speaking, statistical inference can be a very useful tool when an
observable variable y can be thought of as being influenced by a vector of ob-
servables x, but only via a complicated causal chain, that possibly includes sev-
eral other unobservable factors. Thus, we represent y as a random variable, so
as to acknowledge our uncertainty about it; however, we have reasons to be-
lieve that y may not be independent from x, so further insight can be gained by
studying the conditional distribution f (y|x).

Figure 3.1: Conditional distribution of the stature of children given their parents’
74

72

70

68
child

66

64

62

60
64 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 72.5 73
parent

Figure 3.1 is my rendition of a celebrated dataset, that was studied by Galton


(1886). Galton assembled data on the body height of 928 individuals (y), and

77
78 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

matched them against the average height of their parents (x). Data are in inches.
It is natural to think that somebody’s stature is the re-
sult of a multiplicity of causes, but surely the hereditary
component cannot be negligible. Therefore, the interest
in f (y|x). For each observed value in of x in the sample,
Figure 3.1 shows the corresponding boxplot.
Maybe not all readers are familiar with boxplots, so al-
low me to explain how to read the “candlesticks” in the fig-
ure: each vertical object consists of a central “box”, from
which two “whiskers” depart, upwards and downwards.
F RANCIS G ALTON
The central box encloses the middle 50% of the data, i.e.
it is bounded by the first and third quartiles. The “whiskers” extend from each
end of the box for a range equal at most to 1.5 times the interquartile range. Ob-
servations outside that range are considered outliers1 and represented via dots.
A line is drawn across the box at the median. Additionally, a black dot indicates
the average.
The most notable feature of Figure 3.1 is that the boxes seem to go up to-
gether with x; that is, the distribution of y shifts towards higher values as x grows.
However, even considering the subset of observations defined as the children
whose parents were of a certain height, some dispersion remains. For example,
if we focus on x = 65.5, you see from the third candlestick from the left that the
minimum height is about 62 and the maximum is about 72, while the mean is
between 66 and 68 (in fact, the precise figure is 67.059).

Historical curiosity: if you use OLS to go have children who were taller than the average,
through those points such as to minimise the but not as much as themselves (and of course
SSR, you will find that the fitted line is ĉ i = the same, in reverse, happened to shorter par-
23.9 + 0.646p i , where c i stands for “child” and ents). Galton described this state of things as
p i for “parent”. The fact that the slope of the “Regression towards Mediocrity”, and the term
fitted line is less than 1 prompted Galton to ob- stuck.
serve that the tendency for taller parents was to

This method of inquiry is certainly interesting, but also very demanding: if


x had been a vector of characteristics, rather than a simple scalar, the analysis
would have been too complex to undertake. However, we may focus on a more
limited problem, that is hopefully amenable to a neat and tidy solution.
Instead of studying f (y|x), we could focus on the conditional expectation
£ ¤
E y|x (assuming it exists): this object contains the information on how the cen-
tre of the distribution of y varies across different values of x, and in most cases is
just what we want. If you join the dots in figure 3.1, you get an upward-sloping
line like in Figure 3.2, that suits very well our belief that taller parents should
have, as a rule, taller children.
1 An “outlier” is a data point that is far away from the rest.
3.1. THE REGRESSION FUNCTION 79

Figure 3.2: Regression function of the stature of children given their parents’
74

72

70

68
child

66

64

62

60
64 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 72.5 73
parent

The first step for making this intuition operational is to define the random
variable ε ≡ y − E y|x , so that y can be written (by definition) as E y|x + ε.
£ ¤ £ ¤

For historical reasons, the random variable ε is called the disturbance (see also
section 3.A.2). A very important property of the random variable ε is that it’s
orthogonal to x by construction:2

E [x · ε] = 0 (3.1)

The proof is simple: call E y|x = m(x). Since ε = y − m(x), clearly


£ ¤

£ ¤
E [ε|x] = E y|x − E [m(x)|x] = m(x) − m(x) = 0;

therefore, by the law of iterated expectations (see Section 2.2.4),

E [x · ε] = E [x · E [ε|x]] = E [x · 0] = 0.

Finally, assume that m(x) is a simple function, whose shape is governed by a


few parameters.3 The choice that is nearly universally made is that of a linear4
function: E y|x = x′ β . If we observe multiple realisations of y and x, then we
£ ¤

can write
y i = x′i β + εi (3.2)
or, in matrix form,
y = Xβ + ε (3.3)
Note the difference between equation (3.3) and the parallel OLS decompo-
sition y = Xβ̂ + e, where everything on the right-hand side of the equation is an
2 Warning: as shown in the text, E [ε|x] = 0 =⇒ E [x · ε] = 0, but the converse is not necessarily

true.
3 In fact, there are techniques for estimating the regression function directly, without resorting

to assumptions on its functional form. These techniques are grouped under the term nonpara-
metric regression. Their usage in econometrics is rather limited, however, chiefly because of their
greater computational complexity and of the difficulty of computing marginal effects.
4 As for what we mean exactly by “linear”, see 1.3.2.
80 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

observable statistic. Instead, the only observable item in the right-hand side of
(3.3) is X: β is an unobservable vector of parameters and ε is an unobservable
vector of random variables. Still, the similarity is striking; it should be no sur-
prise that, under appropriate conditions, β̂ is a CAN estimator of β , which we
prove in the next section.

3.2 Main statistical properties of OLS


A handy consequence of equation (3.3) is that the OLS statistic can be written as

β̂ = (X′ X)−1 X′ y = β + (X′ X)−1 X′ ε (3.4)

As any estimator, β̂ has a distribution, and its finite-sample properties can be


studied, in some cases. For example, its unbiasedness is very easy to prove: if
E [ε|X] = 0, then

E β̂ |X = β + E (X′ X)−1 X′ ε|X = β + (X′ X)−1 X′ E [ε|X] = β .


£ ¤ £ ¤

And therefore, by the law of iterated expectations,


£ ¤ £ £ ¤¤ £ ¤
E β̂ = E E β̂ |X = E β = β .

However, nobody cares about unbiasedness nowadays. Moreover, in order


to say something on the distribution of β̂ we’d need assumptions on the distri-
bution of ε, which is something we’d rather avoid doing. Therefore, we’ll use
asymptotic results. Of course, we will assume that the data are such that limit
theorems apply (iid being but an example).

3.2.1 Consistency
In order to prove consistency, start from equation (3.4) and rewrite matrix prod-
ucts as sums:
" #−1 " #−1
1X 1X
xi x′i xi εi = β + xi x′i xi εi .
X X
β̂ = β + (3.5)
i i n i n i

Let’s analyse the two terms on the right-hand side separately: in order to do so,
it will be convenient to define the vector

zi = xi εi ; (3.6)

given equation (3.1), E [z] = 0, so a straightforward application of the LLN gives

1X p
xi εi −→ 0. (3.7)
n i
3.2. MAIN STATISTICAL PROPERTIES OF OLS 81

As for the limit of the first term, assume that n −1 X′ X has one, and call it Q:5

1X p
xi x′i −→ Q; (3.8)
n i

if Q is invertible, then we can exploit the fact that inversion is a continuous trans-
formation as follows " #−1
1X p
xi x′i −→ Q −1 ,
n i

so, after putting the two pieces together,


p
β̂ −→ β +Q −1 · 0 = β .

The OLS statistic, therefore, is a consistent estimator of the parameters of


£ ¤
the conditional mean, or to be more technical, of the derivative of E y|x with
respect to x, which is constant by the linearity hypothesis.

One may think that the whole argument would parameters of an object called Optimal Linear
break down if the assumption of linearity were Predictor, which includes the linearity as a spe-
violated. This is not£ completely
¤ true: even in cial case. But this is far too advanced for a book
many cases when E y|x is non linear, it may like this.
be proven that β̂ is a consistent estimator of the

It’s important here to ensure that n1 i xi x′i converges to an invertible matrix;


P

there are two main reasons while this requirement may fail to hold:

1. it may not converge to any limit; this would be the case if, for example, the
vector x possessed no second moments;6

2. it may converge to a singular matrix; this, for example, would happen in


cases such as x t = φt , where |φ| < 1.

However, in ordinary circumstances, such problems should not arise.

3.2.2 Asymptotic normality


From equation (3.5),
" #−1
p ¡ 1X ′ 1 X
xi εi
¢
n β̂ − β = xi xi p
n i n i
5 Ordinarily, Q will be equal to E (x x′ ). It may be interesting to know that the proeprties of
i i
OLS can be worked out in more exotic cases, where Q is more complicated or may even not exist.
These, cases, however, are far too complicated to be analysed here.
6 In fact, there are cases when this situation may be handled by using a scaling factor other than

n −1 ; but let’s ignore such acrobatics.


82 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

¤−1 p
we already know from the previous subsection that n1 i xi x′i −→ Q −1 , but
£ P

what happens to the second term as n grows to infinity? Define zi as in equation


(3.6). Therefore, by the CLT,

1 X p d
p xi εi = n Z̄ −→ N (0, Ω) ,
n i

where Ω ≡ V [zi ] = E zi z′i . Therefore, we can use Cramér’s theorem (see (2.13))
£ ¤

as follows: since
" #−1
p ¡ 1X ′ 1 X
xi εi
¢
n β̂ − β = xi xi p
n i n i
" #−1
1X ′ p
xi xi −→ Q −1
n i
1 X d
p xi εi −→ N (0, Ω)
n i
p ¡ ¢
the quantity n β̂ − β converges to a normal rv multiplied by the nonstochas-
tic matrix Q −1 ; therefore, the linearity property for Gaussian rvs applies (see eq.
(2.17)), and as a consequence
p ¡ ¢ d
n β̂ − β −→ N 0,Q −1 ΩQ −1 .
¡ ¢
(3.9)

In order to say something on Ω, we can use the law of iterated expectations


(see section 2.2.4) to re-write it as follows:

Ω = E zi z′i = E ε2i · xi x′i = E E ε2i |xi · xi x′i .


£ ¤ £ ¤ £ £ ¤ ¤

The quantity E ε2i |xi is a bit of a problem here. We proved quite easily that
£ ¤

E [ε|x] = 0 as a natural consequence of the way ε is defined (see section 3.1, page
79) . However, we know nothing about its conditional second moment (the con-
ditional variance of y, if you like). For all we know, it may even not exist; or if it
does, it could be an arbitrary (and possibly quite weird) function of x. The only
thing we can be sure of is that the function h(x) = E ε2 |x (sometimes called the
£ ¤

skedastic function) must be positive, since it’s the expectation of a square and
of course the support of ε2 is the positive real line, or possibly a subset.

In some cases, one could be tempted to set up That said, there are certain situations where the
a model in which we assume a functional form main object of interest is the conditional vari-
for the skedastic function in the same way as ance instead of the conditional mean, like for
we do for the regression function. This, how- example in certain models used in finance. A
ever, is very seldom done: the computational fuller discussion of this topic would lead to a
complexity is greater than OLS and there is lit- concept called heteroskedasticity, which is the
tle interest in the parameters of the conditional object of section 4.2.
variance.
3.2. MAIN STATISTICAL PROPERTIES OF OLS 83

In the present chapter, we will assume that E ε2 |x is a positive constant,


£ ¤

traditionally labelled σ2 :
E ε2 |x = σ2 ;
£ ¤
(3.10)

the most important implication of this assumption is that the conditional vari-
£ ¤
ance V y i |xi is constant for all observations i = 1 . . . n; this idea is known as
homoskedasticity.7 This assumption can be visualised in terms of Figure 3.1 as
the idea that all the boxplots look roughly the same, and all you get by moving
£ ¤
along the horizontal axis is that they may go up and down as an effect of E y|x
not being constant, but never change their shape and size. How realistic this
assumption is in practice remains to be seen, and a sizeable part of chapter 4
will be devoted to this issue, but for the moment let’s just pretend this is not a
problem.
Therefore, under the homoskedasticity assumption,

Ω = E zi z′i = E σ2 · xi x′i = σ2 · E xi x′i = σ2Q


£ ¤ £ ¤ £ ¤

and equation (3.9) simplifies to


p ¡ ¢ d
n β̂ − β −→ N 0, σ2Q −1 .
¡ ¢
(3.11)

This result is also important because it provides the basis for justifying the usage
of OLS as an estimator of β on the grounds of its efficiency. Traditionally, this is
proven via the so-called Gauss-Markov theorem, which, however, relies quite
heavily on small-sample results that I don’t like very much.8 In fact, there is
a much more satisfactory proof that OLS is asymptotically semiparametrically
efficient, but it’s considerably technical, so it’s way out of scope here.9 Suffice it
to say that, under homoskedasticity, OLS is hard to beat in terms of efficiency.
We can estimate consistently Q via n −1 X′ X and σ2 via10

e′ e p 2
σ̂2 = −→ σ (3.12)
n

so the approximate distribution for OLS that we use is


a £ ¤
β̂ ∼ N β , V̂

where V̂ = σ̂2 (X′ X)−1 .


A word of warning: the expression above V̂ is not the one used by the major-
ity of econometrics textbooks, and by none of the major econometric software
7 From the Greek prefix “homo” (same); “skedastic”, in an econometric context, means “that

has to do with variance”.


8 If you really really care, a proof is given in section 3.A.3, but I don’t care about it very much

myself.
9 See Hansen (2019), sections 7.20 and 7.21 if you’re interested.
10 Proof of this is unnecessary, but if you insist, go to subsection 3.A.1.
84 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

packages. Instead, in the more popular variant an alternative estimator of σ2 is


used:
e′ e
s2 = ,
n −k
where k is the number of elements of β ; the number n − k is commonly known
as degrees of freedom. It’s easy to prove that s 2 is also consistent, but (as can be
proven via some neat algebra trick that I’m sparing you here) it’s also unbiased:
E s 2 = σ2 .
£ ¤

The difference between the two variants is negligible if the sample size n is
reasonably large, so you can use either; or to put it otherwise, if using s 2 instead
of σ̂2 makes a substantial difference, then n is probably so small that in my opin-
ion you shouldn’t be using statistical inference in the first place. And besides, I
see no convincing reason why unbiasedness should be considered a virtue in
this context. The usage of σ̂2 , instead, makes many things easier and is con-
sistent with all the rest of procedures that econometricians use beyond OLS, in
which asymptotic results are uniformly used. Having said this, let’s move on.

3.2.3 In short
To summarise: a set of conditions that are necessary for OLS to be interpreted
as a CAN estimator of something meaningful are:

1. we have a sample of n observations on y i (our dependent variable) and


xi (our explanatory variables) that satisfies some basic requirements so
that asymptotic theory is applicable as a reasonable approximation of the
behaviour of sample statistics. For example, the observation are iid and
all moments exist.

2. The conditional expectation of y on x exists and is linear: E y|x = x′ β .


£ ¤

3. The matrix n −1 X′ X converges in probability to an invertible matrix Q.

4. The conditional variance V y|x is a constant, that we call σ2 .


£ ¤

If the above is true, then β̂ can be regarded as a CAN estimator of the parameters
of the conditional mean, that can be used, in turn, to compute marginal effects
or, as we shall see in the next section, to perform hypothesis tests. Note that
the above hypotheses are sufficient, but some of them may be relaxed to some
degree, and we will do so in the next chapters.
The reader may also find it interesting that an alternative set of assumptions
customarily known as the classical assumptions was traditionally made when
teaching econometrics in the twentieth century. In my opinion, using the clas-
sical assumptions for justifying the usage of OLS as an estimator is a relic of the
past, but if you’re into the history of econometric thought, I wrote a brief de-
scription in section 3.A.2.
3.3. SPECIFICATION TESTING 85

3.3 Specification testing


3.3.1 Tests on a single coefficients
Sometimes, we would like to test hypotheses on elements of β , so that we can
decide optimally on which explanatory variables we must include in our regres-
sion function. This is often called “specification testing”.
Let’s begin by testing a very simple hypothesis: H0 : βi = 0. The practical
implication of H0 in a model like
k
E y|x = x 1 β1 + x 2 β2 + · · · + x k βk = x j βj
£ ¤ X
j =1
£ ¤
is that the impact of x i on E y|x is 0, and therefore that the i -th explanatory
variable is irrelevant, since it has no effect on the regression function.
Note that under H0 there are two equally valid representations of the regres-
sion function; one that includes x i , the other one that doesn’t. For example, for
i = 2 and k = 3,

Model A y i = x 1i β1 + x 2i β2 + x 3i β3 + εi (3.13)
Model B y i = x 1i β1 + x 3i β3 + εi (3.14)

Clearly, if H0 was true, model B would be preferable, chiefly on the grounds of


parsimony;11 however, if H0 was false, only model A would be a valid represen-
tation of the regression function.
As I explained in section 2.4, in order to test H0 we need to find a differen-
tiable function of β such that g (β ) = 0 if and only if H0 is true. In this case, this is
very easy: define ui as the “extraction vector”, that is a vector of zeros, except for
the i -th element, which is 1.12 The extraction vector takes its name by the fact
that the inner product of u j by any vector a returns the j -element of a. More
generally, the product A ·u j yields the j -th column of A, while u′i A yields its i -th
row.13 Evidently, u′i Au j = A i j , the element of A on row i and column j .
By defining g (β ) = u′i β , the hypothesis H0 : βi = 0 can be written as H0 :

ui β = 0, and the Jacobian term is simply u′i . Hence
p £ ′ ¤ d
n ui β̂ − u′i β −→ N 0, u′i V ui
£ ¤

so our test statistic is

¢−1 ′ β̂2i

= t i2
¡ ′
W = β̂ ui ui V̂ ui ui β̂ = (3.15)
vi i
11 In fact, we will argue in section 3.5 that OLS on model B would produce a more efficient esti-

mator of the remaining coefficients.


12 Some call the vector u a basis vector. Others simply say “the i -th column of the identity
i
matrix”.
13 As always, the reader should verify the claim, instead of trusting me blindly.
86 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

β̂i p
where t i = se i , v i i is the i -th element on the diagonal of V̂ and se i = v i i . The
quantity se i is usually referred to as the standard error of β̂i . Of course, the null
hypothesis would be rejected if W > 3.84 (the 5% critical value for a χ21 distribu-
p
tion). Of course, this implies that we’d reject when |t | > 3.84 = 1.96.
In fact, it’s rather easy to prove that we could use a slight generalisation of
the above for constructing a test for H0 : βi = a, where a is any real number you
want, and that such a test takes the form

β̂i − a
t βi =a = (3.16)
se i

Clearly, we can use the t statistic to decide whether a certain explanatory vari-
able is irrelevant or not, and therefore choose between model A and model B. In
the next subsection, I will show how this idea can be nicely generalised so as to
frame the decision on the best specification via hypothesis testing.
Note, also, that the t statistic can also be used “in reverse” to construct confi-
dence intervals in the same way as discussed at the end of Section 2.3.2: instead
of asking ourselves what the decision on H0 would be for a given a, we may look
for the values of a that would lead us to a given decision. A hypothesis of the
kind H0 : β j = a is not rejected whenever

β̂i − a
−1.96 < < 1.96;
se i

therefore, the range of values for a that would lead to accepting H0 is

β̂ j − 1.96 · se i < a < β̂ j + 1.96 · se i .

In other words, the interval β̂ j ±1.96· se i contains all the values of a that we may
consider not contradictory to the observed data.

3.3.2 More general tests


The general idea I will pursue here can be stated as follows: we assume we have
a true representation of the conditional mean, that I will call the unrestricted
model; in the previous subsection, that was called “model A”. Additionally, we
conjecture that the parameters β may obey some restrictions (also known as
constraints); by incorporating those into our unrestricted model, we would ob-
tain a more compact representation of the regression function. This, however,
would be valid only if our conjecture was true. This model is known as the re-
stricted model, and was labelled “model B” in the previous subsection. We make
our decision on the model to adopt by testing if our conjecture via a standard hy-
pothesis test.
In order to exemplify this idea, I will use the following unrestricted model:

y i = x 1i β1 + x 2i β2 + x 3i β3 + εi ; (3.17)
3.3. SPECIFICATION TESTING 87

in the previous subsection, we saw the the restricted model corresponding to


the constraint β2 = 0 is
y i = x 1i β1 + x 3i β3 + εi .

Suppose now that instead of β2 = 0, our conjecture was β1 = 1; by inserting this


equality into (3.17) we obtain

y i = x 1i + x 2i β2 + x 3i β3 + εi

and therefore the restricted version of (3.17) would become

y i − x 1i = x 2i β2 + x 3i β3 + εi ,

so that, in fact, we would be studying the regression function of the observable


variable ỹ i = y i − x 1i on x 2i and x 3i . Note that in this case we would have to
redefine the dependent variable of our model.
One more example: suppose we combine (3.17) with the restriction β2 +β3 =
0: in this case, the constrained model turns out to be

y i = x 1i β1 + (x 2i − x 3i )β2 + εi .

Of course you can combine more than one restriction into a system:

β1 = 1
½

β2 + β3 = 0,

and if you applied these to (3.17), the constrained model would turn into

y i − x 1i = (x 2i − x 3i )β2 + εi .

The best way to represent constraints of the kind we just analysed is via the ma-
trix equation
R β = d,

where the matrix R and the vector d are chosen so as to express the constraints
we want to test. The examples above on model (3.17) are translated into the
R β = d form in the following table:

Constraint R d Restricted model


β3 = 0 E y i |xi = x 1i β1 + x 2i β2
£ ¤ £ ¤
0 0 1 0
β1 = 1 E y i − x 1i |xi = x 2i β2 + x 3i β3
£ ¤ £ ¤
1 0 0 1
β2 + β3 = 0 E y i |xi = x 1i β1 + (x 2i − x 3i )β2
£ ¤ £ ¤
0 1 1 0
β1 = 1
½ · ¸ · ¸
1 0 0 1 £ ¤
E y i − x 1i |xi = (x 2i − x 3i )β2
β2 = β3 0 1 −1 0
88 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

Since the Jacobian of R β − d with respect to β is just the matrix R, we can


adapt the apparatus of Section 3.3.1 to the present case and decide on the ap-
propriateness of the restriction by computing the statistic
¤−1
W = (R β̂ − d)′ R V̂ R ′
£
(R β̂ − d) (3.18)

and matching it to the χ2 distribution with p degrees of freedom, p being the


number of constraints (the number of rows of the R matrix, if you prefer).
It should be noted, at this point, that checking for the compatibility of a con-
jecture such as R β = d may be a good idea for several reasons that go beyond the
simple task of choosing the most parsimonious representation for the regression
function. The hypothesis itself could be of interest: for example, the coefficient
β j could measure the response of the incidence of autism to the percentage of
vaccinated children. From a policy perspective, it would be extremely important
if H0 : β j = 0 were rejected (I’d be very surprised if it were).
Additionally, econometric models are often written in terms of parameters
that can be given a direct interpretation in terms of economic theory. As an
example, take a Cobb-Douglas production function: Q = AK α1 L α2 . The reader
is doubtlessly familiar enough with microeconomics to need no reminder that
scale economies are constant if and only if α + α2 = 1. The production function,
in logs, reads
q = a + α1 k + α2 l ,

where k = log(K ) and l = log(L). If we could perform an experiment in which we


vary k and l to our liking and observe the resulting q, it would be very natural to
estimate the parameter vector

a
 

β =  α1 
α2

by means of an OLS regression, and the hypothesis “constant returns to scale”


would simply amount to R β = d, with

R = [0 1 1] d = 1.

Nevertheless, if we knew for certain (by some supernatural revelation) that


our production function displays in fact constant returns to scale, we would like
our estimate of β to incorporate the information α1 + α2 = 1, and there is no
reason why β̂ should.
In section 3.5, we will develop an alternative estimator, known as the Re-
stricted Least Squares estimator (or RLS for short), which integrates sample
data with one or more a priori constraints on the parameter vector. As we will
see, this will have the additional advantage of providing us with to calculate the
W test statistic, by comparing the SSRs for the two versions of the model.
3.4. EXAMPLE: READING THE OUTPUT OF A SOFTWARE PACKAGE 89

3.4 Example: reading the output of a software package


Now it’s time for a hands-on example:14 Table 3.1 contains a regression on a
dataset containing data about 2610 home sales in Stockton, CA from Oct 1, 1996
to Nov 30, 1998;15 the dependent variable is the natural logarithm of their sale
price. For this example, the software package I used is gretl, but the output is
more or less the same with every other program.
The model we’re going to estimate can be written as follows:

p i = β 0 + β 1 s i + β 2 b i + β 3 a i + β 4 x i + εi (3.19)

where p i is the log price of the i -th house and the explanatory variables are:

Legend
lsize si log of living area, hundreds of square feet
baths bi number of baths
age ai age of home at time of sale, years
pool xi = 1 if home has pool, 0 otherwise

Models of this type, where the dependent variable is the price of a good and
the explanatory variables are its features, are commonly called hedonic models.
In this case (like in most hedonic models), the dependent variable is in loga-
rithm; therefore, the effect of all coefficients must be interpreted as the impact
on that variable on the relative change in the house price (see footnote 34 in
Chapter 1).
As you can see, the output is divided into two tables; the most interesting is
the top one, which contains β̂ and some more statistics. I’ll describe the con-
tents of the bottom table in section 3.4.2.

3.4.1 The top table: the coefficients


The top table contains one row for each regressor. In the five columns you have:

1. the regressor name

2. the corresponding element of β̂ , that is β̂i ;


q
3. the corresponding standard error, that is se i = s 2 · (X′ X)−1
ii
;

4. the ratio of those two numbers, that is the t -ratio (see eq. 3.15)
14 After reading this section, the reader might want to go back to section 1.5 and read it again

from a different perspective.


15 Data are taken from Hill et al. (2018); if you use gretl, you can find the data in gdt format

at http://www.learneconometrics.com/gretl/poe5/data/stockton5.gdt. This is part of


the rich offering you find on Lee Adkins’ excellent website http://www.learneconometrics.
com/gretl/, which also contains Lee’s book. Highly recommended.
90 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

Dependent variable: lprice

coefficient std. error t-ratio p-value


----------------------------------------------------------
const 8.85359 0.0483007 183.3 0.0000 ***
lsize 1.03696 0.0232121 44.67 0.0000 ***
baths -0.00515142 0.0130688 -0.3942 0.6935
age -0.00238675 0.000270997 -8.807 2.29e-18 ***
pool 0.106793 0.0226757 4.710 2.61e-06 ***

Mean dependent var 11.60193 S.D. dependent var 0.438325


Sum squared resid 157.8844 S.E. of regression 0.246187
R-squared 0.685027 Adjusted R-squared 0.684544
F(4, 2605) 1416.389 P-value(F) 0.000000
Log-likelihood -42.58860 Akaike criterion 95.17721
Schwarz criterion 124.5127 Hannan-Quinn 105.8041

Table 3.1: Example: house prices in the US

5. the corresponding p-value, possibly with the little stars on the right (see
section 2.4.1).

Note that gretl, like all econometric packages I know, gives you the “finite-
sample” version of the standard errors, that is those computed by using s 2 as
a variance estimator instead of σ̂2 , which is what I personally prefer, but the
difference would be minimal.
For the interpretation of each row, let’s begin by the lsize variable:16 the
coefficient is positive, so that in our dataset bigger houses sell for higher prices,
which of course stands to reason. However, the magnitude of the coefficient is
also interesting: 1.037 is quite close to one. Since the house size is also expressed
in logs, we could say that the relative response of the house price to a relative
variation in the house size is 1.037. For example, if we compared two houses
where house A is bigger than house B by 10% (and all other characteristics were
the same), we would expect the value of house A to be 10.37% higher than that
of house B.
As the reader knows, this is what in economics we call an elasticity: for a
continuous function you have that

dy x d log y
η= · =
dx y d log x

d log y dy
because dy = 1y and therefore d log y = y . So, any time you see something
like log(y) = a + b log(x), you can safely interpret b as the elasticity of y to x.
From an economic point of view, therefore, we would say that the elasticity
of the house price to its size is about 1. What is more interesting, the standard
16 “Why not the constant?” you may ask. Nobody cares about the constant.
3.4. EXAMPLE: READING THE OUTPUT OF A SOFTWARE PACKAGE 91

error for that coefficient is about 0.023, which gives a t -ratio of 44.67, and the
corresponding p-value is such an infinitesimal number that the software just
gives you 0.17 If we conjectured that there was no effect of size on price, that
hypothesis would be strongly rejected on the grounds of empirical evidence. In
the jargon of applied economists, we would say that size is significant (in this
case, very significant).
If, instead, we wanted to test the more meaningful hypotheses H0 : β1 = 1, it
would be quite easy to compute the appropriate t statistic as per equation (3.16):

β̂1 − 1 1.03696 − 1
t= = = 1.592
se 1 0.0232121
and the corresponding p-value would be about 11.1%, so we wouldn’t reject H0 .
On the other hand, we get a slightly negative effect for the number of baths
(−0.00515142). At first sight, this does not make much sense, since you would
expect that the more baths you have in the house, the more valuable your prop-
erty is. How come we observe a negative effect?
There are two answers to this question: first, the p-value associated to the
coefficient is 0.6935, which is way over the customary 0.05 level. In other words,
an applied economist would say that the baths variable is not significant. This
does not mean that we can conclusively deduce that there is no effect. It means
that, if there is one, it’s too weak for us to detect (and, for all we know, it might as
well be positive instead, albeit quite limited). Moreover, this is the effect of the
number of baths other things being equal, as we know from the Frisch-Waugh
theorem (see section 1.4.4). In this light, the result is perhaps less surprising:
why should the number of baths matter, given the size of the house? Actually, a
small house filled with baths wouldn’t seem such a great idea, at least to me.
On the contrary, the two variables age and pool are highly significant. The
coefficient for age, for example, is about -0.002: each year of age decreases the
house value by about 0.2%, which makes sense. The coefficient for the dummy
variable pool is about 0.107, so it would seem that having a pool increases the
house value by a little over 10%, which, again, makes sense.

3.4.2 The bottom table: other statistics


Let’s begin with the easy bits; the first line of the bottom table contains descrip-
tive statistics for the dependent variable: mean (about 11.6) and standard de-
viation (about 0.43). The next line contains the sum of squared residuals e′ e
(157.88) and the square root of s 2 = e′ e/(n − k), which is in this case about 0.246.
Since our dependent variable is in logs, this means that the “typical” size of the
approximation errors for our model is roughly 25%. The line below contains
the R 2 index and its adjusted variant (see eq. 1.19). Both versions are around
68.5%, so that our model, all in all, does a fair job at describing price differen-
tials between houses, especially given how little information on each properties
17 In case you’re curious: I can’t compute the number exactly, but it’s smaller than 10−310 .
92 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

we have. However, since our estimate of σ2 is quite sizeable, we shouldn’t expect


our model to give us a detailed description of individual house prices.
The line below contains a test18 commonly called “overall specification test”:
it is a joint test for all coefficient being zero apart from the constant. The null
hypothesis is, basically, that none of your regressors make any sense and your
model is utter rubbish. Luckily, in our case the p-value is infinitesimal, so we
reject.
On the next line, you get the log-likelihood for the model:

1 + ln(2π) + ln(σ̂2 ) .
¤
L=−
2

This number is of very little use by itself;19 in this book, it’s only important be-
cause it provides the essential ingredient for calculating the so-called Informa-
tion Criteria (IC), that are widely used tools for comparing non-nested models.
We say that a model nests another one when the latter is a special case of
the former. For example, the two models (3.13) and (3.14) are nested, because
model (3.14) is just model (3.13) in the special case β2 = 0. If model B is nested
in model A, choosing between A and B is rather easy: all you need is a proper
test statistic; I will provide a detailed exposition in Section 3.5. However, we may
have to choose between the two alternatives in which nesting is impossible:

yi ≃ x′i β
yi ≃ z′i γ

Information criteria start from the value of the log-likelihood (multiplied by -2)
and add a penalty function, which is increasing in the number of parameters.
The gretl package, that I’m using here, reports three criteria: the Akaike IC (AIC),
the Schwartz IC (BIC, where B is for “Bayesian”) and the one by Hannan and
Quinn (HQC), which differ by the choice of penalty function:

AIC = −2L + 2k (3.20)


BIC = −2L + k log n (3.21)
HQC = −2L + 2k log log n (3.22)

The rule is to pick the model that minimises information criteria. It may be
interesting to know that, for the case of linear model that we are examining in
the present context, the quantity −2L equals

−2L = n K − log(σ̂2 ) ,
£ ¤

18 This test, technically, is of the F variety — see section 3.5.1 for its definition.
19 If the data are iid and f (y|x) is Gaussian, then β̂ and σ̂2 are the maximum likelihood estima-

tors of β and σ2 . I chose not to include this topic into this book, but the interested reader will find
a nice and compact exposition in Verbeek (2017), chapter 6. Other excellent treatments abound,
but the curious reader may want to check out chapters 14–15 of Ruud (2000). If you want to go for
the real thing, grab Gourieroux and Monfort (1995), volume 1, chapter 7.
3.5. RESTRICTED LEAST SQUARES AND HYPOTHESIS TESTING 93

where K is a not particularly interesting constant. Therefore, minimising the in-


formation criteria amounts to choosing a model that fits the data “well” without
using “too many” parameters.
At times, it may happen that this algorithm gives conflicting results depend-
ing on which IC you choose. There is a huge literature on this, but my advice in
these cases is “don’t trust the AIC much”.20 An alternative to information criteria
that has become very popular in recent years (especially because the machine
learning people are crazy about it) is the so-called cross-validation criterion:
you’ll find more about it in Section 3.A.4.21

3.5 Restricted Least Squares and hypothesis testing


The Restricted Least Squares statistic (or RLS for short) is an estimator of the
parameter vector that, like OLS, uses the available data in the most effective way
but at the same time, unlike OLS, satisfies by construction a set of p restrictions
of the type R β = d. In other words, we are looking for a vector β̃ that minimises
the SSR under the condition that a certain set of linear constraints are satisfied:

β̃ = Argmin ∥y − Xβ ∥; (3.23)
R β =d

compare (3.23) with equation (1.14): OLS is defined as the unconstrained SSR
minimiser (we can choose β̂ among all k-element vectors); RLS, instead, can
only be chosen among those vectors β that satisfy R β = d. Figure 3.3 exemplifies
the situation for k = 2.
Define the restricted residuals as ẽ = y − Xβ̃ ; we will be interested in com-
paring them with the OLS residuals, so in this section we will denote them as
ê = y − Xβ̂ = MX y to make the distinction typographically evident.
A couple of remarks can already be made even without knowing what the
solution to the problem in (3.23) is. First, since β̂ is an unrestricted minimiser,
ê′ ê cannot be larger than the constrained minimum ẽ′ ẽ. However, the inequality
ê′ ê ≤ ẽ′ ẽ can be made more explicit by noting that
£ ¤
MX ẽ = MX y − Xβ̃ = MX y = ê

and therefore
ê′ ê = ẽ′ MX ẽ = ẽ′ ẽ − ẽ′ PX ẽ
so that

ẽ′ ẽ − ê′ ê = ẽ′ PX ẽ (3.24)


20 If you want some more detail, see section 15.4 in Davidson and MacKinnon (2004) or section

3.2.2 in Verbeek (2017). However, the literature on statistical methods for selecting the “best”
model (whatever that may mean) is truly massive; see for example the “model selection” entry in
(Durlauf and Blume, 2008).
21 In fact, the cross validation criterion can be shown to be roughly equivalent to the AIC.
94 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

Figure 3.3: Example: two-parameter vector

β2

β̂2 β1 = 3β2

β̃2

β̂1 β̃1 β1

The ellipses are the contour lines of the function e′ e. The constraint is β1 = 3β2 . The
number of parameters k is 2 and the number of constraint p is 1. The unconstrained
minimum is β̂1 , β̂2 ; the constrained minimum is β̃1 , β̃2 .

where the right-hand side of the equation is non-negative, since ẽ′ PX ẽ can be
written as (PX ẽ)′ (PX ẽ), which is a sum of squares.22
In order to solve (3.23) for β̃ , we need to solve a constrained optimisation
problem, which is not complicated once you know how to set up a Lagrangean.
The details, however, are not important here and I’ll give you the solution straight
away:
¤−1
β̃ = β̂ − (X′ X)−1 R ′ R(X′ X)−1 R ′ (R β̂ − d);
£
(3.25)

derivation of this result is provided in the separate subsection 3.A.5, so you can
skip it if you want.
The statistical properties of β̃ are proven in section 3.A.6, but the most im-
portant point to make here are: if R β̂ = d, then β̃ is consistent just like the OLS
estimator β̂ , but has the additional advantage of being more efficient. If, on the
contrary, R β̂ ̸= d, then β̂ is inconsistent. The practical consequence of this fact
is that, if we were certain that the equation R β̂ = d holds, we would be much
better off by using an estimator that incorporates this information; but if our
conjecture is wrong, our inference would be invalid.
It’s also worth noting that nobody uses expression (3.25) as a computational
device. The simplest way to compute β̃ is to run OLS on the restricted model and
then “undo” the restrictions: for example, if you take model (3.17), reproduced

22 In fact, the same claim follows more elegantly by the fact that P is, by construction, positive
X
semi-definite.
3.5. RESTRICTED LEAST SQUARES AND HYPOTHESIS TESTING 95

here for your convenience

y i = x 1i β1 + x 2i β2 + x 3i β3 + εi

and want to impose the set of restrictions β1 = 1 and β2 = β3 , what you would
do is estimating the constrained version

y i − x 1i = (x 2i + x 3i )β2 + εi , (3.26)

that can be “unravelled” as

y i = x 1i · 1 + x 2i β2 + x 3i β2 + εi

and then forming β̂ as [1, β̃2 , β̃2 ], where β̃2 is the OLS estimate of equation (3.26).
Nevertheless, equation (3.25) is useful for proving an important result. Let’s
define the vector λ as
¤−1
λ = R(X′ X)−1 R ′
£
(R β̂ − d);

by premultiplying (3.25) by X we get:

Xβ̃ = ỹ = ŷ − X(X′ X)−1 R ′ λ

which in turn implies


ẽ = ê + X(X′ X)−1 R ′ λ
By using ê = MX y =⇒ ê′ X = 0, we can use the above equation to get

ẽ′ ẽ = ê′ ê + λ′ R(X′ X)−1 R ′ λ

but by the definition of λ,


¤−1
λ′ R(X′ X)−1 R ′ λ = (R β̂ − d)′ R(X′ X)−1 R ′
£
(R β̂ − d)

so finally
¤−1
ẽ′ ẽ − ê′ ê = (R β̂ − d)′ R(X′ X)−1 R ′
£
(R β̂ − d). (3.27)
Note that the right-hand side of equation (3.27) is very similar to (3.18). In
fact, if our estimator for σ2 is σ̂2 = ê′ ê/n, we can combine equations (3.18), (3.24)
and (3.27) to write the W statistic as:
¤−1
(R β̂ − d)′ R(X′ X)−1 R ′
£
(R β̂ − d) ẽ′ ẽ − ê′ ê
W= = n . (3.28)
σ̂2 ê′ ê
Therefore, we can compute the same number in two different ways: one im-
plies a rather boring sequence of matrix operations, using only ingredients that
are available after the estimation of the unrestricted model. The other one, in-
stead, requires estimating both models, but at that point 3 scalars (the SSRs and
the number of observations) are sufficient for computing W .
96 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

3.5.1 Two alternative test statistics


There are two other statistics that can be used to perform a test on H0 instead of
the W test. One is the so-called F test, which is the traditional statistic taught in
all elementary econometric courses, treated with reverence in all introductory
econometrics books and the one that all software packages report. It can be
written as
ẽ′ ẽ − ê′ ê 1 ẽ′ ẽ − ê′ ê n − k
F= = · ; (3.29)
s2 p ê′ ê p
it can be easily seen that there are two differences between F and W : F uses
s 2 , the unbiased estimator of σ2 that I showed you at the end of section 3.2.2,
instead of σ̂2 ; moreover, you also have the number of restrictions p in the de-
nominator. Of course, it’s easy to compute them from one another:
n n −k
W = p ·F ⇐⇒ F= (W /p)
n −k n
so in the standard case, when n is much larger than k, you have that W ≃ p · F .
Since their p-values are always practically the same, there is no statistical ground
for preferring either. The reason why the econometricians of yore were attached
to the F test was because its distribution is known even for small samples if εi is
normal, so you don’t need asymptotics. In my opinion, however, small samples
are something you should steer clear of anyway, and postulating normality of εi
is, as a rule, just wishful thinking. So, my advice is: use W .
The other statistic we can use is more interesting: if H0 is true, the restricted
model is just as correct as the unrestricted one. Therefore, one could conceiv-
ably estimate σ2 by using ẽ instead of ê:
ẽ′ ẽ
σ̃2 = (3.30)
n
It can be proven that intuition is right, and if H0 is true σ̃2 is indeed consistent
for σ2 . If we use σ̃2 instead of σ̂2 in equation (3.28), we obtain the so-called LM
statistic:23
ẽ′ ẽ − ê′ ê ẽ′ PX ẽ
LM = n = n , (3.31)
ẽ′ ẽ ẽ′ ẽ
where equality comes from (3.24). Since ẽ′ ẽ cannot be less than ê′ ê, in finite
samples LM will always be smaller than W . However, under H0 they tend to the
same probability limit, and therefore under the null LM will also by asymptoti-
cally distributed as χ2p .24
The nice feature of the LM statistic is that it can be computed via a neat trick,
known as auxiliary regression:
23 The reason for the name is that this test statistic can be shown to be a “Lagrange Multiplier”

test if normality of εi was assumed. Its validity, however, does not depend on this assumption. A
fuller discussion of this point would imply showing that OLS is a maximum likelihood estimator
under normality, which is something I’m not willing to do. See also footnote 19 in Section 3.4.2.
24 The reader may want to verify that alternative formulations of the W and LM statistics are

possible using σ̂2 and σ̃2 , or the R 2 indices from the two models.
3.5. RESTRICTED LEAST SQUARES AND HYPOTHESIS TESTING 97

1. run OLS on the constrained model and compute the residuals ẽ;

2. run OLS on a model where the dependent variable is ẽ and the regressors
are the same as in the unconstrained model;

3. take R 2 from this regression and multiply it by n. What you get is the LM
statistic.
y′ PX y
The last step is motivated by the fact that you can write R 2 as y′ y , so in the
ẽ′ PX ẽ 2
present case ẽ′ ẽ is R from the auxiliary regression.

Example 3.1
As an example, let’s go back to the house pricing model we used as an example
in Section 3.4. In Section 3.4.1 we already discussed two hypotheses of interest,
namely:

• The price elasticity is 1, and

• the number of baths has no effect.

Testing for these two hypotheses separately is easy, via t -tests, which is just what
we did a few pages back. As for the joint hypothesis, the easiest thing to do is
setting up the resticted model as follows: combine equation (3.19) with β1 = 1
and β2 = 0. The restricted model becomes

p i − s i = β 0 + β 3 a i + β 4 x i + εi . (3.32)

Note the redefinition of the dependent variable: if p i is the log of the house
price and s i is its size in square feet, then p i − s i is the log of is price per square
foot, or unit price if you prefer. In fact, the hypothesis β1 = 1 implies that if
you have two houses (A and B) that are identical on all counts, except that A is
twice as big as B, then the price of A should be twice that for B. Therefore, this
hypothesis says implicitly that you can take into account appropriately the size
of the property simply by focusing on its price per square foot, which is what we
do in model (3.32). Estimating (3.32) via OLS gives

Dependent variable: lup

coefficient std. error t-ratio p-value


----------------------------------------------------------
const 8.94603 0.00833147 1074 0.0000 ***
age -0.00247396 0.000232353 -10.65 6.07e-26 ***
pool 0.115810 0.0221914 5.219 1.94e-07 ***

Mean dependent var 8.881042 S.D. dependent var 0.253030


Sum squared resid 158.1204 S.E. of regression 0.246277
R-squared 0.053395 Adjusted R-squared 0.052669
98 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

Superficially, one may think that our restricted model is much worse than
the unrestricted one, as the R 2 drops from 68.5% to 5.3%. However, this is not a
fair comparison, because in the restricted model the dependent variable is rede-
fined and the denominator of the two R 2 indices is not the same. The SSRs are,
instead, perfectly comparable, and the change you have between the full model
and the unrestricted one is 157.88 → 158.12, which looks far less impressive, so
we are drawn to think that the restricted model is not much worse in terms of
fit. We could take this as an indication that our maintained hypothesis is not
particularly at odds with the data.
This argument can be made rigorous by computing the W statistic:
158.1204 − 157.8844
W = 2610 · = 3.9014
157.8844
you get a statistic that is smaller than the critical value at 5% of χ22 = 5.99, so we
accept both hypotheses again (the p-value is about 0.124). This time, however,
the test was performed on the joint hypothesis. It may well happen (examples
are not hard to construct) that you may accept two hypotheses separately but
reject them jointly (the converse should never happen, though).
The LM test, instead, can be computed via an auxiliary regression as follows:
take the residuals form model (3.32) and regress them against the explanatory
variables of the unrestricted model (3.19). In this case, you get
coefficient std. error t-ratio p-value
---------------------------------------------------------
const -0.0924377 0.0483007 -1.914 0.0558 *
lsize 0.0369564 0.0232121 1.592 0.1115
baths -0.00515142 0.0130688 -0.3942 0.6935
age 8.72115e-05 0.000270997 0.3218 0.7476
pool -0.00901706 0.0226757 -0.3977 0.6909

SSR = 157.884, R-squared = 0.001493

The auxiliary regression per se is not particularly interesting: its parameters


don’t have any meaningful interpretation.25 . For us, it’s just a computational
device we use to compute the LM test statistic: take R 2 = 0.001493 and multiply
it by the number of observations (2610); you get
LM = 0.001493 × 2610 = 3.89558,
which is practically identical to W , hence the conclusion is the same.

3.6 Exogeneity and causal effects


This section is short, but very important: The 2021 Nobel Prize for Economics
was awarded to David Card, Joshua Angrist and Guido Imbens, precisely for
25 For curiosity: the SSR is the same as in the restricted model. Can you prove why analytically?

It’s not difficult.


3.6. EXOGENEITY AND CAUSAL EFFECTS 99

their work on causal effects, which has been enormously influential, especially
in labour economics. The issue here is not about the statistical properties of β̂ ,
but rather on its interpretation as an estimator of β , so it fits in well at this point
of the book, although the point we pursue here will be discussed in much greater
detail in Chapter 6.
What does β measure? If E y|x = x′ β , then β is simply defined as
£ ¤

∂E y|x
£ ¤
β= ;
∂x

therefore, in a wage equation, the coefficient for education tells us simply by


how much, on average, wage varies across different education levels.
It would be tempting to say “by how much your wage would change if you
got one extra year of education”, but unfortunately this statement would be un-
warranted.26 There are many reasons why the conditional mean may not be a
good indicator of causality: for example, people may just stop attending school,
or university, the moment they are able to earn a decent wage. If that were the
case, the regression function of wage with respect to education would be flat, if
not negative (because the smartest people would spend a shorter time in edu-
cation). But this wouldn’t mean that education has a negative effect on wages.
In fact, quite the contrary: people who get more education would do so to com-
pensate for their lesser ability. Of course, this example is a bit of a stretch, but
should give you a hint as to why inferring causality from a regression coefficient
may be a very bad idea.
More in general, there are situations when the causal relationship between
y and x works in such a way that the conditional mean does not capture causal-
ity, but only the outcome of the process, which can be quite different, as in the
example I just made.
In these cases, the traditional phrase that we use in the economics commu-
nity is “x is endogenous” (as opposed to exogenous). If regressors are endoge-
nous, then the regression parameters have nothing to do with causal effects;
put another way, the parameters of interest β are not those that describe the
conditional mean, and therefore, if you define εi as y i − x′i β , the first conse-
quence is that the property E [εi |xi ] = 0 doesn’t hold anymore, and so E [εi xi ] =
Cov [εi , xi ] ̸= 0. This is why in many cursory treatment of the subject, endogene-
ity is described as an “illness” that happens when the regressors are correlated
with the disturbance term. Of course, that is an oversimplification: a more rig-
orous statement would be that in some cases the causal effects can be different
from the conditional mean; if you define the disturbances as deviations from
the causal effects, non-zero correlation between regressors and disturbances
follows by construction.
26 The ongoing debate in contemporary econometrics on the issue of differentiating between

correlation and causation is truly massive. For a quick account, read chapter 3 in the latest best-
seller in econometrics, that is Angrist and Pischke (2008), or simply google for “Exogeneity”.
100 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

Note that this problem is not a shortcoming of OLS per se: the job of OLS is
∂E y|x
to estimate consistently the parameters of the conditional expectation [ ] . If ∂x
the nature of the problem is such that our parameters of interest β are a different
object and we insist on equating them with what OLS returns (thereby giving
OLS a misleading interpretation), it’s a hermeneutical problem, not a statistical
one.
The preferred tool in the econometric tradition for estimating causal effects
is an estimator called Instrumental Variable estimator (or IV for short), but
you’ll have to wait until chapter 6 for it.

3.7 Prediction
Once a model is estimated and we have CAN estimators of β and σ2 , we may
want to answer to the following question: if we were given a new datapoint for
which the vector of covariates is known and equal to x̌, what could we say about
the value of the dependent variable y̌ for that new observation?
In order to give a sensible answer, let’s begin by noting a few obvious facts:
of course, y is a random variable, so we cannot predict it exactly. If we knew the
true DGP parameters β and σ2 we could say, however, that

E y̌|x̌ = x̌′ β V y̌|x̌ = σ2


£ ¤ £ ¤

If we were willing to entertain the claim that ε is normal, we could even build a
confidence interval27 and say

P (| y̌ − m| < 1.96σ) ≃ 95% (3.33)

where m = x̌′ β . The problem is that we don’t observe β ; instead, we observe β̂ ,


so, assuming that the best way to make a point prediction of a random variable
is to take its expectation,28 the best we can do to predict y is computing

ŷ = x̌′ β̂ .

Note, however, that β̂ is a random variable, with its own variance, so the confi-
dence interval around ŷ has to take this into account. Formally, let us define the
prediction error as

e∗ = y̌ − ŷ = (x̌′ β + ε̌) − x̌′ β̂ = ε̌ + x̌′ (β − β̂ );

the expression above reveals that our prediction can be wrong for two reasons:
(a) because ε̌ is inherently unpredictable: our model does not contain all the
27 If you need to refresh the notion of confidence interval, go back to the end of section 2.3.2.
28 This may seem obvious, but actually isn’t: this choice is optimal if the loss function we employ

for evaluating prediction is quadratic (see section 1.2). If the loss function was linear, for example,
we’d have to use the median. But let’s just stick to what everybody does.
3.7. PREDICTION 101

feaures that describe the dependent variable and its variance is a measure of
how bad our model is and (b) our sample is not infinite, and therefore we don’t
observe the DGP parameter β , but only its estimate β̂ .
If ε̌ can be assumed independent of β̂ (as is normally safe to do), then the
variance of the difference is the sum of the variances:

V e ∗ = V ( y̌ − ŷ)|x̌ = V [ε̌] + V x̌′ (β − β̂ ) = σ2 + x̌′V x̌.


£ ¤ £ ¤ £ ¤

Of course, when computing this quantity with real data, we replace variances
with their estimates, so we use σ̂2 (or s 2 ) in place of σ2 , and V̂ for V .

Example 3.2
Suppose we use the model shown in section 3.4.1 to predict the price for a house
built 5 years ago, with 1500 square feet of living area, 2 baths and no pool. In this
case,
x̌′ = [1 2.708 2 5 0] ;
(the number 2.708 is just log(1500/100); since

β̂ ′ = [8.8536 1.037 − 0.00515 − 0.00239 0.107]

simple multiplication yields ŷ = 11.6395. Therefore the predicted price is about


US$ 113491 (that is, exp(11.6395)).29
As for the variance,30 we need V̂ , that is s 2 (X′ X)−1 :
 
23.330 −9.914 2.288 −0.025 1.724
 −9.914
 5.388 −2.245 −0.010 −0.705  
V̂ = 0.0001 ×  2.288 −2.245 1.708 0.016 −0.021 
 
 
 −0.025 −0.010 0.016 0.001 −0.001 
1.724 −0.705 −0.021 −0.001 5.142

It turns out that

V ŷ|x = σ̂2 + x̌′V̂ x̌ = 0.0606082 + 6.28845 · 10−05 = 0.0606711;


£ ¤

p
since 0.0606711 ≃ 0.2463, we can even calculate a 95% confidence interval
around our prediction as
q £ ¤
ŷ ± 1.96 V ŷ = 11.6395 ± 1.96 × 0.2463

so we could expect that, with a probability of 0.95, the log price of our hypothet-
ical house would be between 11.157 and 12.122, and therefore the price itself
29 Actually, the expectation of the exponential is not the exponential of the expectation, since

the exponential function is everywhere convex (see Section 2.A.1), but details are not important
here.
30 Of course, we could have used the asymptotic version σ̂2 (X′ X)−1 and very little would have

changed.
102 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

between $ 70000 and $ 180000 (roughly). You may feel unimpressed by such a
wide range, and I wouldn’t disagree. But on the other hand, consider that this
is a very basic model, which only takes into account very few features of the
property, so it would be foolish to expect it to be razor-sharp when it comes to
prediction.

One last thing: you may have noticed, in the example above, that the vari-
ance of the predictor depends almost entirely on the “model uncertainty” com-
ponent σ̂2 and very little on the “parameter uncertainty” component x̌′V̂ x̌. This
is not surprising, in the light of the fact that, as n → ∞, the latter component
should vanish, since β̂ is consistent. Therefore, in many settings (notably, in
time-series models, that we’ll deal with in Chapter 5), the uncertainty on the
prediction is tacitly assumed to come only from σ2 .

3.8 The so-called “omitted-variable bias”


In many econometrics textbooks, you can find an argument that goes more or
less like this: assume that the true model is

y i = x i β 1 + z i β 2 + εi ; (3.34)

if you try to estimate β1 via a regression of y i on x i alone, you’re going to end up


with a bad estimate. The proof is rather easy:31
Pn Pn
x i β 1 + z i β 2 + εi
¡ ¢
i =1 x i y i i =1 x i
β̂1 = Pn = 2 Pn 2
= (3.35)
i =1 x i i =1 x i
Pn Pn
i =1 x i z i i =1 x i εi
= β1 + β2 Pn 2
+ P n 2
i =1 x i i =1 x i

If you take probability limits, you’ll find that, even if E [εi |x i ] = 0,

p E [x i z i ]
β̂1 −→ β1 + β2
E x i2
£ ¤

and the estimator would be inconsistent unless β2 = 0 and/or E [x i z i ] = 0. The


solemn maxim the student receives at this point is “if you omit a relevant regres-
sor (z i in our case, that is relevant if β2 ̸= 0) then your estimates will be incorrect,
unless the omitted variable miraculously happens to be uncorrelated with x i ”.
The phenomenon is usually called omitted variable bias.
I’ve always considered this remark quite useless, if not outright misleading,
and seeing econometricians, who are far better than myself, repeating it over
31 I could extend this example to matrices, but it would be totally unnecessary to grasp my point.
3.8. THE SO-CALLED “OMITTED-VARIABLE BIAS” 103

and over to generations of students is a constant source of wonder to me. I’ll try
to illustrate why, and convince you of my point.
The parameter β1 in (3.34) is defined as the partial effect of x i on the condi-
tional mean of y i on both x i and z i ; that is, the effect of x on y given z. It would
be silly to think that this quantity could be estimated consistently without us-
ing any information on z i .32 The statistic β̂1 , as defined in (3.35) (which does
ignore z i ), is nevertheless a consistent estimator of a different quantity, namely
the partial effect of x i on the conditional mean of y i on x i alone, that is

E y i |x i = β1 x i + β2 E [z i |x i ] .
£ ¤

The objection that some put forward, at this point, is: “OK; but assume that
equation (3.34) is my object of interest, and z i is unobservable, or unavailable.
Surely, you must be aware that the estimate you get by using x i only is bogus.”
Granted. But then, I may reply, do you ever get a real-life case when you observe
all the variables you would like to have in your conditioning set? I don’t think
so; take for example the model presented in section 3.4: in order to set up a truly
complete model, you would have to have data on the state of the building, on
the quality of life in the neighbourhood, on the pleasantness of the view, and
so on. You should always keep in mind that the parameters of your model only
make sense in the context of the observable explanatory variable that you use
for conditioning.33
This doesn’t mean that you should not worry about omitted variable bias at
all. The message to remember is: the quantity we would like to measure (ideally)
is “the effect of x on y all else being equal”; but what we measure by OLS is the
effect of x on y conditional on z. Clearly, in order to interpret our estimate the
way we would like to, z should be as close to “all else” as possible, and if you omit
relevant factors from your analysis (by choice, or impossibility) you have to be
extra careful in interpreting your results.

Example 3.3
I downloaded some data from the World Development Indicators34 website. The
variables I’m using for this example are

32 I should add that if we had an observable variable w , which we knew for certain to be uncor-
i
related with z i , you could estimate β1 consistently via a technique called instrumental variable
estimation, which is the object of Chapter 6.
33 In fact, there is an interesting link the bias you get from variable omission and the one you get

from endogeneity (see section 3.6). Maybe I’ll write it down at some point.
34 The World Development Indicator (or WDI for short) is a wonderful database, maintained

by the World Bank, that collects a wide variety of variables for over 200 countries over a large
time span. It is one of the most widely used resources in development economics and is publicly
available at http://wdi.worldbank.org or through DBnomics https://db.nomics.world/.
104 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

Variable name Description


NY.GDP.PCAP.PP.KD GDP per capita based on purchasing
power parity (PPP).
SH.MED.BEDS.ZS Hospital beds (per 1,000 people)
NV.AGR.TOTL.ZS Agriculture, value added (% of GDP)

For each country, I computed the logarithm of the average (between 2014
and 2018) of the available data, which left me with data for 69 countries. The
three resulting variables are called l_gdp, l_hbeds and l_agri. Now consider
Table 3.2, which reports two OLS regressions. In the first one, we regress the
number of hospital beds on the share of GDP from agriculture. As you can see,
the parameter is negative and significant. However, when we add GDP to the
equation, the coefficient of l_agri becomes insignificant (and besides, its sign
changes). On the contrary, you find that GDP matters a lot.

Dependent variable: l_hbeds

(1) (2)
const 1.090∗∗ −4.876∗∗
(0.1168) (1.231)
∗∗
l_agri −0.2916 0.08467
(0.05794) (0.09217)

l_gdp 0.5655∗∗
(0.1163)
2
R̄ 0.2635 0.4496
(standard error in parentheses)

Table 3.2: Two regressions on WDI data

The correct interpretation for this result is: there is a significant link between
medical quality (as measured by the number of hospital beds per 1000 inhabi-
tants) and the share of GDP from agriculture. In other words, if you travel to a
country where everybody works in the fields, you’d better not get ill. However,
this fact is simply a by-product of differences between countries in terms of eco-
nomic development.
Once you consider the conditional expectation of l_hbeds on a wider infor-
mation set, which includes GDP per capita,35 the effect disappears. That is, for
a given level of economic development36 there is no visible link between hospi-
tal beds and agriculture. To put it more explicitly: if you compare two countries
35 In the applied economic jargon: “once you control for GDP”.
36 OK, GDP per capita is not a perfect measure of economic development, nor of happiness, nor

of well-being. I know. I know about the Human Development Index. I know about that Latouche
guy. I know about all these things. Just give me a break, will you?
3.A. ASSORTED RESULTS 105

where the agricultural sectors have a different size (say, Singapore and Burundi),
you’re likely to find differences in their health system quality. However, if you
compare two countries with the same per capita GDP (say, Croatia vs Greece, or
Vietnam vs Bolivia) you shouldn’t expect to find any association between agri-
culture and hospital beds.
Does this mean that model (1) is “wrong”? No: it simply means that the two
coefficients in the two models measure two different things: a “gross” effect in
equation (1) and a “net” effect in equation (2).37 Does this mean that model (2)
is preferable? Yes: model (2) gives you a richer picture (see how much larger R̄ 2
is) because it’s based on a larger information set.

3.A Assorted results


3.A.1 Consistency of σ̂2
From (3.3), e = MX y = MX ε. Therefore, the sum of squared residuals can be writ-
ten as
e′ e = ε′ MX ε = ε′ ε − ε′ X(X′ X)−1 X′ ε;

now, given the definition of σ̂2 ,


e′ e
σ̂2 = ,
n
divide everything by n and take probability limits; the first bit is easy:

ε′ ε 1X n
p
ε2i −→ E ε2i = σ2 .
£ ¤
=
n n i =1

On the other hand, equations (3.7) and (3.8) say that

ε′ X p X′ X p
−→ 0 −→ Q
n n

and therefore
¶−1
ε′ M X ε ε′ ε ε′ X X′ X X′ ε
µ
2 p
σ̂ = = − −→ σ2 − 0′Q −1 · 0 = σ2 .
n n n n n

3.A.2 The classical assumptions


The classical assumptions were used in the infancy of econometrics to justify
OLS as an inferential method. They reflect a point of view that was quite natural
in those days, that is the idea that statistical methods could be borrowed from
37 The discerning reader will doubtlessly spot the parallel with the discussion we had on the

Frisch-Waugh theorem in section 1.4.4.


106 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

other sciences and be employed on economic data with little or no modification.


The starting point is the linear model in matrix form

y = Xβ + ε

which of course implies y i = x′i β + εi . The classical assumptions are:

1. X is a n × k non-stochastic matrix, with n > k and rk (X) = k;

2. ε ∼ N 0, σ2 I
¡ ¢

In this context, x′i β is interpreted as a “law of nature”, which describes what


happens to y i under certain conditions, described by xi ; this idea is borrowed
directly from experimental science, where y i is the outcome of the i -th experi-
ment and xi are the conditions under which the i -th experiment took place.
The disturbance term εi is just “random noise”, coming from experimental
errors, bad measurement or some other factor that is impossible to control fully;
the idea here is that if εi was 0, by observing y i we would observe the “law of
Nature” x′i β in its “uncontaminated” form.
In a controlled experiment, these hypotheses are perfectly natural: the xi are
obviously non-random (because are decided by the experimenter); to surmise
that εi is Gaussian is also quite natural, since it is the outcome of a multitude
of a large number of small imperfections and some faith in the Central Limit
Theorem is not totally unjustified.
In the adaptation to economic data, the “fixed-X” assumption was recog-
nised as untenable, so a second version of the classical assumptions allows for
the possibility that X may be random, In that case, assumption 2 is replaced by

ε|X ∼ N 0, σ2 I ,
¡ ¢

£ ¤
which in turn implies E [εi |X] = E [εi |xi ] = 0, and as a consequence E y i |xi =
x′i β .
Finally, note that in the classical world ε is assumed to be Gaussian, while
no assumption of that kind was made in Section 3.3. Normality is necessary to
derive the distribution of hypothesis tests such as the t test or the F test when
the sample size is small; needless to say, this is neither necessary nor desirable
in modern econometrics, where datasets are almost always rather large and the
normality assumption is, at best, questionable. This is why in contemporary
econometrics (and, as a result, in this book) we mainly rely on asymptotic infer-
ence, where Gaussianity is nearly useless.

3.A.3 The Gauss-Markov theorem


The Gauss-Markov theorem states that, under homoskedasticity, OLS is the most
efficient estimator among all those that are (a) unbiased and (b) linear. Unbi-
asedness means that E β̂ = β ; linearity means that β̂ can be written as β̂ = L′ y.
£ ¤
3.A. ASSORTED RESULTS 107

The OLS statistic enjoys both properties (L′ being equal to (X′ X)−1 X′ for OLS), but
other statistics may too. This property is often condensed in the phrase “OLS is
BLUE”, where BLUE stands for “Best Linear Unbiased Estimator”.
The proof is simple if we concentrate on the case when X is a matrix of fixed
constants and does not contain random variables, because in this case we can
shuffle X in and out of the expectation operator E [·] any way we want. Consider-
ing the case when X contains random variables makes the proof more involved.
Here goes: take a linear estimator β̃ defined as β̃ = L′ y. In order for it to be
unbiased, the following must hold

E β̃ = E L′ Xβ + ε = L′ Xβ + L′ E [ε] = β ;
£ ¤ £ ¡ ¢¤

it is safe to assume that E [ε] = 0, so the unbiasedness requirement amounts to


L′ X = I . Note that, in the standard case, there are infinitely many matrices that
satisfy this requirement, since n > k and X is a “tall” matrix. In the OLS case,
L′ = (X′ X)−1 X′ , and the requirement is met. Therefore, under unbiasedness, β̃
can be written as β̃ = β + L′ ε.
Now consider the variance of β̃ :

V β̃ = E (β̃ − β )(β̃ − β )′ = E L′ εε′ L = L′ E εε′ L = L V [ε] L′ ;


£ ¤ £ ¤ £ ¤ £ ¤

under homoskedasticity, V [ε] = σ2 I , so

V β̃ = σ2 L′ L;
£ ¤
(3.36)
£ ¤
again, OLS is just a special case, so the variance of β̂ is easy to compute as V β̂ =
σ2 (X′ X)−1 X′ X(X′ X)−1 = σ2 (X′ X)−1 .
£ ¤£ ¤

The gist of the theorem lies in proving that the difference

V β̃ − V β̂ = σ2 L′ L − σ2 (X′ X)−1 = σ2 L′ L − (X′ X)−1


£ ¤ £ ¤ £ ¤

is positive semidefinite any time L ̸= (X′ X)−1 X′ , and therefore OLS is more effi-
cient than β̃ . This is relatively easy: define D ≡ L′ − (X′ X)−1 X′ , which has to be
nonzero unless β̂ = β̃ . Therefore, D′ D must be positive semidefinite (see section
1.A.7).

D′ D = L′ − (X′ X)−1 X′ L − X(X′ X)−1 = L′ L − (X′ X)−1 X′ L − L′ X(X′ X)−1 + (X′ X)−1 ;
£ ¤£ ¤

under unbiasedness, L′ X = X′ L = I , so the expression above becomes

D′ D = L′ L − (X′ X)−1 (3.37)

and the claim is proven.


Having said this, let me add that the relevance of the Gauss-Markov theorem
in modern econometrics is quite limited: the assumption that X is a fixed matrix
makes sense in the context of a randomised experiment, but the data we have
in economics are rarely compatible with this idea; the same goes, perhaps even
108 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

more strongly, for homoskedasticity. Moreover, one does not see why the linear-
ity requirement should be important, aside from computational convenience;
and a similar remark holds for unbiasedness, that is nice to have but not really
important if our dataset is of a decent size and we can rely on consistency.
One may see why people insisted so much on the Gauss-Markov theorem
in the early days of econometrics, when samples were small, computers were
rare and statistical methods were borrowed from other disciplines with very few
adjustments. Nowadays, it’s just a nice exercise in matrix algebra.

3.A.4 Cross-validation and leverage


Given our usual regression model

y = Xβ + ε, (3.38)

we may indulge in the following thought exercise: “We have n datapoints, and
we use all the available information to compute all the OLS-related statistics.
But what if we had only n − 1? We could pretend that the value of y n was un-
available. What happens if we compute β̂ only using the first n − 1 datapoints?
How different would it be from its full-sample equivalent? And if we used β̂ and
xn to predict y n , what should we expect?”. As we will see, pursuing this idea will
lead us to developing useful tools for identifying influential observations and
testing the specification of our model.
Suppose we have n observations but we leave the i -th one aside, and in-
troduce the following convention: the “(−i )” index means “excluding the i -th
observation”; hence, X(−i ) is a (n − 1) × k matrix, equal to X with the i -th row
dropped, and the same interpretation holds for y(−i ) .38 The reason why we may
want to do this is to check what happens to our model if a certain observation
had not been available. There are several insights we can gain from doing so.
In order to perform the necessary calculations, it is useful to consider a model
where you add to X a dummy variable identifying the i -th observation, that is an
additional column d, containing all zeros save for the i -th row, that contains 1.
In pratice, our model becomes

y i = x′i β + d i λ + εi = w′i γ + εi . (3.39)

For example, if i = n, d would be a vector of zeros with one 1 at the bottom and
in matrix form the model would look like this:
· ¸ · ¸ · ¸
y(−i ) X(−i ) 0 β
y= W= γ= .
yi x′i 1 λ

Clearly, our original model 3.38 is just the special case λ = 0. Here are a few
38 Note: this is not standard notation. I adopted it just for this section.
3.A. ASSORTED RESULTS 109

results that will be useful later on:39


h i
X′ Md = X′(−i ) 0
X′ Md X = X′(−i ) X(−i ) = x j x′j
X
j ̸=i

X′(−i ) y(−i )
X
X Md y = = xj y j
j ̸=i

d′ MX d = m i
d′ MX y = d′ ẽ = ẽ i

where ẽ are the OLS residuals for the full-sample model, that is equation (3.38);
m i is the i -th element on the diagonal of MX , that is 1 − x′i (X′ X)−1 xi . Let’s also
define h i = 1 − m i = x′i (X′ X)−1 xi , the i -th element on the diagonal of PX . It can
be proven that 0 ≤ m i ≤ 1, so that the same holds for h i too.40

Some readers may find the choice of symbols for the diagonal of PX . The reason for using h i
surprising: if the m i values are the diagonal of instead comes from calling PX the “hat matrix”
MX , then it would have been natural to use p i (see section 1.4.1).

The Frisch-Waugh theorem (see section 1.4.4) makes it easy to compute the
OLS estimates for model (3.39):
¤−1 h i−1
β̂ = X′ Md X X′ Md y = X′(−i ) X(−i ) X′(−i ) y(−i )
£

ẽ i
λ̂ = (d′ MX d)−1 d′ MX y =
mi

The β̂ vector is nothing but the OLS statistic you would have found after
dropping the i -th observation. The λ̂ parameter is more interesting: let’s begin
by considering the residuals of (3.39): ê = MW y. Its i -th element, ê i , is defined as

ê i = y i − x′i β̂ − λ̂.

This quantity is identically 0; to see why, note that d is an extraction vector (see
section 3.3.1), so you can write

ê i = d′ ê = d′ MW y = 0;

since d ∈ Sp (W), then d′ MW = 0′ , and therefore d′ ê = ê i = 0. By putting the two


equations above together, you get

λ̂ = y i − x′i β̂ .
39 These are easy to prove, and provide a nice exercise on matrix algebra. Hint: start by comput-

ing d′ d and dd′ .


40 The proof is surprisingly easy: m lies on the diagonal of M ; since M is positive semi-
i X X
definite, the diagonal elements cannot be negative. But the same holds for h i and PX . Since
h i + m i = 1, the proof is complete.
110 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

which, in turn, means that λ̂ is the prediction error you get if you try to pre-
dict the i -th observation by using all the other ones. Or, put another way, if you
want to compute what the prediction error for y i would be (based on the re-
maining observations) all you have to do is stick an appropriate dummy into
your model and take its coefficient. Note that in fact there is an even easier way:
ẽ i
since λ̂ = m i
, you may just as well run OLS on the full-sample model (3.38), save
its residuals and the m i series, and divide one by the other.
The cross-validation criterion is a model selection tool that is based on just
that: you simulate the out-of-sample performance of your model by adding the
squares of the n prediction errors you find by omitting each observation in turn:
n n µ ẽ ¶2
X 2
X i
CV = e (−i ) =
i =1 i =1 m i

When you compare two models (say, A and B), it may well happen that model A
yields a smaller sum of squared residuals than B, but B outperforms A in terms of
the cross-validation criterion. Usually, this happens when A has a richer struc-
ture than B (in the OLS context, more regressors); in these cases, the canonical
interpretation is that A is only apparently a better model than B: some of the
apparently significant regressors catch in fact spurious regularities than cannot
be expected to hold in general. In these cases, the term we customarily use is
overfitting.

Data scientists are inordinately fond of the jackknifing, which in turn is a close relative of
cross-validation concept, and in many cases boostrapping (see Section 4.A.4).
they use sophisticated variations of this idea
to pick the best forecasting model for a given In statistical learning and similar disciplines,
problem. the approach presented here is often gener-
The variant of the cross-validation method I alised by excluding entire subsets of the entire
just illustrated, where you exclude one obser- dataset instead of one observation only, possi-
vation at a time, has a lot in common with an bly choosing them in very elaborate ways. They
old and established statistical technique called call this folding.

In the light of the discussion above, there is something interesting we can say
on the interpretation of the magnitude m i and its complement to 1, h i = 1 − m i :
from the definition of ê we have

y = Xβ̂ + dλ̂ + ê;

by premultiplying the above by MX , you get

MX y = ẽ = MX dλ̂ + ê.

Therefore, since MX ê = 0,
ẽ′ ẽ = d′ MX dλ2 + ê′ ê
so, finally,
ẽ i2 /m i = ẽ′ ẽ − ê′ ê.
3.A. ASSORTED RESULTS 111

Which means: if you compare the SSR for the two models you get by using the
full sample or omitting the i -th observation, you get that their difference is al-
ways non-negative, and equals ẽ i2 /m i . Clearly, if the difference is large, the re-
sults you get by adding/removing the i -th observation are dramatically different,
so that data point deserves special attention.
The quantity ẽ i2 /m i may be large either because (a) ẽ i is large in absolute
value and/or (b) m i is close to 0 (which implies that h i is close to 1). This is why
the h i values are sometimes used as descriptive statistics to check for “influen-
tial observations”, and are sometimes referred to as leverage values. Note that
h i only depends on the regressors X, and not on y. Therefore, large values of
h i indicate observations for which the combination of explanatory variables we
have is uncommon enough to exert a substantial effect on the final estimates.

3.A.5 Derivation of RLS


As the reader doubtlessly already knows, the standard method for finding the
extrema of a function subject to constraints is the so called “Lagrange multipliers
method”. For a full description of the method, the reader had better refer to one
of the many existing texts of mathematics for economists41 , but here I’ll give you
a super-simplified account for your convenience.
If you have to find maxima and/or minima of a func-
tion f (x) subject to a system of constraints g (x) = 0, you
set up a function, called the Lagrangean, in which you sum
the objective function to a linear combination of the con-
straints, like this:

L (x, λ) = f (x) + λ′ g (x).

The elements of the vector λ are known as “Lagrange mul- J OSEPH L OUIS
tipliers”. L AGRANGE
For example, the classic microeconomic problem of a utility-maximising con-
sumer is represented as

L (x, λ) = U (x) + λ · Y − p′ x
¡ ¢

where x is the bundle of goods, U (·) is the utility function, Y is disposable in-
come and p is the vector of prices. In this example, the only constraint you have
is the budget constraint, so λ is a scalar.
The solution has to obey two conditions, known as the “first order condi-
tions”:
∂L ∂L
=0 = 0, (3.40)
∂x ∂λ
so in practice you differentiate the Lagrangean with respect to your variables
and λ, and then check if there are any solutions to the system of equations you
41 One I especially like is Dixit (1990), but for a nice introductory treatment I find Dadkhah (2011)

hard to beat.
112 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

get by setting the partial derivatives to 0. If the solution is unique, you’re all
set. In the utility function example, applying equations (3.40) gives the standard
microeconomic textbook solution to the problem:

∂U (x)
= λp Y = p′ x.
∂x

in words:
. at the maximum, (a) marginal utilities are proportional to prices (or
∂U ∂U pi
∂x i ∂x j = pj if you prefer) and (b) you should spend all your income.
In the case of RLS, the Lagrangean is42

1
L (β , λ) = e′ e + λ′ (R β − d).
2

where of course e = y − Xβ . The derivative of L with respect to λ is just the


constraint; as for the other one, since the derivative of e with respect to β is −X,
we can use the chain rule like in section 1.A.5, arranging all the products in an
appropriate way so as to obtain column vectors, which gives

∂L
= −X′ e + R ′ λ
∂β

so the first order condition with respect to β can be written as

X′ ẽ = R ′ λ, (3.41)

where ẽ is the vector that satisfies equation (3.41), defined as y−Xβ̃ . By premul-
tiplying (3.41) by (X′ X)−1 we get

(X′ X)−1 X′ (y − Xβ̃ ) = β̂ − β̃ = (X′ X)−1 R ′ λ,

which of course implies


β̃ = β̂ − (X′ X)−1 R ′ λ (3.42)

So the constrained solution β̃ can be expressed as the OLS vector β̂ , plus a “cor-
rection factor”, proportional to λ. If we premultiply (3.42) by R we get
¤−1
λ = R(X′ X)−1 R ′
£
(R β̂ − d) (3.43)

because R β̃ = d by construction. Interestingly, λ itself is proportional to (R β̂ −d),


that is precisely the quantity we use for the construction of the W statistic in
(3.18). Finally, equation (3.25) is obtained by combining (3.42) and (3.43):
¤−1
β̃ = β̂ − (X′ X)−1 R ′ R(X′ X)−1 R ′
£
(R β̂ − d).
42 Note that I divided the objective function by 2. Clearly, the solution is the same, but the alge-

bra is somewhat simplified.


3.A. ASSORTED RESULTS 113

3.A.6 Asymptotic properties of the RLS estimator


Begin by (3.25)
¤−1
β̃ = β̂ − (X′ X)−1 R ′ R(X′ X)−1 R ′
£
(R β̂ − d);

from this equation, it is quite easy to see that β̃ is an affine function of β̂ ; this
will be quite useful. Now define H as
³ ¤−1 ´ ¤−1
H = plim (X′ X)−1 R ′ R(X′ X)−1 R ′ = Q −1 R ′ RQ −1 R ′
£ £
(3.44)

normally, this is a (k × p) matrix with rank p; also note that H R is idempotent.


If β̂ is consistent, then
p
β̃ −→ β − H · (R β − d)

so if R β = d, consistency is guaranteed; otherwise, it isn’t.


As for efficiency, let me briefly remind you what we mean by “efficiency”:43
an estimator a is more efficient than a competing estimator b if the difference
V (b)−V (a) is positive. If the two estimators are vectors, then the criterion gener-
alises to the requirement that V (b) − V (a) is a positive semi-definite matrix (psd
for short — see section 1.A.7, especially figure 1.4).
There are a few matrix algebra results that we are going to need here:44

1. if A is pd, then A −1 is pd too;

2. if A is psd, then B ′ AB is also psd for any matrix B .

Using these, we will prove that the asymptotic variance of β̃ is smaller (in a ma-
trix sense) than that of β̂ :

AV β̃ = (I − H R) · AV β̂ · (I − R ′ H ′ )
£ ¤ £ ¤

if AV β̃ = σ2Q −1 , then note that


£ ¤

¤−1
H RQ −1 = Q −1 R ′ RQ −1 R ′ RQ −1 ,
£
(3.45)

which is evidently symmetric, so H RQ −1 = Q −1 R ′ H ′ ; furthermore, by using the


fact that H R is idempotent, you get

H RQ −1 = H R H RQ −1 = H RQ −1 R ′ H ′ .

As a consequence, the asymptotic variance of β̃ can be written as

AV β̃ = σ2 Q −1 − H RQ −1 −Q −1 R ′ H ′ + H RQ −1 R ′ H ′ =
£ ¤ © ª

= σ2 Q −1 − H RQ −1 =
£ ¤

= AV β̂ − σ2 H RQ −1
£ ¤

43 A slightly fuller discussion is at the end of section 2.3.2.


44 They’re both easy to prove; try!
114 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL

the last thing we need do prove is that H RQ −1 is also positive semi-definite: for
this, we’ll use the right-hand side of (3.45).
Since Q is pd, then Q −1 is pd as well (property 1); therefore, RQ −1 R ′ is also
¤−1
pd (property 2), and so is RQ −1 R ′
£
(property 1). Finally, by using property
−1 ′ −1 ′ −1
RQ −1 is positive semi-definite and the
£ ¤
1 again, we find that Q R RQ R
result follows.
Chapter 4

Diagnostic testing in
cross-sections

In order to justify the usage of OLS as an estimator, we made some assumptions


in section 3.2. Roughly:

1. the data we observe are realisations of random variables such that it makes
sense to assume that we are observing the same DGP in all the n cases
in our dataset; or, more succinctly, there are no structural breaks in our
dataset.

2. We can trust asymptotic results as a reliable guide to the distribution of


our estimators, as the n observations we have are sufficiently homoge-
neous and sufficiently independent, so that the LLN and the CLT can be
taken as valid;

3. the conditional expectation E y|x exists and is linear: E y|x = x′ β ;


£ ¤ £ ¤

£ ¤
4. the conditional variance V y|x exists and does not depend on x at all, so
it’s a positive constant: V y|x = σ2 .
£ ¤

Assumption number 2 may be inappropriate for two reasons: one is that our
sample size is too small to justify asymptotic results as a reasonable approxima-
tion to the actual properties of our statistics; the other one is that our observa-
tion may not be identical, nor independent. The first case cannot really be tested
formally; in most cases, the data we have are given and economists almost never
enjoy the privileges of experimenters, who can have as many data points as they
want (of course, given sufficient resources). Therefore, we just assume that our
dataset is good enough for our purposes, and hope for the best. Certainly, intel-
lectual honesty dictates that we should be quite wary of drawing conclusions on
the basis of few data points, but there is not much more we can do. As for the
second problem, we will defer the possible lack of independence to chapter 5,
since the issue is most likely to arise with time-series data.

115
116 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

In the next section, we will consider a way of testing assumptions 1 (to some
extent) and 3. If they fail, consistency of β̂ may be at risk. Conversely, assump-
tion number 4 is crucial for our hypothesis testing apparatus, and will need
some extra tools; this will be the object of section 4.2.

4.1 Diagnostics for the conditional mean


p
Our main tool in proving that β̂ −→ β was that E y|x = x′ β (see section 3.2.1).
£ ¤

But this statement may be false. We will not explore the problem in its full gen-
erality: we’ll just focus on two possible issues that often arise in practice.

1. the regression function is nonlinear, but can be approximated via a lin-


ear function (see the discussion in section 1.3.2). In the case of a scalar
regressor,
q
q j
E y i |x i = m(x i ) ≃ β0 + β1 x i + β2 x i2 + · · · βq x i = β j xi .
£ ¤ X
j =0

2. Our data comprise observations for which the DGP is partly different. That
is, we have j = 1 . . . m separate sub-populations, for which

E y i |xi = x′i β j
£ ¤

where j is the class that observation i belongs to. For example, we have
data on European and American firms, and the vector β is different on the
two sides of the Atlantic (in this case, m = 2).

4.1.1 The RESET test


£ ¤
As I repeatedly said earlier, the hypothesis of linearity simply means that E y i |xi
can be written as a linear combination of observable variables. The short phrase
we use is: the model has to be linear in the parameters, not necessarily in the
variables (see Section 1.3.4 for a fuller discussion).
For example, suppose xi is a scalar; it is perfectly possible to accommodate
something like
E y i |x i = β1 x i + β2 x i2 ;
£ ¤

(to ease exposition, I am assuming here that the conditional mean has no con-
stant term). Suppose that the expression above holds, but in the model we esti-
mate the quadratic term x i2 is dropped. That is, we estimate a model like

y i = γx i + u i ;

by OLS, so that we would obtain a statistic γ̂ defined as


Pn
xi y i
γ̂ = (x x) x y = Pi =1
′ −1 ′
n 2
.
i =1 x i
4.1. DIAGNOSTICS FOR THE CONDITIONAL MEAN 117

Clearly, there is no value of γ that can make u i the difference between y i and
E y i |x i , so we can’t expect γ̂ to have all the nice asymptotic properties that OLS
£ ¤

has. In fact, it can be proven that (in standard cases) the statistic γ̂ does have a
limit in probability, but the number it tends to is neither β1 nor a simple function
of it, so technically there is no way we can use γ̂ to estimate β1 consistently.

The limit in probability of γ̂ is technically to have a look at Cameron and Trivedi (2005),
known as a pseudo-true value, which is far too section 4.7 or (more technical) Gourieroux and
complex a concept for me to attempt an expo- Monfort (1995). The ultimate bible on this is
sition here. The inquisitive reader may want White (1994).

In the present case, the remedy is elementary: add x i2 to the list of your re-
gressors and, voilà, you get perfectly good CAN estimates of β1 and β2 .1 How-
ever, in a real-life case, where you have a vector of explanatory variables xi ,
things are not so simple. In order to have quadratic effects, you should include
all possible cross-products between regressors. For example, a model like

y i = β 0 + β 1 x i + β 2 z i + εi

would become

y i = β0 + β1 x i + β2 z i + β3 x i2 + β4 x i · z i + β5 z i2 +εi
| {z } | {z }
linear part quadratic part

and it’s very easy to show that the number of quadratic terms becomes rapidly
unmanageable for a realistic model: if the original model has k regressors the
quadratic one can have up to k(k+1) 2 additional terms.2 . I don’t think I have to
warn the reader on how much of a headache it would be to incorporate cubic or
quartic terms.
The RESET test (stands for REgression Specification Error Test) is a way to
check whether a given specification needs additional nonlinear effects or not.
The intuition is simple and powerful: instead of augmenting our model with all
the possible order 2 terms (squares and cross-products), we just use the square
of the fitted values, that is instead of x i2 , x i · z i and z i2 in the example above, we
would use
¢2
ŷ i2 = β̂1 x i + β̂2 z i .
¡

Clearly, a similar strategy could be extended to cubic terms; in the example


above, we would replace the linear combination of x i3 , x i2 · z i , x i · z i2 and z i3 with
the simple scalar term ŷ i3 . Then, we just check if the added terms are significant;
since this is a test for addition of variables to a pre-existing model, the most con-
venient way to perform the test is by using the LM statistic (see section 3.5.1).
The procedure is then:
1 Of course, you’d have to be careful in computing your marginal effects, but if you have read

section 1.3.4, you know that, don’t you?


2 The reader is invited to work out what happens for various values of k.
118 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

1. Run OLS, save the residuals e i and the fitted values ŷ i ;

2. generate squares and cubes ŷ i2 , ŷ i3 ;

3. run the auxiliary regression

e i = γ′ xi + δ1 ŷ i2 + δ2 ŷ i3 + u i ;

a
4. compute LM = n · R 2 ∼ χ22

Example 4.1
Let us compute the RESET test to check the hedonic model used as an exam-
ple in Section 3.4 for possible neglected nonlinearities; the auxiliary regression
yields

Dependent variable: ehat

coefficient std. error t-ratio p-value


-------------------------------------------------------
const 195.825 81.6622 2.398 0.0166 **
lsize 38.9198 17.1107 2.275 0.0230 **
baths -0.205739 0.0861412 -2.388 0.0170 **
age -0.0900385 0.0393760 -2.287 0.0223 **
pool 3.97729 1.76245 2.257 0.0241 **
yh2 -3.42319 1.40727 -2.433 0.0151 **
yh3 0.103615 0.0399778 2.592 0.0096 ***

Mean dependent var 0.000000 S.D. dependent var 0.245999


Sum squared resid 152.9225 S.E. of regression 0.242381
R-squared 0.031428 Adjusted R-squared 0.029195

In this case, the LM statistic equals n ·R 2 = 2610×0.031 = 82.0263, which is much


bigger than 5.99 (the 5% critical value for the χ22 density); in fact, the p-value is
a puny 1.54243 · 10−18 . Therefore, we reject the null and we conclude that the
model has a specification problem.

One final note: the usage of powers to model nonlinearity is widespread


in applied econometrics, but it is by no means the only available choice. If
you’re interested, you may want to spend some time googling for “cubic splines”
or “fractional polynomials”, the former being hugely popular among data sci-
entists; these techniques are useful for approximating arbitrary smooth func-
tion via linear combinations of observable variables, and are therefore perfectly
suited for OLS estimation. If you’re into even more exotic stuff, try “loess” or
“Nadaraya-Watson”.
4.1. DIAGNOSTICS FOR THE CONDITIONAL MEAN 119

4.1.2 Interactions and the Chow test


The problem of possible differences in the parameters between sub-samples is
best illustrated in a simple setting: we have a scalar regressor x i and two sub-
populations. A dummy variable d i tells us which group observation i belongs
to. Suppose that the DGP could be described as follows:

Subsample 1 y i = β 0 + β 1 x i + εi
Subsample 2 y i = β 0 + β 2 x i + εi

with d i = 0 in subsample 1 and d i = 1 in subsample 2. Note that the model could


be rewritten as

y i = β0 + β1 + (β2 − β1 )d i x i + εi =
£ ¤
(4.1)
= β0 + β1 x i + γd i · x i + εi = β0 + β1 x i + γz i + εi

where γ = β2 −β1 . Again, note that model (4.1) is perfectly fit for OLS estimation,
since the product z i = d i · x i is just another observable variable, which happens
to be equal to x i when d i = 1 and 0 otherwise.
If the effect of x i on y i is in fact homogeneous across the two categories,
then γ = 0; therefore, testing for the equality of β1 and β2 is easy: all you need to
do is check whether the regressor z i is significant. Explanatory variables of this
kind, that you obtain by multiplying a regressor by a dummy variable, are of-
ten called interactions in the applied economics jargon. If the interaction term
turns out to be significant, then the effect of x on y is different across the two
subcategories, since the interaction term in your model measures how different
the effect is across the two subgroups.
Clearly, you can interact as many regressors as you want: in the example
above, you could also imagine that the intercept could be different across the
two subpopulations as well, so the model would become something like

y i = β0 + β1 x i + γ0 d i + γ1 d i · x i + εi ,

because interacting the constant by d i just gives you d i .


It should be noted that interactions can be viewed as including a peculiar
form of nonlinearity, so you should keep this in mind when computing marginal
effects. The marginal effect for x i in equation (4.1), for example, would be

∂E y i |x i
£ ¤
= β + γd i ,
∂x i
that is, β if d i = 0 and β + γ if d i = 1.
When you interact all the parameters by a dummy, then the test for equality
of coefficients across the two subsamples is particularly simple, and amounts to
what is known as the Chow test, since the SSR for the unrestricted model (that is,
the one with all the interactions) is just the sum of the two separate regressions:
if you have two subgroups, you can compute
120 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

1. the SSR for the OLS model on the whole sample (call it S T );

2. the SSR for the OLS model using only the data in subsample 1 (call it S 1 );

3. the SSR for the OLS model using only the data in subsample 2 (call it S 2 ).

Then, the Chow test is simply

S T − (S 1 + S 2 )
W =n· (4.2)
S1 + S2

because the SSR for the model with all the interactions is equal to the sum of the
SSRs for the separate submodels. Of course, the appropriate number of degrees
of freedom to use for the p-value would be k, the difference between the number
of parameters in the unrestricted model (k + k) and those in the restricted one
(k). The proof is contained in section 4.A.1, where I also generalise this idea to
the case when you have more than 2 subsamples.

Example 4.2 (bike sharing and the weather)


Figure 4.1 comes from a dataset on bike sharing provided in Fanaee-T and Gama
(2014) and depicts the relationship between the temperature on a given day (in
Celsius, x i in formulae) and the number of bikes rented (in thousands, y i in
formulae).

Figure 4.1: Relationship between temperature and bikes rented


9

5
Bikes (000s)

-1
Y = -1.90 + 0.522X - 0.00896X^2
-2
5 10 15 20 25 30 35
Temperature (C°)

Once you fit a quadratic model to the data, things appear to be basically OK:
2
R is not at all bad, and the estimated regression lines makes perfect sense, with
the negative concavity indicating that most people want to ride a bike when the
weather is warm, but not too hot.
4.1. DIAGNOSTICS FOR THE CONDITIONAL MEAN 121

Table 4.1: Various models for bike rentals

Full sample Sunny days Cloudy/rainy Full + interactions


const −1.902∗∗∗ −2.659∗∗∗ −0.6698 −0.6698
(−4.974) (−6.281) (−0.9444) (−1.009)
∗∗∗ ∗∗∗ ∗∗∗
x 0.5221 0.6563 0.3165 0.3165∗∗∗
(12.70) (14.54) (4.005) (4.279)
2 ∗∗∗ ∗∗∗ ∗
x −0.008956 −0.01243 −0.003717 −0.003717∗
(−8.895) (−11.41) (−1.835) (−1.960)

d −1.989∗∗
(−2.494)

d ·x 0.3398∗∗∗
(3.876)
2
d ·x −0.008712∗∗∗
(−3.940)
n 731 463 268 731
R2 0.4532 0.5222 0.3965 0.5115
SSR 1498.035 779.7312 558.5052 1338.2364

Note: t -statistics in parenthesis.

However, we may surmise that what happens on sunny days may be differ-
ent from rainy days. Fortunately, we also have the dummy variable d i , which
equals 1 if the weather on that day was sunny and 0 if it was cloudy or rainy.
Splitting the sample in two gives the estimates in Table 4.1: the first column
gives the estimates on the full sample (the same as in Figure 4.1). Column 2, in-
stead, contains the estimates obtained using only the sunny days and column 3
only the ones for the bad weather days.
As you can see, the estimates for sunny days are numerically different from
the ones for cloudy days. For example, the quadratic effect in column 3 seems to
be much less significant than the one in column 2. However, the real question
is: are they statistically different? Or, in other words: is there a reason to believe
that the relationship between the number of rented bikes and air temperature
depends on the weather?
In order to do so, we can run a Chow test. The mechanical way to do this
would be adding to the base model all the interactions with the “sunny” dummy.
The corresponding estimates are found in column 4 of Table 4.1. Note that the
first three coefficients in column 4 are exactly equal to those in column 3,3 and
that the coefficients in column 2 can be obtained by summing the correspon-

3 The standard errors are not: this is a side effect of the fact that model 3 and model 4 use

different estimators for σ2 and hence the two coefficient covariance matrices are different.
122 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

dent coefficients in column 4 to its “interacted” counterpart. For example, the


coefficient for x i reported in column 2 (0.6563) can be calculated as the two en-
tries in column 4 for x and its interaction with d (0.3165 + 0.3398). Put differently,
the interaction terms contain the differences between the “good weather” and
“bad weather” coefficients. Another point worth noting is that the SSR for col-
umn 4 is exactly the sum for those in columns 2 and 3, as dictated by equation
(4.2).
Of course, the fact that those interaction terms are individually significant
would be enough to conclude that the null hypothesis of homogeneity between
the two regimes has to be rejected. However, the Chow test is easy to compute
using the SSRs:

1498.035 − (779.7312 + 558.5052)


W = 731 × = 87.2888,
779.7312 + 558.5052

where W is a rather astronomical value for a χ23 distribution (the correspond-


ing p value is 8.37114e-19), so we have to reject the null: we conclude that the
regression function for sunny days is different from the one for rainy days.

Historically, the Chow test has mostly been used with time-series data, where
each row of the dataset refers to a certain time period and the rows are consec-
utive. For example, data on the economy of a certain country (GDP, interest rate
etc.) in which each row refers to a quarter, so for example the dataset starts
in 1980q1, the next row is 1980q2, and so on. Regressions on data of this kind
present the user with special issues, that have to be analysed separately, and we
will do so in Chapter 5. However, it can be seen rather easily that the Chow test
lends itself very naturally to testing whether a model remains stable before and
after a certain event: just imagine that in equation (4.1) d i equals 0 up to a cer-
tain point in time and d i = 1 after that. It is for this reason that the Chow test
is sometimes referred to as the structural stability test. Rejection of the Chow
test would in this case point to something we economists often call structural
break or regime change, obvious examples being the introduction of the single
currency in the Euro Area, the COVID pandemic, etc.

In a time series context, assuming that the pu- cedures are available (one is the so-called
tative date for the break is known a priori may CUSUM test), as well as methods for estimat-
be unwarranted. In some case, we may sup- ing the timing of the break. These, however, are
pose that a structural break has occurred at too advanced for this book, since they require a
some point, without knowing exactly when. fairly sophisticated inferential apparatus.
For these situations, some clever test pro-
4.2. HETEROSKEDASTICITY AND ITS CONSEQUENCES 123

4.2 Heteroskedasticity and its consequences


As the reader might recall, the homoskedasticity assumption was a fundamen-
tal ingredient in the derivation of the asymptotic covariance matrix of the OLS
estimator (see 3.2). While the linearity assumption E y|x = x′ β makes consis-
£ ¤
p
tency almost automatic (E [ε · x] = 0 implies β̂ −→ β if some kind of LLN can be
invoked), it’s impossible to derive the parallel result for asymptotic normality
p ¡ ¢ p
n β̂ − β −→ N 0, σ2Q −1
¡ ¢

without assuming homoskedasticity, that is E ε2 |x = σ2 .


£ ¤

Like all assumptions, homoskedasticity is easier to justify in certain circum-


stances than others. If data come from controlled experiments, εi can often be
interpreted as the disturbance term that contaminated the i -th experiment; it
is normally safe to think that εi should be independent from xi , the conditions
under which the experiment was performed. Therefore, if εi ⊥ ⊥ xi , no moments
of εi can depend on xi , and homoskedasticity follows.
This is almost never the case in economics, where virtually all data come
from non-experimental settings. This is particularly true for cross-sectional data,
where we collect data about individuals who did not take part in an experiment
at all. When we estimate a wage equation, our dependent variable y i (typically,
the log wage for individual i ) will be matched against a vector of explanatory
variables xi that contain a description of that individual (education, age, work
experience and so on), and εi is simply defined as the deviation of y i from its
conditional expectation, so in principle there is no reason to think that it should
enjoy any special properties except E [εi |xi ] = 0, which holds by construction
under the linearity hypothesis.
Therefore, assuming that εi has a finite second moment, in general all we
can say is that E ε2i |xi is some function of xi :
£ ¤

E ε2i |xi = h(xi ) = σ2i ,


£ ¤
(4.3)

where the function h(·) is of unknown form (but certainly non-linear, since σ2i
can never be negative). Since the variances σ2i may be different across observa-
tions, we use the term heteroskedasticity.
The reader may recall (see page 82) that this function is known as the “skedas-
tic” function, and in principle one could try to carry out an inferential analysis of
the h(xi ) function very much like we do with the regression function. However,
in this section we will keep to the highest level of generality and simply allow
for the possibility that the sequence σ21 , σ22 · · · , σ2n contains potentially different
numbers, without committing to a specific formula for h(xi ).
124 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

Contrary to what many people think, het- If, however, you estimate the model
eroskedasticity is not a property of the data, but
only of the model we use, since it depends on y i = β 0 + β 1 x i + εi ,
the conditioning set
¤ you use. For example, as-
£ that is a perfectly valid representation of the
sume that E y i |x i is a constant, so the linear
data, since the true value of β1 is 0, then the
model we would use is
model becomes heteroskedastic, since the vari-
y i = β 0 + εi , ance of εi is a function of the explanatory vari-
ables.
but the variance of εi depends on x i (for exam- Having said this, it is very common for applied
ple, σ2i = x2i ). If you estimate a model in which economists to say “the data are heteroskedas-
the only regressor is the constant, the model is tic”, when you can’t get rid of heteroskedasticity
homoskedastic. in any meaningful model you may think of.

To simplify notation, in this section all expectation operators will be implic-


itly understood as conditional on X. In other words, we will treat X as if it were a
matrix of constants. Therefore,

V [ε] = E εε′ = Σ
£ ¤
y = Xβ + ε

If our model is heteroskedastic, then Σ is a diagonal matrix, where elements


along the main diagonal need not be equal to each other, and it would look like
this:4
 
σ21
σ22
 
 
..
 
.
 
 
2
σn

The variance of β̂ can be simply computed as

V β̂ = V β + (X′ X)−1 X′ ε = (X′ X)−1 X′ ΣX (X′ X)−1


£ ¤ £ ¤
(4.4)

and clearly if Σ = σ2 I we go back to σ2 (X′ X)−1 . But under heteroskedasticity


the V̂ = σ̂2 (X′ X)−1 matrix may be very far from the actual asymptotic covariance
matrix of β̂ (shown in equation (4.4)), even asymptotically; therefore, our test
statistics are very unlikely to be χ2 -distributed under H0 , which makes our p-
values all wrong and inference impossible.
In order to see how the situation can be remedied, it’s instructive to consider
a case of limited practical relevance, but that provides a few insights that may
help later: the case when Σ is known.

4 In fact, some of our considerations carry over to more general cases, in which Σ is a generic

symmetric, positive semi-definite matrix, but let’s not complicate matters.


4.2. HETEROSKEDASTICITY AND ITS CONSEQUENCES 125

4.2.1 If Σ were known


If the matrix Σ were observable, then the σ2i variances would be known (they are
just the diagonal elements of Σ), and getting rid of heteroskedasticity would be
easy: define
yi 1 εi
ẏ i = ẋi = xi ui =
σi σi σi
and the model becomes
ẏ i = ẋ′i β + u i (4.5)
but clearly V [u i ] = 1 by construction,5 so you can happily run OLS on the trans-
formed variables, since ẏ i and ẋi would both be observable. The resulting esti-
mator " #−1
n n

ẋi ẏ i = (Ẋ′ Ẋ)−1 Ẋ′ ẏ
X X
β̃ = ẋi ẋi
i =1 i =1

could also be written as


" #−1
n 1 n 1

x y = (X′ Σ−1 X)−1 X′ Σ−1 y
X X
β̃ = xi xi 2 i i
(4.6)
i =1 σi i =1 σi
2

and is called GLS. It can be proven that GLS is more efficient than OLS (the proof
is in subsection 4.A.2), and that its covariance matrix equals

V β̃ = (X′ Σ−1 X)−1


£ ¤

Since GLS is just OLS on suitably transformed variables, all standard properties
of OLS in the homoskedastic case remain valid, so for example you could test
hypotheses by the usual techniques.6

Some readers may find it intriguing to know which obviously becomes Euclidean distance if
that GLS has more or less the geometrical inter- Σ = I . (The fact that OLS equals GLS if Σ is a
pretation of OLS that I described in Section 1.4, scalar multiple of I is a trivial consequence.)
once a more general definition of “distance” is You can apply all the usual concepts of projec-
adopted. GLS arises if ordinary Euclidean dis- tions etc, with the only difference that the space
tance is generalised to you’re considering is somewhat “distorted”.
q
d (x1 , x2 ) = (x1 − x2 )′ Σ−1 (x1 − x2 )

It is interesting to note that, in the case we are considering here, Σ is di-


agonal, and therefore the operation that makes GLS equivalent to OLS on the
transformed data can be written very simply as in equation (4.5). However, it
can easily be proven that the formula (4.6) applies far more generally: all that is
5 Readers would hopefully not feel offended if I reminded them that a straightforward applica-

tion of equation (2.7) yields V [X /b] = V [X ] /b 2 .


6 In fact, we wouldn’t even have to observe Σ, as long as we had a matrix which is proportional

to Σ by a (possibly unknown) scalar factor. If we had Ω = c · Σ, where c is an unknown positive


scalar, we could use Ω instead of Σ in equation (4.6), since the scalar c would cancel out.
126 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

required is that Σ is a proper covariance matrix, that is, symmetric and positive
definite.
Of course, in ordinary circumstances Σ is unknown, but we could use this
idea to explore alternative avenues:

1. In some cases, you may have reason to believe that σ2i should be roughly
proportional to some observable variable. For example, if y i is an average
from some sampled values and n i is the size of the i -th sample, it would
be rather natural to conjecture that σ2i ≃ K n i−1 , where K is some constant.
p
Therefore, by dividing all the observables by n i you get an equivalent
representation of the model, in which heteroskedasticity is less likely to be
a problem, since in the transformed model all variances should be roughly
equal to K . The resulting estimator is sometimes called WLS (for Weighted
Least Squares), because you “weight” each observation by an observable
p
quantity w i . In our example, w i = 1/n i .

2. The idea above can be generalised: one could try to reformulate the model
in such a way that the heteroskedasticity problem might be attenuated.
For example, it is often the case that, rather than a model like

Yi = α0 + α1 X i + u i ,

a formulation in natural logs, like

ln Yi = β0 + β1 ln X i + εi

not only leads to a more natural interpretation of the parameters (since β1


can be read as an elasticity), but also alleviates heteroskedasticity prob-
lems.
p
3. Even more generally, it can be proven that, if we have Σ̂ −→ Σ, we can use
it in the so-called “feasible” version of GLS, or FGLS for short:

β̃ = (X′ Σ̂−1 X)−1 X′ Σ̂−1 y;

in principle, this can be accomplished by setting an explicit functional


form for the conditional variance (the function h(·) in 4.3). It can be done,
but in most cases it’s much more difficult computationally: the resulting
estimator cannot be written in closed form as an explicit function of the
observables, but only in implicit form as the minimiser of the least squares
function. This in turn, involves computational techniques that are stan-
dard nowadays, but are far beyond the scope of an introductory treatment
like this one.
In some cases, however, Σ can be assumed to be a function of a small set of
parameters, which can be consistently estimated separately. In that case,
FGLS is a perfectly sensible option. One example will be provided in sec-
tion 7.4.
4.2. HETEROSKEDASTICITY AND ITS CONSEQUENCES 127

In many cases, however, neither strategy is possible, so we may have to do


with OLS; the next section illustrates how you can make good use of OLS even
under heteroskedasticity.

4.2.2 Robust estimation


As the previous section should have made clear, heteroskedasticity doesn’t af-
fect consistency of OLS, which therefore remains a perfectly valid estimator (it
wouldn’t be as efficient as GLS, but this is something we can live with). The real
£ ¤
problem is that using the “regular” estimator for V β̂ , that is

V̂ = σ̂2 (X′ X)−1

for hypothesis testing yields statistics that are not asymptotically χ2 -distributed,
so all our p-values would be wrong. On the other hand, if we could use the
correct variance for OLS (given in equation (4.4) that I’m reporting here for your
convenience)
V β̂ = (X′ X)−1 (X′ ΣX)(X′ X)−1 ,
£ ¤

or anything asymptotically equivalent, inference would be perfectly standard.


This seems impossible, given that Σ is unobservable: the middle matrix in the
equation above could be written as
  
σ21 x′1
σ22
 ′  n
  x2  X 2 ′

X′ ΣX = [x1 x2 . . . xn ]  σi xi xi

..
 .  =
 .  (4.7)
.   .  i =1


σ2n x′n

and it would seem that in order to compute an asymptotically equivalent expres-


sion we would need the σ2i variances (or at the very least consistent estimates).
However, although Σ does in fact contain n distinct unknown elements, the
size of X′ ΣX is k × k,7 a fixed number of elements about which, in principle, we
may hope to say something as n → ∞. In other words, even if we can’t estimate
consistently the individual variances σ2i , we may be able to estimate consistently
the individual elements of the matrix X′ ΣX.
This is the basic idea that was put forward in White (1980): first, observe that
under heteroskedasticity, OLS is still consistent, so
p
β̂ −→ β =⇒
p
e i − εi = (y i − x′i β̂ ) − (y i − x′i β ) = x′i (β − β̂ ) −→ x′i 0 = 0 =⇒
p p
e i −→ εi =⇒ e i2 −→ ε2i :

the difference between the OLS residuals e i and the disturbances εi should be
“small” in large samples, and likewise for their squares.
7 In fact, it’s a symmetric matrix, so the number of its distinct elements is k(k + 1)/2.
128 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

Next, define a random variable η i as

η i = ε2i − E ε2i |xi


£ ¤

and by a similar argument to that used in Section 3.1, you get that

ε2i = σ2i + η i

where E η i |xi = 0 by definition. Therefore, in large samples, you can approxi-


£ ¤

mate σ2i by e i2 − η i .
If you substitute this into (4.7), you get

n n
X′ ΣX ≃ e i2 xi x′i − η i xi x′i .
X X
i =1 i =1

The two elements of the right-hand side are interesting,


because the former is observable, while the latter can be
easily proven8 to be a sum of zero-mean variables, which
should converge in probability to [0] if divided by n, where
I’m using the [0] notation for “a matrix full of zeros”.
As a consequence, we’d expect that the average of e i2 − H AL W HITE
σ2i should be a small quantity, so that

1X n
p
(e i2 − σ2i )xi x′i −→ [0].
n i =1

Now rewrite (4.4) as


à !
n
′ −1
σ2i xi x′i (X′ X)−1
£ ¤ X
V β̂ = (X X)
i =1

£ ¤
Therefore, asymptotically you can estimate V β̂ via
à !
n
′ −1
e i2 xi x′i (X′ X)−1
X
Ve = (X X) (4.8)
i =1

In fact, many variants have been proposed since White’s 1980 paper, that
seem to have better performance in finite samples, and most packages use one
of the later solutions. The principle they are based on, however, is the original
one.
A clever variation on the same principle that goes under the name of cluster-
robust estimation has become very fashiobnable in recent years. I’m not going
to describe it in this book, but you should be aware that in some circles you will
h i h £ i
8 It’s easy, really: E η |x = 0 means that E η x x′ = E E η |x x x′ = [0].
£ ¤ ¤
i i i i i i i i i
4.2. HETEROSKEDASTICITY AND ITS CONSEQUENCES 129

be treated like the village idiot if you don’t use “clustering”. In some cases, peo-
ple do this just because it’s cool and trendy. In some contexts, however, cluster-
robust inference is quite appropriate and should be considered as a very use-
ful tool; for example, with panel datasets, which I’ll describe in Chapter 7, giv-
ing a summary treatment of clustering in Section 7.3.4. For more details, read
Cameron and Miller (2010), Cameron and Miller (2015) and MacKinnon et al.
(2023).
An even more radical solution for dealing with heteroskedasticity has be-
come quite popular over the recent past because of the enormous advancement
of our computing capabilities: it’s called the bootstrap. In many respects, the
bootstrap is a very ingenious solution for performing inference with estimators
whose covariance matrix could be unreliable, for various reasons. In a book like
this, giving a full account of the bootstrap is far too ambitious a task, and I’ll just
give you a cursory description in section 4.A.4. Nevertheless, the reader ought
to be aware that “bootstrapped standard errors” are becoming more and more
widely used in the applied literature.

Heteroskedasticity-robust standard errors, variant HC0

coefficient std. error t-ratio p-value


-----------------------------------------------------------
const 8.85359 0.0557726 158.7 0.0000 ***
lsize 1.03696 0.0270429 38.34 1.85e-255 ***
baths -0.00515142 0.0150608 -0.3420 0.7323
age -0.00238675 0.000300502 -7.943 2.92e-15 ***
pool 0.106793 0.0239646 4.456 8.69e-06 ***

Mean dependent var 11.60193 S.D. dependent var 0.438325


Sum squared resid 157.8844 S.E. of regression 0.246187
R-squared 0.685027 Adjusted R-squared 0.684544
F(4, 2605) 929.7044 P-value(F) 0.000000
Log-likelihood -42.58860 Akaike criterion 95.17721
Schwarz criterion 124.5127 Hannan-Quinn 105.8041

Table 4.2: Example: houses prices in the US (with robust standard errors)

Example 4.3
The hedonic model presented in section 3.4 was re-estimated with robust stan-
dard errors, and the results are shown in Table 4.2.
As the reader can check, all the figures in Table 4.2 are exactly the same as
those in Table 3.1, except for those that depend on the covariance matrix of
the parameters: the standard errors (and therefore, the t -statistics and their p
values) and the overall specification test. In this case, I instructed gretl to use
White’s original formula, but this is not the software’s default choice (although
results would change but marginally).
130 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

4.2.3 White’s test


Is it possible to test for homoskedasticity? Yes. In fact, many tests exist, and
they all have in common the fact that, under H0 , σ21 = σ22 = . . . = σ2n (that is,
homoskedasticity). In this section, I will focus on one of the mostly widely used,
also due to Hal White: other similar tests (that I will not describe here) go after
the name of Breusch-Pagan and Koenker test.
White’s idea is both simple and powerful: if εi is homoskedastic, then both
estimators of the parameters covariance matrix are consistent, so V̂ = σ̂2 (X′ X)−1
and its robust counterpart Ve should be similar in large samples. Otherwise, the
two matrices should diverge from one another. Therefore, one can indirectly
spot the problem by comparing the two matrices.9
If we re-write V̂ as
à !
¡ 2 ′ ¢ ′ −1 n
′ −1 ′ −1 2
V̂ = (X X) σ̂ X X (X X) = (X X) σ̂ xi xi (X′ X)−1 ,

X
i =1

and compare this expression to (4.8), it’s clear to see that any difference between
the two variance estimators comes from the matrix in the middle, which equals
Pn 2 Pn 2 ′
i =1 σ̂ xi xi for V̂ and i =1 e i xi xi for V . Therefore, the difference between them
′ e

1X n
(e 2 − σ̂2 )xi x′i
n i =1 i

is the quantity of interest. We need a test for the hypothesis that the proba-
bility limit of the expression above is a matrix of zeros. If it were so, then the
two estimators would converge to the same limit, and therefore the two esti-
mators would coincide asymptotically; this, of course, wouldn’t happen under
heteroskedasticity. Therefore, the null hypothesis of White’s test is homoskedas-
ticity.
Note that the alternative hypothesis is left unspecified: that is, the alterna-
tive hypothesis is simply the there is at least one variance σ2i that differs from
the other ones. This has two implications, one good and one bad. The good
one is that this is a fairly general test and is not specific to any assumption we
may make on the skedastic function h(xi ). The bad one is that the test is “non-
constructive”: if the null is rejected the test gives us no indication on what to
do.
It would seem that performing such a test is difficult; fortunately, an asymp-
totically equivalent test is easy to compute by means of an auxiliary regression:

e i2 = γ0 + z′i γ + u i ;

the vector zi can be defined, technically, as

zi = vech xi x′i .
¡ ¢

9 A generalisation of the same principle is known among econometricians as the Hausman test,

after the great Jerry Hausman. More on this in Section 6.4.


4.2. HETEROSKEDASTICITY AND ITS CONSEQUENCES 131

The definition of the vech (·) operator is given in Subsection 4.A.3, but in prac-
tice, zi contains the non-duplicated cross-products of xi , that is all combina-
tions of the kind x l i ·x mi (with l , m = 1 . . . k); some of them could cause collinear-
ity, so they must be dropped from the auxiliary regression (see below for an ex-
ample). Of course, if xi contains a constant term, then zi would contain all the
elements of xi , as the products 1 · x mi .
Like in all auxiliary regression, we don’t really care about its results; running
it is just a computationally convenient way to calculate the test statistic we need,
namely
LM = n · R 2 .
Under the null of homoskedasticity, this statistic will be asymptotically distributed
as χ2p , where p is the size of the vector zi .
For example: suppose that xi contains:

1. the constant;

2. two continuous variables x i and w i ;

3. a dummy variable d i

The cross products could be written as per the following “multiplication ta-
ble”
1 xi wi di
1 1 xi wi di
xi xi x i2 xi · w i xi · di
wi wi xi · w i w i2 w i · di
di di xi · di w i · di di
where I indicated the elements to keep by shading the corresponding cell in grey.
Of course the lower triangle is redundant, because it reproduces the upper one,
but the element in the South-East corner must be dropped too: since d i is a
dummy variable, it only contains zeros and ones, so its square d i2 contains the
same entries as d i itself; clearly, inserting both d i and d i2 into zi would make the
auxiliary regression collinear.
Therefore, the vector zi would contain

z′i = [x i , wi , di , x i2 , xi · w i , xi · di , w i2 , w i · di ]

so the auxiliary regression would read

e i2 = γ0 + γ1 x i + γ2 w i + γ3 d i +
+ γ4 x i2 + γ5 x i w i + γ6 x i d i +
+ γ7 w i2 + γ8 w i d i + u i

where u i is the error term of the auxiliary regression. In this case, p = 8.10
10 It’s easy to prove that, if you have k regressors, then p ≤ k(k+1) − 1.
2
132 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

Example 4.4
Running White’s heteroskedasticity test on the hedonic model for houses (see
section 3.4) yields:

White’s test for heteroskedasticity


OLS, using observations 1-2610
Dependent variable: uhat^2

coefficient std. error t-ratio p-value


------------------------------------------------------------
const 0.806236 0.160351 5.028 5.30e-07 ***
lsize -0.710993 0.134434 -5.289 1.33e-07 ***
baths 0.0936181 0.0514158 1.821 0.0688 *
age 0.00464135 0.00119047 3.899 9.91e-05 ***
pool 0.267089 0.110699 2.413 0.0159 **
sq_lsize 0.159277 0.0307527 5.179 2.40e-07 ***
X2_X3 -0.0409056 0.0240448 -1.701 0.0890 *
X2_X4 -0.00163418 0.000527404 -3.099 0.0020 ***
X2_X5 -0.164525 0.0498979 -3.297 0.0010 ***
sq_baths 0.00360873 0.00513648 0.7026 0.4824
X3_X4 0.000204179 0.000252621 0.8082 0.4190
X3_X5 0.0653657 0.0290985 2.246 0.0248 **
sq_age -4.93515e-08 4.95338e-06 -0.009963 0.9921
X4_X5 0.00245662 0.000753929 3.258 0.0011 ***

Unadjusted R-squared = 0.054056

Test statistic: TR^2 = 141.085963,


with p-value = P(Chi-square(13) > 141.085963) = 0.000000

The cross-products are: (a) the original regressors first (because the original
model has a constant) and (b) all the cross-products, except for the square of
pool, which is a dummy variable. The total number of regressors in the auxiliary
model is 14 including the constant, so the degrees of freedom for our test statistic
is 14 − 1 = 13.
Since the LM statistic is 141.1 (which is a huge number, compared to the
2
χ13 distribution), the null hypothesis of homoskedasticity is strongly rejected.
Therefore, the standard errors presented in Table 4.2 are a much better choice
than those in Table 3.1.

4.2.4 So, in practice. . .


In practice, when you estimate a model in which heteroskedasticity is a possi-
ble problem (in practice, every time you have cross-sectional data), you should
in principle strive for maximal efficiency, and you can do so by employing the
following algorithm, graphically depicted in Figure 4.2.

1. Start with OLS on a tentative model


4.2. HETEROSKEDASTICITY AND ITS CONSEQUENCES 133

2. Perform White’s test; if it doesn’t reject H0 , fine. Otherwise

3. can you reformulate the model so as to achieve homoskedasticity? If you


can, try a different formulation and start back from the top. Otherwise,

4. see if you can use FGLS. If you can, do it; otherwise

5. stick to OLS with robust standard errors.

estimate
model by OLS

does
White
No
test keep your model
reject
H0 ?

Yes

can you
update model reformu-
Yes
late?

No

is GLS Yes
use FGLS
feasible?

No
use robust covariance matrix

Figure 4.2: Heteroskedasticity flowchart

The things you can do at points 3 and 4 are many: for example, you can try
transforming your dependent variable and/or use weighting; for more details,
go back to section 4.2.1.
Note, however, that this algorithm often ends at point 5; this is so com-
mon that many people, in the applied economics community, don’t even bother
checking for heteroskedasticity and start directly from there.11 This is especially
true in some cases, where you know from the outset what the situation is. The
11 In fact, some researchers show sometimes an inclination to disregard specification issues in

hope that robust inference will magically take care of everything, which is of course not the case.
For an insightful analysis, see King and Roberts (2015).
134 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

so-called linear probability model (often abbreviated as LPM) is a notable ex-


ample.
The LPM is what you get when your dependent variable is a dummy. So for
example you may want to set up a model where y i is the employment status of an
individual, so y i = 1 if the i -th person has a job and y i = 0 otherwise. Contrary
to what happens in most cases, we know exactly what the distribution of the
dependent variable is: it’s a Bernoulli random variable:
(
1 with probability πi
yi = (4.9)
0 with probability 1 − πi

The linearity hypothesis implies that E y i |xi = πi = x′i β , since the expected
£ ¤

value of a Bernoulli rv is, by construction, the probability of success. This is quite


weird already, because π is a probability, and therefore has to be between 0 and
1, whereas if it really was a linear function of the x variables, you could always
imagine to find an observation for which the predicted probability is outside
the [0, 1] interval. Many applied econometricians are OK with that: they con-
cede that the linearity assumption is inappropriate after all, but assume that it
shouldn’t be a problem in practice, and use it as a convenient approximation.12
But then, you also have that for a Bernoulli rv V y i = πi · (1 − πi ), and there-
£ ¤

fore
V y i |xi = πi = x′i β · (1 − x′i βi )
£ ¤

so the conditional variance cannot be constant unless the conditional mean is


constant too. The vector of parameters β enjoys a special nature, being the vec-
tor of parameters that determine both the conditional mean and the conditional
variance. In theory, it is possible to estimate β by an elaborate FGLS strategy, but
in these cases practitioners always just use OLS with robust standard errors.

4.A Assorted results


4.A.1 Proof that full interactions are equivalent to split-sample esti-
mation
Suppose you have m categories in which you can split your sample and that all
the parameters in your model are liable to be different between the m subsam-
ples.13 Then, you can write the model as
m
(d j i · xi )′ β j + εi
X
yi = (4.10)
j =1

12 Models that overcome this questionable approach have existed for a long time: you’ll find a

thorough description of logit and probit models in any decent econometrics textbook, but for
some bizarre reason they are going out of fashion.
13 The classic Chow test occurs when m = 2; in order to study the argument below, I suggest you

to start with the special case m = 2 and generalise later.


4.A. ASSORTED RESULTS 135

where d j i = 1 if observation i belongs to sub-population j , and 0 otherwise.


This model can be written in matrix notation as
       
y1 X1 0 . . . 0 β1 ε1
 y2   0 X2 . . . 0   β2   ε2 
       
 . = .
 .   . .. . . ..  ·  ..  +  .. 
    (4.11)
 .   . . . .   .   . 

ym 0 0 . . . Xm βm εm

where y j is the segment of the y vector containing the observations for the j -th
subsample, and so forth. If you apply the OLS formula to equation (4.11), you
get
−1
X′1
    
β̂1 0 ... 0 X1 0 ... 0
 β̂ 
 2 

 0 X′2 ... 0  
  0 X2 ... 0 

 ..  =  .. .. .. .. · .. .. .. ..  ×
. .
     
 .   . . .   . . . 
β̂m 0 0 ... X′m 0 0 ... Xm
X′1
   
0 ... 0 y1

 0 X′2 ... 0  
  y2 

×  .. .. .. .. · .. =
.
   
 . . .   . 
0 0 ... X′m ym
−1 
X′1 X1 X′1 y1
 
0 ... 0

 0 X′2 X2 ... 0 


 X′2 y2 

=  .. .. .. ..  · .. 
.
   
 . . .   . 
0 0 ... X′m Xm X′m ym
(X′1 X1 )−1 X′1 y1
 

 (X′2 X2 )−1 X′2 y2 

= 
 .. 

 . 
(X′m Xm )−1 X′m ym

So clearly each β̂ j coefficient can be calculated by an OLS regression using the


data for subsample j only. Therefore, the residuals for subsample j are e j =
y j − X j β̂ j . As a consequence,

m
e′ e = e j′e j ,
X
j =1

which in words reads: the SSR for model (4.10) is the same as the sum of the
SSRs you get for the m separate submodels. Equation (4.2) is a simple special
case when m = 2; the corresponding generalisation for a generic m is

ST − m
P
j =1 S j
W = n · Pm (4.12)
j =1 S j

and the degrees of freedom for the test equals k · (m − 1).


136 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

Now note that if you take q to be the “reference” category14 , you can rewrite
equation (4.10) as
y i = x′i β + (d j i x′i γ j ) + εi
X
j ̸=q

where γ j = β j − βq by a simple generalisation of the argument at the start of


section 4.1.2. As a consequence, you can compare the model above with the
model where all the γ j vectors are 0 by comparing the SSR for the restricted
model y i = x′i β + εi (call it e′ e) against the sum of the SSRs of the m separate
submodels (call them e′ e j , with j = 1 . . . m), and they corresponding Wald-type
statistic would be exactly equation (4.12).

4.A.2 Proof that GLS is more efficient than OLS


£ ¤ £ ¤
In order to prove that V β̂ − V β̃ is psd, I’ll use the properties on psd matrices
that I listed in section 3.A.6, plus a few more

1. if A and B are invertible and A − B is psd, then B −1 − A −1 is also psd;

2. if A is psd, there always exists a matrix H such that A = H H ′ ;15

3. all idempotent matrices are psd.

Therefore, to check the relative efficiency of β̂ and β̃ , we’ll perform an equiv-


£ ¤−1 £ ¤−1
alent check on ∆ ≡ V β̃ −V β̂ (by property 1 above). To prove that ∆ is psd,
start from its definition:
£ ¤−1 £ ¤−1
∆ ≡ V β̃ − V β̂ = X′ Σ−1 X − (X′ X)(X′ ΣX)−1 (X′ X);

since Σ is pd, we can write it as Σ = H H ′ (by property 2), so that Σ−1 = (H ′ )−1 H −1 :

∆ = X′ (H ′ )−1 H −1 X − (X′ X)(X′ H H ′ X)−1 (X′ X) =


= (H −1 X)′ I − H ′ X(X′ H H ′ X)−1 X′ H (H −1 X).
£ ¤

Now define W = H ′ X and re-express ∆ as:

∆ = (H −1 X)′ I − W(W′ W)−1 W′ (H −1 X) = (H −1 X)′ MW (H −1 X),


£ ¤

since MW is idempotent, it is psd (property 3); but then, the same is true of
(H −1 X)′ MW (H −1 X); therefore, the claim follows.
Note that under heteroskedasticity Σ is assumed to be diagonal, but the above
proof holds for any non-singular covariance matrix Σ.
14 I will not offend the reader’s intelligence by writing the obvious double inequality 1 ≤ q ≤ m.
15 Note: H is not unique, but that doesn’t matter here. By the way, it is also true that if a matrix

H exists such that A = H H ′ , then A is psd, but we won’t use this result here.
4.A. ASSORTED RESULTS 137

4.A.3 The “vec” and “vech” operators


In some cases, it can be useful to reshape the contents of a matrix so as to trans-
form it into a vector. The “vec” operator does just that: it stacks the columns of
a matrix below one another. For example,
 
a
µ· ¸¶  
a c b 
vec = 
b d c 
d

or more generally  
x1
¤¢  x2 
 
¡£
vec x1 x2 . . . xk = 
 ..  .

.
xk
The “vech” operator works in a similar way, but is generally applied to sym-
metric matrices: the difference from “vec” is that the redundant elements are
not considered. For example:

x
 
µ· ¸¶
x y
vech = y  .
y z
z

More generally, if A is an n ×n symmetric matrix, vech (A) is a vector holding the


n(n+1)
2 elements on and below its diagonal.

4.A.4 The bootstrap


For a reliable account, get hold of Efron and Hastie (2016) (Bradley Efron is
none other than the inventor of the technique), or MacKinnon (2006) for a more
econometrics-oriented approach. Here, I’m just giving you a basic intuition on
what the bootstrap is. Suppose you have an estimator

θ̂ = T (X),

where X is a data matrix with n rows. Clearly, in order to perform inference,


you need to have an idea of what the distribution of the random variable θ̂ is.
Asymptotically, the CLT may be of help, but perhaps your sample size is not
large enough to trust the asymptotic approximation given by the CLT; and even
if you’re willing to take the asymptotic distribution as an acceptable approxima-
tion, the covariance matrix of θ̂ may be unknown, or difficult to compute.
Of course, given your data X you can compute θ̂ just once, but if you could
observe many different datasets with the same distribution, then you could com-
pute your estimator many times and get an idea of the distribution of your statis-
tic by looking at the different values of θ̂ you get each time.
138 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS

The idea is to use your observed data X to produce, with the aid of computer-
generated pseudo random numbers, H alternative datasets Xh , with h = 1 . . . H ,
and compute your estimator for each of them, so you end up with a collection
θ̂1 , θ̂2 , . . . , θ̂H . This procedure is what we call bootstrapping,16 and the H realisa-
tions you get of your statistic are meant to give you an idea of the actual, finite-
sample distribution of the statistic itself.
Then, one possible way of computing V (θ̂) is just to take the sample variance
of the bootstrap estimates:

1 XH
θ̄ = θ̂h
H h=1
1 XH ¡ ¢2
Ve (θ̂) = θ̂h − θ̄
H h=1

How do you generate your artificial datasets Xh ? There is a myriad of ways


to do this, but when the observations are iid,17 the simplest solution is just to
resample from the rows of X with replacement, as exemplified in Table 4.3; the
example uses the scripting language of gretl, but it should be relatively easy to
translate this into any language that you like better.18 Note that the rows are
picked with replacement, which means that you have near-certainty that some
of the rows of X will be present in your “fake” dataset Xh more than once and
some others won’t be there at all.
You may find it puzzling, but a simple argument should give you an idea of
why this is done. Suppose you have only 3 data points; x 1 , x 2 and x 3 . If your
data are iid, then each of your observations is equally likely, so you could have
observed, with the same probability, each of the following 27 datasets:

X1 = (x 1 , x 1 , x 1 )
X2 = (x 1 , x 1 , x 2 )
X3 = (x 1 , x 1 , x 3 )
X4 = (x 1 , x 2 , x 1 )
X5 = (x 1 , x 2 , x 2 )
X6 = (x 1 , x 2 , x 3 )
..
.
X26 = (x 3 , x 3 , x 2 )
X27 = (x 3 , x 3 , x 3 )
16 According to Efron, “[i]ts name celebrates Baron Munchausen’s success in pulling himself up

by his own bootstraps from the bottom of a lake” (Efron and Hastie, 2016, p. 177), although the
story is reportedly a little different. However, the name was chosen to convey the idea of the
accomplishment of something apparently impossible without external help.
17 When data are not independent, things get a bit more involved.
18 Warning: the algorithm in Table 4.3 wouldn’t be a computationally efficient way to get the job

done. It’s just meant to illustrate the procedure in the most transparent way possible.
4.A. ASSORTED RESULTS 139

ad it’s only by chance that you observed X6 instead of any of the others. The
number 27 comes from the fact that the number of possible datasets is n n , so in
this case 33 = 27. Clearly, the estimator θ̂h can be computed for each of the 27
cases and various descriptive statistics can be computed easily. In realistic cases,
computing θ̂h for each possible sample is impossible, since n n is astronomical:
therefore, we just randomly extract H samples and use those.

# a l l o c a t e space f o r H e s t i m a t e s (H i s the number of bootstrap r e p l i c a t i o n s )


matrix thetas = zeros (H, 1)

# generate H simulated d a t a s e t s and corresponding e st i m at o r s


loop h = 1 . . H
Xh = zeros (n , k ) # s t a r t with a matrix of z e r o s

loop i = 1 . . n # f o r each row of our dataset


k = randgen1 ( i , 1 , n) # pick a random number between 1 and n
# put the k−th row of the true data into the i −th row of
# the simulated data
Xh[ i , ] = X[ k , ]
endloop

# now compute the estimator on the generated data Xh and s t o r e i t


thetas [ h ] = estimator (Xh) # t h i s would be the T(X) function
endloop

# compute the variance of the simulated t h e t a s


V = mcov( thetas )

Note: this is not meant to run “out of the box”. The script above assumes that a
few objects, such as the scalars n or H, or the function estimator() have already
been defined.

Table 4.3: Elementary example of bootstrap


140 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS
Chapter 5

Dynamic Models

5.1 Dynamic regression


In cross sectional datasets, it is quite natural to assume that the most useful in-
formation set on which to condition the distribution of y i is xi . Why should we
consider, for the conditional distribution of y i , the information available for in-
dividual j as relevant (with i ̸= j )? In some cases, there could be something to
this; perhaps individuals i and j have some unobservable feature in common,
but in most cross-sectional datasets this shouldn’t be something to worry about.
This argument does not apply to time-series datasets. Here, we have two
fundamental differences from cross-sectional datasets:

1. Data have a natural ordering.

2. At any given point in time, we can take as known what happened in the
past (and, possibly, at present time), but the future remains unknown.

This means that, if we want to condition y t to something1 , we may proceed


£ ¤
as in chapter 3 and consider E y t |xt , but this is unlikely to be a good idea, es-
pecially in the light of a feature that most economics time series display, that is,
they are very persistent.
Persistence is a loose term we use for describing the quality, that time se-
ries often possess, whereby contiguous observations look more like each other
than distant ones. In other words, persistence is the observable consequence of
the fact that most phenomena evolve gradually through time.2 In fact, you may
think of time series as something with “memory” of the past. The information
embodied in a time series dataset is not only in the numbers it contains, but
1 Attention: for this chapter, I’m going to switch to a slightly different notation convention than

what I used in the previous chapters. Since we’re dealing with time series, I will use the symbols t
and T instead of i and n, so for example the dependent variable has values y 1 , . . . , y t , . . . , y T .
2 In fact, the econometric treatment of time series data has become, since the 1980s, such a

vast and complex subject that we may legitimately treat time-series econometrics as a relatively
autonomous scientific field (with financial econometrics as a notable sub-field).

141
142 CHAPTER 5. DYNAMIC MODELS

also in the sequence in which they come, as if the data told you a story. If you
scramble the ordering of the rows in a cross-sectional dataset, the information
remains intact; in a time series dataset, most of it is gone.
For example: figure 5.1 shows log of real GDP and log of private consump-
tion in the Euro area between 1995 and 2019 (y and c, respectively).3 By looking
at the plot, it just makes sense to surmise that c t −1 may contain valuable infor-
mation about c t , even more than y t does.

14.8
y
14.7
c
14.6

14.5

14.4

14.3

14.2

14.1

14

13.9

13.8

13.7
1995 2000 2005 2010 2015 2020

Figure 5.1: Consumption and income in the Euro area (in logs)

Therefore:
£ ¤
• the choice of xt as the conditioning set for E y t |xt says implicitly that
information on what happened before time t is not of our interest (which
is silly);

• since observations are very unlikely to be independent, there is no ground


£ ¤
for assuming that covariance matrix of y t − E y t |xt is diagonal.

In the early days of econometrics, this situation was treated in pretty much
the same way as we did with heteroskedasticity in section 4.2, that is, by consid-
ering a model like
y t = x′t β + εt (5.1)

and working out solutions to deal with the fact that E εε = Σ is not a diagonal

£ ¤

matrix (although the elements on the diagonal might well be constant).


The presence of non-zero entries outside the diagonal was commonly called
the “autocorrelation” or “serial correlation” problem. In order to define this con-
cept,4 let us begin by defining what the autocovariance of a sequence of random
3 Source: Eurostat.
4 You may also want to take a look at Section 5.A.2.
5.1. DYNAMIC REGRESSION 143

variables is: suppose you have T random variables observed through time

z1 , z2 , . . . , z t , . . . , zT .

The covariance between z t and z s is an autocovariance, since it’s the covariance


of a random variable “with itself at a different time”, so to speak. Clearly, if this
quantity is different from 0, the two random variables z t and z s cannot be inde-
pendent. If we standardise this covariance as

Cov [z s , z t ]
ρ t ,s = p
V [z t ] V [z s ]

we have something called autocorrelation. In most cases, it makes sense to as-


sume that the correlation between z s and z t is only a function of how far they
are from each other; that is, assume that ρ t ,s is just a function of |t − s|; if this is
the case, the quantity Corr [z t −1 , z t ] = Corr [z t , z t +1 ] = . . . is called first-order au-
tocorrelation or autocorrelation of order 1. Generalisation is straightforward.

ACF for c

1
+- 1.96/sqrt(T)

0.5

-0.5

-1
0 5 10 15 20

lag

Figure 5.2: Sample autocorrelation for the log consumption series

Example 5.1
Figure 5.2 displays the sample autocorrelations for the log consumption series
shown in Figure 5.1. As you can see, the numbers are very different from 0. For
example, the first 3 sample correlations equal

ρ̂ 1 = Corr [z t , z t −1 ] = 0.9627
ρ̂ 2 = Corr [z t , z t −2 ] = 0.9259
ρ̂ 3 = Corr [z t , z t −3 ] = 0.8881

and it would be hard to argue that the random variables contained in this time
series are independent.

Clearly, if the autocorrelation between εt and εs is nonzero for some t and s,


144 CHAPTER 5. DYNAMIC MODELS

Σ cannot be diagonal, so GLS solutions have been devised5 , and a clever general-
isation of White’s robust estimator (due to Whitney Newey and Kenneth West) is
also available, but instead of “fixing” OLS, a much better strategy is to rethink our
conditioning strategy. That is, instead of employing clever methods to perform
acceptable inference on equation (5.1), we’d be much better off if we redefined
our object of interest altogether.
What we want to do is using all the possibly relevant available information
as our conditioning set; to this end, define the information set at time t as6
© ª
ℑt = x1 , x2 , . . . xt , y 1 , y 2 , . . . , y t −1 ;

(note that ℑt includes xt ). For example, in order to build a model where con-
sumption is the dependent variable and the only explanatory variable is income
(a dynamic consumption function, if you will), it may make sense to condition
consumption on the whole information set ℑt .
Therefore, the conditioning operation will be done by using all the variables
relevant for the distribution of y t that can be assumed to be known at that time.
Clearly, that includes the current value of xt , but also the past of both y t and xt .
Possibly, even future variables that are known with certainty at time t ; variables
such as these are normally said to be deterministic. Apart from the constant
term (x t = 1), popular examples include time trends (eg x t = t ), seasonal dummy
variables (eg x t = 1 if t is the month of May), or more exotic choices, such as the
number of days in a given month, that is known in advance. Note that (this will
be very important) ℑt is an element of a sequence where ℑt −1 ⊆ ℑt ⊆ ℑt +1 ; in
other words, the sequence of information sets is increasing.7
£ ¤
Now consider the conditional expectation E y t |ℑt ; even under the linearity
assumption, this object may have two potentially troublesome characteristics:
£ ¤
1. since the sequence ℑt is increasing, E y t |ℑt may contain information
that goes indefinitely back into the past, and
£ ¤
2. E y t |ℑt could be different for each t .

If none of the above is true, things are much simplified; under the additional
assumption of linearity of the conditional mean,
p q
αi y t −i + βi′ xt −i ,
£ ¤ X X
E y t |ℑt =
i =1 i =0

5 For the readers who are into the history of econometrics: the so-called Cochrane-Orcutt es-

timator and its refinements are totally forgotten today, but they were a big thing back in the 1960s
and 1970s.
6 To be rigorous, we should define the information set by using a technical tool called a σ-

field. This ensures that ℑt contains all possible functions of the elements listed above (∆y t −1 , for
example). But in an introductory treatment such as this, I’ll just use the reader’s intuition and use
ℑt as “all the things we knows at time t ”.
7 Or, if you will, we are assuming that we always learn and never forget.
5.1. DYNAMIC REGRESSION 145

where p and q are finite numbers. Although in principle ℑt contains all the past,
no matter how remote, only the most recent elements of ℑt actually enter the
conditional expectation. A slightly more technical way of expressing the same
concept is: we are assuming that there is a subset of ℑt (call it Ft ), that contains
only recent information, such that conditioning on ℑt or Ft makes no differ-
ence:
£ ¤ £ ¤
E y t |ℑt = E y t |Ft , (5.2)
where ℑt ⊃ Ft . In practice, Ft is the relevant information at time t .
The linearity assumption makes the regression function of y t on ℑt a differ-
ence equation, that is, a relationship in which an element of a sequence y t is
determined by a linear combination of its own past and the present and past of
another sequence xt ;8 if we proceed in a similar way as in chapter 3, and define
εt ≡ y t − E y t |ℑt , we can write the so-called ADL model:
£ ¤

p q
αi y t −i + βi′ xt −i + εt .
X X
yt = (5.3)
i =1 i =0

The ADL acronym is for Autoregressive Distributed Lags (some people prefer the
ARDL acronym): in many cases, we call the above an ADL(p, q) model, to make
it explicit that the conditional mean contains p lags of the dependent variable
and q lags of the explanatory variables.
Of course, it would be very nice if we could estimate the above parameters
via OLS. Clearly, the first few observations would have to be discarded, but once
this is done, we may construct our y and X matrices as9

α1


 α2 
 
x′p+1 x′p x′p−q+1  ..  ε

yp y p−1 ... y1 ...
    
y p+1 . p+1
x′p+2 ′ x′p−q+2  
 
y  y p+1

yp ... y2 xp+1 ...
  ε
αp  + ε
 p+2    p+2 
 

= x′p+3 x′p+2 xp−q+3   β   p+3  =

y 
 p+3 
  y p+2 y p+1 ... y3 ...  
 .   0 
..

..
β  .. 
 1 .
.
 .. 
 
 . 
βq
y = Wγ + ε

where wt is defined as

w′t = [y t −1 , y t −2 , . . . , y t −p , x′t , x′t −1 , . . . , x′t −q ],

and
γ ′ = [α1 , α2 , . . . , αp , β0′ , β1′ , . . . , βq′ ].
8 Note: this definition works for our present purposes, but in some cases you may want to con-

sider non-linear relationships, or cases which involve future values.


9 I assumed for simplicity that p ≥ q; of course, potentially collinear deterministic terms would

have to be dropped.
146 CHAPTER 5. DYNAMIC MODELS

Given this setup, clearly the OLS statistic can be readily computed with the
usual formula (W′ W)−1 W′ y, but given the nature of the conditioning, one may
wonder if OLS is a CAN estimator of the α and β parameters. As we will see in
section 5.3, the answer is positive, under certain conditions.
Before we focus on the possible inferential difficulties, however, it is instruc-
tive to consider another problem. Even if the parameters of the conditional ex-
£ ¤
pectation E y t |ℑt were known and didn’t have to be estimated, how do we in-
terpret them?

5.2 Manipulating difference equations


Given the difference equation
p q
αi y t −i + βi′ xt −i
X X
yt =
i =1 i =0

we may ask ourselves: what is the effect of x on y after a given period? That is:
how does xt affect y t +h ? Since the coefficients αi and βi do not depend on t ,
we may rephrase the question as: what is the impact on y t of something that
happened h periods ago, that is xt −h ? Clearly, if h = 0 we have a quantity that
is straightforward to interpret, that is the instantaneous impact of xt in y t , but
much is to be gained by considering magnitudes like

∂y t ∂y t +h
dh = = ; (5.4)
∂xt −h ∂xt
the d h parameters take the name of dynamic multipliers, or just multipliers for
short. In order to find a practical and general way to compute them, we will need
a few extra tools. Read on.

5.2.1 The lag operator


Time series are nothing but sequences of numbers, with a natural ordering given
by time. In many cases, we may want to manipulate sequences by means of
appropriate operators. The lag operator is generally denoted by the letter L by
econometricians (statisticians prefer B — savages!); it’s an operator that turns
a sequence x t into another sequence, that contains the same objects as x t , but
shifted back by one period.10 If you apply L to a constant, the result is the same
constant. In formulae,
Lx t = x t −1
Repeated application of the L operator n times is indicated by L n , and therefore
L n x t = x t −n . By convention, L 0 = 1. The L operator is linear, which means that,
10 In certain cases, you might want to use the lead operator, usually notated as F , which is de-

fined as the inverse to the lag operator (F x t = x t +1 , or F = L −1 ). I’m not using it in this book, but
its usage is very common in economic models with rational expectations.
5.2. MANIPULATING DIFFERENCE EQUATIONS 147

if a and b are constant, then L(ax t + b) = aLx t + b = ax t −1 + b. These simple


properties have the nice consequence that, in many cases, we can manipulate
the L operator algebraically as if it was a number. This trick is especially useful
when dealing with polynomials in L. Allow me to exemplify:

Example 5.2
Call b t the money you have at time t , and s t the difference between the money
you earn and the money you spend between t − 1 and t (in other words, your
savings). Of course,
b t = b t −1 + s t .
Now the same thing with the lag operator::

b t = Lb t + s t → b t − Lb t = (1 − L)b t = ∆b t = s t

The ∆ operator, which I suppose not unknown to the reader, is defined as (1−L),
that is a polynomial in L of degree 1. The above expression simply says that the
variation in the money you have is your net saving.

Example 5.3
Call q t the GDP for the Kingdom of Verduria in quarter t . Obviously, yearly GDP
is given by
y t = q t + q t −1 + q t −2 + q t −3 = (1 + L + L 2 + L 3 )q t
Since (1 + x + x 2 + x 3 )(1 − x) = (1 − x 4 ), if you “multiply” the equation above11 by
(1 − L) you get
∆y t = (1 − L 4 )q t = q t − q t −4 ;
The variation in yearly GDP between quarters is just the difference between the
quarterly figures a year apart from each other.

A polynomial P (x) may be evaluated at any value, but two cases are of special
interest. Obviously, if you evaluate P (x) for x = 0 you get the “constant” coeffi-
cient of the polynomial, since P (0) = p 0 + p 1 · 0 + p 2 · 0 + · · · = p 0 ; instead, if you
evaluate P (1) you get the sum of the polynomial coefficients:
n n
p j 1j =
X X
P (1) = pj.
j =0 j =0

This turns out to be quite handy when you apply a lag polynomial to a constant,
since
n n
pjµ = µ
X X
P (L)µ = p j = P (1)µ.
j =0 j =0

11 To be precise, we should say: ‘if you apply the (1 − L) operator to the expression above’.
148 CHAPTER 5. DYNAMIC MODELS

There are two more routine results that come in very handy: the first one has
to do with inverting polynomials of order 1. It can be proven that, if |α| < 1,

(1 − αL)−1 = (1 + αL + α2 L 2 + · · · ) = αi L i ;
X
(5.5)
i =0

the other one is that a polynomial P (x) is invertible if and only if all its roots are
greater than one in absolute value:
1
exists iff P (x) = 0 ⇒ |x| > 1. (5.6)
P (x)
The proofs are in subsection 5.A.1.

Example 5.4 (The Keynesian multiplier)


Let me illustrate a possible use of polynomial manipulation by a very old-school
macro example: the simplest possible version of the Keynesian multiplier idea.
Suppose that

Yt = Ct + It ; (5.7)
Ct = αY t −1 ; (5.8)

where Y t is GDP, C t is aggregate consumption and I t is investment; 0 < α < 1 is


the marginal propensity to consume.
By combining the two equations,

Y t = αY t −1 + I t → (1 − αL)Y t = I t .

therefore, by applying the first degree polynomial A(L) = (1 − αL) to the Y t se-
quence (national income), you get the time series for investments, simply be-
cause I t = Y t −C t = Y t − αY t −1 .
If you now invert the A(L) = (1 − αL) operator,

Y t = (1 + αL + α2 L 2 + · · · )I t = αi I t −i :
X
i =0

aggregate demand at time t can be seen as a weighted sum of past and present
investment. Suppose that investment goes from 0 to 1 at time 0. This brings
about a unit increase in GDP via equation (5.7); but then, at time 1 consumption
goes up by α, by force of equation (5.8), so at time 2 it increases by α2 and so on.
Since 0 < α < 1, the effect dies out eventually.
If investments were constant through time, then I t = I¯; therefore, A(L)Y t = I¯
becomes
1 1 I¯
Yt = I¯ = I¯ =
A(L) A(1) 1−α
where the second equality comes from the fact that I¯ is constant. The rightmost
expression is nothing but the familiar “Keynesian multiplier” formula.
5.2. MANIPULATING DIFFERENCE EQUATIONS 149

A word of caution: in many cases, it’s OK to manipulate L algebraically as if


it was a number, but sometimes it’s not: the reader should always keep in mind
that the expression Lx t does not mean ‘L times x t ’, but ‘L applied to x t ’. The
following example should, hopefully, convince you.

Example 5.5
Given two sequence x t and y t , define the sequence z t as z t = x t · y t . Obviously,
z t −1 = x t −1 y t −1 ; however, one may be tempted to argue that

z t −1 = x t −1 y t −1 = Lx t Ly t = L 2 x t y t = L 2 z t = z t −2

which is obviously absurd.

5.2.2 Dynamic multipliers


When considering an ADL model, the problem that we are ultimately after is:
how do we interpret its parameters? Let’s start from a difference equation like
the following:
A(L)y t = B (L)x t

where the degrees of the A(L) and B (L) polynomials are p and q, respectively.
If the polynomial A(L) is invertible, the difference equation is said to be stable.
In this case, we may define D(L) = A(L)−1 B (L) = B (L)/A(L); as a rule, D(L) is of
infinite order (although not necessarily so):


X
y t = D(L)x t = d i x t −i .
i =0

This is all we need for dealing with our problem, if you consider that the dynamic
multipliers as defined in equation (5.4),

∂y t ∂y t +i
di = = ,
∂x t −i ∂x t

are simply the coefficients of the D(L) polynomial.12 It is possible to calculate


them analytically by inverting the A(L) polynomial, but doing so is neither in-
structive nor enjoyable. On the contrary, the same effect can be achieved by
using a nice recursive algorithm.
The impact multiplier d 0 is easy to find, since d 0 = D(0) = B (0)/A(0), which
simply equals β0 (since A(0) = 1). All other multipliers can be found by means of
(5.4), which can be used to express d i in terms of d i −1 , d i −2 etc. Once you have
d 0 , the rest of the sequence follows.
12 As we will see in section 5.3.1, invertibility of A(L) is not only required for the calculation of

the multipliers, but also for the CAN property of OLS.


150 CHAPTER 5. DYNAMIC MODELS

Let me show you a practical example. For an ADL(1,1) model,

y t = αy t −1 + β0 x t + β1 x t −1 , (5.9)

use the definition of a multiplier as a derivative and write

∂y t ∂ ¡
αy t −1 + β0 x t + β1 x t −1 = β0
¢
d0 = =
∂x t ∂x t
∂y t ∂ ¡ ∂y t −1
αy t −1 + β0 x t + β1 x t −1 = α + β1 = αd 0 + β1 ,
¢
d1 = =
∂x t −1 ∂x t −1 ∂x t −1

where we used the property

∂y t −1 ∂y t
= = d0
∂x t −1 ∂x t

in such a way that d 1 is expressed as a function of d 0 ; similarly,

∂y t ∂ ¡ ∂y t −1
αy t −1 + β0 x t + β1 x t −1 = α = αd 1
¢
d2 = =
∂x t −2 ∂x t −2 ∂x t −2

and so on, recursively.


A nice and cool way to express the above is by saying that the multipliers can
be calculated through a difference equation with the same polynomials as the
original one; the sequence of multipliers obeys the relationship

A(L)d i = B (L)u i , (5.10)

where u i is a sequence that contains 1 for u 0 , and 0 everywhere else. This makes
it easy to calculate the multipliers numerically, given the polynomial coefficients,
via appropriate software.

Example 5.6 (Multiplier calculation)


Take for example the following difference equation:

y t = 0.2y t −1 + 0.4x t + 0.3x t −2 .

In this case, A(L) = 1 − 0.2L and B (L) = 0.4 + 0.3L 2 . The inverse of A(L) is

A(L)−1 = 1 + 0.2L + 0.04L 2 + 0.008L 3 + · · · = 0.2i L i ;
X
i =0

therefore,

B (L)
= (0.4 + 0.3L 2 ) × (1 + 0.2L + 0.04L 2 + 0.008L 3 + · · · ).
A(L)
5.2. MANIPULATING DIFFERENCE EQUATIONS 151

The two polynomials can be multiplied directly, as in

B (L)
= 0.4 × (1 + 0.2L + 0.04L 2 + 0.008L 3 + · · · ) +
A(L)
+0.3L 2 × (1 + 0.2L + 0.04L 2 + 0.008L 3 + · · · ) =
= 0.4 + 0.08L + 0.016L 2 + 0.0032L 3 + · · · +
+0.3L 2 + 0.06L 3 + 0.012L 4 + 0.0024L 5 · · · =
= 0.4 + 0.08L + 0.316L 2 + 0.0632L 3 + · · ·

but it’s really boring. The recursive approach is much quicker:

d0 = B (0)/A(0) = 0.4/1 = 0.4


d1 = 0.2 · d 0 = 0.08
d2 = 0.2 · d 1 + 0.03 = 0.016 + 0.3 = 0.316
d3 = 0.2 · d 2 = 0.0632

and so on.

m= ∞ i · πi , and can be given a nice inter-


P
In certain cases, the multipliers d i may all have i =0
d
the same sign. If so, the sequence πi = ci pretation as the “average” time span it takes x t
has all the characteristics of a discrete probabil- to affect y t .
ity distribution: all the πi coefficients are non- Note, however, that in general the sequence d i
negative and sum to 1. may well include positive and negative num-
Therefore, it makes sense to compute quanti- bers, and the long-run multiplier c could even
ties such as the mean lag or the median lag. be 0; in those cases, the notion itself of mean
For example, the mean lag can be defined as lag is meaningless.

5.2.3 Interim and long-run multipliers


∂y ∂y
If you go back to the definition of the multipliers, (5.4), that is d i = ∂x t −it = ∂xt +it ,
it is quite natural to interpret the magnitude d h as the effect of something that
happened h periods ago on what we see today. The implicit idea in this defini-
tion is that the source of the dynamic behaviour in our system is a one-off event.
In many cases, instead, we could be interested in computing the effect on
y t of a permanent change in x t . Clearly, at time 0 the effect will be equal to the
impact multiplier d 0 , but after one period the instantaneous effect will overlap
with the lagged one, so the effect will be equal to d 0 + d 1 . By induction, we may
define a new sequence of multipliers as

j
X
c j = d0 + d1 + · · · + d j = di . (5.11)
i =0
152 CHAPTER 5. DYNAMIC MODELS

These are called interim multipliers and measure the effect on y t of a perma-
nent change in x t that took place j periods ago. In order to see what happens in
the long run after a permanent change, we may also want to consider the long-
run multiplier c = lim j →∞ c j . Calculating c is much easier than what may seem,
since

X
cj = d i = D(1);
i =0

that is: c is the number you get by evaluating the polynomial D(z) in z = 1; since
D(z) = B (z)/A(z), c can be easily computed as c = D(1) = BA(1)
(1)
.

Example 5.7 (interim multipliers)


Let’s go back to the difference equation we used in example 5.6:

y t = 0.2y t −1 + 0.4x t + 0.3x t −2 .

Interim multipliers are easily computed:

c0 = d 0 = 0.4
c1 = d 0 + d 1 = c 0 + d 1 = 0.48
c2 = d 0 + d 1 + d 2 = c 1 + d 2 = 0.796

and so on. The limit of this sequence (the long-run multiplier) is also easy to
compute:
B (1)
c = D(1) = = 0.7/0.8 = 0.875
A(1)
Et voilà.

The long-run multiplier c is very important, because it describes the rela-


tionship between y t and x t in steady state. The concept of steady state is of
paramount importance in econometrics, because is the closest you get to what
you refer to as “equilibrium” in theoretical economics: by “equilibrium”, we usu-
ally mean that there is no internal force that pushes the state of the system away
from where it currently is. Therefore, if a system is in equilibrium, all the vari-
ables that describe it will remain stable through time until an external shock
occurs.
When the dynamic behaviour of a system is described by a difference equa-
tion, the concept of steady state can be explained as follows: suppose we fix x t
at a certain value that stays the same forever. Is there a limit value for y t ? It
can be shown that the limit exists as long as the difference equation is stable; if
this condition is met, then the system admits a steady state. The steady state is
a long-run equilibrium: in steady state, neither y t nor x t change until external
shocks come from outside the system to perturb it.
5.3. INFERENCE ON OLS WITH TIME-SERIES DATA 153

Mathematically, if the system is in steady state, both variables are invariant


through time, so in steady state y t = Y and x t = X (note that Y and X bear no
subscript); as a consequence,

B (1)
A(L)y t = B (L)x t ⇒ A(L)Y = B (L)X ⇒ A(1)Y = B (1)X ⇒ Y = X = cX,
A(1)

where we used the property that, if X is a constant sequence, L n X = X . There-


fore, the system is not in equilibrium any time y t ̸= c x t . As we will see, this trivial
observation will become important later.

5.3 Inference on OLS with time-series data


At this point, we know how to interpret the coefficients of a difference equation.
An ADL model (equation (5.3), reproduced here for the reader’s convenience in
lag-polynomial notation)
A(L)y t = B (L)xt + εt (5.12)

is basically a difference equation plus an error term; therefore, the coefficients


of the two polynomials A(L) and B (L) are unobservable, but perhaps we could
find a CAN estimator.
We showed in section 5.1 how OLS can be applied to a dynamic model by
defining the X matrix and the y vector appropriately. The question now is: is
OLS a CAN estimator of the ADL parameters? The answer is positive, if certain
conditions are satisfied.

5.3.1 Martingale differences


Define wt like at the end of section 5.1, as

w′t = [y t −1 , y t −2 , . . . , y t −p , x′t , x′t −1 , . . . , x′t −q ],

so that we can write our dynamic model as

y t = w′t γ + εt

where of course γ ′ = [α1 , α2 , . . . , αp , β0′ , β1′ , . . . , βq′ ].


The first important requirement is that the second moments of wt exist and
that
T
p
T −1 wt w′t −→ Q
X
t =1

where Q is invertible. The conditions under which we can expect this to happen
are quite tricky to lay down formally. Here, I’ll just say that in order for every-
thing to work as expected, it is sufficient that our observed data are realisations
154 CHAPTER 5. DYNAMIC MODELS

of covariance-stationary and ergodic stochastic processes.13 For a summary de-


scription of what this means, I have written subsection 5.A.2 at the end of this
chapter. If you can’t be bothered, just take this to mean that all moments up to
the fourth order of all the observables exist and are stable through time.
On top of this, a fundamental ingredient for OLS being a CAN estimator of
the parameters in equation (5.12) is that εt be a martingale difference sequence
(or MDS for short).14
Roughly speaking, a MDS is a sequence of random variables whose expected
value (conditional to a certain information set) meets certain requirements:

• a martingale with respect to ℑt −1 is a sequence of random variables X t


such that E [X t |ℑt −1 ] = X t −1 ;

• If X t is a martingale, then ∆X t is a MDS: E [∆X t |ℑt −1 ] = 0

Of course, if we could condition y t on ℑt then εt would be a MDS by con-


struction:
£ £ ¤ ¤ £ ¤ £ ¤
E [εt |ℑt ] = E y t − E y t |ℑt |ℑt = E y t |ℑt − E y t |ℑt = 0

(this is essentially the same argument we used in section 3.1). But of course we
can’t use ℑt in practice; however, we’re assuming (see equation 5.2) that there
exists a subset Ft ⊂ ℑt such that E y t |ℑt = E y t |Ft , so Ft (which is usable,
£ ¤ £ ¤

because it’s finite) is just as good. So if you condition y t on Ft , the quantity


εt = y t − E y t |Ft is a MDS and all is well.
£ ¤

However, what happens if you use a conditioning set G t that is “too small”?
That is, that doesn’t contain Ft ? In that case, the difference u t = y t − E y t |G t is
£ ¤
£ ¤ £ ¤
not a MDS with respect to ℑt : if E y t |G t ̸= E y t |Ft , then
£ £ ¤ ¤ £ ¤ £ ¤
E [u t |ℑt ] = E y t − E y t |G t |ℑt = E y t |Ft − E y t |G t ̸= 0.

On the contrary, it is easy to prove that in the opposite case, when you condition
on a subset of ℑt that is larger than Ft , no problems arise.
This remark is extremely important in practice because the order of the poly-
nomials A(L) and B (L) (p and q, respectively) are not known: what happens if
we get them wrong? Well, if they are larger than the “true” ones, then our con-
ditioning set contains Ft , and all is well. But if they’re smaller, the disturbance
term of our model is not a MDS, and all inference collapses. For example, if
p = 2 and q = 3, then Ft contains y t −1 , y t −2 , x t , x t −1 , x t −2 and x t −3 . Any set of
regressors that doesn’t include at least these renders inference invalid.
13 I’m being very vague and unspecific here: if you want an authoritative source on the asymp-

totics for dynamic models, you’ll want to check chapters 6 and 7 in Davidson (2000).
14 MDSs arise quite naturally in inter-temporal optimisation problems, so their usage in eco-

nomic and finance models with uncertainty is very common. In this contexts, an MDS is, so to
speak, something that cannot be predicted in any way from the past. For a thorough discussion,
see Hansen and Sargent (2013), chapter 2.
5.3. INFERENCE ON OLS WITH TIME-SERIES DATA 155

Therefore, εt is a MDS, if we pick p and q large enough. The obvious impli-


cation is that E [εt |wt ] = 0 (since wt is contained in ℑt −1 ). If we also add a ho-
moskedasticity assumption E ε2t |wt = σ2 , then we have a set of results that par-
£ ¤

allel completely those in section 3.2. To put it simply, everything works exactly
the same way as in cross-sectional models: the martingale property ensures that
p
E [wt · εt |ℑt ] = 0, and therefore γ̂ −→ γ ; additionally,

p d
T (γ̂ − γ ) −→ N 0, σ2Q −1 .
¡ ¢

In practice, the whole testing apparatus we set up for cross sectional datasets
remains valid; the t statistic, the W statistic, everything. Nice, isn’t it? In addi-
tion, since the dynamic multipliers are continuous and differentiable functions
of the ADL parameters γ , we can simply compute the multipliers from the esti-
mated parameters γ̂ and get automatically CAN estimators of the multipliers.15

The homoskedasticity assumption is not nor- model, which I will not consider here, but are
mally a problem, except for financial data at extremely important in the field of financial
high frequencies (eg daily); for those cases, you econometrics. In case we want to stick with
get a separate class of models, the most no- OLS, robust estimation is perfectly viable.
table example of which is the so-called GARCH

5.3.2 Testing for autocorrelation and the general-to-specific approach

Basically, we need a test for deciding, on the basis of the OLS residuals, whether
εt is a MDS or not. Because if it were not, the OLS estimator would not be con-
sistent for the ADL parameters, let alone have the asymptotic distribution we
require for carrying out tests. As I argued in the previous section, εt cannot be a
MDS if we estimate a model in which the orders p and q that we use for the two
polynomials A(L) and B (L) are too small.
Most tests hinge on the fact that a MDS cannot be autocorrelated:16 for the
sake of brevity, I don’t prove this here, but the issue is discussed in section 5.A.3
if you’re interested. Therefore, in practice, the most important diagnostic check
on a dynamic regression model is checking for autocorrelation: if we reject the
null of no autocorrelation, then εt cannot be a MDS.
All econometric software pays tribute to tradition by reporting a statistic in-
vented by James Durbin and Geoffrey Watson in 1950, called DW statistic in their
honour. Its support is, by construction, the interval between 0 and 4, and ide-
ally it should be close to 2. It is practically useless, because it only checks for
autocorrelation of order 1, and there are several cases in which it doesn’t work
15 Unfortunately, the function linking multipliers and parameters is nonlinear, so you need the

delta method to compute their asymptotic variance. See section 2.3.2.


16 If I were insufferably pedantic, I would say “a MDS with finite second moments”.
156 CHAPTER 5. DYNAMIC MODELS

(notably, when lags of the dependent variable are among the regressors); there-
fore, nobody uses it anymore, although all software packages routinely print it
out as a homage to tradition.
The Godfrey test (also known as the Breusch-Godfrey test, or the LM test for
autocorrelation) is much better:

A(L)y t = B (L)xt + γ1 e t −1 + γ2 e t −2 + · · · + γh e t −h + εt

where e t is the t -th OLS residual and h is known as the order of the test. There
is no precise rule for choosing h; the most important aspect to consider is “how
long is the period we can reasonably expect to consider long enough for dynamic
effects to show up?”. When dealing with macro time series, a common choice is
2 years. That is, it is tacitly assumed that nothing can happen now, provoke no
effects for two years, and then suddenly do something.17 Therefore, you would
use h = 2 for yearly data, h = 8 for quarterly data, and so on. But clearly, this is a
very subjective criterion, so take it with a pinch of salt and be ready to adjust it
to your particular dataset.
This test, being a variable addition test, is typically implemented as an LM
test (see section 3.5.1) and is asymptotically distributed (under H0 ) as χ2h . In
practice, you carry out an auxiliary regression of the OLS residuals e t against wt
and h lags of e t ; you multiply R 2 by T and you’re done.
The Godfrey test is the cornerstone of the so-called general-to-specific esti-
mation strategy: since the polynomial orders p and q are not known in practice,
one has to make a guess. There are three possible situations:

1. your guess is exactly right; you’re a lucky bastard.

2. Your guess is wrong because you overestimated p and/or q: in this case,


your model contains the “true” one and the disturbance term will still be a
MDS; hence, the probability of the Godfrey test rejecting the null hypoth-
esis is 5%. The only slight inconvenience is that you’re using too many pa-
rameters. This is not a problem, however, because asymptotic inference is
valid and you can trim your model down by using ordinary specification
tests (see section 3.3).

3. Your guess is wrong because you underestimated one of p and q: your


model does not contain the “true” one and the disturbance term will not
be a MDS. In this case, the Godfrey test should reject the null.

So the idea of the general-to-specific approach is: start from a large model,
possibly ridiculously oversized. Then you can start refining it by ordinary hy-
pothesis tests, running diagnostics18 at each step to make sure your reduction
was not too aggressive.
17 “Mi ha detto mio cuggino che sa un colpo segreto. . . ”, EELST.
18 The most important test to run at this stage is of course the Godfrey test, but other diagnostics,

such as the RESET test for example, won’t hurt.


5.4. AN EXAMPLE, PERHAPS? 157

5.4 An example, perhaps?


If we were to ignore the points I raised at the beginning of this chapter, we could
simply use the data depicted in Figure 5.1 to estimate the parameters of what an
economist in the 1970s would have called a “consumption function” and regress
consumption at time t on a time trend and DGP at time t . If we did, we’d obtain
a “static model”
c t = β0 + β1 t + β2 y t + ε t ,
whose output is in Table 5.1.
OLS, using observations 1995:1-2019:4 (T = 100)
Dependent variable: c

coefficient std. error t-ratio p-value


----------------------------------------------------------
const -0.372424 0.391519 -0.9512 0.3439
time -0.000436889 0.000107381 -4.069 9.64e-05 ***
y 0.986243 0.0272713 36.16 4.18e-58 ***

Mean dependent var 13.95257 S.D. dependent var 0.100926


Sum squared resid 0.007333 S.E. of regression 0.008695
R-squared 0.992728 Adjusted R-squared 0.992578
F(2, 97) 6621.001 P-value(F) 1.9e-104
Log-likelihood 334.1324 Akaike criterion -662.2647
Schwarz criterion -654.4492 Hannan-Quinn -659.1016
rho 0.866455 Durbin-Watson 0.254935

Breusch-Godfrey test for autocorrelation up to order 4


TR^2 = 78.249011, with p-value = 4.08e-16

Table 5.1: Static regression example

Superficially, it would look as if the static model is a rather good one: R 2


looks great, but this is common with trending data (as macro time series typi-
cally are). The important thing is that ρ̂ = 0.866 and the Godfrey test rejects the
null hypothesis with a vengeance. In an equation like the one above, there is no
way the disturbance term εt can be thought of as an MDS. Therefore, not only
inference is invalid. There’s much more than can be said about income affects
consumption through time.
If instead we enlarge the information set to ℑt , the model we come up with
is an ADL(1,2) model. In practice:
c t ≃ k + αc t −1 + β0 y t + β1 y t −1 + β2 y t −2 ;
table 5.2 contains the OLS estimates, that is α̂ = 0.894, β̂0 = 0.589, and so on.
Also note that the time trend, which appeared to be highly significant in the
static model, drops out in the dynamic model.
In this case the Godfrey test cannot reject the null, so we may be confident
that inference is correct. The next thing we want to do now is interpreting the
output from an economic point of view.
158 CHAPTER 5. DYNAMIC MODELS

Model 2: OLS, using observations 1995:3-2019:4 (T = 98)


Dependent variable: c

coefficient std. error t-ratio p-value


--------------------------------------------------------
const 0.220191 0.0643345 3.423 0.0009 ***
c_1 0.893634 0.0378199 23.63 4.39e-41 ***
y 0.588653 0.0648987 9.070 1.90e-14 ***
y_1 -0.612108 0.109440 -5.593 2.23e-07 ***
y_2 0.110455 0.0655676 1.685 0.0954 *

Mean dependent var 13.95702 S.D. dependent var 0.096924


Sum squared resid 0.001048 S.E. of regression 0.003357
R-squared 0.998850 Adjusted R-squared 0.998800
F(4, 93) 20190.32 P-value(F) 1.0e-135
Log-likelihood 421.7841 Akaike criterion -833.5681
Schwarz criterion -820.6433 Hannan-Quinn -828.3403
rho 0.079847 Durbin’s h 0.852445

Breusch-Godfrey test for autocorrelation up to order 4:


TR^2 = 3.484938, with p-value = 0.48

Table 5.2: Dynamic regression example

Example 5.8 (Multipliers for the Euro consumption function)


The calculation of the sequence of multipliers for the model in Table 5.2 can be
undertaken by using equation (5.10); the estimates of two polynomials we need
are

A(L)
 = 1 − 0.893634L
B
 (L) = 0.588653 − 0.612108L + 0.110455L 2 ,

so in this case we have

d i = 0.893634d i −1 + 0.588653u i − 0.612108u i −1 + 0.110455u i −2 ,

where u 0 = 1 and u i = 0 for i ̸= 0. Therefore, d 0 equals

d 0 = 0.588653

while for d 1 and d 2 we have

d1 = 0.893634 · d 0 − 0.612108 = −0.0860674


d2 = 0.893634 · d 1 + 0.110455 = 0.0335424,

and so on. With a little effort (and appropriate software), you get the following
results:
5.5. THE ECM REPRESENTATION 159

i di ci
0 0.588653 0.588653
1 -0.0860674 0.502585
2 0.0335424 0.536128
3 0.0299747 0.566103
4 0.0267864 0.592889
5 0.0239372 0.616826
6 0.0213911 0.638217
.. .. ..
. . .

Where I also added a column for the interim (cumulated) multipliers. Moreover,
you have that A(1) = 1 − 0.893634 = 0.106366, B (1) = 0.087, and therefore the
long-run multiplier equals c = 0.87/0.106366 = 0.81793.

5.5 The ECM representation


As I argued in section 5.2.2, the best way to interpret the parameters of an ADL
model is by computing the dynamic multipliers (and possibly cumulating them).
The multipliers that are presumably of most interest from an economic view-
point are (a) the impact multiplier d 0 (because it measures what happens in-
stantaneously) and (b) the long run-multiplier (because it measures what hap-
pens when all adjustment has taken place).
Both are easy to compute, since d 0 = B (0)/A(0) and c = B (1)/A(1). Neverthe-
less, there is a way to rewrite an ADL model in such a way that these quantities
are even more evident: the so-called ECM representation. This device amounts,
essentially, to expressing a difference equation in a slightly modified form, so
that certain quantities appear more clearly. In fact, this is an example of the “re-
parametrisation” trick I described in section 1.4.3: since the difference equation
that underlies the statistical model is exactly the same, just written in a different
way.
As for what the acronym means. . . it’s a long story. Sir
David Hendry, who is considered the father of ECM (or at
least, one of the fathers) is adamant on Equilibrium Correc-
tion Mechanism, which is probably the most precise way to
express the concept. Unfortunately, this is not the original
choice. Back in the day, when the phrase was introduced in
the deservedly famous article by Davidson et al. (1978), the
original expansion was Error Correction Model, and most
people I know (including myself) keep using the old name.
To illustrate how the ECM representation works, let’s D AVID H ENDRY
start from the simple case of an ADL(1,1) (here xt is a vector):

y t = αy t −1 + β0′ xt + β1′ xt −1 ;
160 CHAPTER 5. DYNAMIC MODELS

from the definition of the ∆ operator, evidently y t = y t −1 + ∆y t and xt = xt −1 +


∆xt . After substitution,
∆y t = (α − 1)y t −1 + β0′ ∆xt + (β0 + β1 )′ xt −1
which, after rearranging terms, yields
(β0 + β1 )′
· ¸
∆y t = β0′ ∆xt + (α − 1) y t −1 − xt −1 (5.13)
1−α
Which means: the time variation in y t (on the left-hand side) may come
from variation in xt , with response β0 (the impact multiplier); however, even if
∆xt = 0 there may be some variation in y t if the term in square brackets is non-
zero. This term can also be written as
y t −1 − c′ xt −1
β +β
0 1
where c = 1−α , that is the long-run multiplier vector. In practice, the above
expression, commonly referred to as ECM term, gives you the difference (at t −
1) between the actual value y t −1 and the value that (given xt −1 ) the dependent
variable should have taken if the system had been in equilibrium.
If |α| < 1, then (α − 1) is negative: if the ECM term is positive (so y t −1 was
larger than its equilibrium value), then ∆y t will be negative, so y t would tend
to get closer to equilibrium. Evidently, this situation is reversed when the ECM
term is negative, so if (α − 1) < 0, the dynamic system has an inherent tendency
to go back to a steady state. To be more precise, the number 1 − α can be seen
as the fraction of disequilibrium that get re-absorbed in one period, so that the
closer α is to 0, the faster adjustment occurs.
You can always go from the ADL representation to the ECM representation
(and back), for polynomials A(L) e B (L) of any order: for the algebra-loving
reader, the formal proof is in section 5.A.4. In general, however, if
A(L)y t = m t + B (L)xt + εt
where the order of A(L) is p and the order of B (L) is q, then the ECM represen-
tation is
H (L)∆y t = m t + K (L)∆xt − A(1)y t −1 + B (1)xt −1 + εt ,
where the orders of H (L) and K (L) are q −1 and p −1, respectively. For example:

Example 5.9 (ECM Representation)


Let’s take another look at the difference equation I used in example 5.7:
y t = 0.2y t −1 + 0.4x t + 0.3x t −2
and compute the ECM representation. The quickest way to do this is to re-
express all the terms relative to time t − 1:
yt = y t −1 + ∆y t
xt = x t −1 + ∆x t
x t −2 = x t −1 − ∆x t −1 ;
5.5. THE ECM REPRESENTATION 161

now substitute

y t −1 + ∆y t = 0.2y t −1 + 0.4(x t −1 + ∆x t ) + 0.3(x t −1 − ∆x t −1 )

and collect
∆y t = −0.8y t −1 + 0.7x t −1 + 0.4∆x t − 0.3∆x t −1

so finally
∆y t = 0.4∆x t − 0.3∆x t −1 − 0.8 y t −1 − 0.875x t −1 ;
£ ¤

the impact multiplier is 0.4, the long-run multiplier is 0.875; the fraction of dis-
equilibrium that re-adjusts each period is 0.8.

Note that the ADL model and the ECM are not two different models, but are
simply two ways of expressing the same difference equation. As a consequence,
you can use OLS on either and get the same residuals. The only difference be-
tween them is that the ECM forms makes it more immediate for the human eye
to calculate the parameters that are most likely to be important for the dynamic
properties of the model: that is the long-run multipliers and the convergence
speed. On the other hand, the ADL form allows for simple (and, most impor-
tantly, mechanical) calculation of the whole sequence of dynamic multipliers.

OLS, using observations 1995:3-2019:4 (T = 98)


Dependent variable: dc

coefficient std. error t-ratio p-value


--------------------------------------------------------
const 0.220191 0.0643345 3.423 0.0009 ***
dy 0.588653 0.0648987 9.070 1.90e-14 ***
dy_1 -0.110455 0.0655676 -1.685 0.0954 *
c_1 -0.106366 0.0378199 -2.812 0.0060 ***
y_1 0.0870005 0.0333187 2.611 0.0105 **

Mean dependent var 0.003681 S.D. dependent var 0.004886


Sum squared resid 0.001048 S.E. of regression 0.003357
R-squared 0.547335 Adjusted R-squared 0.527866
F(4, 93) 28.11251 P-value(F) 2.61e-15
Log-likelihood 421.7841 Akaike criterion -833.5681
Schwarz criterion -820.6433 Hannan-Quinn -828.3403

Table 5.3: Dynamic regression in ECM form

Example 5.10 (ECM on real data)


The ECM representation of the model shown in table 5.2 is easily computed after
162 CHAPTER 5. DYNAMIC MODELS

performing the following substitutions:

ct = c t −1 + ∆c t
yt = y t −1 + ∆y t
y t −2 = y t −1 − ∆y t −1

Hence,

∆c t = k + (α − 1)c t −1 + β0 ∆y t + β0 + β1 + β2 y t −1 − β2 ∆y t −1 + εt ,
¡ ¢

that is
∆c t = k + β0 ∆y t − A(1) c t −1 − cy t −1 − β2 ∆y t −1 + εt ;
£ ¤

so, after substituting the estimated numerical values (and rounding results a lit-
tle),

∆c t = 0.220 + 0.589∆y t − 0.110∆y t −1 − 0.106 c t −1 − 0.818y t −1 + εt .


£ ¤

Note, however, that this representation could have been calculated directly by
applying OLS to the model in ECM form: it is quite clear from Table 5.3 that what
gets estimated is the same model in a different form. Not only the parameters for
each representation can be calculated from the other one: the objective function
(the SSR) is identical for both models (and equals 0.001048); clearly, the same
happens for all the statistics based on the SSR. The only differences (eg the R 2
index) come from the fact that the model is transformed in such a way that the
dependent variable is not the same (it’s c t in the ADL form and ∆c t in the ECM
form).

5.6 Hypothesis tests on the long-run multiplier


In some cases, it may be of interest to test hypotheses on c, such as H0 : c =
k. One way to do this could be to use the estimator of c provided by the OLS
estimates
Bb(1)
ĉ =
A(1)
b
and then working out its asymptotic distribution, but this is complicated by the
fact that ĉ is a nonlinear function of the estimated parameters,19 so the delta
method (see section 2.3.2, particularly equation (2.14)) would be required. A
much simpler way comes from observing that

c = k ⇐⇒ B (1) − k · A(1) = 0

which is a linear test and, as such, falls under the R β = d jurisdiction.


b in the denominator, so for example in an ADL(1,1) c = β0 +β1 , and the Jacobian
19 You have A(1)
1−α
1
term would be J = 1−α [1 1 c].
5.6. HYPOTHESIS TESTS ON THE LONG-RUN MULTIPLIER 163

It may be worth mentioning here that tests of There are some important cases when this may
this type behave in the ordinary way only if the not be true, notably when the data we are work-
assumptions we made in section 5.3.1 are valid. ing with are generated by non-stationary DGPs.

The test is particularly easy when k = 1, which is a common hypothesis to


test, since it implies, if true, that the two variables under considerations are pro-
portional to each other in the long run. In this case, the hypothesis becomes

H0 : α1 + · · · + αp + β0 + · · · + βq = 1

that can be tested quite easily.


The test is even easier if you start from the estimates of the model in ECM
form: all you have to do is set up a test that involves just 2 parameters, since the
parameter for y t −1 is just − A(1)
b (note the minus sign) and the parameter for x t −1
is Bb(1).

Example 5.11
Suppose that we have the following estimates:

ŷ t = 0.75y t −1 + 0.53x t − 0.24x t −1

with the following covariance matrix:

5 0.5 −2
 

V̂ ADL = 0.001 ×  5 4
5

The hypothesis c = 1 implies α+β0 +β1 = 1. Therefore, a Wald test can be set up
with R = [1 1 1] and d = 1 (see section 3.3.2 for details). Therefore
0.75
 

R β̂ − d = [1 1 1]  0.53 − 1 = 0.04
−0.24
R · V̂ ADL · R ′ = 0.001 × 20 = 0.02
0.042
W = = 0.08
0.02
which leads of course to accepting H0 , since its p-value is way larger than 5%
(P (χ21 > 0.08) = 0.777)). The same test could have been performed even more
easily from the ECM representation:

∆y
c t = 0.53∆x t − 0.25y t −1 + 0.29x t −1

with the associated covariance matrix


5 0.5 9
 

V̂EC M = 0.001 ×  5 −1.5


18
164 CHAPTER 5. DYNAMIC MODELS

In this case the hypothesis can be written as H0 : B (1) − A(1) = 0, so for the ECM
form

0.53
 

R β̂ − d = [0 1 1] −0.25 = 0.04
0.29
R · V̂EC M · R ′ = 0.001 × 20 = 0.02

and of course the W statistic is the same as above.

5.7 Forecasting and Granger causality


One of the cool things you can do with an ADL model is forecasting. Here’s how
it works: suppose we have data that goes from t = 1 to t = T , and that our model
of choice is an ADL(1,1). What we can say, on these premises, about y T +1 , that
we haven’t yet observed? The random variable y T +1 can be represented as

y T +1 = αy T + β0 x T +1 + β1 x T + εT +1 ; (5.14)

of all the objects that appear on the right-hand side of the equation, the only
ones that are known with certainty at time T are y T and x T . Suppose we also
know for certain what the future value x T +1 will be, and call it x T +1 = x̌ T +1 .
Therefore, since εt is a martingale difference sequence, its conditional expec-
tation with respect to ℑT +1 is 0,20 so

= αy T + β0 x̌ T +1 + β1 x T
£ ¤
E y T +1 |ℑT
= σ2
£ ¤
V y T +1 |ℑT

Following the same logic as in Section 3.7, we can use the conditional expecta-
tion as predictor and the estimated values for the parameters instead of the true
ones. Therefore, our prediction will be

ŷ T +1 = α̂y T + β̂0 x̌ T +1 + β̂1 x T

and a 95% confidence interval can be constructed as

ŷ T +1 ± 1.96 × σ̂

where it is implicitly assumed that εt is normal and uncertainty about the pa-
rameters is ignored.
Now, there are two points I’d like to draw your attention on. First: in order to
predict y T +1 we need x T +1 ; but then, we could generalise this idea and imagine
20 We’re sticking to the definition of ℑ
T +1 I introduced in Section 5, so ℑT +1 includes x T +1 but
not y T +1 ; later in this section, we’ll use a different convention.
5.7. FORECASTING AND GRANGER CAUSALITY 165

that we could make guesses about x T +2 , x T +3 , . . . as well. What keeps us from


predicting y t farther into the future? To cut a long story short, performing multi-
step forecasts is rather easy if you use your own predictions in lieu of the future
values of y t and proceed recursively. In other words, once you have ŷ T +1 you
can push equation (5.14) one step ahead in time and write

y T +2 = αy T +1 + β0 x T +2 + β1 x T +1 + εT +2 ;

next, we operate in a similar way as we just did, using the conditional expecta-
tion as predictor
ŷ T +2 = α̂ ŷ T +1 + β̂0 x̌ T +2 + β̂1 x̌ T +1 ,

repeating the process with the obvious adjustments for T + 3, T + 4 etc. It can be
proven (nice exercise for the reader) that the variance you should use for con-
structing confidence interval for multi-step forecasts would be in this case

¤ 1 − (α2 )k 2
σ .
£
V ŷ T +k =
1 − α2

Extending the formulae above to the general ADL(p, q) case is trivial but boring,
and I’ll just skip it.
The second point I want to make comes by considering the possibility that
the β0 and β1 coefficients were 0 in equation (5.14). In this case, there would be
no need to conjecture anything about x t , in order to forecast y t . In other words,
x t has no predicting power for y t . This is a hypothesis we may want to test.
As a general rule, in the context of dynamic regression
models, it is difficult to formulate hypotheses of economic
interest that can be tested through restrictions on coeffi-
cients, since the coefficients of the A(L) and B (L) polyno-
mials normally don’t have a natural economic interpreta-
tion per se, and this is why we compute multipliers.
However, there are exceptions: we just saw one of
them in the previous section. Another one is the so-called
Granger-causality test, after the great Clive Granger, Nobel
Prize winner in 2003.21 The idea on which the test is built
is that, whenever A causes B , the cause should come, in
C LIVE G RANGER
time, before the effect. Therefore, if A does not cause B , it
should have no effect on the quantity we normally use for prediction, i. e. the
conditional expectation.
The only difference with the ADL models we’ve considered so far is that,
since we’re dealing with predictions about the future, we will want to base our

21 C. W. Granger is one of the founding fathers of modern time series econometrics; his most fa-

mous brainchild, that earned him the Nobel Prize, is a concept called cointegration, that I will skip
in this book, but is absolutely indispensable if you want to engage in applied macroeconomics.
166 CHAPTER 5. DYNAMIC MODELS

inference on an information set that collects everything that is known at time


t − 1, namely
ℑ∗t −1 = y t −1 , y t −2 , . . . , xt −1 , xt −2 , . . . ;
© ª

note that, contrary to the concept of information set ℑt we used so far (defined
in section 5.1), ℑ∗t does not include xt +1 ; in practice, it collects all information
on y t and xt that is available up to time t . Forecasting, therefore, amounts to
finding
ŷ T +1|T = E y T +1 |ℑ∗T .
£ ¤

The subscript “T + 1|T ” is customarily read as “at time T + 1, based on the infor-
mation available at time T ”.
h i
There is no doubt that the discerning reader sible to forecast x̂T +1|T = E xT +1|T |ℑ∗
T
. This
has spotted, by now, a fundamental difference seemingly innocent remark paves the way to
between the information set ℑ∗ T −1
that we are multi-step forecasts, where we use the predic-
using here and the information set ℑT we use tions for T to forecast T + 1, which in turn we
in the rest of this chapter: the latter includes xt , use for forecasting T + 2, and so on.
while the former does not.
Since ℑ∗ T −1
⊂ ℑT , predictions on y t made using This is the principle used in the so-called VAR
ℑ∗T −1
are obviously going to be less accurate, model, which is probably the main empirical
but have the advantage of being possible one tool in modern macroeconometrics. If you’re
period earlier. Moreover, this makes also pos- curious, check out Lütkepohl (2005).

As a consequence, our ADL model

A(L)y t = B (L)x t + εt

will not include xt , but only its lags:

y t = α1 y t −1 + α2 y t −2 + · · · + β1′ xt −1 + β2′ xt −2 + · · · + ε∗t

where ε∗t is defined as y t − E y T +1 |ℑ∗T . Clearly, this can also be written as an


£ ¤

ordinary ADL model in which B (0) = 0. The idea that x t does not cause y t is
equivalent to the idea B (L) = 0; since a polynomial is 0 if and only if all its coef-
ficients are, it is easy to formulate the hypothesis of no-Granger-causality as

H 0 : β1 = β2 = . . . = β q = 0

which is of course a system of linear restrictions, that we can handle just fine via
the R β = d machine we described in Section 3.3.2.
In the late 1960s, when this idea was introduced, it was hailed as a break-
through in economic theory, because for a while this seemed to provide a data-
based way to ascertain causal links. For example, a hotly debated point among
macroeconomists in the 1970 and 80s was: is there a causality direction between
the quantity of money and GDP in an economy? If there is, the repercussions on
economic policy (notably, on the effectiveness of monetary policy) are huge. In
5.7. FORECASTING AND GRANGER CAUSALITY 167

chicken egg
600000 7500

7000

550000
6500

6000
500000
5500

5000
450000
4500

4000
400000

3500

350000 3000
1930 1942 1954 1966 1978 1990 2002 1930 1942 1954 1966 1978 1990 2002

Figure 5.3: Thurman & Fisher data on chickens and eggs

those days, the concept of Granger-causality seemed to provide a convincing


answer.22

Example 5.12
In a humorous article, Thurman and Fisher (1988) collected data on the produc-
tion of chickens and eggs from 1930 to 2004, that are depicted in Figure 5.3.
After taking logs, we estimate by OLS the following 2 equations:

c t = m 1 + α1 c t −1 + α2 c t −2 + β1 e t −1 + β2 e t −2 + εt (5.15)
e t = µ1 + γ1 c t −1 + γ2 c t −2 + λ1 e t −1 + λ2 e t −2 + η t (5.16)

where c t is the log of chickens at time t and e t is the log of eggs.


The hypothesis that chickens don’t Granger-cause eggs is H0 : γ1 = γ2 = 0;
the opposite hypothesis, that eggs don’t Granger-cause chickens is H1 : β1 = β2 =
0. The results are in Table 5.4. As can be seen, the hypothesis of absence of
Granger-causality is rejected in the egg → chicken direction, but not the other
way round; hence, the perennial question “what comes first?” has finally found
an answer: it’s the egg that comes first.

There are a few issues that may be raised here: one is statistical, and pertains
to the fact that the test is relative to a certain conditioning set. You may see this
as a variation on the same theme I discussed in Section 3.8, especially example
3.3. It may well be that A turns out to be Granger-causal for B in a model, and the
reverse happens in another model, in which some other variables are included
22 Readers who are into the history of economics and econometrics might want to take a look at

Sims (1972).
168 CHAPTER 5. DYNAMIC MODELS

Dependent variable: l_chicken

Coefficient Std. Error t -ratio p-value


const 2.1437 0.7715 2.7788 0.0070
l_chicken_1 0.4037 0.1389 2.9054 0.0049
l_chicken_2 0.4362 0.1320 3.3037 0.0015
l_egg_1 0.8627 0.2011 4.2906 0.0001
l_egg_2 −0.8724 0.1999 −4.3642 0.0000
Mean dependent var 12.92290 S.D. dependent var 0.100867
Sum squared resid 0.125953 S.E. of regression 0.043038
R2 0.828060 Adjusted R 2 0.817946

Granger-causality test egg → chicken: F (2, 68) = 9.57089, p-value = 0.000217573

Dependent variable: l_egg

Coefficient Std. Error t -ratio p-value


const 0.8816 0.5292 1.6660 0.1003
l_chicken_1 −0.1196 0.0953 −1.2548 0.2139
l_chicken_2 0.0695 0.0906 0.7673 0.4456
l_egg_1 1.5302 0.1379 11.0961 0.0000
l_egg_2 −0.5570 0.1371 −4.0624 0.0001
Mean dependent var 8.577181 S.D. dependent var 0.204279
Sum squared resid 0.059259 S.E. of regression 0.029520
R2 0.980277 Adjusted R 2 0.979117

Granger-causality test chicken → egg: F (2, 68) = 1.28907, p-value = 0.282174

Table 5.4: Granger causality tests between chickens and eggs


5.A. ASSORTED RESULTS 169

or excluded. This is why, in some cases, people perform Granger-causality tests


on models in which the only variables considered are the ones that come directly
into play. I’ll leave it to the reader to judge whether this approach leads to results
that have a sensible statistical interpretation.
Another one is more substantial in nature, and has to do with the fact that
in economics it may well be that the cause comes after the effect, because ex-
pectations play a major role in human behaviour; people may do something at
a certain time in view of something that they expect to happen in the future. In
fact, standard economic theory assumes that agents are rational and forward-
looking: they base all their choices on the expectations they have about the fu-
ture.
There are many examples I could give you, but I’ll simply hint at a widely
used one: if people anticipate that a company is going to go bust, everyone
will sell that stock, causing its price to drop. If one should mechanically assess
causality from time precedence, it would be legitimate to say that the drop in the
stock price drove the company bankrupt, rather than the other way around. The
problem here is that in this case the statistical concept of Granger causality does
not agree very much with the notion of causality we use in everyday life (and is
arguably what we care about in economics). In fact, it is much more accurate
to consider the Granger-causality test as a device for assessing predictive power;
whether predictive power can be considered a sign of a causal chain depends on
the circumstances.

5.A Assorted results


5.A.1 Inverting polynomials
Let us begin by noting that, for any a ̸= 1,
n 1 − a n+1
ai =
X
, (5.17)
i =0 1−a

which is easy to prove: call


n
ai = 1 + a + a2 + · · · + an ;
X
S= (5.18)
i =0

of course
a · S = a + a 2 + · · · + a n+1 (5.19)
n+1
and therefore, by subtracting (5.19) from (5.18), S(1 − a) = 1 − a , and hence
equation (5.17).
1
If a is a small number (|a| < 1), then a n → 0, and therefore ∞ i
P
i =0 a = 1−a . By
setting a = αL, you may say that, for |α| < 1, the inverse of (1 − αL) is (1 + αL +
α2 L 2 + · · · ), that is
(1 − αL)(1 + αL + α2 L 2 + · · · ) = 1,
170 CHAPTER 5. DYNAMIC MODELS

or, alternatively,
1 ∞
αi L i
X
=
1 − αL i =0

provided that |α| < 1. Now consider a n-th degree polynomial P (x):

n
pjxj
X
P (x) =
j =0

If P (0) = p 0 = 1, then P (x) can be written as the product of n first-degree poly-


nomials as follows:23
Y n µ 1

P (x) = 1− x (5.20)
j =1 λj

where the numbers λ j are the roots of P (x): if x = λ j , then 1 − λ1j x = 0 and con-
sequently P (x) = 0. Therefore, if P (x)−1 exists, it must satisfy

n µ 1 −1

1 Y
= 1− x ;
P (x) j =1 λj

but if at least one of the roots λ j is smaller than 1 in modulus,24 then 1/|λ j | is
³ ´−1
larger than 1 and, as a consequence, 1 − λ1j x does not exist, and neither does
P (x)−1 .

Example 5.13
Consider the polynomial A(x) = 1 − 1.2x + 0.32x 2 ; is it invertible? Let’s check its
roots: p
1.2 ± 1.44 − 1.28
A(x) = 0 ⇐⇒ x = = (1.2 ± 0.4)/0.64
0.64
so λ1 = 2.5 and λ2 = 1.25. Both are larger than 1 in modulus, so the polynomial
is invertible. Specifically,

A(x) = (1 − λ−1 −1
1 x)(1 − λ2 x) = (1 − 0.4x)(1 − 0.8x)

and

1
= (1 − 0.4x)−1 (1 − 0.8x)−1 = (1 + 0.4x + 0.16x 2 + · · · )(1 + 0.8x + 0.64x 2 + · · · )
A(x)
= 1 + 1.2x + 1.12x 2 + 0.96x 3 + 0.7936x 4 + · · ·

23 If you don’t believe me, google for “Fundamental theorem of algebra”.


24 Warning: the roots may be complex, but this is not particularly important. If z is a complex
p p
number of the form z = a + bi (where i = −1), then |z| = a 2 + b 2 .
5.A. ASSORTED RESULTS 171

In practice: if the sequence a t is defined as the result of the application of


the operator P (L) to the sequence u t , that is a t = P (L)u t , then reconstructing
the sequence u t from a t is only possible if P (L) is invertible. In this case,

1
u t = P (L)−1 a t = at .
P (L)

5.A.2 Basic concepts on stochastic processes


This section is just meant to give you a rough idea of some of the concept I hinted
at in section 5.3.1; if you want the real thing, go for Brockwell and Davis (1991).
Suppose you have an infinitely long sequence of random variables

. . . , x t −1 , x t , x t +1 , . . .

where the index t is normally taken to mean “time” (although not necessarily).
This sequence is a stochastic process.25 When we observe a time series, we ob-
serve a part of the realisation of a stochastic process (also called a trajectory
of the process). Just in the same way as the DGP for the toss of a coin can be
thought of as the machine that nature uses for giving us a binary number that
we cannot predict, a stochastic process is a machine that nature uses for giving
us an infinitely long trajectory through time, and what we observe is just a short
segment of it. This idea may be unintuitive at start (it certainly was for me, back
in the day), but I find it very useful.
If we take two different elements of the sequence, say x s and x t (with s ̸= t ),
we could wonder what their joint distribution is. The two fundamental proper-
ties of the joint distribution that we are interested in are:

1. is the joint distribution stable through time? That is, is the joint distribu-
tion of (x s , x t ) the same as (x s+1 , x t +1 ) ?

2. Is it likely that x s and x t become independent (or nearly so) if |t − s| is


large?

Property number 1 refers to the idea that the point in time when we ob-
serve the process should be irrelevant: the probability distribution of the data
we see today (x s , x t ) should be the same as the one for an observer in the past
(x s−100 , x t −100 ) or in the future (x s+100 , x t +100 ). This gives rise to the concept of
stationarity. A stochastic process is said to be weakly stationary, or covariance
stationary, or second-order stationary if the covariance between x s and x t (also
known as autocovariance) exists and is independent of time. In formulae:

γh = Cov [x t , x t +h ]
25 It’s not inappropriate to think of stochastic processes as infinite-dimensional random

variables. Using the same terminology as in section 2.2.1, we may think of the sequence
. . . , x t −1 (ω), x t (ω), x t +1 (ω), . . . as the infinite-dimensional outcome of one point in the state space
ω ∈ Ω.
172 CHAPTER 5. DYNAMIC MODELS

note that γh , the autocovariance of order h, is a function of h only, not of t ; of


course, γ0 is just V [x t ]. If this is the case, the internal structure of correlation
between points in time is often described via the autocorrelation sequence (or
autocorrelation function, often abbreviated as ACF), defined as
γh
ρh = .
γ0

Property number 2, instead, is what we realistically imagine should happen


when we observe many phenomena through time: if s and t are very far apart,
what happened at time s one should contain little or no information on what
happened at time t . For example: the temperature at Cape North on May 29th,
1453 at 12am should contain no useful information on the temperature at Cape
North right now. This intuition can be translated into maths in a number of dif-
ferent ways. A common one is ergodicity. While a formal definition of ergodicity
would require a hefty investment in measure theory, if a process is covariance
stationary, ergodicity amounts to absolute summability of its autocovariances.
The property
X∞
|γi | = M < ∞
i =0

ensures that limh→∞ |γh | = 0 (so correlation between distant events should be
negligible), but most importantly, that the sample mean of an observed stochas-
tic process is a consistent estimator of the true mean of the process:

1 X T
p
x t −→ E [x t ] .
T t =1

Note that the above expression can be considered as one of the many versions
of the Law of Large Numbers, applicable when observations are not necessarily
independent.
It goes without saying that, in the same way as we can define multivariate
random variables, it is perfectly possible to define multivariate stochastic pro-
cesses, that is, sequences of random vectors: modern macroeconometrics is pri-
marily built upon these objects. A large part of the statistical analysis of time
series is based on the idea that the time series we observe are realisations of sta-
tionary and ergodic processes (or can be transformed to this effect).
How do you adapt statistical inference to such a context? The main idea un-
derlying most approaches is to describe the a DGP in such a way that the whole
autocovariance structure of a stochastic process (the sequence γ0 , γ1 , γ2 , ...) can
be expressed as a function of a finite set of parameters θ; if the process is sta-
tionary and ergodic, then maybe the available data x 1 , . . . , x T can be used to
construct CAN estimators of θ. ARIMA models are one of the most celebrated
instances of this approach, and the literature that has developed after their in-
troduction in the late 1960s is truly gigantic. If you’re interested, Brockwell and
Davis (1991) is an excellent starting point.
5.A. ASSORTED RESULTS 173

5.A.3 Why martingale difference sequences are serially uncorrelated


Here’s a rapid proof: if εt is a MDS with respect to ℑt , then

E [εt |ℑt ] = E εt |xt , xt −1 , . . . , y t −1 , y t −2 , . . . = 0.


£ ¤

Now, observe that εt −1 is defined as εt −1 = y t −1 − E y t −1 |ℑt −1 ; since both ele-


£ ¤

ments of the right-hand side of the equation are contained in ℑt , then εt −1 ∈ ℑt ;


moreover, ℑt ⊇ ℑt −1 ⊇ ℑt −2 . . ., so clearly all lags of εt are all contained in ℑt (see
footnote 6 in this chapter). As a consequence, we can use the law of iterated
expectations (2.8) as follows:

Cov [εt , εt −k ] = E [εt · εt −k ] = E [E [εt · εt −k ] |ℑt ] =


= E [E [εt |ℑt ] · εt −k ] = E [0 · εt −k ] =
= 0

A second argument, perhaps more intuitive, rests directly on the definition


of a MDS: if εt is a MDS with respect to ℑt , then its expectation conditional on
ℑt −k (for k > 0) must also be 0, because ℑt −k is a subset of ℑt . But that means
that the expectation of any future element εt +k conditional on the present infor-
mation set ℑt is 0. In formulae:
¾
E [εt |ℑt ] = 0
=⇒ E [εt |ℑt −k ] = E [εt +k |ℑt ] = 0
ℑt −k ⊆ ℑt for k > 0
This is tantamount to saying that εt is effectively unpredictable. But then, if
Cov [εt , εt −k ] ̸= 0, εt wouldn’t be totally unpredictable, because there would be
some information in the past about the future. Therefore, the autocorrelations
of a MDS must be 0 for any k.

5.A.4 From ADL to ECM


Let’s begin with a preliminary result (which I’m not going to prove):
If P (x) is a polynomial whose degree is n > 0 and a is a scalar, you can always find
a polynomial Q(x), whose degree is (n − 1), such that

P (x) = P (a) +Q(x)(a − x);

if n = 0, obviously Q(x) = 0.

For example, the reader is invited to check that, if we choose a = 1, the poly-
nomial P (x) = 0.8x 2 − 1.8x + 1.4 can be written as

P (x) = 0.4 + (1 − 0.8x)(1 − x)

where P (1) = 0.4 and Q(x) = 1 − 0.8x.


174 CHAPTER 5. DYNAMIC MODELS

Now consider P (L), a polynomial in the lag operator of degree n ≥ 1, and


apply the result above twice in a row, once with a = 0 and then with a = 1:

P (L) = P (0) −Q(L) · L (5.21)



Q(L) = Q(1) + P (L)(1 − L) (5.22)

If n = 1, evidently P ∗ (L) = 0. Otherwise, the order of Q(L) is (n − 1) and the


order of P ∗ (L) is (n − 2). If you evaluate equation (5.21) in L = 1, you have P (1) =
P (0) −Q(1), so that equation (5.22) becomes

Q(L) = P (0) − P (1) + P ∗ (L)(1 − L)

and therefore, using equation (5.21) again,

P (L) = P (0) − P (0) − P (1) + P ∗ (L)(1 − L) · L = P (0)∆ + P (1)L − P ∗ (L)∆ · L.


£ ¤

The actual form of the P ∗ (L) polynomial is not important: all we need is know-
ing that it exists, so that the decomposition of P (L) we just performed is always
possible. As a consequence, every sequence P (L)z t can be written as:

P (L)z t = P (0)∆z t + P (1)z t −1 − P ∗ (L)∆z t −1 .

Now apply this result to both sides of the ADL model A(L)y t = B (L)xt + εt :

∆y t + A(1)y t −1 − A ∗ (L)∆y t −1 = B (0)∆xt + B ∗ (L)∆xt −1 + B (1)xt −1 + εt ;

(note that A(0) = 1 by construction). After rearranging terms, you obtain the
ECM representation proper:

∆y t = B (0)∆xt + A ∗ (L)∆y t −1 + B ∗ (L)∆xt −1 − A(1) y t −1 − c′ xt −1 + εt


£ ¤

where c′ = BA(1)
(1)
contains the long-run multipliers. In other words, the variation
of y t over time is expressed as the sum of three components:

1. the external unpredictable shock εt ;

2. a short-run transitory component: B (0)∆xt +A ∗ (L)∆y t −1 +B ∗ (L)∆xt −1 ; the


first coefficient, B (0), gives you the instantaneous effect of xt on y t ;

3. a long-run component whose base ingredient is the long-run multiplier c.


Chapter 6

Instrumental Variables

The arguments I presented in chapter 3 should have convinced the reader that
OLS is an excellent solution to the problem of estimating linear models of the
kind
y = Xβ + ε,
£ ¤
where ε is defined as y − E y|X , with the appropriate adjustments for dynamic
models; the derived property E [ε|X] = 0 is the key ingredient for guaranteeing
consistency of OLS as an estimator of β . With some extra effort, we can also
derive asymptotic normality and have all the hypothesis testing apparatus at
our disposal.
In some cases, however, this is not what we need. What we have implicitly
assumed so far is that the parameters of economic interest are the same as the
statistical parameters that describe the conditional expectation (or functions
thereof, like for example marginal effects or multipliers in dynamic models).
Sometimes, this might not be the case. As anticipated in section 3.6, this
happens when the model we have in mind contains explanatory variables that,
in common economics parlance, are said to be endogenous. In the next section,
I will give you a few examples where the quantities of interpretative interest are
not computable from the regression parameters. Hence, it should come as no
surprise that OLS is not a usable tool for this purpose: this is why we’ll want to
use a different estimator, known as instrumental variables estimator, or IV for
short.

6.1 Examples
6.1.1 Measurement error
Measurement error is what you get when one or more of your explanatory vari-
able are measured imperfectly. Suppose you have the simplest version of a linear
model, where everything is a scalar:

y i = x i∗ β + εi (6.1)

175
176 CHAPTER 6. INSTRUMENTAL VARIABLES

where E y i |x i∗ = x i∗ β and β is our parameter of interest. The problem is that we


£ ¤

do not observe x i∗ directly; instead, all we have is a version of x i∗ that is contam-


inated by some measurement error:

x i = x i∗ + η i (6.2)

where η i is a zero-mean random variable, independent of x i∗ and εi , with vari-


ance σ2η > 0; clearly, the larger σ2η > 0 is, the worse is the quality of our measure-
ment for the variable of interest x i∗ . One may think that, since η i is, so to speak,
“neutral”, setting up a model using x i instead of x i∗ would do no harm. Instead,
this is not the case: unfortunately, OLS regression y i on x i won’t give you a con-
sistent estimator of β. This is quite easy to prove: combine the two equations
above to get
y i = x i β + εi − βη i = x i β + u i
¡ ¢
(6.3)

so
Pn Pn Pn
i =1 x i y i i =1 x i (x i β + u i ) xi ui
β̂ = Pn 2
= Pn 2
= β + Pi =1
n 2
i =1 x i i =1 x i i =1 x i

From the assumptions above, you get

E [x i u i ] = E (x i∗ + η i )(εi − βη i ) = E x i∗ εi − βE x i∗ η i + E η i εi − βE η2i =
£ ¤ £ ¤ £ ¤ £ ¤ £ ¤

= −βσ2η

If we define Q = E x i2 , clearly
£ ¤

βσ2η σ2η
à !
p
β̂ −→ β − = β 1− ̸= β
Q Q

It can be proven that 0 < σ2η < Q,1 so two main conclusions can be drawn from
the equation above: first, the degree of inconsistency of OLS is proportional to
the size of measurement error σ2η relative to Q; second, the asymptotic bias is
such that |plim β̂ | < |β|; that is, the estimated effect is smaller than the true
¡ ¢

one. This is often called attenuation.


As the rest of this chapter should hopefully make clear, the reason why OLS
doesn’t work as we’d like it to work lies in the fact that equation (6.3) does not
split y i into a conditional expectation and a disturbance term. It can be shown
that the regression function E y i |x i is not equal to x i β: if it were, E [x i u i ] would
£ ¤

be zero, but we just showed it isn’t. Since OLS is programmed to estimate the
parameters of a conditional expectation, you can’t expect it to come up with
anything else.

1 Come on, it’s easy, do it by yourself.


6.1. EXAMPLES 177

This argument came out as important in an economic


theory controversy in the 1950s about the consumption
function. In those days, orthodoxy was Keynes’ idea that

[T]he fundamental psychological law [. . . ] is


that men are disposed, as a rule and on the av-
erage, to increase their consumption as their
income increases but not by as much as the in-
crease in the income.2
J OHN M AYNARD
In formulae, this was translated as K EYNES

C = C 0 + cY .

with 0 < c < 1. As the reader knows, c is the “marginal propensity to consume”,
that is a key ingredient in mainstream Keynesian macroeconomics.
In the 1950s, few people would dissent from the received wisdom: one of
them was Milton Friedman, who would argue that c should not be less than 1 (at
least in the long run), since the only thing income is good for is buying things.
Over the span of your life, it would be silly to save money unconditionally: a
rational individual with perfect foresight should die penniless.3
Back in the day, economists thought of measuring c by running regressions
on the consumption function, and regularly found estimates that were signif-
icantly smaller than 1. Friedman, however, put forward a counter-argument,
based on the “permanent income” concept: consumption is not based on cur-
rent income, but rather on a concept of income that takes into account your ex-
pectations about the future. For example, if you knew with certainty that you’re
going to inherit a disgustingly large sum from a moribund distant uncle, you
would probably start squandering money today (provided, of course, you find
somebody willing to lend you money), far beyond your level of current income.
In this case, your observed actual income x i does not coincide with your
permanent income x i∗ (which is unobservable), and estimated values of c lower
than 1 could well be the product of attenuation.

6.1.2 Simultaneous equation systems


Simultaneous equation systems make for another nice example. Inclined as I
am to put econometric concepts in a historical context, I would love to inflict on
the reader a long, nostalgic account about the early days of econometrics, the
great Norwegian pioneer Trygve Haavelmo and Lawrence Klein4 and the Cowles
Commission, but this is not the place for it. Suffice it to say that estimation of
2 Keynes, J.M. (1936) The General Theory of Employment, Interest and Money
3 “Avaritia vero senilis quid sibi velit, non intellego; potest enim quicquam esse absurdius

quam, quo viae minus restet, eo plus viatici quaerere?” Marcus Tullius Cicero, De senectute.
4 Klein and Haavelmo got the Nobel Prize for their work in 1980 and 1989, respectively.
178 CHAPTER 6. INSTRUMENTAL VARIABLES

systems of equations is the first autonomous contribution of econometrics to


the general arsenal of statistical tools.
The reason why the estimation of parameters in simul-
taneous systems may be tricky is relatively easy to see by
focusing on the distinction between parameters of interest
and parameters of the conditional mean.
Consider one of the simplest examples of simultane-
ous equation system in economics: a two-equation linear
model of supply and demand for a good.

qt = α0 − α1 p t + u t (6.4)
pt = β0 + β1 q t + v t , (6.5)
T RYGVE
equation (6.4) is the demand equation (quantity at time t H AAVELMO
as a function of price at time t ), (6.5) is supply (price as a function of quan-
tity); the two disturbance terms u t and v t represent random shocks to the two
curves. For example, u t could incorporate random fluctuations in demand
due to shifting customer preferences, fluctuations in disposable income and so
forth; v t , instead could be non-zero because of productivity shifts due to ran-
dom events (think for example weather for agricultural produce). Assume that
E [u t ] = E [v t ] = 0.
If you considered the two equations separately, one may think of estimat-
ing their parameters by using OLS, but this would be a big mistake, since the
“systematic part” of each of the two equations is not a conditional expectation.
£ ¤
An easy way to convince yourself is simply to consider that if E q t |p t is
upward (downward) sloping, the correlation between q t and p t must be posi-
tive (negative), and therefore there’s no way the reverse conditional expectation
£ ¤
E p t |q i can be downward (upward) sloping. Since the demand function goes
down and the supply function goes up, at least one of them cannot be a condi-
tional expectation.
However, a more rigorous proof can be given: take the demand curve (6.4):
if the expression (α0 − α1 p t ) were in fact the conditional expectation of q t to p t ,
£ ¤
then E u t |p t should be 0. Now substitute (6.4) into equation (6.5):

pt = β0 + β1 (α0 − α1 p t + u t ) + v t
= (β0 + β1 α0 ) − (β1 α1 )p t + (v t + β1 u t ) ⇒
(1 + β1 α1 )p t = (β0 + β1 α0 ) + (v t + β1 u t ) ⇒ (6.6)
pt = π1 + η t , (6.7)
β +β α v +β u
where the constant π1 is 1+β
0 1 0
1 α1
and η t = 1+β
t 1 t
1 α1
is a zero-mean random vari-
able. The covariance between p t and u t is easy to compute:

= E p t · u t = E u t π1 + u t η t = 0 + E u t · (v t + β1 u t ) =
£ ¤ £ ¤ £ ¤ £ ¤
Cov p t , u t
= Cov [v t , u t ] + β1V (u t )
6.2. THE IV ESTIMATOR 179

Now, unless the covariance between v t and u t happens to be exactly equal to


£ ¤
−β1V (u t ) (and there is no reason why it should), Cov p t , u t ̸= 0. Borrowing on
the definition I gave in Section 3.6, the variable p t is clearly endogenous.
But if Cov p t , u t ̸= 0, then E u t |p t can’t be 0 either; therefore, (α0 − α1 p t )
£ ¤ £ ¤
£ ¤
can’t be E q t |p t , and as a consequence there’s no way that OLS applied to equa-
tion (6.4) (that is, regressing quantity on a constant and price) could be a con-
sistent estimator of α0 and α1 .
£ ¤
To be more specific: even assuming that E q t |p t is a linear function like

E q t |p t = γ0 + γ1 p t ,
£ ¤

OLS gives you an excellent estimate of the coefficients γ0 and γ1 ; unfortunately,


they are not the same thing as α0 and α1 .
Of course, the same argument in reverse could be applied to the supply
equation so regressing p t on q t won’t give you good estimates of β0 and β1 , ei-
ther. This example will be generalised in section 6.6.2.

6.2 The IV estimator


In a standard linear model y i = x′i β + εi . As we argued in chapter 3, the assump-
tion x′ β = E y|x is crucial for the consistency of the OLS statistic as an estimator
£ ¤

of β ; in fact, you could see this assumption as a definition of β , in that β is the


only vector for which the following equation is true:

E X′ (y − Xβ ) = 0.
£ ¤
(6.8)

The OLS statistic β̂ , instead, is implicitly defined by the relationship

X′ (y − Xβ̂ ) = 0. (6.9)

which corresponds to the first-order conditions for the minimisation of the sum
of squared residuals (see section 1.3.2, especially equation (1.10)); note that equa-
tion (6.9) can be seen as the sample equivalent of equations (6.8). The fact that
the OLS statistic β̂ works quite nicely as an estimator of its counterpart β just
agrees with common sense.
If, on the contrary, the parameter of interest β satisfies an equation other
than (6.8), then we may proceed by analogy and use, as an estimator, a statistic
β̃ that satisfies the corresponding sample property. In this chapter, we assume
we have a certain number of observable variables W for which

E W′ (y − Xβ ) = 0.
£ ¤
(6.10)

The corresponding statistic will then be implicitly defined by

W′ (y − Xβ̃ ) = 0. (6.11)
180 CHAPTER 6. INSTRUMENTAL VARIABLES

The variables W are known as instrumental variables, or, more concisely, in-
struments. The so-called “simple” IV estimator can then be defined as follows:
if we had a matrix W, of the same size as X, satisfying (6.10), then we may define
a statistic β̃ such that (6.11) holds:

W′ (y − Xβ̃ ) = 0 =⇒ W′ X · β̃ = W′ y.

In a parallel fashion, the difference ε = y − Xβ is not defined as the difference


between y and it conditional mean, but rather as a zero-mean random variable
which describes how much y deviates from its “standard” value, as described by
the “structural” relationship Xβ . A term that we use in this context is structural
disturbance, or just “disturbance” when no confusion arises. Given this defini-
tion, there is no guarantee that the structural disturbance should be orthogonal
to the regressors.
Since W has as many columns as X, then the matrix W′ X is square; if it’s also
invertible, then β̃ is
β̃ = (W′ X)−1 W′ y; (6.12)

this is sometimes called the “simple” IV estimator.


The actual availability of the variables W may be a problem, sometimes. In
fact, collecting observable data that can be used as instrument is a bit of an art,
although in many cases the choice of instruments is dictated by economic in-
tuition. In Section 6.6, we will look at the examples provided in Section 6.1 and
suggest possible solutions. However, before doing so, it is convenient to con-
sider a generalisation.

6.2.1 The generalised IV estimator


What if the number of columns of W (call it m) was different from number of
columns from X (call it k)? Of course, the matrix W′ X wouldn’t be square and
therefore not invertible. While there’s no remedy for the case m < k, one may
argue that in the opposite case we could just drop m − k columns from W and
proceed as above. While this makes sense, the reader will probably feel uneasy
at the thought of dumping information deliberately. And besides, how do we
choose which columns to drop from W?
Fortunately, there is a solution: assume, for simplicity, that the covariance
matrix of the structural disturbances is a multiple of the identity matrix:5

E εε′ |W = σ2 I.
£ ¤

By hypothesis, E [ε|W] = 0; therefore,

E W′ εε′ W|W = σ2 W′ W = σ2 Ω.
£ ¤

5 In fact, this assumption is not strictly necessary, but makes for a cleaner exposition.
6.2. THE IV ESTIMATOR 181

Now define v, C and e as

v = W′ y ′
C =W X e = W′ ε;
m×1 m×k m×k

so the following equality holds:

v = C β + e, (6.13)

Equation (6.13) may be seen as a linear model where the disturbance term has
zero mean and covariance matrix σ2 Ω. The number of explanatory variables is
k (the column size of X) but the peculiar feature of this model is that the number
of “observations” is m (the column size of W).
Since Ω is observable (up to a constant), we may apply the GLS estimator
(see 4.2.1) to (6.13) and write
£ ′ −1 ¤−1 ′ −1 ¤−1 ′
CΩ C C Ω v = X′ W(W′ W)−1 W′ X X W(W′ W)−1 W′ y =
£
β̃ =
= (X′ PW X)−1 X′ PW y. (6.14)

This clever idea is due to the English econometrician


Denis Sargan,6 whose name will also crop up later, in sec-
tion 6.7.1.
The estimator β̃ in the equation above is technically
called the Generalised IV Estimator, or GIVE for short.
However, proving that (6.12) is just a special case of (6.14)
when m = k is a simple exercise in matrix algebra, left to
the reader as an exercise, so when I speak of the IV estima-
tor, what I mean is (6.14).
When m = k, the model is said to be exactly identified, D ENIS S ARGAN
as the estimator is based on solving (6.11), which is a sys-
tem of m equations in m unknowns; if W′ X is invertible, it has one solution.
On the contrary, if m > k, (6.11) becomes a system with more equations than
unknowns, so a solution does not ordinarily exist. The statistic we use is not a
solution of (6.11), but is rather defined by re-casting the problem as a sui generis
OLS model as in (6.13).
In this case, we say the model is over-identified and the difference (m − k)
is referred to as over-identification rank. The opposite case, when m < k, is a
textbook case of under-identification, which I described in section 2.5. In short,
one may say that a necessary condition for the existence of the IV estimator is
that m ≥ k; this is known as the order condition.
6 For historical accuracy, it must be said that the idea of IV estimation was first put forward

as early as 1953 by the Dutch genius Henri Theil. But it was Sargan who created the modern
approach, in an article appeared in 1958.
182 CHAPTER 6. INSTRUMENTAL VARIABLES

As a by-product from estimating β , you also get a residual vector εe = y − Xβ̃ ,


so that an estimator of σ2 is readily available:

εe′ εe
σ̃2 = .
n
As shown in section 6.A.1, it can be proven that, under the set of assumptions I
just made, the statistics β̃ and σ̃2 are CAN estimators. Therefore, the whole test-
ing apparatus we developed in Chapter 3 can be applied without modifications
since
p d
n(β̃ − β ) −→ N (0,V ) .
The precise form of the asymptotic covariance matrix V is not important here;
see 6.A.1. What is important in practice is that, under homoskedasticity, we have
an asymptotically valid matrix we can use for hypothesis testing, which is

V̂ = σ̃2 (X′ PW X)−1 .

In more general cases, robust alternatives (see section 4.2.2) are available.

Just like OLS, the IV estimator may be defined would be the first step towards seeing it as a
as the solution of an optimisation problem: member of a very general category of estima-
β̃ =Argmin ε(β )′ PW ε(β ) tors known as GMM (Generalised Method of
β ∈Rk Moments) estimators, which includes practi-
cally all estimators used in modern economet-
(compare the above expression with equation
rics. The theory of GMM is beautiful: as a
(1.14)).
starting point, I heartily recommend Hayashi
In this book, we will not make much use of
(2000).
this property. However, defining β̃ in this way

6.2.2 The instruments


I will not distract the reader here with technicalities on the asymptotics of the IV
estimator; you’ll find those in Section 6.A.1. Here, I’m going to focus on two nec-
essary conditions for consistency of β̃ and explore what requisites they imply
for the variables we choose as instruments. The two conditions are:
p
1. n1 wi εi −→ 0
P

1 p
xi w′i −→ A, where A is a k × m matrix with rank k.
P
2. n

Condition 1 is more or less guaranteed by (6.10), that is by wi being exoge-


nous, which basically means “uncorrelated with the structural disturbance” εi ;
if this requirement isn’t met, the limit in probability of β̃ is not β . End of story.
The implications of condition 2 are more subtle. The first one is: since the
rank of A cannot be k if m < k, you need to have at least as many instruments
as regressors. The good news is, this condition is not as stringent as it may seem
at first sight: the fact that E [xi · εi ] is not a vector of zeros does not necessarily
6.2. THE IV ESTIMATOR 183

mean that all its elements are nonzero. Some of the explanatory variables may
be exogenous; in fact, in empirical models the subset of explanatory variables
that may be suspected of endogeneity is typically rather small. Therefore, the
exogenous subset of xi is perfectly adequate to serve as an instrument, obvious
examples being deterministic variables such as the constant. What the order
condition really means is that, for each endogenous explanatory variable, we
need at least one instrument not also used as a regressor.
Clearly, m ≥ k is not a sufficient condition for A to have rank k. For exam-
ple, A may be square but have a column full of zeros. This would happen, for
example, if the corresponding instrument was independent of all regressors x.
The generalisation of this idea leads to the concept of relevance.7 The instru-
ments must not only be exogenous, but must also be related to the explanatory
variables.8
Note a fundamental difference between the order condition and the rele-
vance condition: the order condition can be checked quite easily (all you have
to do is count the variables); and even if you can’t be bothered with checking, if
the order condition fails the IV estimator β̃ is not computable, since (X′ PW X) is
singular and your software will complain about this.
The relevance condition, instead, is much trickier to spot, since (with prob-
ability 1) n1 xi w′i will have rank k even if A doesn’t. Hence, if rk (A) < k, you will
P

be able to compute β̃ , but unfortunately it will be completely useless as an esti-


mator. It can be proven that, in such an unfortunate case, the limit in probability
of β̃ is not a constant, but rather a random variable, so there’s no way it can be a
consistent estimator for any parameter.
In order to make this point clearer, let me give you an example. Suppose
that the three random variables y i , x i and w i were continuous and scalar and
imagine that that w i is not relevant. The IV estimator would simply be
Pn
n −1 i =1 x i y i
β̃ = Pn ;
n −1 i =1 w i x i

now focus on the denominator of the expression above: clearly, the probability
that n −1 ni=1 w i x i = 0 is 0, so the probability that β̃ exists is 1 for any finite n.
P

However, if you compute its probability limit, you see the problem very quickly:
p
if w i is not relevant, then n −1 ni=1 w i x i −→ A = 0 (which has, of course, rank 0
P

instead of 1, as we would require). Therefore the denominator will be a nonzero


number which becomes smaller and smaller as n → ∞. The reader should easily
see that in this case we can’t expect the asymptotic distribution of β̃ to collapse
to a point. In this case the estimator is not inconsistent because it converges to
the wrong value, but rather because it doesn’t converge at all.
7 In other contexts, what I call the relevance condition is known as the rank condition.
8 It should be noted that these properties are, to some extent, contradictory: if X and ε are
correlated, any variable perfectly correlated to X could not be orthogonal to ε. The trick here is
that W is not perfectly correlated to X.
184 CHAPTER 6. INSTRUMENTAL VARIABLES

Finally, instruments should be as “strong” as possible. The precise mean-


ing of this phrase is the object of Section 6.7.2: here, I’ll just mention the fact
that inference with IV models could be quite problematic in finite samples, since
the asymptotic approximations that we ordinarily use may work quite poorly. A
common source of problems is the case of weak instruments: variables that are
relevant, but whose connection with the regressors is so feeble that you need an
inordinately large data set to use them for your purposes. This point will (hope-
fully) become clearer later, in the context of “two-stage” estimation (Section 6.5).

6.3 An example with real data

For this example, we are going to use a great classic from applied labour eco-
nomics: the “Mincer wage equation”; the idea is roughly to have a model like
the following:
y i = z′i β0 + e i β1 + εi (6.15)

where y i is the log wage for an individual, e i is their education level and the
vector zi contains other characteristics we want to control for (gender, work ex-
perience, etc). The parameter of interest is β1 , which measures the returns to
education and that we would expect to be positive.
The reader, at this point, may dimly recall that we already estimated an equa-
tion like this, in section 1.5. Back then, we did not have the tools yet for interpret-
ing the results from an inferential perspective, but the results were in agreement
with commons sense. Why would we want to go back to a wage equation here?
The literature has long recognised that education may be endogenous, be-
cause the amount of education individuals receive is (ordinarily) decided by the
individuals themselves. In practice, if the only reason to get an education is to
have access to more lucrative jobs, individuals solve an optimisation problem
where they decide, among other things, their own education level. This gives
rise to a an endogeneity problem.9
For the reader’s convenience, I’ll reproduce here OLS estimates in Table 6.1.
If education is endogenous, as economic theory suggests may be, then the “re-
turns to education” parameter we find in the OLS output (about 5.3%) is a valid
estimate of the marginal effect of education on the conditional expectation of
wage, but is not a valid measure of the causal effect of education on wages, that
is the increment in wage that an individual would have had if they had received
an extra year of education.
I will now estimate the same equation via IV: the instruments I chose for
this purpose are (apart from the three regressors other than education, which
I take as exogenous) two variables that I will assume, for the moment, as valid
instruments:
9 The literature on this topic is truly massive. A good starting point is Card (1999).
6.4. THE HAUSMAN TEST 185

OLS, using observations 1-1917


Dependent variable: lw

coefficient std. error t-ratio p-value


----------------------------------------------------------
const 1.32891 0.0355309 37.40 2.86e-230 ***
male 0.175656 0.0143224 12.26 2.42e-33 ***
wexp 0.00608615 0.000671303 9.066 2.97e-19 ***
educ 0.0526218 0.00202539 25.98 1.02e-127 ***

Mean dependent var 2.218765 S.D. dependent var 0.363661


Sum squared resid 177.9738 S.E. of regression 0.305015
R-squared 0.297629 Adjusted R-squared 0.296528
F(3, 1913) 270.2107 P-value(F) 3.3e-146
Log-likelihood -441.8652 Akaike criterion 891.7304
Schwarz criterion 913.9645 Hannan-Quinn 899.9118

Table 6.1: Wage equation on the SHIW dataset — OLS estimates

• the individual’s age: the motivation for this is that you don’t choose when
you’re born, and therefore age can be safely considered exogenous; at the
same time, regulations on compulsory education have changed over time,
so it is legitimate to think that older people may have spent less time in
education, so there are good chances age may be relevant.

• Parents’ education level (measured as the higher between mother’s and


father’s): it is a known fact that family environment is a powerful factor in
educational choice. Yet, individuals can’t decide on the educational level
of their parents, so we may conjecture that this variable is both exogenous
and relevant.

Table 6.2 shows the output from IV estimation; in fact, gretl (that is what I
used) gives you richer output than this, but I’ll focus on the part of immediate
interest.
As you can see from the output, you get substantially different coefficients:
not only you get that the returns from education appear to be quite stronger
(7.9% versus 5.3%), but the other coefficients become larger too. Clearly, the
question at this point becomes: OK, the two methods give different numbers.
But are they significantly different? The tool we use to answer this question is
the so-called Hausman test, which is the object of the next section.

6.4 The Hausman test


So far, we have taken as given that some of the variables in the regressor matrix
X were endogenous, and that, as a consequence, OLS wouldn’t yield consistent
estimates of the parameters of interest. But of course, we don’t know with cer-
186 CHAPTER 6. INSTRUMENTAL VARIABLES

TSLS, using observations 1-1917


Dependent variable: lw
Instrumented: educ
Instruments: const male wexp age peduc

coefficient std. error t-ratio p-value


---------------------------------------------------------
const 0.943553 0.0684539 13.78 2.80e-41 ***
male 0.182926 0.0149929 12.20 5.02e-33 ***
wexp 0.00860475 0.000795375 10.82 1.62e-26 ***
educ 0.0792106 0.00449758 17.61 1.85e-64 ***

Mean dependent var 2.218765 S.D. dependent var 0.363661


Sum squared resid 194.0070 S.E. of regression 0.318457
R-squared 0.292476 Adjusted R-squared 0.291366
F(3, 1913) 144.8624 P-value(F) 1.38e-84

Table 6.2: Wage equation on the SHIW dataset — IV estimates

tainty. One may argue that, if we have instruments whose quality we’re confi-
dent about, we might as well stay on the safe side and use IV anyway. If we do,
however, we may be using an inefficient estimator: it can be proven that, if X is
exogenous, OLS is more efficient that IV under standard conditions.10
The Hausman test is based on the idea of comparing
the two estimators and checking if their difference is sta-
tistically significant.11 If it is, we conclude that OLS and
IV have different probability limits, and therefore OLS can’t
be consistent, so our estimator of choice has to be IV. Oth-
erwise, there is no ground for considering X endogenous,
and we may well opt for OLS, which is more efficient.12
This idea can be generalised: if you have two estima-
tors, one of which (θ̃) is robust to some problem and the
other one isn’t (θ̂), the difference δ = θ̂ − θ̃ should converge J ERRY H AUSMAN
to a non-zero value if the problem is there, and to 0 otherwise. Therefore, we
could set up a Wald-like statistic
h i−1
H = δ′ V(δ) δ; (6.16)

where V (δ) is a consistent estimator of AV [δ], and, under some standard reg-
ularity conditions, it can be proven that H is asymptotically χ2 under the null
10 If you’re curious, the proof is in section 6.A.2.
11 As always, there is some paternity debate: some people call this the Wu-Hausman test; some

others, the Durbin-Wu-Hausman test. While it is technically true that the same test statistic had
been independently derived before (by Durbin in 1954 and by Wu in 1973), the idea became main-
stream only after the publication of Hausman (1978).
12 I know what you’re thinking: this is the same logic we used in section 4.2.3 for the White test

for heteroskedasticity. You’re right.


6.4. THE HAUSMAN TEST 187

(H0 : plim (δ) = 0).13


The problem is, how do you compute V  (δ)? For the special case when the
non-robust estimator is also efficient, we have a very nice result: the variance of
the difference is the difference of the variances. A general proof is quite involved,
but a for a sketch in the scalar case you can jump to section 6.A.3. In practice, if
you have two estimators, and the situation is the one described in Table 6.3, an
asymptotically valid procedure is just to compute H as
¤′ ¡ £ ¤ £ ¤¢−1 £
H = θ̂ − θ̃ AV θ̃ − AV θ̂ θ̂ − θ̃ .
£ ¤

Table 6.3: Hausman Test — special case

if H0 is true if H0 is false

θ̂ CAN and efficient Inconsistent

θ̃ CAN but not efficient

In our case, the two estimators to compare are β̂ and β̃ , so δ = β


b − β̃ . If we
assume that OLS is efficient (which would be under homoskedasticity), then
£ ¤ £ ¤
V [δ ] = V β̃ − V β̂ .
Since under H0 OLS is consistent, then σ̂2 is a consistent estimator of σ2 and we
can use the matrix
σ̂2 (X′ PW X)−1 − (X′ X)−1 ;
£ ¤

therefore,
¢′ h ¢−1 i−1 ¡
(X′ PW X)−1 − X′ X
¡ ¡ ¢
β̃ − β̂ β̃ − β̂
H= . (6.17)
σ̂2
In practice, actual computation of the test is performed even more simply,
via an auxiliary regression: consider the model
y = Xβ + X̂γ + residuals. (6.18)
where X̂ ≡ PW X. By the Frisch–Waugh theorem (see section 1.4.4)
¤−1
γ̂ = X̂′ MX X̂ X̂′ MX y;
£

now rewrite the two matrices on the right-hand side of the equation above as
¢−1
X̂′ MX X̂ = X̂′ X̂ − X̂′ PX X̂ = X̂′ X̂ − X̂′ X X′ X X′ X̂ =
¡
¢ h¡ ¢−1 ¡ ′ ¢−1 i ¡ ′ ¢
= X̂′ X̂ X̂′ X̂
¡
− XX X̂ X̂ (6.19)
¢−1
X̂′ MX y = X̂′ y − X̂′ X X′ X X′ y = X̂′ X̂ β̃ − β̂
¡ ¡ ¢¡ ¢
(6.20)
13 The number of degrees of freedom for the test is not as straightforward to figure out as it may

seem. See below.


188 CHAPTER 6. INSTRUMENTAL VARIABLES

where we repeatedly used the equality X̂′ X = X′ PW X = X̂′ X̂; therefore,


¢−1 h¡ ¢−1 ¡ ′ ¢−1 i−1 ¡
γ̂ = X̂′ X̂ X̂′ X̂
¡ ¢
− XX β̃ − β̂ . (6.21)

A Wald test for γ = 0 is


γ̂ ′ X̂′ MX X̂ γ̂
£ ¤
W= ,
σ̂2
so, after performing a few substitutions, you get
¡ ¢′ h¡ ′ ¢−1 ¡ ′ ¢−1 i−1 ¡ ¢
β̃ − β̂ X̂ X̂ − XX β̃ − β̂
W= = H.
σ̂2
Of course, the possibility of X and W having some columns in common com-
plicates slightly the setup above, and X̂ should only contain the projection on W
of the endogenous regressors, because if a regressor is also contained in W its
projection is obviously identical to the original. This means that the degrees of
freedom for the Hausman test is equal to the number of explanatory variables
that are effectively treated as endogenous (that is, are not present in the instru-
ment matrix W).

Example 6.1
Let us go back to the wage equation example illustrated in Section 6.3. While
commenting Table 6.2, I mentioned the fact that my software of choice (gretl)
offers richer output than what I reported. Part of it is the outcome of the Haus-
man test, that compares IV vs OLS:

Hausman test -
Null hypothesis: OLS estimates are consistent
Asymptotic test statistic: Chi-square(1) = 50.2987
with p-value = 1.32036e-12

As you can see, our original impression that the two sets of coefficients were
substantially different was definitely right. The p-value for the test leads to re-
jecting the null hypothesis very strongly. Therefore, IV and OLS have different
limits in probability, which we take as a sign that education is, in fact, endoge-
nous.
Note that the test statistic is matched against a χ2 distribution with 1 degree
of freedom, because there is one endogenous variable in the regressors list (that
is, education).

6.5 Two-stage estimation


The IV estimator is also called two-stage estimator (hence the acronyms TSLS,
for Two-Stage Least Squares or 2SLS, for 2-Stage Least Squares).
6.5. TWO-STAGE ESTIMATION 189

The reason is that the β̃ statistic may be computed by two successive ap-
plications of OLS, called the two “stages”.14 In the era when computation was
expensive, this was a nice trick to calculate β̃ without the need for other soft-
ware than OLS, but seeing IV as the product of a two-stage procedure has other
advantages too.
In order to see what the two stages are, define X̂ = PW X and rewrite (6.14) as
follows:
β̃ = (X̂′ X̂)−1 X̂′ y, (6.22)
The matrix X̂ contains, in the j -th column, the fitted value of a regression of the
j -th column of X on W; the regression

xi = Πwi + ui (6.23)

is what we call the first stage regression. In the second stage, you just regress y
on X̂: the OLS coefficient equals β̃ . Note: this is a numerically valid procedure
for computing β̃ , but the standard errors you get are not valid for inference. This
is because second stage residuals are e = y − X̂β̃ , which is a different vector from

the IV residuals εe = y−Xβ̃ . Consequently, the statistic ene does not provide a valid
estimator of σ2 , which in turn makes the estimated covariance matrix invalid.

Readers who liked the geometrical interpreta- the GIVE estimator can be written as
tion of OLS as a projection might like to con-
sider a different way of writing equation (6.22), ỹ = Xβ̃ = X(X̂′ X)−1 X̂′ y = QX̂,X y
that is
The matrix QX̂,X is square and idempotent (but
β̃ = (X̂′ X)−1 X̂′ y,
not necessarily symmetric) and performs what
from which you have that the fitted values from is called an oblique projection.

So, if the only computing facility you have is OLS (which was often the case
in the 1950s and 1960s), you can compute the IV estimator via a repeated appli-
cation of OLS. Moreover, you don’t really have to run as many first-stage regres-
sions as the number of regressors. You just have to run one first-stage regression
for each endogenous element of X (recall the discussion in subsection 6.4 on the
degrees of freedom for the Hausman test).

Example 6.2
Let’s estimate the same model we used in section 6.3 via the two-stage method:
the output from the first stage regression is in Table 6.4: as you can see, the de-
pendent variable here is education, while the explanatory variables are the full
instrument matrix W. There is not much to say about the first stage regression,
except noting that the two “real” instruments (age and parents’ education) are
both highly significant. This will be important in the context of weak instru-
ments (see section 6.7.2).
14 In fact, the word Henri Theil used when he invented this method was “rounds”, but subse-

quent literature has settled on “stages”.


190 CHAPTER 6. INSTRUMENTAL VARIABLES

OLS, using observations 1-1917


Dependent variable: educ

coefficient std. error t-ratio p-value


---------------------------------------------------------
const 5.43051 0.491957 11.04 1.65e-27 ***
male -0.137838 0.142828 -0.9651 0.3346
wexp -0.205149 0.0129572 -15.83 3.78e-53 ***
age 0.192950 0.0146614 13.16 6.26e-38 ***
peduc 0.386233 0.0207287 18.63 2.47e-71 ***

Mean dependent var 11.73344 S.D. dependent var 3.598508


Sum squared resid 17665.37 S.E. of regression 3.039607
R-squared 0.287996 Adjusted R-squared 0.286507
F(4, 1912) 193.3448 P-value(F) 2.6e-139
Log-likelihood -4848.785 Akaike criterion 9707.570
Schwarz criterion 9735.363 Hannan-Quinn 9717.797

Table 6.4: Wage equation on the SHIW dataset — first stage

OLS, using observations 1-1917


Dependent variable: lw

coefficient std. error t-ratio p-value


---------------------------------------------------------
const 0.943553 0.0711035 13.27 1.64e-38 ***
male 0.182926 0.0155733 11.75 8.22e-31 ***
wexp 0.00860475 0.000826161 10.42 9.55e-25 ***
hat_educ 0.0792106 0.00467166 16.96 3.48e-60 ***

SSR = 209.316, R-squared = 0.173936

Table 6.5: Wage equation on the SHIW dataset — second stage


6.5. TWO-STAGE ESTIMATION 191

Once the first stage regression is computed, we save the fitted values into
a new variable called hat_educ. Then, we replace the original education educ
variable with hat_educ in the list of regressors and perform the second stage.
Results are in Table 6.5; a comparison of these results with those reported in
Table 6.2 reveals that:
1. the coefficients are identical;

2. the standard error are not, because the statistic you obtain by dividing the
SSR from the second-stage regression (209.316) by the number of obser-
vations (1917) is not a consistent estimator for σ2 ; therefore, the standard
errors reported in table 6.5 differ from the correct ones by a constant scale
factor (1.0387 in this case).

6.5.1 The control function approach


The method I described in the previous subsection to compute the IV estimator
via two successive stages is the traditional one. A slight variation on that proce-
dure gives rise to what is sometimes called the control function approach.
The main idea can be grasped by considering a minimal model such as

yi = x i · β + εi (6.24)
xi = w i · π + ui , (6.25)

where equation (6.25) is a “proper” linear model, and Cov [u i , εi ] is some real
number, not necessarily 0. If w i is exogenous, the only way x i can be correlated
with εi is if Cov [u i , εi ] ̸= 0. It is easy to show that OLS will overestimate β if
Cov [u i , εi ] > 0 and underestimate it if Cov [u i , εi ] < 0. If, however, we defined

νi = εi − E [εi |u i ]

and assumed linearity of E [εi |u i ], we could write εi = u i · θ + νi and recast (6.24)


as
y i = x i · β + u i · θ + νi (6.26)
If we could observe u i , we wouldn’t need the IV estimator at all, because νi is
orthogonal to both x i and u i ,15 so OLS would be our tool of choice to estimate
β and θ at the same time.
Unfortunately, we don’t observe u i directly, but once we’ve run the first stage
regression (6.25), we have the first-stage residuals û i , which hopefully shouldn’t
be too different: after all, (6.25) is a perfectly valid regression model, and the OLS
estimate of π is consistent. Therefore, the difference

u i − û i = (x i − w i π) − (x i − w i π̂) = (π̂ − π) w i
15 Lazy writers like myself love the sentence: “the proof is left to the reader as an exercise”.
192 CHAPTER 6. INSTRUMENTAL VARIABLES

p
should go to 0 asymptotically, since π̂ −→ π; on these premises, the possibility of
using û i in place of the “true” u i is tempting.
From a computational point of view, hence, the control function approach
differs from the traditional two-stage method only in the second stage. Once you
have performed the first stage, you use, in the second stage, the residuals from
the first stage and add them as extra explanatory variables to the main equation.
Call E the first-stage residuals and perform an OLS regression of y on X and E
together:
y = Xβ + Eθ + ν ; (6.27)
the estimates we get have a very nice interpretation.
Let’s begin with β : by the Frisch-Waugh theorem (see section 1.4.4), the OLS
estimate of β is
(X′ ME X)−1 X′ ME y;
now focus on the matrix X′ ME :

X′ ME = X′ − X′ PE = X′ − X′ E(E′ E)−1 E′ .

Since, by definition, E = MW X, by substitution we have

X′ ME = X′ − X′ MW X(X′ MW X)−1 X′ MW = X′ − X′ MW = X′ PW ;

therefore, the coefficient vector becomes

(X′ ME X)−1 X′ ME y = (X′ PW X)−1 X′ PW y = β̃

So the OLS estimate of the coefficients associated with X in equation (6.27)


is exactly equal to the IV estimator β̃ . Warning: just like the two-stage proce-
dure, the control function approach does not yield correct standard errors for β̃ ;
explaining precisely why is far beyond the scope of this book, and readers will
have to content themselves with knowing that this is a consequence of using the
first-stage residuals û i in place of the true u i series.16
Moreover, the control function regression gives you, as a nice by-product,
the Hausman test, as an exclusion test for the first-stage residuals: by applying
the Frisch-Waugh theorem again, the OLS estimate of θ in (6.27) is
¤−1
θ̂ = E′ MX E E′ MX y;
£

of course ê = MX y are the OLS residuals. Now use the definition of E as E = MW X


again:
E′ MX y = E′ ê = X′ MW ê = X′ ê − X′ PW ê = −X′ PW ê;
16 If you really want to know, the problem arises because we are using π̂ to compute û by treat-
i
ing it as if it was the true π, and hence ignoring the fact that π̂ is an estimate with a non-zero
variance. This is a case of generated regressors; section 6.1 of Wooldridge (2010) is where you
want to start from. Besides, Section 6.2 of the same book contains a much more accurate and
thorough treatment of the control function approach than what I’m giving you here.
6.5. TWO-STAGE ESTIMATION 193

where the last equality comes from the OLS residuals ê being orthogonal to X;
therefore

E′ MX y = −X′ PW y + X′ PW X · β̂ = X′ PW X (β̂ − β̃ ),
£ ¤

and as a consequence θ̂ is just

¤−1 £ ′
θ̂ = E′ MX E
£ ¤
X PW X (β̂ − β̃ );

¤−1 £ ′
since the matrix E′ MX E
£ ¤
X PW X is invertible, θ̂ it can be zero if and only if
β̂ = β̃ . Therefore, the hypothesis h0 : θ = 0 is logically equivalent to the null
hypothesis of the Hausman test, and the test can be performed by a simple zero
restriction on θ̂ . If (as often happens) θ is a scalar, the result of the Hausman test
is immediately visible as the significance t -test associated with that coefficient.
A very nice feature of the control function approach is that this approach is
very natural to generalise to settings where our estimator of choice is not a least
squares estimator, which happens quite often in applied work. But this book is
entitled “basic econometrics”, and I think I’ll just stop here.

OLS, using observations 1-1917


Dependent variable: lw

coefficient std. error t-ratio p-value


---------------------------------------------------------
const 0.943553 0.0647377 14.58 1.04e-45 ***
male 0.182926 0.0141790 12.90 1.42e-36 ***
wexp 0.00860475 0.000752195 11.44 2.34e-29 ***
educ 0.0792106 0.00425341 18.62 2.89e-71 ***
resid -0.0341349 0.00481934 -7.083 1.98e-12 ***

SSR = 173.423, R-squared = 0.315587

Table 6.6: Wage equation on the SHIW dataset — control function

Example 6.3
Using the SHIW data again, after storing the residuals from the first-stage regres-
sion (see Table 6.4) under the name resid, you can run an OLS regression like
the one in Table 6.1, with resid added to the list of regressors. The results are in
Table 6.6. Again, the coefficients for the original regressors are β̃ and again, the
standard errors are not to be trusted (they’re a bit smaller than the correct ones,
listed in Table 6.2). The t -test for the resid variable, instead, is interpretable as
a perfectly valid Hausman test, and the fact that we strongly reject the null again
is no coincidence.
194 CHAPTER 6. INSTRUMENTAL VARIABLES

6.6 The examples, revisited


6.6.1 Measurement error
A typical way to use IV techniques to overcome measurement error is the usage
of a second measurement of the latent variable, whose contamination error is
independent of the first one. In formulae: together with the two equations (6.1)
and (6.2)

yi = x i∗ β + εi
xi = x i∗ + η i

suppose you have a third observable variable w i such that

w i = x i∗ + νi

If η i and νi are uncorrelated, then w i is a valid instrument and the statistic


P
wi yi
β̃ = P
w i xi

is a consistent estimator of β.
One famous application of this principle is provided in Griliches (1976), a
landmark article in labour economics, where the author has two measurements
of individual ability and uses one for instrumenting the other.

6.6.2 Simultaneous equation systems


Consider again equations (6.4)–(6.5):

qt = α0 − α1 p t + u t
pt = β0 + β1 q t + v t .

As we proved in section 6.1.2, the systematic part of these two equations are not
conditional means, so there’s no way we can estimate their parameters consis-
tently via OLS.
On the other hand, we can use (6.7) to deduce that E p t = π1 ; clearly, the
£ ¤

parameter π1 can be estimated consistently very simply by taking the average of


p t (or regressing p t on a constant, if you will). The same holds for E q t = π0 .
£ ¤

Can we use these two parameters to estimate the structural ones? The an-
swer is no: the relationship between the (π0 , π1 ) pair and the structural parame-
ters is
α0 − β0 α1 β0 + β1 α0
π0 = π1 = ,
1 + β1 α1 1 + β1 α1
which is a system of 2 equations in 4 unknowns; as such, it has infinitely many
solutions. This is exactly the under-identification scenario we analysed in sec-
tion 2.5.
6.6. THE EXAMPLES, REVISITED 195

Let us now consider this example in greater generality: a system of linear


equations can be written as
Γyt = B xt + εt ; (6.28)
the yt vector contains the q endogenous variables, while xt is an m-vector hold-
ing the exogenous variables. The matrix Γ (assumed non-singular) is q × q and
B is a q × m matrix. In the demand-supply example,

α1 α0
· ¸ · ¸
1
Γ= B= .
−β1 1 β0

Equation (6.28) is known as the structural form of the system, because its pa-
rameters have a behavioural interpretation and are our parameters of interest.
By pre-multiplying (6.28) by Γ−1 , you get the so-called reduced form:

yt = Πxt + ut , (6.29)

where Π = Γ−1 B and ut = Γ−1 εt . In our example, the matrix Π is a column vector,
containing π0 and π1 .
If xt is exogenous, then Cov [xt , εt ], so the correlation between xt and ut is
zero; hence, OLS is a consistent estimator for the parameters of the reduced
form. However, by postmultiplying (6.29) by x′t you get:

yt x′t = Πxt x′t + ut x′t ,

which implies E yt x′t = ΠE xt x′t . Ordinarily, this matrix should not contain
£ ¤ £ ¤

zeros; if variables were centred in mean, it would be the covariance matrix be-
tween the vector yt and the vector xt . Therefore, each element of xt is correlated
with each yt despite being uncorrelated with εt . In other words, xt is both rele-
vant and exogenous and, as such, is a perfect instrument.
In the example above, xt contains only the constant term, and the reduced
form looks like:
π0
· ¸ · ¸
qt
= · 1 + ut ;
pt π1
it should be clear where under-identification comes from: in the demand equa-
tion you have 2 regressors (the constant and p t ) but only one instrument (the
constant), and the same goes, mutatis mutandis, for the supply equation.
Consider now a different formulation, where:

qt = α0 − α1 p t + α2 y t + u t (6.30)
pt = β0 + β1 q t + β2 m t + v t , (6.31)

where we use the two new variables y t , the per-capita income at time t and m t ,
the price of raw materials at time t ; assume both are exogenous.
In this case, both equations are estimable via IV, because we have three re-
gressors and three instruments for each (the same for both: constant, y t and
196 CHAPTER 6. INSTRUMENTAL VARIABLES

m t ). In the context of simultaneous systems, the order condition I stated in sec-


tion 6.2.2 is often translated as: for each equation in the system the number of
included endogenous variables cannot be greater than the number of excluded
exogenous variables. I’ll leave it to the reader to work out the equivalence.

6.7 Are my instruments OK?


A fundamental requirement for the whole IV strategy is that the variables that
we choose to use as instruments are (i) exogenous and (ii) relevant. Of course,
we cannot assume that they are just because we say so, and it’d be nice to have
some way of testing these assumptions. It turns out that neither property can be
verified directly, but we do have statistics that we can interpret in a useful way.

6.7.1 The Sargan test


The Sargan test is often interpreted as a test for exogeneity of the instruments,
but in fact things are a bit more subtle.
Exogeneity, in our context, means uncorrelatedness between the instruments
wi and the structural disturbances εi . Assume we are in the simplest case, where
p
V [ε] = σ2 I and n1 W′ W −→ B . If the structural disturbances εi were observable, a
test would be straightforward to construct: under H0 ,
1 d
p W′ ε −→ N 0, σ2 B
¡ ¢
n
and it can be proven that, under H0 ,

ε′ PW ε d
−→ χ2m , (6.32)
σ2
where m is the size of wi . Unfortunately, ε is unobservable, and therefore the
quantity above is not a statistic and cannot be used as a test.
The idea of substituting disturbances with residuals like we did in section
6.5.1 takes us to the Sargan test. Its most important feature is that this test has
a different asymptotic distribution than (6.32), since the degrees of freedom of
the limit χ2 is not m (the number of instruments), but rather m − k, where k is
the number of elements in β : in other terms, the over-identification rank (see
section 6.2.1). In formulae,
εe′ PW εe d
S= −→ χ2m−k . (6.33)
σ̂2
This result may appear surprising at first. Consider, however, that under ex-
act identification the numerator of S is identically zero,17 so, in turn, the S statis-
tic is identically 0, not a χ2m variable. Can this result be generalised?
17 Blitz proof: if m = k, then β̃ = (W′ X)−1 W′ y. Therefore, P ε
W e = PW (y − Xβ̃ ); however, observe
that PW Xβ̃ = W′ (W′ W)−1 W′ X(W′ X)−1 W′ y = PW y. As a consequence, PW εe = PW y − PW y = 0.
6.7. ARE MY INSTRUMENTS OK? 197

£ ¤
Take the conditional expectation E y i |wi ; assuming linearity, if wi is exoge-
nous we could estimate its parameters via OLS in a model like

y i = w′i π + u i ; (6.34)

in the simultaneous system jargon, equation (6.34) would be the reduced-form


equation for y i . To see how the parameters π relate to β , write the first-stage
equation (6.25) in matrix form as

X = WΠ + E

so that the i -th row of X can be written as x′i = w′i Π + e′i ; now substitute in the
structural equation:

y i = x′i β + εi = w′i Πβ + (εi + e′i β ); (6.35)

clearly, the two models (6.34) and (6.35) become equivalent only if π = Πβ ; in
fact, the expression Πβ can be seen as a restricted version of π , where the con-
straint is that π must be a linear combination of the columns of Π (or, more
concisely, that π ∈ Sp (Π)).
The Sargan test is precisely a test for those restrictions: this begs three ques-
tions:

1. how do we compute the test statistic?

2. What is its limit distribution?

3. Which interpretation must be given to a rejection?

Number 1 is quite easy: the IV residuals εe are the residuals from the restricted
model. All we have to do is apply the LM principle (see Section 3.5.1) and regress
those on the explanatory variables from the unrestricted model. Compute nR 2 ,
and your job is done. If you do, you end up exactly with the statistic I called S in
equation (6.33).
As for its limit distribution, the fact that the number of degrees of freedom of
the χ2 limit distribution is m −k can be intuitively traced back to the fact that the
number of parameters of the unrestricted model (6.34) is m, while the number
of the restricted parameters is k. Therefore, the number of constraints is m−k.18
Now we can tackle point 3: the null hypothesis in the Sargan test is that the
m relationships implicit in the equation π = Πβ are non-contradictory; if they
were, it would mean that at least one element of the vector E [wi εi ] is non-zero,
18 A more rigorous argument goes as follows: if the restriction is true, then P π = π , or, equiva-
Π
lently, MΠ π = 0. Since the rank of MΠ is m − k, we can write

MΠ π = UV ′ π = 0

where V and U are matrices with k rows and m − k columns (see section 1.A.3). So, the null
hypothesis implicit in the restriction is H0 : V ′ π = 0, which is a system of m − r constraints.
198 CHAPTER 6. INSTRUMENTAL VARIABLES

and therefore at least one instrument in not exogenous. Unfortunately, the test
cannot identify the culprit.
To clarify the matter, take for example a simple DGP where y i = x i β + εi and
we have two potential instruments, w 1,i and w 2,i . We can choose among three
possible IV estimators for β:
1. a simple IV estimator using w 1,i only (call it β1 );

2. a simple IV estimator using w 2,i only (call it β2 );

3. the GIVE estimator using both w 1,i and w 2,i (call it β12 ).
Suppose β1 turns out to be positive (and very significant), and β2 turns out to
be negative (and very significant); clearly, there must be something wrong. At
least one between β1 and β2 cannot be consistent. The Sargan test, applied to
the third model, would reject the null hypothesis and inform us that at least one
of our two instruments is probably not exogenous. Hence, β12 would be cer-
tainly inconsistent, and we’d have to decide which one to keep between β1 and
β2 (usually, on the basis of some economic reasoning). If, on the other hand, we
were unable to reject the null, then we would probably want to use β12 for effi-
ciency reasons, since it’s the one that incorporates all the information available
from the data.
In view of these features, the Sargan test is often labelled overidentification
test, since what it can do is, at most, finding whether there is a contradiction
between the m assumptions we make when we say “I believe instrument i is
exogenous”.
Sargan over-identification test -
Null hypothesis: all instruments are valid
Test statistic: LM = 0.138522
with p-value = P(Chi-square(1) > 0.138522) = 0.709755

Table 6.7: Wage equation — Sargan test

Example 6.4
Table 6.7 is, again, an excerpt from the full output that gretl gives you after IV
estimation (the main table is 6.2) and shows the Sargan test for the wage equa-
tion we’ve been using as an example; in this case, we have 1 endogenous vari-
able (education) and two instruments (age and parents’ education), so the over-
identification rank is 2 − 1 = 1. The p-value for the test is over 70%, so the null
hypothesis cannot be rejected. Hence, we conclude that our instruments form a
coherent set and the estimates that we would have obtained by using age alone
or parents’ education alone would not have been statistically different from one
another. Either our instruments are all exogenous or they are all endogenous,
but in the latter case they would all be wrong in exactly the same way, which
seems quite unlikely.
6.7. ARE MY INSTRUMENTS OK? 199

6.7.2 Weak instruments


The relevance condition for instruments looks deceptively simple to check: in-
struments are relevant if the matrix A = E xi w′i is full-rank. Of course, this ma-
£ ¤

trix is quite easy to estimate consistently, so in principle testing for relevance is


straightforward.

In fact, the point above is subtler than it looks: a real number) break down. Constructing a test
estimating the rank of a matrix is not exactly for the hypothesis rk (A) = m is possible, but re-
trivial, because the rank is an integer, so most quires some mathematical tools I chose not to
of our intuitive ideas on the relationship be- include in this book. The interested reader may
tween an estimator and the unknown param- want to google for “canonical correlations”.
eter (that make perfect sense when the latter is

The practical problem we are often confronted with is that, although an in-
strument is technically relevant, its correlation with the regressors could be so
small that finite-sample effects may become important. In this case, that instru-
ment is said to be weak.19
The problem is best exemplified by a little simulation study: consider the
same model we used in section 6.5.1:

yi = x i · β + εi (6.36)
xi = w i · π + ui (6.37)
εi
· ¸ µ · ¸¶
1 0.75
∼ N 0,
ui 0.75 1

with the added stipulation that w i ∼ U (0, 1). Of course, equation (6.36) is our
equation of interest, while (6.37) is the “first stage” equation. Since εi and u i
are correlated, then x i is itself correlated with εi , and therefore endogenous;
however, we have the variable w i , which meets all the requirements for being
a perfectly valid instrument: it is exogenous (uncorrelated with εi ) and relevant
(correlated with x i ), as long as the parameter π is nonzero.
However, if π is a small number the correlation between x i and w i is very
faint, so w i is weak, despite being relevant. In this experiment, the IV estimator
is simply Pn
wi yi
β̃ = (W X) W y = Pni =1
′ −1 ′
:
i =1 w i x i

if you scale the denominator by n1 , it’s easy to see that its probability limit is
non-zero; however, its finite-sample distribution could well be spread out over
a wide interval of the real axis, so you can end up dividing the numerator by
19 Compared to the rest of the material contained in this book, inference under weak instru-

ments is a fairly recent strand in econometric research. A recent review article I heartily recom-
mend is Andrews et al. (2019), but a fairly accessible introductory treatment can also be found
in Hill et al. (2018), Chapter 10. Chapter 12 of Hansen (2019) is considerably more technical, but
highly recommended.
200 CHAPTER 6. INSTRUMENTAL VARIABLES

an infinitesimally small number and have an inordinately large statistic (not to


mention the possibility of getting a wrong sign). Let me stress that this is a finite-
sample issue: asymptotically, there are no problems at all; but since all we ever
have are finite samples, the problem deserves consideration.
In order to show you what the consequences are, I generated 10000 artifi-
cial samples with 400 observations each and ran OLS and IV on equation (6.36),
setting β = 1 and π = 1.
The results of the experiment are plotted in the left picture in Figure 6.1. If
you want to repeat the experiment on your PC, the gretl code is at subsection
6.A.4. As you can see, everything works as expected: OLS has a smaller vari-
ance, but is inconsistent (none of the simulated β̂ gets anywhere near the true
value β = 1); IV, on the other hand, shows larger dispersion, but its distribution
is nicely centred around the true value.

1.6
1

0
1.4

-1

1.2 -2

-3
1
-4

0.8 -5

-6
0.6
OLS IV OLS IV

π=1 π = 1/3

Figure 6.1: Weak instruments: simulation study

If instead you set π = 1/3, the simulation gives you the results plotted in the
right-hand panel. Asymptotically, nothing changes: however, the finite-sample
distribution of β̃ is worrying. Not only its dispersion is rather large (and there are
quite a few cases when the estimated value for β is negative): its distribution is
very far from being symmetric, which makes it questionable to use asymptotic
normality for hypothesis testing. I’ll leave it to the reader to figure out what
happens if the instrumental variable w i becomes very weak, which is what you
would get by setting π to a very small value, such as π = 0.1.
More generally, the most troublesome finite-sample consequences of weak
instruments are:

• the IV estimator is severely biased; that is, the expected value of its finite
sample distribution may be very far from the true value β ;20

• even more worryingly, the asymptotic approximations we use for our test
statistics may be very misleading.
20 I should add “provided it exists”; there are cases when the distribution of the IV estimator has

no finite moments.
6.7. ARE MY INSTRUMENTS OK? 201

How do we spot the problem? Since this is a small-sample problem, it is not


easy to construct a test for weak instruments: what should its null hypothesis be,
precisely? What we can do, at most, is using some kind of descriptive statistic
telling us if the potential defects of the IV estimator are likely to be an actual
problem for the data we have.
For the simplest case, where you only have one endogenous variable in your
model, the tool everybody uses is the so-called “first-stage F test”, also labelled
partial F statistic. You compute it as follows: take the first-stage regression
(6.23) and perform an F -test (see Section 3.5.1) for the exclusion of the “true”
instruments (that is, the elements of wi not contained in xi ). The suggestion
contained in Staiger and Stock (1997) was that a value less than 10 could be taken
as an indication of problems related to weak instruments.

Example 6.5
The weak instrument test for the example on the SHIW data we’ve been using in
this chapter gives:

Weak instrument test -


First-stage F-statistic (2, 1912) = 271.319
Critical values for desired TSLS maximal size, when running
tests at a nominal 5% significance level:

size 10% 15% 20% 25%


value 19.93 11.59 8.75 7.25

Maximal size is probably less than 10%

The first-stage F statistic for the wage equation, as reported by gretl is 271.319,
which is way above 10, so we don’t have to worry.

The generalisation of this statistic is the so-called Cragg-Donald statistic,


whose description and interpretation is somewhat more involved, and I’ll just
point you to the bibliographic references I made at the start of this section.
Finally, a warning: problems similar to weak instruments may also arise
when the overidentification rank becomes large: the over-identification range
is usually a rather small number, but in some contexts it could happen that we
have an abundance of instruments. Common sense dictates that we should use
as much information as we have available, but in finite samples things may not
be so straightforward. A thorough analysis of the problem quickly becomes very
technical, so I’ll just quote Hall (2005), which contains an excellent treatment of
the issue.

There is a far more complex relationship between the behaviour of


the [IV] estimator and the properties of the instrument vector in fi-
nite samples than is predicted by asymptotic theory.
202 CHAPTER 6. INSTRUMENTAL VARIABLES

6.A Assorted results


6.A.1 Asymptotic properties of the IV estimator
The limit in probability of the IV estimator (see 6.2.1 for its derivation)

β̃ = (X′ PW X)−1 X′ PW y,

can be calculated by rewriting the equation above as a function of statistics for


which the probability limits can be computed easily. Clearly, some regularity
conditions (such as the observations being iid, for example) are assumed to hold
so that convergence occurs; we’ll take these as given and assume that sample
moments converge in probability to the relevant moments.
Given the linear model y i = x′i β + εi , we assume that:

1 PT ′ X′ W p
1. n t =1 xt wt = n −→ A, where rk (A) = k;

1 PT ′ W′ W p
2. n t =1 wt wt = n −→ B , where B is invertible;

1 PT W′ ε p
3. n t =1 wt u t = n −→ 0;
p
then β̃ = (X′ PW X)−1 X′ PW y −→ β . The proof is a simple application of Slutsky’s
theorem:

β̃ = β + (X′ PW X)−1 X′ PW ε =
"µ ¶−1 µ ¶#−1 µ ′ ¶ µ ′ ¶−1 µ ′ ¶
X′ W W′ W W′ X Wε
¶µ
XW WW
= β+
n n n n n n

so that
p ¤−1
β̃ −→ β + AB −1 A ′ AB −1 · 0 = β .
£
(6.38)

It is instructive to consider the role played by the ranks of A and B ; the matrix
B must be invertible, because otherwise AB −1 A ′ wouldn’t exist. Since B is the
probability limit of the second moments of the instruments, this requirement is
equivalent to saying that all instruments must carry separate information, and
cannot be collinear.

Note that the requisite is only that the instru- course that the transformed variables have fi-
ments shouldn’t be collinear: the stronger req- nite moments).
uisite of independence is not needed. As a This strategy is a special case of something
consequence, it is perfectly OK to use nonlin- called identification through nonlinearity; al-
ear transformations of one instrument to create though it feels a bit like cheating (and is
additional ones. frowned upon by some), it is perfectly legiti-
For example, if you have a variable w i that mate, at least asymptotically, as long as each
you assume independent of εi , you can use transformation carries some extra amount of
as instruments w i , w i2 , log(w i ), . . . (provided of information.
6.A. ASSORTED RESULTS 203

The rank of A, instead, must be k for AB −1 A ′ to be invertible. This means


£ ¤

that instruments wi must be relevant (see Section 6.2.2) for all the regressors
xi . If AB −1 A ′ is not invertible, then the probability limit in (6.38) does not
£ ¤

exist. If, instead, it’s invertible, but very close to being singular (as in the case
of weak instruments — see Section 6.7.2), then its inverse will be a matrix with
inordinately large values. This is mainly a problem for the distribution of β̃ : if
we also assume

′ d
p1 W
PT
pε −→ N (0,Q) ;
4. n t =1 wt u t = n

p d
then n(β̃ − β ) −→ N (0, Σ), where

¤−1 ¤−1
Σ = AB −1 A ′ AB −1QB −1 A ′ AB −1 A ′
£ £
.

In the standard case Q = σ2 B (from the homoskedasticity assumption E εε′ |W =


£ ¤

σ2 I), and therefore

p d
³ ¤−1 ´
n(β̃ − β ) −→ N 0, σ2 AB −1 A ′
£
. (6.39)

So the precision of the IV estimator is severely impaired any time the matrix
£ −1 ′ ¤
AB A is close to being singular.
The last thing is proving that σ̃2 is consistent: from

¡ ¢
εe = X β − β̃ + ε,

you get
¢′ ¢′
εe′ εe = β − β̃ X′ X β − β̃ + 2 β − β̃ X′ ε + ε′ ε
¡ ¡ ¢ ¡

and therefore

¢′ X′ X ¡ ¢′ X′ ε
¶ µ ′ ¶
εε
µ ¶ µ
1 ′ ¡ ¢ ¡
εe εe = β − β̃ β − β̃ + 2 β − β̃ + .
n n n n

By taking probability limits,

1 ′ p ′
σ̃2 = εe εe −→ 0 (Q) 0 + 2 · 0′ λ + σ2 = σ2 ,
n
³ ´ ³ ´
X′ X X′ ε
where Q = plim n and λ = plim n . Note that σ̃2 is consistent even though
λ ̸= 0. Consistency of σ̃2 is important because it implies that we can use the
empirical counterparts of the asymptotic covariance matrix in equation (6.39)
and use σ̃2 (X′ PW X)−1 as a valid covariance matrix for Wald tests.
204 CHAPTER 6. INSTRUMENTAL VARIABLES

6.A.2 Proof that OLS is more efficient than IV


In the OLS vs IV case, the proof that OLS is more efficient than IV if X is exoge-
nous can be given as follows: given the model y i = x′i β + εi , define the following
quantities:

X′ X W′ X W′ W
µ ¶ µ ¶ µ ¶
Q = plim A = plim B = plim ;
n n n

under homoskedasticity, we have that AV β̂ = σ2Q −1 (see section 3.2.2) and


£ ¤
−1
AV β̃ = σ2 A ′ B −1 A (see 6.A.1), where σ2 = V [εi ]. In order to prove that
£ ¤ £ ¤
£ ¤ £ ¤
AV β̃ − AV β̂ is positive semi-definite, we re-use two of the results on positive
definite matrices that we employed in section 4.A.2:

1. if Q and P are invertible and Q − P is psd, then P −1 −Q −1 is also psd;

2. if Q is psd, then P ′QP is also psd for any matrix P .

Begin by applying property 1 above and define


h £ ¤
−1 £ ¤−1 i
∆ ≡ σ2 · AV β̃ − AV β̂ = Q − A ′ B −1 A;

Since σ2 > 0, it is sufficient to prove that ∆ is psd. Now define the vector z′i =
[x′i w′i ] and consider the probability limit of its second moments:

Z′ Z A′
µ ¶ · ¸
Q
C = plim =
n A B

where C is clearly psd; now define H as

−A ′ B −1
£ ¤
H= I

so, by property 2, the product HC H ′ is also psd; but

A′
· ¸· ¸

¤ Q I
′ −1
= Q − A ′ B −1 A = ∆,
£
HC H = I −A B
A B −B −1 A

and the proof is complete.

6.A.3 Covariance matrix for the Hausman test (scalar case)


Suppose we have two consistent estimators of a scalar parameter θ; call them θ̂
and θ̃; assume also that the joint asymptotic distribution of θ̂ and θ̃ is normal.
Then,
θ̂
· ¸ · ¸
a b
AV = .
θ̃ b c
6.A. ASSORTED RESULTS 205

Consider now the statistic θ̀(λ) = λθ̂ + (1 − λ)θ̃, where λ ∈ R. Obviously, θ̀(λ) is
also a consistent estimator for any λ:
p
θ̀(λ) −→ λθ + (1 − λ)θ = θ.

Its asymptotic variance is

λ
· ¸ µ ¶
a b
AV θ̀(λ) = λ 1 − λ · = λ2 a + 2λ(1 − λ)b + (1 − λ)2 c.
£ ¤ ¡ ¢
·
b c 1−λ

If you choose λ so that AV θ̀(λ) is minimised, you get


£ ¤

c −b
λ∗ = ;
a − 2b + c

Now, if θ̂ is efficient, the optimal value of λ∗ must be 1, because θ̀(λ∗ ) cannot


be more efficient than θ̂, so the two statistics must coincide. But if λ∗ = 1, then
a = b. Therefore, if θ̂ is efficient, the joint asymptotic covariance matrix is

θ̂
· ¸ · ¸
a a
AV = .
θ̃ a c

so · ¸ µ ¶
a a 1
AV θ̂ − θ̃ = 1 −1 = c − a = AV θ̃ − AV θ̂ .
£ ¤ ¡ ¢ £ ¤ £ ¤
.
a c −1

6.A.4 Hansl script for the weak instrument simulation study

function matrix run_experiment ( scalar pi )


# function to generate the simulations

scalar rho = 0.75 # c o r r e l a t i o n betweeen u and eps


series w = uniform ( ) # generate the instrument
scalar H = 10000 # number of r e p l i c a t i o n s
matrix b = zeros (H, 2) # a l l o c a t e space f o r the s t a t i s t i c s

loop h = 1 . . H −−quiet
# the s t r u c t u r a l disturbances ( unit variance )
s e r i e s eps = normal ( )
# the reduced form disturbances ( unit variance ,
# c o r r e l a t e d with eps by construction )
s e r i e s u = rho * eps + sqrt (1 −rho ^2) * normal ( )
# generate x via the f i r s t − s t a g e equation
s e r i e s x = pi *w + u
# generate y via the s t r u c t u r a l equation
s e r i e s y = x + eps
# estimate beta by OLS
ols y x −−quiet
# s t o r e OLS estimate into the 1 s t column of b
206 CHAPTER 6. INSTRUMENTAL VARIABLES

b [ h , 1 ] = $coeff
# estimate beta by IV
t s l s y x ; w −−quiet
# s t o r e IV estimate into the 2nd column of b
b [ h , 2 ] = $coeff
endloop

cnameset ( b , s t r s p l i t ( "OLS IV " ) ) # s e t column names f o r matrix b


return b
end function

###
### main
###

set verbose o f f
nulldata 400
set seed 1234 # s e t random seed f o r r e p l i c a b i l i t y

# run experiment with pi = 1.0 and p l o t r e s u l t s


b10 = run_experiment ( 1 . 0 )
boxplot −−matrix=b10 −−output=display
# run experiment with pi = 0.333 and p l o t r e s u l t s
b03 = run_experiment ( 1 / 3 )
boxplot −−matrix=b03 −−output=display
Chapter 7

Panel data

7.1 Introduction
So far, we have made a sharp distinction between cross-sectional and time-
series datasets. In a cross-section, you observe a “screenshot” of many individ-
uals at a certain time; a time series, instead, observes one thing through time.
In panel datasets, you observe multiple individuals (that we will generally
call units) through time. Therefore, the typical element of a variable y will bear
a double subscript: y i ,t is the value for unit i at time t . In a parallel fashion,
the explanatory variables will be indexed similarly, as xi ,t . As a consequence, we
merge the two conventions used earlier and assume that i = 1 . . . n and t = 1 . . . T
so, for example, a typical excerpt of a panel dataset looks more or less like this:

id year y x z
..
.
451 2015 12 1 1
451 2016 14 1 0
451 2017 11 3 0
452 2010 12 5 0
452 2011 12 2 1
..
.

In this example, the “id” column identifies the different units, and the “year”
column identifies time, so the first row shown says that the value of y for unit
451 in the year 2015 is 12, or, in formulae, y 451,2015 = 12, x 452,2010 = 5, and so on.
In this chapter I will use the symbol N for the total number of observations.
From a practical point of view, a panel dataset may be balanced or unbalanced:
in the former case, you observe data over a common time range for each unit,
so you get valid data for each (i , t ) combination and N = n · T . Otherwise, some
rows may be missing, and not all time periods are available for all units, so N <
nT . This is the most common case in practice.

207
208 CHAPTER 7. PANEL DATA

Typically, most panel datasets will contain data for many units for short time
periods: this situation is normally referred to as the “large n, small T ” case,
but other cases are possible. For example, macroeconomists regularly deal with
datasets where units are countries and the amount of data can be considerable
in the time dimension. In most microeconomic applications, however, you have
many individuals observed for short time spans. As we will see, this aspect be-
comes important for the asymptotic analysis of the estimators we have for panel
datasets.

In some cases, it makes sense to consider the tion and in some cases may be very relevant to
factors that provoke the appearance or disap- the empirical analysis.
pearance of a certain unit in the dataset. A clas- In the elementary treatment we give here, how-
sic example is firms going bankrupt. Of course, ever, we assume that this issue is moot, as the
these random factors may interact with the factors that determine whether a unit is observ-
Data Generating Process is very subtle ways. able or not are completely independent from
This phenomenon is known as sample attri- the DGP we want to study.

The importance of panel datasets has grown exponentially since the IT rev-
olution of the 1980s-1990s: more and more datasets of this type are available,
simply as a consequence of the mechanisation of databases. For example: I have
been doing my weekly shopping for more than thirty years always at the same
supermarket chain, and I regularly pass my customer card each time. Those
guys, potentially, know everything about my habits: what I like, what I dislike,
how much I spend each week, what I buy only during a discount promotion,
and so on. And they have the same information about millions of customers.
Just imagine the kind of datasets giants like Amazon possess. It should be no sur-
prise that econometricians have devoted a lot of energy into methods for panel
datasets and, as always, this book will only scratch the surface. If you want to go
deeper, Wooldridge (2010) is what everybody considers the ultimate reference,
but in my opinion Hsiao (2022) is also a must-have.
A mechanical application of the line of thought we followed in chapter 3
would disregard the panel nature of the dataset entirely and just focus on the
£ ¤
regression function E y i ,t |xi ,t . Of course this approach is possible, and leads to
our usual OLS statistic, which in this context is often called the pooled estimator
of the conditional mean parameters. While this is a technically valid procedure,
it is almost never a good idea, because we can do clever things with the infor-
mation contained in the panel structure of the dataset and redefine the object
£ ¤
of our interest (from E y i ,t |xi ,t to something else), like we did in chapter 5, so
as to give a much more meaningful description of the DGP.

7.2 Individual effects


Consider the balanced panel dataset displayed in Table 7.1, where you have N =
18 observations, pertaining to n = 3 different units, with T = 6. The application
7.2. INDIVIDUAL EFFECTS 209

Table 7.1: Small example panel dataset

id time y x
1 1 1.6 1.6
1 2 1.0 1.8
1 3 2.2 1.0
1 4 2.0 1.0
1 5 1.8 1.0
1 6 2.2 0.8
2 1 3.2 4.2
2 2 3.4 3.2
2 3 3.0 4.2
2 4 3.6 2.4
2 5 3.8 3.2
2 6 3.2 3.6
3 1 3.8 6.8
3 2 5.0 4.8
3 3 5.2 5.4
3 4 4.6 5.8
3 5 4.4 6.0
3 6 3.6 7.0

£ ¤
of OLS to these data gives the “pooled” estimate of E y|x , which is

ŷ = 1.62 + 0.45x,

with an R 2 index of 60.7%. The slope parameter, our customary indicator of the
relationship between x and y, equals 0.45 and is very significant (its t -ratio is
4.97). What you see is a strong, significant positive link between x and y.

Figure 7.1: Example data with OLS line

y ŷ = 1.62 + 0.45x

Figure 7.1 displays the data together with the fitted line, using different sym-
210 CHAPTER 7. PANEL DATA

bols to identify different units. In this context, the model we’re fitting is

y i ,t = x′i ,t β + εi ,t (7.1)

and we’re not using the information we have on the different units at all.
How can we improve on the above? The idea is to introduce heterogeneity
between units into the picture, and generalise the DGP by allowing for the pos-
sibility of each units having its own set of parameters. A fully general application
of this principle would entail considering an object like
£ ¤
m i (x) = E y i ,t |xi ,t

(note that the regression function m has a subscript i ). In principle, this ap-
proach would lead us to estimating a different regression function for each unit,
which is undesirable for various reasons: first, in the typical “large n, small T ”
scenario, it is quite possible that T , the number of observations you have for
one unit, is smaller than k, the number of parameters in your model, which
would make estimation impossible. Moreover, that level of generality is not even
needed. In most contexts, it is perfectly reasonable to assume that heterogene-
ity between units does not affect the marginal effects of x on y. In other words,
even if individuals are different, it’s often likely that the way they respond to vari-
ations in the observables is the same. If this is the case, then β is the same for all
units and we may settle for

y i ,t = x′i ,t β + αi + εi ,t , (7.2)

where the αi term is commonly known as the individual effect. We can use
vectors and matrices for writing (7.2) more compactly, expressing all the obser-
vations for unit i as
yi = Xi β + αi ι + εi (7.3)
where of course y is a T × 1 vector, X is a T × k matrix, and ι is, as usual, a con-
formable vector of ones.1 The presence of the individual effects in equation (7.2)
means that each unit is potentially different from all the others because there is
a term αi , constant through time, that shifts the level of y i ,t by some amount.
The β vector, instead, is homogeneous across units.
In the simplest cases, it is customary to assume that, once heterogeneity is
taken into account, the disturbances are well-behaved, so the covariance matrix
for the whole ε vector is σ2ε I , where σ2ε is a positive scalar and I is a N ×N identity
matrix. More general scenarios will be considered in section 7.3.4.
There are two main points to note about individual effects:

1. individual effects are unobservable (anything observable can be part of


the set of explanatory variables);
1 If the panel were unbalanced, each unit would have its own time span, and we should say T
i
rather than T . But we’ll avoid this complication.
7.3. FIXED EFFECTS 211

2. individual effects are time invariant.

For example, imagine that y i ,t is the percentage of malnourished popula-


tion in country i at time t . There could be many factors that explain differences
between observations, the most obvious one being GDP per capita; this is ob-
servable, so it goes into xi ,t . Another one could be the fertility of soil; for the
sake of the example, assume that fertility is unobservable. Clearly, the soil of
each country is typical of that country. If we also assume that its characteristics
don’t change through time, soil fertility is one of the many possible factors that
may contribute to αi .
At this stage, we’re making no assumptions on the relationship between ob-
servables and individual effects. For all we know, soil quality and per capita GDP
could be related or not. Another example, close to the one I used in section 6.3,
is the Mincer wage equation: if you have a panel dataset with individuals’ wage
and education, the individual effect could be rightfully interpreted as “unob-
servable ability”. Is it independent of education? Maybe it is, maybe not. In the
toy dataset depicted in Figure 7.1, the average value of x seems to be different
across units, so one could think that observable and unobservable factors are
unlikely to be independent of one another.
For the estimation of β in equation (7.2), there are two ways to take individ-
ual effects into account:

fixed-effects approach: treat the individual effects as parameters to estimate


and make no assumptions about them.2 This approach is described in
Section 7.3.

random-effects approach: make some assumptions on the individual effects


and treat them as random variables. This leads to more efficient esti-
mators, provided certain conditions are met (but if they aren’t the con-
sequences could be catastrophic). Section 7.4 is about this.

7.3 Fixed effects


7.3.1 Using dummy variables
In this section, we treat individual effects as parameters. Therefore, a very crude
way to estimate equation (7.2) is to add individual dummies, that is

y i ,t = x′i ,t β + α1 d i1,t + α2 d i2,t + · · · + αn d in,t + εi ,t (7.4)

2 This approach, in principle, may lead to some complications because, apart from β , you have

n different αi parameters to estimate. In general, when the number of parameters to estimate is


not fixed, but is a function of the sample size, we may not be able to estimate consistently any
of them. In the statistical literature, this is known as the incidental parameters problem, but
fortunately in linear models (the only ones we consider here) we don’t have to worry about this
issue: more on this at the end of section 7.3.3.
212 CHAPTER 7. PANEL DATA

where d i1,t , d i2,t etc are a set of dummies for unit 1, 2 and so on, respectively (so
no, the number near the letter d is not an exponent). Therefore, the model for
unit k would simply reduce to (7.2), since for that unit d ik,t = 1 and all the other
dummies are 0. If units differ on account of some unobserved factor that shifts
the level of y for each one of them but keeps the marginal effects β equal across
units, then we have an ordinary linear model in which each unit has its own
intercept.
In matrix notation, eq. (7.4) would read

y = Xβ + Dα + ε, (7.5)

where, with the vector notation used in equation (7.3), the relevant matrices look
like this:
       
y1 X1 ι 0 ... 0 ε1
 y2   X2   0 ι ... 0   ε2 
       
y= .. 
 X =  .. 
  D =  .. .. . .
 . 
 ε =  ..  ;
 
 .   .   . . . ..   . 
yn Xn 0 0 ... ι εn

note that X is an N × k matrix, whereas D is N × n.3


As usual, the parameters we’re interested in are the β vector, and the esti-
mate you get by applying OLS to (7.4) is known as the LSDV (Least Squares with
Dummy Variables) estimator. In principle, one could also consider the estimates
of the individual effects αi , but this is less interesting and is not done very often.
What is the interpretation of β in this context, and how different is it from
its “pooled” counterpart? The estimate we get of β from equation (7.4) is the
marginal effect of x on y once heterogeneity between units has been taken into
account.
For example, the estimates you’d get by applying model (7.4) to the example
dataset in Table 7.1 are

ŷ i ,t = −0.62x i ,t + 2.54d i1,t + 5.53d i2,t + 8.15d i3,t

and the fitted value lines are displayed in Figure 7.2. Not only R 2 jumps to 92.8%
here, but the slope coefficient changes sign (also: it’s even more significant)! Do
we have a contradiction here? Not really: if we look at what happens if we follow
each unit through time, we have a negative association between y and x. In each
individual’s experience, when x goes up, y goes down (on average). However, in
our example units with larger values of x generally have larger values of y, so
the overall conditional mean of y on x has a positive slope, because it doesn’t
take into account unobservable differences between individuals. By explicitly
3 For you linear algebra addicts: the structure of the D matrix could be handled in a very ele-

gant and effective way using a cool tool called Kronecker product: those interested may jump to
Section 7.A.1.
7.3. FIXED EFFECTS 213

Figure 7.2: Example data with FE estimate

considering individual effects, we eliminate heterogeneity and shed light on the


negative relationship that each individual observes.
From a practical point of view, the insertion of the unit dummies creates a
few issues. First, the regressor matrix would have k + n columns, so if n is large
OLS estimation involves the inversion of a disproportionately large matrix, but
even that wouldn’t be a serious problem for modern computers unless n is in the
thousands or so. In addition, to carry out estimation you can’t have a constant
in your model unless you drop one of the unit dummies to avoid the collinearity
problem known as the “dummy trap” (see Section 1.3.3), but apart from this, es-
timation is a straightforward application of OLS. In the example above, inserting
a constant and dropping the dummy for unit 1 would give

ŷ i ,t = 2.54 − 0.62x i ,t + 2.98d i2,t + 5.60d i3,t ,

which is clearly equivalent (apart from rounding errors).


Finally, the possibility of time-invariant regressors raises a somewhat more
delicate point: these cannot coexist with the unit dummies for collinearity rea-
sons. The classic example is gender: if units are persons, the possibility of ob-
serving an individual changing their gender in our sample is usually very low.
Therefore, one of the columns of the X matrix will contain, for each unit, the
same value repeated from t = 1 to T . It is a simple exercise to prove that such a
column would be a linear combination of the columns of D, and therefore OLS
would be unfeasible. However, this issue can be circumvented by using a slightly
different estimation technique, that I’ll illustrate in Section 7.4.2.
One nice thing about this setup is that it makes heterogeneity testable rather
easily. Assuming (without loss of generality) that the we drop the individual
dummy for the first unit, the the null hypothesis for the test is H0 : α2 = α3 =
. . . = αn = 0, which would be equivalent to homogeneity across units. Under H0
the preferred model would be the pooled one, so this kind of test is often termed
a poolability test. The details of the test are unimportant: this is just a linear
214 CHAPTER 7. PANEL DATA

test on parameters if the R β = d form, so you have the choice of using any of
the procedures described in section 3.5; of course, this is a joint test where the
number of hypotheses is n − 1.
In the simple example above the F -form of the test would yield a p-value of
7.55615e-08, so the visual impression that units 1, 2 and 3 are indeed different
from each other would be strongly confirmed.

7.3.2 The “within” transformation


The LSDV approach is computationally very inefficient if you’re in the typical
“large n, small T ” case, because of the column size of the regressor matrix.4 For-
tunately, there is a very handy approach for obtaining the same statistic in a
different way. This approach also has the virtue of highlighting a few features of
the estimator, and is based on the so-called within transformation.
The within transformation for a variable essentially amounts to subtracting
the per-unit averages. For example, the within transformation for y i is

yei ,t ≡ y i ,t − ȳ i ,

where ȳ i is the average of the observations for unit i . Following most of the
literature, I will use the tilde as a decoration for within-transformed variables.
The reason why this is called the “within” transfromation is motivated by a
traditional decomposition of the variance of a variable. A precise definition of
the decomposition of variance in “within” and “between” components is one
of those pedantic things that make descriptive statistics one of the most bor-
ing things on Earth. Let’s just say that the transformation above annihilates all
the differences between units (the per-unit average of yei ,t is 0 by construction)
and all the information that is left comes from variability within units through
time (hence the name). Therefore, the within tranformation of a time-invariant
variable, such as gender in the example above, gives you a vector of zeros.
The matrix representation of the within transformation is very useful: at the
very beginning of this book (see Section 1.2) I showed that the average of yi can
be written as
ȳ i = (ι′ ι)−1 ι′ yi .

Therefore, we can easily compute the vector of the deviations of yi from its own
mean as
yi = yi − ι ȳ i = yi − ι(ι′ ι)−1 ι′ yi = yi − P yi = Qyi ;
e

where I used P and Q as synonyms for Pι and Mι , respectively (we’ll use these
matrices quite often in Section 7.4, so it’s good to have a quick alternative nota-
tion; besides, I’m trying to stay consistent with the notation traditionally used in
4 Conversely, if T really is a small number, nothing prevents you from also adding “time dum-

mies” for t = 1, t = 2 etc. This is actually quite common practice. Section 7.A.4 shows how this
works in practice.
7.3. FIXED EFFECTS 215

most textbooks).5 The reader is invited to check that applying the within trans-
formation to the toy example in Table 7.1 gives the data shown in Table 7.2.

Table 7.2: Within tranformation

y x ȳ x̄ ye xe
1.6 1.6 1.8 1.2 -0.2 0.4
1 1.8 1.8 1.2 -0.8 0.6
2.2 1 1.8 1.2 0.4 -0.2
2 1 1.8 1.2 0.2 -0.2
1.8 1 1.8 1.2 0 -0.2
2.2 0.8 1.8 1.2 0.4 -0.4
3.2 4.2 3.367 3.467 -0.167 0.733
3.4 3.2 3.367 3.467 0.033 -0.267
3 4.2 3.367 3.467 -0.367 0.733
3.6 2.4 3.367 3.467 0.233 -1.067
3.8 3.2 3.367 3.467 0.433 -0.267
3.2 3.6 3.367 3.467 -0.167 0.133
3.8 6.8 4.433 5.967 -0.633 0.833
5 4.8 4.433 5.967 0.567 -1.167
5.2 5.4 4.433 5.967 0.767 -0.567
4.6 5.8 4.433 5.967 0.167 -0.167
4.4 6 4.433 5.967 -0.033 0.033
3.6 7 4.433 5.967 -0.833 1.033

With the help of the within transformation, we’ll rewrite equation (7.2) so
as to eliminate the individual effects.6 If you average observations for unit i
through time, you get

1 X T 1 X T h i
ȳ i ≡ y i ,t = x′i ,t β + αi + εi ,t = x̄′i ,t β + αi + ε̄i ,t , (7.6)
T t =1 T t =1

in obvious notation. Now subtract equation (7.6) from (7.2):

x′i ,t β + ε
yei ,t = e ei ,t (7.7)

and the αi terms have disappeared. In vector form, the above would read

yi = Qyi = QXi β +Q ιαi +Q εi = X


e e i β + εei , (7.8)

where the main simplification trivially comes from Q ι = 0. Naturally, we are


assuming that X constains no time-invariant regressors, which would become
columns of zeros, for the reasons given above.
5 Here we’re assuming the panel is balanced to minimise the fuss, but the extension to un-

balanced panels is straighforward, as long as you admit that the P and Q matrices could have
different size for different individuals.
6 The within transofrmation is a convenient way to sweep out the α terms, but it’s by no means
i
the only one: ∆s would work just the same, with a few slight adjustments.
216 CHAPTER 7. PANEL DATA

Intuition suggests that, having removed the individual effect by means of the
within transformation, you can estimate β by applying OLS to (7.7). This is in-
deed the case, and the result is known, unsurprisingly, as the “within” estimator.
The amazing result is that this statistic is exactly the same as you’d get from
using OLS on (7.4). The proof is quite simple if we consider the within trans-
formation as a matrix operation: the within transformation can be expressed in
matrix terms as the premultiplication of the original data by an N × N square
and singular matrix that we call Q:

y = Qy
e X
e = QX.

The Q matrix is a block-diagonal matrix, where all elements on the diagonal are
the Q matrices defined above, so it looks like this:
 
Q 0 ... 0
 0 Q ... 0 
 
Q≡ . .. . . .. . (7.9)

 . . . . . 
0 0 ... Q

Clearly, it is also possible to define P = PD analogously:


 
P 0 ... 0
 0 P ... 0 
 
P≡ . . . .
. . ... 

 .. .. 
0 0 ... P

We won’t need P now, but we’ll use it later in Section 7.4. Therefore,
      
Q 0 ... 0 y1 Qy1 y1
e
 0 Q . . . 0   y2   Qy2   e y2 
      
Qy =  .
 .. . . .  .  =  .  =  . 
     
 .. . . ..   ..   ..   .. 

0 0 ... Q yn Qyn yn
e

and the algebra for X


e is just the same. As a consequence, the within estimator,
which is just OLS on (7.7), can be written in matrix notation as7

β̂ = (X′ QX)−1 X′ Qy = (X
e′X
e )−1 Xy,
ee (7.10)

and the corresponding model is called the within regression.


To prove that (7.10) is just the LSDV estimator, note that Q = MD , where D
is the N × n matrix with all the unit dummies I used in equation (7.5) (the proof
is in section 7.A.5). Therefore, equivalence between the LSDV and within esti-
mators follows from the Frisch-Waugh theorem: the OLS estimate for equation
(7.5) satisfies
y = Xβ̂ + Dα̂ + e, (7.11)
7 Quite evidently, Q is idempotent: I’ll leave the proof to the reader.
7.3. FIXED EFFECTS 217

so
MD y = MD Xβ̂ + e =⇒ X′ MD y = X′ MD Xβ̂ ,

which implies (7.10) (you may need to go back to Section 1.4.4 for the different
passages).
In practice, then, the LSDV and within estimators are exactly the same thing,
so they have the same interpretation. For example, if you regress ye on xe in Table
7.2, the OLS coefficient you get is -0.62, exactly equal to the one we found for
model (7.4). You may use either term for them, or even a third alternative, pos-
sibly even more popular: the fixed-effects estimator, or FE for short, which I’ll
indicate by β̂F E .

Some readers may be troubled by the fact that This, however, is not a problem, since it can
the disturbances in equation (7.7) are corre- be proven that in this case OLS coincides with
lated. This is easily seen by considering the vec- GLS, so OLS takes care of the problem quite ef-
tor representation (7.8): since εei = Q εi , it fol- fectively.
lows that I’m not proving this because we’d need a
V εei = QV εi Q ′ .
£ ¤ £ ¤
slightly more sophisticated definition of GLS
Even in the ideal case, where V εi =£ σ¤2ε I ,
£ ¤
than I gave in chapter 4.2.1, on account of the
the covariance matrix of εei would be V εei = fact that Q is singular and I’d have to use the
σ2ε Q, which is obviously non-diagonal (keep “Moore-Penrose” inverse I hinted at in Section
in mind that Q is symmetric and idempotent). 1.A.4. Just trust me, OK?

With the LSDV approach, the estimates for the individual effects α̂i are ob-
tained directly. However, calculating them via the within estimator is also rather
easy: rewrite equation (7.11) as

y − Xβ̂F E = Dα̂ + e.

If you pick a single unit, this implies

yi − x′i β̂F E = ια̂i + ei ;

1 ′
now premultiply by Tι and use the fact ι′ ei = 0: the result is

1 ′ 1 X T
α̂i = ι (yi − x′i β̂F E ) = u i ,t , (7.12)
T T t =1

where
u i ,t = y i ,t − x′i ,t β̂F E . (7.13)

So, all you have to do is compute the residuals you’d get by using the within
estimate on the untransformed data and take means by unit.
218 CHAPTER 7. PANEL DATA

7.3.3 Asymptotics for the FE estimator


While the meaning of the word “asymptotic” is straightforward in cross-sectional
or time-series datasets, it is not so for panel data. The number of rows in our
dataset N can go to infinity if either n or T do so, or both. In this book, we’ll
concentrate on the case when T is fixed and n → ∞; the reader, however, should
be aware that things in more sophisticated scenarios the case T → ∞ may have
be considered too.
The best starting point to analyse the asymptotic behaviour of the fixed-
effect estimator is to consider its LSDV representation: under the hypothesis
that no heteroskedasticity or serial correlation issues arise, standard OLS infer-
ence applies to equation (7.5), and therefore β̂F E is consistent and asymptoti-
cally normal, with a limit covariance matrix given by
h i−1
V = σ2ε E e x′i ,t
xi ,t · e

Assuming we have a consistent estimator for σ2ε , then a consistent estimate of


£ ¤
V β̂F E is
V̂ = σ̂2ε (X
e′X
e )−1 ,

where we’re keeping T fixed here, as usual. The questions of interest are:

1. Do we have a consistent estimator for σ2ε ?

x′ nonsingular, for n → ∞?
£ ¤
2. Is E e
x·e

For devising an estimator of the variance, the customary ingredient we’ve


been using all along is the sum of squared residuals. So far, the SSR divided
by the number of observations has always done the trick. In the context of FE
estimation, however, things are not so simple, and the appropriate estimator to
use is
SSR
σ̂2ε = . (7.14)
N −n
The reason why the denominator is different from the total number of observa-
tions is very interesting, but a bit too distracting at this point, so the interested
reader should jump to Section 7.A.6.8
As for the asymptotic behaviour of N1 X e′X
e , we must assume that there is suf-
ficient time-variability in the regressors not only to compute the estimator, but
£ ′¤
also to allow E e x to have full rank. Of course this excludes time-invariant re-
xe
gressors, since the within transformation turns them into columns of zeros, but
also explanatory variables for which the within variation cannot be assumed to
increase for n → ∞. This is in fact a rather general point: consistency of β̂F E
depends on its variance going to 0: this happens only if the within variation in
regressors grows without bounds (the X e′X
e matrix goes to infinity). Therefore,
8 Most textbooks, and all the software I’m aware of, use in fact a slightly different formula, where

SSN is divided by N − n − k rather than N − n; asymptotically, it makes no difference.


7.3. FIXED EFFECTS 219

if we have one or more regressors whose variation through time is limited, we


shouldn’t expect our estimates to have nice properties in terms of precision.
For example: suppose you have a dataset with 1000 individuals and you ob-
serve two of them changing their gender. In this case, the gender dummy be-
comes technically time-varying, so you can use it for fixed-effect estimation.
However, you can’t expect your estimates to be particularly precise, as your X e′Xe
matrix will be near-singular. Moreover, in order to consider your estimator as
consistent, you must be willing to assume that, in principle, the number of trans-
gender people in your sample would increase as n grows (which could be rea-
sonable or not).
Therefore, as long as our object of interest is inference on β , we can just
happily use the within regression, provided we make the necessary adjustment
to our estimator of σ2ε . If we wanted to make inference on the individual effects,
instead, things are not so simple, since in the usual “large n, small T ” scenario,
the estimates of αi you get from LSDV are not consistent for n → ∞. This is easy
to see by considering equation (7.12): the statistic α̂i , is calculated on the T ob-
servations we have the i -th individual, so its variance is not a function of n at all,
and n going to infinity has no effect on the distribution of α̂i . Fortunately, this
is not a problem for estimating β , because our estimator β̂F E doesn’t depend on
the estimated individual effects. Nevertheless, the reader should be aware that
in statistical models where the object of interest is not a linear regression func-
tion, it may be impossible to estimate the parameters of interest separately from
the individual effects, and inconsistency may be a very serious problems. This
is the so-called “incidental parameters” problem I hinted at a few pages back.

7.3.4 Heteroskedasticity and dependence between observations


In fact, we could allow for greater generality by considering several extensions:
the first one that comes to mind is heteroskedasticity, with unit-specific vari-
ances for εi ,t . This is not a particularly serious problem, since appropriate adap-
tations of the robust estimators à la White (see Section 4.2.2) are quite simple
and effective.
More worryingly: if we consider equation (7.2), it is clear that the hypothesis
V (ε) = σ2ε I implies that, apart from the individual effects, all observations are
incorrelated with each other. This includes observations pertaining to the same
unit at different times. In many cases, this could be unrealistic.
In fact, time persistence can be a very likely possibility, since model (7.2) is
almost certain to neglect some time-varying unobservable factors that evolve
gradually through time. By applying the sample logic as in chapter 5, we could
allow for some kind of ADL structure in equation (7.2). The kind of models you’d
get are normally called dynamic panel models, and have become increasingly
popular since the late 1980s. However, inference is considerably more complex
than in the static models we consider here: the tool that is almost invariably
used is the Generalised Method of Moments (GMM), which you can think of
220 CHAPTER 7. PANEL DATA

as a generalisation of the IV technique that chapter 6 is about. The obligatory


reference for these cases is again Wooldridge (2010), but Biørn (2017) is also very
good.
A different possibility for dealing with the persistence issue is to stick with
the static formulation (7.2) and assume that any kind of time-dependence be-
tween observations can be accommodated via correlation between disturbances.
It turns out that, if this is the case, the fixed-effects estimator β̂F E is consistent
under a fairly large spectrum of conditions. The only problem is, like in static
models with heteroskedasticity, that in order to perform inference correctly, the
covariance matrix for β̂F E needs an appropriate adjustment. This leads us to the
idea of clustered covariance matrix.
A full description of cluster-robust inference has no place in this book; suf-
fice it to say that you divide your observations in observable groups called clus-
ters, and you allow the εi ,t random variable to be arbitrarily correlated inside
the group. The variable which tells you which group an observation belongs to
is called the “clustering” variable.
Of course, the most obvious choice for clustering is the variable indexing
units (the “id” variable in Table 7.1). In this case, equation (7.3) would be gener-
alised so as to allow the covariance matrix to be pretty much anything, instead
of a scalar matrix:
V [εi ] = E εi ε′i = Σi .
£ ¤

Note that the covariance matrix bears the subscript i , so we’re also implicitly
allowing for arbitrary forms of heteroskedasticity. Other choices, however, are
possible. For example, it may be not unrealistic to imagine that some correla-
tion may exist across different units: the classic example is a panel where units
are geographical entities, where units are regions and clusters cluster are coun-
tryies, but one could also think of individuals belonging to the same household,
firms in the same sector, etc. Let me just say that the literature on this topic has
exploded in the past 15 years, and that Cameron and Miller (2010) or Cameron
and Miller (2015) provide excellent surveys.

7.4 Random effects


The basic idea that gives rise to the random effects estimator (abbreviated as
RE) is that in some cases we may be willing to put some structure on the in-
dividual effects, rather than being completely agnostic about them as we do in
fixed-effects estimation.
Since individual effects are taken to represent a heap of time invariant, un-
observed, and possibly very diverse factors that describe how units differ from
one another, it’d be natural to describe the αi terms as random variables. If we
assume the existence of moments, then the assumption E [αi ] = 0 implies no
loss of generality, and V [αi ] = σ2α is nothing more than a mild regularity condi-
7.4. RANDOM EFFECTS 221

tion. With these assumptions, then equation (7.2) can be rewritten as

y i ,t = x′i ,t β + ωi ,t (7.15)

where ωi ,t = αi + εi ,t . The same equation for unit i can be written as

yi = Xi β + ωi (7.16)

where ωi ≡ αi ι + εi . By making the harmless assumption that E αi εi ,t = 0 for


£ ¤

all i and t , the covariance matrix of ωi equals

Σ = V [ωi ] = E ωi ωi′ = V [εi ] + σ2α ιι′ = σ2ε I + σ2α ιι′ ,


£ ¤
(7.17)

where the last equality comes from the assumption that the disturbances are
well-behaved. If we also assume independence between units, equation (7.15)
for the whole sample would therefore become

y = Xβ + ω (7.18)

where  
Σ 0 ... 0
0 Σ ... 0
 
V [ω ] = Ω = 
 
.. .. .. ..  (7.19)
. . . .
 
 
0 0 ... Σ

is a block-diagonal matrix.
We are now in the position to substantiate the claim I made at the end of
Section 7.1, when I said that using the pooled OLS estimator is almost never a
good idea with a panel dataset: for a start, the covariance matrix of the distur-
bance term is not scalar, which suggests that even though OLS on (7.15) was
consistent, valid inference requires at least with some form of robust covariance
matrix estimation (see Section 4.2.2). Besides, consistency itself may be at risk:
even if E [αi ] = 0, there is no guarantee that αi and xi should be independent, or
at least incorrelated (see the discussion at the end of Section 7.2). If E [αi |xi ] ̸= 0,
if follows that E ωi ,t |xi ̸= 0 and therefore E y i ,t |xi ,t ̸= x′i ,t β : the classic endo-
£ ¤ £ ¤

geneity problem we analysed in Chapter 6, that renders the pooled estimator


inconsistent.
In the light of these two possible problems, what could an effective strategy
be? Let’s put the endogeneity issue aside for the moment (we’ll come back to it
in section 7.4.2). If E [αi |xi ] = 0, one may conjecture that OLS should be more
efficient than the FE estimator, since the FE estimator uses only the “within”
variation in the data, but we could use the “between” information (that is, dif-
ferences between units) to gain some efficiency.
In fact, we can do even better than OLS: from equation (7.17), the structure
of the covariance matrix of Σ is known, bar two scalars, σ2ε and σ2α . Therefore,
222 CHAPTER 7. PANEL DATA

if these two scalars were known, we could use the GLS estimator, described in
Section 4.2.1), which I’m reproducing here for your convenience:
¤−1
β̃ = X′ Ω−1 X X′ Ω−1 y.
£

This solution would take care of two problems at once: we’d be using the most
efficient estimator possible, and we wouldn’t have to worry about robust infer-
ence. In practice, the two variances we need to get the job done are unknown,
but asymptotically consistent estimators would be just as good, so a FGLS esti-
mator would be available. This is what we call the RE estimator.
It turns out that, as often happens, once the original data are suitably mod-
ified, the RE estimator can be rewritten as OLS on the transformed data. The
transformation we need is known as “quasi-differencing”: for each observation,
we subtract a fraction of the per-unit average from the original data:

y̆ i ,t = y i ,t − θ ȳ i ,

where 0 ≤ θ ≤ 1. In vector form,

y̆i = yi − θ ι ȳ i = (I − θP ) yi ,

where, again, P is an alias for Pι . Quasi-differencing for the whole sample can
be written as
y̆ = (I − θP) y,
where P was defined in section 7.3.2 or, equivalently, as

y̆ = [Q + (1 − θ)P] y, (7.20)

given that Q = I − P
For a given value of θ, the RE estimator is just OLS on the quasi-differenced
data, that is
¤−1
β̀ (θ) = X̆′ X̆ X̆′ y̆.
£
(7.21)
As is easy to check, quasi-differencing with θ = 1 is just the within transfor-
mation, so β̀ (1) = β̂F E . At the other end of the spectrum, where θ = 0, the orig-
inal data are unmodified, so β̀ (0) is the just pooled OLS estimator. Note that,
for θ < 1, time-invariant variables do not become zero, and so they are perfectly
useable.
Derivation of the optimal choice of θ for GLS estimation is a bit techincal,
and is in Section 7.A.7 for those interested. Here, I’m just giving you the solution
straight away, which is s
σ2ε
θ = 1− . (7.22)
σ2ε + T σ2α
Note that, when σ2ε is large compared to σ2α , θ will be near 0: heterogeneity
between units is negliglible and the optimal estimator is practically OLS. Con-
versely, if σ2ε is very small compared to σ2α , then θ is close to 1 and the within
7.4. RANDOM EFFECTS 223

estimator is optimal, since all the variance in ωi ,t comes from individual effects,
which are eliminated by the within transformation.
As I said earlier, the two variances σ2ε and σ2α are unknown in practice, so
they must be estimated. The almost universal solution is to use FE for σ2ε ; as
for σ2α , there are various alternatives and it is not clear if the “best” one even
exists. Shortly after the RE estimator was invented, in the late 1960s, quite a lot
of work was devoted to this issue, and the method most software uses is the one
by Swamy and Arora (1972), but you should be aware that you may get different
results form different programs because different (equally defensible) methods
are adopted.
Anyway: once we have consistent estimates of σ2ε and σ2α , to compute FGLS
we just plug them into equation (7.22) and obtain
s
σ̂2ε
θ̂ = 1 − , (7.23)
σ̂2ε + T σ̂2α

a consistent estimate of θ. By using θ̂, we can compute the quasi-differenced


data y̆ and X̆ and, finally, compute β̂RE as OLS on the quasi-differenced data.
And that is what we call the RE estimator:9

β̂RE = β̀ (θ̂). (7.24)

7.4.1 The Hausman test

Having dealt with the particular covariance structure of ωi ,t , we now turn to the
other issue I mentioned above, that is the possibility of the observables xi ,t being
correlated with the individual effect αi . In many cases, this is a very real possi-
bility: think about the example I used on page 211, where GDP per capita is one
of the regressors xi ,t and soil fertility is one of the things that go into the indi-
vidual effect αi : who says that GDP per capita and soil fertility are independent?
More generally, it’s easy to imagine other examples, such as unobserved ability
and schooling in a Mincer wage equation.
As I argued above, this is not a problem for the FE estimator, since the within
transformation just sweeps the individual effect away, but it would make OLS
and the RE estimator inconsistent. Therefore, we could compare the FE and RE
estimators to see if they are similar, much in the same way as we did in Section
6.4 when we compared the OLS and IV estimators. This comparison gave rise
to the “Hausman test”, and this case is just the same. In fact, the original article
(Hausman, 1978) uses exactly the two examples we have in this book, that is OLS
vs IV and RE vs FE.
9 As the reader might imagine, robust versions of the RE estimator exists, but I’ll refrain from

illustrating the details, and I’ll just say that there is no additional worry compared to the FE case,
and they work as one would expect.
224 CHAPTER 7. PANEL DATA

What should we expect from the comparison? β̂F E is robust but inefficient;
β̂RE is efficient but potentially inconsistent.10 Under the null hypothesis of no
correlation between xi ,t and αi , the difference

δ = β̂F E − β̂RE
should converge to 0 in probability, because both statistics share the same limit.
Conversely, large values of δ should be taken as an indicator of endogeneity of
xi ,t .
The Hausman test can be carried out in a variety of ways, some numerically
equivalent, some only asymptotically. A choice that is used by several software
packages is to perform an auxiliary regression of the form

y̆i = X̆i β + X
e i γ + ui , (7.25)

and then a Wald test for the hypothesis H0 : γ = 0. With a bit of algebra, it can be
p
proven that this is equivalent to β̂F E − β̂RE −→ 0.
Therefore, the course of action that may take is very simple: after RE estima-
tion, look at the Hausman test. If the null is rejected, β̂RE is probably inconsis-
tent, and β̂F E is preferable. Otherwise, we may happily use β̂RE , which is better
than β̂F E because it’s more efficient. As simple as that.

7.4.2 Correlated Random Effects, aka “the Mundlak trick”


An alternative strategy for dealing with the possible correlation between the re-
gressors xi ,t and the individual effect αi comes from modelling explicitly the
correlation between them. This gives rise to an estimator sometimes called the
correlated random effects estimator, or CRE for short, proposed first by Mund-
lak (1978). As we will see shortly, however, the result will be less exciting than
one may hope, but side benefits will be substantial.
The key intuition is to consider the conditional expectation of αi to x̄i and
assume it is a linear function,

E [αi |x̄i ] = x̄′i γ ; (7.26)

note that the conditioning variable we’re using here is not xi ,t , but rather its av-
erage through time. Since αi is time-invariant, it is quite natural to assume that
a time average of the xi ,t should capture the effect we’re after.
Therefore, if you define u i = αi − x̄′i γ you can re-write (7.2) as

y i ,t = x′i ,t β + x̄′i γ + u i + εi ,t = y i ,t = x′i ,t β + x̄′i γ + η i ,t , (7.27)

where, by construction, none of the two error terms u i and εi ,t is correlated with
the explanatory variables. In vector form,

yi = Xi β + P Xi γ + ηi = Xi β + X̄i γ + ηi .
10 Naturally, we have both parameters only for time-varying regressors, so the comparison is

limited to the subset of β̂RE that matches β̂RF .


7.5. AN EXAMPLE WITH REAL DATA 225

If you substitute αi with u i , and therefore ωi ,t in equation (7.15) with η i ,t


in equation (7.27), you see that the structure of the covariance matrix of ηi is
absolutely identical, so nothing stops you from using FGLS on equation (7.27).
Therefore, in practice, Mundlak’s CRE estimator is just the RE estimator with the
time averages of Xi as additional regressors.
The fist thing to say is that the estimate of β you get is nothing new. With a
little bit of matrix algebra, it can be proven (I do it in Section 7.A.8) that the esti-
mated β vector is numerically equal to the within estimator β̂F E . Therefore, one
may think that the Mundlak procedure is just a tortuous avenue to get some-
thing we already had. Not quite: one nice thing of the CRE estimator is that it
provides us with a nice way to use time-invariant explanatory variables, which
is impossible with the LSDV or the within approaches.
Moreover, testing the hypothesis H0 : γ = 0 is very interesting: under the
null, the endogeneity problem just goes away. Therefore, rejection of the null
would imply we have to stick with FE, but otherwise we could gain efficiency and
go with RE. It should come as no surprise that testing this hypothesis is equiva-
lent to the Hausman test I described in the previous subsection.

7.5 An example with real data


7.5.1 The Kuznets curve
The American economist Simon Kuznets (Nobel prize win-
ner in 1971) is credited with an idea that has become
known as the “Kuznets curve”. In short, the basic intuition
is that developing economies go through several structural
changes that provoke an increase in inequality in the early
stages, and a decrease later. Clearly, this idea is too me-
chanical and simplistic to paint an accurate picture, but if
there is something to it, we should observe that inequality
is highest in middle-income economies.
I collected some data from the World Bank’s WDI
database: per capita income and the Gini index (the stan- S IMON K UZNETS
dard measure of income inequality) for the years between 2008 and 2022.11
As often happens in these cases, the panel is heavily unbalanced. We have
lots of data for some economies, but for some countries we only have one or
two datapoints. Having said this, our dataset comprises 1044 observations for
157 countries. A scatterplot of the Gini index versus log GDP per capita is shown
in Figure 7.3.
The curve you see in the figure is the fitted line from a pooled OLS regres-
sion of the Gini index versus GDP per capita (in logs) and its square. I added to
11 In the interest of replicability: the measure of GDP per capita I used is GDP per capita in con-

stant 2015 US$ (WDI code: NY.GDP.PCAP.KD.) The WDI code for the Gini index is SI.POV.GINI.
226 CHAPTER 7. PANEL DATA

65
Y = -10.3 + 12.8X - 0.822X^2
60

55

50

45
Gini

40

35

30

25

20
6 7 8 9 10 11

Figure 7.3: The Kuznets curve

this model a dummy for European countries, since these countries have had a
historical and cultural preference for social equality that some people consider
a dangerous socialist drift. The results are shown in Table 7.3.
Pooled OLS, using 1044 observations
Included 157 cross-sectional units
Time-series length: minimum 1, maximum 15
Dependent variable: Gini

coefficient std. error t-ratio p-value


--------------------------------------------------------
const -15.2335 7.78943 -1.956 0.0508 *
y 12.7708 1.74755 7.308 5.40e-13 ***
y2 -0.712896 0.0969748 -7.351 3.97e-13 ***
Europe -9.35281 0.426650 -21.92 7.23e-88 ***

Mean dependent var 36.39588 S.D. dependent var 7.629774


Sum squared resid 35233.57 S.E. of regression 5.820518
R-squared 0.419705 Adjusted R-squared 0.418031
F(3, 1040) 250.7304 P-value(F) 2.1e-122

Table 7.3: The Kuznets curve: OLS estimates

Here we seem to have a confirmation of Kuznets’ hypothesis: the curvature


is negative (the coefficient for y2 is negative and significant) and the distribu-
tion of income for European countries is confirmed to be more even than other
countries with similar levels of GDP per capita.
However, this is a pooled estimate, conceptually similar to the plot I showed
you earlier, in Figure 7.1. Is it possible that the results we are seeing neglect the
effect of unobserved heterogeneity between countries. Therefore, we turn to FE
7.5. AN EXAMPLE WITH REAL DATA 227

estimates.

7.5.2 Fixed-effects estimates


A word of warning on the presence of a constant in the FE estimate. Strictly
speaking, the intercept is a time-invariant regressor, so it should not appear in
the FE output. However, most econometric software (including gretl, which is
what I’m using) adopt a slightly different convention on the definition of the
matrix D in (7.5), so that an intercept is in fact calculated.12

Fixed-effects, using 1044 observations


Included 157 cross-sectional units
Time-series length: minimum 1, maximum 15
Dependent variable: Gini
Omitted due to exact collinearity: Europe

coefficient std. error t-ratio p-value


-------------------------------------------------------
const 69.4905 20.2572 3.430 0.0006 ***
y -0.546958 4.60603 -0.1187 0.9055
y2 -0.331499 0.260585 -1.272 0.2037

Mean dependent var 36.39588 S.D. dependent var 7.629774


Sum squared resid 2687.375 S.E. of regression 1.742579
LSDV R-squared 0.955739 Within R-squared 0.129623
LSDV F(158, 885) 120.9498 P-value(F) 0.000000
Log-likelihood -1974.926 Akaike criterion 4267.851
Schwarz criterion 5055.031 Hannan-Quinn 4566.408
rho 0.615823 Durbin-Watson 0.582907

Test for differing group intercepts -


Null hypothesis: The groups have a common intercept
Test statistic: F(155, 885) = 69.1486
with p-value = P(F(155, 885) > 69.1486) = 0

Table 7.4: The Kuznets curve: fixed-effects estimates

Having said this, Table 7.4 is relatively straightforward to comment: the “Eu-
rope” dummy drops out of the equation on account of it being time-invariant, as
explained in Section 7.3.1. Moreover, the the poolability test rejects the null very
strongly (the p-value is so small that the software just prints 0). This means that
heterogeneity between units (countries in this case) is substantial and a simple
pooled model may yield misleading results, as long as we’re interested in the
effect of GDP on inequality. In fact, the Kuznets curve simply disappears: the
coefficients on per capita GDP and its square are not significant.
Nevertheless, it can be verified that the joint hypothesis of both coefficients
being zero delivers a very small p-value (2.09218e-27): dropping the quadratic
12 The difference amounts to modifying the withing transformation by adding back, for each

observation, the overall mean: yei ,t = y i ,t − ȳ i + ȳ.


228 CHAPTER 7. PANEL DATA

term gives a marginal effect of -6.36334, with a t -statistic of -11.41:13 it seems


we do have a uniformly inverse relationship, instead of a concave curve.
Therefore, having eliminated the variation between countries, what we ob-
serve is the relationship between GDP and inequality through time: if we con-
centrate on the individual history of each country, we observe that on average
inequality decreases with economic growth, instead of the “inverted-U” rela-
tionship described by Kuznets.
One last thing to note is that the estimated value for the first-order autocor-
relation of residuals ρ̂ is 0.6158, so we have a substantial autocorrelation prob-
lem. In principle, we should go for a dynamic model, but here we’re following
the easier route of just using cluster-robust standard errors, by unit. That is,
we employ a different estimator for the variance of β̂F E , that permits (a) arbi-
trary correlation through time between observations for the same country and
(b) heteroskedasticity between countries.

coefficient std. error t-ratio p-value


-------------------------------------------------------
const 69.4905 40.3405 1.723 0.0869 *
y -0.546958 8.90973 -0.06139 0.9511
y2 -0.331499 0.491751 -0.6741 0.5012

Table 7.5: The Kuznets curve: fixed-effects estimates with robust standard errors

As can be seen in table 7.5, the estimated standard errors are quite different
from Table 7.4. This is in fact a very common phenomenon: while it is very rare
in cross-sectional models that robust inference delivers substantially divergent
results from plain estimation, it panel dataset clustering by unit almost always
inflates standard errors by a great deal, and the interpretation of results may
have be adjusted, even radically.
In this case, however, the meaning conveyed by the model stays the same:
the Kuznets curve vanishes, although the joint test still rejects the null (the p-
value is 1.14766e-06) and the conclusions are the same.

7.5.3 Random-effects estimates

Having established that heterogeneity between countries is something we can-


not ignore, maybe we could gain efficiency by using the RE estimator (Section
7.4); to be on the safe side, I’ll use cluster-robust inference.
Note that in this case the quasi-differencing operation I described in section
7.4 is a little bit more complicated, because the panel is heavily unbalanced and
you have different numbers of observations for different countries. Equation
(7.22) contains the symbol T , so what should we use here? The solution is to
adopt a different θ for countries with different numbers of observations, so data
13 I’m not reporting the whole restricted regression for the sake of brevity. This chapter is already

long enough. You can try it yourself if you want.


7.5. AN EXAMPLE WITH REAL DATA 229

for each unit are quasi-differenced using


s
σ̂2ε
θ̂i = 1 − ;
σ̂2ε + Ti σ̂2α
(note the “i ” subscript). Clearly, (7.22) is just a special case of the equation
above, that applies to balanced panels where Ti = T for all units. Gretl reports
the average value of θ used, which is 0.85298.
Random-effects (GLS), using 1044 observations
Included 157 cross-sectional units
Time-series length: minimum 1, maximum 15
Dependent variable: Gini
Standard errors clustered by unit

coefficient std. error z p-value


------------------------------------------------------
const 18.7662 17.9769 1.044 0.2965
y 7.38777 4.16427 1.774 0.0760 *
y2 -0.579803 0.237245 -2.444 0.0145 **
Europe -3.53635 1.32306 -2.673 0.0075 ***

Mean dependent var 36.39588 S.D. dependent var 7.629774


Sum squared resid 48078.17 S.E. of regression 6.795925
Log-likelihood -3480.511 Akaike criterion 6969.022
Schwarz criterion 6988.825 Hannan-Quinn 6976.533
rho 0.615823 Durbin-Watson 0.582907

’Between’ variance = 39.864


’Within’ variance = 3.03658
mean theta = 0.85298
corr(y,yhat)^2 = 0.256021

Breusch-Pagan test -
Null hypothesis: Variance of the unit-specific error = 0
Asymptotic test statistic: Chi-square(1) = 3065.99
with p-value = 0

Hausman test -
Null hypothesis: GLS estimates are consistent
Asymptotic test statistic: Chi-square(2) = 22.3276
with p-value = 1.4178e-05

Table 7.6: The Kuznets curve: random-effects estimates

Comparing β̂F E with β̂RE , we observe a striking difference:


variable FE RE
y -0.547 7.388
y2 -0.331 -0.588
It looks as if the two estimates should come out as significantly unlike one
another, and this is indeed the case: Hausman test rejects quite strongly (p-
value = 1.4178e-05), so it’s unlikely that the two estimator converge to the same
230 CHAPTER 7. PANEL DATA

probability limit. This is what happens when one or more of the explanatory
variables (presumably GDP per capita, in our case) is correlated with the indi-
vidual effect αi , on account of the endogeneity problem that this provokes. In
cases like these, the RE estimator is inconsistent, so we’d better stay with FE.
Finally, note that gretl (like all other software packages do) reports a test as
the Breusch-Pagan test. This is a test for the hypothesis H0 : σ2α = 0: under the
null, the individual effects are in fact not even random variables at all, because
they have zero mean and zero variance, so αi = 0 for all units. Therefore, it can
be seen as the random-effects equivalent to the poolability test I described ear-
lier. In this case, we have that heterogeneity is substantial, again. Note that this
is an entirely different test from the BP test for heteroskedasticity I mentioned
earlier in Section 4.2.3. The two tests share the same authors, but the similarity
stops there.

7.5.4 Correlated random effects

Random-effects (GLS), using 1044 observations


Included 157 cross-sectional units
Time-series length: minimum 1, maximum 15
Dependent variable: Gini
Standard errors clustered by unit

coefficient std. error z p-value


---------------------------------------------------------
const 8.96061 16.6681 0.5376 0.5909
y -0.546958 8.92260 -0.06130 0.9511
y2 -0.331499 0.492462 -0.6731 0.5009
Europe -7.90436 1.20454 -6.562 5.30e-11 ***
Py 8.10053 10.4543 0.7748 0.4384
Py2 -0.118126 0.575871 -0.2051 0.8375

Table 7.7: The Kuznets curve: CRE estimates

The final estimate we see is the CRE estimate (see Section 7.4.2). There’s
hardly anything to see here: the coefficients for the time-varying variables y i ,t
and y i2,t are absolutely identical to those in Table 7.5, as they should; their stan-
dard errors are not exactly the same, but that’s a consequence of using robust
SEs. If we had used plain GLS standard errors, they would have been identical
too. The difference is minor anyway. So, for the time-varying variables we have
nothing more that the FE estimate, and the interpretation is obviously the same.
On the contrary, the CRE technique allows us to keep the time-invariant dummy
for Europe in the model, which is (unsurprisingly) negative and significant.
Finally, note the insertion of the two “Mundlak” extra regressors, labelled
Py and Py2 in the table, which contain the per-unit averages of y i ,t and y i2,t ,
respectively. Although they are not significant individually, an F test for joint
significance of the two “Mundlak” extra regressors yields 11.1638, with a p-value
7.A. ASSORTED RESULTS 231

of 1.59597e-05, which is (unsurprisingly) nearly identical to the Hausman test


shown in table 7.6.

7.A Assorted results


In this chapter, I used several matrix algebra concepts and results that had not
been necessary before. Therefore, this section starts with a quick and rudimen-
tary treatment of a few linear algebra topics. For more details, see Lütkepohl
(1996), Abadir and Magnus (2005) or Horn and Johnson (2012).

7.A.1 The Kronecker product


The usual way of multiplying two matrices, where C = AB comes from taking all
possible inner products of the rows of A and the columns of B is not the only
way to define a way of multiplying two matrices.
An alternative is provided by the so-called Kronecker product, also known
as tensor product, which is defined as follows. Take two matrices A and B , and
sat that A is r × c and B is m × n. Then their Kronecker product A ⊗ B is a matrix
with r · m rows and c · n columns, in which each element of A is multiplied by
the whole matrix B.
 
a 1,1 B a 1,2 B . . . a 1,c B
a 2,1 B a 2,2 B . . . a 2,c B 
 
A ⊗B =  .. .. .. .. 
 . . . . 

a r,1 B a r,2 B ... a r,c B.

Note that, as a consequence of its definition, with the Kronecker product no con-
formability issues arise. On the other hand, like with ordinary matrix product,
Kronecker product is not commutative: A ⊗ B ̸= B ⊗ A.
The Kronecker product has many nice properties, but the only ones we will
need concern their combination with transposition, inversion and the ordinary
matrix product. It can be proven that

(A ⊗ B )′ = A ′ ⊗ B ′ (A ⊗ B )−1 = A −1 ⊗ B −1 (A ⊗ B )(C ⊗ D) = (AC ⊗ B D)

Note: the last equality assumes that the matrices are conformable.
In many cases, the Kronecker product makes it much easier to work with
“large matrices with a structure”. For example, if the panel is balanced the D
matrix defined in equation (7.5) can be written as D = I ⊗ ι and the variance of
ω in equation (7.19) is V [ω ] = I ⊗ Σ, where I is n × n; unfortunately, with unbal-
anced panels such elegance is unattainable.
Finally: the “vec” operator I illustrated in Section 4.A.3 and the Kronecker
product play together very nicely. The basic property you need to know is that

vec (ABC ) = C ′ ⊗ A vec (B ) ,


¡ ¢
232 CHAPTER 7. PANEL DATA

so for example if
· ¸ · ¸
1 2 1 £ ¤
A= B= C= 3 6 9
3 4 −1

you may verify that


· ¸
−3 −6 −9
ABC =
−3 −6 −9
so  
−3
−3
 
−6
 
vec (ABC ) = −6
 
 
−3
 
−9
−9

which is equal to
 
3 6
 9 12
 · ¸
 6 12 1
¡ ′ ¢  
C ⊗ A vec (B ) = 
18 24 −1

 
 9 18
27 36

7.A.2 The trace operator

Given a square matrix C with n rows and columns, the trace operator is defined
simply as
n
X
tr (C ) = C i ,i ,
i =1

that is, the sum of all the elements on the diagonal. Clearly, the trace of a scalar
is the scalar itself.
This operator is useful in many contexts, mostly related to the fact that, for
any given r × c matrix A (possibly, with r ̸= c),

r X
c
tr A ′ A = A 2i , j .
¡ ¢ X
i =1 j =1

The two notable properties of the trace operator we use in our context are:

Linearity : tr (A + B ) = tr (A) + tr (B ), and it is also true that tr (λC ) = λ · tr (C ),


where λ is a scalar. Note that linearity implies that the trace and expecta-
tion operators can be interchanged: if C is a random matrix,

E [tr (C )] = tr (E [C ]) .
7.A. ASSORTED RESULTS 233

Commutation : tr (AB ) = tr (B A), which implies the amusing property I like to


call the “train” property:
tr (ABC ) = tr (C AB ) = tr (BC A) ;
that is: the argument of the trace operator is like a train, where you can
detach a wagon from one end and stick it to the other end. For example,
of x is a vector, the trace of the xx′ matrix can be computed very easily as
tr xx′ = tr x′ x = x′ x,
¡ ¢ ¡ ¢

with the second equality coming from x′ x being a scalar.

7.A.3 A neat matrix inversion trick


Suppose P is idempotent and Q = I −P ; therefore Q is idempotent too and PQ =
QP = [0]. Assume that a matrix A can be written as
A = αP + βQ,
where α and β are nonzero scalars. Then, there is an amazingly simple way to
write the inverse of A:
1 1
A −1 = P + Q.
α β
The proof is by direct multiplication:
α β
µ ¶
¢ 1 1
αP + βQ
¡
P + Q = P + Q = P +Q = I
α β α β
because PQ = QP = [0] by construction and the cross-products drop out.
Note that, by the same logic, it’s also possible to compute the “inverse square
root” of A, that is a matrix that gives A −1 when multiplied by itself:
1 1
A −1/2 = p P + p Q,
α β
and again, the proof is by direct multiplication. In fact, the result could be gen-
eralised to any exponent k:
A k = αk P + βk Q.

7.A.4 Time dummies


The addition of time dummies to a fixed-effect model is straightforward, and
amounts to adding to the dataset a set of T dummies identifying time periods;
actually, you normally add T − 1 to avoid the dummy trap.
Therefore, equation (7.4) would become, after dropping the dummies for
unit 1 and time 1,
y i ,t = x′i ,t β + α2 d i2,t + · · · + αn d in,t + γ2 t i2,t + · · · + γT t iT,t + εi ,t ;
this model is often called the two-way fixed-effects model. In the toy dataset in
Table 7.1, this would give:
234 CHAPTER 7. PANEL DATA

id time y x t2 t3 ... t6
1 1 1.6 1.6 0 0 ... 0
1 2 1 1.8 1 0 ... 0
1 3 2.2 1 0 1 ... 0
1 4 2 1 0 0 ... 0
1 5 1.8 1 0 0 ... 0
1 6 2.2 0.8 0 0 ... 1
2 1 3.2 4.2 0 0 ... 0
2 2 3.4 3.2 1 0 ... 0
2 3 3 4.2 0 1 ... 0
2 4 3.6 2.4 0 0 ... 0
2 5 3.8 3.2 0 0 ... 0
2 6 3.2 3.6 0 0 ... 1
3 1 3.8 6.8 0 0 ... 0
3 2 5 4.8 1 0 ... 0
3 3 5.2 5.4 0 1 ... 0
3 4 4.6 5.8 0 0 ... 0
3 5 4.4 6 0 0 ... 0
3 6 3.6 7 0 0 ... 1

Note that in the “large n, small T ” scenario the number of dummies you use is
in fact relatively small, and does not create any computational problem. From
the viewpoint of the interpretation of results, the effect you have is that in your
estimate you not only get rid of heterogeneity across units, but also across time
periods. This is especially useful when some unobserved factor affects all units
in a given period. For example, imagine your dataset describes turnover by firms
and includes year 2020: surely you’ll want to control for the COVID pandemic,
since it’s reasonable to assume that it affected most, if not all, the units you ob-
serve.
Alternatively, you may want to economise on the number of regressors used
to clean unobservable time effects by using a time trend, and possibly its square.
How advisable this is depends on the data you have.

7.A.5 Proof that Q = MD

Here we assume that the panel is balanced for simplicity, although the unbal-
anced case would be completely analogous and the conclusion would be the
same, but the algebra would be somewhat messier. The Q matrix, defined in
equation (7.9) and repeated here for convenience, is:

 
Q 0 ... 0
0 Q ... 0
 
 
Q≡ .. .. .. .
..
. . . .
 
 
0 0 ... Q
7.A. ASSORTED RESULTS 235

Now we prove that Q is in fact MD : first, note that D′ D = T · I:


ι′ ι′ ι
    
0 ... 0 ι 0 ... 0 0 ... 0

 0 ι′ ... 0 
 0 ι ... 0   0
  ι′ ι ... 0 

.. .. .. .. .. ..= . .. ..  = T ·I
.. .. ..
 
  .
. . .
  
 . . .  . . .  . . . 
0 0 ... ι′ 0 0 ... ι 0 0 ... ι′ ι
1
Therefore, (D′ D)−1 = T I. As a consequence,

ι′
    
ι 0 ... 0 0 ... 0 P 0 ... 0
1 1 0
 ι ... 0  0 ι′ ... 0   0 P ... 0 
DD′ = 
   
PD = . .. .. .. .. .. = .. .. .. .
.. .. ..

T  ..
T  . . .
   
. .  . . .   . . . 
0 0 ... ι 0 0 ... ι′ 0 0 ... P

and so
   
I −P 0 ... 0 Q 0 ... 0

 0 I −P ... 0  
  0 Q ... 0 

M D = I − PD =  .. .. .. .. = .. .. .. ..  = Q.
. .
   
 . . .   . . . 
0 0 ... I −P 0 0 ... Q

A more compact proof can be given by using the Kronecker product, de-
scribed in Section 7.A.1: with a balanced panel dataset one can write D as I ⊗ ι,
where I is n × n and ι is T × 1, and therefore
¤−1 1
PD = (I ⊗ ι) (I ⊗ ι)′ (I ⊗ ι) (I ⊗ ι′ ) = (I ⊗ ι) [T · I ]−1 I ⊗ ι′ = (I ⊗ ιι′ ) = I ⊗ P ;
£
T
as a consequence,
MD = I ⊗ (I − P ) = I ⊗Q = Q,
as claimed.

7.A.6 The estimator of the variance in the within regression


In order to derive equation (7.14), we need to proceed in steps. Again, I’ll as-
sume that our panel is balanced for simplicity, but this restriction could be easily
dropped at the cost of more cumbersome notation.
First, let’s define the residuals from the within regression as

xi ,t β̂F E .
u i ,t = yei ,t − e

Now note that the SSR from the within regression can be written as
n X
T n
u i2,t = u′i ui ;
X X
SSRW =
i =1 t =1 i =1

if we maintain independence between units, the rightmost expression is the sum


of n independent rv, and the probability limit of

1 1X n
SSRW = u′ ui
n n i =1 i
236 CHAPTER 7. PANEL DATA

should just equal E u′i ui .


£ ¤

On the other hand, consistency of β̂F E implies that the within residuals u i ,t
converge to the centred disturbances εei ,t as n → ∞, ans so, by extension, does
the whole vector for a single unit
p
ui −→ εei .

Therefore, one may say that, for n → ∞, E u′i ui should converge to E ε′i εi and
£ ¤ £ ¤

therefore
1 p
SSRW =−→ E ε′i εi
£ ¤
n
This limit can be computed by noting that εei = Q εi , and so, by using the
properties of the trace operator (if you’re not 100% confident on the trace oper-
ator, section 7.A.2 is for you):

εe′i εei = tr εe′i εei = tr εei εe′i = tr Q εi ε′i Q .


¡ ¢ ¡ ¢ ¡ ¢

The expected value of the above is

E tr Q εi ε′i Q = tr E Q εi ε′i Q = tr QE εi ε′i Q = tr σ2εQQ = σ2ε tr (Q)


£ ¡ ¢¤ ¡ £ ¤¢ ¡ £ ¤ ¢ ¡ ¢

As for the trace of Q, note that Q = Mι , so

tr (Q) = tr (I ) − tr ι(ι′ ι)−1 ι′ = T − tr (ι′ ι)−1 ι′ ι = T − 1


¡ ¢ ¡ ¢

so, finally
E ε′i εi = (T − 1)σ2ε .
£ ¤

By combining results, it’s easy to see that


SSRW p
−→ (T − 1)σ2ε
n
and therefore a consistent estimator of σ2ε is provided by
SSRW SSRW
σ̂2ε = = .
n(T − 1) N − n

7.A.7 The RE estimator as FGLS


Let’s begin with a brief restatement of what a GLS estimator is: suppose we have
a model of the form
y = Xβ + ε V [ε] = Ω;
we need a matrix H such that
H ΩH ′ = k I , (7.28)
where k is some arbitrary positive scalar, then we could transform the model
above by premultiplying everything by H :

H y = H Xβ + H ε = y̆ + X̆β + ε̆.
7.A. ASSORTED RESULTS 237

It’s easy to check that the covariance matrix of ε̆ is H V [ε] H ′ = k I , so the trans-
formed model is homoskedastic and OLS on the transformed data is efficient
and standard inference applies. The GLS estimator is therefore

β̃ = (X̆′ X̆)−1 X̆′ y̆ = (X′ Ω−1 X)−1 X′ Ω−1 y,


where the second equality comes from

Ω−1 = (1/k)H ′ H ,

which I’m not proving, but it’s easy enough for the reader to demonstrate as an
exercise.
The matrix Ω is in our case given in equation (7.19), but in fact the peculiar
structure of the matrix implies that all we need to do is find a transformation for
the model for each individual, that is equation (7.16) (reported here for conve-
nience):
yi = Xi β + ωi ;
As argued above (see equation (7.17)), the covariance matrix of ωi is14

Σ = σ2ε I + σ2α ιι′

therefore, a simple solution to the GLS problem lies in finding a matrix H such
that H ΣH ′ is a scalar multiple of the identity matrix or, equivalently, a matrix H
such that H ′ H is a scalar multiple of Σ−1 .
In order to do so, it is useful to rewrite Σ in terms of the idempotent matrices
P and Q:
σ2 + T σ2
· ¸
Σ = σ2ε I + σ2α ιι′ = σ2εQ + (σ2ε + T σ2α )P = σ2ε Q + ε 2 α P
σε
Therefore, via the result shown in Section 7.A.3, it’s easy to see that the ap-
propriate matrix H is “inverse square root of Σ”, H = Σ−1/2 , that can be written
as (apart from the σ2ε scalar)
s
σ2ε
H = Q+ P=
σ2ε + T σ2α
s
σ2ε
= (I − P ) + P=
σ2ε + T σ2α
Ãs !
σ2ε
= I+ − 1 P = I − θP
σ2ε + T σ2α

where s
σ2ε
θ ≡ 1− .
σ2ε + T σ2α
14 As usual, I’m using the convenient simplification of assuming that the dataset is balanced and

you have T observations for each unit. Again, generalisation to unbalanced panels is possible but
somewhat messier.
238 CHAPTER 7. PANEL DATA

7.A.8 Proof that CRE yields FE


As noted in Section 7.4 (equation (7.20)), the quasi-differenced version of y can
be wriiten as
y̆ = [Q + (1 − θ)P] y = e
y + (1 − θ)ȳ.
It also follows that

Qy̆ = Q [Q + (1 − θ)P] y = Qy = e
y (7.29)
Py̆ = P [Q + (1 − θ)P] y = (1 − θ)Py = (1 − θ)ȳ (7.30)

and analogous expressions trivially apply to X. Now write the augmented model
as
y = Xβ + X̄γ + ε
and apply quasi-differencing so that GLS is just OLS on the transformed model

y̆ = X̆β + X̄[(1 − θ) · γ ] + ε̆.

To find the estimate of β , apply Frisch-Waugh:


¤−1
β̂ = X̆′ MX̄ X̆ X̆′ MX̄ y̆
£

From equation (7.30), it follows that

X̆′ X̄ = X′ [Q + (1 − θ)P] PX = (1 − θ)X̄′ X̄.

and therefore

X̆′ MX̄ = X̆′ − X̆′ X̄(X̄′ X̄)−1 X̄′ = X


£ ′
e + (1 − θ)X̄′ − (1 − θ)X̄′ = X
e′
¤

so
¤−1
e′X e ′ y̆ = β̂F E
£
β̂ = X e X
e ′ y̆ as X′ Qy̆ and applying (7.29).
where the last equality comes from writing X
Bibliography

A BADIR , K. AND J. M AGNUS (2005): Matrix Algebra, Cambridge University Press.

A NDERSEN , H. AND B. H EPBURN (2016): “Scientific Method,” in The Stanford En-


cyclopedia of Philosophy, ed. by E. N. Zalta, Metaphysics Research Lab, Stan-
ford University, summer 2016 ed.

A NDREWS , I., J. H. S TOCK , AND L. S UN (2019): “Weak instruments in instrumen-


tal variables regression: Theory and practice,” Annual Review of Economics,
11.

A NGRIST, J. D. AND J.-S. P ISCHKE (2008): Mostly harmless econometrics: An em-


piricist’s companion, Princeton University Press.

A XLER , S. (2015): Linear algebra done right, Springer, 2nd ed.

B IAU , D. J., B. M. J OLLES , AND R. P ORCHER (2009): “P value and the theory of
hypothesis testing: an explanation for new researchers,” Clinical orthopaedics
and related research, 468, 885–892.

B IERENS , H. J. (2011): Introduction to the Mathematical and Statistical Founda-


tions of Econometrics, Cambridge University Press.

B ILLINGSLEY, P. (1986): Probability and Measure, Wiley series in probability and


mathematical statistics, John Wiley and Sons, 2nd ed.

B IØRN , E. (2017): Econometrics of panel data: Methods and applications, Oxford


University Press.

B ROCKWELL , P. AND R. D AVIS (1991): Time Series: Theory and Methods, Springer
Series in Statistics, Springer.

C AMERON , A. C. AND D. L. M ILLER (2010): “Robust inference with clustered


data,” Tech. rep., Working paper.

——— (2015): “A practitioner’s guide to cluster-robust inference,” Journal of hu-


man resources, 50, 317–372.

C AMERON , A. C. AND P. K. T RIVEDI (2005): Microeconometrics, Cambridge Uni-


versity Press.

239
240 BIBLIOGRAPHY

C ARD, D. (1999): “The causal effect of education on earnings,” in Handbook of


Labor Economics, ed. by O. Ashenfelter and D. Card, Elsevier, vol. 3, Part A,
chap. 30, 1801–1863, 1 ed.

C ASELLA , G. AND R. L. B ERGER (2002): Statistical inference, Duxbury Pacific


Grove, CA, 2nd ed.

D ADKHAH , K. (2011): Foundations of mathematical and computational eco-


nomics. 2nd ed., Berlin: Springer, 2nd ed. ed.

D AVIDSON , J. (1994): Stochastic limit theory: An introduction for econometri-


cians, Oxford University Press.

——— (2000): Econometric Theory, Wiley-Blackwell.

D AVIDSON , J., D. H ENDRY, F. S RBA , AND S. Y EO (1978): “Econometric Modelling


of the Aggregate Time-Series Relationship between Consumers’ Expenditure
and Income in the United Kingdom,” Economic Journal, 88, 661–92.

D AVIDSON , R. AND J. G. M AC K INNON (1993): Estimation and inference in econo-


metrics, Oxford University Press.

——— (2004): Econometric theory and methods, Oxford University Press New
York.

D IXIT, A. K. (1990): Optimization in economic theory, Oxford University Press.

D URLAUF, S. AND L. B LUME (2008): The New Palgrave Dictionary of Economics,


Palgrave Macmillan UK.

E FRON , B. AND T. H ASTIE (2016): Computer Age Statistical Inference: Algorithms,


Evidence, and Data Science, Cambridge University Press, 1st ed.

E PPERSON , J. F. (2013): An Introduction to Numerical Methods and Analysis, Wi-


ley Publishing, 2nd ed.

FANAEE -T, H. AND J. G AMA (2014): “Event labeling combining ensemble detec-
tors and background knowledge,” Progress in Artificial Intelligence, 2, 113–
127.

F REEDMAN , D. AND P. S TARK (2016): “What is the chance of an earthquake?”


Tech. Rep. 611, Department of Statistics, University of California, Berkeley.

G ALLANT, R. A. (1997): An Introduction to Econometric Theory, Princeton Uni-


versity Press.

G ALTON , F. (1886): “Regression Towards Mediocrity in Hereditary Stature,” Jour-


nal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263.
BIBLIOGRAPHY 241

G OURIEROUX , C. AND A. M ONFORT (1995): Statistics and Econometric Models,


Cambridge University Press.

G RILICHES , Z. (1976): “Wages of Very Young Men,” Journal of Political Economy,


84, 69–85.

H ALL , A. (2005): Generalized Method of Moments, Advanced texts in economet-


rics, Oxford University Press.

H ANSEN , B. E. (2019):
“Econometrics,” https://www.ssc.wisc.edu/
~bhansen/econometrics/.

H ANSEN , L. P. AND T. J. S ARGENT (2013): Recursive Models of Dynamic Linear


Economies, no. 10141 in Economics Books, Princeton University Press.

H AUSMAN , J. A. (1978): “Specification Tests in Econometrics,” Econometrica, 46,


1251–1271.

H AYASHI , F. (2000): Econometrics, Princeton: Princeton Univ. Press.

H ILL , R., W. C ARTER , E. G RIFFITHS , AND G. L IM (2018): Principles of Economet-


rics, John Wiley and Sons, 5th ed.

H ORN , R. A. AND C. R. J OHNSON (2012): Matrix Analysis, Cambridge University


Press, 2nd ed.

H SIAO, C. (2022): Analysis of Panel Data, Econometric Society Monographs,


Cambridge University Press, 4 ed.

K ING , G. AND M. E. R OBERTS (2015): “How robust standard errors expose


methodological problems they do not fix, and what to do about it,” Political
Analysis, 23, 159–179.

L ÜTKEPOHL , H. (1996): Handbook of matrices, John Wiley and Sons.

L ÜTKEPOHL , H. (2005): New introduction to multiple time series analysis,


Springer.

M AC K INNON , J. G. (2006): “Bootstrap Methods in Econometrics,” Economic


Record, 82, S2–S18.

M AC K INNON , J. G., M. Ø. N IELSEN , AND M. D. W EBB (2023): “Cluster-robust


inference: A guide to empirical practice,” Journal of Econometrics, 232, 272–
299.

M ARSAGLIA , G. (2004): “Evaluating the Normal Distribution,” Journal of Statisti-


cal Software, 11, 1–11.

M ULLAINATHAN , S. AND J. S PIESS (2017): “Machine Learning: An Applied


Econometric Approach,” Journal of Economic Perspectives, 31, 87–106.
242 BIBLIOGRAPHY

M UNDLAK , Y. (1978): “On the pooling of time series and cross section data,”
Econometrica, 69–85.

P OPPER , K. R. (1968): Conjectures and Refutations: The Growth of Scientific


Knowledge., New York: Harper & Row.

RUUD, P. A. (2000): An introduction to classical econometric theory, Oxford Uni-


versity Press.

S IMS , C. A. (1972): “Money, Income, and Causality,” American Economic Review,


62, 540–552.

S PANOS , A. (1999): Probability theory and statistical inference: econometric mod-


eling with observational data, Cambridge University Press.

S TAIGER , D. AND J. S TOCK (1997): “Instrumental Variables Regression with Weak


Instruments,” Econometrica, 65, 557–586.

S WAMY, P. A. V. B. AND S. S. A RORA (1972): “The Exact Finite Sample Properties


of the Estimators of Coefficients in the Error Components Regression Mod-
els,” Econometrica, 40, 261–275.

T HURMAN , W. N. AND M. E. F ISHER (1988): “Chickens, Eggs, and Causality, or


Which Came First?” American Journal of Agricultural Economics, 70, 237–238.

V ERBEEK , M. (2017): A Guide to Modern Econometrics, John Wiley and Sons, 5th
ed.

WASSERSTEIN , R. L. AND N. A. L AZAR (2016): “Editorial,” The American Statisti-


cian, 70, 129–133.

W HITE , H. (1980): “A heteroskedasticity-consistent covariance matrix estimator


and a direct test for heteroskedasticity,” Econometrica, 817–838.

——— (1994): Estimation, Inference and Specification Analysis, Cambridge Uni-


versity Press.

W ILLIAMS , D. (1991): Probability with Martingales, Cambridge University Press.

W OOLDRIDGE , J. M. (2010): Econometric Analysis of Cross Section and Panel


Data, The MIT Press, 2nd ed.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy