Basic Econometrics
Basic Econometrics
Point number one is modelled after the ideas in the two great masterpieces,
Davidson and MacKinnon (1993) and Davidson and MacKinnon (2004). I have
several reasons for this choice, but it is mainly a pedagogical one. The students I
am writing for are people who often don’t feel at ease with the tools of statistical
inference: they have learned the properties of estimators by heart, they are not
sure they can read a test, find the concept of the distribution of a statistic a little
unclear (never mind asymptotic distributions), get confused between the vari-
ance of an estimator and an estimator of the variance. In the best cases. Never
mind; no big deal.
There’s an awful lot you can say on the base tool in econometrics (OLS) even
without all this, and that’s good to know. Once a student has learned how to
handle OLS properly as a mere computational tool, the issues of its usage and
interpretation as an estimator and of how to read the associated test statistics
can be grasped more correctly. If you mix the two aspects too early, a beginner
is prone to mistake properties of least squares that are true by construction for
properties that depend on some probabilistic assumptions.
Point number two is motivated by laziness. In my teaching career, I have
found that once students get comfortable with matrices, my workload halves. Of
course, it takes some initial sunk cost to convey properly ideas such as projec-
tions and properties of quadratic forms, but the payoff is very handsome. This
book contains no systematic account of matrix algebra; we’re using just the ba-
sics, so anything you find on the Net by googling “matrix algebra lecture notes”
is probably good enough.
As for probability and statistics, I will only assume some familiarity with the
very basics: simple descriptive statistics and basic properties of probability, ran-
dom variables and expectations. Chapter 2 contains a cursory treatment of the
concepts I will use later, but I wouldn’t recommend it as a general reference on
the subject. Its purpose is mainly to make the notation explicit and clarify a few
points. For example, I will avoid any kind of reference to maximum likelihood
methods.
i
I don’t think I have to justify point number three. I am writing this in 2023,
when typical data sets have hundreds, if not thousands observations and no-
body would ever dream of running any kind of inferential procedure with less
than 50 data points. Apart from OLS, there is no econometric technique in actual
use that does not depend vitally on asymptotics, so I guess that readers should
get familiar with the associated concepts if there is a remote chance that this
will not be put them off econometrics completely. The t test, the F tests and,
in general, all kinds of degrees-of-freedom corrections are ad-hockeries of the
past; unbiasedness is overrated. Get over it.
I promise I’ll try to be respectful of the readers and don’t treat them like id-
iots. I assume that if you’re reading this, you want to know more than you do
about econometrics, but this doesn’t give me the right to assume that you need
to be taken by the hand and treated like an 11-year-old.
All the examples and scripts in this book are replicable. All the material is in
a zip file you can download from this link. The software I used throughout the
book is gretl, so data and scripts are in gretl format, but if you insist on using
inferior software (;-)), data are in CSV format too.
Finally, a word of gratitude. A book like this is akin to a software project,
and there’s always one more bug to fix. So, I’d like to thank first all my students
who helped me eradicate quite a few. Then, my colleagues Allin Cottrell, Stefano
Fachin, Francesca Mariani, Giulio Palomba, Luca Pedini, Matteo Picchio, Clau-
dia Pigini, Alessandro Pionati and Francesco Valentini for making many valuable
suggestions. Needless to say, the remaining shortcomings are all mine. Claudia
also allowed me to grab a few things from her slides on IV estimation, so thanks
for that too. If you want to join the list, please send me bug reports and fea-
ture requests. Also, I’m not an English native speaker (I suppose it shows). So,
Anglophones of the world, please correct me whenever needed.
The structure of this book is as follows: chapter 1 explores the properties
of OLS as a descriptive statistic. Inference comes into play at chapter 2 with
some general concepts, while their application to OLS is the object of chapter 3,
with some basic ideas on diagnostic testing and heteroskedasticity in Chapter
4. Extension of basic OLS are considered in the subsequent chpater: Chapter
5 deals with dynamic models chapter 6 with instrumental variable estimation
and finally, Chapter 7 considers linear models for panel data. Each chapter has
an appendix, named “Assorted results”, where I discuss some of the material I
use during the chapter in a little more detail.
In some cases, I will use a special format for dispensable for the overall comprehension of
short pieces of texts, like this. They contain ex- the main topic.
tra stuff that I consider interesting, but not in-
ii
Contents
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
iii
2.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4.1 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.5 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.A Assorted results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.A.1 Jensen’s lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.A.2 More on consistency . . . . . . . . . . . . . . . . . . . . . . . . 68
p
2.A.3 Why n ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.A.4 The normal and χ2 distributions . . . . . . . . . . . . . . . . . 71
2.A.5 Gretl script to reproduce example 2.6 . . . . . . . . . . . . . . 74
iv
4.2.3 White’s test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.2.4 So, in practice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.A Assorted results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.A.1 Proof that full interactions are equivalent to split-sample
estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.A.2 Proof that GLS is more efficient than OLS . . . . . . . . . . . . 136
4.A.3 The “vec” and “vech” operators . . . . . . . . . . . . . . . . . . 137
4.A.4 The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
v
6.7 Are my instruments OK? . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.7.1 The Sargan test . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.7.2 Weak instruments . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.A Assorted results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.A.1 Asymptotic properties of the IV estimator . . . . . . . . . . . 202
6.A.2 Proof that OLS is more efficient than IV . . . . . . . . . . . . . 204
6.A.3 Covariance matrix for the Hausman test (scalar case) . . . . . 204
6.A.4 Hansl script for the weak instrument simulation study . . . . 205
Bibliography 239
vi
Chapter 1
1.1 Models
I won’t even attempt to give the reader an account of the theory of econometric
modelling. For our present purposes, suffice it to say that we econometricians
like to call a model a mathematical description of something, that doesn’t aim
at being 100% accurate, but still, hopefully, useful.1
We have a quantity of interest, also called the dependent variable, which
we observe more than once: a collection of numbers y 1 , y 2 , . . . , y n , where n is the
size of our data set. These numbers can be anything that can be given a coherent
numerical representation; in this course, however, we will confine ourselves to
the case where the i -th observation y i is a real number. So for example, we could
record the income for n individuals, the export share for n firms, the inflation
rate for a given country at n points in time.
Now suppose that, for each data point, we also have a vector of k elements
containing auxiliary data possibly helpful in better understanding the differ-
ences between the y i s; we call these explanatory variables,2 or xi in symbols.3
To continue the previous examples, xi may include a numerical description of
the individuals we recorded the income of (such as age, gender, educational at-
tainment and so on), or characteristics of the firms we want to study the export
propensity for (size, turnover, R&D expenditure and so on), or the conditions of
the economy at the time the inflation rate was recorded (interest rate, level of
output, and so forth).
1 “All models are wrong, but some are useful” (G. E. P. Box). In fact, one may argue that, in order
covariates, while people from the machine learning community like the word features.
3 I will almost always use boldface symbols to indicate vectors.
1
2 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
y i ≃ m(xi )
where we implicitly assume that if xi is not too different from x j , then we should
expect y i to be broadly close to y j : if we pick two people of the same age, with
the same educational level and many other characteristics in common we would
expect that their income should be roughly the same. Of course this won’t be
true in all cases (in fact, chances are that this will never be true exactly), but
hopefully our model won’t lead us to catastrophic mistakes.
The reason why we want to build models is that, once the function m(·) is
known, it becomes possible to ask ourselves interesting questions by inspect-
ing the characteristics of that function. So for example, if it turned out that the
export share of a firm is increasing in the expenditure in R&D, we may make con-
jectures about the reasons why it should be so, look for some economic theory
that could explain the result, and wonder if one could improve export competi-
tiveness by giving the firms incentives to do research.
Moreover, the door is open to forecasting: given the characteristics of a hy-
pothetical firm or individual, the model makes it possible to guess what their
export share or income (respectively) should be. I don’t think I have to convince
the reader of how useful this could be in practice.
Of course, we will want to build our model in the best possible way. In other
words, our aim will be choosing the function m(·) according to some kind of
optimality criterion. This is what the present course is about.
But there’s more: as we will see, building an optimal model is impossible in
general. At most, we may hope to build the best possible model for the data
that we have available. Of course, there is no way of knowing if the model we
built, that perhaps works rather well with our data, will keep working equally
well with new data. Imagine you built a model for the inflation rate in a country
with monthly data from January 2000 to December 2017. It may well be that your
model performs (or, as we say, “fits the data”) very well for that period, but what
guarantee do you have that it will keep doing so in 2018, or in the more distant
future? The answer is: you have none. But still, this is something that we’d like to
do; our mind has a natural tendency to generalise, to infer, to extrapolate. And
yet, there is no logical compelling basis for proving that it’s a good idea to do
so.4 The way out is framing the problem in a probabilistic setting, and this is the
reason why econometrics is so intimately related with probability and statistics.
For the moment, we’ll start with the problem of choosing m(·) in a very sim-
ple case, that is when we have no extra information xi . In this case, the function
becomes a constant:
y i ≃ m(xi ) = m
and the problem is very much simplified, because it means we have to pick a
number m in some optimal way, given the data y 1 , y 2 , . . . , y n . In other words, we
4 ‘The philosophically inclined reader may at this point google for “Bertrand Russel’s turkey”.
1.2. THE AVERAGE 3
have to find a function of the data which returns the number m. Of course, a
function of the data is what we call a statistic. In the next section, I will prove
that the statistic we’re looking for is, in this case, the average of the y i s, that is
Ȳ = n1 ni=1 y i .
P
1X n 1
Ȳ = y i = ι′ y, (1.1)
n i =1 n
where ι is a column vector full of ones. The “sum” notation is probably more
familiar to most readers; I prefer the matrix-based one not only because I find
it more elegant, but also because it’s far easier to generalise. The nice feature
of the vector ι is that its inner product with any conformable vector x yields the
sum of the elements of x.5
We use averages all the time. Why is the average so popular? As I said, we’re
looking for a descriptive statistic m, as a synthesis of the information contained
in our data set.
In 1929, Oscar Chisini (pronounced kee-zee-nee) pro-
posed the following definition: for a function of interest
g (·), the mean of the vector y is the number m that yields
the unique solution to g (y) = g (m · ι). Powerful idea: for ex-
ample, the average is the solution of the special case when
the g (·) function is the sum of the vector’s elements, and
the reader may want to spend some time with more exotic
cases.
Chisini’s idea may be further generalised: if our aim is
to use m — that we haven’t yet chosen — as an imperfect O SCAR C HISINI
but parsimonious description of the whole data set, the question that naturally
arises is: how much information is lost?
If all we knew, for a given data set, was m, what could we say about each
single observation? If we lack any more information, the most sensible thing to
say is that, for a generic i , y i should more or less be m. Consider the case of
A. S. Tudent, who belongs to a class for which the “typical” grade in economet-
rics is 23;6 the most sensible answer we could give to the question “What was the
grade that A. S. Tudent got in econometrics?” would be “Around 23, I guess”. If
the actual grade that A. S. got were in fact 23, OK. Otherwise, we could measure
by how much we were wrong by taking the difference between the actual grade
and our guess, e i = y i − m. We call these quantities the residuals; the vector of
residuals is, of course, e = y − ι · m.
In the ideal case, using m to summarise the data should entail no informa-
tion loss at all, and the difference between y i and m should be 0 for all i (all stu-
dents got 23). If it weren’t so, we may measure how well m does its job through
the size of the residuals. Let’s define a function, called loss function, which mea-
sures the cost we incur because of the information loss.
L(m) = C [e(m)]
In principle, there are not many properties such a function should be assumed
to have. It seems reasonable that C (0) = 0:7 if all the residuals are 0, no approxi-
mation errors occur and the cost is nil. Another reasonable idea is C (e) ≥ 0: you
can’t gain from a mistake.8 Apart from this, there is not very much you can say:
the L(·) function cannot be assumed to be convex, or symmetric, or anything
else. It depends on the context.
Whatever the shape of this function, however, we’ll want to choose m so that
is L(m) as small as possible. In math-speak: for a given problem, we can write
down the loss function and choose the statistic which minimises it. In formulae:
where you read the above as: m with a hat on is that number that you find if you
choose, among all real numbers, the one that makes the function L(m) as small
as possible.
In practice, by finding the minimum of the L(·) function for a given prob-
lem, we can be confident that we are using our data in the best possible way. At
this point, the first thing that crosses a reasonable person’s mind is “How do I
choose L(·)? I mean, what should it look like?”. Fair point. Apart from extraordi-
nary cases when the loss function is a natural consequence of the problem itself,
6 Note for international readers: in the Italian academic systems, which is what I’m used to,
zero errors. For example, in some contexts “small” error may be irrelevant.
1.2. THE AVERAGE 5
writing down its exact mathematical form may be complicated. What does the
L(m) function look like for the grades in econometrics of our hypothetical class?
Hard to say.
Moreover, we often must come up with a summary statistic without know-
ing in advance what it will be used for. Obviously, in these cases finding a one-
size-fits-all optimal solution is downright impossible. We have to make do with
something that is not too misleading. A possible choice is
n
(y i − m)2 = (y − ι · m)′ (y − ι · m) = e′ e
X
L(m) = (1.3)
i =1
n d y −m 2
¡ ¢
dL(m) X n ¡
i X ¢
= = −2 yi − m
dm i =1 dm i =1
so the derivative is
dL(m)
= −2ι′ y + 2m · ι′ ι = −2ι′ (y − ιm) = 0
dm
whence
ι′ y = (ι′ ι) · m̂ =⇒ m̂ = (ι′ ι)−1 ι′ y = Ȳ
6 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
where again, the hat ( ˆ ) on m indicates that, among all possible real numbers,
we are choosing the one that minimises our loss function.
The argument above, which leads to choosing the average as an optimal
summary is, in fact, much more general than it may seem: many of the descrip-
tive statistics we routinely use are special cases of the average, where the data
y are subject to some preliminary transformation. In practice: the average of z,
where z i = h(y i ) can be very informative, if we choose the function h(·) wisely.
The variance is the most obvious example: the sample variance9 is just the aver-
age of z i = (y i − Ȳ )2 , which measures how far y i is from Ȳ .
Things get even more interesting when we express a frequency as an average:
define the event E = {y i ∈ A}, where A is some subset of the possible values for
y i ; now define the variable z i = I(y i ∈ A), where I(·) is the so-called “indicator
function”, that gives 1 when its argument is true and 0 when false. Evidently, the
average of the z i , Z , is the relative frequency of E :
Pn
i =1 z i
Z= = K /n;
n
since z i can only be 0 or 1, K = ni=1 z i is just the number of times the event E
P
has occurred. I’m sure you can come up with more examples.
z i = (y i − Ȳ )(x i − X̄ )
is well known, the average of z i is known as covariance; but this is just boring
elementary statistics.
The reason why I brought this up is to highlight the main problem with co-
variance (and correlation, that is just covariance rescaled so that it’s guaranteed
to be between -1 and 1): it’s a symmetric concept. The variables y i and x i are
treated equally: the covariance between y i and x i is by construction the same as
between x i and y i . On the contrary, we often like to think in terms of y i = m(x i ),
because what we have in mind is an interpretation where y i “depends” on x i ,
and not the other way around.10 This is why we call y i the dependent variable
and x i the explanatory variable. In this context, it’s rather natural to see what
happens if you split y into several sub-vectors, according to the values that x i
takes. In a probabilistic context, we’d call this conditioning (see section 2.2.2).
Simple example: suppose our vector y includes observations on n people,
with n m males and n f = n − n m females. The information on gender is in a vari-
able x i , that equals 1 for males and 0 for females. As is well known, a 0/1 variable
may be called “binary”, “Boolean”, “dichotomic”, but we econometricians tradi-
tionally call it a dummy variable.11
Common sense suggests that, if we take into account the information we
have on gender, the average by gender will give us a data description which
should be slightly less concise than overall average (since we’re using two num-
bers instead of one), but certainly not less accurate. Evidently, we can define
Sf
P P
x i =1 y i Sm x i =0 y i
Ȳm = = Ȳ f = =
nm nm nf nf
where S m and S f are the sums of y i for males and females, respectively.
Now, everything becomes more elegant and exciting if we formalise the prob-
lem in a similar way to what we did with the average. We would like to use in the
best possible way the information (that we assume we have) on the gender of the
i -th individual. So, instead of summarising the data by a number, we are going
to use a function, that is something like
m(x i ) = m m · x i + m f · (1 − x i )
which evidently equals m m for men (since x i = 1) and m f for women (since x i =
0). Our summary will be a rule giving us ‘representative’ values of y i according
to x i .
Let’s go back to our definition of residuals as approximation errors: in this
case, you clearly have that e i ≡ y i − m(x i ), and therefore
y i = m m x i + m f (1 − x i ) + e i (1.4)
10 I’m being deliberately vague here: in everyday speech, saying that A depends on B may mean
many things, not necessarily consistent. For example, “dependence” may not imply a cause-effect
link. This problem is much less trivial than it seems at first sight, and we’ll leave it to professional
epistemologists.
11 I am aware that there are people who don’t fit into the tradtional male/female distinction, and
I don’t mean to disrespect them. Treating gender as a binary variable just makes for a nice and
simple example here, ok?
8 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
so we can use matrix notation, which is much more compact and elegant
y = Xβ + e, (1.5)
where
β1
· ¸ · ¸
mf
β= =
mm − m f β2
and X is a matrix with n rows and 2 columns; the first column is ι and the second
one is x. The i -th row of X is [1, 1] if the corresponding individual is male and
[1, 0] otherwise. To be explicit:
y1 1 x1 e1
y2 1 x2 e2
β1
· ¸
..
= .. .. ..
. . .
β + .
2
y n−1
x n−1
e n−1
1
yn 1 xn en
X′ y = X′ X · β̂ (1.6)
What we have to do now is solve equation (1.6) for β̂ . The solution is unique
′
if X X is invertible (if you need a refresher on matrix inversion, and related mat-
ters, subsection 1.A.3 is for you):
¢−1
β̂ = X′ X X′ y.
¡
(1.7)
12 Need I remind the reader of the rule for transposing a matrix product, that is (AB )′ = B ′ A ′ ?
Obviously not.
13 Not so well-known, maybe? Jump to subsection 1.A.1.
1.3. OLS AS A DESCRIPTIVE STATISTIC 9
Equation (1.7) is the single most important equation in this book, and this is
why I framed it into a box. The vector β̂ is defined as the vector that minimises
the sum of squared residuals among all vectors with k elements (where k = 2 in
this case):
β̂ = Argmin e′ e,
β ∈Rk
and the expression in equation (1.7) turns the implicit definition into an explicit
formula that you can use to calculate β̂ .
The coefficients β̂ obtained from (1.7) are known as OLS coefficients, or OLS
statistic, from Ordinary Least Squares.14 A very common idiom that economists
use when referring to the calculation of OLS is “regressing y on X”. The usage of
the word “regression” here might seem odd, but will be justified in chapter 3.
The “hat” symbol has exactly the same meaning as in eq. (1.2): of all the
possible choices for β , we pick the one that makes eq. (1.6) true, and therefore
minimises the associated loss function e′ e. The vector
ŷ = Xβ̂
is our approximation to y. The term we normally use for the elements of ŷ are
the fitted values: the closer they are to y, the better we say that the model fits
the data.
In this example, a few simple calculations suffice to show that
· ¸
′ n nm
XX =
nm nm
· Pn ¸ · ¸
′ i =1 yi Sm + S f
Xy = P =
x i =1 y i Sm
P P
where S m = xi =1 y i and S f = xi =0 y i : the sums of y i for males and females,
respectively. By using the standard rule for inverting (2 × 2) matrices, which I
will also assume known,15
· ¸
1 n m −n m
(X′ X)−1 =
n m n f −n m n
so that
· ¸· ¸ · ¸
1 nm −n m Sm + S f 1 nm S f
β̂ = =
n m n f −n m n Sm nm n f n f S m − nm S f
and finally
Sf
· ¸
nf Ȳ f
β̂ = Sm Sf
=
− nf Ȳm − Ȳ f
nm
14 Why “ordinary”? Well, because there are more sophisticated variants, so we call these “ordi-
nary” as in “not extraordinary”. We’ll see one of those variants in section 4.2.1.
15 If you’re in trouble, go to subsection 1.A.4.
10 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
and it’s easy to see that the fitted value for males (x i = 1) is Ȳm , while the one for
the females (x i = 0) is Ȳ f .
Example 1.1
Let me give you a numerical example of the above: suppose we have 80 individ-
uals (50 males and 30 females) and that we’re interested in their monthly wage.
Moreover, S m = xi =1 y i = € 60000 and S f = xi =0 y i = € 42000: therefore, the
P P
average wage is Ȳm = 1200 = 60000/50 for males and Ȳ f = 1400 = 42000/30 for
females. After ordering observations by putting the data for males first,16 the X
matrix looks like
1 1
1 1
.
..
X= 1 1
1 0
..
.
1 0
where the top block of rows has 50 rows and the bottom one has 30. As the reader
may easily verify,
· ¸ · ¸
′ 80 50 ′ 102000
X X= X y=
50 50 60000
ŷ i = 1400 − 200x i ,
which reads: for females, x i = 0, so their typical income is €1400; for males, in-
stead, x i = 1, so their income is given by 1400 − 200 · 1 = €1200.
Once again, opting for a quadratic loss function (and therefore minimising
′
e e) delivers a solution consistent with common sense, and our approximate de-
scription of the vector y uses a function whose parameters are the statistics we
are interested in.
16 With no loss of generality, as a mathematician would say.
1.3. OLS AS A DESCRIPTIVE STATISTIC 11
Example 1.2
Suppose that
1 4
3 3
2 2
y= x=
3 5
0 1
1 1
and therefore
2.3 −1.3
1.825 1.175
· ¸
0.4 1.345 0.65
β̂ = ŷ = e=
0.475 2.775 0.225
0.875 −0.875
0.875 0.125
Hence, the function m(x i ) minimising the sum of squared residuals is m(x i ) =
0.4 + 0.475x i and e′ e equals 4.325.
In traditional textbooks, at this point you always get a picture similar to the
one in Figure 1.1, which is supposed to aid intuition; I don’t like it very much, and
will explain why shortly. Nevertheless, let me show it to you: in this example, we
use the same data as in the present example.
In Figure 1.1, each black dot corresponds to a (x i , y i ) pair; the dashed line
plots the m(x) function and the residuals are the vertical differences between the
dots and the dashed line; the least squares criterion makes the line go through
the dots in such a way that the sum of these differences (squared) is minimal.
So, for example, for observation number 1 the observed value of x i is 4 and the
17 Before you triumphantly shout “It’s wrong!”, remember to stick ι and x together.
12 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
(x 1 , ŷ 1 )
(x 1 , y 1 )
ŷ i = β0 + β1 x 1i + β2 x 2i + . . . + βk x ki
For example, suppose we have data on each student in the class A. S. Tudent
belongs to. How many hours each student spent studying econometrics, their
previous grades in related subjects, and so on; these data, for the i -th student,
are contained in the vector x′i , which brings us back to equation (1.5).
The algebraic apparatus we need for dealing with the generalised problem
is, luckily, unchanged; allow me to recap it briefly. If the residual we use for
minimising the loss function is e i (β ) = y i − x′i β , then the vector of residuals is
e(β ) = y − Xβ (1.8)
(a more detailed proof, should you need it, is in subsection 1.A.5). By putting
together (1.8) and (1.9) you get a system of equations sometimes referred to as
normal equations:
X′ X · β̂ = X′ y (1.10)
If you think that all this is very clever, well, you’re right.
The inventor of OLS is arguably the greatest mathemati-
cians of all time: the great Carl Friedrich Gauss, also known
as the princeps mathematicorum.18
Note, again, that the average can be obtained as the
special case when X = ι. Moreover, it’s nice to observe that
the above formulae make it possible to compute all the rel-
evant quantities without necessarily observing the matri-
C ARL F RIEDRICH
ces X and y; in fact, all the elements you need are the fol-
G AUSS
lowing:
1. the scalar y′ y;
∂m(x)
= βj (1.12)
∂x j
and therefore can be read as the partial derivative of the m(·) function with re-
spect to the number of hours. Clearly, you may attempt to interpret these mag-
nitude by their sign (do more hours of study improve your grade?) and by their
18 To be fair, the French mathematician Adrien-Marie Legendre rediscovered it independently a
magnitude (if so, by how much?). However, you should resist the temptation to
give the coefficients a counterfactual interpretation (If A. S. Tudent had studied
2 more hours, instead of watching that football game, by how much would their
mark have improved?); this is possible, in some circumstances, but not always
(more on this in Section 3.6).
Focusing on marginal effects is what we do popular at the beginning of the XXI century,
most often in econometrics, because the ques- and are the tools that companies like Google
tion of interest is not really approximating y and Amazon use to predict what video you’d
given x, but rather understanding what the ef- like to see on Youtube or what book you’d like
fect of x on y is (and, possibly, how general and to buy when you open their website. As we all
robust this effect is). In other words, the ob- know, these models perform surprisingly well
ject of interest in econometrics is much more in practice, but nobody would be able to re-
often β , rather than m(x). The opposite hap- construct how their predictions come about.
pens in a broad class of statistical methods that The phrase some people use is that machine
go, collectively, by the name of machine learn- learning procedures are “black boxes”: they
ing methods and focus much more on predic- work very well, but they don’t provide you with
tion than interpretation. In order to predict cor- an explanation of why you like that particular
rectly, these models use much more sophisti- video. The pros and cons of econometric mod-
cated ways of handling the data than a simple els versus machine learning tools are still un-
linear function, and even writing the rule that der scrutiny by the scientific community, and,
links x to ŷ is impossible. if you’re curious, I’ll just give you a pointer to
Machine learning tools have been getting quite Mullainathan and Spiess (2017).
x′1 y
β̂1 = ,
x′1 x1
X = [ι Z]
has not full rank, so X′ X doesn’t have full rank either, and consequently is not
invertible.19
19 If you have problems following this argument, sections 1.A.3 and 1.A.4 may be of help.
16 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
The remedy you normally adopt is to drop one of the column of Z, and the
corresponding category becomes the so-called “reference” category. For exam-
ple, suppose you have a geographical variable x i conventionally coded from 1 to
3 (1=North, 2=Centre, 3=South). The model m(x i ) = β1 + β2 x i is clearly mean-
ingless, but one could think to set up an alternative model like
ŷ i = β1 + β2 Ni + β3C i + β4 S i ,
where Ni = 1 if the i -th observation pertains to the North, and so on. This would
make more sense, as all the variables in the model have a proper numerical in-
terpretation. However, in this case we would have a collinearity problem for the
reasons given above, that is Ni + C i + S i = 1 by construction for all observations
i.
The solution is dropping one of the geographical dummies from the model:
for example, let’s say we drop the “South” dummy S i : the model would become
ŷ i = β1 + β2 Ni + β3C i ;
observe that with the above formulation the fitted value for a southern observa-
tion would be
ŷ i = β1 + β2 × 0 + β3 × 0 = β1
whereas for a northern one you would have
ŷ i = β1 + β2 × 1 + β3 × 0 = β1 + β2 ,
1.3.4 Nonlinearity
A further step in enhancing this setup would be allowing for the possibility that
the function m(x i ) is non-linear. In a traditional econometric setting this idea
would take us to consider the so-called NLS (Nonlinear Least Squares) tech-
nique. I won’t go into this either, for two reasons.
¤2
First, because minimising a loss function like L(β ) = ni=1 y i − m(x i , β ) ,
P £
where m(·) is some kind of crazy arbitrary function may be a difficult problem:
it could have more than one solution, or none, or maybe one that cannot be
written in closed form.
Second, the linear model is in fact more general than it seems, since in order
to use OLS it is sufficient that the model be linear in the parameters, not nec-
essarily in the variables. For example, suppose that we have one explanatory
variable; it is perfectly possible to use a model formulation like
m(x i ) = β1 + β2 x i + β3 x i2 . (1.13)
1.3. OLS AS A DESCRIPTIVE STATISTIC 17
∂m(x i )
= β2 + 2β3 x i ;
∂x i
β
and its sign would depend on the condition x i > − 2β23 , so it’s entirely possible
that the marginal effect of x i on y i is positive for some units in our sample and
negative for others.
Example 1.3
Suppose you have the following model:
p
ŷ i = m(x i ) = −1 + 2x i − 0.4x i2 + 2 x i
m(x)
∂m(x)
∂x
More generally, what we can treat via OLS is the class of models that can be
written as
k
β j g j (xi ),
X
m(xi ) =
j =1
where xi are our “base” explanatory variables and g j (·) is a sequence of arbitrary
transformations, no matter how crazy. Each element of this sequence becomes
a column of the X matrix. Clearly, once you have computed the β̂ vector, the
marginal effects are easy to calculate (of course, as long as the g j (·) functions
are differentiable):
∂m(xi ) X k ∂g j (xi )
= β̂ j .
∂xi j =1 ∂xi
1. d (a, b) = d (b, a)
2. d (a, b) ≥ 0
3. d (a, b) = 0 ⇔ a = b
The first three are obvious; as for the last one, called triangle inequality, it just
means that the shortest way is the straight one. The objects in question may be
of various sorts, but we will only consider the case when they are vectors. The
distance of a vector from zero is its norm, written as ∥x∥ = d (x, 0).
Many functions d (·) enjoy the four properties above, but the concept of dis-
tance we use in everyday life is the so-called Euclidean distance, defined as
q
d (x, y) = (x − y)′ (x − y)
and the reader may verify that the four properties are satisfied
p by this definition.
Obviously, the formula for the Euclidean norm is ∥x∥ = x x. ′
The second concept I will use is the idea of a vector space. If you’re not
familiar with vector spaces, linear combinations and the rank of a matrix, then
sections 1.A.2 and 1.A.3 are for you.20 In brief, I use the expression Sp (X) to
indicate the set of all vectors that can be obtained as a linear combination of the
columns of X.
Consider the space Rn , where you have a vector y and a few vectors x j , with
j = 1 . . . k and k < n, all packed in a matrix X. What we want to find is the element
of Sp (X) which is closest to y. In formulae:
ŷ = Argmin ∥y − x∥;
x∈Sp(X)
since the optimal point must belong to Sp (X), the problem can be rephrased as:
find the vector β such that Xβ (that belongs to Sp (X) by construction) is closest
to y:
β̂ = Argmin ∥y − Xβ ∥. (1.14)
β ∈Rk
from which
ŷ = Xβ̂ = X(X′ X)−1 X′ y.
20 If, on the other hand, you find the topic intriguing and want a rigorous yet very readable book
PX ≡ X(X′ X)−1 X′
and ŷ can be written as ŷ = PX y. The reader may find it amusing that in the
econometrics jargon the PX matrix is sometimes referred to as the “hat” matrix,
because PX “puts a hat on y”.
coordinate 2
y
Sp (x)
e ŷ
coordinate 1
In this simple example, x = (3, 1) and y = (5, 3); the reader may want to check that
ŷ = (5.4, 1.8) and e = (−0.4, 1.2).
PX = PX ′ PX PX = PX .
We call idempotent something that does not change when multiplied by itself;
for example, the real numbers 0 and 1 are idempotent. A nice way to understand
the meaning of idempotency is by reflecting on its geometrical implication: the
21 If I were pedantic, I’d have to say orthogonal projection, because you also get a tool called
oblique projection. We’ll never use it in this book, apart from a passing reference in chapter 6.
1.4. THE GEOMETRY OF OLS 21
matrix PX takes a vector from wherever it is and moves it onto the closest point of
Sp (X); but if the starting point already belongs to Sp (X), obviously no movement
takes place at all, so applying PX to a vector more than once produces no extra
effects (PX y = PX PX y = PX PX · · · PX y).
It can also be proven that PX is singular;22 again, this algebraic property can
be given a nice intuitive geometric interpretation: a projection entails a loss
of information, because some of the original coordinates get “squashed” onto
Sp (X): in the fly example, it’s impossible to know the exact position of the fly
from its shadow, because one of the coordinates (the distance from the screen)
is lost. In formulae, the implication of PX being singular is that no matrix A exists
such that A·PX = I, and therefore no matrix exists such that Aŷ = y, which means
that y is impossible to reconstruct from its projection.
In practice, when you regress y on X, you are performing exactly the same
calculations that are necessary to find the projection of y onto Sp (X), and the
vector β̂ contains the coordinates for locating ŷ in that space.
There is another interesting matrix we’ll be using often:
M X = I − PX .
where I’m using the notation [0] for “a matrix full of zeros”.
Some more noteworthy properties: MX is symmetric, idempotent and sin-
gular, just like PX .23 As for is rank, it can be proven that its rank equals n − k,
where n is the number of rows of X and r = rk (X).
A fundamental property this matrix enjoys is that every vector of the type
MX y is orthogonal to Sp (X), so it forms a 90° angle with any vector that can be
written as Xλ.24 These properties are very convenient in many cases; a notable
one is the possibility of rewriting the SSR as a quadratic form:25 .
usually symmetric. I sometimes use the metaphor of a sandwich and call x the “bread” and A the
“cheese”.
22 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
where the second equality comes from symmetry and the third one from idem-
potency. By the way, the above expression could be further manipulated to re-
obtain equation (1.11):
Example 1.4
Readers are invited to check (by hand or using a computer program of their
choice) that, with the matrices used in example 1.2, PX equals
0.3 0.2 0.1 0.4 0 0
0.2 0.175 0.15 0.225 0.125 0.125
0.1 0.15 0.2 0.05 0.25 0.25
PX =
0.4 0.225 0.05 0.575 −0.125 −0.125
0 0.125 0.25 −0.125 0.375 0.375
0 0.125 0.25 −0.125 0.375 0.375
In the present context, the advantage of using projection matrices is that the
main quantities that appear in the statistical problem of approximating y i via xi
become easy to represent in a compact and intuitive way:
Magnitude Symbol Formula
OLS Coefficients β̂ (X′ X)−1 X′ y
Fitted values ŷ PX y
Residuals e MX y
Sum of squared residuals SSR e′ e = y′ MX y
Take for example the special case X = ι. As we now know, the optimal solu-
tion to the statistical problem is using the sample average, so β̂ = Ȳ : the fitted
values are Pι y = ι · Ȳ and the residuals are simply the deviations from the mean:
e = Mι y = y − ι · Ȳ . Finally, deviance can be written as y′ Mι y.
the first one is rather obvious, considering that ŷ′ ŷ is a sum of squares, and
therefore non-negative. The other one, instead, can be motivated via y′ PX y =
y′ y−y′ MX y = y′ y−e′ e; since e′ e is also a sum of squares, y′ PX y ≤ y′ y. If we divide
everything by y′ y, we get
ŷ′ ŷ e′ e
0≤ ′
= 1 − ′ = R u2 ≤ 1. (1.16)
yy yy
This index bears the name R u2 (“uncentred R-squared”), and, as the above
expression shows, it’s bounded by construction between 0 and 1. It can be given
a very intuitive geometric interpretation: evidently, in Rn the points 0, y and ŷ
form a right triangle (see also Figure 1.3), in which you get a “good” leg, that is ŷ,
and a “bad” one, the segment linking ŷ and y, which is congruent to e: we’d like
the bad leg to be as short as possible. After Pythagoras’ theorem, the R u2 index
gives us (the square of) the ratio between the good leg and the hypotenuse. Of
course, we’d like this ratio to be as close to 1 as possible.
Example 1.5
With the matrices used in example 1.2, you get that y′ y = 24 and e′ e = 4.325;
therefore,
4.325
R u2 = 1 − ≃ 81.98%
24
The R u2 index makes perfect sense geometrically, but hardly any from a sta-
tistical point of view: the quantity y′ y has a natural geometrical interpretation,
but statistically it doesn’t mean much, unless we give it the meaning
y′ y = (y − 0)′ (y − 0),
that is, the SSR for a model in which ŷ = 0. Such a model would be absolutely
minimal, but rather silly as a model. Instead, we might want to use as a bench-
mark our initial proposal described in section 1.2, where X = ι. In this case,
the SSR is just the deviance of y, that is the sum of squared deviations from the
mean, which can be written as y′ Mι y.
If ι ∈ Sp (X) (typically, when the model contains a constant term, but not
necessarily), then a decomposition similar to (1.15) is possible: since y = ŷ + e,
then obviously
y′ Mι y = ŷ′ Mι ŷ + e′ Mι e = ŷ′ Mι ŷ + e′ e (1.17)
0 ≤ e′ e ≤ y′ Mι y,
26 Subsection 1.A.8 should help the readers who want this result proven.
24 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
where the second inequality comes from the fact that ŷ′ Mι ŷ is a sum of squares
and therefore non-negative.27 The modified version of R 2 is known as centred
R-square:
e′ e
R2 = 1 − ′ . (1.18)
y Mι y
The concept of R 2 that we normally use in econometrics is the centred one, and
this is why the index defined at equation (1.16) has the “u” as a footer (from the
word uncentred).
In a way, the definition of R 2 is implicitly based on a comparison between
different models: one which uses all the information contained in X and another
(smaller) one, which only uses ι, because y′ Mι y is just the SSR of a model in
which we regress y on ι. Therefore, equation (1.18) can be read as a way to com-
pare the loss function for those two models.
In fact, this same idea can be pushed a little bit further: imagine that we
wanted to compare model A and model B, in which B contains the same ex-
planatory variables as A, plus some more. In practice:
Model A y ≃ Xβ
Model B y ≃ Xβ + Zγ = Wθ
· ¸
β
where W = [X Z] and θ = .
γ
The matrix Z contains additional regressors to model A. It is important to
realise that the information contained in Z could be perfectly relevant and le-
gitimate, but also ridiculously useless. For example, a model for the academic
performance of A. S. Tudent could well contain, as an explanatory variable, the
number of pets A. S. Tudent’s neighbours have, or the number of consonants in
A. S. Tudent’s mother’s surname.
It’s easy to prove that the SSR for model B is always smaller than that for A:
MW MX = MW ,
The implication is: if we had to choose between A and B by using the SSR
as a criterion, model B would always be the winner, no matter how absurd the
choice of the variables Z is. The R 2 index isn’t any better: proving that
SSR B ≤ SSR A ⇒ R B2 ≥ R 2A .
2 e′ e n − 1
R = 1− , (1.19)
y′ Mι y n − k
where n is the size of our dataset and k is the number of explanatory variables. It
is easy to prove that if you add silly variables to a model, so that the SSR changes
only slightly, the n − k in the denominator should offset that effect. However,
as we will see in section 3.3.2, the best way of choosing between models is by
framing the decision in a proper inferential context.
One final thing on the R 2 index. Although it’s perfectly legitimate to think
that 0 is “bad” and 1 is “good”, it would be unwise to automatically consider a
number close to 0 (say, 10%) as “rather bad” or, symmetrically, a number close
to 1 (say, 90%) as “pretty good”: a model is an approximate description of the
dependent variable y i , insofar as the explanatory variables xi contain relevant
information. It may well be that the main determinants of y i are unobservable,
and therefore xi only manages to capture a small portion of the overall disper-
sion of y i . In these cases, the R 2 index will be very small, but it doesn’t neces-
sarily follow that our model is worthless: the relationship that it reveals between
the dependent variable and the explanatory variables may be extremely valu-
able, even if the fraction of variance we explain is small. But again, this idea is
more properly framed as a statistical inference issue, which is what chapter 3 is
about.
1.4.3 Reparametrisations
Suppose that there are two researchers (Alice and Bob), who have the same
dataset, which contains three variables: y i , x i and z i . Alice performs OLS on
the model
y i ≃ β1 x i + β2 z i
Bob, instead, computes the new variables s i = x i + z i and d i = x i − z i and com-
putes his coefficients using the transformed regressors as
y i ≃ γ1 s i + γ2 d i .
How different will the two models be? Before delving into algebra, it is worth
observing that Alice and Bob are using the same data, and it would be surprising
26 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
if they arrived at different conclusions. Moreover, Alice and Bob’s choices are
simply a matter of taste, and there’s no “right” way to set up a model. One could
compute s i and d i from x i and z i , or the other way around. In other words,
the set of explanatory variables Alice and Bob are using are invertible transfor-
mations of one another, and therefore must contain the same information, ex-
pressed in a different way.
With this in mind, a relationship between the two sets of parameters is easy
to find: start from Bob’s model
yi ≃ γ1 s i + γ2 d i =
= γ1 (x i + z i ) + γ2 (x i − z i ) =
= (γ1 + γ2 )x i + (γ1 − γ2 )z i ,
so β1 = (γ1 +γ2 ) and β2 = (γ1 −γ2 ). Clearly, this entails that Bob’s parameters can
β +β β −β
be recovered from Alice’s as γ1 = 1 2 2 and γ2 = 1 2 2 . It is perfectly legitimate
to surmise that the two models are in fact equivalent, and should give the same
fit.
More generally, it is possible to show that Alice’s model can be written as
y ≃ Xβ and Bob’s model as y ≃ Zγ , where Z = XA and A is square and invertible.
In the example above,
· ¸
1 1
A=
1 −1
This simple fact has a very nice consequence on the respective projection
matrices:
PZ = Z(Z′ Z)−1 Z′
= XA(A ′ X′ XA)−1 A ′ X′
= XA(A)−1 (X′ X)−1 (A ′ )−1 A ′ X′
= X(X′ X)−1 X′ = PX ,
that is, the two projection matrices are the same.29 Therefore, Sp (X) = Sp (Z): Al-
ice and Bob are projecting y onto the same space. It should be no surprise that
they will get the same fitted values ŷ and the same residuals e. As a further con-
sequence, all the quantities that depend on the projection will be the same, such
as the sum of squared residuals, the R 2 index and so on. As a matter of fact, Al-
ice’s and Bob’s models are just the same model written in a different way, by a
different representation choice, which uses different parameters. The relation-
ship between the two sets of parameters is easy to show: since ŷ is the same for
the two models, then it must also hold
ŷ = Zγ̂ = XA γ̂ = Xβ̂ .
29 If you find some of the passages above unclear, then Section 1.A.4 may be useful.
1.4. THE GEOMETRY OF OLS 27
It would seem that finding an analytical closed form for β1 and β2 as func-
tions of Z, W and y is quite difficult; fortunately, it isn’t so: start from
y = ŷ + e = Zβ̂1 + Wβ̂2 + e
MW y = MW Zβ̂1 + e,
Z′ MW y = Z′ MW Zβ̂1
30 In fact, many call this theorem the Frisch-Waugh-Lovell theorem, as it was Micheal Lovell
who, in a paper appeared in 1963, generalised the original result that Frisch and Waugh had ob-
tained 30 years earlier to its present form.
31 If you’re getting a bit confused, you may want to take a look at section 1.A.8.
28 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
therefore β̂1 is the vector of the coefficients for a model in which the dependent
variable is the vector of the residuals of y with respect to W and the regressor
matrix is the matrix of residuals of Z with respect to W. For symmetry reasons,
you also obviously get a corresponding expression for β̂2 :
¢−1
β̂2 = W′ MZ W W′ MZ y
¡
2. regress each column of Z on W; form a matrix with the residuals and call it
Z̃;
1.5 An example
For this example, I got some data from the 2016 SHIW dataset;32 our dataset
contains 1917 individuals, who are full-time employees.33 We are going to use
four variables, briefly described in Table 1.1. Our dependent variable is going to
be w, the natural logarithm of the hourly wage in Euro. The set of explanatory
32 SHIW is the acronym for “Survey on Household Income and Wealth”, provided by the Bank of
variables was chosen in accordance with some vague and commonsense idea
of the factors that can account for differences in wages. We would expect that
people with higher education and/or longer work experience should command
a higher wage, but we would also use the information on gender, because we are
aware of an effect called “gender gap”, that we might want to take into account.
but it is much more common to see results presented in a table like Table 1.2.
At this point, there are quite a few numbers in the table above that we don’t
know how to read yet, but we have time for this: chapter 3 is devoted entirely to
this purpose. The important thing for now is that we have a reasonably efficient
way to summarise the information on wages via the following model:
where w i is the log wage for individual i , g i is their gender, and the rest follows.
In practice, if we had a guy who studied for 13 years and has worked for 20
years, we would guess that the log of his hourly wage would be
e′ e
0.298 = 1 − =⇒ e′ e = 0.702 · y′ Mι y;
y′ Mι y
if you consider the dazzling complexity of the factors that potentially dictate why
two individuals get different wages, the fact that a simple linear rule involving
only three variables manages to describe 30% of the heterogeneity between in-
dividual is surprisingly good.
Of course, nothing is stopping us from interpreting the sign and magnitude
of our OLS coefficients: for example, the coefficient for education is about 5%,
and therefore the best way to use the educational attainment variable for sum-
marising the data we have on wages is by saying that each year of extra edu-
cation gives you a guess which is about 5% higher.34 Does this imply that you
get positive returns to education in the Italian labour market? Strictly speaking,
it doesn’t. This number yields a fairly decent approximation to our dataset of
1917 people. To assume that the same regularity should hold for others is totally
unwarranted. And the same goes for the gender gap: it would seem that being
male shifts your fitted wage by 17.5%. But again, at the risk of being pedantic,
all we can say is that among our 1917 data points, males get (on average) more
34 One of the reasons why we economists love logarithms is that they auto-magically turn abso-
money than females with the same level of experience and education. Coinci-
dence? We should be wary of generalisations, however tempting they may be to
our sociologist self.
And yet, these thoughts are perfectly natural. The key ingredient to give sci-
entific legitimacy to this sort of mental process is to frame it in the context of
statistical inference, which is the object of the next chapter.
evidently, the partial derivative of f (x) with respect to x i is just a i ; by stacking all
the partial derivatives into a vector, the result is just the vector a, and therefore
d ′
a x = a′
dx
d
note that the familiar rule dx ax = a is just a special case when a and x are
scalars.
As for the quadratic form
n X
n
f (x) = x′ Ax =
X
ai j xi x j ;
i =1 j =1
32 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
d ′
x Ax = x′ (A + A ′ )
dx
d ′
and of course if A is symmetric (as in most cases), then dx x Ax = 2 · x′ A. Again,
d
note that the scalar case dx ax 2 = 2ax is easy to spot as a special case.
One last thing: the convention by which differentiation expands “by row”
turns out to be very useful because it makes the chain rule for the derivatives
“just work” automatically. For example, suppose you have y = Ax and z = B y;
of course, if you need the derivative of z with respect to x you may proceed by
defining C = B · A and observing that
∂z
z = B (Ax) = C x =⇒ =C
∂x
but you may also get the same result via the chain rule, as
∂z ∂z ∂y
= = B · A = C.
∂x ∂y ∂x
note that the above could have been written more compactly in matrix notation
as z = Xλ, where X is a matrix whose columns are the vectors x j and λ is a k-
element vector.
The result is, of course, an n-element vector, that is a point in Rn . But the
k vectors x1 , . . . , xk are also a cloud of k points in Rn ; so we may ask ourselves if
there is any kind of geometrical relationship between z and x1 , x2 , . . . , xk .
Begin by considering the special case k = 1. Here z is just a multiple of x1 ;
longer, if |λ1 | > 1, shorter otherwise; mirrored across the origin if λ1 < 0, in the
same quadrant otherwise. Easy, boring. Note that, if you consider the set of all
the vectors z you can obtain by all possible choices for λ1 , you get a straight line
going through the origin, and of course x1 ; this set of points is called the space
spanned, or generated by x1 ; or, in symbols, Sp (x1 ). It’s important to note that
1.A. ASSORTED RESULTS 33
this won’t work if x1 = 0: in this case, Sp (x1 ) is not a straight line, but rather a
point (the origin).
If you have two vectors, instead, the standard case occurs when they are not
aligned with respect to the origin. In this case, Sp (x1 , x2 ) is a plane and z = λ1 x1 +
λ2 x2 is a point somewhere on that plane. Its exact location depends on λ1 and
λ2 , but note that
• no matter how you choose λ1 and λ2 , you can’t end up outside the plane.
2. rk (X) = rk X′ ;
¡ ¢
3. 0 ≤ rk (X) ≤ min(k, n) (by putting together the previous two); but if rk (X) =
min(k, n), and the rank hits its maximal value, the matrix is said to have
“full rank”;
We can use the rank function to measure the dimension of the space spanned
by X. For example, if rk (X) = 1, then Sp (X) is a line, if rk (X) = 2, then Sp (X) is a
plane, and so on. This number may be smaller than the number of columns of
X.
35 The usual definition is that x , . . . , x are linearly independent if no linear combination
Pk 1 k
j =1
λ j x j is zero unless all the λ j are zero. The reader is invited to check that the two defini-
tions are equivalent.
36 I’m not proving them for the sake of brevity: if you’re curious, have a look at https://en.
wikipedia.org/wiki/Rank_(linear_algebra).
34 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
A result we will not use very much (only in chapter 6), but is quite useful to
know in more advanced settings is that, if you have a matrix A with n rows, k
columns and rank r , it is always possible to write it as
A = UV ′
1 0
A = 0 0
0 0
can be written as
1
£ ¤
A = 0 1 0
0
where
1
· ¸
1
U = 0
and V= .
0
0
Note that such decomposition is not unique: there are infinitely many pairs
of matrices that satisfy the decomposition above. The example above would
have worked just as well with
−10 · ¸
−0.1
P = 0 and Q=
0
0
y = A x;
37 You almost never need to compute a determinant by hand, so I’ll spare you its definition. If
The world of matrix algebra is populated with singular, and therefore has no inverse. How-
results that appear unintuitive when you’re ever, it’s got a pseudo-inverse, which is A it-
used to the algebra of scalars. A notable one self. In fact, all projection matrices are their
is: any matrix A (even singular ones; even own pseudo-inverses.
non-square ones) admits a matrix B such that
AB A = A and B AB = B ; B is called the “Moore- Roger Penrose has been awarded the 2020 No-
Penrose” pseudo-inverse, or “generalised” in- bel prize for physics. Not for the generalised in-
·
1 0
¸ verse, but you get the idea of how brilliant the
verse. For example, the matrix A = is guy is.
0 0
There are many nice properties that invertible matrices enjoy. For example:
• the transpose of the inverse is the inverse of the transpose ((A ′ )−1 = (A −1 )′ );
• if a matrix is positive definite (see section 1.A.7), then its inverse is positive
definite too;
the reason why we have to transpose the second element of the sum in the equa-
tion above is conformability: you can’t sum a row vector and a column vector.
Therefore, since e(β ) is defined as e(β ) = y − Xβ , we have
∂e(β )
= −X
∂β
∂L(β )
and the necessary condition for minimisation is ∂β = −2X′ e = 0, which of
course implies equation 1.9.
For ϵ > 0, the rank of X is, clearly, 2; nevertheless, if ϵ is a very small number, a
computer program39 goes berserk; technically, this situation is known as quasi-
collinearity. To give you an example, I used gretl to compute (X′ X)−1 (X′ X) for
decreasing values of ϵ; Table 1.3 contains the results. Ideally, the right-hand side
column in the table should only contain identity matrices. Instead, results are
quite disappointing for ϵ = 1e − 05 or smaller. Note that this is not a problem
specific to gretl (which internally uses the very high quality LAPACK routines),
but a consequence of finite precision of digital computers.
This particular example is easy to follow, because X is a small matrix. But if
that matrix had contained hundreds or thousands of rows, things wouldn’t have
been so obvious.
There are a few, but no statistical package belongs to this category, for very good reasons.
40 In fact, figure 1.4 contains a slight inaccuracy. Finding it is left as an exercise to the reader.
38 CHAPTER 1. OLS: ALGEBRAIC AND GEOMETRIC PROPERTIES
Positive Negative
semi-definite semi-definite
Positive Negative
definite definite
Invertible
H exists such that B = H H ′ , then B is psd.41 This, for example, gives you a quick
way to prove that that I is pd and PX is psd.
y ∈ Sp (X) ⇐⇒ y ∈ Sp (W)
Let’s now consider the case when A is a matrix with rank less than k (for
example, a column vector). Evidently, any linear combination of the columns of
W is also a linear combination of the columns of X, and therefore each column
of W is an element of Sp (X). As a consequence, any vector that belongs to Sp (W)
also belongs to Sp (X).
41 Easy to prove, too. Try.
1.A. ASSORTED RESULTS 39
The converse is not true, however: some elements of Sp (X) do not belong
to Sp (W) (allow me to skip the proof ). In short, Sp (W) is a subset of Sp (X); in
formulae, Sp (W) ⊂ Sp (X).
A typical example occurs when W contains some of the columns of X, but
not all. Let’s say, without loss of generality, that W contains the leftmost k − p
columns of X. In this case, the matrix A can be written as
· ¸
I
A=
0
where the identity matrix above has k − p rows and columns, and the 0 zero
matrix below has p and k − p columns.
PW MW PX MX
PW PW 0 PW 0
MW 0 MW PX − PW MX
PX PW PX − PW PX 0
MX 0 MX 0 MX
Important: it is assumed that Sp (W) ⊂ Sp (X). All products commute.
41
42 CHAPTER 2. SOME STATISTICAL INFERENCE
Even if you could establish statement 1 beyond any doubt, statement 2 is ba-
sically an act of faith. You may believe in it, but there is no rational argument one
could convincingly use to support it. And yet, we routinely act on the premise of
statement 2. Hume considered our natural tendency to rely on it as a biological
feature of the human mind. And it’s a good thing: if we didn’t have this fun-
damental psychological trait, we’d be unable to learn anything at all;2 the only
problem is, it’s logically unfounded.
Statistical inference is a way to make an inductive argument more rigorous
by replacing statement number 2 with some assumptions that translate into for-
mal statements our tendency to generalise, by introducing uncertainty into the
picture. Uncertainty is the concept we use to handle situations in which our
knowledge is partial. So for example we cannot predict which number will show
up when we roll a die, although in principle it would be perfectly predictable,
given initial conditions, using the laws of physics. We simply don’t have the re-
sources to perform such a monster computation, so we represent our imperfect
knowledge through the language of probability, or, more correctly, via a proba-
bilistic model; then, we assume that the same model will keep being valid in the
future. Therefore, if we rolled a die 10 times and obtained something like
x = [1, 5, 4, 6, 3, 3, 2, 6, 3, 4]
we would act on the assumption that if we keep rolling the same die we will
observe something that, in our eyes, looks “just as random” as x. To put it differ-
ently, our aim will not be to predict exactly which side the die will land on, but
rather to make statements on how surprising or unsurprising certain outcomes
will be.
Therefore, we use the idea of a Data Generating Process, or DGP. We assume
that the DGP is the mechanism that Nature (or any divinity of your choice) has
used to produce the data we observe, and will continue doing so for the data
we have not observed yet. By describing the DGP via a mathematical structure
(usually, but not necessarily, via a probability distribution), we try to come up
with statistics T (x) whose aim is not as much to describe the available data x,
but rather to describe the DGP that generated x, and therefore, to provide us
with some insight that goes beyond merely descriptive statistics.
Of course, in order to accomplish such an ambitious task, we need a set of
tools to represent imperfect knowledge in a mathematical way. This is why we
need probability theory.
2 To be fair, this only applies to what Immanuel Kant called “synthetic” propositions. But maybe
states of the world that our mind can conceive as possible. K OLMOGOROV
and so forth. Event A can be defined as the subset of Ω including all the states
ω in which a statement A is true, and only those. P (A) is the measure of A,
where the technical word “measure” is a generalisation of our intuitive notion of
“extension” (length, area, volume).4 The familiar laws of probability are simple
consequences of the way usual set operations (complement, union, intersec-
tion) work; let’s waste no time on those.5
Random variables are a convenient way to map events to segments on the
real line. That is, a random variable X is defined as a measurable function from
3 The interested reader might want to have a look at Freedman and Stark (2016), section 2. You
measurable”. Providing a simple example is difficult, this is deep measure theory: google “Vitali
set” if you’re curious.
5 In most cases, intuition will suffice. For tricky cases, I should explain what a σ-algebra is, but
I don’t think that this is the right place for this, really.
44 CHAPTER 2. SOME STATISTICAL INFERENCE
Ω to R; or, to put it differently, for any ω in Ω you get a corresponding real num-
ber X (ω). The requisite of measurability is necessary to avoid paradoxical cases,
and simply amounts to requiring that, if we define A as the the subset of Ω such
that
a < X (ω) ≤ b ⇐⇒ ω ∈ A,
then A is a proper event. In practice, it must be possible to define P (a < X ≤ b)
for any a and b. I will sometimes adopt the convention of using the acronym
“rv” for random variables.
There are two objects that a random variable comes equipped with: the first
is its support, which is the subset of R with all the values that X can take; in
formulae, X : Ω 7→ S ⊆ R, and the set S is sometimes indicated as S(X ). For a
six-sided die, S(X ) = {1, 2, 3, 4, 5, 6}; if X is the time before my car breaks down,
then S(X ) = [0, ∞), and so on.
The other one is its distribution function, or cumulative distribution func-
tion (often abbreviated as cdf), defined as
F X (a) = P (X ≤ a),
• lima→−∞ F X (a) = 0;
• lima→∞ F X (a) = 1;
Apart from this, there’s very little that can be said in general. However, in many
cases it is assumed that F X (a) has a known functional form, which depends on
a vector of parameters θ.
Two special cases are of interest:
1. The cdf is a function that goes up in steps; the support is a countable set,
and the corresponding rv is said to be discrete; for every member of the
support x it is possible to define p(x) = P (X = x) > 0; the function p(x) is
the so-called probability function.
in most cases, when the meaning is clear from the context, we just write
the density function for X as f (x).6
In the rest of the book, I will mostly use continuous random variables for exam-
ples; hopefully, generalisations to discrete rvs should be straightforward.
Of course, you can collect a bunch of random variables into a vector, so you
have a multivariate random variable, or random vector. The multivariate ex-
tension of the concepts I sketched above is a little tricky from a technical view-
point, but for our present needs intuition will again suffice. I will only mention
that for a multivariate random variable x with k elements you have that
F x (a) = P [(x 1 ≤ a 1 ) ∩ (x 2 ≤ a 2 ) ∩ . . . ∩ (x k ≤ a k )]
If all the k elements of x are continuous random variables, then you can define
the joint density as
∂k F x (z)
f x (z) = .
∂z 1 ∂z 2 · · · ∂z k
The marginal density of the i -th element of x is just the density of x i taken in iso-
lation. For example, suppose you have a trivariate random vector w = [X , Y , Z ]:
the marginal density for Y is
Z Z
f Y (a) = f w (x, a, z) dz dx
S(X ) S(Z )
P (A ∩ B )
P (A|B ) ≡ (2.1)
P (B )
6 I imagine the reader doesn’t need reminding that the density at x is not the probability that
X = x; for continuous random variables, the probability is only defined for intervals by the for-
mula in the text, from which it follows that P (X = x) is 0.
7 When speaking about sets, I use the bar to indicate the complement.
8 Technically, it’s more complicated that this, because P (B ) may be 0, in which case the defini-
tion has to be adapted and becomes more technical. If you’re interested, chapter 10 in Davidson
(1994) is absolutely splendid.
46 CHAPTER 2. SOME STATISTICAL INFERENCE
You read the left-hand side of this definition as “the probability of A given B ”,
which is of course what we call conditional probability. It should be clear that
Which means: “if you don’t need to revise your evaluation of A after having
received some message about B , then A and B have nothing to do with each
other”; in this situation, A and B are said to be independent, and we write A ⊥
⊥ B,
so independence can be thought of as lack of mutual information. Note that in-
dependence is a symmetric concept: if A is independent of B , then B is inde-
pendent of A, and vice versa.
Equation (2.1) has the following implication: thing as the probability that someone who died
from COVID was a no-vaxxer (think about it).
P (A ∩ B ) = P (A|B ) · P (B ) = P (B |A) · P (A),
Another reason is that this expression is the
so that cornerstone of an approach to statistics known
P (B |A) · P (A) as Bayesian, after the English statistician and
P (A|B ) = .
P (B ) clergyman Reverend Thomas Bayes, who lived
The expression above is interesting for many in the 18th century; I’m not going to use any-
reasons. One is: in general, P (A|B ) ̸= P (B |A), thing Bayesian in this book, but Bayesian meth-
so, for example, the probability of dying from ods are getting increasingly popular in many
COVID if you’re not vaccinated is not the same areas of econometrics.
for any a and b, then evidently X carries no information about Y , and we say
that the two random variables are independent: Y ⊥ ⊥ X . If this is not the case,
it makes sense to consider the conditional distribution of Y on X , which de-
scribes our uncertainty about Y once we have information about X . So for ex-
ample, if Y is the yearly expenditure on food by a household and X is the number
of its components, it seems safe to say that F (Y |X > 6) should be different from
F (Y |X < 3), because more people eat more food.
The case a = b is important9 , because it gives us a tool for evaluating proba-
bilities about Y in a situation when X is not uncertain at all, because in fact we
observe its realisation X = x. In this case, we can define the conditional density
as
f Y ,X (z, x)
f Y |X =x (z) = (2.2)
f X (x)
and when what we mean is clear from the context, we simply write f (y|x).
Therefore, in many cases we will use the intuitive notion of X , the set of
random variables we are conditioning Y on, as being “the relevant information
9 Albeit special: a moment’s reflection is enough to convince the reader that if X is continuous,
the event X = x has probability 0, and our naïve definition of conditioning breaks down. But
again, treating the subject rigorously implies using measure theory, σ-algebras and other tools
that I’m not willing to use in this book.
2.2. A CRASH COURSE IN PROBABILITY 47
about Y that we have”; in certain contexts, this idea is expressed by the notion
of an information set. However, a formalised description of this idea is, again,
far beyond the scope of this book and I am contented to leave this to the reader’s
intuition.
2.2.3 Expectation
The expectation of a random variable is a tremendously important concept.
A rigorous definition, valid in all cases, would require a technical tool called
Lebesgue integral, that I’d rather avoid introducing. Luckily, in the two elemen-
tary special cases listed in section 2.2.1, its definition is quite simple:
X
E [X ] = x · p(x) for discrete rvs (2.3)
x∈S(X )
Z
E [X ] = z · f X (z) dz for continuous rvs. (2.4)
S(X )
for continuous random variables and the parallel definition for the discrete case
is obvious. The extension to multivariate rvs should also be straightforward: the
expectation of a vector is the vector of expectations.
Some care must be taken, since E [X ] may not exist, even in apparently harm-
less cases.
Example 2.1
If X is a uniform continuous random variable between 0 and 1, its density func-
tion is f (x) = 1 for 0 < x ≤ 1. Its expectation is easy to find as
¸1
1 x2
Z ·
E [X ] = x · 1 dx = = 1/2;
0 2 0
however, it’s not difficult to prove that E [1/X ] does not exist (the corresponding
integral diverges):
Z 1 1 £ ¤1
E [1/X ] = · 1 dx = log x 0 = ∞.
0 x
a and b are finite, then a < E [X ] < b. The proof is easy and left to the reader as
an exercise.
The expectation operator E [·] is linear, and therefore we have the following
simple rule for affine transforms (A and b must be non-stochastic):
n ≥ 1) may not exist, but if it does then E X n−1 is guaranteed to exist too.
£ ¤
the most widely used indicator of dispersion. Of course, in order to exist, the sec-
ond moment of X must exist. Its multivariate generalisation is the covariance
matrix, defined as
Cov [x] = E xx′ − E [x] E [x]′ ;
£ ¤
(2.6)
The properties of Cov [x] should be well known, but let’s briefly mention the
most important ones: if Σ = Cov [x], then
• Σ is symmetric;
• Σ is positive semi-definite.12
Definition 2.6 makes it quite easy to calculate the covariance matrix of an
affine transform:13
Cov [Ax + b] = A · Cov [x] · A ′ . (2.7)
Note that this result makes it quite easy to prove that if X and Y are independent
rvs, then V [X + Y ] = V [X ] + V [Y ] (hint: put X and Y into a vector and observe
that its covariance matrix is diagonal).
on the whole support of X : it’s called Jensen’s lemma. We will not use this result in this book, but
the result is widely used in economics and econometrics; if you’re interested, the idea is briefly
explained in section 2.A.1.
11 An alternative equivalent definition, perhaps more common, is V [X ] = E (X − E [X ])2 .
£ ¤
12 If you’re wondering what “semi-definite” means, you may want to go back to section 1.A.7.
13 The proof is an easy exercise, left to the reader.
2.2. A CRASH COURSE IN PROBABILITY 49
If f Y |X =x (z) changes with x, the result of the integral (if it exists) should change
£ ¤
with x too, so we may see E y|x = m(x) as a function of x. This function is
sometimes called the regression function of Y on X .
£ ¤
Does E y|x have a closed functional form? Not necessarily, but if it does, it
hopefully depends on a small number of parameters θ.
Example 2.2
Assume that you have a bivariate variable (Y , X ) where Y is 1 if an individual
catches COVID and 0 otherwise, and X is 1 if the same individual is vaccinated.
Suppose that the joint probability is
X =0 X =1
Y =0 0.1 0.3
Y =1 0.3 0.3
0.3
The probability of catching COVID among vaccinated people is 0.3+0.3 = 50%,
0.3
while for unvaccinated people it’s 0.1+0.3 = 75%. The same statement could have
been stated in formulae as
E [Y |X ] = 0.75 − 0.25X ,
To continue with example 2.2, note that, since E [X ] = 0.6, E [Y ] = E [0.75 − 0.25 · X ] =
0.75 − 0.25 · E [X ] = 0.7 − 0.25 × 0.6 = 0.6.
Example 2.3
As a more elaborate example, suppose that
E [Y |X ] = m(X ) = 4X − 0.5X 2
where I used V [X ] = E X 2 − E [X ]2 .
£ ¤
50 CHAPTER 2. SOME STATISTICAL INFERENCE
2.3 Estimation
The best way to define the concept of an estimator is to assume that we observe
some data x, and that the DGP which generated x can be described by means
of a vector of parameters θ. We assume to know nothing about θ, apart from
the fact that it can be thought of as a vector with a certain number of elements
(say, k), and that it belongs to a subset S of of Rk called the parameter space. An
estimator is a statistic θ̂ = T (x) that should be “likely” to yield a value “close to”
the parameters of interest θ.
To state the same idea more formally: since x is random and θ̂ is a function
of x, then θ̂ is a random variable too, and therefore it must have a support and
a distribution function. Clearly, both will depend on those of x, but ideally, we’d
like to choose the function T (·) so that the support S(θ̂) contains at least a neigh-
bourhood of θ, and we’d like the probability of observing a realisation of θ̂ that
is “near” θ, P (θ − ϵ < θ̂ < θ + ϵ) = P (|θ̂ − θ| < ϵ), to be as close to 1 as possible.
The indispensable ingredient for evaluating those probabilities would be the
distibution of θ̂ = T (x). However, it is almost always tremendously difficult to
pin it down exactly, either because of the characteristics of x, which could be a
very complex random variable, or because the function T (·) could be very intri-
cate. In fact, the cases when we’re able to work out the exact distibution of θ̂ are
exceptionally few. In very simple cases,14 we may be able to compute E θ̂ and
£ ¤
and efficiency:
• the bias of θ̂ is the difference E θ̂ − θ; therefore θ̂ is said to be unbiased if
£ ¤
E θ̂ = θ;
£ ¤
The problem that makes these concepts not very useful is that, in many cases of
interest, it’s very hard, if not impossible, to compute the moments of θ̂ (in some
cases, θ̂ may even possess no moments at all). So we need to use something else.
Fortunately, asymptotic theory comes to the rescue.
2.3.1 Consistency
The estimator θ̂ is consistent if its probability limit is the parameter we want to
estimate. To explain what this means, let us first define convergence in proba-
14 Notably, when θ̂ is an affine function of x.
2.3. ESTIMATION 51
bility:
p
X n −→ X ⇐⇒ lim P [|X n − X | < ϵ] = 1 (2.9)
n→∞
The simplest version of the LLN is due to the Soviet mathematician Alek-
sandr Khinchin, and sets very strong bounds on the first two conditions and
15 Where ϵ > 0 is the mathematically respectable way of saying “more or less”.
16 The curious reader might be interested in knowing that there are several other ways to define
a similar concept. A particularly intriguing one is the so-called “almost sure” convergence.
17 Technically, these are the weak LLNs. The strong version uses a different concept of limit.
52 CHAPTER 2. SOME STATISTICAL INFERENCE
Example 2.4
Let’s toss a coin n times. The random variable representing the i -th toss is x i ,
which is assumed to obey the following probability distribution (often referred
to as a Bernoulli distribution):
1 with probability π
½
xi =
0 with probability 1 − π
Note that the probability π is assumed to be the same for all x i ; that is, the coin
we toss does not change its physical properties during the experiment. More-
over, it is safe to assume that what happens at the i -th toss has no consequences
on all the other ones. In short, the x i random variables are iid.
Does x i have a mean? Yes: E [x i ] = 1 · π + 0 · (1 − π) = π. Together with the iid
p
property, this is enough for invoking the LLN and establishing that X̄ = p̂ −→ π.
Therefore, we can take the empirical frequency p̂ as a consistent estimator of
the true probability π.
The LLN becomes enormously powerful when coupled with another won-
derful result, which is a special case of a powerful tool called Slutsky’s Theorem,
p
that I’m not exposing in full here. If X n −→ a and g (·) is continuous at a, then
p
g (X n ) −→ g (a) (note how much easier this property makes it to work with prob-
ability limits rather than expectations).
In the context of estimation, obviously we will want our estimators to be con-
sistent:
p
θ̂ −→ θ ⇐⇒ lim P |θ̂ − θ| < ϵ = 1;
£ ¤
(2.10)
n→∞
that is, we will want to use as estimators statistics that become increasingly un-
likely to be grossly wrong. Fortunately, the combination of the LLN and Slutsky’s
Theorem provides a very nice way to devise estimators that are consistent by
construction. If the average has a probability limit that is a continuous, invert-
ible function of the parameter we want, we just apply a suitable transformation
the the average and we’re done: so for example ifpE [x i ] = 1/θ, then θ̂ = 1/ X̄ ; if
E [x i ] = e θ , then θ̂ = log( X̄ ); if E [x i ] = θ 2 , then θ̂ = X̄ ; and so on.
More generally, the extension to the case when θ is a vector is technically
messier, but conceptually identical. This is known as the method of moments:
it is by no means the only one used in inferential statistics, but it will suffice for
our purposes. The core intuition that motivates it is relatively straightforward:
2.3. ESTIMATION 53
2. Estimate m via the corresponding sample moments m̂, using the LLN, so
p
that m̂ −→ m.
Example 2.5
Suppose you have a sample of iid random variables for which you know that
p
E [X ] =
α
£ 2¤ p(p + 1)
E X = ;
α2
and define the two statistics m 1 = X̄ = n −1 i x i and m 2 = n −1 i x i2 . Clearly
P P
p p
m1 −→
α
p p(p + 1)
m2 −→ .
α2
m 12
Now consider the statistic p̂ = m 2 −m 12
. Since p̂ is a continuous function of
both m 1 and m 2 ,
m 12 p p 2 /α2 p2
p̂ = −→ = = p,
m 2 − m 12 p(p + 1)/α2 − p 2 /α2 p 2 + p − p 2
So p̂ is a consistent estimator of p.
But then, by the same token, by dividing p̂ by m 1 you get that
p̂ m1 p p
= −→ = α,
m 1 m 2 − m 12 p/α
m1
so you get a second statistic, α̂ = m 2 −m 12
which estimates α consistently.
= P |θ̂ − θ| < ϵ
£ ¤
Pbn
= P |θ̃ − θ| < ϵ
£ ¤
Pen
clearly limn→∞ Pbn = limn→∞ Pen = 1, so a decision can’t be made on these grounds.
Nevertheless, if we could establish that, for n large enough, Pbn > Pen , so that our
probability of being grossly wrong is lower if we use θ̂ instead of θ̃, our preferred
course of action would be obvious. Unfortunately, this is not an easy check: Pbn
is defined as Z θ+ϵ
Pn =
b fˆ(x)dx,
θ−ϵ
where fˆ(x) is the density function for θ̂ (clearly, a parallel definition holds for
Pen ). In most cases, the analytical form of fˆ(x) is very hard to establish, if not at
all impossible. However, we could try to approximate the actual densities with
something good enough to perform the required check. This is almost invari-
ably achieved by resorting to a property called asymptotic normality, by which
the unknown density fˆ(x) can be approximated via a suitably chosen Gaussian
density.18
At first sight, this sounds like a very ambitious task: how can we hope to
make general statements on the distribution of an arbitrary function of arbitrar-
ily distributed random variables? Besides, why the Gaussian density, rather than
something else? What’s so special about the bell-shaped curve?
And yet, there is a result that applies in a surprisingly large number of cases,
and goes under the name of Central Limit Theorem, or CLT for short. Basically,
the CLT says that, under appropriate conditions, when you observe a random
variable X that can be conceivably thought of as the accumulation of a large
number of random causes that are reasonably independent of each other, with
none of them dominating the others in magnitude, there are very good chances
that the distribution of X should be approximately normal.
The practical effect of this theorem is ubiquitous in nature; most natural
phenomena follow (at least approximately) a Gaussian distribution; the width
of leaves, the length of fish, the height of humans. The French mathematician
Henri Poincaré is credited with the following remark:
18 I assume that the reader is reasonably comfortable with the Gaussian distribution, but section
Everyone is sure of this, Mr. Lippman told me one day, since the
experimentalists believe that it is a mathematical theorem, and the
mathematicians that it is an experimentally determined fact.19
d
X n −→ X ⇐⇒ F X n (z) → F X (z) (2.11)
p
Convergence in distribution is a much weaker On the other hand, if X n −→ X , the fact that
concept than convergence in probability: for limn→∞ P [|X n − X | < ϵ implies that, when n is
example, take a sequence X 1 , X 2 , . . . X n of iid large, P (a < X n < b) ≃ P (a < X < b) for every
random variable with the same distribution F . d
interval (a, b), and therefore X n −→ X . This
Of course, by the definition we can say that result is often spelt “convergence in probabil-
d
X n −→ X , where the distribution of X is, again, ity implies convergence in distribution, but not
F , but there is very little we can say about the p d
vice versa”, or X n −→ X ⇒ X n −→ X
behaviour of the sequence itself.
p p
Now imagine that the LLN holds and X̄ −→ m. Clearly, X̄ − m −→ 0. In many
p
cases, it can be proven that multiplying that quantity by n gives you something
that doesn’t collapse to 0 but does not diverge to infinity either. The Central
Limit Theorems analyse the conditions under which
p d
n( X̄ − m) −→ N (0, v) , (2.12)
19 French original: Tout le monde y croit cependant, me disait un jour M. Lippmann, car les ex-
said at this point that in many cases we should take into account the fact that the support of X n
may be discrete, and special care is needed to interpret what happens when F X n (z) “takes a step”.
I thought that would have been rather pedantic, so this remark is confined to a footnote.
56 CHAPTER 2. SOME STATISTICAL INFERENCE
p
µ ¶
d a 1
n (w − m) −→ N (0, Σ) =⇒ w ∼ N m, Σ
n
Example 2.6
Let’s go back to example 2.4 (the coin-tossing experiment). Here not only the
mean exists, but also the variance:
V [x i ] = E x i2 − E [x i ]2 = π − π2 = π(1 − π)
£ ¤
π(1 − π)
µ ¶
a
p̂ ∼ N π, .
n
In practice, if you toss a fair coin (π = 0.5) n = 100 times, the distribution
of the relative frequency you get is very well approximated by a Gaussian ran-
dom variable with mean 0.5 and variance 0.0025. Just so you appreciate how
well the approximation works, consider that the event 0.35 < p̂ ≤ 0.45 has a true
probability of 18.234%, while the approximation the CLT gives you is 18.219%. If
21 At this point, the inquisitive reader may ask: why the square root of n? Why not n itself, or
the cube root, or some other function of n? Section 2.A.3 offers an intuitive explanation of why it
should be so.
22 Note that, from a technical point of view, w may not have a variance for any n, although its
The second result we will often use is the delta method: if your estimator θ̂ is
defined as a differentiable transformation of a quantity which obeys a LLN and
a CLT, there is a relatively simple rule to obtain the limit in distribution of θ̂;
( p ) ( p )
X̄ −→ m θ̂ = g ( X̄ ) −→ θ = g (m)
p ¡ ¢ d =⇒ p ¡ ¢ d (2.14)
n X̄ − m −→ N (0, Σ) n θ̂ − θ −→ N 0, J ΣJ ′
¡ ¢
∂g (x) ¯
¯
where n is the sample size and J is the Jacobian ∂x ¯x=m .
Example 2.7
Given a sample of iid random variables x i for which E [x i ] = 1/a and V [x i ] =
1/a 2 , it is straightforward to construct a consistent estimator of the parameter a
as
1 p 1
â = −→ = a.
X̄ 1/a
Its asymptotic distribution is easy to find: start from the CLT:
p ¡ ¢ d
n X̄ − 1/a −→ N 0, 1/a 2 .
¡ ¢
dâ 1 1
J = plim = −plim =− = −a 2 ;
d X̄ X̄ 2 1/a 2
1
AV [â] = (−a 2 ) (−a 2 ) = a 2 ,
a2
and therefore
p d
n (â − a) −→ N 0, a 2
¡ ¢
³ 2
´
a
so we can use the approximation â ∼ N a, an .
58 CHAPTER 2. SOME STATISTICAL INFERENCE
By using these tools, we construct estimators satisfying not only the consis-
tency property, but also asymptotic normality. These estimators are sometimes
termed CAN estimators (Consistent and Asymptotically Normal). Asymptotic
normality is important for three reasons:
this means that, for a decently large value of n, we can approximate the
distribution of θ̂ as θ̂ ∼ N θ, ω
a ¡ ¢
n . This, in turn, implies that
"¯ #
¯θ̂ − θ ¯
¯
P p < 1.96 ≃ 95%;
ω/n
contains the true value of θ are roughly 95%. This is what is called a 95%
confidence interval. Of course, a 99% confidence interval would be some-
what larger. Generalising this to a vector of parameters would lead us to
speaking of confidence sets; for example, when θ is a 2-parameter vector,
the confidence set would be an ellipse.
Example 2.8
Consider the same setup as example 2.7, that is tossing a coin n = 100 times,
and suppose we get “heads” 45 times. Therefore, our estimate of π would be
p̂ = 45/100 = 0.45.
2.4. HYPOTHESIS TESTING 59
Of course, this would imply that the asymptotic variance of our estimator
can be itself estimated as
p̂(1 − p̂) 0.45 · 0.55
v̂ = = = 0.002475
n 100
p
so, since its square root is v̂ = 0.04975, we may say that the interval
A = [0.45 − 0.04975 × 1.96, 0.045 + 0.04975 × 1.96] = 0.45 ± 0.0975 ≃ [0.35, 0.55]
H0 : θ ∈ H ⊂ S.
We would like to check whether our belief is consistent with the observed data.
If what we see is at odds with our hypothesis, then reason dictates we should
drop it in favour of something else.
The coin toss is a classic example. The one parameter in our DGP is π, that
is a probability, so the parameter space is S = [0, 1]. However, there is a subset
of S that is of special significance, namely the point H = {0.5}, because if π ∈ H
(which trivially means π = 0.5), then the coin is fair.
We presume that the coin is fair, but it’d be nice if we could check. What we
can do is flip it a number of times, and then decide, on the basis of the results,
if our original conjecture is still tenable. After flipping the coin n times, we ob-
tain a vector x of zeros and ones. What we want is a function T (x) (called a test
statistic) such that we can decide whether to reject H0 or not.
23 Disclaimer: this section is horribly simplistic. Any decent statistics textbook is far better than
this. My aim here is just to lay down a few concepts that I will use in subsequent chapters, with no
claim to rigour or completeness. My advice to the interested reader is to get hold of Casella and
Berger (2002) or Gourieroux and Monfort (1995, volume 2). Personally, I adore Spanos’ historical
approach to the matter.
60 CHAPTER 2. SOME STATISTICAL INFERENCE
By fixing beforehand a subset R of the the support of T (x) (called the “rejec-
tion region”), we can follow a simple rule: we reject H0 if and only if T (x) ∈ R.
Since T (x) is a random variable, the probability of rejecting H0 will be between 0
and 1 regardless of the actual truth of H0 .24 Therefore, there is a possibility that
we’ll end up rejecting H0 while it’s in fact true, but the opposite case, when we
don’t reject while in fact we should, is also possible. These two errors are known
as type I and type II errors, respectively, and the following 4-way table appears
in all statistics textbooks, but memorising the difference could be easier if you
look at figure 2.1:
This situation is not unlike the job of a judge in a criminal case. The judge
starts from the premise that the defendant is not guilty, and then evidence is ex-
amined. By the “presumption of innocence” principle, the judge declares the
defendant guilty only if the available evidence is overwhelmingly against H0 .
Thus, type I error happens when an innocent person goes to jail; type II error
is when a criminal gets acquitted.
This line of thought is very much in line with the idea philosophers call fal-
sificationism, whose most notable exponent was the Austrian-British philoso-
pher Karl Popper (see eg Popper (1968)). According to the falsificationist point
of view, scientific progress happens via a series of rejections of previously made
conjectures.
24 Unless, of course, we do something silly such as deciding to always reject, or never. But in that
I am fully aware that the debate on philoso- hypothesis testing borrows quite a few ideas
phy of science has long established that fal- from the falsificationist approach. For a fuller
sificationism is untenable, as a description of account, check out Andersen and Hepburn
scientific progress, on several accounts. What (2016).
I’m saying is just that the statistical theory of
a
g (θ̂) ∼ N g (θ), Σ ;
¡ ¢
under consistency, Σ should tend to a zero matrix; as for g (θ), that should be
0 if and only if H0 is true. These two statement imply that the quadratic form
g (θ̂)′ Σ−1 g (θ̂) should behave very differently in the two cases: if H0 is false, it
25 Some people use the word “retain” instead of “accept”, which is certainly more correct, but
where p = rk (Σ). Hence, under H0 , the W statistic should take values typical of
a χ2 random variable.27 Therefore, we should expect to see “small” values of W
when H0 is true and large values when it’s false. The natural course of action is,
therefore, to set the rejection region as R = (c, ∞), where c, the critical value,
is some number to be determined. Granted, there is always the possibility that
W > c even if H0 is true. In that case, our decision to reject would imply a type
I error. But since we can calculate the distribution function for W , we can set
c to a prudentially large value. What is normally done is to set c such that the
probability of a type I error (called the size of the test, and usually denoted by
the Greek letter α) is a small number, typically 5%.
What people do in most cases is deciding which α they want to use and then
set c accordingly, so that in many cases you see c expressed as a function if α
(and written c α ), rather than the other way around.
But, I hear you say, what about type II error? Well, if W in fact diverges when
H0 is false, the the probability of rejection (also known as the power of the test)
should approach 1, and we should be OK, at least when our dataset is reasonably
large.28 There are many interesting things that could and should be said about
the power of tests, especially a truly marvellous result known as the Neyman-
Pearson lemma, but I’m afraid this is not the place for this. See the literature
cited at footnote 23.
Example 2.9
Let’s continue example 2.6 here. So we have that the relative frequency is a CAN
estimator for the probability of a coin showing “heads”.
p d
n(p̂ − π) −→ N (0, π(1 − π)) .
Let’s use this result for building a test for the “fair coin” hypothesis, H0 : π = 0.5.
We need a differentiable function g (x) such that g (x) = 0 if and only if x = 0.5.
One possible choice is
g (π) = 2π − 1
£ ¤
What we have to find is the asymptotic ´ variance of g (p̂), which is AV g (p̂) =
∂g (x)
³
J · π(1 − π) · J ′ = ω, where J = plim ∂x = 2, so
p d
n(g (p̂) − g (π)) −→ N (0, ω) .
26 To see why, see section 2.A.4
27 The reason why I’m using the letter W to indicate the test is that, in a less cursory treatment
of the matter, the test statistic constructed in this way could be classified as a “Wald-type” test.
28 When the power of a test goes to 1 asymptotically, the test is said to be consistent. I know, it’s
confusing.
2.4. HYPOTHESIS TESTING 63
Under the null, g (π) = 0 and ω = 1; therefore, the approximate distribution for
g (p̂) is
a
g (p̂) ∼ N 0, n −1
¡ ¢
p-value, “[e]ither an exceptionally rare chance has occurred or the theory [...] is
not true”.30
Figure 2.2 shows an example in which W = 9, and is compared against a χ23
distribution. The corresponding 95% percentile is 7.815, so with α = 0.05 the
null should be rejected. Alternatively, we could compute the area to the right of
the number 9 (shaded in the Figure), which is 2.92%; obviously, 2.92% < 5%, so
we reject.
critical value at 5%
realised W
p-value area
To make results even quicker to read, most statistical packages adopt a graph-
ical convention, based on ornating the test statistic with a variable number of ’*’
characters, usually called “stars”. Their meaning is as follows:
Stars Meaning
(none) p-value greater than 10%
* p-value between 5% and 10%
** p-value between 1% and 5%
*** p-value under 1%
Therefore, when you see 3 stars you “strongly” reject the null, but you don’t
reject Ho where no stars are printed. One star means “use your common sense”.
In fact, I’d like add a few words on “using your common sense”: relying on
the p-value for making decisions is OK; after all, that’s what it was invented for.
However, you should avoid blindly following the rule “above 5% → yes, below
5% → no”. You should always be aware of the many limitations of this kind of
approach: for example,
• even if all the statistical assumption of your model are met, the χ2 distri-
bution is just an approximation to the actual density of the test. Therefore,
30 Fisher RA. Statistical Methods and Scientific Inference. Ed 2, 1959. On this subject, if you’re
into the history of statistics, you might like Biau et al. (2009).
2.5. IDENTIFICATION 65
• even if the test was in fact exactly distributed as a χ2 variable, type I and
type II errors are always possible; actually, if you choose 5% as your signif-
icance level (like everybody does), you will make a mistake in rejecting H0
one time out of twenty;
• and besides, why 5%? Why not 6%? Or 1%? In fact, someone once said:
That said, I don’t want you to think “OK, the p-value is rubbish”: it isn’t. Ac-
tually, it’s the best tool we have for the purpose. But like any other tool (be it a
screwdriver, a microwave oven or a nuclear reactor), in order to use it effectively,
you must be aware of its shortcomings.
2.5 Identification
A common problem in econometrics is identification of a model. The issue is
quite complex, and I cannot do justice to it in an introductory book such as this,
so I’ll just sketch the main ideas with no pretence to rigour or completeness.
Basically, a model is said to be identified with reference to a question of interest
if the model’s probabilistic structure is informative on that question.
A statistical model is, essentially, a probabilistic description of the data that
we observe. When we perform inference on a dataset we assume that
• the data we observe are such that asymptotic theory is applicable (for ex-
ample, n is large and the data are iid) and we can define statistics that we
can use as estimators or tests;
The importance of the first three items in the list should be clear to the reader
from the past sections of this chapter. In this section, we will discuss the fourth
one.
66 CHAPTER 2. SOME STATISTICAL INFERENCE
θ = M (ψ).
In some cases, the relationship is trivial; often, M (·) is just the identity function,
θ = ψ, but sometimes this is not the case.
Statistics gives us the tools to estimate θ; is this enough to estimate ψ? It
depends on the function M (·); if the function is invertible, and we have a CAN
estimator θ̂, a possible estimator for ψ is
ψ̂ = M −1 (θ̂).
If M (·) is continuous and differentiable, then its inverse will share these prop-
erties, so we can use Slutsky’s theorem and the delta method and ψ̂ is a CAN
estimator too. In this case, we say that the model is identified.
In some cases, however, the function M (·) is not invertible, typically when
different values of ψ give rise to the same θ.31 In other terms, two alternative
descriptions of the world give rise to the same observable consequences: if
M (ψ1 ) = M (ψ2 )
for ψ1 ̸= ψ2 , we would observe data from the same DGP (described by θ) in both
cases; this situation is known as observational equivalence, and ψ1 and ψ2 are
said to be observationally equivalent. In these cases, being able to estimate θ,
even in an arbitrarily precise way, doesn’t tell us if the “true” description of the
world is ψ1 or ψ2 . This unfortunate case is known in econometrics as under-
identification.
Example 2.10
Suppose you have an urn full of balls, some white and some red. Call w the
number of white balls and r the number of red balls. We want to estimate both
w and r .
31 The technical way to say this would be “the M (·) function is not injective”.
2.A. ASSORTED RESULTS 67
Suppose also that the only experiment we can perform works as follows: we
can extract one ball from the urn as many times as we want, but we must put it
back after extraction (statisticians call this “sampling with replacement”). De-
fine the random variable x i as 1 if the ball is red. Clearly
(
r
1 with probability π = w+r
xi = w
0 with probability (1 − πi ) = w+r
In this case, the probability distribution of our data is completely characterised
by the parameter π; as we know, we have a perfectly good way to estimate π;
since the data are iid, X̄ is a CAN estimator of π and testing hypotheses on π is
easy.
If, however, the parameters of interest are ψ = [r, w], there is no way to es-
timate them separately, because the function θ = M (ψ) is not invertible, for the
very simple reason that the relationship between the DGP parameter π and our
parameters of interest w and r
r
π=
w +r
is one equation in two unknowns. Therefore, in the absence of extra information
we are able to estimate π (the proportion of red balls) as precisely as wanted, but
there is no way to estimate r (the number of red balls).
Even if we knew the true value of π, there would still be an infinite array of
observationally equivalent descriptions of the urn. If, say, π = 0.3 the alterna-
tives ψ1 = [3, 10], ψ2 = [15, 50], ψ3 = [3000, 10000], etc would all be observation-
ally equivalent.
for each x ∗ ∈ (a, b). Now assume that the interval (a, b) is the support of the rv
X , which possesses an expectation. Clearly, a < µ = E [X ] < b; this implies that
equation (2.15) holds when x ∗ = µ, and therefore
£ ¤
E g (X ) ≤ g (µ) = g [E [X ]],
• it is possible to prove Jensen’s lemma in the more general case when g (·) is
not everywhere differentiable in (a, b), but that’s a bit more intricate (see
for example Williams (1991), page 61).
So for example if you knew that the expectation of a non-negative random vari-
able X was 4, you could safely say that P [X > 8] ≤ 1/2 without knowing anything
on the distribution of X . Nice.
2.A. ASSORTED RESULTS 69
Now let’s have a look at the moments of X̄ ; its first moment is trivial to find,
since
nm
· ¸
£ ¤ 1X 1X
E X̄ = E xi = E [x i ] = = m;
n n n
as for its variance, just assume that it exists (which in turn requires existence of
all the variances v i ) and that
£ ¤
lim V X̄ = 0;
n→∞
that is, its variance shrinks as the dataset grows larger (this may be tricky: see
also Section 2.A.3).
Now define W = ( X̄ −m)2 , so E [W ] is the variance of X̄ . But the most impor-
tant thing is that W cannot be negative (it’s a square), so we can use Markov’s
inequality (2.16) directly and say32
£ ¤
2 V X̄
P [W ≥ ε ] ≤ ;
ε2
of course, the left-hand side of this inequality can be rewritten as P [| X̄ −m| ≥ ε],
which is in turn equal to 1 − P [| X̄ − m| < ε]. Therefore,
£ ¤
V X̄
P [| X̄ − m| < ε] ≥ 1 − 2 .
ε
£ ¤
If we assume that V X̄ → 0, then we have the desired result:
p
lim P [| X̄ − m| < ε] = 1 ⇐⇒ X̄ −→ m.
n→∞
Note that this case is nearly useless in practice, because being able to com-
pute the moments of our quantities of interest is extremely rare, but still gives
you a nice idea of the kind of conditions can be used to prove consistency.
p
2.A.3 Why n?
Here I’ll give you an intuitive account of the reason why, in the standard cases,
p
the Central Limit Theorem works by using n as the normalising transforma-
tion instead of some other power of n. Suppose we have a vector x of size n
containing our observations, that are not necessarily independent nor identi-
cal. However, we do require that they possess second moments and use Σ to
indicate the covariance matrix of x:
V [x 1 ] Cov [x 1 , x 2 ] . . . Cov [x 1 , x n ]
Cov [x 1 , x 2 ] V [x 2 ] . . . Cov [x 2 , x n ]
V [x] = Σ =
.. .. .. ..
. . . .
Cov [x 1 , x n ] Cov [x 2 , x x ] ... V [x n ]
32 This special case of Markov’s inequality is sometimes called Chebyshev’s inequality.
70 CHAPTER 2. SOME STATISTICAL INFERENCE
p
Suppose also that X̄ has a probability limit that we call m: X̄ −→ m. Now
note that the average X̄ can be written as X̄ = n1 ι′ x, and therefore its variance
can be easily calculated by the rule (2.7). Therefore,
1
V X̄ = 2 · ι′ Σι
£ ¤
n
What can we say about ι′ Σι? First, given the properties of ι, this is simply
the sum of all the elements of Σ; second, since Σ is positive semi-definite by
construction, this cannot be a negative number, but it may be a large positive
one. Especially so, considering that the size of Σ grows with n.
We must now examine what happens to ι′ Σι as n → ∞. When the x i rvs
are iid, this is easy, since in this special case Σ is just a multiple of the identity
matrix; hence, in the iid case, Σ = v ·I and ι′ Σι = n · v. In a more general case, the
non-diagonal elements may be non-zero (which could happen for dependent
observations), or the elements on the diagonal may be heterogeneous (which
could happen in the non-identical case). However, it may still be that, despite
these complications, ι′ Σι behaves asymptotically as a linear function of n. To be
more precise, it may happen that
ι′ Σι
lim =K,
n→∞ n
where K is some constant. For example, in the iid case, K would just be equal to
v. In all these cases, you have that
£ ¤
n · lim V X̄ = K
n→∞
¡ ¢
Therefore, the only way to multiply X̄ − m by a power of n and have that the
variance of the result is a constant is to choose that α = 1/2, which of course
p
gives you n.
When observations are not iid, there may be cases when ι′ Σι grows at a rate
that is different from n. In these cases, the normalising factor needed to achieve
convergence in distribution is actually different from the square root. This typ-
ically happens when the x i rvs come from a time-series sample, and the degree
of dependence between nearby observations can be substantial. The beginning
of chapter 5 contains a brief discussion of “persistence” in time series.
2.A. ASSORTED RESULTS 71
0.4
0.3
ϕ(x)
0.2
0.1
0
−3 −2 −1 0 1 2 3
x
As is well known, ϕ(x) has no closed-form in- proximate numerically, so every statistical pro-
definite integral: that is, it can be proven that gram (heck, even spreadsheets) will give you
the function Φ(x), whose derivative is ϕ(x), excellent approximations via clever numeri-
does exist, but cannot be written as a combi- cal methods. If you’re into this kind of stuff,
nation of “simple” functions (the proof is very Marsaglia (2004) is highly recommended.
technical). Nevertheless, it’s quite easy to ap-
or, in short, x ∼ N (m, Σ), where n is the dimension of x, m is its expectation and
Σ its covariance matrix. The multivariate version of this random variable also
enjoys the linearity property, so if x ∼ N (m, Σ), then
y = Ax + b ∼ N Am + b, AΣA ′ .
¡ ¢
(2.17)
It is easy to overlook how amazing this result is: the fact that E [Ax + b] =
AE [x] + b is true for any distribution and does not depend on Gaussianity; and
the same holds for the parallel property of the variance. The special thing about
the Gaussian distribution is that a linear transformation of a Gaussian rv is itself
Gaussian. And this is a very special property, that is only shared by a few distri-
butions (for example: if you take a linear combination of two Bernoulli rvs, the
result is not Bernoulli-distributed).
The Gaussian distribution has a very convenient feature: contrary to what
happens in general, if X and Y have a joint normal distribution (that is, the vec-
tor x = [Y , X ] is a bivariate normal rv), absence of correlation implies indepen-
dence (again, this can be proven quite easily: nice exercise left to the reader).
Together with the linearity property, this also implies another very important
result: if y and x are jointly Gaussian, then the conditional density f (y|x) is Gaus-
sian as well. In formulae:
where
m = E y|x = E y + B ′ (x − E [x])
£ ¤ £ ¤
B = Σ−1
x Σx,y
Σ = V y|x = Σy − Σ′x,y Σ−1
x Σx,y
£ ¤
Example 2.11
For example, suppose that the joint distribution of y and x = [x 1 , x 2 ] is normal,
with
y 1
E x1 = 2
x2 3
y 3 0 1
V x1 = 0 1 1
x2 1 1 2
2.A. ASSORTED RESULTS 73
and therefore
· ¸−1 · ¸ · ¸
1 1 0 −1
B = =
1 2 1 1
· ¸−1 · ¸
¤ 1 1 0
Σ = 3− 0 1
£
= 3 − 1 = 2,
1 2 1
· ¸−1 · ¸
1 1 2 −1
since = . Thus, the conditional expectation of y given x equals
1 2 −1 1
· ¸
x1 − 2
E y|x = 1 + [−1, 1]′
£ ¤
= −x 1 + x 2
x3 − 3
and in conclusion
y|x ∼ N [x 2 − x 1 , 2] .
Note that:
• if you apply the Law of Iterated Expectations (eq. (2.8)) you get
£ £ ¤¤ £ ¤
E E y|x = −E [x 1 ] + E [x 2 ] = −2 + 3 = 1 = E y ;
f (x)
χ21
χ22
χ23
χ24
1 2 3 4 5 6 x
The most common cases, where n ranges from 1 to 4, are shown in Figure 2.4.35
Like the normal density, there is no way to write down the distribution function
of χ2 random variables, but numerical approximations work very well, so critical
values are easy to compute via appropriate software. The 95% critical values for
the cases n = 1 . . . 4 are
degrees of freedom 1 2 3 4
critical value at 95% 3.84 5.99 7.81 9.49
For example, a χ21 random variable takes values from 0 to 3.84 with probabil-
ity 95%. Memorising them may turn out to be handy from time to time.
set verbose o f f
clear
# c h a r a c t e r i s t i c s of the event
scalar p = 0 .5
scalar n = 100
scalar lo = 36
scalar hi = 45
35 In case you’re wondering what Γ(n/2) is, just google for “gamma function”; it’s a wonderful
object, you won’t be disappointed. Suffice it to say that, if x is a positive integer, then Γ(x) = (x+1)!,
but the gamma function is defined for all real numbers, except nonpositive integers.
2.A. ASSORTED RESULTS 75
# printout
Output:
Figure 3.1: Conditional distribution of the stature of children given their parents’
74
72
70
68
child
66
64
62
60
64 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 72.5 73
parent
77
78 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL
matched them against the average height of their parents (x). Data are in inches.
It is natural to think that somebody’s stature is the re-
sult of a multiplicity of causes, but surely the hereditary
component cannot be negligible. Therefore, the interest
in f (y|x). For each observed value in of x in the sample,
Figure 3.1 shows the corresponding boxplot.
Maybe not all readers are familiar with boxplots, so al-
low me to explain how to read the “candlesticks” in the fig-
ure: each vertical object consists of a central “box”, from
which two “whiskers” depart, upwards and downwards.
F RANCIS G ALTON
The central box encloses the middle 50% of the data, i.e.
it is bounded by the first and third quartiles. The “whiskers” extend from each
end of the box for a range equal at most to 1.5 times the interquartile range. Ob-
servations outside that range are considered outliers1 and represented via dots.
A line is drawn across the box at the median. Additionally, a black dot indicates
the average.
The most notable feature of Figure 3.1 is that the boxes seem to go up to-
gether with x; that is, the distribution of y shifts towards higher values as x grows.
However, even considering the subset of observations defined as the children
whose parents were of a certain height, some dispersion remains. For example,
if we focus on x = 65.5, you see from the third candlestick from the left that the
minimum height is about 62 and the maximum is about 72, while the mean is
between 66 and 68 (in fact, the precise figure is 67.059).
Historical curiosity: if you use OLS to go have children who were taller than the average,
through those points such as to minimise the but not as much as themselves (and of course
SSR, you will find that the fitted line is ĉ i = the same, in reverse, happened to shorter par-
23.9 + 0.646p i , where c i stands for “child” and ents). Galton described this state of things as
p i for “parent”. The fact that the slope of the “Regression towards Mediocrity”, and the term
fitted line is less than 1 prompted Galton to ob- stuck.
serve that the tendency for taller parents was to
Figure 3.2: Regression function of the stature of children given their parents’
74
72
70
68
child
66
64
62
60
64 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 72.5 73
parent
The first step for making this intuition operational is to define the random
variable ε ≡ y − E y|x , so that y can be written (by definition) as E y|x + ε.
£ ¤ £ ¤
For historical reasons, the random variable ε is called the disturbance (see also
section 3.A.2). A very important property of the random variable ε is that it’s
orthogonal to x by construction:2
E [x · ε] = 0 (3.1)
£ ¤
E [ε|x] = E y|x − E [m(x)|x] = m(x) − m(x) = 0;
E [x · ε] = E [x · E [ε|x]] = E [x · 0] = 0.
can write
y i = x′i β + εi (3.2)
or, in matrix form,
y = Xβ + ε (3.3)
Note the difference between equation (3.3) and the parallel OLS decompo-
sition y = Xβ̂ + e, where everything on the right-hand side of the equation is an
2 Warning: as shown in the text, E [ε|x] = 0 =⇒ E [x · ε] = 0, but the converse is not necessarily
true.
3 In fact, there are techniques for estimating the regression function directly, without resorting
to assumptions on its functional form. These techniques are grouped under the term nonpara-
metric regression. Their usage in econometrics is rather limited, however, chiefly because of their
greater computational complexity and of the difficulty of computing marginal effects.
4 As for what we mean exactly by “linear”, see 1.3.2.
80 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL
observable statistic. Instead, the only observable item in the right-hand side of
(3.3) is X: β is an unobservable vector of parameters and ε is an unobservable
vector of random variables. Still, the similarity is striking; it should be no sur-
prise that, under appropriate conditions, β̂ is a CAN estimator of β , which we
prove in the next section.
3.2.1 Consistency
In order to prove consistency, start from equation (3.4) and rewrite matrix prod-
ucts as sums:
" #−1 " #−1
1X 1X
xi x′i xi εi = β + xi x′i xi εi .
X X
β̂ = β + (3.5)
i i n i n i
Let’s analyse the two terms on the right-hand side separately: in order to do so,
it will be convenient to define the vector
zi = xi εi ; (3.6)
1X p
xi εi −→ 0. (3.7)
n i
3.2. MAIN STATISTICAL PROPERTIES OF OLS 81
As for the limit of the first term, assume that n −1 X′ X has one, and call it Q:5
1X p
xi x′i −→ Q; (3.8)
n i
if Q is invertible, then we can exploit the fact that inversion is a continuous trans-
formation as follows " #−1
1X p
xi x′i −→ Q −1 ,
n i
One may think that the whole argument would parameters of an object called Optimal Linear
break down if the assumption of linearity were Predictor, which includes the linearity as a spe-
violated. This is not£ completely
¤ true: even in cial case. But this is far too advanced for a book
many cases when E y|x is non linear, it may like this.
be proven that β̂ is a consistent estimator of the
there are two main reasons while this requirement may fail to hold:
1. it may not converge to any limit; this would be the case if, for example, the
vector x possessed no second moments;6
¤−1 p
we already know from the previous subsection that n1 i xi x′i −→ Q −1 , but
£ P
1 X p d
p xi εi = n Z̄ −→ N (0, Ω) ,
n i
where Ω ≡ V [zi ] = E zi z′i . Therefore, we can use Cramér’s theorem (see (2.13))
£ ¤
as follows: since
" #−1
p ¡ 1X ′ 1 X
xi εi
¢
n β̂ − β = xi xi p
n i n i
" #−1
1X ′ p
xi xi −→ Q −1
n i
1 X d
p xi εi −→ N (0, Ω)
n i
p ¡ ¢
the quantity n β̂ − β converges to a normal rv multiplied by the nonstochas-
tic matrix Q −1 ; therefore, the linearity property for Gaussian rvs applies (see eq.
(2.17)), and as a consequence
p ¡ ¢ d
n β̂ − β −→ N 0,Q −1 ΩQ −1 .
¡ ¢
(3.9)
The quantity E ε2i |xi is a bit of a problem here. We proved quite easily that
£ ¤
E [ε|x] = 0 as a natural consequence of the way ε is defined (see section 3.1, page
79) . However, we know nothing about its conditional second moment (the con-
ditional variance of y, if you like). For all we know, it may even not exist; or if it
does, it could be an arbitrary (and possibly quite weird) function of x. The only
thing we can be sure of is that the function h(x) = E ε2 |x (sometimes called the
£ ¤
skedastic function) must be positive, since it’s the expectation of a square and
of course the support of ε2 is the positive real line, or possibly a subset.
In some cases, one could be tempted to set up That said, there are certain situations where the
a model in which we assume a functional form main object of interest is the conditional vari-
for the skedastic function in the same way as ance instead of the conditional mean, like for
we do for the regression function. This, how- example in certain models used in finance. A
ever, is very seldom done: the computational fuller discussion of this topic would lead to a
complexity is greater than OLS and there is lit- concept called heteroskedasticity, which is the
tle interest in the parameters of the conditional object of section 4.2.
variance.
3.2. MAIN STATISTICAL PROPERTIES OF OLS 83
traditionally labelled σ2 :
E ε2 |x = σ2 ;
£ ¤
(3.10)
the most important implication of this assumption is that the conditional vari-
£ ¤
ance V y i |xi is constant for all observations i = 1 . . . n; this idea is known as
homoskedasticity.7 This assumption can be visualised in terms of Figure 3.1 as
the idea that all the boxplots look roughly the same, and all you get by moving
£ ¤
along the horizontal axis is that they may go up and down as an effect of E y|x
not being constant, but never change their shape and size. How realistic this
assumption is in practice remains to be seen, and a sizeable part of chapter 4
will be devoted to this issue, but for the moment let’s just pretend this is not a
problem.
Therefore, under the homoskedasticity assumption,
This result is also important because it provides the basis for justifying the usage
of OLS as an estimator of β on the grounds of its efficiency. Traditionally, this is
proven via the so-called Gauss-Markov theorem, which, however, relies quite
heavily on small-sample results that I don’t like very much.8 In fact, there is
a much more satisfactory proof that OLS is asymptotically semiparametrically
efficient, but it’s considerably technical, so it’s way out of scope here.9 Suffice it
to say that, under homoskedasticity, OLS is hard to beat in terms of efficiency.
We can estimate consistently Q via n −1 X′ X and σ2 via10
e′ e p 2
σ̂2 = −→ σ (3.12)
n
myself.
9 See Hansen (2019), sections 7.20 and 7.21 if you’re interested.
10 Proof of this is unnecessary, but if you insist, go to subsection 3.A.1.
84 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL
The difference between the two variants is negligible if the sample size n is
reasonably large, so you can use either; or to put it otherwise, if using s 2 instead
of σ̂2 makes a substantial difference, then n is probably so small that in my opin-
ion you shouldn’t be using statistical inference in the first place. And besides, I
see no convincing reason why unbiasedness should be considered a virtue in
this context. The usage of σ̂2 , instead, makes many things easier and is con-
sistent with all the rest of procedures that econometricians use beyond OLS, in
which asymptotic results are uniformly used. Having said this, let’s move on.
3.2.3 In short
To summarise: a set of conditions that are necessary for OLS to be interpreted
as a CAN estimator of something meaningful are:
If the above is true, then β̂ can be regarded as a CAN estimator of the parameters
of the conditional mean, that can be used, in turn, to compute marginal effects
or, as we shall see in the next section, to perform hypothesis tests. Note that
the above hypotheses are sufficient, but some of them may be relaxed to some
degree, and we will do so in the next chapters.
The reader may also find it interesting that an alternative set of assumptions
customarily known as the classical assumptions was traditionally made when
teaching econometrics in the twentieth century. In my opinion, using the clas-
sical assumptions for justifying the usage of OLS as an estimator is a relic of the
past, but if you’re into the history of econometric thought, I wrote a brief de-
scription in section 3.A.2.
3.3. SPECIFICATION TESTING 85
Model A y i = x 1i β1 + x 2i β2 + x 3i β3 + εi (3.13)
Model B y i = x 1i β1 + x 3i β3 + εi (3.14)
¢−1 ′ β̂2i
′
= t i2
¡ ′
W = β̂ ui ui V̂ ui ui β̂ = (3.15)
vi i
11 In fact, we will argue in section 3.5 that OLS on model B would produce a more efficient esti-
β̂i p
where t i = se i , v i i is the i -th element on the diagonal of V̂ and se i = v i i . The
quantity se i is usually referred to as the standard error of β̂i . Of course, the null
hypothesis would be rejected if W > 3.84 (the 5% critical value for a χ21 distribu-
p
tion). Of course, this implies that we’d reject when |t | > 3.84 = 1.96.
In fact, it’s rather easy to prove that we could use a slight generalisation of
the above for constructing a test for H0 : βi = a, where a is any real number you
want, and that such a test takes the form
β̂i − a
t βi =a = (3.16)
se i
Clearly, we can use the t statistic to decide whether a certain explanatory vari-
able is irrelevant or not, and therefore choose between model A and model B. In
the next subsection, I will show how this idea can be nicely generalised so as to
frame the decision on the best specification via hypothesis testing.
Note, also, that the t statistic can also be used “in reverse” to construct confi-
dence intervals in the same way as discussed at the end of Section 2.3.2: instead
of asking ourselves what the decision on H0 would be for a given a, we may look
for the values of a that would lead us to a given decision. A hypothesis of the
kind H0 : β j = a is not rejected whenever
β̂i − a
−1.96 < < 1.96;
se i
In other words, the interval β̂ j ±1.96· se i contains all the values of a that we may
consider not contradictory to the observed data.
y i = x 1i β1 + x 2i β2 + x 3i β3 + εi ; (3.17)
3.3. SPECIFICATION TESTING 87
y i = x 1i + x 2i β2 + x 3i β3 + εi
y i − x 1i = x 2i β2 + x 3i β3 + εi ,
y i = x 1i β1 + (x 2i − x 3i )β2 + εi .
Of course you can combine more than one restriction into a system:
β1 = 1
½
β2 + β3 = 0,
and if you applied these to (3.17), the constrained model would turn into
y i − x 1i = (x 2i − x 3i )β2 + εi .
The best way to represent constraints of the kind we just analysed is via the ma-
trix equation
R β = d,
where the matrix R and the vector d are chosen so as to express the constraints
we want to test. The examples above on model (3.17) are translated into the
R β = d form in the following table:
a
β = α1
α2
R = [0 1 1] d = 1.
p i = β 0 + β 1 s i + β 2 b i + β 3 a i + β 4 x i + εi (3.19)
where p i is the log price of the i -th house and the explanatory variables are:
Legend
lsize si log of living area, hundreds of square feet
baths bi number of baths
age ai age of home at time of sale, years
pool xi = 1 if home has pool, 0 otherwise
Models of this type, where the dependent variable is the price of a good and
the explanatory variables are its features, are commonly called hedonic models.
In this case (like in most hedonic models), the dependent variable is in loga-
rithm; therefore, the effect of all coefficients must be interpreted as the impact
on that variable on the relative change in the house price (see footnote 34 in
Chapter 1).
As you can see, the output is divided into two tables; the most interesting is
the top one, which contains β̂ and some more statistics. I’ll describe the con-
tents of the bottom table in section 3.4.2.
4. the ratio of those two numbers, that is the t -ratio (see eq. 3.15)
14 After reading this section, the reader might want to go back to section 1.5 and read it again
5. the corresponding p-value, possibly with the little stars on the right (see
section 2.4.1).
Note that gretl, like all econometric packages I know, gives you the “finite-
sample” version of the standard errors, that is those computed by using s 2 as
a variance estimator instead of σ̂2 , which is what I personally prefer, but the
difference would be minimal.
For the interpretation of each row, let’s begin by the lsize variable:16 the
coefficient is positive, so that in our dataset bigger houses sell for higher prices,
which of course stands to reason. However, the magnitude of the coefficient is
also interesting: 1.037 is quite close to one. Since the house size is also expressed
in logs, we could say that the relative response of the house price to a relative
variation in the house size is 1.037. For example, if we compared two houses
where house A is bigger than house B by 10% (and all other characteristics were
the same), we would expect the value of house A to be 10.37% higher than that
of house B.
As the reader knows, this is what in economics we call an elasticity: for a
continuous function you have that
dy x d log y
η= · =
dx y d log x
d log y dy
because dy = 1y and therefore d log y = y . So, any time you see something
like log(y) = a + b log(x), you can safely interpret b as the elasticity of y to x.
From an economic point of view, therefore, we would say that the elasticity
of the house price to its size is about 1. What is more interesting, the standard
16 “Why not the constant?” you may ask. Nobody cares about the constant.
3.4. EXAMPLE: READING THE OUTPUT OF A SOFTWARE PACKAGE 91
error for that coefficient is about 0.023, which gives a t -ratio of 44.67, and the
corresponding p-value is such an infinitesimal number that the software just
gives you 0.17 If we conjectured that there was no effect of size on price, that
hypothesis would be strongly rejected on the grounds of empirical evidence. In
the jargon of applied economists, we would say that size is significant (in this
case, very significant).
If, instead, we wanted to test the more meaningful hypotheses H0 : β1 = 1, it
would be quite easy to compute the appropriate t statistic as per equation (3.16):
β̂1 − 1 1.03696 − 1
t= = = 1.592
se 1 0.0232121
and the corresponding p-value would be about 11.1%, so we wouldn’t reject H0 .
On the other hand, we get a slightly negative effect for the number of baths
(−0.00515142). At first sight, this does not make much sense, since you would
expect that the more baths you have in the house, the more valuable your prop-
erty is. How come we observe a negative effect?
There are two answers to this question: first, the p-value associated to the
coefficient is 0.6935, which is way over the customary 0.05 level. In other words,
an applied economist would say that the baths variable is not significant. This
does not mean that we can conclusively deduce that there is no effect. It means
that, if there is one, it’s too weak for us to detect (and, for all we know, it might as
well be positive instead, albeit quite limited). Moreover, this is the effect of the
number of baths other things being equal, as we know from the Frisch-Waugh
theorem (see section 1.4.4). In this light, the result is perhaps less surprising:
why should the number of baths matter, given the size of the house? Actually, a
small house filled with baths wouldn’t seem such a great idea, at least to me.
On the contrary, the two variables age and pool are highly significant. The
coefficient for age, for example, is about -0.002: each year of age decreases the
house value by about 0.2%, which makes sense. The coefficient for the dummy
variable pool is about 0.107, so it would seem that having a pool increases the
house value by a little over 10%, which, again, makes sense.
This number is of very little use by itself;19 in this book, it’s only important be-
cause it provides the essential ingredient for calculating the so-called Informa-
tion Criteria (IC), that are widely used tools for comparing non-nested models.
We say that a model nests another one when the latter is a special case of
the former. For example, the two models (3.13) and (3.14) are nested, because
model (3.14) is just model (3.13) in the special case β2 = 0. If model B is nested
in model A, choosing between A and B is rather easy: all you need is a proper
test statistic; I will provide a detailed exposition in Section 3.5. However, we may
have to choose between the two alternatives in which nesting is impossible:
yi ≃ x′i β
yi ≃ z′i γ
Information criteria start from the value of the log-likelihood (multiplied by -2)
and add a penalty function, which is increasing in the number of parameters.
The gretl package, that I’m using here, reports three criteria: the Akaike IC (AIC),
the Schwartz IC (BIC, where B is for “Bayesian”) and the one by Hannan and
Quinn (HQC), which differ by the choice of penalty function:
The rule is to pick the model that minimises information criteria. It may be
interesting to know that, for the case of linear model that we are examining in
the present context, the quantity −2L equals
−2L = n K − log(σ̂2 ) ,
£ ¤
18 This test, technically, is of the F variety — see section 3.5.1 for its definition.
19 If the data are iid and f (y|x) is Gaussian, then β̂ and σ̂2 are the maximum likelihood estima-
tors of β and σ2 . I chose not to include this topic into this book, but the interested reader will find
a nice and compact exposition in Verbeek (2017), chapter 6. Other excellent treatments abound,
but the curious reader may want to check out chapters 14–15 of Ruud (2000). If you want to go for
the real thing, grab Gourieroux and Monfort (1995), volume 1, chapter 7.
3.5. RESTRICTED LEAST SQUARES AND HYPOTHESIS TESTING 93
β̃ = Argmin ∥y − Xβ ∥; (3.23)
R β =d
compare (3.23) with equation (1.14): OLS is defined as the unconstrained SSR
minimiser (we can choose β̂ among all k-element vectors); RLS, instead, can
only be chosen among those vectors β that satisfy R β = d. Figure 3.3 exemplifies
the situation for k = 2.
Define the restricted residuals as ẽ = y − Xβ̃ ; we will be interested in com-
paring them with the OLS residuals, so in this section we will denote them as
ê = y − Xβ̂ = MX y to make the distinction typographically evident.
A couple of remarks can already be made even without knowing what the
solution to the problem in (3.23) is. First, since β̂ is an unrestricted minimiser,
ê′ ê cannot be larger than the constrained minimum ẽ′ ẽ. However, the inequality
ê′ ê ≤ ẽ′ ẽ can be made more explicit by noting that
£ ¤
MX ẽ = MX y − Xβ̃ = MX y = ê
and therefore
ê′ ê = ẽ′ MX ẽ = ẽ′ ẽ − ẽ′ PX ẽ
so that
3.2.2 in Verbeek (2017). However, the literature on statistical methods for selecting the “best”
model (whatever that may mean) is truly massive; see for example the “model selection” entry in
(Durlauf and Blume, 2008).
21 In fact, the cross validation criterion can be shown to be roughly equivalent to the AIC.
94 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL
β2
β̂2 β1 = 3β2
β̃2
β̂1 β̃1 β1
The ellipses are the contour lines of the function e′ e. The constraint is β1 = 3β2 . The
number of parameters k is 2 and the number of constraint p is 1. The unconstrained
minimum is β̂1 , β̂2 ; the constrained minimum is β̃1 , β̃2 .
where the right-hand side of the equation is non-negative, since ẽ′ PX ẽ can be
written as (PX ẽ)′ (PX ẽ), which is a sum of squares.22
In order to solve (3.23) for β̃ , we need to solve a constrained optimisation
problem, which is not complicated once you know how to set up a Lagrangean.
The details, however, are not important here and I’ll give you the solution straight
away:
¤−1
β̃ = β̂ − (X′ X)−1 R ′ R(X′ X)−1 R ′ (R β̂ − d);
£
(3.25)
derivation of this result is provided in the separate subsection 3.A.5, so you can
skip it if you want.
The statistical properties of β̃ are proven in section 3.A.6, but the most im-
portant point to make here are: if R β̂ = d, then β̃ is consistent just like the OLS
estimator β̂ , but has the additional advantage of being more efficient. If, on the
contrary, R β̂ ̸= d, then β̂ is inconsistent. The practical consequence of this fact
is that, if we were certain that the equation R β̂ = d holds, we would be much
better off by using an estimator that incorporates this information; but if our
conjecture is wrong, our inference would be invalid.
It’s also worth noting that nobody uses expression (3.25) as a computational
device. The simplest way to compute β̃ is to run OLS on the restricted model and
then “undo” the restrictions: for example, if you take model (3.17), reproduced
22 In fact, the same claim follows more elegantly by the fact that P is, by construction, positive
X
semi-definite.
3.5. RESTRICTED LEAST SQUARES AND HYPOTHESIS TESTING 95
y i = x 1i β1 + x 2i β2 + x 3i β3 + εi
and want to impose the set of restrictions β1 = 1 and β2 = β3 , what you would
do is estimating the constrained version
y i − x 1i = (x 2i + x 3i )β2 + εi , (3.26)
y i = x 1i · 1 + x 2i β2 + x 3i β2 + εi
and then forming β̂ as [1, β̃2 , β̃2 ], where β̃2 is the OLS estimate of equation (3.26).
Nevertheless, equation (3.25) is useful for proving an important result. Let’s
define the vector λ as
¤−1
λ = R(X′ X)−1 R ′
£
(R β̂ − d);
so finally
¤−1
ẽ′ ẽ − ê′ ê = (R β̂ − d)′ R(X′ X)−1 R ′
£
(R β̂ − d). (3.27)
Note that the right-hand side of equation (3.27) is very similar to (3.18). In
fact, if our estimator for σ2 is σ̂2 = ê′ ê/n, we can combine equations (3.18), (3.24)
and (3.27) to write the W statistic as:
¤−1
(R β̂ − d)′ R(X′ X)−1 R ′
£
(R β̂ − d) ẽ′ ẽ − ê′ ê
W= = n . (3.28)
σ̂2 ê′ ê
Therefore, we can compute the same number in two different ways: one im-
plies a rather boring sequence of matrix operations, using only ingredients that
are available after the estimation of the unrestricted model. The other one, in-
stead, requires estimating both models, but at that point 3 scalars (the SSRs and
the number of observations) are sufficient for computing W .
96 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL
test if normality of εi was assumed. Its validity, however, does not depend on this assumption. A
fuller discussion of this point would imply showing that OLS is a maximum likelihood estimator
under normality, which is something I’m not willing to do. See also footnote 19 in Section 3.4.2.
24 The reader may want to verify that alternative formulations of the W and LM statistics are
possible using σ̂2 and σ̃2 , or the R 2 indices from the two models.
3.5. RESTRICTED LEAST SQUARES AND HYPOTHESIS TESTING 97
1. run OLS on the constrained model and compute the residuals ẽ;
2. run OLS on a model where the dependent variable is ẽ and the regressors
are the same as in the unconstrained model;
3. take R 2 from this regression and multiply it by n. What you get is the LM
statistic.
y′ PX y
The last step is motivated by the fact that you can write R 2 as y′ y , so in the
ẽ′ PX ẽ 2
present case ẽ′ ẽ is R from the auxiliary regression.
Example 3.1
As an example, let’s go back to the house pricing model we used as an example
in Section 3.4. In Section 3.4.1 we already discussed two hypotheses of interest,
namely:
Testing for these two hypotheses separately is easy, via t -tests, which is just what
we did a few pages back. As for the joint hypothesis, the easiest thing to do is
setting up the resticted model as follows: combine equation (3.19) with β1 = 1
and β2 = 0. The restricted model becomes
p i − s i = β 0 + β 3 a i + β 4 x i + εi . (3.32)
Note the redefinition of the dependent variable: if p i is the log of the house
price and s i is its size in square feet, then p i − s i is the log of is price per square
foot, or unit price if you prefer. In fact, the hypothesis β1 = 1 implies that if
you have two houses (A and B) that are identical on all counts, except that A is
twice as big as B, then the price of A should be twice that for B. Therefore, this
hypothesis says implicitly that you can take into account appropriately the size
of the property simply by focusing on its price per square foot, which is what we
do in model (3.32). Estimating (3.32) via OLS gives
Superficially, one may think that our restricted model is much worse than
the unrestricted one, as the R 2 drops from 68.5% to 5.3%. However, this is not a
fair comparison, because in the restricted model the dependent variable is rede-
fined and the denominator of the two R 2 indices is not the same. The SSRs are,
instead, perfectly comparable, and the change you have between the full model
and the unrestricted one is 157.88 → 158.12, which looks far less impressive, so
we are drawn to think that the restricted model is not much worse in terms of
fit. We could take this as an indication that our maintained hypothesis is not
particularly at odds with the data.
This argument can be made rigorous by computing the W statistic:
158.1204 − 157.8844
W = 2610 · = 3.9014
157.8844
you get a statistic that is smaller than the critical value at 5% of χ22 = 5.99, so we
accept both hypotheses again (the p-value is about 0.124). This time, however,
the test was performed on the joint hypothesis. It may well happen (examples
are not hard to construct) that you may accept two hypotheses separately but
reject them jointly (the converse should never happen, though).
The LM test, instead, can be computed via an auxiliary regression as follows:
take the residuals form model (3.32) and regress them against the explanatory
variables of the unrestricted model (3.19). In this case, you get
coefficient std. error t-ratio p-value
---------------------------------------------------------
const -0.0924377 0.0483007 -1.914 0.0558 *
lsize 0.0369564 0.0232121 1.592 0.1115
baths -0.00515142 0.0130688 -0.3942 0.6935
age 8.72115e-05 0.000270997 0.3218 0.7476
pool -0.00901706 0.0226757 -0.3977 0.6909
their work on causal effects, which has been enormously influential, especially
in labour economics. The issue here is not about the statistical properties of β̂ ,
but rather on its interpretation as an estimator of β , so it fits in well at this point
of the book, although the point we pursue here will be discussed in much greater
detail in Chapter 6.
What does β measure? If E y|x = x′ β , then β is simply defined as
£ ¤
∂E y|x
£ ¤
β= ;
∂x
correlation and causation is truly massive. For a quick account, read chapter 3 in the latest best-
seller in econometrics, that is Angrist and Pischke (2008), or simply google for “Exogeneity”.
100 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL
Note that this problem is not a shortcoming of OLS per se: the job of OLS is
∂E y|x
to estimate consistently the parameters of the conditional expectation [ ] . If ∂x
the nature of the problem is such that our parameters of interest β are a different
object and we insist on equating them with what OLS returns (thereby giving
OLS a misleading interpretation), it’s a hermeneutical problem, not a statistical
one.
The preferred tool in the econometric tradition for estimating causal effects
is an estimator called Instrumental Variable estimator (or IV for short), but
you’ll have to wait until chapter 6 for it.
3.7 Prediction
Once a model is estimated and we have CAN estimators of β and σ2 , we may
want to answer to the following question: if we were given a new datapoint for
which the vector of covariates is known and equal to x̌, what could we say about
the value of the dependent variable y̌ for that new observation?
In order to give a sensible answer, let’s begin by noting a few obvious facts:
of course, y is a random variable, so we cannot predict it exactly. If we knew the
true DGP parameters β and σ2 we could say, however, that
If we were willing to entertain the claim that ε is normal, we could even build a
confidence interval27 and say
ŷ = x̌′ β̂ .
Note, however, that β̂ is a random variable, with its own variance, so the confi-
dence interval around ŷ has to take this into account. Formally, let us define the
prediction error as
the expression above reveals that our prediction can be wrong for two reasons:
(a) because ε̌ is inherently unpredictable: our model does not contain all the
27 If you need to refresh the notion of confidence interval, go back to the end of section 2.3.2.
28 This may seem obvious, but actually isn’t: this choice is optimal if the loss function we employ
for evaluating prediction is quadratic (see section 1.2). If the loss function was linear, for example,
we’d have to use the median. But let’s just stick to what everybody does.
3.7. PREDICTION 101
feaures that describe the dependent variable and its variance is a measure of
how bad our model is and (b) our sample is not infinite, and therefore we don’t
observe the DGP parameter β , but only its estimate β̂ .
If ε̌ can be assumed independent of β̂ (as is normally safe to do), then the
variance of the difference is the sum of the variances:
Of course, when computing this quantity with real data, we replace variances
with their estimates, so we use σ̂2 (or s 2 ) in place of σ2 , and V̂ for V .
Example 3.2
Suppose we use the model shown in section 3.4.1 to predict the price for a house
built 5 years ago, with 1500 square feet of living area, 2 baths and no pool. In this
case,
x̌′ = [1 2.708 2 5 0] ;
(the number 2.708 is just log(1500/100); since
p
since 0.0606711 ≃ 0.2463, we can even calculate a 95% confidence interval
around our prediction as
q £ ¤
ŷ ± 1.96 V ŷ = 11.6395 ± 1.96 × 0.2463
so we could expect that, with a probability of 0.95, the log price of our hypothet-
ical house would be between 11.157 and 12.122, and therefore the price itself
29 Actually, the expectation of the exponential is not the exponential of the expectation, since
the exponential function is everywhere convex (see Section 2.A.1), but details are not important
here.
30 Of course, we could have used the asymptotic version σ̂2 (X′ X)−1 and very little would have
changed.
102 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL
between $ 70000 and $ 180000 (roughly). You may feel unimpressed by such a
wide range, and I wouldn’t disagree. But on the other hand, consider that this
is a very basic model, which only takes into account very few features of the
property, so it would be foolish to expect it to be razor-sharp when it comes to
prediction.
One last thing: you may have noticed, in the example above, that the vari-
ance of the predictor depends almost entirely on the “model uncertainty” com-
ponent σ̂2 and very little on the “parameter uncertainty” component x̌′V̂ x̌. This
is not surprising, in the light of the fact that, as n → ∞, the latter component
should vanish, since β̂ is consistent. Therefore, in many settings (notably, in
time-series models, that we’ll deal with in Chapter 5), the uncertainty on the
prediction is tacitly assumed to come only from σ2 .
y i = x i β 1 + z i β 2 + εi ; (3.34)
p E [x i z i ]
β̂1 −→ β1 + β2
E x i2
£ ¤
and over to generations of students is a constant source of wonder to me. I’ll try
to illustrate why, and convince you of my point.
The parameter β1 in (3.34) is defined as the partial effect of x i on the condi-
tional mean of y i on both x i and z i ; that is, the effect of x on y given z. It would
be silly to think that this quantity could be estimated consistently without us-
ing any information on z i .32 The statistic β̂1 , as defined in (3.35) (which does
ignore z i ), is nevertheless a consistent estimator of a different quantity, namely
the partial effect of x i on the conditional mean of y i on x i alone, that is
E y i |x i = β1 x i + β2 E [z i |x i ] .
£ ¤
The objection that some put forward, at this point, is: “OK; but assume that
equation (3.34) is my object of interest, and z i is unobservable, or unavailable.
Surely, you must be aware that the estimate you get by using x i only is bogus.”
Granted. But then, I may reply, do you ever get a real-life case when you observe
all the variables you would like to have in your conditioning set? I don’t think
so; take for example the model presented in section 3.4: in order to set up a truly
complete model, you would have to have data on the state of the building, on
the quality of life in the neighbourhood, on the pleasantness of the view, and
so on. You should always keep in mind that the parameters of your model only
make sense in the context of the observable explanatory variable that you use
for conditioning.33
This doesn’t mean that you should not worry about omitted variable bias at
all. The message to remember is: the quantity we would like to measure (ideally)
is “the effect of x on y all else being equal”; but what we measure by OLS is the
effect of x on y conditional on z. Clearly, in order to interpret our estimate the
way we would like to, z should be as close to “all else” as possible, and if you omit
relevant factors from your analysis (by choice, or impossibility) you have to be
extra careful in interpreting your results.
Example 3.3
I downloaded some data from the World Development Indicators34 website. The
variables I’m using for this example are
32 I should add that if we had an observable variable w , which we knew for certain to be uncor-
i
related with z i , you could estimate β1 consistently via a technique called instrumental variable
estimation, which is the object of Chapter 6.
33 In fact, there is an interesting link the bias you get from variable omission and the one you get
from endogeneity (see section 3.6). Maybe I’ll write it down at some point.
34 The World Development Indicator (or WDI for short) is a wonderful database, maintained
by the World Bank, that collects a wide variety of variables for over 200 countries over a large
time span. It is one of the most widely used resources in development economics and is publicly
available at http://wdi.worldbank.org or through DBnomics https://db.nomics.world/.
104 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL
For each country, I computed the logarithm of the average (between 2014
and 2018) of the available data, which left me with data for 69 countries. The
three resulting variables are called l_gdp, l_hbeds and l_agri. Now consider
Table 3.2, which reports two OLS regressions. In the first one, we regress the
number of hospital beds on the share of GDP from agriculture. As you can see,
the parameter is negative and significant. However, when we add GDP to the
equation, the coefficient of l_agri becomes insignificant (and besides, its sign
changes). On the contrary, you find that GDP matters a lot.
(1) (2)
const 1.090∗∗ −4.876∗∗
(0.1168) (1.231)
∗∗
l_agri −0.2916 0.08467
(0.05794) (0.09217)
l_gdp 0.5655∗∗
(0.1163)
2
R̄ 0.2635 0.4496
(standard error in parentheses)
The correct interpretation for this result is: there is a significant link between
medical quality (as measured by the number of hospital beds per 1000 inhabi-
tants) and the share of GDP from agriculture. In other words, if you travel to a
country where everybody works in the fields, you’d better not get ill. However,
this fact is simply a by-product of differences between countries in terms of eco-
nomic development.
Once you consider the conditional expectation of l_hbeds on a wider infor-
mation set, which includes GDP per capita,35 the effect disappears. That is, for
a given level of economic development36 there is no visible link between hospi-
tal beds and agriculture. To put it more explicitly: if you compare two countries
35 In the applied economic jargon: “once you control for GDP”.
36 OK, GDP per capita is not a perfect measure of economic development, nor of happiness, nor
of well-being. I know. I know about the Human Development Index. I know about that Latouche
guy. I know about all these things. Just give me a break, will you?
3.A. ASSORTED RESULTS 105
where the agricultural sectors have a different size (say, Singapore and Burundi),
you’re likely to find differences in their health system quality. However, if you
compare two countries with the same per capita GDP (say, Croatia vs Greece, or
Vietnam vs Bolivia) you shouldn’t expect to find any association between agri-
culture and hospital beds.
Does this mean that model (1) is “wrong”? No: it simply means that the two
coefficients in the two models measure two different things: a “gross” effect in
equation (1) and a “net” effect in equation (2).37 Does this mean that model (2)
is preferable? Yes: model (2) gives you a richer picture (see how much larger R̄ 2
is) because it’s based on a larger information set.
ε′ ε 1X n
p
ε2i −→ E ε2i = σ2 .
£ ¤
=
n n i =1
ε′ X p X′ X p
−→ 0 −→ Q
n n
and therefore
¶−1
ε′ M X ε ε′ ε ε′ X X′ X X′ ε
µ
2 p
σ̂ = = − −→ σ2 − 0′Q −1 · 0 = σ2 .
n n n n n
y = Xβ + ε
2. ε ∼ N 0, σ2 I
¡ ¢
ε|X ∼ N 0, σ2 I ,
¡ ¢
£ ¤
which in turn implies E [εi |X] = E [εi |xi ] = 0, and as a consequence E y i |xi =
x′i β .
Finally, note that in the classical world ε is assumed to be Gaussian, while
no assumption of that kind was made in Section 3.3. Normality is necessary to
derive the distribution of hypothesis tests such as the t test or the F test when
the sample size is small; needless to say, this is neither necessary nor desirable
in modern econometrics, where datasets are almost always rather large and the
normality assumption is, at best, questionable. This is why in contemporary
econometrics (and, as a result, in this book) we mainly rely on asymptotic infer-
ence, where Gaussianity is nearly useless.
The OLS statistic enjoys both properties (L′ being equal to (X′ X)−1 X′ for OLS), but
other statistics may too. This property is often condensed in the phrase “OLS is
BLUE”, where BLUE stands for “Best Linear Unbiased Estimator”.
The proof is simple if we concentrate on the case when X is a matrix of fixed
constants and does not contain random variables, because in this case we can
shuffle X in and out of the expectation operator E [·] any way we want. Consider-
ing the case when X contains random variables makes the proof more involved.
Here goes: take a linear estimator β̃ defined as β̃ = L′ y. In order for it to be
unbiased, the following must hold
E β̃ = E L′ Xβ + ε = L′ Xβ + L′ E [ε] = β ;
£ ¤ £ ¡ ¢¤
V β̃ = σ2 L′ L;
£ ¤
(3.36)
£ ¤
again, OLS is just a special case, so the variance of β̂ is easy to compute as V β̂ =
σ2 (X′ X)−1 X′ X(X′ X)−1 = σ2 (X′ X)−1 .
£ ¤£ ¤
is positive semidefinite any time L ̸= (X′ X)−1 X′ , and therefore OLS is more effi-
cient than β̃ . This is relatively easy: define D ≡ L′ − (X′ X)−1 X′ , which has to be
nonzero unless β̂ = β̃ . Therefore, D′ D must be positive semidefinite (see section
1.A.7).
D′ D = L′ − (X′ X)−1 X′ L − X(X′ X)−1 = L′ L − (X′ X)−1 X′ L − L′ X(X′ X)−1 + (X′ X)−1 ;
£ ¤£ ¤
more strongly, for homoskedasticity. Moreover, one does not see why the linear-
ity requirement should be important, aside from computational convenience;
and a similar remark holds for unbiasedness, that is nice to have but not really
important if our dataset is of a decent size and we can rely on consistency.
One may see why people insisted so much on the Gauss-Markov theorem
in the early days of econometrics, when samples were small, computers were
rare and statistical methods were borrowed from other disciplines with very few
adjustments. Nowadays, it’s just a nice exercise in matrix algebra.
y = Xβ + ε, (3.38)
we may indulge in the following thought exercise: “We have n datapoints, and
we use all the available information to compute all the OLS-related statistics.
But what if we had only n − 1? We could pretend that the value of y n was un-
available. What happens if we compute β̂ only using the first n − 1 datapoints?
How different would it be from its full-sample equivalent? And if we used β̂ and
xn to predict y n , what should we expect?”. As we will see, pursuing this idea will
lead us to developing useful tools for identifying influential observations and
testing the specification of our model.
Suppose we have n observations but we leave the i -th one aside, and in-
troduce the following convention: the “(−i )” index means “excluding the i -th
observation”; hence, X(−i ) is a (n − 1) × k matrix, equal to X with the i -th row
dropped, and the same interpretation holds for y(−i ) .38 The reason why we may
want to do this is to check what happens to our model if a certain observation
had not been available. There are several insights we can gain from doing so.
In order to perform the necessary calculations, it is useful to consider a model
where you add to X a dummy variable identifying the i -th observation, that is an
additional column d, containing all zeros save for the i -th row, that contains 1.
In pratice, our model becomes
For example, if i = n, d would be a vector of zeros with one 1 at the bottom and
in matrix form the model would look like this:
· ¸ · ¸ · ¸
y(−i ) X(−i ) 0 β
y= W= γ= .
yi x′i 1 λ
Clearly, our original model 3.38 is just the special case λ = 0. Here are a few
38 Note: this is not standard notation. I adopted it just for this section.
3.A. ASSORTED RESULTS 109
d′ MX d = m i
d′ MX y = d′ ẽ = ẽ i
where ẽ are the OLS residuals for the full-sample model, that is equation (3.38);
m i is the i -th element on the diagonal of MX , that is 1 − x′i (X′ X)−1 xi . Let’s also
define h i = 1 − m i = x′i (X′ X)−1 xi , the i -th element on the diagonal of PX . It can
be proven that 0 ≤ m i ≤ 1, so that the same holds for h i too.40
Some readers may find the choice of symbols for the diagonal of PX . The reason for using h i
surprising: if the m i values are the diagonal of instead comes from calling PX the “hat matrix”
MX , then it would have been natural to use p i (see section 1.4.1).
The Frisch-Waugh theorem (see section 1.4.4) makes it easy to compute the
OLS estimates for model (3.39):
¤−1 h i−1
β̂ = X′ Md X X′ Md y = X′(−i ) X(−i ) X′(−i ) y(−i )
£
ẽ i
λ̂ = (d′ MX d)−1 d′ MX y =
mi
The β̂ vector is nothing but the OLS statistic you would have found after
dropping the i -th observation. The λ̂ parameter is more interesting: let’s begin
by considering the residuals of (3.39): ê = MW y. Its i -th element, ê i , is defined as
ê i = y i − x′i β̂ − λ̂.
This quantity is identically 0; to see why, note that d is an extraction vector (see
section 3.3.1), so you can write
ê i = d′ ê = d′ MW y = 0;
λ̂ = y i − x′i β̂ .
39 These are easy to prove, and provide a nice exercise on matrix algebra. Hint: start by comput-
which, in turn, means that λ̂ is the prediction error you get if you try to pre-
dict the i -th observation by using all the other ones. Or, put another way, if you
want to compute what the prediction error for y i would be (based on the re-
maining observations) all you have to do is stick an appropriate dummy into
your model and take its coefficient. Note that in fact there is an even easier way:
ẽ i
since λ̂ = m i
, you may just as well run OLS on the full-sample model (3.38), save
its residuals and the m i series, and divide one by the other.
The cross-validation criterion is a model selection tool that is based on just
that: you simulate the out-of-sample performance of your model by adding the
squares of the n prediction errors you find by omitting each observation in turn:
n n µ ẽ ¶2
X 2
X i
CV = e (−i ) =
i =1 i =1 m i
When you compare two models (say, A and B), it may well happen that model A
yields a smaller sum of squared residuals than B, but B outperforms A in terms of
the cross-validation criterion. Usually, this happens when A has a richer struc-
ture than B (in the OLS context, more regressors); in these cases, the canonical
interpretation is that A is only apparently a better model than B: some of the
apparently significant regressors catch in fact spurious regularities than cannot
be expected to hold in general. In these cases, the term we customarily use is
overfitting.
Data scientists are inordinately fond of the jackknifing, which in turn is a close relative of
cross-validation concept, and in many cases boostrapping (see Section 4.A.4).
they use sophisticated variations of this idea
to pick the best forecasting model for a given In statistical learning and similar disciplines,
problem. the approach presented here is often gener-
The variant of the cross-validation method I alised by excluding entire subsets of the entire
just illustrated, where you exclude one obser- dataset instead of one observation only, possi-
vation at a time, has a lot in common with an bly choosing them in very elaborate ways. They
old and established statistical technique called call this folding.
In the light of the discussion above, there is something interesting we can say
on the interpretation of the magnitude m i and its complement to 1, h i = 1 − m i :
from the definition of ê we have
MX y = ẽ = MX dλ̂ + ê.
Therefore, since MX ê = 0,
ẽ′ ẽ = d′ MX dλ2 + ê′ ê
so, finally,
ẽ i2 /m i = ẽ′ ẽ − ê′ ê.
3.A. ASSORTED RESULTS 111
Which means: if you compare the SSR for the two models you get by using the
full sample or omitting the i -th observation, you get that their difference is al-
ways non-negative, and equals ẽ i2 /m i . Clearly, if the difference is large, the re-
sults you get by adding/removing the i -th observation are dramatically different,
so that data point deserves special attention.
The quantity ẽ i2 /m i may be large either because (a) ẽ i is large in absolute
value and/or (b) m i is close to 0 (which implies that h i is close to 1). This is why
the h i values are sometimes used as descriptive statistics to check for “influen-
tial observations”, and are sometimes referred to as leverage values. Note that
h i only depends on the regressors X, and not on y. Therefore, large values of
h i indicate observations for which the combination of explanatory variables we
have is uncommon enough to exert a substantial effect on the final estimates.
The elements of the vector λ are known as “Lagrange mul- J OSEPH L OUIS
tipliers”. L AGRANGE
For example, the classic microeconomic problem of a utility-maximising con-
sumer is represented as
L (x, λ) = U (x) + λ · Y − p′ x
¡ ¢
where x is the bundle of goods, U (·) is the utility function, Y is disposable in-
come and p is the vector of prices. In this example, the only constraint you have
is the budget constraint, so λ is a scalar.
The solution has to obey two conditions, known as the “first order condi-
tions”:
∂L ∂L
=0 = 0, (3.40)
∂x ∂λ
so in practice you differentiate the Lagrangean with respect to your variables
and λ, and then check if there are any solutions to the system of equations you
41 One I especially like is Dixit (1990), but for a nice introductory treatment I find Dadkhah (2011)
hard to beat.
112 CHAPTER 3. USING OLS AS AN INFERENTIAL TOOL
get by setting the partial derivatives to 0. If the solution is unique, you’re all
set. In the utility function example, applying equations (3.40) gives the standard
microeconomic textbook solution to the problem:
∂U (x)
= λp Y = p′ x.
∂x
in words:
. at the maximum, (a) marginal utilities are proportional to prices (or
∂U ∂U pi
∂x i ∂x j = pj if you prefer) and (b) you should spend all your income.
In the case of RLS, the Lagrangean is42
1
L (β , λ) = e′ e + λ′ (R β − d).
2
∂L
= −X′ e + R ′ λ
∂β
X′ ẽ = R ′ λ, (3.41)
where ẽ is the vector that satisfies equation (3.41), defined as y−Xβ̃ . By premul-
tiplying (3.41) by (X′ X)−1 we get
So the constrained solution β̃ can be expressed as the OLS vector β̂ , plus a “cor-
rection factor”, proportional to λ. If we premultiply (3.42) by R we get
¤−1
λ = R(X′ X)−1 R ′
£
(R β̂ − d) (3.43)
from this equation, it is quite easy to see that β̃ is an affine function of β̂ ; this
will be quite useful. Now define H as
³ ¤−1 ´ ¤−1
H = plim (X′ X)−1 R ′ R(X′ X)−1 R ′ = Q −1 R ′ RQ −1 R ′
£ £
(3.44)
Using these, we will prove that the asymptotic variance of β̃ is smaller (in a ma-
trix sense) than that of β̂ :
AV β̃ = (I − H R) · AV β̂ · (I − R ′ H ′ )
£ ¤ £ ¤
¤−1
H RQ −1 = Q −1 R ′ RQ −1 R ′ RQ −1 ,
£
(3.45)
H RQ −1 = H R H RQ −1 = H RQ −1 R ′ H ′ .
AV β̃ = σ2 Q −1 − H RQ −1 −Q −1 R ′ H ′ + H RQ −1 R ′ H ′ =
£ ¤ © ª
= σ2 Q −1 − H RQ −1 =
£ ¤
= AV β̂ − σ2 H RQ −1
£ ¤
the last thing we need do prove is that H RQ −1 is also positive semi-definite: for
this, we’ll use the right-hand side of (3.45).
Since Q is pd, then Q −1 is pd as well (property 1); therefore, RQ −1 R ′ is also
¤−1
pd (property 2), and so is RQ −1 R ′
£
(property 1). Finally, by using property
−1 ′ −1 ′ −1
RQ −1 is positive semi-definite and the
£ ¤
1 again, we find that Q R RQ R
result follows.
Chapter 4
Diagnostic testing in
cross-sections
1. the data we observe are realisations of random variables such that it makes
sense to assume that we are observing the same DGP in all the n cases
in our dataset; or, more succinctly, there are no structural breaks in our
dataset.
£ ¤
4. the conditional variance V y|x exists and does not depend on x at all, so
it’s a positive constant: V y|x = σ2 .
£ ¤
Assumption number 2 may be inappropriate for two reasons: one is that our
sample size is too small to justify asymptotic results as a reasonable approxima-
tion to the actual properties of our statistics; the other one is that our observa-
tion may not be identical, nor independent. The first case cannot really be tested
formally; in most cases, the data we have are given and economists almost never
enjoy the privileges of experimenters, who can have as many data points as they
want (of course, given sufficient resources). Therefore, we just assume that our
dataset is good enough for our purposes, and hope for the best. Certainly, intel-
lectual honesty dictates that we should be quite wary of drawing conclusions on
the basis of few data points, but there is not much more we can do. As for the
second problem, we will defer the possible lack of independence to chapter 5,
since the issue is most likely to arise with time-series data.
115
116 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS
In the next section, we will consider a way of testing assumptions 1 (to some
extent) and 3. If they fail, consistency of β̂ may be at risk. Conversely, assump-
tion number 4 is crucial for our hypothesis testing apparatus, and will need
some extra tools; this will be the object of section 4.2.
But this statement may be false. We will not explore the problem in its full gen-
erality: we’ll just focus on two possible issues that often arise in practice.
2. Our data comprise observations for which the DGP is partly different. That
is, we have j = 1 . . . m separate sub-populations, for which
E y i |xi = x′i β j
£ ¤
where j is the class that observation i belongs to. For example, we have
data on European and American firms, and the vector β is different on the
two sides of the Atlantic (in this case, m = 2).
(to ease exposition, I am assuming here that the conditional mean has no con-
stant term). Suppose that the expression above holds, but in the model we esti-
mate the quadratic term x i2 is dropped. That is, we estimate a model like
y i = γx i + u i ;
Clearly, there is no value of γ that can make u i the difference between y i and
E y i |x i , so we can’t expect γ̂ to have all the nice asymptotic properties that OLS
£ ¤
has. In fact, it can be proven that (in standard cases) the statistic γ̂ does have a
limit in probability, but the number it tends to is neither β1 nor a simple function
of it, so technically there is no way we can use γ̂ to estimate β1 consistently.
The limit in probability of γ̂ is technically to have a look at Cameron and Trivedi (2005),
known as a pseudo-true value, which is far too section 4.7 or (more technical) Gourieroux and
complex a concept for me to attempt an expo- Monfort (1995). The ultimate bible on this is
sition here. The inquisitive reader may want White (1994).
In the present case, the remedy is elementary: add x i2 to the list of your re-
gressors and, voilà, you get perfectly good CAN estimates of β1 and β2 .1 How-
ever, in a real-life case, where you have a vector of explanatory variables xi ,
things are not so simple. In order to have quadratic effects, you should include
all possible cross-products between regressors. For example, a model like
y i = β 0 + β 1 x i + β 2 z i + εi
would become
y i = β0 + β1 x i + β2 z i + β3 x i2 + β4 x i · z i + β5 z i2 +εi
| {z } | {z }
linear part quadratic part
and it’s very easy to show that the number of quadratic terms becomes rapidly
unmanageable for a realistic model: if the original model has k regressors the
quadratic one can have up to k(k+1) 2 additional terms.2 . I don’t think I have to
warn the reader on how much of a headache it would be to incorporate cubic or
quartic terms.
The RESET test (stands for REgression Specification Error Test) is a way to
check whether a given specification needs additional nonlinear effects or not.
The intuition is simple and powerful: instead of augmenting our model with all
the possible order 2 terms (squares and cross-products), we just use the square
of the fitted values, that is instead of x i2 , x i · z i and z i2 in the example above, we
would use
¢2
ŷ i2 = β̂1 x i + β̂2 z i .
¡
e i = γ′ xi + δ1 ŷ i2 + δ2 ŷ i3 + u i ;
a
4. compute LM = n · R 2 ∼ χ22
Example 4.1
Let us compute the RESET test to check the hedonic model used as an exam-
ple in Section 3.4 for possible neglected nonlinearities; the auxiliary regression
yields
Subsample 1 y i = β 0 + β 1 x i + εi
Subsample 2 y i = β 0 + β 2 x i + εi
y i = β0 + β1 + (β2 − β1 )d i x i + εi =
£ ¤
(4.1)
= β0 + β1 x i + γd i · x i + εi = β0 + β1 x i + γz i + εi
where γ = β2 −β1 . Again, note that model (4.1) is perfectly fit for OLS estimation,
since the product z i = d i · x i is just another observable variable, which happens
to be equal to x i when d i = 1 and 0 otherwise.
If the effect of x i on y i is in fact homogeneous across the two categories,
then γ = 0; therefore, testing for the equality of β1 and β2 is easy: all you need to
do is check whether the regressor z i is significant. Explanatory variables of this
kind, that you obtain by multiplying a regressor by a dummy variable, are of-
ten called interactions in the applied economics jargon. If the interaction term
turns out to be significant, then the effect of x on y is different across the two
subcategories, since the interaction term in your model measures how different
the effect is across the two subgroups.
Clearly, you can interact as many regressors as you want: in the example
above, you could also imagine that the intercept could be different across the
two subpopulations as well, so the model would become something like
y i = β0 + β1 x i + γ0 d i + γ1 d i · x i + εi ,
∂E y i |x i
£ ¤
= β + γd i ,
∂x i
that is, β if d i = 0 and β + γ if d i = 1.
When you interact all the parameters by a dummy, then the test for equality
of coefficients across the two subsamples is particularly simple, and amounts to
what is known as the Chow test, since the SSR for the unrestricted model (that is,
the one with all the interactions) is just the sum of the two separate regressions:
if you have two subgroups, you can compute
120 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS
1. the SSR for the OLS model on the whole sample (call it S T );
2. the SSR for the OLS model using only the data in subsample 1 (call it S 1 );
3. the SSR for the OLS model using only the data in subsample 2 (call it S 2 ).
S T − (S 1 + S 2 )
W =n· (4.2)
S1 + S2
because the SSR for the model with all the interactions is equal to the sum of the
SSRs for the separate submodels. Of course, the appropriate number of degrees
of freedom to use for the p-value would be k, the difference between the number
of parameters in the unrestricted model (k + k) and those in the restricted one
(k). The proof is contained in section 4.A.1, where I also generalise this idea to
the case when you have more than 2 subsamples.
5
Bikes (000s)
-1
Y = -1.90 + 0.522X - 0.00896X^2
-2
5 10 15 20 25 30 35
Temperature (C°)
Once you fit a quadratic model to the data, things appear to be basically OK:
2
R is not at all bad, and the estimated regression lines makes perfect sense, with
the negative concavity indicating that most people want to ride a bike when the
weather is warm, but not too hot.
4.1. DIAGNOSTICS FOR THE CONDITIONAL MEAN 121
d −1.989∗∗
(−2.494)
d ·x 0.3398∗∗∗
(3.876)
2
d ·x −0.008712∗∗∗
(−3.940)
n 731 463 268 731
R2 0.4532 0.5222 0.3965 0.5115
SSR 1498.035 779.7312 558.5052 1338.2364
However, we may surmise that what happens on sunny days may be differ-
ent from rainy days. Fortunately, we also have the dummy variable d i , which
equals 1 if the weather on that day was sunny and 0 if it was cloudy or rainy.
Splitting the sample in two gives the estimates in Table 4.1: the first column
gives the estimates on the full sample (the same as in Figure 4.1). Column 2, in-
stead, contains the estimates obtained using only the sunny days and column 3
only the ones for the bad weather days.
As you can see, the estimates for sunny days are numerically different from
the ones for cloudy days. For example, the quadratic effect in column 3 seems to
be much less significant than the one in column 2. However, the real question
is: are they statistically different? Or, in other words: is there a reason to believe
that the relationship between the number of rented bikes and air temperature
depends on the weather?
In order to do so, we can run a Chow test. The mechanical way to do this
would be adding to the base model all the interactions with the “sunny” dummy.
The corresponding estimates are found in column 4 of Table 4.1. Note that the
first three coefficients in column 4 are exactly equal to those in column 3,3 and
that the coefficients in column 2 can be obtained by summing the correspon-
3 The standard errors are not: this is a side effect of the fact that model 3 and model 4 use
different estimators for σ2 and hence the two coefficient covariance matrices are different.
122 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS
Historically, the Chow test has mostly been used with time-series data, where
each row of the dataset refers to a certain time period and the rows are consec-
utive. For example, data on the economy of a certain country (GDP, interest rate
etc.) in which each row refers to a quarter, so for example the dataset starts
in 1980q1, the next row is 1980q2, and so on. Regressions on data of this kind
present the user with special issues, that have to be analysed separately, and we
will do so in Chapter 5. However, it can be seen rather easily that the Chow test
lends itself very naturally to testing whether a model remains stable before and
after a certain event: just imagine that in equation (4.1) d i equals 0 up to a cer-
tain point in time and d i = 1 after that. It is for this reason that the Chow test
is sometimes referred to as the structural stability test. Rejection of the Chow
test would in this case point to something we economists often call structural
break or regime change, obvious examples being the introduction of the single
currency in the Euro Area, the COVID pandemic, etc.
In a time series context, assuming that the pu- cedures are available (one is the so-called
tative date for the break is known a priori may CUSUM test), as well as methods for estimat-
be unwarranted. In some case, we may sup- ing the timing of the break. These, however, are
pose that a structural break has occurred at too advanced for this book, since they require a
some point, without knowing exactly when. fairly sophisticated inferential apparatus.
For these situations, some clever test pro-
4.2. HETEROSKEDASTICITY AND ITS CONSEQUENCES 123
where the function h(·) is of unknown form (but certainly non-linear, since σ2i
can never be negative). Since the variances σ2i may be different across observa-
tions, we use the term heteroskedasticity.
The reader may recall (see page 82) that this function is known as the “skedas-
tic” function, and in principle one could try to carry out an inferential analysis of
the h(xi ) function very much like we do with the regression function. However,
in this section we will keep to the highest level of generality and simply allow
for the possibility that the sequence σ21 , σ22 · · · , σ2n contains potentially different
numbers, without committing to a specific formula for h(xi ).
124 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS
Contrary to what many people think, het- If, however, you estimate the model
eroskedasticity is not a property of the data, but
only of the model we use, since it depends on y i = β 0 + β 1 x i + εi ,
the conditioning set
¤ you use. For example, as-
£ that is a perfectly valid representation of the
sume that E y i |x i is a constant, so the linear
data, since the true value of β1 is 0, then the
model we would use is
model becomes heteroskedastic, since the vari-
y i = β 0 + εi , ance of εi is a function of the explanatory vari-
ables.
but the variance of εi depends on x i (for exam- Having said this, it is very common for applied
ple, σ2i = x2i ). If you estimate a model in which economists to say “the data are heteroskedas-
the only regressor is the constant, the model is tic”, when you can’t get rid of heteroskedasticity
homoskedastic. in any meaningful model you may think of.
V [ε] = E εε′ = Σ
£ ¤
y = Xβ + ε
4 In fact, some of our considerations carry over to more general cases, in which Σ is a generic
and is called GLS. It can be proven that GLS is more efficient than OLS (the proof
is in subsection 4.A.2), and that its covariance matrix equals
Since GLS is just OLS on suitably transformed variables, all standard properties
of OLS in the homoskedastic case remain valid, so for example you could test
hypotheses by the usual techniques.6
Some readers may find it intriguing to know which obviously becomes Euclidean distance if
that GLS has more or less the geometrical inter- Σ = I . (The fact that OLS equals GLS if Σ is a
pretation of OLS that I described in Section 1.4, scalar multiple of I is a trivial consequence.)
once a more general definition of “distance” is You can apply all the usual concepts of projec-
adopted. GLS arises if ordinary Euclidean dis- tions etc, with the only difference that the space
tance is generalised to you’re considering is somewhat “distorted”.
q
d (x1 , x2 ) = (x1 − x2 )′ Σ−1 (x1 − x2 )
required is that Σ is a proper covariance matrix, that is, symmetric and positive
definite.
Of course, in ordinary circumstances Σ is unknown, but we could use this
idea to explore alternative avenues:
1. In some cases, you may have reason to believe that σ2i should be roughly
proportional to some observable variable. For example, if y i is an average
from some sampled values and n i is the size of the i -th sample, it would
be rather natural to conjecture that σ2i ≃ K n i−1 , where K is some constant.
p
Therefore, by dividing all the observables by n i you get an equivalent
representation of the model, in which heteroskedasticity is less likely to be
a problem, since in the transformed model all variances should be roughly
equal to K . The resulting estimator is sometimes called WLS (for Weighted
Least Squares), because you “weight” each observation by an observable
p
quantity w i . In our example, w i = 1/n i .
2. The idea above can be generalised: one could try to reformulate the model
in such a way that the heteroskedasticity problem might be attenuated.
For example, it is often the case that, rather than a model like
Yi = α0 + α1 X i + u i ,
ln Yi = β0 + β1 ln X i + εi
for hypothesis testing yields statistics that are not asymptotically χ2 -distributed,
so all our p-values would be wrong. On the other hand, if we could use the
correct variance for OLS (given in equation (4.4) that I’m reporting here for your
convenience)
V β̂ = (X′ X)−1 (X′ ΣX)(X′ X)−1 ,
£ ¤
the difference between the OLS residuals e i and the disturbances εi should be
“small” in large samples, and likewise for their squares.
7 In fact, it’s a symmetric matrix, so the number of its distinct elements is k(k + 1)/2.
128 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS
and by a similar argument to that used in Section 3.1, you get that
ε2i = σ2i + η i
mate σ2i by e i2 − η i .
If you substitute this into (4.7), you get
n n
X′ ΣX ≃ e i2 xi x′i − η i xi x′i .
X X
i =1 i =1
1X n
p
(e i2 − σ2i )xi x′i −→ [0].
n i =1
£ ¤
Therefore, asymptotically you can estimate V β̂ via
à !
n
′ −1
e i2 xi x′i (X′ X)−1
X
Ve = (X X) (4.8)
i =1
In fact, many variants have been proposed since White’s 1980 paper, that
seem to have better performance in finite samples, and most packages use one
of the later solutions. The principle they are based on, however, is the original
one.
A clever variation on the same principle that goes under the name of cluster-
robust estimation has become very fashiobnable in recent years. I’m not going
to describe it in this book, but you should be aware that in some circles you will
h i h £ i
8 It’s easy, really: E η |x = 0 means that E η x x′ = E E η |x x x′ = [0].
£ ¤ ¤
i i i i i i i i i
4.2. HETEROSKEDASTICITY AND ITS CONSEQUENCES 129
be treated like the village idiot if you don’t use “clustering”. In some cases, peo-
ple do this just because it’s cool and trendy. In some contexts, however, cluster-
robust inference is quite appropriate and should be considered as a very use-
ful tool; for example, with panel datasets, which I’ll describe in Chapter 7, giv-
ing a summary treatment of clustering in Section 7.3.4. For more details, read
Cameron and Miller (2010), Cameron and Miller (2015) and MacKinnon et al.
(2023).
An even more radical solution for dealing with heteroskedasticity has be-
come quite popular over the recent past because of the enormous advancement
of our computing capabilities: it’s called the bootstrap. In many respects, the
bootstrap is a very ingenious solution for performing inference with estimators
whose covariance matrix could be unreliable, for various reasons. In a book like
this, giving a full account of the bootstrap is far too ambitious a task, and I’ll just
give you a cursory description in section 4.A.4. Nevertheless, the reader ought
to be aware that “bootstrapped standard errors” are becoming more and more
widely used in the applied literature.
Table 4.2: Example: houses prices in the US (with robust standard errors)
Example 4.3
The hedonic model presented in section 3.4 was re-estimated with robust stan-
dard errors, and the results are shown in Table 4.2.
As the reader can check, all the figures in Table 4.2 are exactly the same as
those in Table 3.1, except for those that depend on the covariance matrix of
the parameters: the standard errors (and therefore, the t -statistics and their p
values) and the overall specification test. In this case, I instructed gretl to use
White’s original formula, but this is not the software’s default choice (although
results would change but marginally).
130 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS
and compare this expression to (4.8), it’s clear to see that any difference between
the two variance estimators comes from the matrix in the middle, which equals
Pn 2 Pn 2 ′
i =1 σ̂ xi xi for V̂ and i =1 e i xi xi for V . Therefore, the difference between them
′ e
1X n
(e 2 − σ̂2 )xi x′i
n i =1 i
is the quantity of interest. We need a test for the hypothesis that the proba-
bility limit of the expression above is a matrix of zeros. If it were so, then the
two estimators would converge to the same limit, and therefore the two esti-
mators would coincide asymptotically; this, of course, wouldn’t happen under
heteroskedasticity. Therefore, the null hypothesis of White’s test is homoskedas-
ticity.
Note that the alternative hypothesis is left unspecified: that is, the alterna-
tive hypothesis is simply the there is at least one variance σ2i that differs from
the other ones. This has two implications, one good and one bad. The good
one is that this is a fairly general test and is not specific to any assumption we
may make on the skedastic function h(xi ). The bad one is that the test is “non-
constructive”: if the null is rejected the test gives us no indication on what to
do.
It would seem that performing such a test is difficult; fortunately, an asymp-
totically equivalent test is easy to compute by means of an auxiliary regression:
e i2 = γ0 + z′i γ + u i ;
zi = vech xi x′i .
¡ ¢
9 A generalisation of the same principle is known among econometricians as the Hausman test,
The definition of the vech (·) operator is given in Subsection 4.A.3, but in prac-
tice, zi contains the non-duplicated cross-products of xi , that is all combina-
tions of the kind x l i ·x mi (with l , m = 1 . . . k); some of them could cause collinear-
ity, so they must be dropped from the auxiliary regression (see below for an ex-
ample). Of course, if xi contains a constant term, then zi would contain all the
elements of xi , as the products 1 · x mi .
Like in all auxiliary regression, we don’t really care about its results; running
it is just a computationally convenient way to calculate the test statistic we need,
namely
LM = n · R 2 .
Under the null of homoskedasticity, this statistic will be asymptotically distributed
as χ2p , where p is the size of the vector zi .
For example: suppose that xi contains:
1. the constant;
3. a dummy variable d i
The cross products could be written as per the following “multiplication ta-
ble”
1 xi wi di
1 1 xi wi di
xi xi x i2 xi · w i xi · di
wi wi xi · w i w i2 w i · di
di di xi · di w i · di di
where I indicated the elements to keep by shading the corresponding cell in grey.
Of course the lower triangle is redundant, because it reproduces the upper one,
but the element in the South-East corner must be dropped too: since d i is a
dummy variable, it only contains zeros and ones, so its square d i2 contains the
same entries as d i itself; clearly, inserting both d i and d i2 into zi would make the
auxiliary regression collinear.
Therefore, the vector zi would contain
z′i = [x i , wi , di , x i2 , xi · w i , xi · di , w i2 , w i · di ]
e i2 = γ0 + γ1 x i + γ2 w i + γ3 d i +
+ γ4 x i2 + γ5 x i w i + γ6 x i d i +
+ γ7 w i2 + γ8 w i d i + u i
where u i is the error term of the auxiliary regression. In this case, p = 8.10
10 It’s easy to prove that, if you have k regressors, then p ≤ k(k+1) − 1.
2
132 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS
Example 4.4
Running White’s heteroskedasticity test on the hedonic model for houses (see
section 3.4) yields:
The cross-products are: (a) the original regressors first (because the original
model has a constant) and (b) all the cross-products, except for the square of
pool, which is a dummy variable. The total number of regressors in the auxiliary
model is 14 including the constant, so the degrees of freedom for our test statistic
is 14 − 1 = 13.
Since the LM statistic is 141.1 (which is a huge number, compared to the
2
χ13 distribution), the null hypothesis of homoskedasticity is strongly rejected.
Therefore, the standard errors presented in Table 4.2 are a much better choice
than those in Table 3.1.
estimate
model by OLS
does
White
No
test keep your model
reject
H0 ?
Yes
can you
update model reformu-
Yes
late?
No
is GLS Yes
use FGLS
feasible?
No
use robust covariance matrix
The things you can do at points 3 and 4 are many: for example, you can try
transforming your dependent variable and/or use weighting; for more details,
go back to section 4.2.1.
Note, however, that this algorithm often ends at point 5; this is so com-
mon that many people, in the applied economics community, don’t even bother
checking for heteroskedasticity and start directly from there.11 This is especially
true in some cases, where you know from the outset what the situation is. The
11 In fact, some researchers show sometimes an inclination to disregard specification issues in
hope that robust inference will magically take care of everything, which is of course not the case.
For an insightful analysis, see King and Roberts (2015).
134 CHAPTER 4. DIAGNOSTIC TESTING IN CROSS-SECTIONS
The linearity hypothesis implies that E y i |xi = πi = x′i β , since the expected
£ ¤
fore
V y i |xi = πi = x′i β · (1 − x′i βi )
£ ¤
12 Models that overcome this questionable approach have existed for a long time: you’ll find a
thorough description of logit and probit models in any decent econometrics textbook, but for
some bizarre reason they are going out of fashion.
13 The classic Chow test occurs when m = 2; in order to study the argument below, I suggest you
where y j is the segment of the y vector containing the observations for the j -th
subsample, and so forth. If you apply the OLS formula to equation (4.11), you
get
−1
X′1
β̂1 0 ... 0 X1 0 ... 0
β̂
2
0 X′2 ... 0
0 X2 ... 0
.. = .. .. .. .. · .. .. .. .. ×
. .
. . . . . . .
β̂m 0 0 ... X′m 0 0 ... Xm
X′1
0 ... 0 y1
0 X′2 ... 0
y2
× .. .. .. .. · .. =
.
. . . .
0 0 ... X′m ym
−1
X′1 X1 X′1 y1
0 ... 0
0 X′2 X2 ... 0
X′2 y2
= .. .. .. .. · ..
.
. . . .
0 0 ... X′m Xm X′m ym
(X′1 X1 )−1 X′1 y1
(X′2 X2 )−1 X′2 y2
=
..
.
(X′m Xm )−1 X′m ym
m
e′ e = e j′e j ,
X
j =1
which in words reads: the SSR for model (4.10) is the same as the sum of the
SSRs you get for the m separate submodels. Equation (4.2) is a simple special
case when m = 2; the corresponding generalisation for a generic m is
ST − m
P
j =1 S j
W = n · Pm (4.12)
j =1 S j
Now note that if you take q to be the “reference” category14 , you can rewrite
equation (4.10) as
y i = x′i β + (d j i x′i γ j ) + εi
X
j ̸=q
since Σ is pd, we can write it as Σ = H H ′ (by property 2), so that Σ−1 = (H ′ )−1 H −1 :
since MW is idempotent, it is psd (property 3); but then, the same is true of
(H −1 X)′ MW (H −1 X); therefore, the claim follows.
Note that under heteroskedasticity Σ is assumed to be diagonal, but the above
proof holds for any non-singular covariance matrix Σ.
14 I will not offend the reader’s intelligence by writing the obvious double inequality 1 ≤ q ≤ m.
15 Note: H is not unique, but that doesn’t matter here. By the way, it is also true that if a matrix
H exists such that A = H H ′ , then A is psd, but we won’t use this result here.
4.A. ASSORTED RESULTS 137
or more generally
x1
¤¢ x2
¡£
vec x1 x2 . . . xk =
.. .
.
xk
The “vech” operator works in a similar way, but is generally applied to sym-
metric matrices: the difference from “vec” is that the redundant elements are
not considered. For example:
x
µ· ¸¶
x y
vech = y .
y z
z
θ̂ = T (X),
The idea is to use your observed data X to produce, with the aid of computer-
generated pseudo random numbers, H alternative datasets Xh , with h = 1 . . . H ,
and compute your estimator for each of them, so you end up with a collection
θ̂1 , θ̂2 , . . . , θ̂H . This procedure is what we call bootstrapping,16 and the H realisa-
tions you get of your statistic are meant to give you an idea of the actual, finite-
sample distribution of the statistic itself.
Then, one possible way of computing V (θ̂) is just to take the sample variance
of the bootstrap estimates:
1 XH
θ̄ = θ̂h
H h=1
1 XH ¡ ¢2
Ve (θ̂) = θ̂h − θ̄
H h=1
X1 = (x 1 , x 1 , x 1 )
X2 = (x 1 , x 1 , x 2 )
X3 = (x 1 , x 1 , x 3 )
X4 = (x 1 , x 2 , x 1 )
X5 = (x 1 , x 2 , x 2 )
X6 = (x 1 , x 2 , x 3 )
..
.
X26 = (x 3 , x 3 , x 2 )
X27 = (x 3 , x 3 , x 3 )
16 According to Efron, “[i]ts name celebrates Baron Munchausen’s success in pulling himself up
by his own bootstraps from the bottom of a lake” (Efron and Hastie, 2016, p. 177), although the
story is reportedly a little different. However, the name was chosen to convey the idea of the
accomplishment of something apparently impossible without external help.
17 When data are not independent, things get a bit more involved.
18 Warning: the algorithm in Table 4.3 wouldn’t be a computationally efficient way to get the job
done. It’s just meant to illustrate the procedure in the most transparent way possible.
4.A. ASSORTED RESULTS 139
ad it’s only by chance that you observed X6 instead of any of the others. The
number 27 comes from the fact that the number of possible datasets is n n , so in
this case 33 = 27. Clearly, the estimator θ̂h can be computed for each of the 27
cases and various descriptive statistics can be computed easily. In realistic cases,
computing θ̂h for each possible sample is impossible, since n n is astronomical:
therefore, we just randomly extract H samples and use those.
Note: this is not meant to run “out of the box”. The script above assumes that a
few objects, such as the scalars n or H, or the function estimator() have already
been defined.
Dynamic Models
2. At any given point in time, we can take as known what happened in the
past (and, possibly, at present time), but the future remains unknown.
what I used in the previous chapters. Since we’re dealing with time series, I will use the symbols t
and T instead of i and n, so for example the dependent variable has values y 1 , . . . , y t , . . . , y T .
2 In fact, the econometric treatment of time series data has become, since the 1980s, such a
vast and complex subject that we may legitimately treat time-series econometrics as a relatively
autonomous scientific field (with financial econometrics as a notable sub-field).
141
142 CHAPTER 5. DYNAMIC MODELS
also in the sequence in which they come, as if the data told you a story. If you
scramble the ordering of the rows in a cross-sectional dataset, the information
remains intact; in a time series dataset, most of it is gone.
For example: figure 5.1 shows log of real GDP and log of private consump-
tion in the Euro area between 1995 and 2019 (y and c, respectively).3 By looking
at the plot, it just makes sense to surmise that c t −1 may contain valuable infor-
mation about c t , even more than y t does.
14.8
y
14.7
c
14.6
14.5
14.4
14.3
14.2
14.1
14
13.9
13.8
13.7
1995 2000 2005 2010 2015 2020
Figure 5.1: Consumption and income in the Euro area (in logs)
Therefore:
£ ¤
• the choice of xt as the conditioning set for E y t |xt says implicitly that
information on what happened before time t is not of our interest (which
is silly);
In the early days of econometrics, this situation was treated in pretty much
the same way as we did with heteroskedasticity in section 4.2, that is, by consid-
ering a model like
y t = x′t β + εt (5.1)
and working out solutions to deal with the fact that E εε = Σ is not a diagonal
′
£ ¤
variables is: suppose you have T random variables observed through time
z1 , z2 , . . . , z t , . . . , zT .
Cov [z s , z t ]
ρ t ,s = p
V [z t ] V [z s ]
ACF for c
1
+- 1.96/sqrt(T)
0.5
-0.5
-1
0 5 10 15 20
lag
Example 5.1
Figure 5.2 displays the sample autocorrelations for the log consumption series
shown in Figure 5.1. As you can see, the numbers are very different from 0. For
example, the first 3 sample correlations equal
ρ̂ 1 = Corr [z t , z t −1 ] = 0.9627
ρ̂ 2 = Corr [z t , z t −2 ] = 0.9259
ρ̂ 3 = Corr [z t , z t −3 ] = 0.8881
and it would be hard to argue that the random variables contained in this time
series are independent.
Σ cannot be diagonal, so GLS solutions have been devised5 , and a clever general-
isation of White’s robust estimator (due to Whitney Newey and Kenneth West) is
also available, but instead of “fixing” OLS, a much better strategy is to rethink our
conditioning strategy. That is, instead of employing clever methods to perform
acceptable inference on equation (5.1), we’d be much better off if we redefined
our object of interest altogether.
What we want to do is using all the possibly relevant available information
as our conditioning set; to this end, define the information set at time t as6
© ª
ℑt = x1 , x2 , . . . xt , y 1 , y 2 , . . . , y t −1 ;
(note that ℑt includes xt ). For example, in order to build a model where con-
sumption is the dependent variable and the only explanatory variable is income
(a dynamic consumption function, if you will), it may make sense to condition
consumption on the whole information set ℑt .
Therefore, the conditioning operation will be done by using all the variables
relevant for the distribution of y t that can be assumed to be known at that time.
Clearly, that includes the current value of xt , but also the past of both y t and xt .
Possibly, even future variables that are known with certainty at time t ; variables
such as these are normally said to be deterministic. Apart from the constant
term (x t = 1), popular examples include time trends (eg x t = t ), seasonal dummy
variables (eg x t = 1 if t is the month of May), or more exotic choices, such as the
number of days in a given month, that is known in advance. Note that (this will
be very important) ℑt is an element of a sequence where ℑt −1 ⊆ ℑt ⊆ ℑt +1 ; in
other words, the sequence of information sets is increasing.7
£ ¤
Now consider the conditional expectation E y t |ℑt ; even under the linearity
assumption, this object may have two potentially troublesome characteristics:
£ ¤
1. since the sequence ℑt is increasing, E y t |ℑt may contain information
that goes indefinitely back into the past, and
£ ¤
2. E y t |ℑt could be different for each t .
If none of the above is true, things are much simplified; under the additional
assumption of linearity of the conditional mean,
p q
αi y t −i + βi′ xt −i ,
£ ¤ X X
E y t |ℑt =
i =1 i =0
5 For the readers who are into the history of econometrics: the so-called Cochrane-Orcutt es-
timator and its refinements are totally forgotten today, but they were a big thing back in the 1960s
and 1970s.
6 To be rigorous, we should define the information set by using a technical tool called a σ-
field. This ensures that ℑt contains all possible functions of the elements listed above (∆y t −1 , for
example). But in an introductory treatment such as this, I’ll just use the reader’s intuition and use
ℑt as “all the things we knows at time t ”.
7 Or, if you will, we are assuming that we always learn and never forget.
5.1. DYNAMIC REGRESSION 145
where p and q are finite numbers. Although in principle ℑt contains all the past,
no matter how remote, only the most recent elements of ℑt actually enter the
conditional expectation. A slightly more technical way of expressing the same
concept is: we are assuming that there is a subset of ℑt (call it Ft ), that contains
only recent information, such that conditioning on ℑt or Ft makes no differ-
ence:
£ ¤ £ ¤
E y t |ℑt = E y t |Ft , (5.2)
where ℑt ⊃ Ft . In practice, Ft is the relevant information at time t .
The linearity assumption makes the regression function of y t on ℑt a differ-
ence equation, that is, a relationship in which an element of a sequence y t is
determined by a linear combination of its own past and the present and past of
another sequence xt ;8 if we proceed in a similar way as in chapter 3, and define
εt ≡ y t − E y t |ℑt , we can write the so-called ADL model:
£ ¤
p q
αi y t −i + βi′ xt −i + εt .
X X
yt = (5.3)
i =1 i =0
The ADL acronym is for Autoregressive Distributed Lags (some people prefer the
ARDL acronym): in many cases, we call the above an ADL(p, q) model, to make
it explicit that the conditional mean contains p lags of the dependent variable
and q lags of the explanatory variables.
Of course, it would be very nice if we could estimate the above parameters
via OLS. Clearly, the first few observations would have to be discarded, but once
this is done, we may construct our y and X matrices as9
α1
α2
x′p+1 x′p x′p−q+1 .. ε
yp y p−1 ... y1 ...
y p+1 . p+1
x′p+2 ′ x′p−q+2
y y p+1
yp ... y2 xp+1 ...
ε
αp + ε
p+2 p+2
= x′p+3 x′p+2 xp−q+3 β p+3 =
′
y
p+3
y p+2 y p+1 ... y3 ...
. 0
..
..
β ..
1 .
.
..
.
βq
y = Wγ + ε
where wt is defined as
and
γ ′ = [α1 , α2 , . . . , αp , β0′ , β1′ , . . . , βq′ ].
8 Note: this definition works for our present purposes, but in some cases you may want to con-
have to be dropped.
146 CHAPTER 5. DYNAMIC MODELS
Given this setup, clearly the OLS statistic can be readily computed with the
usual formula (W′ W)−1 W′ y, but given the nature of the conditioning, one may
wonder if OLS is a CAN estimator of the α and β parameters. As we will see in
section 5.3, the answer is positive, under certain conditions.
Before we focus on the possible inferential difficulties, however, it is instruc-
tive to consider another problem. Even if the parameters of the conditional ex-
£ ¤
pectation E y t |ℑt were known and didn’t have to be estimated, how do we in-
terpret them?
we may ask ourselves: what is the effect of x on y after a given period? That is:
how does xt affect y t +h ? Since the coefficients αi and βi do not depend on t ,
we may rephrase the question as: what is the impact on y t of something that
happened h periods ago, that is xt −h ? Clearly, if h = 0 we have a quantity that
is straightforward to interpret, that is the instantaneous impact of xt in y t , but
much is to be gained by considering magnitudes like
∂y t ∂y t +h
dh = = ; (5.4)
∂xt −h ∂xt
the d h parameters take the name of dynamic multipliers, or just multipliers for
short. In order to find a practical and general way to compute them, we will need
a few extra tools. Read on.
fined as the inverse to the lag operator (F x t = x t +1 , or F = L −1 ). I’m not using it in this book, but
its usage is very common in economic models with rational expectations.
5.2. MANIPULATING DIFFERENCE EQUATIONS 147
Example 5.2
Call b t the money you have at time t , and s t the difference between the money
you earn and the money you spend between t − 1 and t (in other words, your
savings). Of course,
b t = b t −1 + s t .
Now the same thing with the lag operator::
b t = Lb t + s t → b t − Lb t = (1 − L)b t = ∆b t = s t
The ∆ operator, which I suppose not unknown to the reader, is defined as (1−L),
that is a polynomial in L of degree 1. The above expression simply says that the
variation in the money you have is your net saving.
Example 5.3
Call q t the GDP for the Kingdom of Verduria in quarter t . Obviously, yearly GDP
is given by
y t = q t + q t −1 + q t −2 + q t −3 = (1 + L + L 2 + L 3 )q t
Since (1 + x + x 2 + x 3 )(1 − x) = (1 − x 4 ), if you “multiply” the equation above11 by
(1 − L) you get
∆y t = (1 − L 4 )q t = q t − q t −4 ;
The variation in yearly GDP between quarters is just the difference between the
quarterly figures a year apart from each other.
A polynomial P (x) may be evaluated at any value, but two cases are of special
interest. Obviously, if you evaluate P (x) for x = 0 you get the “constant” coeffi-
cient of the polynomial, since P (0) = p 0 + p 1 · 0 + p 2 · 0 + · · · = p 0 ; instead, if you
evaluate P (1) you get the sum of the polynomial coefficients:
n n
p j 1j =
X X
P (1) = pj.
j =0 j =0
This turns out to be quite handy when you apply a lag polynomial to a constant,
since
n n
pjµ = µ
X X
P (L)µ = p j = P (1)µ.
j =0 j =0
11 To be precise, we should say: ‘if you apply the (1 − L) operator to the expression above’.
148 CHAPTER 5. DYNAMIC MODELS
There are two more routine results that come in very handy: the first one has
to do with inverting polynomials of order 1. It can be proven that, if |α| < 1,
∞
(1 − αL)−1 = (1 + αL + α2 L 2 + · · · ) = αi L i ;
X
(5.5)
i =0
the other one is that a polynomial P (x) is invertible if and only if all its roots are
greater than one in absolute value:
1
exists iff P (x) = 0 ⇒ |x| > 1. (5.6)
P (x)
The proofs are in subsection 5.A.1.
Yt = Ct + It ; (5.7)
Ct = αY t −1 ; (5.8)
Y t = αY t −1 + I t → (1 − αL)Y t = I t .
therefore, by applying the first degree polynomial A(L) = (1 − αL) to the Y t se-
quence (national income), you get the time series for investments, simply be-
cause I t = Y t −C t = Y t − αY t −1 .
If you now invert the A(L) = (1 − αL) operator,
∞
Y t = (1 + αL + α2 L 2 + · · · )I t = αi I t −i :
X
i =0
aggregate demand at time t can be seen as a weighted sum of past and present
investment. Suppose that investment goes from 0 to 1 at time 0. This brings
about a unit increase in GDP via equation (5.7); but then, at time 1 consumption
goes up by α, by force of equation (5.8), so at time 2 it increases by α2 and so on.
Since 0 < α < 1, the effect dies out eventually.
If investments were constant through time, then I t = I¯; therefore, A(L)Y t = I¯
becomes
1 1 I¯
Yt = I¯ = I¯ =
A(L) A(1) 1−α
where the second equality comes from the fact that I¯ is constant. The rightmost
expression is nothing but the familiar “Keynesian multiplier” formula.
5.2. MANIPULATING DIFFERENCE EQUATIONS 149
Example 5.5
Given two sequence x t and y t , define the sequence z t as z t = x t · y t . Obviously,
z t −1 = x t −1 y t −1 ; however, one may be tempted to argue that
z t −1 = x t −1 y t −1 = Lx t Ly t = L 2 x t y t = L 2 z t = z t −2
where the degrees of the A(L) and B (L) polynomials are p and q, respectively.
If the polynomial A(L) is invertible, the difference equation is said to be stable.
In this case, we may define D(L) = A(L)−1 B (L) = B (L)/A(L); as a rule, D(L) is of
infinite order (although not necessarily so):
∞
X
y t = D(L)x t = d i x t −i .
i =0
This is all we need for dealing with our problem, if you consider that the dynamic
multipliers as defined in equation (5.4),
∂y t ∂y t +i
di = = ,
∂x t −i ∂x t
y t = αy t −1 + β0 x t + β1 x t −1 , (5.9)
∂y t ∂ ¡
αy t −1 + β0 x t + β1 x t −1 = β0
¢
d0 = =
∂x t ∂x t
∂y t ∂ ¡ ∂y t −1
αy t −1 + β0 x t + β1 x t −1 = α + β1 = αd 0 + β1 ,
¢
d1 = =
∂x t −1 ∂x t −1 ∂x t −1
∂y t −1 ∂y t
= = d0
∂x t −1 ∂x t
∂y t ∂ ¡ ∂y t −1
αy t −1 + β0 x t + β1 x t −1 = α = αd 1
¢
d2 = =
∂x t −2 ∂x t −2 ∂x t −2
where u i is a sequence that contains 1 for u 0 , and 0 everywhere else. This makes
it easy to calculate the multipliers numerically, given the polynomial coefficients,
via appropriate software.
In this case, A(L) = 1 − 0.2L and B (L) = 0.4 + 0.3L 2 . The inverse of A(L) is
∞
A(L)−1 = 1 + 0.2L + 0.04L 2 + 0.008L 3 + · · · = 0.2i L i ;
X
i =0
therefore,
B (L)
= (0.4 + 0.3L 2 ) × (1 + 0.2L + 0.04L 2 + 0.008L 3 + · · · ).
A(L)
5.2. MANIPULATING DIFFERENCE EQUATIONS 151
B (L)
= 0.4 × (1 + 0.2L + 0.04L 2 + 0.008L 3 + · · · ) +
A(L)
+0.3L 2 × (1 + 0.2L + 0.04L 2 + 0.008L 3 + · · · ) =
= 0.4 + 0.08L + 0.016L 2 + 0.0032L 3 + · · · +
+0.3L 2 + 0.06L 3 + 0.012L 4 + 0.0024L 5 · · · =
= 0.4 + 0.08L + 0.316L 2 + 0.0632L 3 + · · ·
and so on.
j
X
c j = d0 + d1 + · · · + d j = di . (5.11)
i =0
152 CHAPTER 5. DYNAMIC MODELS
These are called interim multipliers and measure the effect on y t of a perma-
nent change in x t that took place j periods ago. In order to see what happens in
the long run after a permanent change, we may also want to consider the long-
run multiplier c = lim j →∞ c j . Calculating c is much easier than what may seem,
since
∞
X
cj = d i = D(1);
i =0
that is: c is the number you get by evaluating the polynomial D(z) in z = 1; since
D(z) = B (z)/A(z), c can be easily computed as c = D(1) = BA(1)
(1)
.
c0 = d 0 = 0.4
c1 = d 0 + d 1 = c 0 + d 1 = 0.48
c2 = d 0 + d 1 + d 2 = c 1 + d 2 = 0.796
and so on. The limit of this sequence (the long-run multiplier) is also easy to
compute:
B (1)
c = D(1) = = 0.7/0.8 = 0.875
A(1)
Et voilà.
B (1)
A(L)y t = B (L)x t ⇒ A(L)Y = B (L)X ⇒ A(1)Y = B (1)X ⇒ Y = X = cX,
A(1)
y t = w′t γ + εt
where Q is invertible. The conditions under which we can expect this to happen
are quite tricky to lay down formally. Here, I’ll just say that in order for every-
thing to work as expected, it is sufficient that our observed data are realisations
154 CHAPTER 5. DYNAMIC MODELS
(this is essentially the same argument we used in section 3.1). But of course we
can’t use ℑt in practice; however, we’re assuming (see equation 5.2) that there
exists a subset Ft ⊂ ℑt such that E y t |ℑt = E y t |Ft , so Ft (which is usable,
£ ¤ £ ¤
However, what happens if you use a conditioning set G t that is “too small”?
That is, that doesn’t contain Ft ? In that case, the difference u t = y t − E y t |G t is
£ ¤
£ ¤ £ ¤
not a MDS with respect to ℑt : if E y t |G t ̸= E y t |Ft , then
£ £ ¤ ¤ £ ¤ £ ¤
E [u t |ℑt ] = E y t − E y t |G t |ℑt = E y t |Ft − E y t |G t ̸= 0.
On the contrary, it is easy to prove that in the opposite case, when you condition
on a subset of ℑt that is larger than Ft , no problems arise.
This remark is extremely important in practice because the order of the poly-
nomials A(L) and B (L) (p and q, respectively) are not known: what happens if
we get them wrong? Well, if they are larger than the “true” ones, then our con-
ditioning set contains Ft , and all is well. But if they’re smaller, the disturbance
term of our model is not a MDS, and all inference collapses. For example, if
p = 2 and q = 3, then Ft contains y t −1 , y t −2 , x t , x t −1 , x t −2 and x t −3 . Any set of
regressors that doesn’t include at least these renders inference invalid.
13 I’m being very vague and unspecific here: if you want an authoritative source on the asymp-
totics for dynamic models, you’ll want to check chapters 6 and 7 in Davidson (2000).
14 MDSs arise quite naturally in inter-temporal optimisation problems, so their usage in eco-
nomic and finance models with uncertainty is very common. In this contexts, an MDS is, so to
speak, something that cannot be predicted in any way from the past. For a thorough discussion,
see Hansen and Sargent (2013), chapter 2.
5.3. INFERENCE ON OLS WITH TIME-SERIES DATA 155
allel completely those in section 3.2. To put it simply, everything works exactly
the same way as in cross-sectional models: the martingale property ensures that
p
E [wt · εt |ℑt ] = 0, and therefore γ̂ −→ γ ; additionally,
p d
T (γ̂ − γ ) −→ N 0, σ2Q −1 .
¡ ¢
In practice, the whole testing apparatus we set up for cross sectional datasets
remains valid; the t statistic, the W statistic, everything. Nice, isn’t it? In addi-
tion, since the dynamic multipliers are continuous and differentiable functions
of the ADL parameters γ , we can simply compute the multipliers from the esti-
mated parameters γ̂ and get automatically CAN estimators of the multipliers.15
The homoskedasticity assumption is not nor- model, which I will not consider here, but are
mally a problem, except for financial data at extremely important in the field of financial
high frequencies (eg daily); for those cases, you econometrics. In case we want to stick with
get a separate class of models, the most no- OLS, robust estimation is perfectly viable.
table example of which is the so-called GARCH
Basically, we need a test for deciding, on the basis of the OLS residuals, whether
εt is a MDS or not. Because if it were not, the OLS estimator would not be con-
sistent for the ADL parameters, let alone have the asymptotic distribution we
require for carrying out tests. As I argued in the previous section, εt cannot be a
MDS if we estimate a model in which the orders p and q that we use for the two
polynomials A(L) and B (L) are too small.
Most tests hinge on the fact that a MDS cannot be autocorrelated:16 for the
sake of brevity, I don’t prove this here, but the issue is discussed in section 5.A.3
if you’re interested. Therefore, in practice, the most important diagnostic check
on a dynamic regression model is checking for autocorrelation: if we reject the
null of no autocorrelation, then εt cannot be a MDS.
All econometric software pays tribute to tradition by reporting a statistic in-
vented by James Durbin and Geoffrey Watson in 1950, called DW statistic in their
honour. Its support is, by construction, the interval between 0 and 4, and ide-
ally it should be close to 2. It is practically useless, because it only checks for
autocorrelation of order 1, and there are several cases in which it doesn’t work
15 Unfortunately, the function linking multipliers and parameters is nonlinear, so you need the
(notably, when lags of the dependent variable are among the regressors); there-
fore, nobody uses it anymore, although all software packages routinely print it
out as a homage to tradition.
The Godfrey test (also known as the Breusch-Godfrey test, or the LM test for
autocorrelation) is much better:
A(L)y t = B (L)xt + γ1 e t −1 + γ2 e t −2 + · · · + γh e t −h + εt
where e t is the t -th OLS residual and h is known as the order of the test. There
is no precise rule for choosing h; the most important aspect to consider is “how
long is the period we can reasonably expect to consider long enough for dynamic
effects to show up?”. When dealing with macro time series, a common choice is
2 years. That is, it is tacitly assumed that nothing can happen now, provoke no
effects for two years, and then suddenly do something.17 Therefore, you would
use h = 2 for yearly data, h = 8 for quarterly data, and so on. But clearly, this is a
very subjective criterion, so take it with a pinch of salt and be ready to adjust it
to your particular dataset.
This test, being a variable addition test, is typically implemented as an LM
test (see section 3.5.1) and is asymptotically distributed (under H0 ) as χ2h . In
practice, you carry out an auxiliary regression of the OLS residuals e t against wt
and h lags of e t ; you multiply R 2 by T and you’re done.
The Godfrey test is the cornerstone of the so-called general-to-specific esti-
mation strategy: since the polynomial orders p and q are not known in practice,
one has to make a guess. There are three possible situations:
So the idea of the general-to-specific approach is: start from a large model,
possibly ridiculously oversized. Then you can start refining it by ordinary hy-
pothesis tests, running diagnostics18 at each step to make sure your reduction
was not too aggressive.
17 “Mi ha detto mio cuggino che sa un colpo segreto. . . ”, EELST.
18 The most important test to run at this stage is of course the Godfrey test, but other diagnostics,
A(L)
= 1 − 0.893634L
B
(L) = 0.588653 − 0.612108L + 0.110455L 2 ,
d 0 = 0.588653
and so on. With a little effort (and appropriate software), you get the following
results:
5.5. THE ECM REPRESENTATION 159
i di ci
0 0.588653 0.588653
1 -0.0860674 0.502585
2 0.0335424 0.536128
3 0.0299747 0.566103
4 0.0267864 0.592889
5 0.0239372 0.616826
6 0.0213911 0.638217
.. .. ..
. . .
Where I also added a column for the interim (cumulated) multipliers. Moreover,
you have that A(1) = 1 − 0.893634 = 0.106366, B (1) = 0.087, and therefore the
long-run multiplier equals c = 0.87/0.106366 = 0.81793.
y t = αy t −1 + β0′ xt + β1′ xt −1 ;
160 CHAPTER 5. DYNAMIC MODELS
now substitute
and collect
∆y t = −0.8y t −1 + 0.7x t −1 + 0.4∆x t − 0.3∆x t −1
so finally
∆y t = 0.4∆x t − 0.3∆x t −1 − 0.8 y t −1 − 0.875x t −1 ;
£ ¤
the impact multiplier is 0.4, the long-run multiplier is 0.875; the fraction of dis-
equilibrium that re-adjusts each period is 0.8.
Note that the ADL model and the ECM are not two different models, but are
simply two ways of expressing the same difference equation. As a consequence,
you can use OLS on either and get the same residuals. The only difference be-
tween them is that the ECM forms makes it more immediate for the human eye
to calculate the parameters that are most likely to be important for the dynamic
properties of the model: that is the long-run multipliers and the convergence
speed. On the other hand, the ADL form allows for simple (and, most impor-
tantly, mechanical) calculation of the whole sequence of dynamic multipliers.
ct = c t −1 + ∆c t
yt = y t −1 + ∆y t
y t −2 = y t −1 − ∆y t −1
Hence,
∆c t = k + (α − 1)c t −1 + β0 ∆y t + β0 + β1 + β2 y t −1 − β2 ∆y t −1 + εt ,
¡ ¢
that is
∆c t = k + β0 ∆y t − A(1) c t −1 − cy t −1 − β2 ∆y t −1 + εt ;
£ ¤
so, after substituting the estimated numerical values (and rounding results a lit-
tle),
Note, however, that this representation could have been calculated directly by
applying OLS to the model in ECM form: it is quite clear from Table 5.3 that what
gets estimated is the same model in a different form. Not only the parameters for
each representation can be calculated from the other one: the objective function
(the SSR) is identical for both models (and equals 0.001048); clearly, the same
happens for all the statistics based on the SSR. The only differences (eg the R 2
index) come from the fact that the model is transformed in such a way that the
dependent variable is not the same (it’s c t in the ADL form and ∆c t in the ECM
form).
c = k ⇐⇒ B (1) − k · A(1) = 0
It may be worth mentioning here that tests of There are some important cases when this may
this type behave in the ordinary way only if the not be true, notably when the data we are work-
assumptions we made in section 5.3.1 are valid. ing with are generated by non-stationary DGPs.
H0 : α1 + · · · + αp + β0 + · · · + βq = 1
Example 5.11
Suppose that we have the following estimates:
5 0.5 −2
V̂ ADL = 0.001 × 5 4
5
The hypothesis c = 1 implies α+β0 +β1 = 1. Therefore, a Wald test can be set up
with R = [1 1 1] and d = 1 (see section 3.3.2 for details). Therefore
0.75
R β̂ − d = [1 1 1] 0.53 − 1 = 0.04
−0.24
R · V̂ ADL · R ′ = 0.001 × 20 = 0.02
0.042
W = = 0.08
0.02
which leads of course to accepting H0 , since its p-value is way larger than 5%
(P (χ21 > 0.08) = 0.777)). The same test could have been performed even more
easily from the ECM representation:
∆y
c t = 0.53∆x t − 0.25y t −1 + 0.29x t −1
In this case the hypothesis can be written as H0 : B (1) − A(1) = 0, so for the ECM
form
0.53
R β̂ − d = [0 1 1] −0.25 = 0.04
0.29
R · V̂EC M · R ′ = 0.001 × 20 = 0.02
y T +1 = αy T + β0 x T +1 + β1 x T + εT +1 ; (5.14)
of all the objects that appear on the right-hand side of the equation, the only
ones that are known with certainty at time T are y T and x T . Suppose we also
know for certain what the future value x T +1 will be, and call it x T +1 = x̌ T +1 .
Therefore, since εt is a martingale difference sequence, its conditional expec-
tation with respect to ℑT +1 is 0,20 so
= αy T + β0 x̌ T +1 + β1 x T
£ ¤
E y T +1 |ℑT
= σ2
£ ¤
V y T +1 |ℑT
Following the same logic as in Section 3.7, we can use the conditional expecta-
tion as predictor and the estimated values for the parameters instead of the true
ones. Therefore, our prediction will be
ŷ T +1 ± 1.96 × σ̂
where it is implicitly assumed that εt is normal and uncertainty about the pa-
rameters is ignored.
Now, there are two points I’d like to draw your attention on. First: in order to
predict y T +1 we need x T +1 ; but then, we could generalise this idea and imagine
20 We’re sticking to the definition of ℑ
T +1 I introduced in Section 5, so ℑT +1 includes x T +1 but
not y T +1 ; later in this section, we’ll use a different convention.
5.7. FORECASTING AND GRANGER CAUSALITY 165
y T +2 = αy T +1 + β0 x T +2 + β1 x T +1 + εT +2 ;
next, we operate in a similar way as we just did, using the conditional expecta-
tion as predictor
ŷ T +2 = α̂ ŷ T +1 + β̂0 x̌ T +2 + β̂1 x̌ T +1 ,
repeating the process with the obvious adjustments for T + 3, T + 4 etc. It can be
proven (nice exercise for the reader) that the variance you should use for con-
structing confidence interval for multi-step forecasts would be in this case
¤ 1 − (α2 )k 2
σ .
£
V ŷ T +k =
1 − α2
Extending the formulae above to the general ADL(p, q) case is trivial but boring,
and I’ll just skip it.
The second point I want to make comes by considering the possibility that
the β0 and β1 coefficients were 0 in equation (5.14). In this case, there would be
no need to conjecture anything about x t , in order to forecast y t . In other words,
x t has no predicting power for y t . This is a hypothesis we may want to test.
As a general rule, in the context of dynamic regression
models, it is difficult to formulate hypotheses of economic
interest that can be tested through restrictions on coeffi-
cients, since the coefficients of the A(L) and B (L) polyno-
mials normally don’t have a natural economic interpreta-
tion per se, and this is why we compute multipliers.
However, there are exceptions: we just saw one of
them in the previous section. Another one is the so-called
Granger-causality test, after the great Clive Granger, Nobel
Prize winner in 2003.21 The idea on which the test is built
is that, whenever A causes B , the cause should come, in
C LIVE G RANGER
time, before the effect. Therefore, if A does not cause B , it
should have no effect on the quantity we normally use for prediction, i. e. the
conditional expectation.
The only difference with the ADL models we’ve considered so far is that,
since we’re dealing with predictions about the future, we will want to base our
21 C. W. Granger is one of the founding fathers of modern time series econometrics; his most fa-
mous brainchild, that earned him the Nobel Prize, is a concept called cointegration, that I will skip
in this book, but is absolutely indispensable if you want to engage in applied macroeconomics.
166 CHAPTER 5. DYNAMIC MODELS
note that, contrary to the concept of information set ℑt we used so far (defined
in section 5.1), ℑ∗t does not include xt +1 ; in practice, it collects all information
on y t and xt that is available up to time t . Forecasting, therefore, amounts to
finding
ŷ T +1|T = E y T +1 |ℑ∗T .
£ ¤
The subscript “T + 1|T ” is customarily read as “at time T + 1, based on the infor-
mation available at time T ”.
h i
There is no doubt that the discerning reader sible to forecast x̂T +1|T = E xT +1|T |ℑ∗
T
. This
has spotted, by now, a fundamental difference seemingly innocent remark paves the way to
between the information set ℑ∗ T −1
that we are multi-step forecasts, where we use the predic-
using here and the information set ℑT we use tions for T to forecast T + 1, which in turn we
in the rest of this chapter: the latter includes xt , use for forecasting T + 2, and so on.
while the former does not.
Since ℑ∗ T −1
⊂ ℑT , predictions on y t made using This is the principle used in the so-called VAR
ℑ∗T −1
are obviously going to be less accurate, model, which is probably the main empirical
but have the advantage of being possible one tool in modern macroeconometrics. If you’re
period earlier. Moreover, this makes also pos- curious, check out Lütkepohl (2005).
A(L)y t = B (L)x t + εt
ordinary ADL model in which B (0) = 0. The idea that x t does not cause y t is
equivalent to the idea B (L) = 0; since a polynomial is 0 if and only if all its coef-
ficients are, it is easy to formulate the hypothesis of no-Granger-causality as
H 0 : β1 = β2 = . . . = β q = 0
which is of course a system of linear restrictions, that we can handle just fine via
the R β = d machine we described in Section 3.3.2.
In the late 1960s, when this idea was introduced, it was hailed as a break-
through in economic theory, because for a while this seemed to provide a data-
based way to ascertain causal links. For example, a hotly debated point among
macroeconomists in the 1970 and 80s was: is there a causality direction between
the quantity of money and GDP in an economy? If there is, the repercussions on
economic policy (notably, on the effectiveness of monetary policy) are huge. In
5.7. FORECASTING AND GRANGER CAUSALITY 167
chicken egg
600000 7500
7000
550000
6500
6000
500000
5500
5000
450000
4500
4000
400000
3500
350000 3000
1930 1942 1954 1966 1978 1990 2002 1930 1942 1954 1966 1978 1990 2002
Example 5.12
In a humorous article, Thurman and Fisher (1988) collected data on the produc-
tion of chickens and eggs from 1930 to 2004, that are depicted in Figure 5.3.
After taking logs, we estimate by OLS the following 2 equations:
c t = m 1 + α1 c t −1 + α2 c t −2 + β1 e t −1 + β2 e t −2 + εt (5.15)
e t = µ1 + γ1 c t −1 + γ2 c t −2 + λ1 e t −1 + λ2 e t −2 + η t (5.16)
There are a few issues that may be raised here: one is statistical, and pertains
to the fact that the test is relative to a certain conditioning set. You may see this
as a variation on the same theme I discussed in Section 3.8, especially example
3.3. It may well be that A turns out to be Granger-causal for B in a model, and the
reverse happens in another model, in which some other variables are included
22 Readers who are into the history of economics and econometrics might want to take a look at
Sims (1972).
168 CHAPTER 5. DYNAMIC MODELS
of course
a · S = a + a 2 + · · · + a n+1 (5.19)
n+1
and therefore, by subtracting (5.19) from (5.18), S(1 − a) = 1 − a , and hence
equation (5.17).
1
If a is a small number (|a| < 1), then a n → 0, and therefore ∞ i
P
i =0 a = 1−a . By
setting a = αL, you may say that, for |α| < 1, the inverse of (1 − αL) is (1 + αL +
α2 L 2 + · · · ), that is
(1 − αL)(1 + αL + α2 L 2 + · · · ) = 1,
170 CHAPTER 5. DYNAMIC MODELS
or, alternatively,
1 ∞
αi L i
X
=
1 − αL i =0
provided that |α| < 1. Now consider a n-th degree polynomial P (x):
n
pjxj
X
P (x) =
j =0
where the numbers λ j are the roots of P (x): if x = λ j , then 1 − λ1j x = 0 and con-
sequently P (x) = 0. Therefore, if P (x)−1 exists, it must satisfy
n µ 1 −1
¶
1 Y
= 1− x ;
P (x) j =1 λj
but if at least one of the roots λ j is smaller than 1 in modulus,24 then 1/|λ j | is
³ ´−1
larger than 1 and, as a consequence, 1 − λ1j x does not exist, and neither does
P (x)−1 .
Example 5.13
Consider the polynomial A(x) = 1 − 1.2x + 0.32x 2 ; is it invertible? Let’s check its
roots: p
1.2 ± 1.44 − 1.28
A(x) = 0 ⇐⇒ x = = (1.2 ± 0.4)/0.64
0.64
so λ1 = 2.5 and λ2 = 1.25. Both are larger than 1 in modulus, so the polynomial
is invertible. Specifically,
A(x) = (1 − λ−1 −1
1 x)(1 − λ2 x) = (1 − 0.4x)(1 − 0.8x)
and
1
= (1 − 0.4x)−1 (1 − 0.8x)−1 = (1 + 0.4x + 0.16x 2 + · · · )(1 + 0.8x + 0.64x 2 + · · · )
A(x)
= 1 + 1.2x + 1.12x 2 + 0.96x 3 + 0.7936x 4 + · · ·
1
u t = P (L)−1 a t = at .
P (L)
. . . , x t −1 , x t , x t +1 , . . .
where the index t is normally taken to mean “time” (although not necessarily).
This sequence is a stochastic process.25 When we observe a time series, we ob-
serve a part of the realisation of a stochastic process (also called a trajectory
of the process). Just in the same way as the DGP for the toss of a coin can be
thought of as the machine that nature uses for giving us a binary number that
we cannot predict, a stochastic process is a machine that nature uses for giving
us an infinitely long trajectory through time, and what we observe is just a short
segment of it. This idea may be unintuitive at start (it certainly was for me, back
in the day), but I find it very useful.
If we take two different elements of the sequence, say x s and x t (with s ̸= t ),
we could wonder what their joint distribution is. The two fundamental proper-
ties of the joint distribution that we are interested in are:
1. is the joint distribution stable through time? That is, is the joint distribu-
tion of (x s , x t ) the same as (x s+1 , x t +1 ) ?
Property number 1 refers to the idea that the point in time when we ob-
serve the process should be irrelevant: the probability distribution of the data
we see today (x s , x t ) should be the same as the one for an observer in the past
(x s−100 , x t −100 ) or in the future (x s+100 , x t +100 ). This gives rise to the concept of
stationarity. A stochastic process is said to be weakly stationary, or covariance
stationary, or second-order stationary if the covariance between x s and x t (also
known as autocovariance) exists and is independent of time. In formulae:
γh = Cov [x t , x t +h ]
25 It’s not inappropriate to think of stochastic processes as infinite-dimensional random
variables. Using the same terminology as in section 2.2.1, we may think of the sequence
. . . , x t −1 (ω), x t (ω), x t +1 (ω), . . . as the infinite-dimensional outcome of one point in the state space
ω ∈ Ω.
172 CHAPTER 5. DYNAMIC MODELS
ensures that limh→∞ |γh | = 0 (so correlation between distant events should be
negligible), but most importantly, that the sample mean of an observed stochas-
tic process is a consistent estimator of the true mean of the process:
1 X T
p
x t −→ E [x t ] .
T t =1
Note that the above expression can be considered as one of the many versions
of the Law of Large Numbers, applicable when observations are not necessarily
independent.
It goes without saying that, in the same way as we can define multivariate
random variables, it is perfectly possible to define multivariate stochastic pro-
cesses, that is, sequences of random vectors: modern macroeconometrics is pri-
marily built upon these objects. A large part of the statistical analysis of time
series is based on the idea that the time series we observe are realisations of sta-
tionary and ergodic processes (or can be transformed to this effect).
How do you adapt statistical inference to such a context? The main idea un-
derlying most approaches is to describe the a DGP in such a way that the whole
autocovariance structure of a stochastic process (the sequence γ0 , γ1 , γ2 , ...) can
be expressed as a function of a finite set of parameters θ; if the process is sta-
tionary and ergodic, then maybe the available data x 1 , . . . , x T can be used to
construct CAN estimators of θ. ARIMA models are one of the most celebrated
instances of this approach, and the literature that has developed after their in-
troduction in the late 1960s is truly gigantic. If you’re interested, Brockwell and
Davis (1991) is an excellent starting point.
5.A. ASSORTED RESULTS 173
if n = 0, obviously Q(x) = 0.
For example, the reader is invited to check that, if we choose a = 1, the poly-
nomial P (x) = 0.8x 2 − 1.8x + 1.4 can be written as
The actual form of the P ∗ (L) polynomial is not important: all we need is know-
ing that it exists, so that the decomposition of P (L) we just performed is always
possible. As a consequence, every sequence P (L)z t can be written as:
Now apply this result to both sides of the ADL model A(L)y t = B (L)xt + εt :
(note that A(0) = 1 by construction). After rearranging terms, you obtain the
ECM representation proper:
where c′ = BA(1)
(1)
contains the long-run multipliers. In other words, the variation
of y t over time is expressed as the sum of three components:
Instrumental Variables
The arguments I presented in chapter 3 should have convinced the reader that
OLS is an excellent solution to the problem of estimating linear models of the
kind
y = Xβ + ε,
£ ¤
where ε is defined as y − E y|X , with the appropriate adjustments for dynamic
models; the derived property E [ε|X] = 0 is the key ingredient for guaranteeing
consistency of OLS as an estimator of β . With some extra effort, we can also
derive asymptotic normality and have all the hypothesis testing apparatus at
our disposal.
In some cases, however, this is not what we need. What we have implicitly
assumed so far is that the parameters of economic interest are the same as the
statistical parameters that describe the conditional expectation (or functions
thereof, like for example marginal effects or multipliers in dynamic models).
Sometimes, this might not be the case. As anticipated in section 3.6, this
happens when the model we have in mind contains explanatory variables that,
in common economics parlance, are said to be endogenous. In the next section,
I will give you a few examples where the quantities of interpretative interest are
not computable from the regression parameters. Hence, it should come as no
surprise that OLS is not a usable tool for this purpose: this is why we’ll want to
use a different estimator, known as instrumental variables estimator, or IV for
short.
6.1 Examples
6.1.1 Measurement error
Measurement error is what you get when one or more of your explanatory vari-
able are measured imperfectly. Suppose you have the simplest version of a linear
model, where everything is a scalar:
y i = x i∗ β + εi (6.1)
175
176 CHAPTER 6. INSTRUMENTAL VARIABLES
x i = x i∗ + η i (6.2)
so
Pn Pn Pn
i =1 x i y i i =1 x i (x i β + u i ) xi ui
β̂ = Pn 2
= Pn 2
= β + Pi =1
n 2
i =1 x i i =1 x i i =1 x i
E [x i u i ] = E (x i∗ + η i )(εi − βη i ) = E x i∗ εi − βE x i∗ η i + E η i εi − βE η2i =
£ ¤ £ ¤ £ ¤ £ ¤ £ ¤
= −βσ2η
If we define Q = E x i2 , clearly
£ ¤
βσ2η σ2η
à !
p
β̂ −→ β − = β 1− ̸= β
Q Q
It can be proven that 0 < σ2η < Q,1 so two main conclusions can be drawn from
the equation above: first, the degree of inconsistency of OLS is proportional to
the size of measurement error σ2η relative to Q; second, the asymptotic bias is
such that |plim β̂ | < |β|; that is, the estimated effect is smaller than the true
¡ ¢
be zero, but we just showed it isn’t. Since OLS is programmed to estimate the
parameters of a conditional expectation, you can’t expect it to come up with
anything else.
C = C 0 + cY .
with 0 < c < 1. As the reader knows, c is the “marginal propensity to consume”,
that is a key ingredient in mainstream Keynesian macroeconomics.
In the 1950s, few people would dissent from the received wisdom: one of
them was Milton Friedman, who would argue that c should not be less than 1 (at
least in the long run), since the only thing income is good for is buying things.
Over the span of your life, it would be silly to save money unconditionally: a
rational individual with perfect foresight should die penniless.3
Back in the day, economists thought of measuring c by running regressions
on the consumption function, and regularly found estimates that were signif-
icantly smaller than 1. Friedman, however, put forward a counter-argument,
based on the “permanent income” concept: consumption is not based on cur-
rent income, but rather on a concept of income that takes into account your ex-
pectations about the future. For example, if you knew with certainty that you’re
going to inherit a disgustingly large sum from a moribund distant uncle, you
would probably start squandering money today (provided, of course, you find
somebody willing to lend you money), far beyond your level of current income.
In this case, your observed actual income x i does not coincide with your
permanent income x i∗ (which is unobservable), and estimated values of c lower
than 1 could well be the product of attenuation.
quam, quo viae minus restet, eo plus viatici quaerere?” Marcus Tullius Cicero, De senectute.
4 Klein and Haavelmo got the Nobel Prize for their work in 1980 and 1989, respectively.
178 CHAPTER 6. INSTRUMENTAL VARIABLES
qt = α0 − α1 p t + u t (6.4)
pt = β0 + β1 q t + v t , (6.5)
T RYGVE
equation (6.4) is the demand equation (quantity at time t H AAVELMO
as a function of price at time t ), (6.5) is supply (price as a function of quan-
tity); the two disturbance terms u t and v t represent random shocks to the two
curves. For example, u t could incorporate random fluctuations in demand
due to shifting customer preferences, fluctuations in disposable income and so
forth; v t , instead could be non-zero because of productivity shifts due to ran-
dom events (think for example weather for agricultural produce). Assume that
E [u t ] = E [v t ] = 0.
If you considered the two equations separately, one may think of estimat-
ing their parameters by using OLS, but this would be a big mistake, since the
“systematic part” of each of the two equations is not a conditional expectation.
£ ¤
An easy way to convince yourself is simply to consider that if E q t |p t is
upward (downward) sloping, the correlation between q t and p t must be posi-
tive (negative), and therefore there’s no way the reverse conditional expectation
£ ¤
E p t |q i can be downward (upward) sloping. Since the demand function goes
down and the supply function goes up, at least one of them cannot be a condi-
tional expectation.
However, a more rigorous proof can be given: take the demand curve (6.4):
if the expression (α0 − α1 p t ) were in fact the conditional expectation of q t to p t ,
£ ¤
then E u t |p t should be 0. Now substitute (6.4) into equation (6.5):
pt = β0 + β1 (α0 − α1 p t + u t ) + v t
= (β0 + β1 α0 ) − (β1 α1 )p t + (v t + β1 u t ) ⇒
(1 + β1 α1 )p t = (β0 + β1 α0 ) + (v t + β1 u t ) ⇒ (6.6)
pt = π1 + η t , (6.7)
β +β α v +β u
where the constant π1 is 1+β
0 1 0
1 α1
and η t = 1+β
t 1 t
1 α1
is a zero-mean random vari-
able. The covariance between p t and u t is easy to compute:
= E p t · u t = E u t π1 + u t η t = 0 + E u t · (v t + β1 u t ) =
£ ¤ £ ¤ £ ¤ £ ¤
Cov p t , u t
= Cov [v t , u t ] + β1V (u t )
6.2. THE IV ESTIMATOR 179
E q t |p t = γ0 + γ1 p t ,
£ ¤
E X′ (y − Xβ ) = 0.
£ ¤
(6.8)
X′ (y − Xβ̂ ) = 0. (6.9)
which corresponds to the first-order conditions for the minimisation of the sum
of squared residuals (see section 1.3.2, especially equation (1.10)); note that equa-
tion (6.9) can be seen as the sample equivalent of equations (6.8). The fact that
the OLS statistic β̂ works quite nicely as an estimator of its counterpart β just
agrees with common sense.
If, on the contrary, the parameter of interest β satisfies an equation other
than (6.8), then we may proceed by analogy and use, as an estimator, a statistic
β̃ that satisfies the corresponding sample property. In this chapter, we assume
we have a certain number of observable variables W for which
E W′ (y − Xβ ) = 0.
£ ¤
(6.10)
W′ (y − Xβ̃ ) = 0. (6.11)
180 CHAPTER 6. INSTRUMENTAL VARIABLES
The variables W are known as instrumental variables, or, more concisely, in-
struments. The so-called “simple” IV estimator can then be defined as follows:
if we had a matrix W, of the same size as X, satisfying (6.10), then we may define
a statistic β̃ such that (6.11) holds:
W′ (y − Xβ̃ ) = 0 =⇒ W′ X · β̃ = W′ y.
E εε′ |W = σ2 I.
£ ¤
E W′ εε′ W|W = σ2 W′ W = σ2 Ω.
£ ¤
5 In fact, this assumption is not strictly necessary, but makes for a cleaner exposition.
6.2. THE IV ESTIMATOR 181
v = W′ y ′
C =W X e = W′ ε;
m×1 m×k m×k
v = C β + e, (6.13)
Equation (6.13) may be seen as a linear model where the disturbance term has
zero mean and covariance matrix σ2 Ω. The number of explanatory variables is
k (the column size of X) but the peculiar feature of this model is that the number
of “observations” is m (the column size of W).
Since Ω is observable (up to a constant), we may apply the GLS estimator
(see 4.2.1) to (6.13) and write
£ ′ −1 ¤−1 ′ −1 ¤−1 ′
CΩ C C Ω v = X′ W(W′ W)−1 W′ X X W(W′ W)−1 W′ y =
£
β̃ =
= (X′ PW X)−1 X′ PW y. (6.14)
as early as 1953 by the Dutch genius Henri Theil. But it was Sargan who created the modern
approach, in an article appeared in 1958.
182 CHAPTER 6. INSTRUMENTAL VARIABLES
εe′ εe
σ̃2 = .
n
As shown in section 6.A.1, it can be proven that, under the set of assumptions I
just made, the statistics β̃ and σ̃2 are CAN estimators. Therefore, the whole test-
ing apparatus we developed in Chapter 3 can be applied without modifications
since
p d
n(β̃ − β ) −→ N (0,V ) .
The precise form of the asymptotic covariance matrix V is not important here;
see 6.A.1. What is important in practice is that, under homoskedasticity, we have
an asymptotically valid matrix we can use for hypothesis testing, which is
In more general cases, robust alternatives (see section 4.2.2) are available.
Just like OLS, the IV estimator may be defined would be the first step towards seeing it as a
as the solution of an optimisation problem: member of a very general category of estima-
β̃ =Argmin ε(β )′ PW ε(β ) tors known as GMM (Generalised Method of
β ∈Rk Moments) estimators, which includes practi-
cally all estimators used in modern economet-
(compare the above expression with equation
rics. The theory of GMM is beautiful: as a
(1.14)).
starting point, I heartily recommend Hayashi
In this book, we will not make much use of
(2000).
this property. However, defining β̃ in this way
1 p
xi w′i −→ A, where A is a k × m matrix with rank k.
P
2. n
mean that all its elements are nonzero. Some of the explanatory variables may
be exogenous; in fact, in empirical models the subset of explanatory variables
that may be suspected of endogeneity is typically rather small. Therefore, the
exogenous subset of xi is perfectly adequate to serve as an instrument, obvious
examples being deterministic variables such as the constant. What the order
condition really means is that, for each endogenous explanatory variable, we
need at least one instrument not also used as a regressor.
Clearly, m ≥ k is not a sufficient condition for A to have rank k. For exam-
ple, A may be square but have a column full of zeros. This would happen, for
example, if the corresponding instrument was independent of all regressors x.
The generalisation of this idea leads to the concept of relevance.7 The instru-
ments must not only be exogenous, but must also be related to the explanatory
variables.8
Note a fundamental difference between the order condition and the rele-
vance condition: the order condition can be checked quite easily (all you have
to do is count the variables); and even if you can’t be bothered with checking, if
the order condition fails the IV estimator β̃ is not computable, since (X′ PW X) is
singular and your software will complain about this.
The relevance condition, instead, is much trickier to spot, since (with prob-
ability 1) n1 xi w′i will have rank k even if A doesn’t. Hence, if rk (A) < k, you will
P
now focus on the denominator of the expression above: clearly, the probability
that n −1 ni=1 w i x i = 0 is 0, so the probability that β̃ exists is 1 for any finite n.
P
However, if you compute its probability limit, you see the problem very quickly:
p
if w i is not relevant, then n −1 ni=1 w i x i −→ A = 0 (which has, of course, rank 0
P
For this example, we are going to use a great classic from applied labour eco-
nomics: the “Mincer wage equation”; the idea is roughly to have a model like
the following:
y i = z′i β0 + e i β1 + εi (6.15)
where y i is the log wage for an individual, e i is their education level and the
vector zi contains other characteristics we want to control for (gender, work ex-
perience, etc). The parameter of interest is β1 , which measures the returns to
education and that we would expect to be positive.
The reader, at this point, may dimly recall that we already estimated an equa-
tion like this, in section 1.5. Back then, we did not have the tools yet for interpret-
ing the results from an inferential perspective, but the results were in agreement
with commons sense. Why would we want to go back to a wage equation here?
The literature has long recognised that education may be endogenous, be-
cause the amount of education individuals receive is (ordinarily) decided by the
individuals themselves. In practice, if the only reason to get an education is to
have access to more lucrative jobs, individuals solve an optimisation problem
where they decide, among other things, their own education level. This gives
rise to a an endogeneity problem.9
For the reader’s convenience, I’ll reproduce here OLS estimates in Table 6.1.
If education is endogenous, as economic theory suggests may be, then the “re-
turns to education” parameter we find in the OLS output (about 5.3%) is a valid
estimate of the marginal effect of education on the conditional expectation of
wage, but is not a valid measure of the causal effect of education on wages, that
is the increment in wage that an individual would have had if they had received
an extra year of education.
I will now estimate the same equation via IV: the instruments I chose for
this purpose are (apart from the three regressors other than education, which
I take as exogenous) two variables that I will assume, for the moment, as valid
instruments:
9 The literature on this topic is truly massive. A good starting point is Card (1999).
6.4. THE HAUSMAN TEST 185
• the individual’s age: the motivation for this is that you don’t choose when
you’re born, and therefore age can be safely considered exogenous; at the
same time, regulations on compulsory education have changed over time,
so it is legitimate to think that older people may have spent less time in
education, so there are good chances age may be relevant.
Table 6.2 shows the output from IV estimation; in fact, gretl (that is what I
used) gives you richer output than this, but I’ll focus on the part of immediate
interest.
As you can see from the output, you get substantially different coefficients:
not only you get that the returns from education appear to be quite stronger
(7.9% versus 5.3%), but the other coefficients become larger too. Clearly, the
question at this point becomes: OK, the two methods give different numbers.
But are they significantly different? The tool we use to answer this question is
the so-called Hausman test, which is the object of the next section.
tainty. One may argue that, if we have instruments whose quality we’re confi-
dent about, we might as well stay on the safe side and use IV anyway. If we do,
however, we may be using an inefficient estimator: it can be proven that, if X is
exogenous, OLS is more efficient that IV under standard conditions.10
The Hausman test is based on the idea of comparing
the two estimators and checking if their difference is sta-
tistically significant.11 If it is, we conclude that OLS and
IV have different probability limits, and therefore OLS can’t
be consistent, so our estimator of choice has to be IV. Oth-
erwise, there is no ground for considering X endogenous,
and we may well opt for OLS, which is more efficient.12
This idea can be generalised: if you have two estima-
tors, one of which (θ̃) is robust to some problem and the
other one isn’t (θ̂), the difference δ = θ̂ − θ̃ should converge J ERRY H AUSMAN
to a non-zero value if the problem is there, and to 0 otherwise. Therefore, we
could set up a Wald-like statistic
h i−1
H = δ′ V(δ) δ; (6.16)
where V (δ) is a consistent estimator of AV [δ], and, under some standard reg-
ularity conditions, it can be proven that H is asymptotically χ2 under the null
10 If you’re curious, the proof is in section 6.A.2.
11 As always, there is some paternity debate: some people call this the Wu-Hausman test; some
others, the Durbin-Wu-Hausman test. While it is technically true that the same test statistic had
been independently derived before (by Durbin in 1954 and by Wu in 1973), the idea became main-
stream only after the publication of Hausman (1978).
12 I know what you’re thinking: this is the same logic we used in section 4.2.3 for the White test
if H0 is true if H0 is false
therefore,
¢′ h ¢−1 i−1 ¡
(X′ PW X)−1 − X′ X
¡ ¡ ¢
β̃ − β̂ β̃ − β̂
H= . (6.17)
σ̂2
In practice, actual computation of the test is performed even more simply,
via an auxiliary regression: consider the model
y = Xβ + X̂γ + residuals. (6.18)
where X̂ ≡ PW X. By the Frisch–Waugh theorem (see section 1.4.4)
¤−1
γ̂ = X̂′ MX X̂ X̂′ MX y;
£
now rewrite the two matrices on the right-hand side of the equation above as
¢−1
X̂′ MX X̂ = X̂′ X̂ − X̂′ PX X̂ = X̂′ X̂ − X̂′ X X′ X X′ X̂ =
¡
¢ h¡ ¢−1 ¡ ′ ¢−1 i ¡ ′ ¢
= X̂′ X̂ X̂′ X̂
¡
− XX X̂ X̂ (6.19)
¢−1
X̂′ MX y = X̂′ y − X̂′ X X′ X X′ y = X̂′ X̂ β̃ − β̂
¡ ¡ ¢¡ ¢
(6.20)
13 The number of degrees of freedom for the test is not as straightforward to figure out as it may
Example 6.1
Let us go back to the wage equation example illustrated in Section 6.3. While
commenting Table 6.2, I mentioned the fact that my software of choice (gretl)
offers richer output than what I reported. Part of it is the outcome of the Haus-
man test, that compares IV vs OLS:
Hausman test -
Null hypothesis: OLS estimates are consistent
Asymptotic test statistic: Chi-square(1) = 50.2987
with p-value = 1.32036e-12
As you can see, our original impression that the two sets of coefficients were
substantially different was definitely right. The p-value for the test leads to re-
jecting the null hypothesis very strongly. Therefore, IV and OLS have different
limits in probability, which we take as a sign that education is, in fact, endoge-
nous.
Note that the test statistic is matched against a χ2 distribution with 1 degree
of freedom, because there is one endogenous variable in the regressors list (that
is, education).
The reason is that the β̃ statistic may be computed by two successive ap-
plications of OLS, called the two “stages”.14 In the era when computation was
expensive, this was a nice trick to calculate β̃ without the need for other soft-
ware than OLS, but seeing IV as the product of a two-stage procedure has other
advantages too.
In order to see what the two stages are, define X̂ = PW X and rewrite (6.14) as
follows:
β̃ = (X̂′ X̂)−1 X̂′ y, (6.22)
The matrix X̂ contains, in the j -th column, the fitted value of a regression of the
j -th column of X on W; the regression
xi = Πwi + ui (6.23)
is what we call the first stage regression. In the second stage, you just regress y
on X̂: the OLS coefficient equals β̃ . Note: this is a numerically valid procedure
for computing β̃ , but the standard errors you get are not valid for inference. This
is because second stage residuals are e = y − X̂β̃ , which is a different vector from
′
the IV residuals εe = y−Xβ̃ . Consequently, the statistic ene does not provide a valid
estimator of σ2 , which in turn makes the estimated covariance matrix invalid.
Readers who liked the geometrical interpreta- the GIVE estimator can be written as
tion of OLS as a projection might like to con-
sider a different way of writing equation (6.22), ỹ = Xβ̃ = X(X̂′ X)−1 X̂′ y = QX̂,X y
that is
The matrix QX̂,X is square and idempotent (but
β̃ = (X̂′ X)−1 X̂′ y,
not necessarily symmetric) and performs what
from which you have that the fitted values from is called an oblique projection.
So, if the only computing facility you have is OLS (which was often the case
in the 1950s and 1960s), you can compute the IV estimator via a repeated appli-
cation of OLS. Moreover, you don’t really have to run as many first-stage regres-
sions as the number of regressors. You just have to run one first-stage regression
for each endogenous element of X (recall the discussion in subsection 6.4 on the
degrees of freedom for the Hausman test).
Example 6.2
Let’s estimate the same model we used in section 6.3 via the two-stage method:
the output from the first stage regression is in Table 6.4: as you can see, the de-
pendent variable here is education, while the explanatory variables are the full
instrument matrix W. There is not much to say about the first stage regression,
except noting that the two “real” instruments (age and parents’ education) are
both highly significant. This will be important in the context of weak instru-
ments (see section 6.7.2).
14 In fact, the word Henri Theil used when he invented this method was “rounds”, but subse-
Once the first stage regression is computed, we save the fitted values into
a new variable called hat_educ. Then, we replace the original education educ
variable with hat_educ in the list of regressors and perform the second stage.
Results are in Table 6.5; a comparison of these results with those reported in
Table 6.2 reveals that:
1. the coefficients are identical;
2. the standard error are not, because the statistic you obtain by dividing the
SSR from the second-stage regression (209.316) by the number of obser-
vations (1917) is not a consistent estimator for σ2 ; therefore, the standard
errors reported in table 6.5 differ from the correct ones by a constant scale
factor (1.0387 in this case).
yi = x i · β + εi (6.24)
xi = w i · π + ui , (6.25)
where equation (6.25) is a “proper” linear model, and Cov [u i , εi ] is some real
number, not necessarily 0. If w i is exogenous, the only way x i can be correlated
with εi is if Cov [u i , εi ] ̸= 0. It is easy to show that OLS will overestimate β if
Cov [u i , εi ] > 0 and underestimate it if Cov [u i , εi ] < 0. If, however, we defined
νi = εi − E [εi |u i ]
u i − û i = (x i − w i π) − (x i − w i π̂) = (π̂ − π) w i
15 Lazy writers like myself love the sentence: “the proof is left to the reader as an exercise”.
192 CHAPTER 6. INSTRUMENTAL VARIABLES
p
should go to 0 asymptotically, since π̂ −→ π; on these premises, the possibility of
using û i in place of the “true” u i is tempting.
From a computational point of view, hence, the control function approach
differs from the traditional two-stage method only in the second stage. Once you
have performed the first stage, you use, in the second stage, the residuals from
the first stage and add them as extra explanatory variables to the main equation.
Call E the first-stage residuals and perform an OLS regression of y on X and E
together:
y = Xβ + Eθ + ν ; (6.27)
the estimates we get have a very nice interpretation.
Let’s begin with β : by the Frisch-Waugh theorem (see section 1.4.4), the OLS
estimate of β is
(X′ ME X)−1 X′ ME y;
now focus on the matrix X′ ME :
X′ ME = X′ − X′ PE = X′ − X′ E(E′ E)−1 E′ .
X′ ME = X′ − X′ MW X(X′ MW X)−1 X′ MW = X′ − X′ MW = X′ PW ;
where the last equality comes from the OLS residuals ê being orthogonal to X;
therefore
E′ MX y = −X′ PW y + X′ PW X · β̂ = X′ PW X (β̂ − β̃ ),
£ ¤
¤−1 £ ′
θ̂ = E′ MX E
£ ¤
X PW X (β̂ − β̃ );
¤−1 £ ′
since the matrix E′ MX E
£ ¤
X PW X is invertible, θ̂ it can be zero if and only if
β̂ = β̃ . Therefore, the hypothesis h0 : θ = 0 is logically equivalent to the null
hypothesis of the Hausman test, and the test can be performed by a simple zero
restriction on θ̂ . If (as often happens) θ is a scalar, the result of the Hausman test
is immediately visible as the significance t -test associated with that coefficient.
A very nice feature of the control function approach is that this approach is
very natural to generalise to settings where our estimator of choice is not a least
squares estimator, which happens quite often in applied work. But this book is
entitled “basic econometrics”, and I think I’ll just stop here.
Example 6.3
Using the SHIW data again, after storing the residuals from the first-stage regres-
sion (see Table 6.4) under the name resid, you can run an OLS regression like
the one in Table 6.1, with resid added to the list of regressors. The results are in
Table 6.6. Again, the coefficients for the original regressors are β̃ and again, the
standard errors are not to be trusted (they’re a bit smaller than the correct ones,
listed in Table 6.2). The t -test for the resid variable, instead, is interpretable as
a perfectly valid Hausman test, and the fact that we strongly reject the null again
is no coincidence.
194 CHAPTER 6. INSTRUMENTAL VARIABLES
yi = x i∗ β + εi
xi = x i∗ + η i
w i = x i∗ + νi
is a consistent estimator of β.
One famous application of this principle is provided in Griliches (1976), a
landmark article in labour economics, where the author has two measurements
of individual ability and uses one for instrumenting the other.
qt = α0 − α1 p t + u t
pt = β0 + β1 q t + v t .
As we proved in section 6.1.2, the systematic part of these two equations are not
conditional means, so there’s no way we can estimate their parameters consis-
tently via OLS.
On the other hand, we can use (6.7) to deduce that E p t = π1 ; clearly, the
£ ¤
Can we use these two parameters to estimate the structural ones? The an-
swer is no: the relationship between the (π0 , π1 ) pair and the structural parame-
ters is
α0 − β0 α1 β0 + β1 α0
π0 = π1 = ,
1 + β1 α1 1 + β1 α1
which is a system of 2 equations in 4 unknowns; as such, it has infinitely many
solutions. This is exactly the under-identification scenario we analysed in sec-
tion 2.5.
6.6. THE EXAMPLES, REVISITED 195
α1 α0
· ¸ · ¸
1
Γ= B= .
−β1 1 β0
Equation (6.28) is known as the structural form of the system, because its pa-
rameters have a behavioural interpretation and are our parameters of interest.
By pre-multiplying (6.28) by Γ−1 , you get the so-called reduced form:
yt = Πxt + ut , (6.29)
where Π = Γ−1 B and ut = Γ−1 εt . In our example, the matrix Π is a column vector,
containing π0 and π1 .
If xt is exogenous, then Cov [xt , εt ], so the correlation between xt and ut is
zero; hence, OLS is a consistent estimator for the parameters of the reduced
form. However, by postmultiplying (6.29) by x′t you get:
which implies E yt x′t = ΠE xt x′t . Ordinarily, this matrix should not contain
£ ¤ £ ¤
zeros; if variables were centred in mean, it would be the covariance matrix be-
tween the vector yt and the vector xt . Therefore, each element of xt is correlated
with each yt despite being uncorrelated with εt . In other words, xt is both rele-
vant and exogenous and, as such, is a perfect instrument.
In the example above, xt contains only the constant term, and the reduced
form looks like:
π0
· ¸ · ¸
qt
= · 1 + ut ;
pt π1
it should be clear where under-identification comes from: in the demand equa-
tion you have 2 regressors (the constant and p t ) but only one instrument (the
constant), and the same goes, mutatis mutandis, for the supply equation.
Consider now a different formulation, where:
qt = α0 − α1 p t + α2 y t + u t (6.30)
pt = β0 + β1 q t + β2 m t + v t , (6.31)
where we use the two new variables y t , the per-capita income at time t and m t ,
the price of raw materials at time t ; assume both are exogenous.
In this case, both equations are estimable via IV, because we have three re-
gressors and three instruments for each (the same for both: constant, y t and
196 CHAPTER 6. INSTRUMENTAL VARIABLES
ε′ PW ε d
−→ χ2m , (6.32)
σ2
where m is the size of wi . Unfortunately, ε is unobservable, and therefore the
quantity above is not a statistic and cannot be used as a test.
The idea of substituting disturbances with residuals like we did in section
6.5.1 takes us to the Sargan test. Its most important feature is that this test has
a different asymptotic distribution than (6.32), since the degrees of freedom of
the limit χ2 is not m (the number of instruments), but rather m − k, where k is
the number of elements in β : in other terms, the over-identification rank (see
section 6.2.1). In formulae,
εe′ PW εe d
S= −→ χ2m−k . (6.33)
σ̂2
This result may appear surprising at first. Consider, however, that under ex-
act identification the numerator of S is identically zero,17 so, in turn, the S statis-
tic is identically 0, not a χ2m variable. Can this result be generalised?
17 Blitz proof: if m = k, then β̃ = (W′ X)−1 W′ y. Therefore, P ε
W e = PW (y − Xβ̃ ); however, observe
that PW Xβ̃ = W′ (W′ W)−1 W′ X(W′ X)−1 W′ y = PW y. As a consequence, PW εe = PW y − PW y = 0.
6.7. ARE MY INSTRUMENTS OK? 197
£ ¤
Take the conditional expectation E y i |wi ; assuming linearity, if wi is exoge-
nous we could estimate its parameters via OLS in a model like
y i = w′i π + u i ; (6.34)
X = WΠ + E
so that the i -th row of X can be written as x′i = w′i Π + e′i ; now substitute in the
structural equation:
clearly, the two models (6.34) and (6.35) become equivalent only if π = Πβ ; in
fact, the expression Πβ can be seen as a restricted version of π , where the con-
straint is that π must be a linear combination of the columns of Π (or, more
concisely, that π ∈ Sp (Π)).
The Sargan test is precisely a test for those restrictions: this begs three ques-
tions:
Number 1 is quite easy: the IV residuals εe are the residuals from the restricted
model. All we have to do is apply the LM principle (see Section 3.5.1) and regress
those on the explanatory variables from the unrestricted model. Compute nR 2 ,
and your job is done. If you do, you end up exactly with the statistic I called S in
equation (6.33).
As for its limit distribution, the fact that the number of degrees of freedom of
the χ2 limit distribution is m −k can be intuitively traced back to the fact that the
number of parameters of the unrestricted model (6.34) is m, while the number
of the restricted parameters is k. Therefore, the number of constraints is m−k.18
Now we can tackle point 3: the null hypothesis in the Sargan test is that the
m relationships implicit in the equation π = Πβ are non-contradictory; if they
were, it would mean that at least one element of the vector E [wi εi ] is non-zero,
18 A more rigorous argument goes as follows: if the restriction is true, then P π = π , or, equiva-
Π
lently, MΠ π = 0. Since the rank of MΠ is m − k, we can write
MΠ π = UV ′ π = 0
where V and U are matrices with k rows and m − k columns (see section 1.A.3). So, the null
hypothesis implicit in the restriction is H0 : V ′ π = 0, which is a system of m − r constraints.
198 CHAPTER 6. INSTRUMENTAL VARIABLES
and therefore at least one instrument in not exogenous. Unfortunately, the test
cannot identify the culprit.
To clarify the matter, take for example a simple DGP where y i = x i β + εi and
we have two potential instruments, w 1,i and w 2,i . We can choose among three
possible IV estimators for β:
1. a simple IV estimator using w 1,i only (call it β1 );
3. the GIVE estimator using both w 1,i and w 2,i (call it β12 ).
Suppose β1 turns out to be positive (and very significant), and β2 turns out to
be negative (and very significant); clearly, there must be something wrong. At
least one between β1 and β2 cannot be consistent. The Sargan test, applied to
the third model, would reject the null hypothesis and inform us that at least one
of our two instruments is probably not exogenous. Hence, β12 would be cer-
tainly inconsistent, and we’d have to decide which one to keep between β1 and
β2 (usually, on the basis of some economic reasoning). If, on the other hand, we
were unable to reject the null, then we would probably want to use β12 for effi-
ciency reasons, since it’s the one that incorporates all the information available
from the data.
In view of these features, the Sargan test is often labelled overidentification
test, since what it can do is, at most, finding whether there is a contradiction
between the m assumptions we make when we say “I believe instrument i is
exogenous”.
Sargan over-identification test -
Null hypothesis: all instruments are valid
Test statistic: LM = 0.138522
with p-value = P(Chi-square(1) > 0.138522) = 0.709755
Example 6.4
Table 6.7 is, again, an excerpt from the full output that gretl gives you after IV
estimation (the main table is 6.2) and shows the Sargan test for the wage equa-
tion we’ve been using as an example; in this case, we have 1 endogenous vari-
able (education) and two instruments (age and parents’ education), so the over-
identification rank is 2 − 1 = 1. The p-value for the test is over 70%, so the null
hypothesis cannot be rejected. Hence, we conclude that our instruments form a
coherent set and the estimates that we would have obtained by using age alone
or parents’ education alone would not have been statistically different from one
another. Either our instruments are all exogenous or they are all endogenous,
but in the latter case they would all be wrong in exactly the same way, which
seems quite unlikely.
6.7. ARE MY INSTRUMENTS OK? 199
In fact, the point above is subtler than it looks: a real number) break down. Constructing a test
estimating the rank of a matrix is not exactly for the hypothesis rk (A) = m is possible, but re-
trivial, because the rank is an integer, so most quires some mathematical tools I chose not to
of our intuitive ideas on the relationship be- include in this book. The interested reader may
tween an estimator and the unknown param- want to google for “canonical correlations”.
eter (that make perfect sense when the latter is
The practical problem we are often confronted with is that, although an in-
strument is technically relevant, its correlation with the regressors could be so
small that finite-sample effects may become important. In this case, that instru-
ment is said to be weak.19
The problem is best exemplified by a little simulation study: consider the
same model we used in section 6.5.1:
yi = x i · β + εi (6.36)
xi = w i · π + ui (6.37)
εi
· ¸ µ · ¸¶
1 0.75
∼ N 0,
ui 0.75 1
with the added stipulation that w i ∼ U (0, 1). Of course, equation (6.36) is our
equation of interest, while (6.37) is the “first stage” equation. Since εi and u i
are correlated, then x i is itself correlated with εi , and therefore endogenous;
however, we have the variable w i , which meets all the requirements for being
a perfectly valid instrument: it is exogenous (uncorrelated with εi ) and relevant
(correlated with x i ), as long as the parameter π is nonzero.
However, if π is a small number the correlation between x i and w i is very
faint, so w i is weak, despite being relevant. In this experiment, the IV estimator
is simply Pn
wi yi
β̃ = (W X) W y = Pni =1
′ −1 ′
:
i =1 w i x i
if you scale the denominator by n1 , it’s easy to see that its probability limit is
non-zero; however, its finite-sample distribution could well be spread out over
a wide interval of the real axis, so you can end up dividing the numerator by
19 Compared to the rest of the material contained in this book, inference under weak instru-
ments is a fairly recent strand in econometric research. A recent review article I heartily recom-
mend is Andrews et al. (2019), but a fairly accessible introductory treatment can also be found
in Hill et al. (2018), Chapter 10. Chapter 12 of Hansen (2019) is considerably more technical, but
highly recommended.
200 CHAPTER 6. INSTRUMENTAL VARIABLES
1.6
1
0
1.4
-1
1.2 -2
-3
1
-4
0.8 -5
-6
0.6
OLS IV OLS IV
π=1 π = 1/3
If instead you set π = 1/3, the simulation gives you the results plotted in the
right-hand panel. Asymptotically, nothing changes: however, the finite-sample
distribution of β̃ is worrying. Not only its dispersion is rather large (and there are
quite a few cases when the estimated value for β is negative): its distribution is
very far from being symmetric, which makes it questionable to use asymptotic
normality for hypothesis testing. I’ll leave it to the reader to figure out what
happens if the instrumental variable w i becomes very weak, which is what you
would get by setting π to a very small value, such as π = 0.1.
More generally, the most troublesome finite-sample consequences of weak
instruments are:
• the IV estimator is severely biased; that is, the expected value of its finite
sample distribution may be very far from the true value β ;20
• even more worryingly, the asymptotic approximations we use for our test
statistics may be very misleading.
20 I should add “provided it exists”; there are cases when the distribution of the IV estimator has
no finite moments.
6.7. ARE MY INSTRUMENTS OK? 201
Example 6.5
The weak instrument test for the example on the SHIW data we’ve been using in
this chapter gives:
The first-stage F statistic for the wage equation, as reported by gretl is 271.319,
which is way above 10, so we don’t have to worry.
β̃ = (X′ PW X)−1 X′ PW y,
1 PT ′ X′ W p
1. n t =1 xt wt = n −→ A, where rk (A) = k;
1 PT ′ W′ W p
2. n t =1 wt wt = n −→ B , where B is invertible;
1 PT W′ ε p
3. n t =1 wt u t = n −→ 0;
p
then β̃ = (X′ PW X)−1 X′ PW y −→ β . The proof is a simple application of Slutsky’s
theorem:
β̃ = β + (X′ PW X)−1 X′ PW ε =
"µ ¶−1 µ ¶#−1 µ ′ ¶ µ ′ ¶−1 µ ′ ¶
X′ W W′ W W′ X Wε
¶µ
XW WW
= β+
n n n n n n
so that
p ¤−1
β̃ −→ β + AB −1 A ′ AB −1 · 0 = β .
£
(6.38)
It is instructive to consider the role played by the ranks of A and B ; the matrix
B must be invertible, because otherwise AB −1 A ′ wouldn’t exist. Since B is the
probability limit of the second moments of the instruments, this requirement is
equivalent to saying that all instruments must carry separate information, and
cannot be collinear.
Note that the requisite is only that the instru- course that the transformed variables have fi-
ments shouldn’t be collinear: the stronger req- nite moments).
uisite of independence is not needed. As a This strategy is a special case of something
consequence, it is perfectly OK to use nonlin- called identification through nonlinearity; al-
ear transformations of one instrument to create though it feels a bit like cheating (and is
additional ones. frowned upon by some), it is perfectly legiti-
For example, if you have a variable w i that mate, at least asymptotically, as long as each
you assume independent of εi , you can use transformation carries some extra amount of
as instruments w i , w i2 , log(w i ), . . . (provided of information.
6.A. ASSORTED RESULTS 203
that instruments wi must be relevant (see Section 6.2.2) for all the regressors
xi . If AB −1 A ′ is not invertible, then the probability limit in (6.38) does not
£ ¤
exist. If, instead, it’s invertible, but very close to being singular (as in the case
of weak instruments — see Section 6.7.2), then its inverse will be a matrix with
inordinately large values. This is mainly a problem for the distribution of β̃ : if
we also assume
′ d
p1 W
PT
pε −→ N (0,Q) ;
4. n t =1 wt u t = n
p d
then n(β̃ − β ) −→ N (0, Σ), where
¤−1 ¤−1
Σ = AB −1 A ′ AB −1QB −1 A ′ AB −1 A ′
£ £
.
p d
³ ¤−1 ´
n(β̃ − β ) −→ N 0, σ2 AB −1 A ′
£
. (6.39)
So the precision of the IV estimator is severely impaired any time the matrix
£ −1 ′ ¤
AB A is close to being singular.
The last thing is proving that σ̃2 is consistent: from
¡ ¢
εe = X β − β̃ + ε,
you get
¢′ ¢′
εe′ εe = β − β̃ X′ X β − β̃ + 2 β − β̃ X′ ε + ε′ ε
¡ ¡ ¢ ¡
and therefore
¢′ X′ X ¡ ¢′ X′ ε
¶ µ ′ ¶
εε
µ ¶ µ
1 ′ ¡ ¢ ¡
εe εe = β − β̃ β − β̃ + 2 β − β̃ + .
n n n n
1 ′ p ′
σ̃2 = εe εe −→ 0 (Q) 0 + 2 · 0′ λ + σ2 = σ2 ,
n
³ ´ ³ ´
X′ X X′ ε
where Q = plim n and λ = plim n . Note that σ̃2 is consistent even though
λ ̸= 0. Consistency of σ̃2 is important because it implies that we can use the
empirical counterparts of the asymptotic covariance matrix in equation (6.39)
and use σ̃2 (X′ PW X)−1 as a valid covariance matrix for Wald tests.
204 CHAPTER 6. INSTRUMENTAL VARIABLES
X′ X W′ X W′ W
µ ¶ µ ¶ µ ¶
Q = plim A = plim B = plim ;
n n n
Since σ2 > 0, it is sufficient to prove that ∆ is psd. Now define the vector z′i =
[x′i w′i ] and consider the probability limit of its second moments:
Z′ Z A′
µ ¶ · ¸
Q
C = plim =
n A B
−A ′ B −1
£ ¤
H= I
A′
· ¸· ¸
′
¤ Q I
′ −1
= Q − A ′ B −1 A = ∆,
£
HC H = I −A B
A B −B −1 A
Consider now the statistic θ̀(λ) = λθ̂ + (1 − λ)θ̃, where λ ∈ R. Obviously, θ̀(λ) is
also a consistent estimator for any λ:
p
θ̀(λ) −→ λθ + (1 − λ)θ = θ.
λ
· ¸ µ ¶
a b
AV θ̀(λ) = λ 1 − λ · = λ2 a + 2λ(1 − λ)b + (1 − λ)2 c.
£ ¤ ¡ ¢
·
b c 1−λ
c −b
λ∗ = ;
a − 2b + c
θ̂
· ¸ · ¸
a a
AV = .
θ̃ a c
so · ¸ µ ¶
a a 1
AV θ̂ − θ̃ = 1 −1 = c − a = AV θ̃ − AV θ̂ .
£ ¤ ¡ ¢ £ ¤ £ ¤
.
a c −1
loop h = 1 . . H −−quiet
# the s t r u c t u r a l disturbances ( unit variance )
s e r i e s eps = normal ( )
# the reduced form disturbances ( unit variance ,
# c o r r e l a t e d with eps by construction )
s e r i e s u = rho * eps + sqrt (1 −rho ^2) * normal ( )
# generate x via the f i r s t − s t a g e equation
s e r i e s x = pi *w + u
# generate y via the s t r u c t u r a l equation
s e r i e s y = x + eps
# estimate beta by OLS
ols y x −−quiet
# s t o r e OLS estimate into the 1 s t column of b
206 CHAPTER 6. INSTRUMENTAL VARIABLES
b [ h , 1 ] = $coeff
# estimate beta by IV
t s l s y x ; w −−quiet
# s t o r e IV estimate into the 2nd column of b
b [ h , 2 ] = $coeff
endloop
###
### main
###
set verbose o f f
nulldata 400
set seed 1234 # s e t random seed f o r r e p l i c a b i l i t y
Panel data
7.1 Introduction
So far, we have made a sharp distinction between cross-sectional and time-
series datasets. In a cross-section, you observe a “screenshot” of many individ-
uals at a certain time; a time series, instead, observes one thing through time.
In panel datasets, you observe multiple individuals (that we will generally
call units) through time. Therefore, the typical element of a variable y will bear
a double subscript: y i ,t is the value for unit i at time t . In a parallel fashion,
the explanatory variables will be indexed similarly, as xi ,t . As a consequence, we
merge the two conventions used earlier and assume that i = 1 . . . n and t = 1 . . . T
so, for example, a typical excerpt of a panel dataset looks more or less like this:
id year y x z
..
.
451 2015 12 1 1
451 2016 14 1 0
451 2017 11 3 0
452 2010 12 5 0
452 2011 12 2 1
..
.
In this example, the “id” column identifies the different units, and the “year”
column identifies time, so the first row shown says that the value of y for unit
451 in the year 2015 is 12, or, in formulae, y 451,2015 = 12, x 452,2010 = 5, and so on.
In this chapter I will use the symbol N for the total number of observations.
From a practical point of view, a panel dataset may be balanced or unbalanced:
in the former case, you observe data over a common time range for each unit,
so you get valid data for each (i , t ) combination and N = n · T . Otherwise, some
rows may be missing, and not all time periods are available for all units, so N <
nT . This is the most common case in practice.
207
208 CHAPTER 7. PANEL DATA
Typically, most panel datasets will contain data for many units for short time
periods: this situation is normally referred to as the “large n, small T ” case,
but other cases are possible. For example, macroeconomists regularly deal with
datasets where units are countries and the amount of data can be considerable
in the time dimension. In most microeconomic applications, however, you have
many individuals observed for short time spans. As we will see, this aspect be-
comes important for the asymptotic analysis of the estimators we have for panel
datasets.
In some cases, it makes sense to consider the tion and in some cases may be very relevant to
factors that provoke the appearance or disap- the empirical analysis.
pearance of a certain unit in the dataset. A clas- In the elementary treatment we give here, how-
sic example is firms going bankrupt. Of course, ever, we assume that this issue is moot, as the
these random factors may interact with the factors that determine whether a unit is observ-
Data Generating Process is very subtle ways. able or not are completely independent from
This phenomenon is known as sample attri- the DGP we want to study.
The importance of panel datasets has grown exponentially since the IT rev-
olution of the 1980s-1990s: more and more datasets of this type are available,
simply as a consequence of the mechanisation of databases. For example: I have
been doing my weekly shopping for more than thirty years always at the same
supermarket chain, and I regularly pass my customer card each time. Those
guys, potentially, know everything about my habits: what I like, what I dislike,
how much I spend each week, what I buy only during a discount promotion,
and so on. And they have the same information about millions of customers.
Just imagine the kind of datasets giants like Amazon possess. It should be no sur-
prise that econometricians have devoted a lot of energy into methods for panel
datasets and, as always, this book will only scratch the surface. If you want to go
deeper, Wooldridge (2010) is what everybody considers the ultimate reference,
but in my opinion Hsiao (2022) is also a must-have.
A mechanical application of the line of thought we followed in chapter 3
would disregard the panel nature of the dataset entirely and just focus on the
£ ¤
regression function E y i ,t |xi ,t . Of course this approach is possible, and leads to
our usual OLS statistic, which in this context is often called the pooled estimator
of the conditional mean parameters. While this is a technically valid procedure,
it is almost never a good idea, because we can do clever things with the infor-
mation contained in the panel structure of the dataset and redefine the object
£ ¤
of our interest (from E y i ,t |xi ,t to something else), like we did in chapter 5, so
as to give a much more meaningful description of the DGP.
id time y x
1 1 1.6 1.6
1 2 1.0 1.8
1 3 2.2 1.0
1 4 2.0 1.0
1 5 1.8 1.0
1 6 2.2 0.8
2 1 3.2 4.2
2 2 3.4 3.2
2 3 3.0 4.2
2 4 3.6 2.4
2 5 3.8 3.2
2 6 3.2 3.6
3 1 3.8 6.8
3 2 5.0 4.8
3 3 5.2 5.4
3 4 4.6 5.8
3 5 4.4 6.0
3 6 3.6 7.0
£ ¤
of OLS to these data gives the “pooled” estimate of E y|x , which is
ŷ = 1.62 + 0.45x,
with an R 2 index of 60.7%. The slope parameter, our customary indicator of the
relationship between x and y, equals 0.45 and is very significant (its t -ratio is
4.97). What you see is a strong, significant positive link between x and y.
y ŷ = 1.62 + 0.45x
Figure 7.1 displays the data together with the fitted line, using different sym-
210 CHAPTER 7. PANEL DATA
bols to identify different units. In this context, the model we’re fitting is
y i ,t = x′i ,t β + εi ,t (7.1)
and we’re not using the information we have on the different units at all.
How can we improve on the above? The idea is to introduce heterogeneity
between units into the picture, and generalise the DGP by allowing for the pos-
sibility of each units having its own set of parameters. A fully general application
of this principle would entail considering an object like
£ ¤
m i (x) = E y i ,t |xi ,t
(note that the regression function m has a subscript i ). In principle, this ap-
proach would lead us to estimating a different regression function for each unit,
which is undesirable for various reasons: first, in the typical “large n, small T ”
scenario, it is quite possible that T , the number of observations you have for
one unit, is smaller than k, the number of parameters in your model, which
would make estimation impossible. Moreover, that level of generality is not even
needed. In most contexts, it is perfectly reasonable to assume that heterogene-
ity between units does not affect the marginal effects of x on y. In other words,
even if individuals are different, it’s often likely that the way they respond to vari-
ations in the observables is the same. If this is the case, then β is the same for all
units and we may settle for
y i ,t = x′i ,t β + αi + εi ,t , (7.2)
where the αi term is commonly known as the individual effect. We can use
vectors and matrices for writing (7.2) more compactly, expressing all the obser-
vations for unit i as
yi = Xi β + αi ι + εi (7.3)
where of course y is a T × 1 vector, X is a T × k matrix, and ι is, as usual, a con-
formable vector of ones.1 The presence of the individual effects in equation (7.2)
means that each unit is potentially different from all the others because there is
a term αi , constant through time, that shifts the level of y i ,t by some amount.
The β vector, instead, is homogeneous across units.
In the simplest cases, it is customary to assume that, once heterogeneity is
taken into account, the disturbances are well-behaved, so the covariance matrix
for the whole ε vector is σ2ε I , where σ2ε is a positive scalar and I is a N ×N identity
matrix. More general scenarios will be considered in section 7.3.4.
There are two main points to note about individual effects:
2 This approach, in principle, may lead to some complications because, apart from β , you have
where d i1,t , d i2,t etc are a set of dummies for unit 1, 2 and so on, respectively (so
no, the number near the letter d is not an exponent). Therefore, the model for
unit k would simply reduce to (7.2), since for that unit d ik,t = 1 and all the other
dummies are 0. If units differ on account of some unobserved factor that shifts
the level of y for each one of them but keeps the marginal effects β equal across
units, then we have an ordinary linear model in which each unit has its own
intercept.
In matrix notation, eq. (7.4) would read
y = Xβ + Dα + ε, (7.5)
where, with the vector notation used in equation (7.3), the relevant matrices look
like this:
y1 X1 ι 0 ... 0 ε1
y2 X2 0 ι ... 0 ε2
y= ..
X = ..
D = .. .. . .
.
ε = .. ;
. . . . . .. .
yn Xn 0 0 ... ι εn
and the fitted value lines are displayed in Figure 7.2. Not only R 2 jumps to 92.8%
here, but the slope coefficient changes sign (also: it’s even more significant)! Do
we have a contradiction here? Not really: if we look at what happens if we follow
each unit through time, we have a negative association between y and x. In each
individual’s experience, when x goes up, y goes down (on average). However, in
our example units with larger values of x generally have larger values of y, so
the overall conditional mean of y on x has a positive slope, because it doesn’t
take into account unobservable differences between individuals. By explicitly
3 For you linear algebra addicts: the structure of the D matrix could be handled in a very ele-
gant and effective way using a cool tool called Kronecker product: those interested may jump to
Section 7.A.1.
7.3. FIXED EFFECTS 213
test on parameters if the R β = d form, so you have the choice of using any of
the procedures described in section 3.5; of course, this is a joint test where the
number of hypotheses is n − 1.
In the simple example above the F -form of the test would yield a p-value of
7.55615e-08, so the visual impression that units 1, 2 and 3 are indeed different
from each other would be strongly confirmed.
yei ,t ≡ y i ,t − ȳ i ,
where ȳ i is the average of the observations for unit i . Following most of the
literature, I will use the tilde as a decoration for within-transformed variables.
The reason why this is called the “within” transfromation is motivated by a
traditional decomposition of the variance of a variable. A precise definition of
the decomposition of variance in “within” and “between” components is one
of those pedantic things that make descriptive statistics one of the most bor-
ing things on Earth. Let’s just say that the transformation above annihilates all
the differences between units (the per-unit average of yei ,t is 0 by construction)
and all the information that is left comes from variability within units through
time (hence the name). Therefore, the within tranformation of a time-invariant
variable, such as gender in the example above, gives you a vector of zeros.
The matrix representation of the within transformation is very useful: at the
very beginning of this book (see Section 1.2) I showed that the average of yi can
be written as
ȳ i = (ι′ ι)−1 ι′ yi .
Therefore, we can easily compute the vector of the deviations of yi from its own
mean as
yi = yi − ι ȳ i = yi − ι(ι′ ι)−1 ι′ yi = yi − P yi = Qyi ;
e
where I used P and Q as synonyms for Pι and Mι , respectively (we’ll use these
matrices quite often in Section 7.4, so it’s good to have a quick alternative nota-
tion; besides, I’m trying to stay consistent with the notation traditionally used in
4 Conversely, if T really is a small number, nothing prevents you from also adding “time dum-
mies” for t = 1, t = 2 etc. This is actually quite common practice. Section 7.A.4 shows how this
works in practice.
7.3. FIXED EFFECTS 215
most textbooks).5 The reader is invited to check that applying the within trans-
formation to the toy example in Table 7.1 gives the data shown in Table 7.2.
y x ȳ x̄ ye xe
1.6 1.6 1.8 1.2 -0.2 0.4
1 1.8 1.8 1.2 -0.8 0.6
2.2 1 1.8 1.2 0.4 -0.2
2 1 1.8 1.2 0.2 -0.2
1.8 1 1.8 1.2 0 -0.2
2.2 0.8 1.8 1.2 0.4 -0.4
3.2 4.2 3.367 3.467 -0.167 0.733
3.4 3.2 3.367 3.467 0.033 -0.267
3 4.2 3.367 3.467 -0.367 0.733
3.6 2.4 3.367 3.467 0.233 -1.067
3.8 3.2 3.367 3.467 0.433 -0.267
3.2 3.6 3.367 3.467 -0.167 0.133
3.8 6.8 4.433 5.967 -0.633 0.833
5 4.8 4.433 5.967 0.567 -1.167
5.2 5.4 4.433 5.967 0.767 -0.567
4.6 5.8 4.433 5.967 0.167 -0.167
4.4 6 4.433 5.967 -0.033 0.033
3.6 7 4.433 5.967 -0.833 1.033
With the help of the within transformation, we’ll rewrite equation (7.2) so
as to eliminate the individual effects.6 If you average observations for unit i
through time, you get
1 X T 1 X T h i
ȳ i ≡ y i ,t = x′i ,t β + αi + εi ,t = x̄′i ,t β + αi + ε̄i ,t , (7.6)
T t =1 T t =1
x′i ,t β + ε
yei ,t = e ei ,t (7.7)
and the αi terms have disappeared. In vector form, the above would read
balanced panels is straighforward, as long as you admit that the P and Q matrices could have
different size for different individuals.
6 The within transofrmation is a convenient way to sweep out the α terms, but it’s by no means
i
the only one: ∆s would work just the same, with a few slight adjustments.
216 CHAPTER 7. PANEL DATA
Intuition suggests that, having removed the individual effect by means of the
within transformation, you can estimate β by applying OLS to (7.7). This is in-
deed the case, and the result is known, unsurprisingly, as the “within” estimator.
The amazing result is that this statistic is exactly the same as you’d get from
using OLS on (7.4). The proof is quite simple if we consider the within trans-
formation as a matrix operation: the within transformation can be expressed in
matrix terms as the premultiplication of the original data by an N × N square
and singular matrix that we call Q:
y = Qy
e X
e = QX.
The Q matrix is a block-diagonal matrix, where all elements on the diagonal are
the Q matrices defined above, so it looks like this:
Q 0 ... 0
0 Q ... 0
Q≡ . .. . . .. . (7.9)
. . . . .
0 0 ... Q
We won’t need P now, but we’ll use it later in Section 7.4. Therefore,
Q 0 ... 0 y1 Qy1 y1
e
0 Q . . . 0 y2 Qy2 e y2
Qy = .
.. . . . . = . = .
.. . . .. .. .. ..
0 0 ... Q yn Qyn yn
e
β̂ = (X′ QX)−1 X′ Qy = (X
e′X
e )−1 Xy,
ee (7.10)
so
MD y = MD Xβ̂ + e =⇒ X′ MD y = X′ MD Xβ̂ ,
which implies (7.10) (you may need to go back to Section 1.4.4 for the different
passages).
In practice, then, the LSDV and within estimators are exactly the same thing,
so they have the same interpretation. For example, if you regress ye on xe in Table
7.2, the OLS coefficient you get is -0.62, exactly equal to the one we found for
model (7.4). You may use either term for them, or even a third alternative, pos-
sibly even more popular: the fixed-effects estimator, or FE for short, which I’ll
indicate by β̂F E .
Some readers may be troubled by the fact that This, however, is not a problem, since it can
the disturbances in equation (7.7) are corre- be proven that in this case OLS coincides with
lated. This is easily seen by considering the vec- GLS, so OLS takes care of the problem quite ef-
tor representation (7.8): since εei = Q εi , it fol- fectively.
lows that I’m not proving this because we’d need a
V εei = QV εi Q ′ .
£ ¤ £ ¤
slightly more sophisticated definition of GLS
Even in the ideal case, where V εi =£ σ¤2ε I ,
£ ¤
than I gave in chapter 4.2.1, on account of the
the covariance matrix of εei would be V εei = fact that Q is singular and I’d have to use the
σ2ε Q, which is obviously non-diagonal (keep “Moore-Penrose” inverse I hinted at in Section
in mind that Q is symmetric and idempotent). 1.A.4. Just trust me, OK?
With the LSDV approach, the estimates for the individual effects α̂i are ob-
tained directly. However, calculating them via the within estimator is also rather
easy: rewrite equation (7.11) as
y − Xβ̂F E = Dα̂ + e.
1 ′
now premultiply by Tι and use the fact ι′ ei = 0: the result is
1 ′ 1 X T
α̂i = ι (yi − x′i β̂F E ) = u i ,t , (7.12)
T T t =1
where
u i ,t = y i ,t − x′i ,t β̂F E . (7.13)
So, all you have to do is compute the residuals you’d get by using the within
estimate on the untransformed data and take means by unit.
218 CHAPTER 7. PANEL DATA
where we’re keeping T fixed here, as usual. The questions of interest are:
x′ nonsingular, for n → ∞?
£ ¤
2. Is E e
x·e
Note that the covariance matrix bears the subscript i , so we’re also implicitly
allowing for arbitrary forms of heteroskedasticity. Other choices, however, are
possible. For example, it may be not unrealistic to imagine that some correla-
tion may exist across different units: the classic example is a panel where units
are geographical entities, where units are regions and clusters cluster are coun-
tryies, but one could also think of individuals belonging to the same household,
firms in the same sector, etc. Let me just say that the literature on this topic has
exploded in the past 15 years, and that Cameron and Miller (2010) or Cameron
and Miller (2015) provide excellent surveys.
y i ,t = x′i ,t β + ωi ,t (7.15)
yi = Xi β + ωi (7.16)
where the last equality comes from the assumption that the disturbances are
well-behaved. If we also assume independence between units, equation (7.15)
for the whole sample would therefore become
y = Xβ + ω (7.18)
where
Σ 0 ... 0
0 Σ ... 0
V [ω ] = Ω =
.. .. .. .. (7.19)
. . . .
0 0 ... Σ
is a block-diagonal matrix.
We are now in the position to substantiate the claim I made at the end of
Section 7.1, when I said that using the pooled OLS estimator is almost never a
good idea with a panel dataset: for a start, the covariance matrix of the distur-
bance term is not scalar, which suggests that even though OLS on (7.15) was
consistent, valid inference requires at least with some form of robust covariance
matrix estimation (see Section 4.2.2). Besides, consistency itself may be at risk:
even if E [αi ] = 0, there is no guarantee that αi and xi should be independent, or
at least incorrelated (see the discussion at the end of Section 7.2). If E [αi |xi ] ̸= 0,
if follows that E ωi ,t |xi ̸= 0 and therefore E y i ,t |xi ,t ̸= x′i ,t β : the classic endo-
£ ¤ £ ¤
if these two scalars were known, we could use the GLS estimator, described in
Section 4.2.1), which I’m reproducing here for your convenience:
¤−1
β̃ = X′ Ω−1 X X′ Ω−1 y.
£
This solution would take care of two problems at once: we’d be using the most
efficient estimator possible, and we wouldn’t have to worry about robust infer-
ence. In practice, the two variances we need to get the job done are unknown,
but asymptotically consistent estimators would be just as good, so a FGLS esti-
mator would be available. This is what we call the RE estimator.
It turns out that, as often happens, once the original data are suitably mod-
ified, the RE estimator can be rewritten as OLS on the transformed data. The
transformation we need is known as “quasi-differencing”: for each observation,
we subtract a fraction of the per-unit average from the original data:
y̆ i ,t = y i ,t − θ ȳ i ,
y̆i = yi − θ ι ȳ i = (I − θP ) yi ,
where, again, P is an alias for Pι . Quasi-differencing for the whole sample can
be written as
y̆ = (I − θP) y,
where P was defined in section 7.3.2 or, equivalently, as
y̆ = [Q + (1 − θ)P] y, (7.20)
given that Q = I − P
For a given value of θ, the RE estimator is just OLS on the quasi-differenced
data, that is
¤−1
β̀ (θ) = X̆′ X̆ X̆′ y̆.
£
(7.21)
As is easy to check, quasi-differencing with θ = 1 is just the within transfor-
mation, so β̀ (1) = β̂F E . At the other end of the spectrum, where θ = 0, the orig-
inal data are unmodified, so β̀ (0) is the just pooled OLS estimator. Note that,
for θ < 1, time-invariant variables do not become zero, and so they are perfectly
useable.
Derivation of the optimal choice of θ for GLS estimation is a bit techincal,
and is in Section 7.A.7 for those interested. Here, I’m just giving you the solution
straight away, which is s
σ2ε
θ = 1− . (7.22)
σ2ε + T σ2α
Note that, when σ2ε is large compared to σ2α , θ will be near 0: heterogeneity
between units is negliglible and the optimal estimator is practically OLS. Con-
versely, if σ2ε is very small compared to σ2α , then θ is close to 1 and the within
7.4. RANDOM EFFECTS 223
estimator is optimal, since all the variance in ωi ,t comes from individual effects,
which are eliminated by the within transformation.
As I said earlier, the two variances σ2ε and σ2α are unknown in practice, so
they must be estimated. The almost universal solution is to use FE for σ2ε ; as
for σ2α , there are various alternatives and it is not clear if the “best” one even
exists. Shortly after the RE estimator was invented, in the late 1960s, quite a lot
of work was devoted to this issue, and the method most software uses is the one
by Swamy and Arora (1972), but you should be aware that you may get different
results form different programs because different (equally defensible) methods
are adopted.
Anyway: once we have consistent estimates of σ2ε and σ2α , to compute FGLS
we just plug them into equation (7.22) and obtain
s
σ̂2ε
θ̂ = 1 − , (7.23)
σ̂2ε + T σ̂2α
Having dealt with the particular covariance structure of ωi ,t , we now turn to the
other issue I mentioned above, that is the possibility of the observables xi ,t being
correlated with the individual effect αi . In many cases, this is a very real possi-
bility: think about the example I used on page 211, where GDP per capita is one
of the regressors xi ,t and soil fertility is one of the things that go into the indi-
vidual effect αi : who says that GDP per capita and soil fertility are independent?
More generally, it’s easy to imagine other examples, such as unobserved ability
and schooling in a Mincer wage equation.
As I argued above, this is not a problem for the FE estimator, since the within
transformation just sweeps the individual effect away, but it would make OLS
and the RE estimator inconsistent. Therefore, we could compare the FE and RE
estimators to see if they are similar, much in the same way as we did in Section
6.4 when we compared the OLS and IV estimators. This comparison gave rise
to the “Hausman test”, and this case is just the same. In fact, the original article
(Hausman, 1978) uses exactly the two examples we have in this book, that is OLS
vs IV and RE vs FE.
9 As the reader might imagine, robust versions of the RE estimator exists, but I’ll refrain from
illustrating the details, and I’ll just say that there is no additional worry compared to the FE case,
and they work as one would expect.
224 CHAPTER 7. PANEL DATA
What should we expect from the comparison? β̂F E is robust but inefficient;
β̂RE is efficient but potentially inconsistent.10 Under the null hypothesis of no
correlation between xi ,t and αi , the difference
δ = β̂F E − β̂RE
should converge to 0 in probability, because both statistics share the same limit.
Conversely, large values of δ should be taken as an indicator of endogeneity of
xi ,t .
The Hausman test can be carried out in a variety of ways, some numerically
equivalent, some only asymptotically. A choice that is used by several software
packages is to perform an auxiliary regression of the form
y̆i = X̆i β + X
e i γ + ui , (7.25)
and then a Wald test for the hypothesis H0 : γ = 0. With a bit of algebra, it can be
p
proven that this is equivalent to β̂F E − β̂RE −→ 0.
Therefore, the course of action that may take is very simple: after RE estima-
tion, look at the Hausman test. If the null is rejected, β̂RE is probably inconsis-
tent, and β̂F E is preferable. Otherwise, we may happily use β̂RE , which is better
than β̂F E because it’s more efficient. As simple as that.
note that the conditioning variable we’re using here is not xi ,t , but rather its av-
erage through time. Since αi is time-invariant, it is quite natural to assume that
a time average of the xi ,t should capture the effect we’re after.
Therefore, if you define u i = αi − x̄′i γ you can re-write (7.2) as
where, by construction, none of the two error terms u i and εi ,t is correlated with
the explanatory variables. In vector form,
yi = Xi β + P Xi γ + ηi = Xi β + X̄i γ + ηi .
10 Naturally, we have both parameters only for time-varying regressors, so the comparison is
stant 2015 US$ (WDI code: NY.GDP.PCAP.KD.) The WDI code for the Gini index is SI.POV.GINI.
226 CHAPTER 7. PANEL DATA
65
Y = -10.3 + 12.8X - 0.822X^2
60
55
50
45
Gini
40
35
30
25
20
6 7 8 9 10 11
this model a dummy for European countries, since these countries have had a
historical and cultural preference for social equality that some people consider
a dangerous socialist drift. The results are shown in Table 7.3.
Pooled OLS, using 1044 observations
Included 157 cross-sectional units
Time-series length: minimum 1, maximum 15
Dependent variable: Gini
estimates.
Having said this, Table 7.4 is relatively straightforward to comment: the “Eu-
rope” dummy drops out of the equation on account of it being time-invariant, as
explained in Section 7.3.1. Moreover, the the poolability test rejects the null very
strongly (the p-value is so small that the software just prints 0). This means that
heterogeneity between units (countries in this case) is substantial and a simple
pooled model may yield misleading results, as long as we’re interested in the
effect of GDP on inequality. In fact, the Kuznets curve simply disappears: the
coefficients on per capita GDP and its square are not significant.
Nevertheless, it can be verified that the joint hypothesis of both coefficients
being zero delivers a very small p-value (2.09218e-27): dropping the quadratic
12 The difference amounts to modifying the withing transformation by adding back, for each
Table 7.5: The Kuznets curve: fixed-effects estimates with robust standard errors
As can be seen in table 7.5, the estimated standard errors are quite different
from Table 7.4. This is in fact a very common phenomenon: while it is very rare
in cross-sectional models that robust inference delivers substantially divergent
results from plain estimation, it panel dataset clustering by unit almost always
inflates standard errors by a great deal, and the interpretation of results may
have be adjusted, even radically.
In this case, however, the meaning conveyed by the model stays the same:
the Kuznets curve vanishes, although the joint test still rejects the null (the p-
value is 1.14766e-06) and the conclusions are the same.
Breusch-Pagan test -
Null hypothesis: Variance of the unit-specific error = 0
Asymptotic test statistic: Chi-square(1) = 3065.99
with p-value = 0
Hausman test -
Null hypothesis: GLS estimates are consistent
Asymptotic test statistic: Chi-square(2) = 22.3276
with p-value = 1.4178e-05
probability limit. This is what happens when one or more of the explanatory
variables (presumably GDP per capita, in our case) is correlated with the indi-
vidual effect αi , on account of the endogeneity problem that this provokes. In
cases like these, the RE estimator is inconsistent, so we’d better stay with FE.
Finally, note that gretl (like all other software packages do) reports a test as
the Breusch-Pagan test. This is a test for the hypothesis H0 : σ2α = 0: under the
null, the individual effects are in fact not even random variables at all, because
they have zero mean and zero variance, so αi = 0 for all units. Therefore, it can
be seen as the random-effects equivalent to the poolability test I described ear-
lier. In this case, we have that heterogeneity is substantial, again. Note that this
is an entirely different test from the BP test for heteroskedasticity I mentioned
earlier in Section 4.2.3. The two tests share the same authors, but the similarity
stops there.
The final estimate we see is the CRE estimate (see Section 7.4.2). There’s
hardly anything to see here: the coefficients for the time-varying variables y i ,t
and y i2,t are absolutely identical to those in Table 7.5, as they should; their stan-
dard errors are not exactly the same, but that’s a consequence of using robust
SEs. If we had used plain GLS standard errors, they would have been identical
too. The difference is minor anyway. So, for the time-varying variables we have
nothing more that the FE estimate, and the interpretation is obviously the same.
On the contrary, the CRE technique allows us to keep the time-invariant dummy
for Europe in the model, which is (unsurprisingly) negative and significant.
Finally, note the insertion of the two “Mundlak” extra regressors, labelled
Py and Py2 in the table, which contain the per-unit averages of y i ,t and y i2,t ,
respectively. Although they are not significant individually, an F test for joint
significance of the two “Mundlak” extra regressors yields 11.1638, with a p-value
7.A. ASSORTED RESULTS 231
Note that, as a consequence of its definition, with the Kronecker product no con-
formability issues arise. On the other hand, like with ordinary matrix product,
Kronecker product is not commutative: A ⊗ B ̸= B ⊗ A.
The Kronecker product has many nice properties, but the only ones we will
need concern their combination with transposition, inversion and the ordinary
matrix product. It can be proven that
Note: the last equality assumes that the matrices are conformable.
In many cases, the Kronecker product makes it much easier to work with
“large matrices with a structure”. For example, if the panel is balanced the D
matrix defined in equation (7.5) can be written as D = I ⊗ ι and the variance of
ω in equation (7.19) is V [ω ] = I ⊗ Σ, where I is n × n; unfortunately, with unbal-
anced panels such elegance is unattainable.
Finally: the “vec” operator I illustrated in Section 4.A.3 and the Kronecker
product play together very nicely. The basic property you need to know is that
so for example if
· ¸ · ¸
1 2 1 £ ¤
A= B= C= 3 6 9
3 4 −1
which is equal to
3 6
9 12
· ¸
6 12 1
¡ ′ ¢
C ⊗ A vec (B ) =
18 24 −1
9 18
27 36
Given a square matrix C with n rows and columns, the trace operator is defined
simply as
n
X
tr (C ) = C i ,i ,
i =1
that is, the sum of all the elements on the diagonal. Clearly, the trace of a scalar
is the scalar itself.
This operator is useful in many contexts, mostly related to the fact that, for
any given r × c matrix A (possibly, with r ̸= c),
r X
c
tr A ′ A = A 2i , j .
¡ ¢ X
i =1 j =1
The two notable properties of the trace operator we use in our context are:
E [tr (C )] = tr (E [C ]) .
7.A. ASSORTED RESULTS 233
id time y x t2 t3 ... t6
1 1 1.6 1.6 0 0 ... 0
1 2 1 1.8 1 0 ... 0
1 3 2.2 1 0 1 ... 0
1 4 2 1 0 0 ... 0
1 5 1.8 1 0 0 ... 0
1 6 2.2 0.8 0 0 ... 1
2 1 3.2 4.2 0 0 ... 0
2 2 3.4 3.2 1 0 ... 0
2 3 3 4.2 0 1 ... 0
2 4 3.6 2.4 0 0 ... 0
2 5 3.8 3.2 0 0 ... 0
2 6 3.2 3.6 0 0 ... 1
3 1 3.8 6.8 0 0 ... 0
3 2 5 4.8 1 0 ... 0
3 3 5.2 5.4 0 1 ... 0
3 4 4.6 5.8 0 0 ... 0
3 5 4.4 6 0 0 ... 0
3 6 3.6 7 0 0 ... 1
Note that in the “large n, small T ” scenario the number of dummies you use is
in fact relatively small, and does not create any computational problem. From
the viewpoint of the interpretation of results, the effect you have is that in your
estimate you not only get rid of heterogeneity across units, but also across time
periods. This is especially useful when some unobserved factor affects all units
in a given period. For example, imagine your dataset describes turnover by firms
and includes year 2020: surely you’ll want to control for the COVID pandemic,
since it’s reasonable to assume that it affected most, if not all, the units you ob-
serve.
Alternatively, you may want to economise on the number of regressors used
to clean unobservable time effects by using a time trend, and possibly its square.
How advisable this is depends on the data you have.
Here we assume that the panel is balanced for simplicity, although the unbal-
anced case would be completely analogous and the conclusion would be the
same, but the algebra would be somewhat messier. The Q matrix, defined in
equation (7.9) and repeated here for convenience, is:
Q 0 ... 0
0 Q ... 0
Q≡ .. .. .. .
..
. . . .
0 0 ... Q
7.A. ASSORTED RESULTS 235
ι′
ι 0 ... 0 0 ... 0 P 0 ... 0
1 1 0
ι ... 0 0 ι′ ... 0 0 P ... 0
DD′ =
PD = . .. .. .. .. .. = .. .. .. .
.. .. ..
T ..
T . . .
. . . . . . . .
0 0 ... ι 0 0 ... ι′ 0 0 ... P
and so
I −P 0 ... 0 Q 0 ... 0
0 I −P ... 0
0 Q ... 0
M D = I − PD = .. .. .. .. = .. .. .. .. = Q.
. .
. . . . . .
0 0 ... I −P 0 0 ... Q
A more compact proof can be given by using the Kronecker product, de-
scribed in Section 7.A.1: with a balanced panel dataset one can write D as I ⊗ ι,
where I is n × n and ι is T × 1, and therefore
¤−1 1
PD = (I ⊗ ι) (I ⊗ ι)′ (I ⊗ ι) (I ⊗ ι′ ) = (I ⊗ ι) [T · I ]−1 I ⊗ ι′ = (I ⊗ ιι′ ) = I ⊗ P ;
£
T
as a consequence,
MD = I ⊗ (I − P ) = I ⊗Q = Q,
as claimed.
xi ,t β̂F E .
u i ,t = yei ,t − e
Now note that the SSR from the within regression can be written as
n X
T n
u i2,t = u′i ui ;
X X
SSRW =
i =1 t =1 i =1
1 1X n
SSRW = u′ ui
n n i =1 i
236 CHAPTER 7. PANEL DATA
On the other hand, consistency of β̂F E implies that the within residuals u i ,t
converge to the centred disturbances εei ,t as n → ∞, ans so, by extension, does
the whole vector for a single unit
p
ui −→ εei .
Therefore, one may say that, for n → ∞, E u′i ui should converge to E ε′i εi and
£ ¤ £ ¤
therefore
1 p
SSRW =−→ E ε′i εi
£ ¤
n
This limit can be computed by noting that εei = Q εi , and so, by using the
properties of the trace operator (if you’re not 100% confident on the trace oper-
ator, section 7.A.2 is for you):
so, finally
E ε′i εi = (T − 1)σ2ε .
£ ¤
H y = H Xβ + H ε = y̆ + X̆β + ε̆.
7.A. ASSORTED RESULTS 237
It’s easy to check that the covariance matrix of ε̆ is H V [ε] H ′ = k I , so the trans-
formed model is homoskedastic and OLS on the transformed data is efficient
and standard inference applies. The GLS estimator is therefore
Ω−1 = (1/k)H ′ H ,
which I’m not proving, but it’s easy enough for the reader to demonstrate as an
exercise.
The matrix Ω is in our case given in equation (7.19), but in fact the peculiar
structure of the matrix implies that all we need to do is find a transformation for
the model for each individual, that is equation (7.16) (reported here for conve-
nience):
yi = Xi β + ωi ;
As argued above (see equation (7.17)), the covariance matrix of ωi is14
therefore, a simple solution to the GLS problem lies in finding a matrix H such
that H ΣH ′ is a scalar multiple of the identity matrix or, equivalently, a matrix H
such that H ′ H is a scalar multiple of Σ−1 .
In order to do so, it is useful to rewrite Σ in terms of the idempotent matrices
P and Q:
σ2 + T σ2
· ¸
Σ = σ2ε I + σ2α ιι′ = σ2εQ + (σ2ε + T σ2α )P = σ2ε Q + ε 2 α P
σε
Therefore, via the result shown in Section 7.A.3, it’s easy to see that the ap-
propriate matrix H is “inverse square root of Σ”, H = Σ−1/2 , that can be written
as (apart from the σ2ε scalar)
s
σ2ε
H = Q+ P=
σ2ε + T σ2α
s
σ2ε
= (I − P ) + P=
σ2ε + T σ2α
Ãs !
σ2ε
= I+ − 1 P = I − θP
σ2ε + T σ2α
where s
σ2ε
θ ≡ 1− .
σ2ε + T σ2α
14 As usual, I’m using the convenient simplification of assuming that the dataset is balanced and
you have T observations for each unit. Again, generalisation to unbalanced panels is possible but
somewhat messier.
238 CHAPTER 7. PANEL DATA
Qy̆ = Q [Q + (1 − θ)P] y = Qy = e
y (7.29)
Py̆ = P [Q + (1 − θ)P] y = (1 − θ)Py = (1 − θ)ȳ (7.30)
and analogous expressions trivially apply to X. Now write the augmented model
as
y = Xβ + X̄γ + ε
and apply quasi-differencing so that GLS is just OLS on the transformed model
and therefore
so
¤−1
e′X e ′ y̆ = β̂F E
£
β̂ = X e X
e ′ y̆ as X′ Qy̆ and applying (7.29).
where the last equality comes from writing X
Bibliography
B IAU , D. J., B. M. J OLLES , AND R. P ORCHER (2009): “P value and the theory of
hypothesis testing: an explanation for new researchers,” Clinical orthopaedics
and related research, 468, 885–892.
B ROCKWELL , P. AND R. D AVIS (1991): Time Series: Theory and Methods, Springer
Series in Statistics, Springer.
239
240 BIBLIOGRAPHY
——— (2004): Econometric theory and methods, Oxford University Press New
York.
FANAEE -T, H. AND J. G AMA (2014): “Event labeling combining ensemble detec-
tors and background knowledge,” Progress in Artificial Intelligence, 2, 113–
127.
H ANSEN , B. E. (2019):
“Econometrics,” https://www.ssc.wisc.edu/
~bhansen/econometrics/.
M UNDLAK , Y. (1978): “On the pooling of time series and cross section data,”
Econometrica, 69–85.
V ERBEEK , M. (2017): A Guide to Modern Econometrics, John Wiley and Sons, 5th
ed.