Michael Creel - Econometrics
Michael Creel - Econometrics
c
Michael Creel
List of Figures 14
List of Tables 17
3.7.2. Normality 41
3.7.3. The variance of the OLS estimator and the Gauss-Markov theorem 42
3.8. Example: The Nerlove model 47
3.8.1. Theoretical background 47
3.8.2. Cobb-Douglas functional form 48
3.8.3. The Nerlove data and OLS 49
Exercises 53
6.2.1. t-test 83
6.2.2. F test 86
6.2.3. Wald-type tests 87
6.2.4. Score-type tests (Rao tests, Lagrange multiplier tests) 88
6.2.5. Likelihood ratio-type tests 90
6.3. The asymptotic equivalence of the LR, Wald and score tests 92
6.4. Interpretation of test statistics 96
6.5. Confidence intervals 96
6.6. Bootstrapping 97
6.7. Testing nonlinear restrictions, and the Delta Method 100
6.8. Example: the Nerlove data 104
Exercises 179
Exercises 179
20.1.2. ML 428
20.1.3. GMM 429
20.1.4. Kernel regression 431
Bibliography 433
Bibliography 441
Bibliography 489
Index 490
List of Figures
1.2.1 LYX 20
1.2.2 Octave 21
3.5.1 Uncentered R2 37
15.10.2 IV 325
17
CHAPTER 1
This document integrates lecture notes for a one year graduate level course with
computer programs that illustrate and apply the methods that are studied. The im-
mediate availability of executable (and modifiable) example programs when using the
PDF1 version of the document is one of the advantages of the system that has been
used. On the other hand, when viewed in printed form, the document is a somewhat
terse approximation to a textbook. These notes are not intended to be a perfect substi-
tute for a printed textbook. If you are a student of mine, please note that last sentence
carefully. There are many good textbooks available. A few of my favorites are listed
in the bibliography.
With respect to contents, the emphasis is on estimation and inference within the
world of stationary data, with a bias toward microeconometrics. The second half is
somewhat more polished than the first half, since I have taught that course more often.
If you take a moment to read the licensing information in the next section, you’ll see
that you are free to copy and modify the document. If anyone would like to contribute
material that expands the contents, it would be very welcome. Error corrections and
other additions are also welcome. As an example of a project that has made use of
these notes, see these very nice lecture slides.
1It is possible to have the program links open up in an editor, ready to run using keyboard macros. To
do this with the PDF version you need to do some setup work. See the bootable CD described below.
18
1.2. OBTAINING THE MATERIALS 19
1.1. License
All materials are copyrighted by Michael Creel with the date that appears above.
They are provided under the terms of the GNU General Public License, which forms
Section 24 of the notes. The main thing you need to know is that you are free to modify
and distribute these materials in any way you like, as long as you do so under the terms
of the GPL. In particular, you must make available the source files, in editable form,
for your modified version of the materials.
The materials are available on my web page, in a variety of forms including PDF
and the editable sources, at pareto.uab.es/mcreel/Econometrics/. In addition to the
final product, which you’re looking at in some form now, you can obtain the ed-
itable sources, which will allow you to create your own version, if you like, or send
error corrections and contributions. The main document was prepared using LYX
(www.lyx.org) and Octave (www.octave.org). LYX is a free2 “what you see is what
you mean” word processor, basically working as a graphical frontend to LATEX. It
(with help from other applications) can export your work in LATEX, HTML, PDF and
several other forms. It will run on Linux, Windows, and MacOS systems. Figure 1.2.1
shows LYX editing this document.
GNU Octave has been used for the example programs, which are scattered though
the document. This choice is motivated by two factors. The first is the high quality of
the Octave environment for doing applied econometrics. The fundamental tools exist
and are implemented in a way that make extending them fairly easy. The example
programs included here may convince you of this point. Secondly, Octave’s licensing
philosophy fits in with the goals of this project. Thirdly, it runs on Linux, Windows
2”Free” is used in the sense of ”freedom”, but L X is also free of charge.
Y
1.3. AN EASY WAY TO USE LYX AND OCTAVE TODAY 20
and MacOS. Figure 1.2.2 shows an Octave program being edited by NEdit, and the
result of running the program in a shell window.
The example programs are available as links to files on my web page in the PDF
version, and here. Support files needed to run these are available here. The files won’t
run properly from your browser, since there are dependencies between files - they are
only illustrative when browsing. To see how to use these files (edit and run them),
you should go to the home page of this document, since you will probably want to
download the pdf version together with all the support files and examples. Then set
1.3. AN EASY WAY TO USE LYX AND OCTAVE TODAY 21
the base URL of the PDF file to point to wherever the Octave files are installed. All of
this may sound a bit complicated, because it is. An easier solution is available:
The file pareto.uab.es/mcreel/Econometrics/econometrics.iso distribution of Linux
is an ISO image file that may be burnt to CDROM. It contains a bootable-from-CD
Gnu/Linux system that has all of the tools needed to edit this document, run the Octave
example programs, etcetera. In particular, it will allow you to cut out small portions
of the notes and edit them, and send them to me as LYX (or TEX) files for inclusion in
future versions. Think error corrections, additions, etc.! The CD automatically detects
the hardware of your computer, and will not touch your hard disk unless you explicitly
tell it to do so. It is based upon the ParallelKnoppix GNU/Linux distribution. The
1.4. KNOWN BUGS 22
reason why these notes are integrated into a Linux distribution for parallel computing
will be apparent if you get to Chapter 20.
• The PDF version has hyperlinks to figures that jump to the wrong figure. The
numbers are correct, but the links are not. ps2pdf bugs?
CHAPTER 2
Economic theory tells us that the demand function for a good is something like:
x = x(p, m, z)
xi = xi (pi , mi , zi )
A step toward an estimable econometric model is to suppose that the model may be
written as
xi = β1 + p0i β p + mi βm + w0i βw + εi
• The functions xi (·) which in principle may differ for all i have been restricted
to all belong to the same parametric family.
• Of all parametric families of functions, we have restricted the model to the
class of linear in the variables functions.
• The parameters are constant across individuals.
• There is a single unobservable component, and we assume it is additive.
If we assume nothing about the error term ε, we can always write the last equation.
But in order for the β coefficients to have an economic meaning, and in order to be
able to estimate them from sample data, we need to make additional assumptions.
These additional assumptions have no theoretical basis, they are assumptions on top
of those needed to prove the existence of a demand function. The validity of any results
we obtain using this model will be contingent on these additional restrictions being at
least approximately correct. For this reason, specification testing will be needed, to
check that the model seems to be reasonable. Only when we are convinced that the
model is at least approximately correct should we use it for economic analysis.
When testing a hypothesis using an econometric model, three factors can cause a
statistical test to reject the null hypothesis:
We would like to ensure that the third reason is not contributing to rejections, so that
rejection will be due to either the first or second reasons. Hopefully the above example
makes it clear that there are many possible sources of misspecification of econometric
models. In the next few sections we will obtain results supposing that the economet-
ric model is entirely correctly specified. Later we will examine the consequences of
misspecification and see some methods for determining if a model is correctly spec-
ified. Later on, econometric methods that seek to minimize maintained assumptions
are introduced.
CHAPTER 3
vector of explanatory variables, and β0 = ( β01 β02 · · · β0k ) . The superscript “0” in
0
β0 means this is the ”true value” of the unknown parameter. It will be defined more
precisely later, and usually suppressed when it’s not necessary for clarity.
Suppose that we want to use data to try to determine the best linear approximation
to y using the variables x. The data {(yt , xt )} ,t = 1, 2, ..., n are obtained by some form
of sampling1. An individual observation is thus
yt = xt0 β + εt
1For example, cross-sectional data may be obtained by random sampling. Time series data accumulate
historically.
26
3.2. ESTIMATION BY LEAST SQUARES 27
(3.1.1) y = Xβ + ε,
0 0
where y = y1 y2 · · · yn is n × 1 and X = x1 x2 · · · xn .
Linear models are more general than they might first appear, since one can employ
nonlinear transformations of the variables:
h i
ϕ0 (z) = ϕ1 (w) ϕ2 (w) · · · ϕ p (w) β+ε
where the φi () are known functions. Defining y = ϕ0 (z), x1 = ϕ1 (w), etc. leads to a
model in the form of equation 3.6.1. For example, the Cobb-Douglas model
β β
z = Aw2 2 w3 3 exp(ε)
ln z = ln A + β2 ln w2 + β3 ln w3 + ε.
If we define y = ln z, β1 = ln A,etc., we can put the model in the form needed. The
approximation is linear in the parameters, but not necessarily linear in the variables.
Figure 3.2.1, obtained by running TypicalData.m shows some data that follows the
linear model yt = β1 +β2 xt2 +εt . The green line is the ”true” regression line β1 +β2 xt2 ,
and the red crosses are the data points (xt2 , yt ), where εt is a random error that has mean
zero and is independent of xt2 . Exactly how the green line is defined will become clear
later. In practice, we only have the data, and we don’t know where the green line lies.
We need to gain information about the straight line that best fits the data points.
3.2. ESTIMATION BY LEAST SQUARES 28
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
X
The ordinary least squares (OLS) estimator is defined as the value that minimizes
the sum of the squared errors:
where
n 2
s(β) = ∑ yt − xt0 β
t=1
= (y − Xβ)0 (y − Xβ)
= y0 y − 2y0 Xβ + β0 X0 Xβ
= k y − Xβ k2
3.2. ESTIMATION BY LEAST SQUARES 29
This last expression makes it clear how the OLS estimator is defined: it minimizes the
Euclidean distance between y and X β. The fitted OLS coefficients will define the best
linear approximation to y using x as basis functions, where ”best” means minimum
Euclidean distance. One could think of other estimators based upon other metrics. For
n
example, the minimum absolute distance (MAD) minimizes ∑t=1 |yt − xt0 β|. Later, we
will see that which estimator is best in terms of their statistical properties, rather than
in terms of the metrics that define them, depends upon the properties of ε, about which
we have as yet made no assumptions.
• To minimize the criterion s(β), find the derivative with respect to β and it to
zero:
so
β̂ = (X0 X)−1 X0 y.
Since ρ(X) = K, this matrix is positive definite, since it’s a quadratic form in
a p.d. matrix (identity matrix of order n), so β̂ is in fact a minimizer.
• The fitted values are in the vector ŷ = Xβ̂.
• The residuals are in the vector ε̂ = y − Xβ̂
• Note that
y = Xβ + ε
= Xβ̂ + ε̂
3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION 30
X0 y − X0 Xβ̂ = 0
X y − Xβ̂ = 0
0
X0 ε̂ = 0
which is to say, the OLS residuals are orthogonal to X. Let’s look at this more
carefully.
3.3.1. In X ,Y Space. Figure 3.3.1 shows a typical fit to data, along with the true
regression line. Note that the true line and the estimated line are different. This fig-
ure was created by running the Octave program OlsFit.m . You can experiment with
changing the parameter values to see how this affects the fit, and to see how the fitted
line will sometimes be close to the true line, and sometimes rather far away.
10
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
X
Observation 2
e = M_xY S(x)
x*beta=P_xY
Observation 1
3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION 32
−1
X β̂ = X X 0 X X 0y
PX = X (X 0X )−1 X 0
since
X β̂ = PX y.
ε̂ = y − X β̂
= y − X (X 0X )−1 X 0 y
= In − X (X 0X )−1 X 0 y.
So the matrix that projects y onto the space orthogonal to the span of X is
MX = In − X (X 0X )−1 X 0
= In − PX .
We have
ε̂ = MX y.
Therefore
y = PX y + MX y
= X β̂ + ε̂.
3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS 33
These two projection matrices decompose the n dimensional vector y into two orthog-
onal components - the portion that lies in the K dimensional space defined by X , and
the portion that lies in the orthogonal n − K dimensional space.
β̂i = (X 0X )−1 X 0 i·
y
= c0i y
This is how we define a linear estimator - it’s a linear function of the dependent
variable. Since it’s a linear combination of the observations on the dependent vari-
able, where the weights are detemined by the observations on the regressors, some
observations may have more influence than others. Define
ht = (PX )tt
= et0 PX et
= k PX et k2
≤ k et k2 = 1
3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS 34
TrPX = K ⇒ h = K/n.
So, on average, the weight on the yt ’s is K/n. If the weight is much higher, then the
observation has the potential to affect the fit importantly. The weight, ht is referred to
as the leverage of the observation. However, an observation may also be influential
due to the value of yt , rather than the weight it is multiplied by, which only depends on
the xt ’s.
To account for this, consider estimation of β without using the t th observation
(designate this estimator as β̂(t) ). One can show (see Davidson and MacKinnon, pp.
32-5 for proof) that
1
β̂ (t)
= β̂ − (X 0 X )−1 Xt0 ε̂t
1 − ht
so the change in the t th observations fitted value is
ht
Xt β̂ − Xt β̂ (t)
= ε̂t
1 − ht
While an observation may be influential if it doesn’t affect its own fitted value, it
certainly is influential if it does. A fast means of identifying influential observations
ht
is to plot 1−h t
ε̂t (which I will refer to as the own influence of the observation) as a
function of t. Figure 3.4.1 gives an example plot of data, fit, leverage and influence.
The Octave program is InfluentialObservation.m . If you re-run the program you will
see that the leverage of the last observation (an outlying value of x) is always high, and
the influence is sometimes high.
After influential observations are detected, one needs to determine why they are
influential. Possible causes include:
3.5. GOODNESS OF FIT 35
10
-2
0 0.5 1 1.5 2 2.5 3
X
• data entry error, which can easily be corrected once detected. Data entry
errors are very common.
• special economic factors that affect some observations. These would need to
be identified and incorporated in the model. This is the idea behind structural
change: the parameters may not be constant across all observations.
• pure randomness may have caused us to sample a low-probability observa-
tion.
ε̂0 ε̂
R2u = 1 −
y0 y
β̂0 X 0 X β̂
=
y0 y
k PX y k2
=
k y k2
= cos2 (φ),
Mι = In − ι(ι0 ι)−1 ι0
= In − ιι0 /n
3.5. GOODNESS OF FIT 37
Mι y just returns the vector of deviations from the mean. In terms of deviations from
the mean, equation 3.5.1 becomes
y0 Mι y = β̂0 X 0 Mι X β̂ + ε̂0 Mι ε̂
ε̂0 ε̂ ESS
R2c = 1 − = 1−
y Mι y
0 T SS
n
where ESS = ε̂0 ε̂ and T SS = y0 Mι y=∑t=1 (yt − ȳ)2 .
Supposing that X contains a column of ones (i.e., there is a constant term),
X 0 ε̂ = 0 ⇒ ∑ ε̂t = 0
t
3.6. THE CLASSICAL LINEAR REGRESSION MODEL 38
So
RSS
R2c =
T SS
where RSS = β̂0 X 0 Mι X β̂
• Supposing that a column of ones is in the space spanned by X (PX ι = ι), then
one can show that 0 ≤ R2c ≤ 1.
Up to this point the model is empty of content beyond the definition of a best linear
approximation to y and some geometrical properties. There is no economic content
to the model, and the regression parameters have no economic interpretation. For
example, what is the partial derivative of y with respect to x j ? The linear approximation
is
y = β1 x1 + β2 x2 + ... + βk xk + ε
1
(3.6.2) lim X0 X = QX
n
where QX is a finite positive definite matrix. This is needed to be able to identify the
individual effects of the explanatory variables.
Independently and identically distributed errors:
Nonautocorrelated errors:
(3.6.5) E (εt εs ) = 0, ∀t 6= s
Optionally, we will sometimes assume that the errors are normally distributed.
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 40
(3.6.6) ε ∼ N(0, σ2 In )
Up to now, we have only examined numeric properties of the OLS estimator, that
always hold. Now we will examine statistical properties. The statistical properties
depend upon the assumptions we can make.
β̂ = (X 0 X )−1 X 0 (X β + ε)
= β + (X 0X )−1 X 0 ε
= (X 0X )−1 X 0 Eε
= 0
so the OLS estimator is unbiased under the assumptions of the classical model.
Figure 3.7.1 shows the results of a small Monte Carlo experiment where the OLS
estimator was calculated for 10000 samples from the classical model with y = 1 + 2x +
ε, where n = 20, σ2ε = 9, and x is fixed across samples. We can see that the β2 appears
to be estimated without bias. The program that generates the plot is Unbiased.m , if
you would like to experiment with this.
With time series data, the OLS estimator will often be biased. Figure 3.7.2 shows
the results of a small Monte Carlo experiment where the OLS estimator was calculated
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 41
0.1
0.08
0.06
0.04
0.02
0
-3 -2 -1 0 1 2 3
for 1000 samples from the AR(1) model with yt = 0 + 0.9yt−1 + εt , where n = 20 and
σ2ε = 1. In this case, assumption 3.6.2 does not hold: the regressors are stochastic. We
can see that the bias in the estimation of β2 is about -0.2.
The program that generates the plot is Biased.m , if you would like to experiment
with this.
β̂ ∼ N β, (X 0X )−1 σ20
since a linear function of a normal random vector is also normally distributed. In Fig-
ure 3.7.1 you can see that the estimator appears to be normally distributed. It in fact
is normally distributed, since the DGP (see the Octave program) has normal errors.
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 42
0.12
0.1
0.08
0.06
0.04
0.02
0
-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4
Even when the data may be taken to be IID, the assumption of normality is often ques-
tionable or simply untenable. For example, if the dependent variable is the number of
automobile trips per week, it is a count variable with a discrete distribution, and is thus
not normally distributed. Many variables in economics can take on only nonnegative
values, which, strictly speaking, rules out normality.2
3.7.3. The variance of the OLS estimator and the Gauss-Markov theorem.
Now let’s make all the classical assumptions except the assumption of normality. We
2Normality may be a good model nonetheless, as long as the probability of a negative value occuring is
negligable under the model. This depends upon the mean being large enough in relation to the variance.
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 43
= (X 0 X )−1 σ20
The OLS estimator is a linear estimator, which means that it is a linear function of
the dependent variable, y.
β̂ = (X 0 X )−1 X 0 y
= Cy
where C is a function of the explanatory variables only, not the dependent variable. It is
also unbiased under the present assumptions, as we proved above. One could consider
other weights W that are a function of X that define some other linear estimator. We’ll
still insist upon unbiasedness. Consider β̃ = Wy, where W = W (X ) is some k × n
matrix function of X . Note that since W is a function of X , it is nonstochastic, too. If
the estimator is unbiased, then we must have W X = IK :
E (Wy) = E (W X β0 +W ε)
= W X β0
= β0
WX = IK
The variance of β̃ is
V (β̃) = WW 0 σ20 .
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 44
Define
D = W − (X 0 X )−1 X 0
so
W = D + (X 0 X )−1 X 0
Since W X = IK , DX = 0, so
0
V (β̃) = D + (X 0 X )−1 X 0 D + (X 0 X )−1 X 0 σ20
−1 2
= DD0 + X 0 X σ0
So
V (β̃) ≥ V (β̂)
The inequality is a shorthand means of expressing, more formally, that V ( β̃) − V (β̂)
is a positive semi-definite matrix. This is a proof of the Gauss-Markov Theorem. The
OLS estimator is the ”best linear unbiased estimator” (BLUE).
• It is worth emphasizing again that we have not used the normality assumption
in any way to prove the Gauss-Markov theorem, so it is valid if the errors are
not normally distributed, as long as the other assumptions hold.
To illustrate the Gauss-Markov result, consider the estimator that results from splitting
the sample into p equally-sized parts, estimating using each part of the data separately
by OLS, then averaging the p resulting estimators. You should be able to show that this
estimator is unbiased, but inefficient with respect to the OLS estimator. The program
Efficiency.m illustrates this using a small Monte Carlo experiment, which compares
the OLS estimator and a 3-way split sample estimator. The data generating process
follows the classical model, with n = 21. The true parameter value is β = 2. In Figures
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 45
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0 0.5 1 1.5 2 2.5 3 3.5 4
3.7.3 and 3.7.4 we can see that the OLS estimator is more efficient, since the tails of
its histogram are more narrow.
0 −1
We have that E(β̂) = β and Var(β̂) = X X σ20 , but we still need to estimate
the variance of ε, σ20 , in order to have an idea of the precision of the estimates of β. A
commonly used estimator of σ20 is
c2 = 1 0
σ 0 ε̂ ε̂
n−K
0.1
0.08
0.06
0.04
0.02
0
0 0.5 1 1.5 2 2.5 3 3.5 4
c2 = 1 0
σ 0 ε̂ ε̂
n−K
1 0
= ε Mε
n−K
c2 ) = 1
E (σ 0 E(Trε0 Mε)
n−K
1
= E(TrMεε0)
n−K
1
= TrEX Eε|X (Mεε0 )
n−K
1
= σ20 EX TrM
n−K
1
= σ2 (n − k)
n−K 0
= σ20
3.8. EXAMPLE: THE NERLOVE MODEL 47
where we use the fact that Tr(AB) = Tr(BA) when both products are conformable.
Thus, this estimator is also unbiased under these assumptions.
3.8.1. Theoretical background. For a firm that takes input prices w and the out-
put level q as given, the cost minimization problem is to choose the quantities of inputs
x to solve the problem
min w0 x
x
f (x) = q.
The solution is the vector of factor demands x(w, q). The cost function is obtained by
substituting the factor demands into the criterion function:
∂C(w, q)
≥0
∂w
Remember that these derivatives give the conditional factor demands (Shep-
hard’s Lemma).
• Homogeneity The cost function is homogeneous of degree 1 in input prices:
C(tw, q) = tC(w, q) where t is a scalar constant. This is because the factor
3.8. EXAMPLE: THE NERLOVE MODEL 48
demands are homogeneous of degree zero in factor prices - they only depend
upon relative prices.
• Returns to scale The returns to scale parameter γ is defined as the inverse of
the elasticity of cost with respect to output:
−1
∂C(w, q) q
γ=
∂q C(w, q)
β β
C = Aw1 1 ...wg g qβq eε
This is one of the reasons the Cobb-Douglas form is popular - the coefficients are easy
to interpret, since they are the elasticities of the dependent variable with respect to the
3.8. EXAMPLE: THE NERLOVE MODEL 49
the cost share of the jth input. So with a Cobb-Douglas cost function, β j = s j (w, q).
The cost shares are constants.
Note that after a logarithmic transformation we obtain
lnC = α + β1 ln w1 + ... + βg ln wg + βq ln q + ε
where α = ln A . So we see that the transformed model is linear in the logs of the data.
One can verify that the property of HOD1 implies that
g
∑ βg = 1
i=1
1
γ= =1
βq
3.8.3. The Nerlove data and OLS. The file nerlove.data contains data on 145
electric utility companies’ cost of production, output and input prices. The data are
for the U.S., and were collected by M. Nerlove. The observations are by row, and
the columns are COMPANY, COST (C), OUTPUT (Q), PRICE OF LABOR (PL ),
3.8. EXAMPLE: THE NERLOVE MODEL 50
PRICE OF FUEL (PF ) and PRICE OF CAPITAL (PK ). Note that the data are sorted
by output level (the third column).
We will estimate the Cobb-Douglas model
(3.8.1) lnC = β1 + β2 ln Q + β3 ln PL + β4 ln PF + β5 ln PK + ε
using OLS. To do this yourself, you need the data file mentioned above, as well as
Nerlove.m (the estimation program) , and the library of Octave functions mentioned
in the introduction to Octave that forms section 22 of this document.3
The results are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
*********************************************************
While we will use Octave programs as examples in this document, since following the
programming statements is a useful way of learning how theory is put into practice,
3If you are running the bootable CD, you have all of this installed and ready to run.
3.8. EXAMPLE: THE NERLOVE MODEL 51
Fortunately, Gretl and my OLS program agree upon the results. Gretl is included in
the bootable CD mentioned in the introduction. I recommend using GRETL to repeat
the examples that are done using Octave.
The previous properties hold for finite sample sizes. Before considering the as-
ymptotic properties of the OLS estimator it is useful to review the MLE estimator,
since under the assumption of normal errors the two estimators coincide.
EXERCISES 53
Exercises
(1) Prove that the split sample estimator used to generate figure 3.7.4 is unbiased.
(2) Calculate the OLS estimates of the Nerlove model using Octave and GRETL, and
provide printouts of the results. Interpret the results.
(3) Do an analysis of whether or not there are influential observations for OLS esti-
mation of the Nerlove model. Discuss.
(4) Using GRETL, examine the residuals after OLS estimation and tell me whether or
not you believe that the assumption of independent identically distributed normal
errors is warranted. No need to do formal tests, just look at the plots. Print out any
that you think are relevant, and interpret them.
(5) For a random vector X ∼ N(µx , Σ), what is the distribution of AX + b, where A and
b are conformable matrices of contants?
(6) Using Octave, write a little program that verifies that Tr(AB) = Tr(BA) for A and
B 4x4 matrices of random numbers. Note: there is an Octave function trace.
(7) For the model with a constant and a single regressor, yt = β1 + β2 xt + εt , which
satisfies the classical assumptions, prove that the variance of the OLS estimator
declines to zero as the sample size increases.
CHAPTER 4
Suppose we have a sample of size n of the random vectors y and z. Suppose the
joint density of Y = y1 . . . yn and Z = z1 . . . zn is characterized by a
parameter vector ψ0 :
fY Z (Y, Z, ψ0 ).
This is the joint density of the sample. This density can be factored as
Note that if θ0 and ρ0 share no elements, then the maximizer of the conditional
likelihood function fY |Z (Y |Z, θ) with respect to θ is the same as the maximizer of
the overall likelihood function fY Z (Y, Z, ψ) = fY |Z (Y |Z, θ) fZ (Z, ρ), for the elements
of ψ that correspond to θ. In this case, the variables Z are said to be exogenous for
estimation of θ, and we may more conveniently work with the conditional likelihood
function fY |Z (Y |Z, θ) for the purposes of estimating θ0 .
L(Y, θ) = f (y1 |z1 , θ) f (y2 |y1 , z2 , θ) f (y3 |y1 , y2 , z3 , θ) · · · f (yn |y1, y2 , . . . yt−n , zn , θ)
1 1 n
sn (θ) = ln L(Y, θ) = ∑ ln f (yt |xt , θ)
n n t=1
where the set maximized over is defined below. Since ln(·) is a monotonic increasing
function, ln L and L maximize at the same value of θ. Dividing by n has no effect on θ̂.
4.1.1. Example: Bernoulli trial. Suppose that we are flipping a coin that may
be biased, so that the probability of a heads may not be 0.5. Maybe we’re interested
in estimating the probability of a heads. Let y = 1(heads) be a binary variable that
indicates whether or not a heads is observed. The outcome of a toss is a Bernoulli
random variable:
y
fY (y, p0 ) = p0 (1 − p0 )1−y , y ∈ {0, 1}
= 0, y ∈
/ {0, 1}
fY (y, p) = py (1 − p)1−y
and
ln fY (y, p) = y ln p + (1 − y) ln (1 − p)
4.1. THE LIKELIHOOD FUNCTION 57
∂ ln fY (y, p) y (1 − y)
= −
∂p p (1 − p)
y− p
=
p (1 − p)
∂sn (p) 1 n yi − p
= ∑
∂p n i=1 p (1 − p)
Now
∂pi (β)
= pi (1 − pi ) xi
∂β
so
∂ ln fY (y, β) y − pi
= pi (1 − pi ) xi
∂β pi (1 − pi )
= (yi − p(xi , β)) xi
4.2. CONSISTENCY OF MLE 58
Uniform convergence:
u.a.s
sn (θ) → lim Eθ0 sn (θ) ≡ s∞ (θ, θ0 ), ∀θ ∈ Θ.
n→∞
We have suppressed Y here for simplicity. This requires that almost sure convergence
holds for all possible parameter values. For a given parameter value, an ordinary Law
of Large Numbers will usually imply almost sure convergence to the limit of the ex-
pectation. Convergence for a single element of the parameter space, combined with
the assumption of a compact parameter space, ensures uniform convergence.
First, θ̂n certainly exists, since a continuous function has a maximum on a compact
set.
Second, for any θ 6= θ0
L(θ) L(θ)
E ln ≤ ln E
L(θ0 ) L(θ0 )
since L(θ0 ) is the density function of the observations, and since the integral of any
density is 1. Therefore, since ln(1) = 0,
L(θ)
E ln ≤ 0,
L(θ0 )
or
E (sn (θ)) − E (sn (θ0 )) ≤ 0.
Suppose that θ∗ is a limit point of θˆn (any sequence from a compact set has at least
one limit point). Since θ̂n is a maximizer, independent of n, we must have
4.3. THE SCORE FUNCTION 60
s∞ (θ∗ , θ0 ) − s∞ (θ0 , θ0 ) ≥ 0.
θ∗ = θ0 , a.s.
Thus there is only one limit point, and it is equal to the true parameter value with
probability one. In other words,
lim θ̂ = θ0 , a.s.
n→∞
This completes the proof of strong consistency of the MLE. One can use weaker as-
sumptions to prove weak consistency (convergence in probability to θ 0 ) of the MLE.
This is omitted here. Note that almost sure convergence implies convergence in prob-
ability.
gn (Y, θ) = Dθ sn (θ)
1 n
= ∑ Dθ ln f (yt |xx, θ)
n t=1
1 n
≡ ∑ gt (θ).
n t=1
4.3. THE SCORE FUNCTION 61
This is the score vector (with dim K × 1). Note that the score function has Y as an
argument, which implies that it is a random function. Y (and any exogeneous variables)
will often be suppressed for clarity, but one should not forget that they are still there.
The ML estimator θ̂ sets the derivatives to zero:
1 n
gn (θ̂) = ∑ gt (θ̂) ≡ 0.
n t=1
We will show that Eθ [gt (θ)] = 0, ∀t. This is the expectation taken with respect to
the density f (θ), not necessarily f (θ0 ) .
Z
Eθ [gt (θ)] = [Dθ ln f (yt |xt , θ)] f (yt |x, θ)dyt
1
Z
= [Dθ f (yt |xt , θ)] f (yt |xt , θ)dyt
f (yt |xt , θ)
Z
= Dθ f (yt |xt , θ)dyt .
= Dθ 1
= 0
Recall that we assume that sn (θ) is twice continuously differentiable. Take a first
order Taylor’s series expansion of g(Y, θ̂) about the true value θ0 :
0 ≡ g(θ̂) = g(θ0 ) + (Dθ0 g(θ∗ )) θ̂ − θ0
H(θ∗ ) θ̂ − θ0 = −g(θ0 ),
where θ∗ = λθ̂ + (1 − λ)θ0 , 0 < λ < 1. Assume H(θ∗ ) is invertible (we’ll justify this
in a minute). So
√ √
n θ̂ − θ0 = −H(θ∗ )−1 ng(θ0 )
= D2θ sn (θ∗ )
1 n 2
= ∑ Dθ ln ft (θ∗)
n t=1
don’t assume any particular set here, since the appropriate assumptions will depend
upon the particularities of a given model. However, we assume that a SLLN applies.
Also, since we know that θ̂ is consistent, and since θ∗ = λθ̂ + (1 − λ)θ0 , we have
a.s.
that θ∗ → θ0 . Also, by the above differentiability assumtion, H(θ) is continuous in θ.
Given this, H(θ∗ ) converges to the limit of it’s expectation:
a.s.
H(θ∗ ) → lim E D2θ sn (θ0 ) = H∞ (θ0 ) < ∞
n→∞
= D2θ s∞ (θ0 , θ0 )
i.e., θ0 maximizes the limiting objective function. Since there is a unique maximizer,
and by the assumption that sn (θ) is twice continuously differentiable (which holds in
the limit), then H∞ (θ0 ) must be negative definite, and therefore of full rank. Therefore
the previous inversion is justified, asymptotically, and we have
√ a.s. √
(4.4.1) n θ̂ − θ0 → −H∞ (θ0 )−1 ng(θ0 ).
4.4. ASYMPTOTIC NORMALITY OF MLE 64
√
Now consider ng(θ0 ). This is
√ √
ngn (θ0 ) = nDθ sn (θ)
√ n
n
= ∑ Dθ ln ft (yt |xt , θ0)
n t=1
1 n
= √ ∑ gt (θ0 )
n t=1
We’ve already seen that Eθ [gt (θ)] = 0. As such, it is reasonable to assume that a CLT
applies.
a.s.
Note that gn (θ0 ) → 0, by consistency. To avoid this collapse to a degenerate r.v. (a
√
constant vector) we need to scale by n. A generic CLT states that, for Xn a random
vector that satisfies certain conditions,
d
Xn − E(Xn ) → N(0, limV (Xn ))
The “certain conditions” that Xn must satisfy depend on the case at hand. Usually, Xn
√
will be of the form of an average, scaled by n:
n
√ ∑t=1 Xt
Xn = n
n
√
This is the case for ng(θ0 ) for example. Then the properties of Xn depend on the
properties of the Xt . For example, if the Xt have finite variances and are not too strongly
dependent, then a CLT for dependent processes will apply. Supposing that a CLT
√
applies, and noting that E( ngn (θ0 ) = 0, we get
√ d
I∞ (θ0 )−1/2 ngn (θ0 ) → N [0, IK ]
4.4. ASYMPTOTIC NORMALITY OF MLE 65
where
I∞ (θ0 ) = lim Eθ0 n [gn (θ0 )] [gn (θ0 )]0
n→∞
√
= lim Vθ0 ngn (θ0 )
n→∞
√ d
(4.4.2) ngn (θ0 ) → N [0, I∞ (θ0 )]
√ a
n θ̂ − θ0 ∼ N 0, H∞ (θ0 )−1 I∞ (θ0 )H∞ (θ0 )−1 .
√
D EFINITION 1 (CAN). An estimator θ̂ of a parameter θ0 is n-consistent and
asymptotically normally distributed if
√ d
(4.4.3) n θ̂ − θ0 → N (0,V∞ )
√ p
There do exist, in special cases, estimators that are consistent such that n θ̂ − θ0 →
√
0. These are known as superconsistent estimators, since normally, n is the highest
factor that we can multiply by an still get convergence to a stable limiting distribution.
Estimators that are CAN are asymptotically unbiased, though not all consistent
estimators are asymptotically unbiased. Such cases are unusual, though. An example
is
1 − n1 , θ̂ = θ0
f (θ̂) =
1
n, θ̂ = n
Show that this estimator is consistent but asymptotically biased. Also ask yourself how
you could define an estimator that would have this density.
We will show that H∞ (θ) = −I∞ (θ). Let ft (θ) be short for f (yt |xt , θ)
Z
1 = ft (θ)dy, so
Z
0 = Dθ ft (θ)dy
Z
= (Dθ ln ft (θ)) ft (θ)dy
1
Now sum over n and multiply by n
" #
n n
1 1
Eθ ∑
n t=1
[Ht (θ)] = −Eθ ∑
n t=1
[gt (θ)] [gt (θ)]0
The scores gt and gs are uncorrelated for t 6= s, since for t > s, ft (yt |y1 , ..., yt−1 , θ) has
conditioned on prior information, so what was random in s is fixed in t. (This forms the
basis for a specification test proposed by White: if the scores appear to be correlated
one may question the specification of the model). This allows us to write
Eθ [H(θ)] = −Eθ n [g(θ)][g(θ)]0
since all cross products between different periods expect to zero. Finally take limits,
we get
√ a.s.
n θ̂ − θ0 → N 0, H∞ (θ0 )−1 I∞ (θ0 )H∞ (θ0 )−1
simplifies to
√ a.s.
(4.6.3) n θ̂ − θ0 → N 0, I∞ (θ0 )−1
H\
∞ (θ0 ) = H(θ̂).
4.7. THE CRAMÉR-RAO LOWER BOUND 68
−1
V\ \
∞ (θ0 ) = −H∞ (θ0 )
−1
V\ \
∞ (θ0 ) = I∞ (θ0 )
−1 −1
V\ \ \ \
∞ (θ0 ) = H∞ (θ0 ) I∞ (θ0 )H∞ (θ0 )
These are known as the inverse Hessian, outer product of the gradient (OPG) and
sandwich estimators, respectively. The sandwich form is the most robust, since it
coincides with the covariance estimator of the quasi-ML estimator.
lim Eθ (θ̃ − θ) = 0
n→∞
Differentiate wrt θ0 :
Z
Dθ0 lim Eθ (θ̃ − θ) = lim Dθ0 f (Y, θ) θ̃ − θ dy
n→∞ n→∞
Now note that Dθ0 θ̃ − θ = −IK , and f (Y, θ)(−IK )dy = −IK . With this we have
R
Z
lim θ̃ − θ f (θ)Dθ0 ln f (θ)dy = IK .
n→∞
Note that the bracketed part is just the transpose of the score vector, g(θ), so we can
write
√ √
lim Eθ n θ̃ − θ ng(θ)0 = IK
n→∞
√
This means that the covariance of the score function with n θ̃ − θ , for θ̃ any CAN
√
estimator, is an identity matrix. Using this, suppose the variance of n θ̃ − θ tends
to V∞ (θ̃). Therefore,
√
n θ̃ − θ V (θ̃) IK
(4.7.1) V∞ √ = ∞ .
ng(θ) IK I∞ (θ)
This simplifies to
α0 V∞ (θ̃) − I∞−1 (θ) α ≥ 0.
4.7. THE CRAMÉR-RAO LOWER BOUND 70
Since α is arbitrary, V∞ (θ̃) − I∞ (θ) is positive semidefinite. This conludes the proof.
This means that I∞−1 (θ) is a lower bound for the asymptotic variance of a CAN
estimator.
• Consistent
• Asymptotically normal (CAN)
• Asymptotically efficient
• Asymptotically unbiased
• This is for general MLE: we haven’t specified the distribution or the lineari-
ty/nonlinearity of the estimator
EXERCISES 71
Exercises
(1) Consider coin tossing with a single possibly biased coin. The density function for
the random variable y = 1(heads) is
y
fY (y, p0 ) = p0 (1 − p0 )1−y , y ∈ {0, 1}
= 0, y ∈
/ {0, 1}
Suppose that we have a sample of size n. We know from above that the ML esti-
mator is pb0 = ȳ. We also know from the theory above that
√ a
n (ȳ − p0 ) ∼ N 0, H∞ (p0 )−1 I∞ (p0 )H∞ (p0 )−1
a) find the analytical expressions for H∞ (p0 ) and I∞ (p0 ) for this problem
√
b) Write an Octave program that does a Monte Carlo study that shows that n (ȳ − p0 )
is approximately normally distributed when n is large. Please give me histograms
√
that show the sampling frequency of n (ȳ − p0 ) for several values of n.
(2) Consider the model yt = xt0 β + αεt where the errors follow the Cauchy (Student-t
with 1 degree of freedom) density. So
1
f (εt ) = , −∞ < εt < ∞
π 1 + εt2
The Cauchy density has a shape similar to a normal density, but with much thicker
tails. Thus, extremely small and large errors occur much more frequently with this
density than would happen if the errors were normally distributed. Find the score
0
function gn (θ) where θ = β0 α .
(3) Consider the model classical linear regression model yt = xt0 β + εt where εt ∼
0
IIN(0, σ2). Find the score function gn (θ) where θ = β0 σ .
EXERCISES 72
(4) Compare the first order conditional that define the ML estimators of problems 2
and 3 and interpret the differences. Why are the first order conditions that define
an efficient estimator different in the two cases?
CHAPTER 5
The OLS estimator under the classical assumptions is unbiased and BLUE, for all
sample sizes. Now let’s see what happens when the sample size tends to infinity.
5.1. Consistency
β̂ = (X 0 X )−1 X 0 y
= (X 0 X )−1 X 0 (X β + ε)
= β0 + (X 0 X )−1 X 0 ε
0 −1 0
XX Xε
= β0 +
n n
−1
X 0X X 0X
Consider the last two terms. By assumption limn→∞ n = QX ⇒ limn→∞ n =
Q−1
X , since the inverse of a nonsingular matrix is a continuous function of the elements
X 0ε
of the matrix. Considering n ,
X 0ε 1 n
= ∑ xt εt
n n t=1
V (xt εt ) = xt xt0 σ2 .
As long as these are finite, and given a technical condition1, the Kolmogorov SLLN
applies, so
1 n a.s.
∑ xt εt → 0.
n t=1
This implies that
a.s.
β̂ → β0 .
This is the property of strong consistency: the estimator converges in almost surely to
the true value.
We’ve seen that the OLS estimator is normally distributed under the assumption
of normal errors. If the error distribution is unknown, we of course don’t know the
distribution of the estimator. However, we can get asymptotic results. Assuming the
distribution of ε is unknown, but the the other classical assumptions hold:
β̂ = β0 + (X 0 X )−1 X 0 ε
β̂ − β0 = (X 0 X )−1 X 0 ε
0 −1 0
√ XX Xε
n β̂ − β0 = √
n n
1For application of LLN’s and CLT’s, of which there are very many to choose from, I’m going to avoid
the technicalities. Basically, as long as terms of an average have finite variances and are not too strongly
dependent, one will be able to find a LLN or CLT to apply.
5.3. ASYMPTOTIC EFFICIENCY 75
−1
X 0X
• Now as before, n → Q−1
X .
√ε,
X 0
• Considering n
the limit of the variance is
X 0ε X 0 εε0 X
lim V √ = lim E
n→∞ n n→∞ n
= σ20 QX
X 0ε d
√ → N 0, σ20 QX
n
Therefore,
√
d
2 −1
n β̂ − β0 → N 0, σ0 QX
n 2
s(β) = ∑ yt − xt0 β
t=1
y = X β0 + ε,
5.3. ASYMPTOTIC EFFICIENCY 76
ε ∼ N(0, σ20 In ), so
n
1 ε2
f (ε) = ∏ √ exp − t 2
t=1 2πσ
2 2σ
The joint density for y can be constructed using a change of variables. We have ε =
∂ε ∂ε
y − X β, so ∂y0 = In and | ∂y 0 | = 1, so
n
1 (yt − xt0 β)2
f (y) = ∏ √ exp − .
t=1 2πσ
2 2σ2
Taking logs,
√ (yt − xt0 β)2
n
ln L(β, σ) = −n ln 2π − n ln σ − ∑ .
t=1 2σ2
It’s clear that the fonc for the MLE of β0 are the same as the fonc for OLS (up to multi-
plication by a constant), so the estimators are the same, under the present assumptions.
Therefore, their properties are the same. In particular, under the classical assumptions
with normality, the OLS estimator β̂ is asymptotically efficient.
As we’ll see later, it will be possible to use (iterated) linear estimation methods and
still achieve asymptotic efficiency even if the assumption that Var(ε) 6= σ 2 In , as long
as ε is still normally distributed. This is not the case if ε is nonnormal. In general with
nonnormal errors it will be necessary to use nonlinear estimation methods to achieve
asymptotically efficient estimation.
CHAPTER 6
ln q = β0 + β1 ln p1 + β2 ln p2 + β3 ln m + ε,
k0 ln q = β0 + β1 ln kp1 + β2 ln kp2 + β3 ln km + ε,
so
β1 ln p1 + β2 ln p2 + β3 ln m = β1 ln kp1 + β2 ln kp2 + β3 ln km
= (ln k) (β1 + β2 + β3 ) + β1 ln p1 + β2 ln p2 + β3 ln m.
β1 + β2 + β3 = 0,
77
6.1. EXACT LINEAR RESTRICTIONS 78
y = Xβ + ε
Rβ = r
Let’s consider how to estimate β subject to the restrictions Rβ = r. The most obvious
approach is to set up the Lagrangean
1
min s(β) = (y − X β)0 (y − X β) + 2λ0 (Rβ − r).
β n
The Lagrange multipliers are scaled by 2, which makes things less messy. The fonc
are
We get
−1
β̂R X 0 X R0 X 0y
= .
λ̂ R 0 r
For the masochists: Stepwise Inversion
6.1. EXACT LINEAR RESTRICTIONS 79
Note that
(X 0 X )−1 0 X 0 X R0
≡ AB
−R (X 0 X )−1 IQ R 0
IK (X 0 X )−1 R0
=
0 −R (X 0 X )−1 R0
IK (X 0 X )−1 R0
≡
0 −P
≡ C,
and
IK (X 0 X )−1 R0 P−1 IK (X 0 X )−1 R0
≡ DC
0 −P−1 0 −P
= IK+Q ,
so
DAB = IK+Q
DA = B−1
−1
IK (X X ) R P
0 −1 0 −1 (X X )
0 0
B−1 =
−1
0 −P −1 −R (X X )0 IQ
(X 0 X )−1 − (X 0 X )−1 R0 P−1 R (X 0 X )−1 (X 0X )−1 R0 P−1
= ,
−1 0 −1 −1
P R (X X ) −P
6.1. EXACT LINEAR RESTRICTIONS 80
The fact that β̂R and λ̂ are linear functions of β̂ makes it easy to determine their dis-
tributions, since the distribution of β̂ is already known. Recall that for x a random
vector, and for A and b a matrix and vector of constants, respectively, Var (Ax + b) =
AVar(x)A0.
Though this is the obvious way to go about finding the restricted estimator, an
easier way, if the number of restrictions is small, is to impose them by substitution.
Write
y = X1 β1 + X2 β2 + ε
h i β1
R1 R2 = r
β2
1 r − R 1 R 2 β2 .
β1 = R−1 −1
6.1. EXACT LINEAR RESTRICTIONS 81
1 r − X1 R1 R2 β2 + X2 β2 + ε
y = X1 R−1 −1
h i
y − X1 R1 r = X2 − X1 R1 R2 β2 + ε
−1 −1
yR = XR β2 + ε.
This model satisfies the classical assumptions, supposing the restriction is true. One
can estimate by OLS. The variance of β̂2 is as before
−1
V (β̂2 ) = XR0 XR σ20
where one estimates σ20 in the normal way, using the restricted model, i.e.,
0
yR − XR βb2 yR − XR βb2
c2 =
σ 0
n − (K − Q)
To recover β̂1 , use the restriction. To find the variance of β̂1 , use the fact that it is a
linear function of β̂2 , so
0
V (β̂1 ) = R−1
1 R 2 V ( β̂ 2 )R 0
2 R −1
1
−1 0 −1 0 2
= R−1
1 R 2 X 0
2 2X R2 R1 σ0
6.1. EXACT LINEAR RESTRICTIONS 82
β̂R − β = (X 0X )−1 X 0 ε
Noting that the crosses between the second term and the other terms expect to zero,
and that the cross of the first and third has a cancellation with the square of the third,
we obtain
MSE(β̂R ) = (X 0 X )−1 σ2
So, the first term is the OLS covariance. The second term is PSD, and the third term is
NSD.
• If the restriction is true, the second term is 0, so we are better off. True
restrictions improve efficiency of estimation.
• If the restriction is false, we may be better or worse off, in terms of MSE,
depending on the magnitudes of r − Rβ and σ2 .
6.2. TESTING 83
6.2. Testing
In many cases, one wishes to test economic theories. If theory suggests parame-
ter restrictions, as in the above homogeneity example, one can test theory by testing
parameter restrictions. A number of tests are available.
y = Xβ + ε
and one wishes to test the single restriction H0 :Rβ = r vs. HA :Rβ 6= r . Under H0 ,
with normality of the errors,
Rβ̂ − r ∼ N 0, R(X 0X )−1 R0 σ20
so
Rβ̂ − r Rβ̂ − r
q = p ∼ N (0, 1) .
R(X 0 X )−1 R0 σ20 σ0 R(X X ) R
0 −1 0
c2 in place
The problem is that σ20 is unknown. One could use the consistent estimator σ 0
of σ20 , but the test would only be valid asymptotically in this case.
P ROPOSITION 4.
N(0, 1)
(6.2.1) q 2 ∼ t(q)
χ (q)
q
(6.2.2) x0 x ∼ χ2 (n, λ)
6.2. TESTING 84
We’ll prove this one as an indication of how the following unproven propositions
could be proved.
Proof: Factor V −1 as PP0 (this is the Cholesky factorization). Then consider y =
P0 x. We have
y ∼ N(0, P0V P)
but
V PP0 = In
P0V PP0 = P0
y0 y = x0 PP0 x = xV −1 x
(6.2.3) x0 Bx ∼ χ2 (ρ(B))
An immediate consequence is
P ROPOSITION 8. If the random vector (of dimension n) x ∼ N(0, I), and B is idem-
potent with rank r, then
(6.2.4) x0 Bx ∼ χ2 (r).
ε̂0 ε̂ ε 0 MX ε
=
σ20 σ20
0
ε ε
= MX
σ0 σ0
∼ χ2 (n − K)
P ROPOSITION 9. If the random vector (of dimension n) x ∼ N(0, I), then Ax and
x0 Bx are independent if AB = 0.
Now consider (remember that we have only one restriction in this case)
√ Rβ̂−r
σ0 R(X 0 X)−1 R0 Rβ̂ − r
q 0 = p
ε̂ ε̂ σ
c0 R(X 0 X )−1 R0
(n−K)σ20
This will have the t(n − K) distribution if β̂ and ε̂0 ε̂ are independent. But β̂ = β +
(X 0 X )−1 X 0 ε and
(X 0 X )−1 X 0 MX = 0,
so
Rβ̂ − r Rβ̂ − r
p = ∼ t(n − K)
σ
c0 R(X 0X )−1 R0 σ̂Rβ̂
6.2. TESTING 86
β̂i
∼ t(n − K)
σ̂β̂i
• Note: the t− test is strictly valid only if the errors are actually normally
distributed. If one has nonnormal errors, one could use the above asymptotic
result to justify taking critical values from the N(0, 1) distribution, since t(n −
d
K) → N(0, 1) as n → ∞. In practice, a conservative procedure is to take critical
values from the t distribution if nonnormality is suspected. This will reject
H0 less often since the t distribution is fatter-tailed than is the normal.
x/r
(6.2.5) ∼ F(r, s)
y/s
P ROPOSITION 11. If the random vector (of dimension n) x ∼ N(0, I), then x 0 Ax
and x0 Bx are independent if AB = 0.
Using these results, and previous results on the χ2 distribution, it is simple to show
that the following statistic has the F distribution:
0 −1
Rβ̂ − r R (X 0 X )−1 R0 Rβ̂ − r
F= ∼ F(q, n − K).
qσ̂2
A numerically equivalent expression is
6.2. TESTING 87
(ESSR − ESSU ) /q
∼ F(q, n − K).
ESSU /(n − K)
• Note: The F test is strictly valid only if the errors are truly normally dis-
tributed. The following tests will be appropriate when one cannot assume
normally distributed errors.
6.2.3. Wald-type tests. The Wald principle is based on the idea that if a restriction
is true, the unrestricted model should “approximately” satisfy the restriction. Given
that the least squares estimator is asymptotically normally distributed:
√
d
n β̂ − β0 → N 0, σ20 Q−1
X
√
d
2
n Rβ̂ − r → N 0, σ0 RQX R
−1 0
so by Proposition [6]
0 −1
d
n Rβ̂ − r σ20 RQ−1
X R
0
Rβ̂ − r → χ2 (q)
2
X or σ0 are not observable. The test statistic we use substitutes the con-
Note that Q−1
sistent estimators. Use (X 0 X /n)−1 as the consistent estimator of Q−1
X . With this, there
• The Wald test is a simple way to test restrictions without having to estimate
the restricted model.
• Note that this formula is similar to one of the formulae provided for the F
test.
6.2. TESTING 88
6.2.4. Score-type tests (Rao tests, Lagrange multiplier tests). In some cases,
an unrestricted model may be nonlinear in the parameters, but the model is linear in
the parameters under the null hypothesis. For example, the model
y = (X β)γ + ε
• Score-type tests are based upon the general principle that the gradient vec-
tor of the unrestricted model, evaluated at the restricted estimate, should be
asymptotically normally distributed with mean zero, if the restrictions are
true. The original development was for ML estimation, but the principle is
valid for a wide variety of estimation methods.
−1
λ̂ = R(X 0X )−1 R0 Rβ̂ − r
= P −1
Rβ̂ − r
Given that
√
d
2
n Rβ̂ − r → N 0, σ0 RQX R
−1 0
√
d
nλ̂ → N 0, σ20 P−1 RQ−1
X R 0 −1
P
or
√
d
nλ̂ → N 0, σ20 lim n (nP)−1 RQ−1
X R 0 −1
P
6.2. TESTING 89
since the n’s cancel and inserting the limit of a matrix of constants changes nothing.
However,
√ d
nλ̂ → N 0, σ20 lim nP−1
In this case,
R(X 0 X )−1 R0 d
λ̂ 0
2
λ̂ → χ2 (q)
σ0
since the powers of n cancel. To get a usable test statistic substitute a consistent esti-
mator of σ20 .
• This makes it clear why the test is sometimes referred to as a Lagrange mul-
tiplier test. It may seem that one needs the actual Lagrange multipliers to
calculate this. If we impose the restrictions by substitution, these are not
available. Note that the test can be written as
0
R0 λ̂ (X 0X )−1 R0 λ̂ d
2
→ χ2 (q)
σ0
−X 0 y + X 0 X β̂R + R0 λ̂
6.2. TESTING 90
to get that
R0 λ̂ = X 0 (y − X β̂R)
= X 0 ε̂R
give us
R0 λ̂ = X 0 y − X 0 X β̂R
and the rhs is simply the gradient (score) of the unrestricted model, evaluated at the
restricted estimator. The scores evaluated at the unrestricted estimate are identically
zero. The logic behind the score test is that the scores evaluated at the restricted esti-
mate should be approximately zero, if the restriction is true. The test is also known as
a Rao test, since P. Rao first proposed it in 1948.
6.2.5. Likelihood ratio-type tests. The Wald test can be calculated using the un-
restricted model. The score test can be calculated using only the restricted model. The
likelihood ratio test, on the other hand, uses both the restricted and the unrestricted
estimators. The test statistic is
LR = 2 ln L(θ̂) − ln L(θ̃)
6.2. TESTING 91
where θ̂ is the unrestricted estimate and θ̃ is the restricted estimate. To show that it is
asymptotically χ2 , take a second order Taylor’s series expansion of ln L(θ̃) about θ̂ :
n 0
ln L(θ̃) ' ln L(θ̂) + θ̃ − θ̂ H(θ̂) θ̃ − θ̂
2
(note, the first order term drops out since Dθ ln L(θ̂) ≡ 0 by the fonc and we need to
multiply the second-order term by n since H(θ) is defined in terms of n1 ln L(θ)) so
0
LR ' −n θ̃ − θ̂ H(θ̂) θ̃ − θ̂
a 0
LR = n θ̃ − θ̂ I∞ (θ0 ) θ̃ − θ̂
√ a
n θ̂ − θ0 = I∞ (θ0 )−1 n1/2 g(θ0 ).
An analogous result for the restricted estimator is (this is unproven here, to prove
this set up the Lagrangean for MLE subject to Rβ = r, and manipulate the first order
conditions) :
√ a −1
n θ̃ − θ0 = I∞ (θ0 )−1 In − R0 RI∞ (θ0 )−1 R0 RI∞ (θ0 )−1 n1/2 g(θ0 ).
√ a −1
n θ̃ − θ̂ = −n1/2 I∞ (θ0 )−1 R0 RI∞ (θ0 )−1 R0 RI∞ (θ0 )−1 g(θ0 )
But since
d
n1/2 g(θ0 ) → N (0, I∞ (θ0 ))
d
RI∞ (θ0 )−1 n1/2 g(θ0 ) → N(0, RI∞ (θ0 )−1 R0 ).
We can see that LR is a quadratic form of this rv, with the inverse of its variance in the
middle, so
d
LR → χ2 (q).
6.3. The asymptotic equivalence of the LR, Wald and score tests
We have seen that the three tests all converge to χ2 random variables. In fact,
they all converge to the same χ2 rv, under the null hypothesis. We’ll show that the
Wald and LR tests are asymptotically equivalent. We have seen that the Wald test is
asymptotically equivalent to
0 −1
a d
W = n Rβ̂ − r σ20 RQ−1
X R 0
R β̂ − r → χ2 (q)
Using
β̂ − β0 = (X 0 X )−1 X 0ε
and
Rβ̂ − r = R(β̂ − β0 )
we get
√ √
nR(β̂ − β0 ) = nR(X 0 X )−1 X 0 ε
0 −1
XX
= R n−1/2 X 0 ε
n
6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS 93
a −1
= ε0 X (X 0X )−1 R0 σ20 R(X 0X )−1 R0 R(X 0X )−1 X 0 ε
a ε0 A(A0 A)−1 A0 ε
=
σ20
a ε0 PR ε
=
σ20
• Note that this matrix is idempotent and has q columns, so the projection ma-
trix has rank q.
a −1
LR = n1/2 g(θ0 )0 I (θ0 )−1 R0 RI (θ0 )−1 R0 RI (θ0 )−1 n1/2 g(θ0 )
√ 1 (y − X β)0 (y − X β)
ln L(β, σ) = −n ln 2π − n ln σ − .
2 σ2
Using this,
1
g(β0 ) ≡ Dβ ln L(β, σ)
n
X (y − X β0 )
0
=
nσ2
X 0ε
=
nσ2
6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS 94
so
I (θ0 )−1 = σ2 Q−1
X
a −1
LR = ε0 X 0 (X 0X )−1 R0 σ20 R(X 0 X )−1 R0 R(X 0 X )−1 X 0 ε
a ε0 PR ε
=
σ20
a
= W
This completes the proof that the Wald and LR tests are asymptotically equivalent.
Similarly, one can show that, under the null hypothesis,
a a a
qF = W = LM = LR
• The proof for the statistics except for LR does not depend upon normality of
the errors, as can be verified by examining the expressions for the statistics.
• The LR statistic is based upon distributional assumptions, since one can’t
write the likelihood function without them.
6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS 95
• However, due to the close relationship between the statistics qF and LR, sup-
posing normality, the qF statistic can be thought of as a pseudo-LR statistic,
in that it’s like a LR statistic in that it uses the value of the objective functions
of the restricted and unrestricted models, but it doesn’t require distributional
assumptions.
• The presentation of the score and Wald tests has been done in the context
of the linear model. This is readily generalizable to nonlinear models and/or
other estimation methods.
Though the four statistics are asymptotically equivalent, they are numerically different
in small samples. The numeric values of the tests also depend upon how σ 2 is esti-
mated, and we’ve already seen than there are several ways to do this. For example all
of the following are consistent for σ2 under H0
ε̂0 ε̂
n−k
ε̂0 ε̂
n
ε̂0R ε̂R
n−k+q
ε̂0R ε̂R
n
and in general the denominator call be replaced with any quantity a such that lim a/n =
1.
It can be shown, for linear regression models subject to linear restrictions, and if
ε̂0 ε̂ ε̂0R ε̂R
n is used to calculate the Wald test and n is used for the score test, that
For this reason, the Wald test will always reject if the LR test rejects, and in turn the
LR test rejects if the LM test rejects. This is a bit problematic: there is the possibility
that by careful choice of the statistic used, one can manipulate reported results to favor
or disfavor a hypothesis. A conservative/honest approach would be to report all three
test statistics when they are available. In the case of linear models with normal errors
the F test is to be preferred, since asymptotic approximations are not an issue.
The small sample behavior of the tests can be quite different. The true size (proba-
bility of rejection of the null when the null is true) of the Wald test is often dramatically
higher than the nominal size associated with the asymptotic distribution. Likewise, the
true size of the score test is often smaller than the nominal size.
Now that we have a menu of test statistics, we need to know how to use them.
Confidence intervals for single coefficients are generated in the normal manner.
Given the t statistic
β̂ − β
t(β) =
σ
cβ̂
a 100 (1 − α) % confidence interval for β0 is defined by the bounds of the set of β such
that t(β) does not reject H0 : β0 = β, using a α significance level:
β̂ − β
C(α) = {β : −cα/2 < < cα/2 }
σ
cβ̂
A confidence ellipse for two coefficients jointly would be, analogously, the set of
{β1 , β2 } such that the F (or some other test statistic) doesn’t reject at the specified
critical value. This generates an ellipse, if the estimators are correlated.
• The region is an ellipse, since the CI for an individual coefficient defines a (in-
finitely long) rectangle with total prob. mass 1 − α, since the other coefficient
is marginalized (e.g., can take on any value). Since the ellipse is bounded
in both dimensions but also contains mass 1 − α, it must extend beyond the
bounds of the individual CI.
• From the pictue we can see that:
– Rejection of hypotheses individually does not imply that the joint test
will reject.
– Joint rejection does not imply individal tests will reject.
6.6. Bootstrapping
When we rely on asymptotic theory to use the normal distribution-based tests and
confidence intervals, we’re often at serious risk of making important errors. If the
sample size is small and errors are highly nonnormal, the small sample distribution
√
of n β̂ − β0 may be very different than its large sample distribution. Also, the
distributions of test statistics may not resemble their limiting distributions at all. A
means of trying to gain information on the small sample distribution of test statistics
and estimators is the bootstrap. We’ll consider a simple example, just to get the main
idea.
6.6. BOOTSTRAPPING 98
Suppose that
y = X β0 + ε
ε ∼ IID(0, σ20)
X is nonstochastic
(1) Draw n observations from ε̂ with replacement. Call this vector ε̃ j (it’s a
n × 1).
(2) Then generate the data by ỹ j = X β̂ + ε̃ j
(3) Now take this and estimate
β̃ j = (X 0X )−1 X 0ỹ j .
(4) Save β̃ j
(5) Repeat steps 1-4, until we have a large number, J, of β̃ j .
With this, we can use the replications to calculate the empirical distribution of β̃ j .
One way to form a 100(1-α)% confidence interval for β0 would be to order the β̃ j
from smallest to largest, and drop the first and last Jα/2 of the replications, and use
the remaining endpoints as the limits of the CI. Note that this will not give the shortest
CI if the empirical distribution is skewed.
• Suppose one was interested in the distribution of some function of β̂, for
example a test statistic. Simple: just calculate the transformation for each j,
and work with the empirical distribution of the transformation.
6.7. TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD 100
• If the assumption of iid errors is too strong (for example if there is het-
eroscedasticity or autocorrelation, see below) one can work with a bootstrap
defined by sampling from (y, x) with replacement.
• How to choose J: J should be large enough that the results don’t change with
repetition of the entire bootstrap. This is easy to check. If you find the results
change a lot, increase J and try again.
• The bootstrap is based fundamentally on the idea that the empirical distri-
bution of the sample data converges to the actual sampling distribution as n
becomes large, so statistics based on sampling from the empirical distribution
should converge in distribution to statistics based on sampling from the actual
sampling distribution.
• In finite samples, this doesn’t hold. At a minimum, the bootstrap is a good
way to check if asymptotic theory results offer a decent approximation to the
small sample distribution.
Testing nonlinear restrictions of a linear model is not much more difficult, at least
when the model is linear. Since estimation subject to nonlinear restrictions requires
nonlinear estimation methods, which are beyond the score of this course, we’ll just
consider the Wald test for nonlinear restrictions on a linear model.
Consider the q nonlinear restrictions
r(β0 ) = 0.
where r(·) is a q-vector valued function. Write the derivative of the restriction evalu-
ated at β as
Dβ0 r(β)β = R(β)
6.7. TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD 101
ρ(R(β)) = q
√ a √
nr(β̂) = nR(β0 )(β̂ − β0 )
√
We’ve already seen the distribution of n(β̂ − β0 ). Using this we get
√
d 0 2
nr(β̂) → N 0, R(β0 )QX R(β0 ) σ0 .
−1
under the null hypothesis. Substituting consistent estimators for β0, QX and σ20 , the
resulting statistic is
−1
r(β̂)0 R(β̂)(X 0X )−1 R(β̂)0 r(β̂) d
→ χ2 (q)
c2
σ
Note that this also gives a convenient way to estimate nonlinear functions and associ-
ated asymptotic confidence intervals. If the nonlinear function r(β 0 ) is not hypothe-
sized to be zero, we just have
√
d
0 2
n r(β̂) − r(β0 ) → N 0, R(β0 )QX R(β0 ) σ0
−1
∂ f (x) x
η(x) =
∂x f (x)
β
η(x) = x
x0 β
(note that this is the entire vector of elasticities). The estimated elasticities are
β̂
η
b(x) = x
x0 β̂
6.7. TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD 103
∂η(x)
R(β) =
∂β0
x1 0 · · · 0 β1 x21 0 ··· 0
.. ..
0 x2 . 0 β2 x22 .
0
. x β − ..
.. .. ..
. 0 . . 0
0 · · · 0 xk 0 ··· 0 βk x2k
= .
(x0 β)2
To get a consistent estimator just substitute in β̂. Note that the elasticity and the stan-
dard error are functions of x. The program ExampleDeltaMethod.m shows how this
can be done.
In many cases, nonlinear restrictions can also involve the data, not just the param-
eters. For example, consider a model of expenditure shares. Let x(p, m) be a demand
funcion, where p is prices and m is income. An expenditure share system for G goods
is
pi xi (p, m)
si (p, m) = , i = 1, 2, ..., G.
m
Now demand must be positive, and we assume that expenditures sum to income, so we
have the restrictions
0 ≤ si (p, m) ≤ 1, ∀i
G
∑ si(p, m) = 1
i=1
It is fairly easy to write restrictions such that the shares sum to one, but the restriction
that the shares lie in the [0, 1] interval depends on both parameters and the values of p
and m. It is impossible to impose the restriction that 0 ≤ si (p, m) ≤ 1 for all possible p
and m. In such cases, one might consider whether or not a linear model is a reasonable
specification.
Remember that we in a previous example (section 3.8.3) that the OLS results for
the Nerlove model are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
*********************************************************
*******************************************************
Restricted LS estimation results
Observations 145
R-squared 0.925652
Sigma-squared 0.155686
*******************************************************
Value p-value
F 0.574 0.450
Wald 0.594 0.441
LR 0.593 0.441
Score 0.592 0.442
*******************************************************
6.8. EXAMPLE: THE NERLOVE DATA 106
*******************************************************
Value p-value
F 256.262 0.000
Wald 265.414 0.000
LR 150.863 0.000
Score 93.771 0.000
Notice that the input price coefficients in fact sum to 1 when HOD1 is imposed.
HOD1 is not rejected at usual significance levels (e.g., α = 0.10). Also, R 2 does not
drop much when the restriction is imposed, compared to the unrestricted results. For
CRTS, you should note that βQ = 1, so the restriction is satisfied. Also note that the
hypothesis that βQ = 1 is rejected by the test statistics at all reasonable significance
6.8. EXAMPLE: THE NERLOVE DATA 107
levels. Note that R2 drops quite a bit when imposing CRTS. If you look at the unre-
stricted estimation results, you can see that a t-test for βQ = 1 also rejects, and that a
confidence interval for βQ does not overlap 1.
From the point of view of neoclassical economic theory, these results are not
anomalous: HOD1 is an implication of the theory, but CRTS is not.
E XERCISE 12. Modify the NerloveRestrictions.m program to impose and test the
restrictions jointly.
The Chow test. Since CRTS is rejected, let’s examine the possibilities more care-
fully. Recall that the data is sorted by output (the third column). Define 5 subsamples
of firms, with the first group being the 29 firms with the lowest output levels, then the
next 29 firms, etc. The five subsamples can be indexed by j = 1, 2, ..., 5, where j = 1
for t = 1, 2, ...29, j = 2 for t = 30, 31, ...58, etc. Define a piecewise linear model
j j j j j
(6.8.1) lnCt = β1 + β2 ln Qt + β3 ln PLt + β4 ln PFt + β5 ln PKt + εt
where j is a superscript (not a power) that inicates that the coefficients may be different
according to the subsample in which the observation falls. That is, the coefficients
depend upon j which in turn depends upon t. Note that the first column of nerlove.data
indicates this way of breaking up the sample. The new model may be written as
1
0 β ε 1
y1 X1 0 · · ·
2 2
y2 0 X2 β ε
. . .
(6.8.2) .. = .. X3 + ..
X4 0
y5 0 X5 β5 ε5
6.8. EXAMPLE: THE NERLOVE DATA 108
where y1 is 29×1, X1 is 29×5, β j is the 5×1 vector of coefficient for the jth subsample,
and ε j is the 29 × 1 vector of errors for the jth subsample.
The Octave program Restrictions/ChowTest.m estimates the above model. It also
tests the hypothesis that the five subsamples share the same parameter vector, or in
other words, that there is coefficient stability across the five subsamples. The null to
test is that the parameter vectors for the separate groups are all the same, that is,
β1 = β 2 = β 3 = β 4 = β 5
This type of test, that parameters are constant across different sets of data, is sometimes
referred to as a Chow test.
• There are 20 restrictions. If that’s not clear to you, look at the Octave pro-
gram.
• The restrictions are rejected at all conventional significance levels.
Since the restrictions are rejected, we should probably use the unrestricted model for
analysis. What is the pattern of RTS as a function of the output group (small to large)?
Figure 6.8.1 plots RTS. We can see that there is increasing RTS for small firms, but
that RTS is approximately constant for large firms.
6.8. EXAMPLE: THE NERLOVE DATA 109
2.4
2.2
1.8
1.6
1.4
1.2
0.8
1 1.5 2 2.5 3 3.5 4 4.5 5
Output group
(1) Using the Chow test on the Nerlove model, we reject that there is coefficient
stability across the 5 groups. But perhaps we could restrict the input price
coefficients to be the same but let the constant and output coefficients vary by
group size. This new model is
j j
(6.8.3) lnCi = β1 + β2 ln Qi + β3 ln PLi + β4 ln PFi + β5 ln PKi + εi
(a) estimate this model by OLS, giving R, estimated standard errors for coef-
ficients, t-statistics for tests of significance, and the associated p-values.
Interpret the results in detail.
(b) Test the restrictions implied by this model using the F, Wald, score and
likelihood ratio tests. Comment on the results.
6.8. EXAMPLE: THE NERLOVE DATA 110
(c) Plot the estimated RTS parameters as a function of firm size. Compare
the plot to that given in the notes for the unrestricted model. Comment
on the results.
dS = 1
(2) For the simple Nerlove model, estimated returns to scale is RT . Apply
βbq
the delta method to calculate the estimated standard error for estimated RTS.
Directly test H0 : RT S = 1 versus HA : RT S 6= 1 rather than testing H0 : βQ = 1
versus HA : βQ 6= 1. Comment on the results.
(3) Perform a Monte Carlo study that generates data from the model
y = −2 + 1x2 + 1x3 + ε
where the sample size is 30, x2 and x3 are independently uniformly distributed
on [0, 1] and ε ∼ IIN(0, 1)
(a) Compare the means and standard errors of the estimated coefficients us-
ing OLS and restricted OLS, imposing the restriction that β2 + β3 = 2.
(b) Compare the means and standard errors of the estimated coefficients us-
ing OLS and restricted OLS, imposing the restriction that β2 + β3 = 1.
(c) Discuss the results.
(4) Get the Octave scripts bootstrap_example1.m , bootstrap.m , bootstrap_resample_iid.m
and myols.m figure out what they do, run them, and interpret the results.
CHAPTER 7
εt ∼ IID(0, σ2),
or occasionally
εt ∼ IIN(0, σ2).
y = Xβ + ε
E (ε) = 0
V (ε) = Σ
β̂ = (X 0X )−1 X 0 y
= β + (X 0X )−1 X 0 ε
of is invalid. In particular, the formulas for the t, F, χ2 based tests given above
do not lead to statistics with these distributions.
• β̂ is still consistent, following exactly the same argument given before.
• If ε is normally distributed, then
β̂ ∼ N β, (X 0X )−1 X 0ΣX (X 0X )−1
√ √
n β̂ − β = n(X 0X )−1 X 0 ε
0 −1
XX
= n−1/2 X 0 ε
n
Suppose Σ were known. Then one could form the Cholesky decomposition
P0 P = Σ−1
P0 PΣ = In
7.2. THE GLS ESTIMATOR 114
so
P0 PΣP0 = P0 ,
y∗ = X ∗ β + ε ∗ .
This variance of ε∗ = Pε is
E (Pεε0 P0 ) = PΣP0
= In
y∗ = X ∗ β + ε ∗
E (ε∗ ) = 0
V (ε∗ ) = In
satisfies the classical assumptions. The GLS estimator is simply OLS applied to the
transformed model:
β̂GLS = (X ∗0 X ∗ )−1 X ∗0 y∗
The GLS estimator is unbiased in the same circumstances under which the OLS
estimator is unbiased. For example, assuming X is nonstochastic
E (β̂GLS ) = E (X 0Σ−1 X )−1 X 0 Σ−1 y
= E (X 0Σ−1 X )−1 X 0 Σ−1 (X β + ε
= β.
β̂GLS = (X ∗0 X ∗ )−1 X ∗0 y∗
= (X ∗0 X ∗ )−1 X ∗0 (X ∗ β + ε∗ )
= β + (X ∗0 X ∗ )−1 X ∗0 ε∗
so
0
E β̂GLS − β β̂GLS − β = E (X ∗0 X ∗ )−1 X ∗0 ε∗ ε∗0 X ∗ (X ∗0 X ∗ )−1
= (X ∗0X ∗ )−1
= (X 0Σ−1 X )−1
• All the previous results regarding the desirable properties of the least squares
estimator hold, when dealing with the transformed model, since the trans-
formed model satisfies the classical assumptions..
• Tests are valid, using the previous formulas, as long as we substitute X ∗ in
place of X . Furthermore, any test that involves σ2 can set it to 1. This is
preferable to re-deriving the appropriate formulas.
7.3. FEASIBLE GLS 116
• The GLS estimator is more efficient than the OLS estimator. This is a con-
sequence of the Gauss-Markov theorem, since the GLS estimator is based on
a model that satisfies the classical assumptions but the OLS estimator is not.
To see this directly, not that (the following needs to be completed)
The problem is that Σ isn’t known usually, so this estimator isn’t available.
• Consider the dimension of Σ : it’s an n × n matrix with n2 − n /2 + n =
n2 + n /2 unique elements.
• The number of parameters to estimate is larger than n and increases faster
than n. There’s no way to devise an estimator that satisfies a LLN without
adding restrictions.
7.3. FEASIBLE GLS 117
• The feasible GLS estimator is based upon making sufficient assumptions re-
garding the form of Σ so that a consistent estimator can be devised.
Σ = Σ(X , θ)
(1) Consistency
(2) Asymptotic normality
(3) Asymptotic efficiency if the errors are normally distributed. (Cramer-Rao).
(4) Test procedures are asymptotically valid.
7.4. Heteroscedasticity
E (εε0 ) = Σ
is a diagonal matrix, so that the errors are uncorrelated, but have different variances.
Heteroscedasticity is usually thought of as associated with cross sectional data, though
there is absolutely no reason why time series data cannot also be heteroscedastic. Ac-
tually, the popular ARCH (autoregressive conditionally heteroscedastic) models ex-
plicitly assume that a time series is heteroscedastic.
Consider a supply function
qi = β1 + β p Pi + βs Si + εi
where Pi is price and Si is some measure of size of the ith firm. One might suppose
that unobservable factors (e.g., talent of managers, degree of coordination between
production units, etc.) account for the error term εi . If there is more variability in these
factors for large firms than for small firms, then εi may have a higher variance when Si
is high than when it is low.
Another example, individual demand.
qi = β1 + β p Pi + βm Mi + εi
where P is price and M is income. In this case, εi can reflect variations in preferences.
There are more possibilities for expression of preferences when one is rich, so it is
possible that the variance of εi could be higher when M is high.
Add example of group means.
7.4. HETEROSCEDASTICITY 119
√
d
n β̂ − β → N 0, Q−1
X ΩQ −1
X
This matrix has dimension K × K and can be consistently estimated, even if we can’t
estimate Σ consistently. The consistent estimator, under heteroscedasticity but no au-
tocorrelation is
n
b = 1 ∑ x0 xt ε̂2
Ω
n t=1 t t
One can then modify the previous test statistics to obtain tests that are valid when there
is heteroscedasticity of unknown form. For example, the Wald test for H0 : Rβ − r = 0
would be
−1 −1 !−1
0 X 0X X 0X
a
n Rβ̂ − r R Ω̂ R0 Rβ̂ − r ∼ χ2 (q)
n n
7.4.2. Detection. There exist many tests for the presence of heteroscedasticity.
We’ll discuss three methods.
Goldfeld-Quandt. The sample is divided in to three parts, with n1 , n2 and n3 obser-
vations, where n1 + n2 + n3 = n. The model is estimated using the first and third parts
of the sample, separately, so that β̂1 and β̂3 will be independent. Then we have
0
ε̂10 ε̂1 ε1 M 1 ε1 d 2
= → χ (n1 − K)
σ2 σ2
and
7.4. HETEROSCEDASTICITY 120
0
ε̂30 ε̂3 ε3 M 3 ε3 d 2
= → χ (n3 − K)
σ2 σ2
so
ε̂10 ε̂1 /(n1 − K) d
→ F(n1 − K, n3 − K).
ε̂30 ε̂3 /(n3 − K)
The distributional result is exact if the errors are normally distributed. This test is a
two-tailed test. Alternatively, and probably more conventionally, if one has prior ideas
about the possible magnitudes of the variances of the observations, one could order
the observations accordingly, from largest to smallest. In this case, one would use a
conventional one-tailed F-test. Draw picture.
• Ordering the observations is an important step if the test is to have any power.
• The motive for dropping the middle observations is to increase the difference
between the average variance in the subsamples, supposing that there exists
heteroscedasticity. This can increase the power of the test. On the other hand,
dropping too many observations will substantially increase the variance of the
statistics ε̂10 ε̂1 and ε̂30 ε̂3 . A rule of thumb, based on Monte Carlo experiments
is to drop around 25% of the observations.
• If one doesn’t have any ideas about the form of the het. the test will probably
have low power since a sensible data ordering isn’t available.
White’s test. When one has little idea if there exists heteroscedasticity, and no idea
of its potential form, the White test is a possibility. The idea is that if there is ho-
moscedasticity, then
E (εt2 |xt ) = σ2 , ∀t
so that xt or functions of xt shouldn’t help to explain E (εt2 ). The test works as follows:
(1) Since εt isn’t available, use the consistent estimator ε̂t instead.
7.4. HETEROSCEDASTICITY 121
(2) Regress
ε̂t2 = σ2 + zt0 γ + vt
P (ESSR − ESSU ) /P
qF =
ESSU / (n − P − 1)
Note that ESSR = T SSU , so dividing both numerator and denominator by this
we get
R2
qF = (n − P − 1)
1 − R2
Note that this is the R2 or the artificial regression used to test for heteroscedas-
ticity, not the R2 of the original model.
a
nR2 ∼ χ2 (P).
This doesn’t require normality of the errors, though it does assume that the fourth
moment of εt is constant, under the null. Question: why is this necessary?
• The White test has the disadvantage that it may not be very powerful unless
the zt vector is chosen well, and this is hard to do without knowledge of the
form of heteroscedasticity.
• It also has the problem that specification errors other than heteroscedasticity
may lead to rejection.
7.4. HETEROSCEDASTICITY 122
• Note: the null hypothesis of this test may be interpreted as θ = 0 for the
variance model V (εt2 ) = h(α + zt0 θ), where h(·) is an arbitrary function of un-
known form. The test is more general than is may appear from the regression
that is used.
Plotting the residuals. A very simple method is to simply plot the residuals (or
their squares). Draw pictures here. Like the Goldfeld-Quandt test, this will be more
informative if the observations are ordered according to the suspected form of the
heteroscedasticity.
yt = xt0 β + εt
δ
σt2 = E (εt2 ) = zt0 γ
δ
εt2 = zt0 γ + vt
and vt has mean zero. Nonlinear least squares could be used to estimate γ and δ con-
sistently, were εt observable. The solution is to substitute the squared OLS residuals
ε̂t2 in place of εt2 , since it is consistent by the Slutsky theorem. Once we have γ̂ and δ̂,
7.4. HETEROSCEDASTICITY 123
δ̂ p
σ̂t2 = zt0 γ̂ → σt2 .
In the second step, we transform the model by dividing by the standard deviation:
yt x0 β εt
= t +
σ̂t σ̂t σ̂t
or
yt∗ = xt∗0 β + εt∗ .
• This model is a bit complex in that NLS is required to estimate the model of
the variance. A simpler version would be
yt = xt0 β + εt
• Save the pairs (σ2m , δm ), and the corresponding ESSm . Choose the pair with
the minimum ESSm as the estimate.
• Next, divide the model by the estimated standard deviations.
• Can refine. Draw picture.
• Works well when the parameter to be searched over is low dimensional, as in
this case.
Groupwise heteroscedasticity
A common case is where we have repeated observations on each of a number of
economic agents: e.g., 10 years of macroeconomic data on each of a set of countries or
regions, or daily observations of transactions of 200 banks. This sort of data is a pooled
cross-section time-series model. It may be reasonable to presume that the variance is
constant over time within the cross-sectional units, but that it differs across them (e.g.,
firms or countries of different sizes...). The model is
E (ε2it ) = σ2i , ∀t
where i = 1, 2, ..., G are the agents, and t = 1, 2, ..., n are the observations on each agent.
To correct for heteroscedasticity, just estimate each σ2i using the natural estimator:
1 n 2
σ̂2i = ∑ ε̂it
n t=1
7.4. HETEROSCEDASTICITY 125
• Note that we use 1/n here since it’s possible that there are more than n re-
gressors, so n − K could be negative. Asymptotically the difference is unim-
portant.
• With each of these, transform the model as usual:
yit x0 β εit
= it +
σ̂i σ̂i σ̂i
Do this for each cross-sectional group. This transformed model satisfies the
classical assumptions, asymptotically.
7.4.4. Example: the Nerlove model (again!) Let’s check the Nerlove data for
evidence of heteroscedasticity. In what follows, we’re going to use the model with
the constant and output coefficient varying across 5 groups, but with the input price
coefficients fixed (see Equation 6.8.3 for the rationale behind this). Figure 7.4.1, which
is generated by the Octave program GLS/NerloveResiduals.m plots the residuals. We
can see pretty clearly that the error variance is larger for small firms than for larger
firms.
Now let’s try out some tests to formally check for heteroscedasticity. The Octave
program GLS/HetTests.m performs the White and Goldfeld-Quandt tests, using the
above model. The results are
Value p-value
White’s test 61.903 0.000
Value p-value
GQ test 10.886 0.000
All in all, it is very clear that the data are heteroscedastic. That means that OLS
estimation is not efficient, and tests of restrictions that ignore heteroscedasticity are not
7.4. HETEROSCEDASTICITY 126
0.5
-0.5
-1
-1.5
0 20 40 60 80 100 120 140 160
valid. The previous tests (CRTS, HOD1 and the Chow test) were calculated assuming
homoscedasticity. The Octave program GLS/NerloveRestrictions-Het.m uses the Wald
test to check for CRTS and HOD1, but using a heteroscedastic-consistent covariance
estimator.1 The results are
Testing HOD1
Value p-value
Wald test 6.161 0.013
Testing CRTS
Value p-value
1By the way, notice that GLS/NerloveResiduals.m and GLS/HetTests.m use the restricted LS estimator
directly to restrict the fully general model with all coefficients varying to the model with only the
constant and the output coefficient varying. But GLS/NerloveRestrictions-Het.m estimates the model
by substituting the restrictions into the model. The methods are equivalent, but the second is more
convenient and easier to understand.
7.4. HETEROSCEDASTICITY 127
We see that the previous conclusions are altered - both CRTS is and HOD1 are rejected
at the 5% level. Maybe the rejection of HOD1 is due to to Wald test’s tendency to over-
reject?
From the previous plot, it seems that the variance of ε is a decreasing function of
output. Suppose that the 5 size groups have different error variances (heteroscedastic-
ity by groups):
Var(εi ) = σ2j ,
*********************************************************
OLS estimation results
Observations 145
R-squared 0.958822
Sigma-squared 0.090800
*********************************************************
*********************************************************
OLS estimation results
Observations 145
R-squared 0.987429
Sigma-squared 1.092393
*********************************************************
Testing HOD1
Value p-value
Wald test 9.312 0.002
The first panel of output are the OLS estimation results, which are used to consistently
estimate the σ2j . The second panel of results are the GLS estimation results. Some
comments:
• The R2 measures are not comparable - the dependent variables are not the
same. The measure for the GLS results uses the transformed dependent vari-
able. One could calculate a comparable R2 measure, but I have not done so.
• The differences in estimated standard errors (smaller in general for GLS) can
be interpreted as evidence of improved efficiency of GLS, since the OLS stan-
dard errors are calculated using the Huber-White estimator. They would not
be comparable if the ordinary (inconsistent) estimator had been used.
7.5. AUTOCORRELATION 130
• Note that the previously noted pattern in the output coefficients persists. The
nonconstant CRTS result is robust.
• The coefficient on capital is now negative and significant at the 3% level.
That seems to indicate some kind of problem with the model or the data, or
economic theory.
• Note that HOD1 is now rejected. Problem of Wald test over-rejecting? Spec-
ification error in model?
7.5. Autocorrelation
Autocorrelation, which is the serial correlation of the error term, is a problem that
is usually associated with time series data, but also can affect cross-sectional data. For
example, a shock to oil prices will simultaneously affect all countries, so one could
expect contemporaneous correlation of macroeconomic variables across countries.
7.5.1. Causes. Autocorrelation is the existence of correlation across the error term:
E (εt εs ) 6= 0,t 6= s.
yt = xt0 β + εt ,
one could interpret xt0 β as the equilibrium value. Suppose xt is constant over
a number of observations. One can interpret εt as a shock that moves the
system away from equilibrium. If the time needed to return to equilibrium is
long with respect to the observation frequency, one could expect εt+1 to be
positive, conditional on εt positive, which induces a correlation.
7.5. AUTOCORRELATION 131
(2) Unobserved factors that are correlated over time. The error term is often
assumed to correspond to unobservable factors. If these factors are correlated,
there will be autocorrelation.
(3) Misspecification of the model. Suppose that the DGP is
yt = β0 + β1 xt + β2 xt2 + εt
but we estimate
yt = β0 + β1 xt + εt
7.5.2. Effects on the OLS estimator. The variance of the OLS estimator is the
same as in the case of heteroscedasticity - the standard formula does not apply. The
correct formula is given in equation 7.1.1. Next we discuss two GLS corrections for
OLS. These will potentially induce inconsistency when the regressors are nonstochas-
tic (see Chapter8) and should either not be used in that case (which is usually the
relevant case) or used with caution. The more recommended procedure is discussed in
section 7.5.5.
7.5.3. AR(1). There are many types of autocorrelation. We’ll consider two exam-
ples. The first is the most commonly encountered case: autoregressive order 1 (AR(1)
errors. The model is
yt = xt0 β + εt
εt = ρεt−1 + ut
ut ∼ iid(0, σ2u )
εt = ρεt−1 + ut
= ρ (ρεt−2 + ut−1 ) + ut
= ρ2 εt−2 + ρut−1 + ut
σ2u
=
1 − ρ2
V (εt ) = ρ2 E (εt−1
2
) + 2ρE (εt−1 ut ) + E (ut2 )
so
σ2u
V (εt ) =
1 − ρ2
• The variance is the 0th order autocovariance: γ0 = V (εt )
• Note that the variance does not depend on t
= ρV (εt )
ρσ2u
=
1 − ρ2
ρs σ2u
Cov(εt , εt−s ) = γs =
1 − ρ2
7.5. AUTOCORRELATION 134
cov(x, y)
corr(x, y) =
se(x)se(y)
but in this case, the two standard errors are the same, so the s-order autocorrelation ρ s
is
ρs = ρ s
• All this means that the overall matrix Σ has the form
1 ρ ρ2 · · · ρn−1
ρ
1 ρ · · · ρn−2
σ2u .
.. .. ..
Σ= . .
1−ρ 2
| {z } ..
this is the variance . ρ
ρn−1 · · · 1
| {z }
this is the correlation matrix
So we have homoscedasticity, but elements off the main diagonal are not zero.
All of this depends only on two parameters, ρ and σ2u . If we can estimate these
consistently, we can apply FGLS.
It turns out that it’s easy to estimate these consistently. The steps are
p
Since ε̂t → εt , this regression is asymptotically equivalent to the regression
εt = ρεt−1 + ut
1 n ∗ 2 p 2
σ̂2u = ∑ (ût ) → σu
n t=2
(3) With the consistent estimators σ̂2u and ρ̂, form Σ̂ = Σ(σ̂2u , ρ̂) using the previ-
ous structure of Σ, and estimate by FGLS. Actually, one can omit the factor
σ̂2u /(1 − ρ2 ), since it cancels out in the formula
−1
β̂FGLS = X 0 Σ̂−1 X (X 0Σ̂−1 y).
• One can iterate the process, by taking the first FGLS estimator of β, re-
estimating ρ and σ2u , etc. If one iterates to convergences it’s equivalent to
MLE (supposing normal errors).
• An asymptotically equivalent approach is to simply estimate the transformed
model
yt − ρ̂yt−1 = (xt − ρ̂xt−1 )0 β + ut∗
7.5.4. MA(1). The linear regression model with moving average order 1 errors is
yt = xt0 β + εt
εt = ut + φut−1
ut ∼ iid(0, σ2u )
In this case,
h i
2
V (εt ) = γ0 = E (ut + φut−1 )
= σ2u + φ2 σ2u
= σ2u (1 + φ2 )
Similarly
= φσ2u
and
= 0
7.5. AUTOCORRELATION 137
so in this case
1 + φ2 φ 0 ··· 0
φ 1 + φ2 φ
2
.. ..
Σ = σu 0 φ . .
..
..
. . φ
0 ··· φ 1 + φ2
Note that the first order autocorrelation is
φσ2u γ1
ρ1 = σu (1+φ2 )
2 =
γ0
φ
=
(1 + φ2 )
Again the covariance matrix has a simple structure that depends on only two parame-
ters. The problem in this case is that one can’t estimate φ using OLS on
ε̂t = ut + φut−1
because the ut are unobservable and they can’t be estimated consistently. However,
there is a simple way to estimate the parameters.
1 n 2
σε = σu (1 + φ ) = ∑ ε̂t
c2 2 \ 2
n t=1
7.5. AUTOCORRELATION 138
1 n 2
c2 (1 + b
σ u φ 2
) = ∑ ε̂t
n t=1
This is a consistent estimator, following a LLN (and given that the epsilon
hats are consistent for the epsilons). As above, this can be interpreted as
defining an unidentified estimator:
1 n
c2 =
φ̂σ u ∑ ε̂t ε̂t−1
n t=2
• Now solve these two equations to obtain identified (and therefore consistent)
estimators of both φ and σ2u . Define the consistent estimator
c2 )
Σ̂ = Σ(φ̂, σ u
following the form we’ve seen above, and transform the model using the
Cholesky decomposition. The transformed model satisfies the classical as-
sumptions asymptotically.
When the form of autocorrelation is unknown, one may decide to use the OLS es-
timator, without correction. We’ve seen that this estimator has the limiting distribution
√
d
n β̂ − β → N 0, Q−1
X ΩQ −1
X
where, as before, Ω is
X 0 εε0 X
Ω = lim E
n→∞ n
We need a consistent estimate of Ω. Define mt = xt εt (recall that xt is defined as a
K × 1 vector). Note that
ε1
h i ε2
X 0ε = x1 x2 · · · xn ..
.
εn
n
= ∑ xt εt
t=1
n
= ∑ mt
t=1
so that " ! !#
n n
1
Ω = lim E
n→∞ n
∑ mt ∑ mt0
t=1 t=1
We assume that mt is covariance stationary (so that the covariance between mt and
mt−s does not depend on t).
Define the v − th autocovariance of mt as
Γv = E (mt mt−v
0
).
Γv = E (mt mt−v
0
) 6= 0
Note that this autocovariance does not depend on t, due to covariance station-
arity.
• contemporaneously correlated ( E (mit m jt ) 6= 0 ), since the regressors in xt
will in general be correlated (more on this later).
• and heteroscedastic (E (m2it ) = σ2i , which depends upon i ), again since the
regressors will have different variances.
While one could estimate Ω parametrically, we in general have little information upon
which to base a parametric specification. Recent research has focused on consistent
nonparametric estimators of Ω.
Now define " ! !#
n n
1
Ωn = E
n ∑ mt ∑ mt0
t=1 t=1
We have (show that the following is true, by expanding sum and shifting rows to left)
n−1 n−2 1
Ωn = Γ0 + Γ1 + Γ01 + Γ2 + Γ02 · · · + Γn−1 + Γ0n−1
n n n
1 n
Γbv = ∑ m̂t m̂t−v
n t=v+1
0
.
where
m̂t = xt ε̂t
7.5. AUTOCORRELATION 141
(note: one could put 1/(n − v) instead of 1/n here). So, a natural, but inconsistent,
estimator of Ωn would be
c0 + n − 1 Γ
Ω̂n = Γ c0 + n − 2 Γ
c1 + Γ c0 + · · · + 1 Γd
c2 + Γ n−1 + d
Γ 0
n 1 n 2 n n−1
n−1
c0 + ∑ n − v Γbv + Γb0v .
= Γ
v=1 n
• The assumption that autocorrelations die off is reasonable in many cases. For
example, the AR(1) model with |ρ| < 1 has autocorrelations that die off.
n−v
• The term n can be dropped because it tends to one for v < q(n), given that
q(n) increases slowly relative to n.
• A disadvantage of this estimator is that is may not be positive definite. This
could cause one to calculate a negative χ2 statistic, for example!
• Newey and West proposed and estimator (Econometrica, 1987) that solves
the problem of possible nonpositive definiteness of the above estimator. Their
estimator is
q(n)
c0 + ∑ v b b
Ω̂n = Γ 1− Γv + Γv .
0
v=1 q+1
7.5. AUTOCORRELATION 142
consistently estimate the limiting distribution of the OLS estimator under heteroscedas-
ticity and autocorrelation of unknown form. With this, asymptotically valid tests are
constructed in the usual way.
• The null hypothesis is that the first order autocorrelation of the errors is zero:
H0 : ρ1 = 0. The alternative is of course HA : ρ1 6= 0. Note that the alternative
is not that the errors are AR(1), since many general patterns of autocorrelation
will have the first order autocorrelation different than zero. For this reason the
test is useful for detecting autocorrelation in general. For the same reason, one
shouldn’t just assume that an AR(1) model is appropriate when the DW test
rejects the null.
• Under the null, the middle term tends to zero, and the other two tend to one,
p
so DW → 2.
• Supposing that we had an AR(1) error process with ρ = 1. In this case the
p
middle term tends to −2, so DW → 0
7.5. AUTOCORRELATION 143
• Supposing that we had an AR(1) error process with ρ = −1. In this case the
p
middle term tends to 2, so DW → 4
• These are the extremes: DW always lies between 0 and 4.
• The distribution of the test statistic depends on the matrix of regressors, X ,
so tables can’t give exact critical values. The give upper and lower bounds,
which correspond to the extremes that are possible. See Figure 7.5.2. There
are means of determining exact critical values conditional on X .
• Note that DW can be used to test for nonlinearity (add discussion).
• The DW test is based upon the assumption that the matrix X is fixed in re-
peated samples. This is often unreasonable in the context of economic time
series, which is precisely the context where the test would have application. It
is possible to relate the DW test to other test statistics which are valid without
strict exogeneity.
Breusch-Godfrey test
This test uses an auxiliary regression, as does the White test for heteroscedasticity.
The regression is
and the test statistic is the nR2 statistic, just as in the White test. There are P restric-
tions, so the test statistic is asymptotically distributed as a χ2 (P).
• The intuition is that the lagged errors shouldn’t contribute to explaining the
current error if there is no autocorrelation.
• xt is included as a regressor to account for the fact that the ε̂t are not indepen-
dent even if the εt are. This is a technicality that we won’t go into here.
7.5. AUTOCORRELATION 144
• This test is valid even if the regressors are stochastic and contain lagged de-
pendent variables, so it is considerably more useful than the DW test for typ-
ical time series data.
• The alternative is not that the model is an AR(P), following the argument
above. The alternative is simply that some or all of the first P autocorrelations
are different from zero. This is compatible with many specific forms of auto-
correlation.
7.5.7. Lagged dependent variables and autocorrelation. We’ve seen that the
OLS estimator is consistent under autocorrelation, as long as plim Xnε = 0. This will
0
case of a single lag of the dependent variable with AR(1) errors. The model is
yt = xt0 β + yt−1 γ + εt
εt = ρεt−1 + ut
E (yt−1 εt ) = E (xt−1
0
β + yt−2 γ + εt−1 )(ρεt−1 + ut )
6= 0
X 0ε
plimβ̂ = β + plim
n
the OLS estimator is inconsistent in this case. One needs to estimate by instrumental
variables (IV), which we’ll get to later.
7.5.8. Examples.
Nerlove model, yet again. The Nerlove model uses cross-sectional data, so one
may not think of performing tests for autocorrelation. However, specification error
can induce autocorrelated errors. Consider the simple Nerlove model
lnC = β1 + β2 ln Q + β3 ln PL + β4 ln PF + β5 ln PK + ε
j j
lnC = β1 + β2 ln Q + β3 ln PL + β4 ln PF + β5 ln PK + ε.
7.5. AUTOCORRELATION 146
1.5
0.5
-0.5
-1
0 1 2 3 4 5 6 7 8 9 10
We have seen evidence that the extended model is preferred. So if it is in fact the
proper model, the simple model is misspecified. Let’s check if this misspecification
might induce autocorrelated errors.
The Octave program GLS/NerloveAR.m estimates the simple Nerlove model, and
plots the residuals as a function of ln Q, and it calculates a Breusch-Godfrey test statis-
tic. The residual plot is in Figure 7.6.1 , and the test results are:
Value p-value
Breusch-Godfrey test 34.930 0.000
E XERCISE 7.6. Repeat the autocorrelation tests using the extended Nerlove model
(Equation ??) to see the problem is solved.
7.5. AUTOCORRELATION 147
p g
Ct = α0 + α1 Pt + α2 Pt−1 + α3 (Wt +Wt ) + ε1t
The Octave program GLS/Klein.m estimates this model by OLS, plots the residuals,
and performs the Breusch-Godfrey test, using 1 lag of the residuals. The estimation
and test results are:
*********************************************************
OLS estimation results
Observations 21
R-squared 0.981008
Sigma-squared 1.051732
1.5
0.5
-0.5
-1
-1.5
-2
-2.5
0 5 10 15 20 25
*********************************************************
Value p-value
Breusch-Godfrey test 1.539 0.215
and the residual plot is in Figure 7.6.2. The test does not reject the null of nonautocor-
relatetd errors, but we should remember that we have only 21 observations, so power
is likely to be fairly low. The residual plot leads me to suspect that there may be auto-
correlation - there are some significant runs below and above the x-axis. Your opinion
may differ.
Since it seems that there may be autocorrelation, lets’s try an AR(1) correction.
The Octave program GLS/KleinAR1.m estimates the Klein consumption equation as-
suming that the errors follow the AR(1) pattern. The results, with the Breusch-Godfrey
test for remaining autocorrelation are:
7.5. AUTOCORRELATION 149
*********************************************************
OLS estimation results
Observations 21
R-squared 0.967090
Sigma-squared 0.983171
*********************************************************
Value p-value
Breusch-Godfrey test 2.129 0.345
• The test is farther away from the rejection region than before, and the residual
plot is a bit more favorable for the hypothesis of nonautocorrelated residuals,
IMHO. For this reason, it seems that the AR(1) correction might have im-
proved the estimation.
• Nevertheless, there has not been much of an effect on the estimated coeffi-
cients nor on their estimated standard errors. This is probably because the
estimated AR(1) coefficient is not very large (around 0.2)
EXERCISES 150
Exercises
EXERCISES 151
(1) Comparing the variances of the OLS and GLS estimators, I claimed that the fol-
lowing holds:
(2)
0
Var(β̂) −Var(β̂GLS ) = AΣA
(4) The limiting distribution of the OLS estimator with heteroscedasticity of unknown
form is
√
d
n β̂ − β → N 0, QX ΩQX ,
−1 −1
where
X 0 εε0 X
lim E =Ω
n→∞ n
Explain why
n
b = 1 ∑ x0 xt ε̂2
Ω
n t=1 t t
is a consistent estimator of this matrix.
(5) Define the v − th autocovariance of a covariance stationary process mt , where
E(mt = 0) as
Γv = E (mt mt−v
0
).
0 ) = Γ0 .
Show that E (mt mt+v v
j j
lnC = β1 + β2 ln Q + β3 ln PL + β4 ln PF + β5 ln PK + ε
EXERCISES 152
Exercises
(a) Calculate the FGLS estimator and interpret the estimation results.
(b) Test the transformed model to check whether it appears to satisfy homoscedas-
ticity.
CHAPTER 8
Stochastic regressors
The model we’ll deal will involve a combination of the following assumptions
Linearity: the model is a linear function of the parameter vector β0 :
yt = xt0 β0 + εt ,
or in matrix form,
y = X β0 + ε,
0
where y is n × 1, X = x1 x2 · · · xn , where xt is K × 1, and β0 and ε are con-
formable.
Stochastic, linearly independent regressors
153
8.1. CASE 1 154
In both cases, xt0 β is the conditional mean of yt given xt : E(yt |xt ) = xt0 β
8.1. Case 1
= β0
β̂|X ∼ N β, (X 0X )−1 σ20
8.2. CASE 2 155
8.2. Case 2
β̂ = β0 + (X 0 X )−1 X 0 ε
0 −1 0
XX Xε
= β0 +
n n
8.2. CASE 2 156
Now
−1
X 0X p
→ Q−1
X
n
by assumption, and
X 0 ε n−1/2 X 0 ε p
= √ →0
n n
since the numerator converges to a N(0, QX σ2 ) r.v. and the denominator still goes
to infinity. We have unbiasedness and the variance disappearing, so, the estimator is
consistent:
p
β̂ → β0 .
so
√
d 2
n β̂ − β0 → N(0, Q−1
X σ0 )
directly following the assumptions. Asymptotic normality of the estimator still holds.
Since the asymptotic results on all test statistics only require this, all the previous
asymptotic results on test statistics are also valid in this case.
(5) Tests are asymptotically valid, but are not valid in small samples.
8.3. Case 3
where now xt contains lagged dependent variables. Clearly, even with E(εt |xt ) = 0, X
and ε are not uncorrelated, so one can’t show unbiasedness. For example,
E (εt−1 xt ) 6= 0
• This fact implies that all of the small sample properties such as unbiasedness,
Gauss-Markov theorem, and small sample validity of test statistics do not
hold in this case. Recall Figure 3.7.2. This is a case of weakly exogenous
regressors, and we see that the OLS estimator is biased in this case.
• Nevertheless, under the above assumptions, all asymptotic properties con-
tinue to hold, using the same arguments as before.
The most complicated case is that of dynamic models, since the other cases can be
treated as nested in this case. There exist a number of central limit theorems for de-
pendent processes, many of which are fairly technical. We won’t enter into details
(see Hamilton, Chapter 7 if you’re interested). A main requirement for use of standard
asymptotics for a dependent sequence
1 n
{st } = { ∑ zt }
n t=1
not depend on t.
• Covariance (weak) stationarity requires that the first and second moments of
this set not depend on t.
• An example of a sequence that doesn’t satisfy this is an AR(1) process with a
unit root (a random walk):
xt = xt−1 + εt
εt ∼ IIN(0, σ2)
One can show that the variance of xt depends upon t in this case.
Stationarity prevents the process from trending off to plus or minus infinity, and pre-
vents cyclical behavior which would allow correlations between far removed zt znd zs
to be high. Draw a picture here.
The AR(1) model with unit root is an example of a case where the dependence
is too strong for standard asymptotics to apply.
• The econometrics of nonstationary processes has been an active area of re-
search in the last two decades. The standard asymptotics don’t apply in this
case. This isn’t in the scope of this course.
EXERCISES 160
Exercises
(1) Show that for two random variables A and B, if E(A|B) = 0, then E (A f (B)) = 0.
How is this used in the Gauss-Markov theorem?
(2) If it possible for an AR(1) model for time series data, e.g., yt = 0 + 0.9yt−1 + εt
satisfy weak exogeneity? Strong exogeneity? Discuss.
CHAPTER 9
Data problems
In this section well consider problems associated with the regressor matrix: collinear-
ity, missing observation and measurement error.
9.1. Collinearity
where xi is the ith column of the regressor matrix X , and v is an n × 1 vector. In the
case that there exists collinearity, the variation in v is relatively small, so that there is
an approximately exact linear relation between the regressors.
In the extreme, if there are exact linear relationships (every element of v equal) then
ρ(X ) < K, so ρ(X 0X ) < K, so X 0 X is not invertible and the OLS estimator is not
uniquely defined. For example, if the model is
yt = β1 + β2 x2t + β3 x3t + εt
x2t = α1 + α2 x3t
161
9.1. COLLINEARITY 162
= β1 + β2 α1 + β2 α2 x3t + β3 x3t + εt
= γ1 + γ2 x3t + εt
• The γ0 s can be consistently estimated, but since the γ0 s define two equations in
three β0 s, the β0 s can’t be consistently estimated (there are multiple values of β
that solve the fonc). The β0 s are unidentified in the case of perfect collinearity.
• Perfect collinearity is unusual, except in the case of an error in construction
of the regressor matrix, such as including the same regressor twice.
Another case where perfect collinearity may be encountered is with models with dummy
variables, if one is not careful. Consider a model of rental price (y i ) of an apartment.
This could depend factors such as size, quality etc., collected in x i , as well as on the
location of the apartment. Let Bi = 1 if the ith apartment is in Barcelona, Bi = 0 other-
wise. Similarly, define Gi , Ti and Li for Girona, Tarragona and Lleida. One could use
a model such as
yi = β1 + β2 Bi + β3 Gi + β4 Ti + β5 Li + x0i γ + εi
60
55
50
45
40
6 35
30
25
4 20
15
-2
-4
-6
-6 -4 -2 0 2 4 6
9.1.2. Back to collinearity. The more common case, if one doesn’t make mis-
takes such as these, is the existence of inexact linear relationships, i.e., correlations
between the regressors that are less than one in absolute value, but not zero. The basic
problem is that when two (or more) variables move together, it is difficult to deter-
mine their separate influences. This is reflected in imprecise estimates, i.e., estimates
with high variances. With economic data, collinearity is commonly encountered, and
is often a severe problem.
When there is collinearity, the minimizing point of the objective function that de-
fines the OLS estimator (s(β), the sum of squared errors) is relatively poorly defined.
This is seen in Figures 9.1.1 and 9.1.2.
To see the effect of collinearity on variances, partition the regressor matrix as
h i
X= x W
9.1. COLLINEARITY 164
100
90
80
70
60
6 50
40
30
4 20
-2
-4
-6
-6 -4 -2 0 2 4 6
where x is the first column of X (note: we can interchange the columns of X isf we like,
so there’s no loss of generality in considering the first column). Now, the variance of
β̂, under the classical assumptions, is
−1
V (β̂) = X 0 X σ2
−1 −1
X 0X 1,1
= x0 x − x0W (W 0W )−1W 0 x
01
−1
0 0 0
= x In −W (W W ) W x
−1
= ESSx|W
9.1. COLLINEARITY 165
where by ESSx|W we mean the error sum of squares obtained from the regression
x = W λ + v.
Since
R2 = 1 − ESS/T SS,
we have
ESS = T SS(1 − R2 )
σ2
V (β̂x ) =
T SSx (1 − R2x|W )
We see three factors influence the variance of this coefficient. It will be high if
(1) σ2 is large
(2) There is little variation in x. Draw a picture here.
(3) There is a strong linear relationship between x and the other regressors, so
that W can explain the movement in x well. In this case, R2x|W will be close to
1. As R2x|W → 1,V (β̂x ) → ∞.
9.1.3. Detection of collinearity. The best way is simply to regress each explana-
tory variable in turn on the remaining regressors. If any of these auxiliary regressions
has a high R2 , there is a problem of collinearity. Furthermore, this procedure identifies
which parameters are affected.
where R and r are as in the case of exact linear restrictions, but v is a random vector.
For example, the model could be
y = Xβ + ε
Rβ = r + v
ε 0 σ2ε In 0n×q
∼ N ,
v 0 0q×n σ2v Iq
This sort of model isn’t in line with the classical interpretation of parameters as con-
stants: according to this interpretation the left hand side of Rβ = r + v is constant
but the right is random. This model does fit the Bayesian perspective: we combine
information coming from the model and the data, summarized in
y = Xβ + ε
ε ∼ N(0, σ2ε In )
Rβ ∼ N(r, σ2v Iq )
Since the sample is random it is reasonable to suppose that E (εv0 ) = 0, which is the
last piece of information in the specification. How can you estimate using this model?
The solution is to treat the restrictions as artificial data. Write
y X ε
= β+
r R v
This model is heteroscedastic, since σ2ε 6= σ2v . Define the prior precision k = σε /σv .
This expresses the degree of belief in the restriction relative to the variability of the
9.1. COLLINEARITY 168
is homoscedastic and can be estimated by OLS. Note that this estimator is biased. It
is consistent, however, given that k is a fixed constant, even if the restriction is false
(this is in contrast to the case of false exact restrictions). To see this, note that there
are Q restrictions, where Q is the number of rows of R. As n → ∞, these Q artificial
observations have no weight in the objective function, so the estimator has the same
limiting objective function as the OLS estimator, and is therefore consistent.
To motivate the use of stochastic restrictions, consider the expectation of the squared
length of β̂:
0
−1 −1
E (β̂ β̂) = E
0
β+ X X 0
Xε 0
β+ X X 0
Xε
0
= β0 β + E ε0 X (X 0X )−1 (X 0 X )−1 X 0 ε
−1
= β0 β + Tr X 0 X σ2
K
= β β+σ
0 2
∑ λi(the trace is the sum of eigenvalues)
i=1
so
σ2
E (β̂0 β̂) > β0 β +
λmin(X 0 X)
where λmin(X 0 X) is the minimum eigenvalue of X 0 X (which is the inverse of the maxi-
mum eigenvalue of (X 0 X )−1 ). As collinearity becomes worse and worse, X 0 X becomes
more nearly singular, so λmin(X 0 X) tends to zero (recall that the determinant is the prod-
uct of the eigenvalues) and E (β̂0 β̂) tends to infinite. On the other hand, β0 β is finite.
9.1. COLLINEARITY 169
Now considering the restriction IK β = 0 + v. With this restriction the model be-
comes
y X ε
= β+
0 kIK kv
and the estimator is
−1
h i X h i y
β̂ridge = X 0 kIK X 0 IK
kIK 0
−1
= X 0 X + k2 IK X 0y
This is the ordinary ridge regression estimator. The ridge regression estimator can be
seen to add k2 IK , which is nonsingular, to X 0 X , which is more and more nearly singular
as collinearity becomes worse and worse. As k → ∞, the restrictions tend to β = 0,
that is, the coefficients are shrunken toward zero. Also, the estimator tends to
−1 −1 X 0y
β̂ridge = X 0 X + k2 IK X 0 y → k2 IK X 0y = →0
k2
so β̂0ridge β̂ridge → 0. This is clearly a false restriction in the limit, if our original model
is at al sensible.
There should be some amount of shrinkage that is in fact a true restriction. The
problem is to determine the k such that the restriction is correct. The interest in
ridge regression centers on the fact that it can be shown that there exists a k such
that MSE(β̂ridge ) < β̂OLS . The problem is that this k depends on β and σ2 , which are
unknown.
The ridge trace method plots β̂0ridge β̂ridge as a function of k, and chooses the value
of k that “artistically” seems appropriate (e.g., where the effect of increasing k dies
off). Draw picture here. This means of choosing k is obviously subjective. This is not
9.2. MEASUREMENT ERROR 170
a problem from the Bayesian perspective: the choice of k reflects prior beliefs about
the length of β.
In summary, the ridge estimator offers some hope, but it is impossible to guarantee
that it will outperform the OLS estimator. Collinearity is a fact of life in econometrics,
and there is no clear solution to the problem.
Measurement error is exactly what it says, either the dependent variable or the re-
gressors are measured with error. Thinking about the way economic data are reported,
measurement error is probably quite prevalent. For example, estimates of growth of
GDP, inflation, etc. are commonly revised several times. Why should the last revision
necessarily be correct?
y∗ = X β + ε
y = y∗ + v
vt ∼ iid(0, σ2v )
y + v = Xβ + ε
9.2. MEASUREMENT ERROR 171
so
y = Xβ + ε − v
= Xβ + ω
yt = xt∗0 β + εt
xt = xt∗ + vt
vt ∼ iid(0, Σv)
yt = (xt − vt )0 β + εt
= xt0 β − vt0 β + εt
= xt0 β + ωt
9.2. MEASUREMENT ERROR 172
E (xt ωt ) = E (xt∗ + vt ) −vt0 β + εt
= −Σv β
where
Σv = E vt vt0 .
Because of this correlation, the OLS estimator is biased and inconsistent, just as in
the case of autocorrelated errors with lagged dependent variables. In matrix notation,
write the estimated model as
y = Xβ + ω
We have that
−1
X 0X X 0y
β̂ =
n n
and
−1
X 0X (X ∗0 +V 0 ) (X ∗ +V )
plim = plim
n n
= (QX ∗ + Σv )−1
V 0V 1 n
plim
n
= lim E ∑ vt vt0
n t=1
= Σv
Likewise,
X 0y (X ∗0 +V 0 ) (X ∗ β + ε)
plim = plim
n n
= QX ∗ β
9.3. MISSING OBSERVATIONS 173
so
plimβ̂ = (QX ∗ + Σv )−1 QX ∗ β
So we see that the least squares estimator is inconsistent when the regressors are mea-
sured with error.
Missing observations occur quite frequently: time series data may not be gath-
ered in a certain year, or respondents to a survey may not answer all questions. We’ll
consider two cases: missing observations on the dependent variable and missing ob-
servations on the regressors.
y = Xβ + ε
or
y1 X1 ε1
= β+
y2 X2 ε2
where y2 is not observed. Otherwise, we assume the classical assumptions hold.
y1 = X1 β + ε1
Since these observations satisfy the classical assumptions, one could estimate
by OLS.
9.3. MISSING OBSERVATIONS 174
• The question remains whether or not one could somehow replace the unob-
served y2 by a predictor, and improve over OLS in some sense. Let ŷ2 be the
predictor of y2 . Now
0 −1 0
X1
X X y
β̂ = 1 1 1
X2 X2
X2 ŷ2
0 −1 0
= X1 X1 + X20 X2 X1 y1 + X20 ŷ2
Likewise, an OLS regression using only the second (filled in) observations would give
Substituting these into the equation for the overall combined estimator gives
−1 h i
β̂ = X10 X1 + X20 X2 X10 X1 β̂1 + X20 X2 β̂2
−1 −1 0
= X10 X1 + X20 X2 X10 X1 β̂1 + X10 X1 + X20 X2 X2 X2 β̂2
where
−1 0
A ≡ X10 X1 + X20 X2 X1 X1
9.3. MISSING OBSERVATIONS 175
and we use
0 −1 0 −1 0
X1 X1 + X20 X2 X2 X2 = X10 X1 + X20 X2 X1 X1 + X20 X2 − X10 X1
−1 0
= IK − X10 X1 + X20 X2 X1 X1
= IK − A.
Now,
E (β̂) = Aβ + (IK − A)E β̂2
and this will be unbiased only if E β̂2 = β.
• The conclusion is the this filled in observations alone would need to define an
unbiased estimator. This will be the case only if
ŷ2 = X2 β + ε̂2
where ε̂2 has mean zero. Clearly, it is difficult to satisfy this condition without
knowledge of β.
• Note that putting ŷ2 = ȳ1 does not satisfy the condition and therefore leads to
a biased estimator.
• One possibility that has been suggested (see Greene, page 275) is to estimate
β using a first round estimation using only the complete observations
ŷ2 = X2 β̂1
Now, the overall estimate is a weighted average of β̂1 and β̂2 , just as above,
but we have
= β̂1
This shows that this suggestion is completely empty of content: the final esti-
mator is the same as the OLS estimator using only the complete observations.
9.3.2. The sample selection problem. In the above discussion we assumed that
the missing observations are random. The sample selection problem is a case where
the missing observations are not random. Consider the model
yt∗ = xt0 β + εt
which is assumed to satisfy the classical assumptions. However, yt∗ is not always
observed. What is observed is yt defined as
yt = yt∗ if yt∗ ≥ 0
15
10
-5
-10
0 2 4 6 8 10
The difference in this case is that the missing values are not random: they are
correlated with the xt . Consider the case
y∗ = x + ε
with V (ε) = 25, but using only the observations for which y∗ > 0 to estimate. Figure
9.3.1 illustrates the bias. The Octave program is sampsel.m
but we assume now that each row of X2 has an unobserved component(s). Again,
one could just estimate using the complete observations, but it may seem frustrating
to have to drop observations simply because of a single missing variable. In general,
9.3. MISSING OBSERVATIONS 178
if the unobserved X2 is replaced by some prediction, X2∗ , then we are in the case of
errors of observation. As before, this means that the OLS estimator is biased when
X2∗ is used instead of X2 . Consistency is salvaged, however, as long as the number of
missing observations doesn’t increase with n.
Exercises
(1) Consider the Nerlove model
j j
lnC = β1 + β2 ln Q + β3 ln PL + β4 ln PF + β5 ln PK + ε
When this model is estimated by OLS, some coefficients are not significant. This
may be due to collinearity.
Exercises
(a) Calculate the correlation matrix of the regressors.
(b) Perform artificial regressions to see if collinearity is a problem.
(c) Apply the ridge regression estimator.
Exercises
(i) Plot the ridge trace diagram
(ii) Check what happens as k goes to zero, and as k becomes very large.
CHAPTER 10
β β
c = Aw1 1 w2 2 qβq eε
ln c = β0 + β1 ln w1 + β2 ln w2 + βq ln q + ε
where β0 = ln A. Theory suggests that A > 0, β1 > 0, β2 > 0, β3 > 0. This model isn’t
compatible with a fixed cost of production since c = 0 when q = 0. Homogeneity of
degree one in input prices suggests that β1 + β2 = 1, while constant returns to scale
implies βq = 1.
While this model may be reasonable in some cases, an alternative
√ √ √ √
c = β 0 + β 1 w1 + β 2 w2 + β q q + ε
√
may be just as plausible. Note that x and ln(x) look quite alike, for certain values
of the regressors, and up to a linear transformation, so it may be difficult to choose
between these models.
180
10.1. FLEXIBLE FUNCTIONAL FORMS 181
The basic point is that many functional forms are compatible with the linear-in-
parameters model, since this model can incorporate a wide variety of nonlinear trans-
formations of the dependent variable and the regressors. For example, suppose that
g(·) is a real valued function and that x(·) is a K− vector-valued function. The follow-
ing model is linear in the parameters but nonlinear in the variables:
xt = x(zt )
yt = xt0 β + εt
Given that the functional form of the relationship between the dependent variable
and the regressors is in general unknown, one might wonder if there exist parametric
models that can closely approximate a wide variety of functional relationships. A
“Diewert-Flexible” functional form is defined as one such that the function, the vector
of first derivatives and the matrix of second derivatives can take on an arbitrary value
at a single data point. Flexibility in this sense clearly requires that there be at least
K = 1 + P + P2 − P /2 + P
free parameters: one for each independent effect that we wish to model.
Suppose that the model is
y = g(x) + ε
10.1. FLEXIBLE FUNCTIONAL FORMS 182
A second-order Taylor’s series expansion (with remainder term) of the function g(x)
about the point x = 0 is
x0 D2x g(0)x
g(x) = g(0) + x0 Dx g(0) + +R
2
Use the approximation, which simply drops the remainder term, as an approximation
to g(x) :
x0 D2x g(0)x
g(x) ' gK (x) = g(0) + x0 Dx g(0) +
2
As x → 0, the approximation becomes more and more exact, in the sense that g K (x) →
g(x), Dx gK (x) → Dx g(x) and D2x gK (x) → D2x g(x). For x = 0, the approximation is
exact, up to the second order. The idea behind many flexible functional forms is to
note that g(0), Dx g(0) and D2x g(0) are all constants. If we treat them as parameters, the
approximation will have exactly enough free parameters to approximate the function
g(x), which is of unknown form, exactly, up to second order, at the point x = 0. The
model is
gK (x) = α + x0 β + 1/2x0 Γx
y = α + x0 β + 1/2x0 Γx + ε
10.1.1. The translog form. In spite of the fact that FFF’s aren’t really flexible for
the purposes of econometric estimation and inference, they are useful, and they are
certainly subject to less bias due to misspecification of the functional form than are
many popular forms, such as the Cobb-Douglas or the simple linear in the variables
model. The translog model is probably the most widely used FFF. This model is as
above, except that the variables are subjected to a logarithmic tranformation. Also, the
expansion point is usually taken to be the sample mean of the data, after the logarithmic
transformation. The model is defined by
y = ln(c)
z
x = ln
z̄
= ln(z) − ln(z̄)
y = α + x0 β + 1/2x0 Γx + ε
10.1. FLEXIBLE FUNCTIONAL FORMS 184
∂y
= β + Γx
∂x
∂ ln(c)
= (the other part of x is constant)
∂ ln(z)
∂c z
=
∂z c
which is the elasticity of c with respect to z. This is a convenient feature of the translog
model. Note that at the means of the conditioning variables, z̄, x = 0, so
∂y
=β
∂x z=z̄
y = c(w, q)
where w is a vector of input prices and q is output. We could add other variables by
extending q in the obvious manner, but this is supressed for simplicity. By Shephard’s
lemma, the conditional factor demands are
∂c(w, q)
x=
∂w
wx ∂c(w, q) w
s= =
c ∂w c
10.1. FLEXIBLE FUNCTIONAL FORMS 185
which is simply the vector of elasticities of cost with respect to input prices. If the cost
function is modeled using a translog function, we have
h i Γ11 Γ12 x
ln(c) = α + x0 β + z0 δ + 1/2 x0 z
Γ012 Γ22 z
Therefore, the share equations and the cost equation have parameters in common. By
pooling the equations together and imposing the (true) restriction that the parameters
of the equations be the same, we can gain efficiency.
To illustrate in more detail, consider the case of two inputs, so
x1
x= .
x2
10.1. FLEXIBLE FUNCTIONAL FORMS 186
The two cost shares of the inputs are the derivatives of ln c with respect to x 1 and x2 :
Note that the share equations and the cost equation have parameters in common.
One can do a pooled estimation of the three equations at once, imposing that the pa-
rameters are the same. In this way we’re using more observations and therefore more
information, which will lead to imporved efficiency. Note that this does assume that
the cost equation is correctly specified (i.e., not an approximation), since otherwise the
derivatives would not be the true derivatives of the log cost function, and would then
be misspecified for the shares. To pool the equations, write the model in matrix form
(adding in error terms)
α
β1
β2
δ
x21 x22 z2
ln c 1 x 1 x2 z x1 x2 x1 z x 2 z ε 1
2 2 2 γ11
+ ε
s1 = 0 1 0 0 x1 0 0 x2 z 0 2
γ22
s2 0 0 1 0 0 x2 0 x1 0 z ε3
γ33
γ12
γ13
γ23
10.1. FLEXIBLE FUNCTIONAL FORMS 187
This is one observation on the three equations. With the appropriate notation, a
single observation can be written as
yt = Xt θ + εt
The overall model would stack n observations on the three equations for a total of 3n
observations:
y1 X1 ε1
y2 X2 ε2
.. = .. θ + ..
. . .
yn Xn εn
Next we need to consider the errors. For observation t the errors can be placed in a
vector
ε1t
εt = ε2t
ε3t
First consider the covariance matrix of this vector: the shares are certainly corre-
lated since they must sum to one. (In fact, with 2 shares the variances are equal and
the covariance is -1 times the variance. General notation is used to allow easy exten-
sion to the case of more than 2 inputs). Also, it’s likely that the shares and the cost
equation have different variances. Supposing that the model is covariance stationary,
the variance of εt won0 t depend upon t:
σ11 σ12 σ13
Varεt = Σ0 = · σ22 σ23
· · σ33
Note that this matrix is singular, since the shares sum to 1. Assuming that there is no
autocorrelation, the overall covariance matrix has the seemingly unrelated regressions
10.1. FLEXIBLE FUNCTIONAL FORMS 188
(SUR) structure.
ε1
ε2
Var .. = Σ
.
εn
Σ0 0 ··· 0
.. ..
0 Σ0 . .
= ..
..
. . 0
0 ··· 0 Σ0
= In ⊗ Σ0
where the symbol ⊗ indicates the Kronecker product. The Kronecker product of two
matrices A and B is
a11 B a12 B · · · a1q B
. ..
a21 B . . .
A⊗B = .. .
.
a pq B · · · a pq B
10.1.2. FGLS estimation of a translog model. So, this model has heteroscedas-
ticity and autocorrelation, so OLS won’t be efficient. The next question is: how do
we estimate efficiently using FGLS? FGLS is based upon inverting the estimated error
covariance Σ̂. So we need to estimate Σ.
An asymptotically efficient procedure is (supposing normality of the errors)
y∗ = X ∗ θ + ε ∗
Σ̂∗ = In ⊗ Σ̂∗0 .
−1
P̂0 = Chol Σ̂∗0
(5) Finally the FGLS estimator can be calculated by applying OLS to the trans-
formed model
ˆ0
P̂0 y∗ = P̂0 X ∗ θ + ˆPε∗
10.1. FLEXIBLE FUNCTIONAL FORMS 191
β1 + β 2 = 1
3
∑ γi j = 0, j = 1, 2, 3.
i=1
These are linear parameter restrictions, so they are easy to impose and will
improve efficiency if they are true.
(3) The estimation procedure outlined above can be iterated. That is, estimate
θ̂FGLS as above, then re-estimate Σ∗0 using errors calculated as
ε̂ = y − X θ̂FGLS
repeated until the estimates don’t change (i.e., iterated to convergence) then
the resulting estimator is the MLE. At any rate, the asymptotic properties of
the iterated and uniterated estimators are the same, since both are based upon
a consistent estimator of the error covariance.
Given that the choice of functional form isn’t perfectly clear, in that many pos-
sibilities exist, how can one choose between forms? When one form is a parametric
restriction of another, the previously studied tests such as Wald, LR, score or qF are
all possibilities. For example, the Cobb-Douglas model is a parametric restriction of
the translog: The translog is
yt = α + xt0 β + ε
M1 : y = X β + ε
εt ∼ iid(0, σ2ε )
M2 : y = Zγ + η
η ∼ iid(0, σ2η )
10.2. TESTING NONNESTED HYPOTHESES 193
• One could account for non-iid errors, but we’ll suppress this for simplicity.
• There are a number of ways to proceed. We’ll consider the J test, proposed
by Davidson and MacKinnon, Econometrica (1981). The idea is to artificially
nest the two models, e.g.,
y = (1 − α)X β + α(Zγ) + ω
If the first model is correctly specified, then the true value of α is zero. On
the other hand, if the second model is correctly specified then α = 1.
– The problem is that this model is not identified in general. For example,
if the models share some regressors, as in
M1 : yt = β1 + β2 x2t + β3 x3t + εt
M2 : yt = γ1 + γ2 x2t + γ3 x4t + ηt
yt = (1 − α)β1 + (1 − α)β2 x2t + (1 − α)β3 x3t + αγ1 + αγ2 x2t + αγ3 x4t + ωt
yt = ((1 − α)β1 + αγ1 ) + ((1 − α)β2 + αγ2 ) x2t + (1 − α)β3 x3t + αγ3 x4t + ωt
The four δ0 s are consistently estimable, but α is not, since we have four equations in 7
unknowns, so one can’t test the hypothesis that α = 0.
10.2. TESTING NONNESTED HYPOTHESES 194
y = (1 − α)X β + α(Zγ̂) + ω
= X θ + αŷ + ω
α̂ a
t= ∼ N(0, 1)
σ̂α̂
p
• If the second model is correctly specified, then t → ∞, since α̂ tends in prob-
ability to 1, while it’s estimated standard error tends to zero. Thus the test
will always reject the false null model, asymptotically, since the statistic will
eventually exceed any critical value with probability one.
• We can reverse the roles of the models, testing the second against the first.
• It may be the case that neither model is correctly specified. In this case,
the test will still reject the null hypothesis, asymptotically, if we use critical
values from the N(0, 1) distribution, since as long as α̂ tends to something
p
different from zero, |t| → ∞. Of course, when we switch the roles of the
models the other will also be rejected asymptotically.
• In summary, there are 4 possible outcomes when we test two models, each
against the other. Both may be rejected, neither may be rejected, or one of the
two may be rejected.
10.2. TESTING NONNESTED HYPOTHESES 195
• There are other tests available for non-nested models. The J− test is simple
to apply when both models are linear in the parameters. The P-test is similar,
but easier to apply when M1 is nonlinear.
• The above presentation assumes that the same transformation of the depen-
dent variable is used by both models. MacKinnon, White and Davidson,
Journal of Econometrics, (1983) shows how to deal with the case of different
transformations.
• Monte-Carlo evidence shows that these tests often over-reject a correctly
specified model. Can use bootstrap critical values to get better-performing
tests.
CHAPTER 11
Several times we’ve encountered cases where correlation between regressors and
the error term lead to biasedness and inconsistency of the OLS estimator. Cases in-
clude autocorrelation with lagged dependent variables and measurement error in the
regressors. Another important case is that of simultaneous equations. The cause is
different, but the effect is the same.
where, for purposes of estimation we can treat X as fixed. This means that when esti-
mating β we condition on X . When analyzing dynamic models, we’re not interested in
conditioning on X , as we saw in the section on stochastic regressors. Nevertheless, the
OLS estimator obtained by treating X as fixed continues to have desirable asymptotic
properties even in that case.
196
11.1. SIMULTANEOUS EQUATIONS 197
Demand: qt = α1 + α2 pt + α3 yt + ε1t
Supply: qt = β1 + β2 pt + ε2t
ε1t h i σ σ
E ε ε = 11 12
1t 2t
ε2t · σ22
≡ Σ, ∀t
The presumption is that qt and pt are jointly determined at the same time by the in-
tersection of these equations. We’ll assume that yt is determined by some unrelated
process. It’s easy to see that we have correlation between regressors and errors. Solv-
ing for pt :
α1 + α2 pt + α3 yt + ε1t = β1 + β2 pt + ε2t
β2 pt − α2 pt = α1 − β1 + α3 yt + ε1t − ε2t
α1 − β 1 α3 yt ε1t − ε2t
pt = + +
β2 − α 2 β2 − α 2 β2 − α 2
Because of this correlation, OLS estimation of the demand equation will be biased and
inconsistent. The same applies to the supply equation, for the same reason.
In this model, qt and pt are the endogenous varibles (endogs), that are determined
within the system. yt is an exogenous variable (exogs). These concepts are a bit tricky,
11.1. SIMULTANEOUS EQUATIONS 198
and we’ll return to it in a minute. First, some notation. Suppose we group together
current endogs in the vector Yt . If there are G endogs, Yt is G × 1. Group current and
lagged exogs, as well as lagged endogs in the vector Xt , which is K ×1. Stack the errors
of the G equations into the error vector Et . The model, with additional assumtions, can
be written as
Et ∼ N(0, Σ), ∀t
Y Γ = XB + E
E (X 0 E) = 0(K×G)
vec(E) ∼ N(0, Ψ)
where
Y10 X10 E10
Y20 X0 E20
2
Y = .. ,X = . ,E = ..
. .. .
Yn0 Xn0 En0
Y is n × G, X is n × K, and E is n × G.
• Since there is no autocorrelation of the Et ’s, and since the columns of E are
individually homoscedastic, then
σ11 In σ12 In · · · σ1G In
..
σ22 In .
Ψ =
. . ..
. .
· σGG In
= In ⊗ Σ
11.2. Exogeneity
The model defines a data generating process. The model involves two sets of
variables, Yt and Xt , as well as a parameter vector
h i0
θ= vec(Γ)0 vec(B)0 vec∗ (Σ)0
• In general, without additional restrictions, θ is a G2 + GK + G2 − G /2 +
G dimensional vector. This is the parameter vector that were interested in
estimating.
• In principle, there exists a joint density function for Yt and Xt , which depends
on a parameter vector φ. Write this density as
ft (Yt , Xt |φ, It )
11.2. EXOGENEITY 200
where It is the information set in period t. This includes lagged Yt0 s and lagged
Xt ’s of course. This can be factored into the density of Yt conditional on Xt
times the marginal density of Xt :
This is a general factorization, but is may very well be the case that not all
parameters in φ affect both factors. So use φ1 to indicate elements of φ that
enter into the conditional density and write φ2 for parameters that enter into
the marginal. In general, φ1 and φ2 may share elements, of course. We have
Et ∼ N(0, Σ), ∀t
Normality and lack of correlation over time imply that the observations are indepen-
dent of one another, so we can write the log-likelihood function as the sum of likeli-
hood contributions of each observation:
n
ln L(Y |θ, It ) = ∑ ln ft (Yt , Xt |φ, It )
t=1
n
= ∑ ln ( ft (Yt |Xt , φ1, It ) ft (Xt |φ2, It ))
t=1
n n
= ∑ ln ft (Yt |Xt , φ1, It ) + ∑ ln ft (Xt |φ2, It ) =
t=1 t=1
11.2. EXOGENEITY 201
This implies that φ1 and φ2 cannot share elements if Xt is weakly exogenous, since
φ1 would change as φ2 changes, which prevents consideration of arbitrary combina-
tions of (φ1 , φ2 ).
Supposing that Xt is weakly exogenous, then the MLE of φ1 using the joint density
is the same as the MLE using only the conditional density
n
ln L(Y |X , θ, It ) = ∑ ln ft (Yt |Xt , φ1, It )
t=1
since the conditional likelihood doesn’t depend on φ2 . In other words, the joint and
conditional log-likelihoods maximize at the same value of φ1 .
Yt aren’t exogenous in the normal usage of the word, since their values are
determined within the model, just earlier on. Weakly exogenous variables
include exogenous (in the normal sense) variables as well as all predetermined
variables.
V (Et ) = Σ
= Xt0 Π +Vt0 =
Now only one current period endog appears in each equation. This is the reduced form.
An example is our supply/demand system. The reduced form for quantity is ob-
tained by solving the supply equation for price and substituting into demand:
11.3. REDUCED FORM 203
qt − β1 − ε2t
qt = α1 + α2 + α3 yt + ε1t
β2
β2 qt − α2 qt = β2 α1 − α2 (β1 + ε2t ) + β2 α3 yt + β2 ε1t
β2 α1 − α2 β1 β2 α3 yt β2 ε1t − α2 ε2t
qt = + +
β2 − α 2 β2 − α 2 β2 − α 2
= π11 + π21 yt +V1t
β1 + β2 pt + ε2t = α1 + α2 pt + α3 yt + ε1t
β2 pt − α2 pt = α1 − β1 + α3 yt + ε1t − ε2t
α1 − β 1 α3 yt ε1t − ε2t
pt = + +
β2 − α 2 β2 − α 2 β2 − α 2
= π12 + π22 yt +V2t
The interesting thing about the rf is that the equations individually satisfy the classical
assumptions, since yt is uncorrelated with ε1t and ε2t by assumption, and therefore
E (yt Vit ) = 0, i=1,2, ∀t. The errors of the rf are
β2 ε1t −α2 ε2t
V1t β2 −α2
=
ε1t −ε2t
V2t β2 −α2
= Xt0 Π +Vt0
so we have that
0 0
Vt = Γ−1 Et ∼ N 0, Γ−1 ΣΓ−1 , ∀t
and that the Vt are timewise independent (note that this wouldn’t be the case if the Et
were autocorrelated).
11.4. IV estimation
The IV estimator may appear a bit unusual at first, but it will grow on you over
time.
11.4. IV ESTIMATION 205
Y Γ = XB + E
Considering the first equation (this is without loss of generality, since we can always
reorder the equations) we can partition the Y matrix as
h i
Y= y Y1 Y2
Similarly, partition X as
h i
X= X1 X2
Assume that Γ has ones on the main diagonal. These are normalization restrictions
that simply scale the remaining coefficients on each equation, and which scale the
variances of the error terms.
11.4. IV ESTIMATION 206
Given this scaling and our partitioning, the coefficient matrices can be written as
1 Γ12
Γ = −γ1 Γ22
0 Γ32
β1 B12
B =
0 B22
y = Y1 γ1 + X1 β1 + ε
= Zδ + ε
The problem, as we’ve seen is that Z is correlated with ε, since Y1 is formed of endogs.
Now, let’s consider the general problem of a linear regression model with correla-
tion between regressors and the error term:
y = Xβ + ε
ε ∼ iid(0, In σ2 )
E (X 0 ε) 6= 0.
The present case of a structural equation from a system of equations fits into this no-
tation, but so do other problems, such as measurement error or lagged dependent vari-
ables with autocorrelated errors. Consider some matrix W which is formed of variables
uncorrelated with ε. This matrix defines a projection matrix
PW = W (W 0W )−1W 0
11.4. IV ESTIMATION 207
so that anything that is projected onto the space spanned by W will be uncorrelated
with ε, by the definition of W. Transforming the model with this projection matrix we
get
PW y = PW X β + PW ε
or
y∗ = X ∗ β + ε ∗
E (X ∗0 ε∗ ) = E (X 0 PW
0
PW ε)
= E (X 0 PW ε)
and
PW X = W (W 0W )−1W 0 X
will lead to a consistent estimator, given a few more assumptions. This is the general-
ized instrumental variables estimator. W is known as the matrix of instruments. The
estimator is
β̂IV = (X 0 PW X )−1 X 0 PW y
β̂IV = (X 0 PW X )−1 X 0 PW (X β + ε)
= β + (X 0 PW X )−1 X 0 PW ε
11.4. IV ESTIMATION 208
so
Assuming that each of the terms with a n in the denominator satisfies a LLN, so that
W 0W p
• n → QWW , a finite pd matrix
X 0W p
• n → QXW , a finite matrix with rank K (= cols(X ) )
W 0ε p
• n →0
then the plim of the rhs is zero. This last term has plim 0 since we assume that W and
ε are uncorrelated, e.g.,
E (Wt0 εt ) = 0,
p
β̂IV → β.
√
Furthermore, scaling by n, we have
then we get
√
d
n β̂IV − β → N 0, (QXW QWW
−1
Q0XW )−1 σ2
11.4. IV ESTIMATION 209
The estimators for QXW and QWW are the obvious ones. An estimator for σ2 is
d2 1 0
σIV = y − X β̂IV y − X β̂IV .
n
This estimator is consistent following the proof of consistency of the OLS estimator of
σ2 , when the classical assumptions hold.
The formula used to estimate the variance of β̂IV is
−1 −1 d
V̂ (β̂IV ) = X 0W W 0W W 0X σ2IV
The IV estimator is
(1) Consistent
(2) Asymptotically normally distributed
(3) Biased in general, since even though E (X 0 PW ε) = 0, E (X 0 PW X )−1 X 0 PW ε
may not be zero, since (X 0 PW X )−1 and X 0 PW ε are not independent.
An important point is that the asymptotic distribution of β̂IV depends upon QXW and
QWW , and these depend upon the choice of W. The choice of instruments influences
the efficiency of the estimator.
for this is that PW X becomes closer and closer to X itself as the number of
instruments increases.
• IV estimation can clearly be used in the case of simultaneous equations. The
only issue is which instruments to use.
• The necessary and sufficient condition for identification is simply that this
matrix be positive definite, and that the instruments be (asymptotically) un-
correlated with ε.
• For this matrix to be positive definite, we need that the conditions noted above
hold: QWW must be positive definite and QXW must be of full rank ( K ).
• These identification conditions are not that intuitive nor is it very obvious how
to check them.
y = Zδ + ε
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 211
where
h i
Z= Y1 X1
Notation:
• Now the X1 are weakly exogenous and can serve as their own instruments.
• It turns out that X exhausts the set of possible instruments, in that if the vari-
ables in X don’t lead to an identified model then no other instruments will
identify the model either. Assuming this is true (we’ll prove it in a moment),
then a necessary condition for identification is that cols(X2 ) ≥ cols(Y1 ) since
if not then at least one instrument must be used twice, so W will not have full
column rank:
• To show that this is in fact a necessary condition consider some arbitrary set
of instruments W. A necessary condition for identification is that
1
ρ plim W 0 Z = K ∗ + G∗ − 1
n
where
h i
Z= Y1 X1
Y Γ = XB + E
as
h i
Y= y Y1 Y2
h i
X = X1 X2
so we have
Y1 = X1 Π12 + X2 Π22 +V1
so
1 0 1 h i
W Z = W 0 X1 Π12 + X2 Π22 +V1 X1
n n
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 213
Because the W ’s are uncorrelated with the V1 ’s, by assumption, the cross between W
and V1 converges in probability to zero, so
1 1 h i
plim W 0 Z = plim W 0 X1 Π12 + X2 Π22 X1
n n
Since the far rhs term is formed only of linear combinations of columns of X , the rank
of this matrix can never be greater than K, regardless of the choice of instruments. If
Z has more than K columns, then it is not of full column rank. When Z has more than
K columns we have
G∗ − 1 + K ∗ > K
or noting that K ∗∗ = K − K ∗ ,
G∗ − 1 > K ∗∗
In this case, the limiting matrix is not of full column rank, and the identification con-
dition fails.
Yt0 Γ = Xt0 B + Et
V (Et ) = Σ
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 214
= Xt0 Π +Vt
0
V (Vt ) = Γ−1 ΣΓ−1
= Ω
The reduced form parameters are consistently estimable, but none of them are known
a priori, and there are no restrictions on their values. The problem is that more than
one structural form has the same reduced form, so knowledge of the reduced form
parameters alone isn’t enough to determine the structural parameters. To see this,
consider the model
Yt0 ΓF = Xt0 BF + Et F
V (Et F) = F 0 ΣF
= Xt0 Π +Vt
= Ω
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 215
Since the two structural forms lead to the same rf, and the rf is all that is directly
estimable, the models are said to be observationally equivalent. What we need for
identification are restrictions on Γ and B such that the only admissible F is an identity
matrix (if all of the equations are to be identified). Take the coefficient matrices as
partitioned before:
1 Γ12
−γ1 Γ22
Γ
= 0 Γ32
B
β1 B12
0 B22
The coefficients of the first equation of the transformed model are simply these coeffi-
cients multiplied by the first column of F. This gives
1 Γ12
−γ1 Γ22
Γ f11 f11
= 0 Γ32
B F2
F2
β1 B12
0 B22
For identification of the first equation we need that there be enough restrictions so that
the only admissible
f11
F2
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 216
then the only way this can hold, without additional restrictions on the model’s param-
eters, is if F2 is a vector of zeros. Given that F2 is a vector of zeros, then the first
equation
h i f11
1 Γ12 = 1 ⇒ f11 = 1
F2
Therefore, as long as
Γ32
ρ = G − 1
B22
then
f11 1
=
F2 0G−1
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 217
The first equation is identified in this case, so the condition is sufficient for identifica-
tion. It is also necessary, since the condition implies that this submatrix must have at
least G − 1 rows. Since this matrix has
G∗∗ + K ∗∗ = G − G∗ + K ∗∗
rows, we obtain
G − G∗ + K ∗∗ ≥ G − 1
or
K ∗∗ ≥ G∗ − 1
• These results are valid assuming that the only identifying information comes
from knowing which variables appear in which equations, e.g., by exclusion
restrictions, and through the use of a normalization. There are other sorts of
identifying information that can be used. These include
(1) Cross equation restrictions
(2) Additional restrictions on parameters within equations (as in the Klein
model discussed below)
(3) Restrictions on the covariance matrix of the errors
(4) Nonlinearities in variables
• When these sorts of information are available, the above conditions aren’t
necessary for identification, though they are of course still sufficient.
To give an example of how other information can be used, consider the model
Y Γ = XB + E
where Γ is an upper triangular matrix with 1’s on the main diagonal. This is a triangu-
lar system of equations. In this case, the first equation is
y1 = X B·1 + E·1
This equation has K ∗∗ = 0 excluded exogs, and G∗ = 2 included endogs, so it fails the
order (necessary) condition for identification.
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 219
• However, suppose that we have the restriction Σ21 = 0, so that the first and
second structural errors are uncorrelated. In this case
E (y1t ε2t ) = E (Xt0B·1 + ε1t )ε2t = 0
p g
Consumption: Ct = α0 + α1 Pt + α2 Pt−1 + α3 (Wt +Wt ) + ε1t
Output: Xt = Ct + It + Gt
p
Profits: Pt = Xt − Tt −Wt
g
The other variables are the government wage bill, Wt , taxes, Tt , government nonwage
spending, Gt ,and a time trend, At . The endogenous variables are the lhs variables,
h i
Yt0 p
= Ct It Wt Xt Pt Kt
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 220
The model assumes that the errors of the equations are contemporaneously correlated,
by nonautocorrelated. The model written as Y Γ = X B + E gives
1 0 0 −1 0 0
0 1 0 −1 0 −1
−α3 0 1 0 1 0
Γ=
0 0 −γ1 1 −1 0
−α1 −β1 0 0 1 0
0 0 0 0 0 1
α 0 β0 γ 0 0 0 0
α3 0 0 0 0 0
0 0 0 1 0 0
0 0 0 0 −1 0
B=
0 0 γ3 0 0 0
α 2 β2 0 0 0 0
0 β3 0 0 0 1
0 0 γ2 0 0 0
To check this identification of the consumption equation, we need to extract Γ 32 and
B22 , the submatrices of coefficients of endogs and exogs that don’t appear in this equa-
tion. These are the rows that have zeros in the first column, and we need to drop the
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 221
We need to find a set of 5 rows of this matrix gives a full-rank 5×5 matrix. For
example, selecting rows 3,4,5,6, and 7 we obtain the matrix
0 0 0 0 1
0 0 1 0 0
A = 0 0 0 −1 0
0 γ 3 0 0 0
β3 0 0 0 1
This matrix is of full rank, so the sufficient condition for identification is met. Counting
included endogs, G∗ = 3, and counting excluded exogs, K ∗∗ = 5, so
K ∗∗ − L = G∗ − 1
5−L = 3−1
L =3
11.6. 2SLS
• This isn’t always efficient, as we’ll see, but it has the advantage that misspec-
ifications in other equations will not affect the consistency of the estimator of
the parameters of the equation of interest.
• Also, estimation of the equation won’t be affected by identification problems
in other equations.
The 2SLS estimator is very simple: in the first stage, each column of Y1 is regressed on
all the weakly exogenous variables in the system, e.g., the entire X matrix. The fitted
values are
= PX Y1
= X Π̂1
Since these fitted values are the projection of Y1 on the space spanned by X , and since
any vector in this space is uncorrelated with ε by assumption, Ŷ1 is uncorrelated with
ε. Since Ŷ1 is simply the reduced-form prediction, it is correlated with Y1 , The only
other requirement is that the instruments be linearly independent. This should be the
11.6. 2SLS 223
case when the order condition is satisfied, since there are more columns in X2 than in
Y1 in this case.
The second stage substitutes Ŷ1 in place of Y1 , and estimates by OLS. This original
model is
y = Y1 γ1 + X1 β1 + ε
= Zδ + ε
y = Yˆ1 γ1 + X1 β1 + ε.
y = PX Y1 γ1 + PX X1 β1 + ε
≡ PX Zδ + ε
δ̂ = (Z 0 PX Z)−1 Z 0 PX y
which is exactly what we get if we estimate using IV, with the reduced form predictions
of the endogs used as instruments. Note that if we define
Ẑ = PX Z
h i
= Ŷ1 X1
11.6. 2SLS 224
δ̂ = (Ẑ 0 Z)−1 Ẑ 0 y
• Important note: OLS on the transformed model can be used to calculate the
2SLS estimate of δ, since we see that it’s equivalent to IV using a particular
set of instruments. However the OLS covariance formula is not valid. We
need to apply the IV covariance formula already seen above.
Ẑ = PX Z
h i
= Ŷ X
−1 −1
V̂ (δ̂) = Z 0 Ẑ Ẑ 0 Ẑ Ẑ 0 Z σ̂2IV
h i0 h i Y10 (PX )Y1 Y10 (PX )X1
Ẑ 0 Z = Ŷ1 X1 Y1 X1 =
X10 Y1 X10 X1
but since PX is idempotent and since PX X = X , we can write
h i0 h i Y10 PX PX Y1 Y10 PX X1
Ŷ1 X1 Y1 X1 =
X10 PX Y1 X10 X1
h i0 h i
= Ŷ1 X1 Ŷ1 X1
= Ẑ 0 Ẑ
11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 225
Therefore, the second and last term in the variance formula cancel, so the 2SLS varcov
estimator simplifies to
−1
V̂ (δ̂) = Z 0 Ẑ σ̂2IV
which, following some algebra similar to the above, can also be written as
−1
V̂ (δ̂) = Ẑ 0 Ẑ σ̂2IV
Finally, recall that though this is presented in terms of the first equation, it is general
since any equation can be placed first.
Properties of 2SLS:
(1) Consistent
(2) Asymptotically normal
(3) Biased when the mean esists (the existence of moments is a technical issue
we won’t go into here).
(4) Asymptotically inefficient, except in special circumstances (more on this later).
The selection of which variables are endogs and which are exogs is part of the
specification of the model. As such, there is room for error here: one might erroneously
classify a variable as exog when it is in fact correlated with the error term. A general
test for the specification on the model can be formulated as follows:
The IV estimator can be calculated by applying OLS to the transformed model, so
the IV objective function at the minimized value is
0
s(β̂IV ) = y − X β̂IV PW y − X β̂IV ,
11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 226
but
ε̂IV = y − X β̂IV
= y − X (X 0 PW X )−1 X 0 PW y
= I − X (X 0PW X )−1 X 0 PW y
= I − X (X 0PW X )−1 X 0 PW (X β + ε)
= A (X β + ε)
where
A ≡ I − X (X 0PW X )−1 X 0 PW
so
s(β̂IV ) = ε0 + β0 X 0 A0 PW A (X β + ε)
A0 PW A = I − PW X (X 0PW X )−1 X 0 PW I − X (X 0PW X )−1 X 0 PW
= PW − PW X (X 0PW X )−1 X 0 PW PW − PW X (X 0PW X )−1 X 0 PW
= I − PW X (X 0PW X )−1 X 0 PW .
Furthermore, A is orthogonal to X
AX = I − X (X 0PW X )−1 X 0 PW X
= X −X
= 0
so
s(β̂IV ) = ε0 A0 PW Aε
11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 227
Supposing the ε are normally distributed, with variance σ2 , then the random variable
s(β̂IV ) ε0 A0 PW Aε
=
σ2 σ2
s(β̂IV ) a 2
∼ χ (ρ(A0 PW A))
c
σ 2
• Even if the ε aren’t normally distributed, the asymptotic result still holds. The
last thing we need to determine is the rank of the idempotent matrix. We have
A0 PW A = PW − PW X (X 0PW X )−1 X 0 PW
so
ρ(A0PW A) = Tr PW − PW X (X 0PW X )−1 X 0 PW
= TrW (W 0W )−1W 0 − KX
= TrW 0W (W 0W )−1 − KX
= KW − KX
• This test is an overall specification test: the joint null hypothesis is that the
model is correctly specified and that the W form valid instruments (e.g., that
the variables classified as exogs really are uncorrelated with ε. Rejection can
mean that either the model y = Zδ + ε is misspecified, or that there is correla-
tion between X and ε.
• This is a particular case of the GMM criterion test, which is covered in the
second half of the course. See Section 15.8.
• Note that since
ε̂IV = Aε
and
s(β̂IV ) = ε0 A0 PW Aε
we can write
s(β̂IV ) ε̂0W (W 0W )−1W 0 W (W 0W )−1W 0 ε̂
=
c2
σ ε̂0 ε̂/n
= n(RSSε̂IV |W /T SSε̂IV )
= nR2u
y = Xβ + ε
11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 229
PW y = PW X β + PW ε
The IV estimator is
−1
β̂IV = X 0 PW X X 0 PW y
−1 −1
X 0 PW X = X 0W (W 0W )−1W 0 X
−1
= (W 0 X )−1 X 0W (W 0W )−1
−1
= (W 0 X )−1 (W 0W ) X 0W
−1
β̂IV = (W 0 X )−1 (W 0W ) X 0W X 0 PW y
−1
= (W 0 X )−1 (W 0W ) X 0W X 0W (W 0W )−1W 0 y
= (W 0 X )−1W 0 y
11.8. SYSTEM METHODS OF ESTIMATION 230
= y0 PW y − X β̂IV − β̂0IV X 0 PW y + β̂0IV X 0 PW X β̂IV
= y0 PW y − X β̂IV − β̂0IV X 0 PW y + X 0 PW X β̂IV
= y0 PW y − X β̂IV
by the fonc for generalized IV. However, when we’re in the just indentified case, this
is
s(β̂IV ) = y0 PW y − X (W 0 X )−1W 0 y
= y0 PW I − X (W 0 X )−1W 0 y
= y0 W (W 0W )−1W 0 −W (W 0W )−1W 0 X (W 0 X )−1W 0 y
= 0
The value of the objective function of the IV estimator is zero in the just identified case.
This makes sense, since we’ve already shown that the objective function after dividing
by σ2 is asymptotically χ2 with degrees of freedom equal to the number of overidenti-
fying restrictions. In the present case, there are no overidentifying restrictions, so we
have a χ2 (0) rv, which has mean 0 and variance 0, e.g., it’s simply 0. This means we’re
not able to test the identifying restrictions in the case of exact identification.
so they don’t need to be specified (except for defining what are the exogs, so 2SLS can
use the complete set of instruments). The disadvantage of 2SLS is that it’s inefficient,
in general.
Y Γ = XB + E
E (X 0 E) = 0(K×G)
vec(E) ∼ N(0, Ψ)
• Since there is no autocorrelation of the Et ’s, and since the columns of E are
individually homoscedastic, then
σ11 In σ12 In · · · σ1G In
..
σ22 In .
Ψ =
. . ..
. .
· σGG In
= Σ ⊗ In
This means that the structural equations are heteroscedastic and correlated
with one another
• In general, ignoring this will lead to inefficient estimation, following the sec-
tion on GLS. When equations are correlated with one another estimation
should account for the correlation in order to obtain efficiency.
11.8. SYSTEM METHODS OF ESTIMATION 232
• Also, since the equations are correlated, information about one equation is
implicitly information about all equations. Therefore, overidentification re-
strictions in any equation improve efficiency for all equations, even the just
identified equations.
• Single equation methods can’t use these types of information, and are there-
fore inefficient (in general).
11.8.1. 3SLS. Note: It is easier and more practical to treat the 3SLS estimator
as a generalized method of moments estimator (see Chapter 15). I no longer teach
the following section, but it is retained for its possible historical interest. Another
alternative is to use FIML (Subsection 11.8.2), if you are willing to make distributional
assumptions on the errors. This is computationally feasible with modern computers.
Following our above notation, each structural equation can be written as
yi = Yi γ1 + Xi β1 + εi
= Zi δi + ε i
or
y = Zδ + ε
11.8. SYSTEM METHODS OF ESTIMATION 233
E (εε0 ) = Ψ
= Σ ⊗ In
The 3SLS estimator is just 2SLS combined with a GLS correction that takes advantage
of the structure of Ψ. Define Ẑ as
X (X X ) X Z1
0 −1 0 0 ··· 0
..
0 X (X 0X )−1 X 0 Z 2 .
Ẑ = .
.. ..
. 0
0 ··· 0 X (X 0X )−1 X 0 ZG
0 ··· 0
Ŷ1 X1
..
0 Ŷ2 X2 .
=
.. ..
. 0
.
0 ··· 0 ŶG XG
These instruments are simply the unrestricted rf predicitions of the endogs, com-
bined with the exogs. The distinction is that if the model is overidentified, then
Π = BΓ−1
δ̂ = (Ẑ 0 Z)−1 Ẑ 0 y
11.8. SYSTEM METHODS OF ESTIMATION 234
as can be verified by simple multiplication, and noting that the inverse of a block-
diagonal matrix is just the matrix with the inverses of the blocks on the main diagonal.
This IV estimator still ignores the covariance information. The natural extension is
to add the GLS transformation, putting the inverse of the error covariance into the
formula, which gives the 3SLS estimator
−1
−1
δ̂3SLS = 0
Ẑ (Σ ⊗ In ) Z Ẑ 0 (Σ ⊗ In )−1 y
−1 0 −1
= Ẑ 0 Σ−1 ⊗ In Z Ẑ Σ ⊗ In y
(IMPORTANT NOTE: this is calculated using Zi , not Ẑi ). Then the element i, j of Σ
is estimated by
ε̂0i ε̂ j
σ̂i j =
n
Substitute Σ̂ into the formula above to get the feasible 3SLS estimator.
Analogously to what we did in the case of 2SLS, the asymptotic distribution of the
3SLS estimator can be shown to be
!−1
√ 0 −1
Ẑ (Σ ⊗ In ) Ẑ
a
n δ̂3SLS − δ ∼ N 0, lim E
n→∞ n
A formula for estimating the variance of the 3SLS estimator in finite samples (can-
celling out the powers of n) is
−1
V̂ δ̂3SLS = Ẑ 0 Σ̂−1 ⊗ In Ẑ
11.8. SYSTEM METHODS OF ESTIMATION 235
• This is analogous to the 2SLS formula in equation (??), combined with the
GLS correction.
• In the case that all equations are just identified, 3SLS is numerically equiva-
lent to 2SLS. Proving this is easiest if we use a GMM interpretation of 2SLS
and 3SLS. GMM is presented in the next econometrics course. For now, take
it on faith.
The 3SLS estimator is based upon the rf parameter estimator Π̂, calculated equation
by equation using OLS:
Π̂ = (X 0 X )−1 X 0Y
which is simply
h i
Π̂ = (X 0X )−1 X 0 y1 y2 · · · yG
that is, OLS equation by equation using all the exogs in the estimation of each column
of Π.
It may seem odd that we use OLS on the reduced form, since the rf equations are
correlated:
= Xt0 Π +Vt0
and
0 0
Vt = Γ−1 Et ∼ N 0, Γ−1 ΣΓ−1 , ∀t
0
Ξ = Γ−1 ΣΓ−1
11.8. SYSTEM METHODS OF ESTIMATION 236
where yi is the n × 1 vector of observations of the ith endog, X is the entire n × K matrix
of exogs, πi is the ith column of Π, and vi is the ith column of V. Use the notation
y = Xπ + v
to indicate the pooled model. Following this notation, the error covariance matrix is
V (v) = Ξ ⊗ In
• In the special case that all the Xi are the same, which is true in the present
case of estimation of the rf parameters, SUR ≡OLS. To show this note that in
this case X = In ⊗ X . Using the rules
(1) (A ⊗ B)−1 = (A−1 ⊗ B−1 )
(2) (A ⊗ B)0 = (A0 ⊗ B0 ) and
(3) (A ⊗ B)(C ⊗ D) = (AC ⊗ BD), we get
−1
π̂SUR = (In ⊗ X )0 (Ξ ⊗ In )−1 (In ⊗ X ) (In ⊗ X )0 (Ξ ⊗ In )−1 y
−1 −1
= Ξ−1 ⊗ X 0 (In ⊗ X ) Ξ ⊗ X0 y
= Ξ ⊗ (X 0 X )−1 Ξ−1 ⊗ X 0 y
= IG ⊗ (X 0 X )−1 X 0 y
π̂1
π̂2
= .
..
π̂G
information set. The 2SLS and 3SLS estimators don’t require distributional assump-
tions, while FIML of course does. Our model is, recall
Et ∼ N(0, Σ), ∀t
The joint normality of Et means that the density for Et is the multivariate normal,
which is
−1 −1/2 1
(2π) −g/2
det Σ exp − Et0 Σ−1 Et
2
The transformation from Et to Yt requires the Jacobian
dEt
| det | = | det Γ|
dYt0
Given the assumption of independence over time, the joint log-likelihood function is
nG n 1 n 0
ln L(B, Γ, Σ) = − ln(2π)+n ln(| det Γ|)− ln det Σ−1 − ∑ Yt0 Γ − Xt0 B Σ−1 Yt0 Γ − Xt0 B
2 2 2 t=1
• One can calculate the FIML estimator by iterating the 3SLS estimator, thus
avoiding the use of a nonlinear optimizer. The steps are
(1) Calculate Γ̂3SLS and B̂3SLS as normal.
(2) Calculate Π̂ = B̂3SLS Γ̂−1
3SLS . This is new, we didn’t estimate Π in this way
before. This estimator may have some zeros in it. When Greene says
iterated 3SLS doesn’t lead to FIML, he means this for a procedure that
doesn’t update Π̂, but only updates Σ̂ and B̂ and Γ̂. If you update Π̂ you
do converge to FIML.
(3) Calculate the instruments Ŷ = X Π̂ and calculate Σ̂ using Γ̂ and B̂ to get
the estimated errors, applying the usual estimator.
(4) Apply 3SLS using these new instruments and the estimate of Σ.
(5) Repeat steps 2-4 until there is no change in the parameters.
• FIML is fully efficient, since it’s an ML estimator that uses all information.
This implies that 3SLS is fully efficient when the errors are normally dis-
tributed. Also, if each equation is just identified and the errors are normal,
then 2SLS will be fully efficient, since in this case 2SLS≡3SLS.
• When the errors aren’t normally distributed, the likelihood function is of
course different than what’s written above.
The Octave program Simeq/Klein.m performs 2SLS estimation for the 3 equations
of Klein’s model 1, assuming nonautocorrelated errors, so that lagged endogenous
variables can be used as instruments. The results are:
CONSUMPTION EQUATION
*******************************************************
11.9. EXAMPLE: 2SLS AND KLEIN’S MODEL 1 240
*******************************************************
INVESTMENT EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.884884
Sigma-squared 1.383184
*******************************************************
WAGES EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.987414
Sigma-squared 0.476427
*******************************************************
The above results are not valid (specifically, they are inconsistent) if the errors
are autocorrelated, since lagged endogenous variables will not be valid instruments
in that case. You might consider eliminating the lagged endogenous variables as in-
struments, and re-estimating by 2SLS, to obtain consistent parameter estimates in this
more complex case. Standard errors will still be estimated inconsistently, unless use a
Newey-West type covariance estimator. Food for thought...
CHAPTER 12
We’ll begin with study of extremum estimators in general. Let Zn be the available
data, based on a sample of size n.
Because the logarithmic function is strictly increasing on (0, ∞), maximization of the
average logarithm of the likelihood function is achieved at the same θ̂ as for the likeli-
hood function:
n
(yt − θ)2
θ̂ ≡ arg max sn (θ) = (1/n) ln Ln (θ) = −1/2 ln 2π − (1/n) ∑
Θ t=1 2
• Define
m1 (θ) = µ1 (θ) − µb1
In this case,
n
m1 (θ̂) = θ̂ − ∑ yt /n = 0.
t=1
n p
Since ∑t=1 yt /n → θ0 by the LLN, the estimator is consistent.
More on the method of moments
Continuing with the above example, the variance of a χ2 (θ0 ) r.v. is
2
V (yt ) = E yt − θ0 = 2θ0 .
• Define
n 2
∑t=1 (yt − ȳ)
m2 (θ) = 2θ −
n
• The MM estimator would set
n 2
∑t=1 (yt − ȳ)
m2 (θ̂) = 2θ̂ − ≡ 0.
n
Again, by the LLN, the sample variance is consistent for the true variance,
that is,
n 2
∑t=1 (yt − ȳ) p
→ 2θ0 .
n
So,
n 2
∑t=1 (yt − ȳ)
θ̂ = ,
2n
which is obtained by inverting the moment-parameter equation, is consistent.
12. INTRODUCTION TO THE SECOND HALF 245
From the first example, define m1t (θ) = θ − yt . We already have that m1 (θ) is the
sample average of m1t (θ), i.e.,
n
m1 (θ) = 1/n ∑ m1t (θ)
t=1
n
= θ − ∑ yt /n.
t=1
Clearly, when evaluated at the true parameter value θ0 , both E m1t (θ0 ) = 0 and
E m1 (θ0 ) = 0.
From the second example we define additional moment conditions
and
n 2
∑t=1 (yt − ȳ)
m2 (θ) = 2θ − .
n
a.s.
Again, it is clear from the LLN that m2 (θ0 ) → 0. The MM estimator would chose θ̂ to
set either m1 (θ̂) = 0 or m2 (θ̂) = 0. In general, no single value of θ will solve the two
equations simultaneously.
12. INTRODUCTION TO THE SECOND HALF 246
One of the focal points of the course will be nonlinear models. This is not to
suggest that linear models aren’t useful. Linear models are more general than they
might first appear, since one can employ nonlinear transformations of the variables:
h i
0
ϕ0 (yt ) = ϕ1 (xt ) ϕ2 (xt ) · · · ϕ p (xt ) θ + εt
For example,
ln yt = α + βx1t + γx21t + δx1t x2t + εt
12. INTRODUCTION TO THE SECOND HALF 247
• The important point is that the model is linear in the parameters but not nec-
essarily linear in the variables.
In spite of this generality, situations often arise which simply can not be convincingly
represented by linear in the parameters models. Also, theory that applies to nonlinear
models also applies to linear models, so one may as well start off with the general case.
Example: Expenditure shares
Roy’s Identity states that the quantity demanded of the ith of G goods is
−∂v(p, y)/∂pi
xi = .
∂v(p, y)/∂y
An expenditure share is
si ≡ pi xi /y,
with a parameter space that is defined independent of the data can guarantee that either
of these conditions holds. These constraints will often be violated by estimated linear
models, which calls into question their appropriateness in cases of this sort.
Example: Binary limited dependent variable
The referendum contingent valuation (CV) method of infering the social value of
a project provides a simple example. This example is a special case of more general
discrete choice (or binary response) models. Individuals are asked if they would pay
an amount A for provision of a project. Indirect utility in the base case (no project) is
v0 (m, z) + ε0 , where m is income and z is a vector of other variables such as prices,
personal characteristics, etc. After provision, utility is v1 (m, z) + ε1 . The random terms
εi , i = 1, 2, reflect variations of preferences in the population. With this, an individual
12. INTRODUCTION TO THE SECOND HALF 248
agrees1 to pay A if
ε|0 {z
− ε}1 v1 (m − A, z) − v0 (m, z)
< | {z }
ε ∆v(w, A)
To simplify notation, define p(w, A) ≡ Fε [∆v(w, A)]. To make the example specific,
suppose that
v1 (m, z) = α − βm
v0 (m, z) = −βm
and ε0 and ε1 are i.i.d. extreme value random variables. That is, utility depends only
on income, preferences in both states are homothetic, and a specific distributional as-
sumption is made on the distribution of preferences in the population. With these
assumptions (the details are unimportant here, see articles by D. McFadden if you’re
interested) it can be shown that
p(A, θ) = Λ (α + βA) ,
Λ(z) = (1 + exp(−z))−1 .
1We assume here that responses are truthful, that is there is no strategic behavior and that individuals
are able to order their preferences in this hypothetical situation.
12. INTRODUCTION TO THE SECOND HALF 249
This is the simple logit model: the choice probability is the logit function of a linear in
parameters function.
Now, y is either 0 or 1, and the expected value of y is Λ (α + βA) . Thus, we can write
y = Λ (α + βA) + η
E (η) = 0.
The main point is that it is impossible that Λ (α + βA) can be written as a linear in the
parameters model, in the sense that, for arbitrary A, there are no θ, ϕ(A) such that
Λ (α + βA) = ϕ(A)0 θ, ∀A
yt = f (xt ) + εt
12. INTRODUCTION TO THE SECOND HALF 250
yt = f (xt , θ) + εt
θ ∈ Θ, φ ∈ Φ
where f (·) and perhaps Fε (z|φ, xt ) are of known functional form. This is important
since economic theory gives us general information about functions and the signs of
their derivatives, but not about their specific form.
Then we’ll look at simulation-based methods in econometrics. These methods
allow us to substitute computer power for mental power. Since computer power is
becoming relatively cheap compared to mental effort, any econometrician who lives
by the principles of economic theory should be interested in these techniques.
Finally, we’ll look at how econometric computations can be done in parallel on a
cluster of computers. This allows us to harness more computational power to work
with more complex models that can be dealt with using a desktop computer.
CHAPTER 13
1
s(θ) = a + b0 θ + θ0Cθ,
2
Dθ s(θ) = b +Cθ
251
13.2. DERIVATIVE-BASED METHODS 252
since conditional on the estimate of the varcov matrix, we have a quadratic objective
function in the remaining parameters.
More general problems will not have linear f.o.c., and we will not be able to solve
for the maximizer analytically. This is when we need a numeric optimization method.
13.1. Search
The idea is to create a grid over the parameter space and evaluate the function at
each point on the grid. Select the best point. Then refine the grid in the neighborhood
of the best point, and continue until the accuracy is ”good enough”. See Figure 13.1.1.
One has to be careful that the grid is fine enough in relationship to the irregularity of
the function to ensure that sharp peaks are not missed entirely.
To check q values in each dimension of a K dimensional parameter space, we need
to check qK points. For example, if q = 100 and K = 10, there would be 100 10 points
to check. If 1000 points can be checked in a second, it would take 3. 171 × 10 9 years
to perform the calculations, which is approximately the age of the earth. The search
method is a very reasonable choice if K is small, but it quickly becomes infeasible if
K is moderate or large.
The iteration method can be broken into two problems: choosing the stepsize a k (a
scalar) and choosing the direction of movement, d k , which is of the same dimension
13.2. DERIVATIVE-BASED METHODS 253
of θ, so that
θ(k+1) = θ(k) + ak d k .
∂s(θ + ad)
∃a : >0
∂a
for a positive but small. That is, if we go in direction d, we will improve on the
objective function, at least if we don’t go too far in that direction.
• As long as the gradient at θ is not zero there exist increasing directions, and
they can all be represented as Qk g(θk ) where Qk is a symmetric pd matrix and
13.2. DERIVATIVE-BASED METHODS 254
g (θ) = Dθ s(θ) is the gradient at θ. To see this, take a T.S. expansion around
a0 = 0
and we keep going until the gradient becomes zero, so that there is no increasing
direction. The problem is how to choose a and Q.
13.2.2. Steepest descent. Steepest descent (ascent if we’re maximizing) just sets
Q to and identity matrix, since the gradient provides the direction of maximum rate of
change of the objective function.
13.2. DERIVATIVE-BASED METHODS 255
To attempt to maximize sn (θ), we can maximize the portion of the right-hand side that
depends on θ, i.e., we can maximize
0
s̃(θ) = g(θk )0 θ + 1/2 θ − θk H(θk ) θ − θk
k k k
Dθ s̃(θ) = g(θ ) + H(θ ) θ − θ
• A potential problem is that the Hessian may not be negative definite when
we’re far from the maximizing point. So −H(θk )−1 may not be positive def-
inite, and −H(θk )−1 g(θk ) may not define an increasing direction of search.
This can happen when the objective function has flat regions, in which case
the Hessian matrix is very ill-conditioned (e.g., is nearly singular), or when
we’re in the vicinity of a local minimum, H(θk ) is positive definite, and our
direction is a decreasing direction of search. Matrix inverses by comput-
ers are subject to large errors when the matrix is ill-conditioned. Also, we
13.2. DERIVATIVE-BASED METHODS 257
ensure that the approximation is p.d. DFP and BFGS are two well-known
examples.
Stopping criteria
The last thing we need is to decide when to stop. A digital computer is subject to
limited machine precision and round-off errors. For these reasons, it is unreasonable
to hope that a program can exactly find the point that maximizes a function. We need
to define acceptable tolerances. Some stopping criteria are:
|θkj − θk−1
j | < ε1 , ∀ j
θkj − θk−1
j
| | < ε2 , ∀ j
θk−1
j
|g j (θk )| < ε4 , ∀ j
Starting values
The Newton-Raphson and related algorithms work well if the objective function
is concave (when maximizing), but not so well if there are convex regions and local
13.2. DERIVATIVE-BASED METHODS 259
minima or multiple local maxima. The algorithm may converge to a local minimum
or to a local maximum that is not optimal. The algorithm may also have difficulties
converging at all.
• The usual way to “ensure” that a global maximum has been found is to use
many different starting values, and choose the solution that returns the highest
objective function value. THIS IS IMPORTANT in practice. More on this
later.
Calculating derivatives
The Newton-Raphson algorithm requires first and second derivatives. It is often
difficult to calculate derivatives (especially the Hessian) analytically if the function
sn (·) is complicated. Possible solutions are to calculate derivatives numerically, or to
use programs such as MuPAD or Mathematica to calculate analytic derivatives. For
example, Figure 13.2.3 shows MuPAD1 calculating a derivative that I didn’t know off
the top of my head, and one that I did know.
• Numeric derivatives are less accurate than analytic derivatives, and are usu-
ally more costly to evaluate. Both factors usually cause optimization pro-
grams to be less successful when numeric derivatives are used.
• One advantage of numeric derivatives is that you don’t have to worry about
having made an error in calculating the analytic derivative. When program-
ming analytic derivatives it’s a good idea to check that they are correct by
using numeric derivatives. This is a lesson I learned the hard way when writ-
ing my thesis.
1MuPAD is not a freely distributable program, so it’s not on the CD. You can download it from
http://www.mupad.de/download.shtml
13.2. DERIVATIVE-BASED METHODS 260
• Numeric second derivatives are much more accurate if the data are scaled so
that the elements of the gradient are of the same order of magnitude. Exam-
ple: if the model is yt = h(αxt + βzt ) + εt , and estimation is by NLS, suppose
that Dα sn (·) = 1000 and Dβ sn (·) = 0.001. One could define α∗ = α/1000;
xt∗ = 1000xt ;β∗ = 1000β; zt∗ = zt /1000. In this case, the gradients Dα∗ sn (·)
and Dβ sn (·) will both be 1.
In general, estimation programs always work better if data is scaled in
this way, since roundoff errors are less likely to become important. This is
important in practice.
13.4. EXAMPLES 261
• There are algorithms (such as BFGS and DFP) that use the sequential gradi-
ent evaluations to build up an approximation to the Hessian. The iterations
are faster for this reason since the actual Hessian isn’t calculated, but more
iterations usually are required for convergence.
• Switching between algorithms during iterations is sometimes useful.
13.4. Examples
This section gives a few examples of how some nonlinear models may be estimated
using maximum likelihood.
13.4.1. Discrete Choice: The logit model. In this section we will consider max-
imum likelihood estimation of the logit model for binary 0/1 dependent variables. We
will use the BFGS algotithm to find the MLE.
13.4. EXAMPLES 262
y∗ = g(x) − ε
y = 1(y∗ > 0)
Pr(y = 1) = Fε [g(x)]
≡ p(x, θ)
1 n
sn (θ) = ∑ (yi ln p(xi , θ) + (1 − yi ) ln [1 − p(xi , θ)])
n i=1
For the logit model (see the contingent valuation example above), the probability
has the specific form
1
p(x, θ) =
1 + exp(−x0θ)
You should download and examine LogitDGP.m , which generates data according
to the logit model, logit.m , which calculates the loglikelihood, and EstimateLogit.m ,
which sets things up and calls the estimation routine, which uses the BFGS algorithm.
Here are some estimation results with n = 100, and the true θ = (0, 1) 0.
***********************************************
Trial of MLE estimation of Logit model
Information Criteria
CAIC : 132.6230
BIC : 130.6230
AIC : 125.4127
***********************************************
13.4.2. Count Data: The Poisson model. Demand for health care is usually
thought of a a derived demand: health care is an input to a home production func-
tion that produces health, and health is an argument of the utility function. Grossman
(1972), for example, models health as a capital stock that is subject to depreciation
(e.g., the effects of ageing). Health care visits restore the stock. Under the home pro-
duction framework, individuals decide when to make health care visits to maintain
their health stock, or to deal with negative shocks to the stock in the form of accidents
or illnesses. As such, individual demand will be a function of the parameters of the
individuals’ utility functions.
The MEPS health data file , meps1996.data, contains 4564 observations on six
measures of health care usage. The data is from the 1996 Medical Expenditure Panel
Survey (MEPS). You can get more information at http://www.meps.ahrq.gov/.
The six measures of use are are office-based visits (OBDV), outpatient visits (OPV),
inpatient visits (IPV), emergency room visits (ERV), dental visits (VDV), and number
of prescription drugs taken (PRESCR). These form columns 1 - 6 of meps1996.data.
The conditioning variables are public insurance (PUBLIC), private insurance (PRIV),
13.4. EXAMPLES 264
sex (SEX), age (AGE), years of education (EDUC), and income (INCOME). These
form columns 7 - 12 of the file, in the order given here. PRIV and PUBLIC are 0/1
binary variables, where a 1 indicates that the person has access to public or private
insurance coverage. SEX is also 0/1, where 1 indicates that the person is female. This
data will be used in examples fairly extensively in what follows.
The program ExploreMEPS.m shows how the data may be read in, and gives some
descriptive information about variables, which follows:
All of the measures of use are count data, which means that they take on the values
0, 1, 2, .... It might be reasonable to try to use this information by specifying the density
as a count data density. One of the simplest count data densities is the Poisson density,
which is
exp(−λ)λy
fY (y) = .
y!
1 n
sn (θ) = ∑ (−λi + yi ln λi − ln yi !)
n i=1
λi = exp(x0i β)
This ensures that the mean is positive, as is required for the Poisson model. Note that
for this parameterization
∂λ/∂β j
βj =
λ
so
13.4. EXAMPLES 265
β j x j = ηλx j ,
the elasticity of the conditional mean of y with respect to the jth conditioning variable.
The program EstimatePoisson.m estimates a Poisson model using the full data set.
The results of the estimation, using OBDV as the dependent variable are here:
OBDV
******************************************************
Poisson model, MEPS 1996 full data set
Information Criteria
CAIC : 33575.6881 Avg. CAIC: 7.3566
BIC : 33568.6881 Avg. BIC: 7.3551
AIC : 33523.7064 Avg. AIC: 7.3452
******************************************************
13.4.3. Duration data and the Weibull model. In some cases the dependent vari-
able may be the time that passes between the occurence of two events. For example,
it may be the duration of a strike, or the time needed to find a job once one is unem-
ployed. Such variables take on values on the positive real line, and are referred to as
duration data.
A spell is the period of time between the occurence of initial event and the con-
cluding event. For example, the initial event could be the loss of a job, and the final
event is the finding of a new job. The spell is the period of unemployment.
Let t0 be the time the initial event occurs, and t1 be the time the concluding event
occurs. For simplicity, assume that time is measured in years. The random variable D
is the duration of the spell, D = t1 − t0 . Define the density function of D, f D (t), with
distribution function FD (t) = Pr(D < t).
Several questions may be of interest. For example, one might wish to know the
expected time one has to wait to find a job given that one has already waited s years.
The probability that a spell lasts s years is
fD (t)
fD (t|D > s) = .
1 − FD (s)
The expectanced additional time required for the spell to end given that is has already
lasted s years is the expectation of D with respect to this density, minus s.
Z ∞
fD (z)
E = E (D|D > s) − s = z dz − s
t 1 − FD (s)
To estimate this function, one needs to specify the density f D (t) as a parametric
density, then estimate by maximum likelihood. There are a number of possibilities
including the exponential density, the lognormal, etc. A reasonably flexible model that
is a generalization of the exponential density is the Weibull density
γ
fD (t|θ) = e−(λt) λγ(λt)γ−1.
According to this model, E (D) = λ−γ . The log-likelihood is just the product of the log
densities.
To illustrate application of this model, 402 observations on the lifespan of mon-
gooses in Serengeti National Park (Tanzania) were used to fit a Weibull model. The
”spell” in this case is the lifetime of an individual mongoose. The parameter estimates
and standard errors are λ̂ = 0.559 (0.034) and γ̂ = 0.867 (0.033) and the log-likelihood
value is -659.3. Figure 13.4.1 presents fitted life expectancy (expected additional years
of life) as a function of age, with 95% confidence bands. The plot is accompanied by a
nonparametric Kaplan-Meier estimate of life-expectancy. This nonparametric estima-
tor simply averages all spell lengths greater than age, and then subtracts age. This is
consistent by the LLN.
13.4. EXAMPLES 268
In the figure one can see that the model doesn’t fit the data well, in that it pre-
dicts life expectancy quite differently than does the nonparametric model. For ages
4-6, the nonparametric estimate is outside the confidence interval that results from the
parametric model, which casts doubt upon the parametric model. Mongooses that are
between 2-6 years old seem to have a lower life expectancy than is predicted by the
Weibull model, whereas young mongooses that survive beyond infancy have a higher
life expectancy, up to a bit beyond 2 years. Due to the dramatic change in the death
13.4. EXAMPLES 269
rate as a function of t, one might specify f D (t) as a mixture of two Weibull densities,
−(λ1t)γ1 γ1 −1 −(λ2t)γ2 γ2 −1
fD (t|θ) = δ e λ1 γ1 (λ1t) + (1 − δ) e λ2 γ2 (λ2t) .
The parameters γi and λi , i = 1, 2 are the parameters of the two Weibull densities, and
δ is the parameter that mixes the two.
With the same data, θ can be estimated using the mixed model. The results are
a log-likelihood = -623.17. Note that a standard likelihood ratio test cannot be used
to chose between the two models, since under the null that δ = 1 (single density), the
two parameters λ2 and γ2 are not identified. It is possible to take this into account,
but this topic is out of the scope of this course. Nevertheless, the improvement in the
likelihood function is considerable. The parameter estimates are
Note that the mixture parameter is highly significant. This model leads to the fit in
Figure 13.4.2. Note that the parametric and nonparametric fits are quite close to one
another, up to around 6 years. The disagreement after this point is not too important,
since less than 5% of mongooses live more than 6 years, which implies that the Kaplan-
Meier nonparametric estimate has a high variance (since it’s an average of a small
number of observations).
Mixture models are often an effective way to model complex responses, though
they can suffer from overparameterization. Alternatives will be discussed later.
13.5. NUMERIC OPTIMIZATION: PITFALLS 270
In this section we’ll examine two common problems that can be encountered when
doing numeric optimization of nonlinear models, and some solutions.
13.5.1. Poor scaling of the data. When the data is scaled so that the magnitudes
of the first and second derivatives are of different orders, problems can easily result. If
we uncomment the appropriate line in EstimatePoisson.m, the data will not be scaled,
and the estimation program will have difficulty converging (it seems to take an infinite
amount of time). With unscaled data, the elements of the score vector have very differ-
ent magnitudes at the initial value of θ (all zeros). To see this run CheckScore.m. With
13.5. NUMERIC OPTIMIZATION: PITFALLS 271
unscaled data, one element of the gradient is very large, and the maximum and mini-
mum elements are 5 orders of magnitude apart. This causes convergence problems due
to serious numerical inaccuracy when doing inversions to calculate the BFGS direction
of search. With scaled data, none of the elements of the gradient are very large, and
the maximum difference in orders of magnitude is 3. Convergence is quick.
13.5.2. Multiple optima. Multiple optima (one global, others local) can compli-
cate life, since we have limited means of determining if there is a higher maximum the
the one we’re at. Think of climbing a mountain in an unknown range, in a very foggy
place (Figure 13.5.1). You can go up until there’s nowhere else to go up, but since
you’re in the fog you don’t know if the true summit is across the gap that’s at your
feet. Do you claim victory and go home, or do you trudge down the gap and explore
the other side?
The best way to avoid stopping at a local maximum is to use many starting values,
for example on a grid, or randomly generated. Or perhaps one might have priors about
possible values for the parameters (e.g., from previous studies of similar data).
Let’s try to find the true minimizer of minus 1 times the foggy mountain function
(since the algoritms are set up to minimize). From the picture, you can see it’s close
to (0, 0), but let’s pretend there is fog, and that we don’t know that. The program
FoggyMountain.m shows that poor start values can lead to problems. It uses SA, which
finds the true global minimum, and it shows that BFGS using a battery of random start
values can also find the global minimum help. The output of one run is here:
======================================================
BFGSMIN final results
13.5. NUMERIC OPTIMIZATION: PITFALLS 272
------------------------------------------------------
STRONG CONVERGENCE
Function conv 1 Param conv 1 Gradient conv 1
------------------------------------------------------
Objective function value -0.0130329
Stepsize 0.102833
43 iterations
13.5. NUMERIC OPTIMIZATION: PITFALLS 273
------------------------------------------------------
16.000 -28.812
================================================
SAMIN final results
NORMAL CONVERGENCE
3.7417e-02 2.7628e-07
In that run, the single BFGS run with bad start values converged to a point far from the
true minimizer, which simulated annealing and BFGS using a battery of random start
values both found the true maximizaer. battery of random start values managed to find
the global max. The moral of the story is be cautious and don’t publish your results
too quickly.
EXERCISES 275
Exercises
(1) In octave, type ”help bfgsmin_example”, to find out the location of the file.
Edit the file to examine it and learn how to call bfgsmin. Run it, and examine the
output.
(2) In octave, type ”help samin_example”, to find out the location of the file. Edit
the file to examine it and learn how to call samin. Run it, and examine the output.
(3) Using logit.m and EstimateLogit.m as templates, write a function to calculate the
probit loglikelihood, and a script to estimate a probit model. Run it using data that
actually follows a logit model (you can generate it in the same way that is done in
the logit example).
(4) Study mle_results.m to see what it does. Examine the functions that mle_results.m
calls, and in turn the functions that those functions call. Write a complete descrip-
tion of how the whole chain works.
(5) Look at the Poisson estimation results for the OBDV measure of health care use
and give an economic interpretation. Estimate Poisson models for the other 5
measures of health care usage.
CHAPTER 14
Readings: Gourieroux and Monfort (1995), Vol. 2, Ch. 24∗ ; Amemiya, Ch. 4
section 4.1∗ ; Davidson and MacKinnon, pp. 591-96; Gallant, Ch. 3; Newey and
McFadden (1994), “Large Sample Estimation and Hypothesis Testing,” in Handbook
of Econometrics, Vol. 4, Ch. 36.
= 1/n k Y − X θ k2
14.2. Consistency
The following theorem is patterned on a proof in Gallant (1987) (the article, ref.
later), which we’ll see in its original form later in the course. It is interesting to com-
pare the following proof with Amemiya’s Theorem 4.1.1, which is done in terms of
convergence in probability.
(3) Identification: s∞ (·) has a unique global maximum at θ0 ∈ Θ, i.e., s∞ (θ0 ) >
s∞ (θ), ∀θ 6= θ0 , θ ∈ Θ
a.s.
Then θ̂n → θ0 .
Proof: Select a ω ∈ Ω and hold it fixed. Then {sn (ω, θ)} is a fixed sequence of
functions. Suppose that ω is such that sn (θ) converges uniformly to s∞ (θ). This hap-
pens with probability one by assumption (b). The sequence { θ̂n } lies in the compact
set Θ, by assumption (1) and the fact that maximixation is over Θ. Since every se-
quence from a compact set has at least one limit point (Davidson, Thm. 2.12), say that
θ̂ is a limit point of {θ̂n }. There is a subsequence {θ̂nm } ({nm } is simply a sequence of
14.2. CONSISTENCY 278
increasing integers) with limm→∞ θ̂nm = θ̂. By uniform convergence and continuity
To see this, first of all, select an element θ̂t from the sequence θ̂nm . Then uniform
convergence implies
lim snm (θ̂t ) = s∞ (θ̂t ).
m→∞
since the limit as t → ∞ of θ̂t is θ̂. So the above claim is true.
Next, by maximization
snm (θ̂nm ) ≥ snm (θ0 )
However,
lim snm (θ̂nm ) = s∞ (θ̂),
m→∞
by uniform convergence, so
s∞ (θ̂) ≥ s∞ (θ0 ).
since so far we have held ω fixed, but now we need to consider all ω ∈ Ω. Therefore
{θ̂n } has only one limit point, θ0 , except on a set C ⊂ Ω with P(C) = 0.
Discussion of the proof:
(2) Identification: Any point θ in Θ with s∞ (θ) ≥ s∞ (θ0 ) must be such that k θ − θ0 k=
0, which matches the way we will write the assumption in the section on nonparametric
inference.
We need a uniform strong law of large numbers in order to verify assumption (2)
of Theorem 19. The following theorem is from Davidson, pg. 337.
T HEOREM 20. [Uniform Strong LLN] Let {Gn (θ)} be a sequence of stochastic
real-valued functions on a totally-bounded metric space (Θ, ρ). Then
a.s.
sup |Gn (θ)| → 0
θ∈Θ
if and only if
a.s.
(a) Gn (θ) → 0 for each θ ∈ Θ0 , where Θ0 is a dense subset of Θ and
(b) {Gn (θ)} is strongly stochastically equicontinuous..
• The metric space we are interested in now is simply Θ ⊂ ℜK , using the Eu-
clidean norm.
• The pointwise almost sure convergence needed for assuption (a) comes from
one of the usual SLLN’s.
• Stronger assumptions that imply those of the theorem are:
– the parameter space is compact (this has already been assumed)
– the objective function is continuous and bounded with probability one on
the entire parameter space
– a standard SLLN can be shown to apply to some point in the parameter
space
• These are reasonable conditions in many cases, and henceforth when deal-
ing with specific estimators we’ll simply assume that pointwise almost sure
convergence can be extended to uniform almost sure convergence in this way.
• The more general theorem is useful in the case that the limiting objective
function can be continuous in θ even if sn (θ) is discontinuous. This can hap-
pen because discontinuities may be smoothed out as we take expectations
14.3. EXAMPLE: CONSISTENCY OF LEAST SQUARES 282
• Considering the second term, since E(ε) = 0 and w and ε are independent,
the SLLN implies that it converges to zero.
• Finally, for the first term, for a given θ, we assume that a SLLN applies so
that
n 2 a.s. Z 2
(14.3.1) 1/n ∑ xt0 0
θ −θ → x0 θ 0 − θ dµW
t=1 W
2 Z 2 Z
= α0 − α + 2 α0 − α β0 − β wdµW + β0 − β w2 dµW
W W
2 2
= α0 − α + 2 α0 − α β0 − β E(w) + β0 − β E w 2
14.4. ASYMPTOTIC NORMALITY 283
Finally, the objective function is clearly continuous, and the parameter space is as-
sumed to be compact, so the convergence is also uniform. Thus,
2 2
s∞ (θ) = α0 − α + 2 α0 − α β0 − β E(w) + β0 − β E w2 + σ2ε
E XERCISE 21. Show that in order for the above solution to be unique it is necessary
that E(w2 ) 6= 0. Discuss the relationship between this condition and the problem of
colinearity of regressors.
This example shows that Theorem 19 can be used to prove strong consistency of
the OLS estimator. There are easier ways to show this, of course - this is only an
example of application of the theorem.
A consistent estimator is oftentimes not very useful unless we know how fast it is
likely to be converging to the true value, and the probability that it is far away from the
true value. Establishment of asymptotic normality with a known scaling factor solves
these two problems. The following theorem is similar to Amemiya’s Theorem 4.1.3
(pg. 111).
√ d
Then n θ̂ − θ0 → N 0, J∞ (θ0 )−1 I∞ (θ0 )J∞ (θ0 )−1
Dθ sn (θ̂n ) = Dθ sn (θ0 ) + D2θ sn (θ∗ ) θ̂ − θ0
• Note that θ̂ will be in the neighborhood where D2θ sn (θ) exists with probability
one as n becomes large, by consistency.
• Now the l.h.s. of this equation is zero, at least asymptotically, since θ̂n is
a maximizer and the f.o.c. must hold exactly since the limiting objective
function is strictly concave in a neighborhood of θ0 .
a.s.
• Also, since θ∗ is between θ̂n and θ0 , and since θ̂n → θ0 , assumption (b) gives
a.s.
D2θ sn (θ∗ ) → J∞ (θ0 )
So
0 = Dθ sn (θ0 ) + J∞ (θ0 ) + o p (1) θ̂ − θ0
And
√ √
0= nDθ sn (θ0 ) + J∞ (θ0 ) + o p (1) n θ̂ − θ0
Now J∞ (θ0 ) is a finite negative definite matrix, so the o p (1) term is asymptotically
irrelevant next to J∞ (θ0 ), so we can write
a √ √
0= nDθ sn (θ0 ) + J∞ (θ0 ) n θ̂ − θ0
√ a √
n θ̂ − θ0 = −J∞ (θ0 )−1 nDθ sn (θ0 )
14.4. ASYMPTOTIC NORMALITY 285
Because of assumption (c), and the formula for the variance of a linear combination of
r.v.’s,
√ d
n θ̂ − θ0 → N 0, J∞ (θ0 )−1 I∞ (θ0 )J∞ (θ0 )−1
• Assumption (b) is not implied by the Slutsky theorem. The Slutsky theorem
a.s.
says that g(xn ) → g(x) if xn → x and g(·) is continuous at x. However, the
function g(·) can’t depend on n to use this theorem. In our case J n (θn ) is a
function of n. A theorem which applies (Amemiya, Ch. 4) is
• To apply this to the second derivatives, sufficient conditions would be that the
second derivatives be strongly stochastically equicontinuous on a neighbor-
hood of θ0 , and that an ordinary LLN applies to the derivatives when evalu-
ated at θ ∈ N(θ0 ).
• Stronger conditions that imply this are as above: continuous and bounded
second derivatives in a neighborhood of θ0 .
• Skip this in lecture. A note on the order of these matrices: Supposing that
sn (θ) is representable as an average of n terms, which is the case for all es-
timators we consider, D2θ sn (θ) is also an average of n matrices, the elements
of which are not centered (they do not have zero expectation). Supposing a
SLLN applies, the almost sure limit of D2θ sn (θ0 ), J∞ (θ0 ) = O(1), as we saw in
√ d
Example 51. On the other hand, assumption (c): nDθ sn (θ0 ) → N 0, I∞ (θ0 )
means that
√
nDθ sn (θ0 ) = O p ()
14.5. EXAMPLES 286
√
where we use the result of Example 49. If we were to omit the n, we’d have
1
Dθ sn (θ0 ) = n− 2 O p (1)
1
= O p n− 2
where we use the fact that O p (nr )O p (nq ) = O p (nr+q ). The sequence Dθ sn (θ0 )
√
is centered, so we need to scale by n to avoid convergence to zero.
14.5. Examples
y∗ = x 0 β − ε
y = 1(y∗ > 0)
ε ∼ N(0, 1)
If p(x, θ) = Λ(x0 θ), we have a logit model. If p(x, θ) = Φ(x0 θ), where Φ(·) is the
standard normal distribution function, then we have a probit model.
Regardless of the parameterization, we are dealing with a Bernoulli density,
so as long as the observations are independent, the maximum likelihood (ML) estima-
tor, θ̂, is the maximizer of
1 n
sn (θ) = ∑ (yi ln p(xi, θ) + (1 − yi) ln [1 − p(xi, θ)])
n i=1
1 n
(14.5.1) ≡ ∑ s(yi, xi, θ).
n i=1
Following the above theoretical results, θ̂ tends in probability to the θ0 that maximizes
the uniform almost sure limit of sn (θ). Noting that E yi = p(xi , θ0 ), and following
a SLLN for i.i.d. processes, sn (θ) converges almost surely to the expectation of a
representative term s(y, x, θ). First one can take the expectation conditional on x to get
Ey|x {y ln p(x, θ) + (1 − y) ln[1 − p(x, θ)]} = p(x, θ0 ) ln p(x, θ)+ 1 − p(x, θ0 ) ln [1 − p(x, θ)] .
where µ(x) is the (joint - the integral is understood to be multiple, and X is the support
of x) density function of the explanatory variables x. This is clearly continuous in θ,
as long as p(x, θ) is continuous, and if the parameter space is compact we therefore
have uniform almost sure convergence. Note that p(x, θ) is continous for the logit and
probit models, for example. The maximizing element of s∞ (θ), θ∗ , solves the first
14.5. EXAMPLES 288
order conditions
Z
p(x, θ0 ) ∂ 1 − p(x, θ0 ) ∂
p(x, θ ) −
∗
p(x, θ ) µ(x)dx = 0
∗
X p(x, θ∗ ) ∂θ 1 − p(x, θ∗ ) ∂θ
√ d
n θ̂ − θ0 → N 0, J∞ (θ0 )−1 I∞ (θ0 )J∞ (θ0 )−1 .
√
In the case of i.i.d. observations I∞ (θ0 ) = limn→∞ Var nDθ sn (θ0 ) is simply the ex-
pectation of a typical element of the outer product of the gradient.
• There’s no need to subtract the mean, since it’s zero, following the f.o.c. in
the consistency proof above and the fact that observations are i.i.d.
• The terms in n also drop out by the same argument:
√ √ 1
lim Var nDθ sn (θ0 ) = lim Var nDθ ∑ s(θ0 )
n→∞ n→∞ n t
1
= lim Var √ Dθ ∑ s(θ0 )
n→∞ n t
1
= lim Var ∑ Dθ s(θ0 )
n→∞ n t
= VarDθ s(θ0 )
So we get
0 ∂ 0 ∂ 0
I∞ (θ ) = E s(y, x, θ ) 0 s(y, x, θ ) .
∂θ ∂θ
Likewise,
∂2
J∞ (θ0 ) = E s(y, x, θ0 ).
∂θ∂θ0
14.5. EXAMPLES 289
Expectations are jointly over y and x, or equivalently, first over y conditional on x, then
over x. From above, a typical element of the objective function is
s(y, x, θ0 ) = y ln p(x, θ0 ) + (1 − y) ln 1 − p(x, θ0 ) .
Now suppose that we are dealing with a correctly specified logit model:
−1
p(x, θ) = 1 + exp(−x0 θ) .
∂ −2
p(x, θ) = 1 + exp(−x0 θ) exp(−x0 θ)x
∂θ
−1 exp(−x0 θ)
= 1 + exp(−x0 θ) x
1 + exp(−x0 θ)
= p(x, θ) (1 − p(x, θ)) x
= p(x, θ) − p(x, θ)2 x.
So
∂
(14.5.3) s(y, x, θ0 ) = y − p(x, θ0 ) x
∂θ
∂2 0
0 0 2
0
s(θ ) = − p(x, θ ) − p(x, θ ) xx .
∂θ∂θ0
0
Z
(14.5.4) I∞ (θ ) = EY y2 − 2p(x, θ0 )p(x, θ0 ) + p(x, θ0 )2 xx0 µ(x)dx
Z
(14.5.5) = p(x, θ0 ) − p(x, θ0 )2 xx0 µ(x)dx.
14.5. EXAMPLES 290
Note that we arrive at the expected result: the information matrix equality holds (that
is, J∞ (θ0 ) = −I∞ (θ0 )). With this,
√ d
n θ̂ − θ0 → N 0, J∞ (θ0 )−1 I∞ (θ0 )J∞ (θ0 )−1
simplifies to
√ d
n θ̂ − θ0 → N 0, −J∞ (θ0 )−1
√ d
n θ̂ − θ0 → N 0, I∞ (θ0 )−1 .
On a final note, the logit and standard normal CDF’s are very similar - the logit dis-
tribution is a bit more fat-tailed. While coefficients will vary slightly between the two
models, functions of interest such as estimated probabilities p(x, θ̂) will be virtually
identical for the two models.
yi = h(xi , θ0 ) + εi
where
εi ∼ iid(0, σ2 )
14.5. EXAMPLES 291
1 n
θ̂n = arg min ∑ (yi − h(xi , θ))2
n i=1
We’ll study this more later, but for now it is clear that the foc for minimization will
require solving a set of nonlinear equations. A common approach to the problem seeks
to avoid this difficulty by linearizing the model. A first order Taylor’s series expansion
about the point x0 with remainder gives
∂h(x0 , θ0 )
yi = h(x0 , θ0 ) + (xi − x0 )0 + νi
∂x
where νi encompasses both εi and the Taylor’s series remainder. Note that νi is no
longer a classical error - its mean is not zero. We should expect problems.
Define
∂h(x0 , θ0 )
α∗ = h(x0 , θ0 ) − x00
∂x
∂h(x0 , θ0 )
β∗ =
∂x
yi = α + βxi + νi
u.a.s.
sn (γ) → s∞ (γ) = EX EY |X (y − α − βx)2
14.5. EXAMPLES 292
Noting that
2 2
EX EY |X y − α − x0 β = EX EY |X h(x, θ0 ) + ε − α − βx
2
= σ2 + EX h(x, θ0 ) − α − βx
since cross products involving ε drop out. α0 and β0 correspond to the hyperplane
that is closest to the true regression function h(x, θ0 ) according to the mean squared
error criterion. This depends on both the shape of h(·) and the density function of the
conditioning variables.
x
Tangent line x
β
α x x
x x
x Fitted line
x_0
• It is clear that the tangent line does not minimize MSE, since, for example, if
h(x, θ0 ) is concave, all errors between the tangent line and the true function
are negative.
14.5. EXAMPLES 293
• Note that the true underlying parameter θ0 is not estimated consistently, either
(it may be of a different dimension than the dimension of the parameter of the
approximating model, which is 2 in this example).
• Second order and higher-order approximations suffer from exactly the same
problem, though to a less severe degree, of course. For this reason, translog,
Generalized Leontiev and other “flexible functional forms” based upon second-
order approximations in general suffer from bias and inconsistency. The bias
may not be too important for analysis of conditional means, but it can be very
important for analyzing first and second derivatives. In production and con-
sumer analysis, first and second derivatives (e.g., elasticities of substitution)
are often of interest, so in this case, one should be cautious of unthinking
application of models that impose stong restrictions on second derivatives.
• This sort of linearization about a long run equilibrium is a common practice in
dynamic macroeconomic models. It is justified for the purposes of theoretical
analysis of a model given the model’s parameters, but it is not justifiable for
the estimation of the parameters of the model using data. The section on
simulation-based methods offers a means of obtaining consistent estimators
of the parameters of dynamic macro models that are too complex for standard
methods of analysis.
14.5. EXAMPLES 294
Chapter Exercises
Readings: Hamilton Ch. 14∗ ; Davidson and MacKinnon, Ch. 17 (see pg. 587 for
refs. to applications); Newey and McFadden (1994), “Large Sample Estimation and
Hypothesis Testing,” in Handbook of Econometrics, Vol. 4, Ch. 36.
15.1. Definition
We’ve already seen one example of GMM in the introduction, based upon the
χ2 distribution. Consider the following example based upon the t-distribution. The
density function of a t-distributed r.v. Yt is
0 Γ θ0 + 1 /2 −(θ0 +1)/2
fYt (yt , θ ) = 1/2
1 + yt2 /θ0
(πθ0 ) Γ (θ0 /2)
Given an iid sample of size n, one could estimate θ0 by maximizing the log-likelihood
function
n
θ̂ ≡ arg max ln Ln (θ) =
Θ
∑ ln fYt (yt , θ)
t=1
• This approach is attractive since ML estimators are asymptotically efficient.
This is because the ML estimator uses all of the available information (e.g.,
the distribution is fully specified up to a parameter). Recalling that a dis-
tribution is completely characterized by its moments, the ML estimator is
interpretable as a GMM estimator that uses all of the moments. The method
of moments estimator uses only K moments to estimate a K− dimensional
295
15.1. DEFINITION 296
2
(15.1.1) θ̂ = n
1−
∑i y2i
This estimator is based on only one moment of the distribution - it uses less information
than the ML estimator, so it is intuitively clear that the MM estimator will be inefficient
relative to the ML estimator.
3 (θ)2 1 n
m2 (θ) = − ∑ yt4
(θ − 2) (θ − 4) n t=1
This estimator isn’t efficient either, since it uses only one moment. A GMM estimator
would use the two moment conditions together to estimate the single parameter. The
15.1. DEFINITION 297
• As before, set mn (θ) = (m1 (θ), m2 (θ))0 . The n subscript is used to indicate the
sample size. Note that m(θ0 ) = O p (n−1/2 ), since it is an average of centered
random variables, whereas m(θ) = O p (1), θ 6= θ0 , where expectations are
taken using the true distribution with parameter θ0 . This is the fundamental
reason that GMM is consistent.
• A GMM estimator requires defining a measure of distance, d (m(θ)). A pop-
ular choice (for reasons noted below) is to set d (m(θ)) = m0Wn m, and we
minimize sn (θ) = m(θ)0Wn m(θ). We assume Wn converges to a finite positive
definite matrix.
• In general, assume we have g moment conditions, so m(θ) is a g -vector and
W is a g × g matrix.
For the purposes of this course, the following definition of the GMM estimator is
sufficiently general:
estimator. Keep in mind that the true distribution is not known so if we er-
roneously specify a distribution and estimate by MLE, the estimator will be
inconsistent in general (not always).
– Feasibility: in some cases the MLE estimator is not available, because
we are not able to deduce the likelihood function. More on this in the
section on simulation-based estimation. The GMM estimator may still
be feasible even though MLE is not possible.
15.2. Consistency
We simply assume that the assumptions of Theorem 19 hold, so the GMM estima-
tor is strongly consistent. The only assumption that warrants additional comments is
that of identification. In Theorem 19, the third assumption reads: (c) Identification:
s∞ (·) has a unique global maximum at θ0 , i.e., s∞ (θ0 ) > s∞ (θ), ∀θ 6= θ0 . Taking the
case of a quadratic objective function sn (θ) = mn (θ)0Wn mn (θ), first consider mn (θ).
a.s.
• Applying a uniform law of large numbers, we get mn (θ) → m∞ (θ).
• Since Eθ0 mn (θ0 ) = 0 by assumption, m∞ (θ0 ) = 0.
• Since s∞ (θ0 ) = m∞ (θ0 )0W∞ m∞ (θ0 ) = 0, in order for asymptotic identification,
we need that m∞ (θ) 6= 0 for θ 6= θ0 , for at least some element of the vector.
a.s.
This and the assumption that Wn → W∞ , a finite positive g × g definite g × g
matrix guarantee that θ0 is asymptotically identified.
• Note that asymptotic identification does not rule out the possibility of lack of
identification for a given data set - there may be multiple minimizing solutions
in finite samples.
15.3. ASYMPTOTIC NORMALITY 299
We also simply assume that the conditions of Theorem 22 hold, so we will have
asymptotic normality. However, we do need to find the structure of the asymptotic
variance-covariance matrix of the estimator. From Theorem 22, we have
√ d
n θ̂ − θ0 → N 0, J∞ (θ0 )−1 I∞ (θ0 )J∞ (θ0 )−1
∂2 √ ∂
where J∞ (θ0 ) is the almost sure limit of 0 0
∂θ∂θ0 sn (θ) and I∞ (θ ) = limn→∞ Var n ∂θ sn (θ ).
We need to determine the form of these matrices given the objective function s n (θ) =
mn (θ)0Wn mn (θ).
Now using the product rule from the introduction,
∂ ∂ 0
sn (θ) = 2 m (θ) Wn mn (θ)
∂θ ∂θ n
∂
(15.3.1) s(θ) = 2D(θ)W m (θ) .
∂θ
(Note that sn (θ), Dn (θ), Wn and mn (θ) all depend on the sample size n, but it is omitted
to unclutter the notation).
To take second derivatives, let Di be the i− th row of D(θ). Using the product rule,
∂2 ∂
s(θ) = 2Di (θ)Wn m (θ)
∂θ0 ∂θi ∂θ0
0 0 ∂ 0
= 2DiW D + 2m W D
∂θ0 i
15.3. ASYMPTOTIC NORMALITY 300
∂2
lim sn (θ0 ) = J∞ (θ0 ) = 2D∞W∞ D0∞ , a.s.,
∂θ∂θ0
where we define lim D = D∞ , a.s., and limW = W∞ , a.s. (we assume a LLN holds).
With regard to I∞ (θ0 ), following equation 15.3.1, and noting that the scores have
mean zero at θ0 (since E m(θ0 ) = 0 by assumption), we have
√ ∂
I∞ (θ0 ) = lim Var n sn (θ0 )
n→∞ ∂θ
= lim E 4nDnWn m(θ0 )m(θ)0Wn D0n
n→∞
√ √
= lim E 4DnWn nm(θ0 ) nm(θ)0 Wn D0n
n→∞
√ d
nm(θ0 ) → N(0, Ω∞ ),
where
Ω∞ = lim E nm(θ0 )m(θ0 )0 .
n→∞
15.4. CHOOSING THE WEIGHTING MATRIX 301
√ d h −1 −1 i
n θ̂ − θ0 → N 0, D∞W∞ D0∞ D∞W∞ Ω∞W∞ D0∞ D∞W∞ D0∞ ,
the asymptotic distribution of the GMM estimator for arbitrary weighting matrix Wn .
Note that for J∞ to be positive definite, D∞ must have full row rank, ρ(D∞ ) = k.
with a much larger than b. In this case, errors in the second moment condition have
less weight in the objective function.
• Since moments are not independent, in general, we should expect that there
be a correlation between the moment conditions, so it may not be desirable
to set the off-diagonal elements to 0. W may be a random, data dependent
matrix.
15.4. CHOOSING THE WEIGHTING MATRIX 302
• We have already seen that the choice of W will influence the asymptotic dis-
tribution of the GMM estimator. Since the GMM estimator is already ineffi-
cient w.r.t. MLE, we might like to choose the W matrix to make the GMM
estimator efficient within the class of GMM estimators defined by mn (θ).
• To provide a little intuition, consider the linear model y = x0 β + ε, where
ε ∼ N(0, Ω). That is, he have heteroscedasticity and autocorrelation.
• Let P be the Cholesky factorization of Ω−1 , e.g, P0 P = Ω−1 .
• Then the model Py = PXβ + Pε satisfies the classical assumptions of ho-
moscedasticity and nonautocorrelation, since V (Pε) = PV (ε)P0 = PΩP0 =
P(P0 P)−1 P0 = PP−1 (P0 )−1 P0 = In . (Note: we use (AB)−1 = B−1 A−1 for A,
B both nonsingular). This means that the transformed model is efficient.
• The OLS estimator of the model Py = PXβ + Pε minimizes the objective
function (y−Xβ)0 Ω−1 (y−Xβ). Interpreting (y − Xβ) = ε(β) as moment con-
ditions (note that they do have zero expectation when evaluated at β 0 ), the
optimal weighting matrix is seen to be the inverse of the covariance matrix of
the moment conditions. This result carries over to GMM estimation. (Note:
this presentation of GLS is not a GMM estimator, because the number of mo-
ment conditions here is equal to the sample size, n. Later we’ll see that GLS
can be put into the GMM framework defined above).
T HEOREM 25. If θ̂ is a GMM estimator that minimizes mn (θ)0Wn mn (θ), the as-
a.s
ymptotic variance of θ̂ will be minimized by choosing Wn so that Wn → W∞ = Ω−1
∞ ,
where Ω∞ = limn→∞ E nm(θ0 )m(θ0 )0 .
−1 −1
D∞W∞ D0∞ D∞W∞ Ω∞W∞ D0∞ D∞W∞ D0∞
15.4. CHOOSING THE WEIGHTING MATRIX 303
−1
simplifies to D∞ Ω−1
∞ D∞
0 . Now, for any choice such that W∞ 6= Ω−1
∞ , consider the
difference of the inverses of the variances when W = Ω−1 versus when W is some
arbitrary positive definite matrix:
0 −1
D∞ Ω−1
∞ D 0
∞ − D ∞W ∞ D 0
∞ D ∞W∞ Ω ∞W∞ D ∞ D ∞ W∞ D 0
∞
h −1 i
1/2
= D ∞ Ω∞
−1/2
I − Ω∞ W∞ D0∞ D∞W∞ Ω∞W∞ D0∞ D∞W∞ Ω1/2
∞ Ω−1/2
∞ D0∞
√ d h i
0 −1
(15.4.1) n θ̂ − θ0 → N 0, D∞ Ω−1
∞ D ∞
∂ 0
c∞ is simply
• The obvious estimator of D ∂θ mn θ̂ , which is consistent by the
∂ 0
consistency of θ̂, assuming that ∂θ mn is continuous in θ. Stochastic equicon-
∂ 0
tinuity results can give us this result even if ∂θ mn is not continuous. We now
turn to estimation of Ω∞ .
15.5. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX 304
ance will not depend on t if the moment conditions are covariance stationary.
• contemporaneously correlated, since the individual moment conditions will
not in general be independent of one another (E (mit m jt ) 6= 0).
• and have different variances (E (m2it ) = σ2it ).
Since we need to estimate so many components if we are to take the parametric ap-
proach, it is unlikely that we would arrive at a correct parametric specification. For
this reason, research has focused on consistent nonparametric estimators of Ω ∞ .
Henceforth we assume that mt is covariance stationary (the covariance between mt
and mt−s does not depend on t). Define the v −th autocovariance of the moment condi-
tions Γv = E (mt mt−s
0 ). Note that E (m m0 ) = Γ0 . Recall that m and m are functions of
t t+s v t
θ, so for now assume that we have some consistent estimator of θ 0 , so that m̂t = mt (θ̂).
Now
" ! !#
n n
Ωn = E nm(θ0 )m(θ0 )0 = E n 1/n ∑ mt 1/n ∑ mt0
t=1 t=1
" ! !#
n n
= E 1/n ∑ mt ∑ mt0
t=1 t=1
n−1 n−2 1
= Γ0 + Γ1 + Γ01 + Γ2 + Γ02 · · · + Γn−1 + Γ0n−1
n n n
15.5. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX 305
(you might use n − v in the denominator instead). So, a natural, but inconsistent,
estimator of Ω∞ would be
c n − 1 c c0 n − 2 c c0
d d
Ω̂ = Γ0 + Γ1 + Γ1 + Γ2 + Γ2 + · · · + Γn−1 + Γn−1
0
n n
n−1
c0 + ∑ n − v Γbv + Γb0v .
= Γ
v=1 n
• Note: the formula for Ω̂ requires an estimate of m(θ0 ), which in turn requires
an estimate of θ, which is based upon an estimate of Ω! The solution to this
circularity is to set the weighting matrix W arbitrarily (for example to an
identity matrix), obtain a first consistent but inefficient estimate of θ0 , then
15.5. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX 306
use this estimate to form Ω̂, then re-estimate θ0 . The process can be iterated
until neither Ω̂ nor θ̂ change appreciably between iterations.
This estimator is p.d. by construction. The condition for consistency is that n −1/4 q →
0. Note that this is a very slow rate of growth for q. This estimator is nonparametric -
we’ve placed no parametric restrictions on the form of Ω. It is an example of a kernel
estimator.
In a more recent paper, Newey and West (Review of Economic Studies, 1994) use
pre-whitening before applying the kernel estimator. The idea is to fit a VAR model
to the moment conditions. It is expected that the residuals of the VAR model will be
more nearly white noise, so that the Newey-West covariance estimator might perform
better with short lag lengths..
The VAR model is
This is estimated, giving the residuals ût . Then the Newey-West covariance estimator is
applied to these pre-whitened residuals, and the covariance Ω is estimated combining
the fitted VAR
c1 m̂t−1 + · · · + Θ
ĉt = Θ
m cp m̂t−p
with the kernel estimate of the covariance of the ut . See Newey-West for details.
This can be factored into a conditional expectation and an expectation w.r.t. the mar-
ginal density of X :
Z Z
E Y g(X ) = Y g(X ) f (Y |X )dY f (X )dX .
X Y
E Y g(X ) = 0
as claimed.
This is important econometrically, since models often imply restrictions on condi-
tional moments. Suppose a model tells us that the function K(yt , xt ) has expectation,
15.6. ESTIMATION USING CONDITIONAL MOMENTS 308
• For example, in the context of the classical linear model yt = xt0 β + εt , we can
set K(yt , xt ) = yt so that k(xt , θ) = xt0 β.
Eθ ht (θ)|It = 0.
where Z(t,·) is the t th row of Zn . This fits the previous treatment. An interesting ques-
tion that arises is how one should choose the instrumental variables Z(wt ) to achieve
maximum efficiency.
15.6. ESTIMATION USING CONDITIONAL MOMENTS 310
∂ 0
Note that with this choice of moment conditions, we have that Dn ≡ ∂θ m (θ) (a
K × g matrix) is
∂ 1 0 0
Dn (θ) = Zn hn (θ)
∂θ n
1 ∂ 0
= h (θ) Zn
n ∂θ n
Ωn = E nmn (θ0 )mn (θ0 )0
1 0 0 0 0
= E Zn hn (θ )hn (θ ) Zn
n
0 1 0 0 0
= Zn E hn (θ )hn (θ ) Zn
n
Φn
≡ Zn0 Zn
n
where we have defined Φn = Varhn (θ0 ). Note that the dimension of this matrix is
growing with the sample size, so it is not consistently estimable without additional
assumptions.
The asymptotic normality theorem above says that the GMM estimator using the
optimal weighting matrix is distributed as
√ d
n θ̂ − θ0 → N(0,V∞ )
15.6. ESTIMATION USING CONDITIONAL MOMENTS 311
where
−1 !−1
Hn Zn Zn0 Φn Zn Zn0 Hn0
(15.6.1) V∞ = lim .
n→∞ n n n
Zn = Φ−1 0
n Hn
and furthermore, this matrix is smaller that the limiting var-cov for any other choice
of instrumental variables. (To prove this, examine the difference of the inverses of the
var-cov matrices with the optimal intruments and with non-optimal instruments. As
above, you can show that the difference is positive semi-definite).
• Note that both Hn , which we should write more properly as Hn (θ0 ), since it
depends on θ0 , and Φ must be consistently estimated to apply this.
• Usually, estimation of Hn is straightforward - one just uses
b = ∂ h0n θ̃ ,
H
∂θ
Note that dynamic moment conditions simplify the var-cov matrix, but are often
harder to formulate. The will be added in future editions. For now, the Hansen appli-
cation below is enough.
The first order conditions for minimization, using the an estimate of the optimal
weighting matrix, are
∂ ∂ 0 −1
s(θ̂) = 2 mn θ̂ Ω̂ mn θ̂ ≡ 0
∂θ ∂θ
or
D(θ̂)Ω̂−1 mn (θ̂) ≡ 0
(15.8.1) m(θ̂) = mn (θ0 ) + D0n (θ0 ) θ̂ − θ0 + o p (1).
15.8. A SPECIFICATION TEST 313
D(θ̂)Ω̂−1 m(θ̂) = D(θ̂)Ω̂−1 mn (θ0 ) + D(θ̂)Ω̂−1 D(θ0 )0 θ̂ − θ0 + o p (1)
The lhs is zero, and since θ̂ tends to θ0 and Ω̂ tends to Ω∞ , we can write
0 a 0
D∞ Ω−1
∞ mn (θ ) = −D∞ Ω∞ D∞ θ̂ − θ
−1 0
or
√ a √
0 −1
n θ̂ − θ0 = − n D∞ Ω−1
∞ D∞ D∞ Ω−1 0
∞ mn (θ )
With this, and taking into account the original expansion (equation 15.8.1), we get
√ a √ √
0 −1
nm(θ̂) = nmn (θ0 ) − nD0∞ D∞ Ω−1
∞ D∞ D∞ Ω−1 0
∞ mn (θ ).
√
a √ 1/2 0 −1
nm(θ̂) = n Ω∞ − D0∞ D∞ Ω−1
∞ D ∞ D Ω
∞ ∞
−1/2
Ω−1/2
∞ mn (θ0 )
Or
√ −1/2
a √ −1 0 −1
nΩ∞ m(θ̂) = n Ig − Ω∞ D∞ D∞ Ω∞ D∞
−1/2 0
D∞ Ω∞
−1/2
Ω−1/2
∞ mn (θ0 )
Now
√ −1/2 d
nΩ∞ mn (θ0 ) → N(0, Ig )
d
nm(θ̂)0 Ω̂−1 m(θ̂) → χ2 (g − K)
or
d
n · sn (θ̂) → χ2 (g − K)
supposing the model is correctly specified. This is a convenient test since we just
multiply the optimized value of the objective function by n, and compare with a χ 2 (g −
K) critical value. The test is a general test of whether or not the moments used to
estimate are correctly specified.
• This won’t work when the estimator is just identified. The f.o.c. are
But with exact identification, both D and Ω̂ are square and invertible (at least
asymptotically, assuming that asymptotic normality hold), so
m(θ̂) ≡ 0.
So the moment conditions are zero regardless of the weighting matrix used.
As such, we might as well use an identity matrix and save trouble. Also
sn (θ̂) = 0, so the test breaks down.
• A note: this sort of test often over-rejects in finite samples. One should be
cautious in rejecting a model when this test rejects.
15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 315
mt (β) = xt yt − xt0 β .
For any choice of W, m(β) will be identically zero at the minimum, due to exact iden-
tification. That is, since the number of moment conditions is identical to the number
of parameters, the foc imply that m(β̂) ≡ 0 regardless of W. There is no need to use the
“optimal” weighting matrix in this case, an identity matrix works just as well for the
purpose of estimation. Therefore
−1
β̂ = ∑ xt xt0 ∑ xt yt = (X0X)−1X0y,
t t
15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 316
n−1
Ω̂ = Γ0 + ∑ Γv + Γv .
c b b0
v=1
X0 ÊX
=
n
This is the varcov estimator that White (1980) arrived at in an influential article. This
estimator is consistent under heteroscedasticity of an unknown form. If there is au-
tocorrelation, the Newey-West estimator can be used to estimate Ω - the rest is the
same.
15.9.2. Weighted Least Squares. Consider the previous example of a linear model
with heteroscedasticity of unknown form:
y = Xβ0 + ε
ε ∼ N(0, Σ)
−1
β̃ = X0 Σ−1 X X0 Σ−1 y)
xt yt xt xt0
m(β̃) = 1/n ∑ − 1/n ∑ σt (θ0) β̃ ≡ 0.
t σt (θ0 ) t
That is, the GLS estimator in this case has an obvious representation as a GMM estima-
tor. With autocorrelation, the representation exists but it is a little more complicated.
Nevertheless, the idea is the same. There are a few points:
15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 318
yt = zt0 β + εt ,
or
y = Zβ + ε
using the usual construction, where β is K × 1 and εt is i.i.d. Suppose that this equation
is one of a system of simultaneous equations, so that zt contains both endogenous and
exogenous variables. Suppose that xt is the vector of all exogenous and predetermined
variables that are uncorrelated with εt (suppose that xt is r × 1).
m(β) = 1/n ∑ ẑt yt − zt0 β .
t
15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 319
This is the standard formula for 2SLS. We use the exogenous variables and the reduced
form predictions of the endogenous variables as instruments, and apply IV estimation.
See Hamilton pp. 420-21 for the varcov formula (which is the standard formula for
2SLS), and for how to deal with εt heterogeneous and dependent (basically, just use the
Newey-West or some other consistent estimator of Ω, and apply the usual formula).
Note that εt dependent causes lagged endogenous variables to loose their status as
legitimate instruments.
or in compact notation
yt = f (zt , θ0 ) + εt ,
We need to find an Ai × 1 vector of instruments xit , for each equation, that are
uncorrelated with εit . Typical instruments would be low order monomials in the ex-
ogenous variables in zt , with their lagged values. Then we can define the ∑G
i=1 Ai × 1
orthogonality conditions
(y1t − f1 (zt , θ1 )) x1t
(y2t − f2 (zt , θ2 )) x2t
mt (θ) = .. .
.
(yGt − fG (zt , θG )) xGt
variables have been selected to take advantage of all useful information). The likeli-
hood function is the joint density of the sample:
Define
mt (Yt , θ) ≡ Dθ ln f (yt |Yt−1 , θ)
as the score of the t th observation. It can be shown that, under the regularity condi-
tions, that the scores have conditional mean zero when evaluated at θ 0 (see notes to
Introduction to Econometrics):
which are precisely the first order conditions of MLE. Therefore, MLE can be inter-
−1
preted as a GMM estimator. The GMM varcov formula is V∞ = D∞ Ω−1 D0∞ .
Consistent estimates of variance components are as follows
• D∞
n
c∞ = ∂ m(Yt , θ̂) = 1/n ∑ D2θ ln f (yt |Yt−1 , θ̂)
D
∂θ0 t=1
• Ω
It is important to note that mt and mt−s , s > 0 are both conditionally and
unconditionally uncorrelated. Conditional uncorrelation follows from the fact
that mt−s is a function of Yt−s , which is in the information set at time t. Un-
conditional uncorrelation follows from the fact that conditional uncorrelation
hold regardless of the realization of Yt−1 , so marginalizing with respect to
Yt−1 preserves uncorrelation (see the section on ML estimation, above). The
fact that the scores are serially uncorrelated implies that Ω can be estimated
by the estimator of the 0th autocovariance of the moment conditions:
n n
b = 1/n ∑ mt (Yt , θ̂)mt (Yt , θ̂)0 = 1/n ∑ Dθ ln f (yt |Yt−1 , θ̂) Dθ ln f (yt |Yt−1 , θ̂) 0
Ω
t=1 t=1
Recall from study of ML estimation that the information matrix equality (equation ??)
states that
n 0 o
E Dθ ln f (yt |Yt−1 , θ0 ) Dθ ln f (yt |Yt−1 , θ0 ) = −E D2θ ln f (yt |Yt−1 , θ0 ) .
This result implies the well known (and already seeen) result that we can estimate V∞
in any of three ways:
15.10. EXAMPLE: THE HAUSMAN TEST 323
• or the inverse of the negative of the Hessian (since the middle and last term
cancel, except for a minus sign):
" #−1
n
∞ = −1/n ∑ Dθ ln f (yt |Yt−1 , θ̂)
c 2
V ,
t=1
• or the inverse of the outer product of the gradient (since the middle and last
cancel except for a minus sign, and the first term converges to minus the
inverse of the middle term, which is still inside the overall inverse)
( )−1
n 0
c
V ∞ = 1/n ∑ Dθ ln f (yt |Yt−1 , θ̂) Dθ ln f (yt |Yt−1 , θ̂) .
t=1
This simplification is a special result for the MLE estimator - it doesn’t apply to GMM
estimators in general.
Asymptotically, if the model is correctly specified, all of these forms converge to
the same limit. In small samples they will differ. In particular, there is evidence that the
outer product of the gradient formula does not perform very well in small samples (see
Davidson and MacKinnon, pg. 477). White’s Information matrix test (Econometrica,
1982) is based upon comparing the two ways to estimate the information matrix: outer
product of gradient or negative of the Hessian. If they differ by too much, this is
evidence of misspecification of the model.
This section discusses the Hausman test, which was originally presented in Haus-
man, J.A. (1978), Specification tests in econometrics, Econometrica, 46, 1251-71.
15.10. EXAMPLE: THE HAUSMAN TEST 324
Consider the simple linear regression model yt = xt0 β + εt . We assume that the func-
tional form and the choice of regressors is correct, but that the some of the regressors
may be correlated with the error term, which as you know will produce inconsistency
of β̂. For example, this will be a problem if
To illustrate, the Octave program biased.m performs a Monte Carlo experiment where
errors are correlated with regressors, and estimation is by OLS and IV. Figure 15.10.1
shows that the OLS estimator is quite biased, while Figure 15.10.2 shows that the IV
estimator is on average much closer to the true value. If you play with the program,
increasing the sample size, you can see evidence that the OLS estimator is asymptoti-
cally biased, while the IV estimator is consistent.
We have seen that inconsistent and the consistent estimators converge to different
probability limits. This is the idea behind the Hausman test - a pair of consistent esti-
mators converge to the same probability limit, while if one is consistent and the other
is not they converge to different limits. If we accept that one is consistent (e.g., the
IV estimator), but we are doubting if the other is consistent (e.g., the OLS estimator),
we might try to check if the difference between the estimators is significantly different
from zero.
• If we’re doubting about the consistency of OLS (or QML, etc.), why should
we be interested in testing - why not just use the IV estimator? Because the
OLS estimator is more efficient when the regressors are exogenous and the
other classical assumptions (including normality of the errors) hold. When
we have a more efficient estimator that relies on stronger assumptions (such
15.10. EXAMPLE: THE HAUSMAN TEST 325
0.12
0.1
0.08
0.06
0.04
0.02
0
2.26 2.28 2.3 2.32 2.34 2.36 2.38 2.4
F IGURE 15.10.2. IV
IV estimates
0.16
line 1
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
1.85 1.9 1.95 2 2.05 2.1 2.15
15.10. EXAMPLE: THE HAUSMAN TEST 326
as exogeneity) than the IV estimator, we might prefer to use it, unless we have
evidence that the assumptions are false.
So, let’s consider the covariance between the MLE estimator θ̂ (or any other fully
efficient estimator) and some other CAN estimator, say θ̃. Now, let’s recall some
results from MLE. Equation 4.4.1 is:
√ a.s. √
n θ̂ − θ0 → −H∞ (θ0 )−1 ng(θ0 ).
Equation 4.6.2 is
H∞ (θ) = −I∞ (θ).
√ a.s. √
n θ̂ − θ0 → I∞ (θ0 )−1 ng(θ0 ).
Also, equation 4.7.1 tells us that the asymptotic covariance between any CAN
estimator and the MLE score vector is
√
n θ̃ − θ V (θ̃) IK
V∞ √ = ∞ .
ng(θ) IK I∞ (θ)
Now, consider
√ √
IK 0K n θ̃ − θ n θ̃ − θ
a.s.
→ √
√ .
0K I∞ (θ) −1 ng(θ) n θ̂ − θ
15.10. EXAMPLE: THE HAUSMAN TEST 327
V∞ (θ̃) I∞ (θ)−1
= ,
I∞ (θ) −1 I∞ (θ) −1
So, the asymptotic covariance between the MLE and any other CAN estimator is equal
to the MLE asymptotic variance (the inverse of the information matrix).
Now, suppose we with to test whether the the two estimators are in fact both con-
verging to θ0 , versus the alternative hypothesis that the ”MLE” estimator is not in fact
consistent (the consistency of θ̃ is a maintained hypothesis). Under the null hypothesis
that they are, we have
√
h i n θ̃ − θ0 √
= n θ̃ − θ̂ ,
IK −IK
√
n θ̂ − θ0
√ d
n θ̃ − θ̂ → N 0,V∞ (θ̃) −V∞ (θ̂) .
So,
0 −1 d
n θ̃ − θ̂ V∞ (θ̃) −V∞ (θ̂) θ̃ − θ̂ → χ2 (ρ),
15.10. EXAMPLE: THE HAUSMAN TEST 328
where ρ is the rank of the difference of the asymptotic variances. A statistic that has
the same asymptotic distribution is
0 −1 d
θ̃ − θ̂ V̂ (θ̃) − V̂ (θ̂) θ̃ − θ̂ → χ2 (ρ).
This is the Hausman test statistic, in its original form. The reason that this test has
power under the alternative hypothesis is that in that case the ”MLE” estimator will
not be consistent, and will converge to θA , say, where θA 6= θ0 . Then the mean of the
√
asymptotic distribution of vector n θ̃ − θ̂ will be θ0 − θA , a non-zero vector, so the
test statistic will eventually reject, regardless of how small a significance level is used.
• Note: if the test is based on a sub-vector of the entire parameter vector of the
MLE, it is possible that the inconsistency of the MLE will not show up in the
portion of the vector that has been used. If this is the case, the test may not
have power to detect the inconsistency. This may occur, for example, when
the consistent but inefficient estimator is not identified for all the parameters
of the model.
• The rank, ρ, of the difference of the asymptotic variances is often less than
the dimension of the matrices, and it may be difficult to determine what the
true rank is. If the true rank is lower than what is taken to be true, the test will
be biased against rejection of the null hypothesis. The contrary holds if we
underestimate the rank.
• A solution to this problem is to use a rank 1 test, by comparing only a single
coefficient. For example, if a variable is suspected of possibly being endoge-
nous, that variable’s coefficients may be compared.
15.10. EXAMPLE: THE HAUSMAN TEST 329
• This simple formula only holds when the estimator that is being tested for
consistency is fully efficient under the null hypothesis. This means that it
must be a ML estimator or a fully efficient estimator that has the same asymp-
totic distribution as the ML estimator. This is quite restrictive since modern
estimators such as GMM and QML are not in general fully efficient.
Following up on this last point, let’s think of two not necessarily efficient estimators,
θ̂1 and θ̂2 , where one is assumed to be consistent, but the other may not be. We assume
for expositional simplicity that both θ̂1 and θ̂2 belong to the same parameter space, and
that they can be expressed as generalized method of moments (GMM) estimators. The
estimators are defined (suppressing the dependence upon data) by
The standard Hausman test is equivalent to a Wald test of the equality of θ 1 and θ2 (or
subvectors of the two) applied to the omnibus GMM estimator, but with the covariance
15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 330
While this is clearly an inconsistent estimator in general, the omitted Σ12 term cancels
out of the test statistic when one of the estimators is asymptotically efficient, as we
have seen above, and thus it need not be estimated.
The general solution when neither of the estimators is efficient is clear: the entire Σ
matrix must be estimated consistently, since the Σ12 term will not cancel out. Methods
for consistently estimating the asymptotic covariance of a vector of moment conditions
are well-known, e.g., the Newey-West estimator discussed previously. The Hausman
test using a proper estimator of the overall covariance matrix will now have an asymp-
totic χ2 distribution when neither estimator is efficient. However, the test suffers from
a loss of power due to the fact that the omnibus GMM estimator of equation 15.10.1
is defined using an inefficient weight matrix. A new test can be defined by using an
alternative omnibus GMM estimator
h i −1 m1 (θ1 )
(15.10.3) θ̂1 , θ̂2 = arg min m1 (θ1 )0 m2 (θ2 )0 Σ
e ,
Θ×Θ
m2 (θ2 )
pt+1 + dt+1
(1 + rt+1 ) =
pt
• Future net rates of return rt+s , s > 0 are not known in period t: the asset is
risky.
A partial set of necessary conditions for utility maximization have the form:
(15.11.2) u0 (ct ) = βE (1 + rt+1 ) u0 (ct+1 )|It .
To see that the condition is necessary, suppose that the lhs < rhs. Then by reducing
current consumption marginally would cause equation 15.11.1 to drop by u0 (ct ), since
there is no discounting of the current period. At the same time, the marginal reduc-
tion in consumption finances investment, which has gross return (1 + rt+1 ) , which
could finance consumption in period t + 1. This increase in consumption would cause
the objective function to increase by βE {(1 + rt+1 ) u0 (ct+1 )|It } . Therefore, unless the
condition holds, the expected discounted utility function is not maximized.
• To use this we need to choose the functional form of utility. A constant rela-
tive risk aversion form is
1−γ
ct −1
u(ct ) =
1−γ
−γ
u0 (ct ) = ct
so that we could use this to define moment conditions, it is unlikely that ct is stationary,
even though it is in real terms, and our theory requires stationarity. To solve this, divide
−γ
though by ct ( −γ )!
ct+1
E 1-β (1 + rt+1 ) |It = 0
ct
(note that ct can be passed though the conditional expectation since ct is chosen based
only upon information available in time t).
Now ( −γ )
ct+1
1-β (1 + rt+1 )
ct
is analogous to ht (θ) defined above: it’s a scalar moment condition. To get a vector of
moment conditions we need some instruments. Suppose that zt is a vector of variables
drawn from the information set It . We can use the necessary conditions to form the
expressions
−γ
ct+1
1 − β (1 + rt+1 ) ct zt ≡ mt (θ)
• θ represents β and γ.
• Therefore, the above expression may be interpreted as a moment condition
which can be used for GMM estimation of the parameters θ0 .
Note that at time t, mt−s has been observed, and is therefore an element of the infor-
mation set. By rational expectations, the autocovariances of the moment conditions
other than Γ0 should be zero. The optimal weighting matrix is therefore the inverse of
the variance of the moment conditions:
Ω∞ = lim E nm(θ0 )m(θ0 )0
15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 334
This process can be iterated, e.g., use the new estimate to re-estimate Ω, use this to
estimate θ0 , and repeat until the estimates don’t change.
******************************************************
Example of GMM estimation of rational expectations model
Value df p-value
X^2 test 0.001 1.000 0.971
******************************************************
Example of GMM estimation of rational expectations model
Value df p-value
X^2 test 3.523 3.000 0.318
Pretty clearly, the results are sensitive to the choice of instruments. Maybe there
is some problem here: poor instruments, or possibly a conditional moment that is not
very informative.
15.12. EMPIRICAL EXAMPLE: A PORTFOLIO MODEL 337
Exercises
(1) Show how to cast the generalized IV estimator presented in section 11.4 as
a GMM estimator. Identify what are the moment conditions, mt (θ), what is
the form of the the matrix Dn , what is the efficient weight matrix, and show
that the covariance matrix formula given previously corresponds to the GMM
covariance matrix formula.
(2) Using Octave, generate data from the logit dgp . Recall that E(yt |xt ) =
p(xt , θ) = [1 + exp(−xt 0θ)]−1 . Consider the moment condtions (exactly iden-
tified) mt (θ) = [yt − p(xt , θ)]xt
(a) Estimate by GMM, using these moments.
(b) Estimate by MLE.
(c) The two estimators should coincide. Prove analytically that the estima-
tors coicide.
(3) Verify the missing steps needed to show that n · m(θ̂)0 Ω̂−1 m(θ̂) has a χ2 (g −
K) distribution. That is, show that the monster matrix is idempotent and has
trace equal to g − K.
(4) For the portfolio example, experiment with the program using lags of 3 and 4
periods to define instruments
(a) Iterate the estimation of θ = (β, γ) and Ω to convergence.
(b) Comment on the results. Are the results sensitive to the set of instruments
used? (Look at Ω̂ as well as θ̂. Are these good instruments? Are the
instruments highly correlated with one another?
CHAPTER 16
Quasi-ML
338
16. QUASI-ML 339
1 1 n
sn (ρ) = ln L(Y|X, ρ) = ∑ ln pt (ρ)
n n t=1
The QML estimator is the argument that maximizes the misspecified average log like-
lihood, which we refer to as the quasi-log likelihood function. This objective function
is
1 n
sn (θ) = ∑
n t=1
ln ft (yt |Yt−1 , Xt , θ0 )
1 n
≡ ∑ ln ft (θ)
n t=1
a.s. 1 n
sn (θ) → lim E ∑ ln ft (θ) ≡ s∞ (θ)
n→∞ n
t=1
16.1. CONSISTENT ESTIMATION OF VARIANCE COMPONENTS 340
We assume that this can be strengthened to uniform convergence, a.s., following the
previous arguments. The “pseudo-true” value of θ is the value that maximizes s̄(θ):
√ d
n θ̂ − θ0 → N 0, J∞ (θ0 )−1 I∞ (θ0 )J∞ (θ0 )−1
where
J∞ (θ0 ) = lim E D2θ sn (θ0 )
n→∞
and
√
I∞ (θ0 ) = lim Var nDθ sn (θ0 ).
n→∞
• Note that asymptotic normality only requires that the additional assumptions
regarding J and I hold in a neighborhood of θ0 for J and at θ0 , for I , not
throughout Θ. In this sense, asymptotic normality is a local property.
1 n 2 a.s. 1 n 2
Jn (θ̂n ) = ∑ θ t n n→∞ n ∑ Dθ ln ft (θ0) = J∞(θ0).
n t=1
D ln f ( θ̂ ) → lim E
t=1
That is, just calculate the Hessian using the estimate θ̂n in place of θ0 .
Consistent estimation of I∞ (θ0 ) is more difficult, and may be impossible.
16.1. CONSISTENT ESTIMATION OF VARIANCE COMPONENTS 341
We need to estimate
√
I∞ (θ0 ) = lim Var nDθ sn (θ0 )
n→∞
√ 1 n
= lim Var n ∑ Dθ ln ft (θ0 )
n→∞ n t=1
n
1
= lim Var ∑ gt
n→∞ n
t=1
( ! !0 )
n n
1
= lim E
n→∞ n
∑ (gt − E gt ) ∑ (gt − E gt )
t=1 t=1
1 n
lim
n→∞ n
∑ (E gt ) (E gt )0
t=1
which will not tend to zero, in general. This term is not consistently estimable in
general, since it requires calculating an expectation using the true density under the
d.g.p., which is unknown.
• There are important cases where I∞ (θ0 ) is consistently estimable. For exam-
ple, suppose that the data come from a random sample (i.e., they are iid). This
would be the case with cross sectional data, for example. (Note: under i.i.d.
sampling, the joint distribution of (yt , xt ) is identical. This does not imply that
the conditional density f (yt |xt ) is identical).
• With random sampling, the limiting objective function is simply
s∞ (θ0 ) = EX E0 ln f (y|x, θ0 )
Dθ EX E0 ln f (y|x, θ0 ) = EX E0 Dθ ln f (y|x, θ0 ) = 0
1 n d
√ ∑ Dθ ln f (y|x, θ0 ) → N(0, I∞ (θ0 )).
n t=1
That is, it’s not necessary to subtract the individual means, since they are zero.
Given this, and due to independent observations, a consistent estimator is
1 n
I = ∑ Dθ ln ft (θ̂)Dθ0 ln ft (θ̂)
b
n t=1
This is an important case where consistent estimation of the covariance matrix is pos-
sible. Other cases exist, even for dynamically misspecified time series models.
To check the plausibility of the Poisson model for the MEPS data, we can compare
the sample unconditional variance with the estimated unconditional variance accord-
n λ̂
∑t=1
ing to the Poisson model: Vd
(y) = n .
t
Using the program PoissonVariance.m, for
OBDV and ERV, we get We see that even after conditioning, the overdispersion is not
TABLE 1. Marginal Variances, Sample and Estimated (Poisson)
OBDV ERV
Sample 38.09 0.151
Estimated 3.28 0.086
16.2. EXAMPLE: THE MEPS DATA 343
captured in either case. There is huge problem with OBDV, and a significant problem
with ERV. In both cases the Poisson model does not appear to be plausible. You can
check this for the other use measures if you like.
16.2.1. Infinite mixture models: the negative binomial model. Reference: Cameron
and Trivedi (1998) Regression analysis of count data, chapter 4.
The two measures seem to exhibit extra-Poisson variation. To capture unobserved
heterogeneity, a possibility is the random parameters approach. Consider the possibil-
ity that the constant term in a Poisson model were random:
exp(−θ)θy
fY (y|x, ε) =
y!
θ = exp(x0β + ε)
= exp(x0β) exp(ε)
= λν
where λ = exp(x0β) and ν = exp(ε). Now ν captures the randomness in the constant.
The problem is that we don’t observe ν, so we will need to marginalize it to get a
usable density
exp[−θ]θy
Z ∞
fY (y|x) = fv (z)dz
−∞ y!
This density can be used directly, perhaps using numerical integration to evaluate the
likelihood function. In some cases, though, the integral will have an analytic solution.
For example, if ν follows a certain one parameter gamma density, then
ψ y
Γ(y + ψ) ψ λ
(16.2.1) fY (y|x, φ) =
Γ(y + 1)Γ(ψ) ψ+λ ψ+λ
where φ = (λ, ψ). ψ appears since it is the parameter of the gamma density.
So both forms of the NB model allow for overdispersion, with the NB-II model allow-
ing for a more radical form.
Testing reduction of a NB model to a Poisson model cannot be done by testing
α = 0 using standard Wald or LR procedures. The critical values need to be adjusted to
account for the fact that α = 0 is on the boundary of the parameter space. Without get-
ting into details, suppose that the data were in fact Poisson, so there is equidispersion
and the true α = 0. Then about half the time the sample data will be underdispersed,
and about half the time overdispersed. When the data is underdispersed, the MLE of α
will be α̂ = 0. Thus, under the null, there will be a probability spike in the asymptotic
p √
distribution of n(α̂ − α) = nα̂ at 0, so standard testing methods will not be valid.
This program will do estimation using the NB model. Note how modelargs is used
to select a NB-I or NB-II density. Here are NB-I estimation results for OBDV:
OBDV
======================================================
BFGSMIN final results
------------------------------------------------------
STRONG CONVERGENCE
Function conv 1 Param conv 1 Gradient conv 1
------------------------------------------------------
16.2. EXAMPLE: THE MEPS DATA 345
******************************************************
Negative Binomial model, MEPS 1996 full data set
Information Criteria
CAIC : 20026.7513 Avg. CAIC: 4.3880
BIC : 20018.7513 Avg. BIC: 4.3862
AIC : 19967.3437 Avg. AIC: 4.3750
******************************************************
16.2. EXAMPLE: THE MEPS DATA 346
Note that the parameter values of the last BFGS iteration are different that those
reported in the final results. This reflects two things - first, the data were scaled be-
fore doing the BFGS minimization, but the mle_results script takes this into ac-
count and reports the results using the original scaling. But also, the parameterization
α = exp(α∗ ) is used to enforce the restriction that α > 0. The unrestricted parameter
α∗ = log α is used to define the log-likelihood function, since the BFGS minimiza-
tion algorithm does not do contrained minimization. To get the standard error and
t-statistic of the estimate of α, we need to use the delta method. This is done inside
mle_results, making use of the function parameterize.m .
Likewise, here are NB-II results:
OBDV
======================================================
BFGSMIN final results
------------------------------------------------------
STRONG CONVERGENCE
Function conv 1 Param conv 1 Gradient conv 1
------------------------------------------------------
Objective function value 2.18496
Stepsize 0.0104394
13 iterations
------------------------------------------------------
******************************************************
Negative Binomial model, MEPS 1996 full data set
Information Criteria
CAIC : 20019.7439 Avg. CAIC: 4.3864
BIC : 20011.7439 Avg. BIC: 4.3847
AIC : 19960.3362 Avg. AIC: 4.3734
******************************************************
• For the OBDV usage measurel, the NB-II model does a slightly better job than
the NB-I model, in terms of the average log-likelihood and the information
criteria (more on this last in a moment).
• Note that both versions of the NB model fit much better than does the Poisson
model (see 13.4.2).
• The estimated α is highly significant.
To check the plausibility of the NB-II model, we can compare the sample uncon-
ditional variance with the estimated unconditional variance according to the NB-II
16.2. EXAMPLE: THE MEPS DATA 348
n λ̂ +α̂ λ̂ 2
∑t=1 ( t)
model: Vd
t
(y) = n . For OBDV and ERV (estimation results not reported),
we get For OBDV, the overdispersion problem is significantly better than in the Pois-
OBDV ERV
Sample 38.09 0.151
Estimated 30.58 0.182
son case, but there is still some that is not captured. For ERV, the negative binomial
model seems to capture the overdispersion adequately.
16.2.2. Finite mixture models: the mixed negative binomial model. The finite
mixture approach to fitting health care demand was introduced by Deb and Trivedi
(1997). The mixture approach has the intuitive appeal of allowing for subgroups of
the population with different health status. If individuals are classified as healthy or
unhealthy then two subgroups are defined. A finer classification scheme would lead to
more subgroups. Many studies have incorporated objective and/or subjective indica-
tors of health status in an effort to capture this heterogeneity. The available objective
measures, such as limitations on activity, are not necessarily very informative about a
person’s overall health status. Subjective, self-reported measures may suffer from the
same problem, and may also not be exogenous
Finite mixture models are conceptually simple. The density is
p−1
∑ πi fY
(i) p
fY (y, φ1 , ..., φ p, π1 , ..., π p−1) = (y, φi ) + π p fY (y, φ p ),
i=1
p−1 p
where πi > 0, i = 1, 2, ..., p, π p = 1 − ∑i=1 πi , and ∑i=1 πi = 1. Identification requires
that the πi are ordered in some way, for example, π1 ≥ π2 ≥ · · · ≥ π p and φi 6= φ j , i 6= j.
This is simple to accomplish post-estimation by rearrangement and possible elimina-
tion of redundant component densities.
16.2. EXAMPLE: THE MEPS DATA 349
The following results are for a mixture of 2 NB-II models, for the OBDV data, which
you can replicate using this program .
OBDV
******************************************************
Mixed Negative Binomial model, MEPS 1996 full data set
Information Criteria
CAIC : 19920.3807 Avg. CAIC: 4.3647
BIC : 19903.3807 Avg. BIC: 4.3610
AIC : 19794.1395 Avg. AIC: 4.3370
******************************************************
It is worth noting that the mixture parameter is not significantly different from zero,
but also not that the coefficients of public insurance and age, for example, differ quite
a bit between the two latent classes.
16.2.3. Information criteria. As seen above, a Poisson model can’t be tested (us-
ing standard methods) as a restriction of a negative binomial model. But it seems,
based upon the values of the likelihood functions and the fact that the NB model fits
the variance much better, that the NB model is more appropriate. How can we deter-
mine which of a set of competing models is the best?
16.2. EXAMPLE: THE MEPS DATA 351
The information criteria approach is one possibility. Information criteria are func-
tions of the log-likelihood, with a penalty for the number of parameters used. Three
popular information criteria are the Akaike (AIC), Bayes (BIC) and consistent Akaike
(CAIC). The formulae are
BIC = −2 ln L(θ̂) + k ln n
AIC = −2 ln L(θ̂) + 2k
It can be shown that the CAIC and BIC will select the correctly specified model from a
group of models, asymptotically. This doesn’t mean, of course, that the correct model
is necesarily in the group. The AIC is not consistent, and will asymptotically favor
an over-parameterized model over the correctly specified model. Here are information
criteria values for the models we’ve seen, for OBDV. Pretty clearly, the NB models
are better than the Poisson. The one additional parameter gives a very significant
improvement in the likelihood function value. Between the NB-I and NB-II models,
the NB-II is slightly favored. But one should remember that information criteria values
are statistics, with variances. With another sample, it may well be that the NB-I model
would be favored, since the differences are so small. The MNB-II model is favored
over the others, by all 3 information criteria.
16.2. EXAMPLE: THE MEPS DATA 352
Why is all of this in the chapter on QML? Let’s suppose that the correct model for
OBDV is in fact the NB-II model. It turns out in this case that the Poisson model will
give consistent estimates of the slope parameters (if a model is a member of the linear-
exponential family and the conditional mean is correctly specified, then the parame-
ters of the conditional mean will be consistently estimated). So the Poisson estimator
would be a QML estimator that is consistent for some parameters of the true model.
The ordinary OPG or inverse Hessinan ”ML” covariance estimators are however biased
and inconsistent, since the information matrix equality does not hold for QML estima-
tors. But for i.i.d. data (which is the case for the MEPS data) the QML asymptotic
covariance can be consistently estimated, as discussed above, using the sandwich form
for the ML estimator. mle_results in fact reports sandwich results, so the Poisson
estimation results would be reliable for inference even if the true model is the NB-I or
NB-II. Not that they are in fact similar to the results for the NB models.
However, if we assume that the correct model is the MNB-II model, as is favored by
the information criteria, then both the Poisson and NB-x models will have misspecified
mean functions, so the parameters that influence the means would be estimated with
bias and inconsistently.
EXERCISES 353
Exercises
Exercises
(1) Considering the MEPS data (the description is in Section 13.4.2), for the OBDV
(y) measure, let η be a latent index of health status that has expectation equal to
unity.1 We suspect that η and PRIV may be correlated, but we assume that η is
uncorrelated with the other regressors. We assume that
y ∼ Poisson(λ)
Since much previous evidence indicates that health care services usage is overdis-
persed2, this is almost certainly not an ML estimator, and thus is not efficient.
However, when η and PRIV are uncorrelated, this estimator is consistent for the β i
parameters, since the conditional mean is correctly specified in that case. When η
and PRIV are correlated, Mullahy’s (1997) NLIV estimator that uses the residual
function
y
ε= − 1,
λ
Nonlinear least squares (NLS) is a means of estimating the parameter of the model
yt = f (xt , θ0 ) + εt .
εt ∼ iid(0, σ2 )
and
ε = (ε1 , ε2 , ..., εn )0
y = f(θ) + ε
355
17.1. INTRODUCTION AND DEFINITION 356
1 1
θ̂ ≡ arg min sn (θ) = [y − f(θ)]0 [y − f(θ)] = k y − f(θ) k2
Θ n n
• The estimator minimizes the weighted sum of squared errors, which is the
same as minimizing the Euclidean distance between y and f(θ).
1 0
sn (θ) = y y − 2y0 f(θ) + f(θ)0 f(θ) ,
n
which gives the first order conditions
∂ ∂
− f(θ̂) y +
0
f(θ̂) f(θ̂) ≡ 0.
0
∂θ ∂θ
Define the n × K matrix
In shorthand, use F̂ in place of F(θ̂). Using this, the first order conditions can be written
as
−F̂0 y + F̂0 f(θ̂) ≡ 0,
or
(17.1.2) F̂0 y − f(θ̂) ≡ 0.
This bears a good deal of similarity to the f.o.c. for the linear model - the derivative of
the prediction is orthogonal to the prediction error. If f(θ) = Xθ, then F̂ is simply X,
so the f.o.c. (with spherical errors) simplify to
X0 y − X0 Xβ = 0,
17.2. IDENTIFICATION 357
• Note that the nonlinearity of the manifold leads to potential multiple local
maxima, minima and saddlepoints: the objective function sn (θ) is not neces-
sarily well-behaved and may be difficult to minimize.
17.2. Identification
1 n
sn (θ) = ∑ [yt − f (xt , θ)]2
n t=1
1 n 2
= ∑
n t=1
f (xt , θ0 ) + εt − ft (xt , θ)
1 n 2 1 n
= ∑ t
n t=1
f (θ 0
) − f t (θ) + ∑ (εt )2
n t=1
2 n
− ∑
n t=1
ft (θ0 ) − ft (θ) εt
1 n 2 a.s. Z 2
(17.2.1) ∑
n t=1
ft (θ0 ) − ft (θ) → f (z, θ0 ) − f (z, θ) dµ(z),
∂2 ∂2
Z 2
s ∞ (θ) = f (x, θ0 ) − f (x, θ) dµ(x)
∂θ∂θ0 ∂θ∂θ0
∂2
Z 2 Z 0
0
f (x, θ ) − f (x, θ) dµ(x) = 2 Dθ f (z, θ0 )0 Dθ0 f (z, θ0 ) dµ(z)
∂θ∂θ0 0 θ
the expectation of the outer product of the gradient of the regression function evaluated
at θ0 . (Note: the uniform boundedness we have already assumed allows passing the
derivative through the integral, by the dominated convergence theorem.) This matrix
will be positive definite (wp1) as long as the gradient vector is of full rank (wp1). The
tangent space to the regression manifold must span a K -dimensional space if we are
17.4. ASYMPTOTIC NORMALITY 359
17.3. Consistency
We simply assume that the conditions of Theorem 19 hold, so the estimator is con-
sistent. Given that the strong stochastic equicontinuity conditions hold, as discussed
above, and given the above identification conditions an a compact estimation space (the
closure of the parameter space Θ), the consistency proof’s assumptions are satisfied.
As in the case of GMM, we also simply assume that the conditions for asymptotic
normality as in Theorem 22 hold. The only remaining problem is to determine the form
of the asymptotic variance-covariance matrix. Recall that the result of the asymptotic
normality theorem is
√ d
n θ̂ − θ0 → N 0, J∞ (θ0 )−1 I∞ (θ0 )J∞ (θ0 )−1 ,
∂2
where J∞ (θ0 ) is the almost sure limit of ∂θ∂θ0 sn (θ) evaluated at θ0 , and
√
I∞ (θ0 ) = limVar nDθ sn (θ0 )
1 n
sn (θ) = ∑ [yt − f (xt , θ)]2
n t=1
17.4. ASYMPTOTIC NORMALITY 360
So
2 n
Dθ sn (θ) = − ∑ [yt − f (xt , θ)] Dθ f (xt , θ).
n t=1
Evaluating at θ0 ,
2 n
Dθ sn (θ0 ) = − ∑ εt Dθ f (xt , θ0).
n t=1
Note that the expectation of this is zero, since εt and xt are assumed to be uncorrelated.
So to calculate the variance, we can simply calculate the second moment about zero.
Also note that
n
∂ 0 0
∑ εt Dθ f (xt , θ0) =
∂θ
f(θ ) ε
t=1
= F0 ε
√
I∞ (θ0 ) = limVar nDθ sn (θ0 )
4 0 ’
= limnE F εε F
n2
F0 F
= 4σ2 lim E
n
the obvious estimator. Note the close correspondence to the results for the linear
model.
The mean of yt is λt , as is the variance. Note that λt must be positive. Suppose that the
true mean is
λt0 = exp(xt0 β0 ),
1 n 2
β̂ = arg min sn (β) = ∑ yt − exp(xt0 β)
T t=1
17.6. THE GAUSS-NEWTON ALGORITHM 362
We can write
1 n 2
sn (β) = ∑
T t=1
exp(xt0 β0 + εt − exp(xt0 β)
1 n 2 1 n 2 1 n
= ∑
T t=1
exp(xt
0 0
β − exp(xt
0
β) + ∑
T t=1
εt + 2 ∑
T t=1
εt exp(xt0 β0 − exp(xt0 β)
The last term has expectation zero since the assumption that E (yt |xt ) = exp(xt0 β0 )
implies that E (εt |xt ) = 0, which in turn implies that functions of xt are uncorrelated
with εt . Applying a strong LLN, and noting that the objective function is continuous
on a compact parameter space, we get
2
s∞ (β) = Ex exp(x0 β0 − exp(x0 β) + Ex exp(x0 β0 )
where the last term comes from the fact that the conditional variance of ε is the same
as the variance of y. This function is clearly minimized at β = β0 , so the NLS estimator
is consistent as long as identification holds.
√
E XERCISE 27. Determine the limiting distribution of n β̂ − β0 . This means
∂2
finding the the specific forms of ∂β∂β s (β), J (β 0 ), ∂sn (β) , and I (β0 ). Again, use a
0 n ∂β
y = f(θ) + ν
where ν is a combination of the fundamental error term ε and the error due to evaluat-
ing the regression function at θ rather than the true value θ0 . Take a first order Taylor’s
series approximation around a point θ1 :
y = f(θ1 ) + Dθ0 f θ1 θ − θ1 + ν + approximation error.
z = F(θ1 )b + ω,
where, as above, F(θ1 ) ≡ Dθ0 f(θ1 ) is the n × K matrix of derivatives of the regres-
sion function, evaluated at θ1 , and ω is ν plus approximation error from the truncated
Taylor’s series.
To see why this might work, consider the above approximation, but evaluated at the
NLS estimator:
y = f(θ̂) + F(θ̂) θ − θ̂ + ω
17.6. THE GAUSS-NEWTON ALGORITHM 364
−1
b̂ = F̂0 F̂ F̂0 y − f(θ̂) .
by definition of the NLS estimator (these are the normal equations as in equation
17.1.2, Since b̂ ≡ 0 when we evaluate at θ̂, updating would stop.
y = β1 + β2 xt β3 + εt
• Characteristics of individual: x
• Latent labor supply: s∗ = x0 β + ω
• Offer wage: wo = z0 γ + ν
• Reservation wage: wr = q0 δ + η
w∗ = z 0 γ + ν − q0 δ + η
≡ r0 θ + ε
s∗ = x 0 β + ω
w∗ = r0 θ + ε.
17.7. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION 366
Assume that
ω 0 σ2 ρσ
∼ N , .
ε 0 ρσ 1
We assume that the offer wage and the reservation wage, as well as the latent variable
s∗ are unobservable. What is observed is
w = 1 [w∗ > 0]
s = ws∗ .
In other words, we observe whether or not a person is working. If the person is work-
ing, we observe labor supply, which is equal to latent labor supply, s ∗ . Otherwise,
s = 0 6= s∗ . Note that we are using a simplifying assumption that individuals can freely
choose their weekly hours of work.
Suppose we estimated the model
s∗ = x0 β + residual
using only observations for which s > 0. The problem is that these observations are
those for which w∗ > 0, or equivalently, −ε < r0 θ and
E ω| − ε < r0 θ 6= 0,
since ε and ω are dependent. Furthermore, this expectation will in general depend on x
since elements of x can enter in r. Because of these two facts, least squares estimation
is biased and inconsistent.
Consider more carefully E [ω| − ε < r0 θ] . Given the joint normality of ω and ε, we
can write (see for example Spanos Statistical Foundations of Econometric Modelling,
17.7. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION 367
pg. 122)
ω = ρσε + η,
where η has mean zero and is independent of ε. With this we can write
s∗ = x0 β + ρσε + η.
z ∼ N(0, 1)
φ(z∗ )
E(z|z > z∗ ) = ,
Φ(−z∗ )
where φ (·) and Φ (·) are the standard normal density and distribution func-
tion, respectively. The quantity on the RHS above is known as the inverse
Mill’s ratio:
φ(z∗ )
IMR(z∗ ) =
Φ(−z∗ )
With this we can write (making use of the fact that the standard normal density
is symmetric about zero, so that φ(−a) = φ(a)):
17.7. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION 368
φ (r0 θ)
(17.7.1) s = x0 β + ρσ +η
Φ (r0 θ)
h i β
(17.7.2) φ(r0 θ) + η.
≡ x0 Φ(r θ)
0
ζ
where ζ = ρσ. The error term η has conditional mean zero, and is uncorrelated with
the regressors x0 φ(r0 θ) . At this point, we can estimate the equation by NLS.
Φ(r0 θ)
• Heckman showed how one can estimate this in a two step procedure where
first θ is estimated, then equation 17.7.2 is estimated by least squares using the
estimated value of θ to form the regressors. This is inefficient and estimation
of the covariance is a tricky issue. It is probably easier (and more efficient)
just to do MLE.
• The model presented above depends strongly on joint normality. There exist
many alternative models which weaken the maintained assumptions. It is
possible to estimate consistently without distributional assumptions. See Ahn
and Powell, Journal of Econometrics, 1994.
CHAPTER 18
Nonparametric inference
3x x 2
f (x) = 1 + −
2π 2π
The problem of interest is to estimate the elasticity of f (x) with respect to x, throughout
the range of x.
In general, the functional form of f (x) is unknown. One idea is to take a Taylor’s
series approximation to f (x) about some point x0 . Flexible functional forms such as the
transcendental logarithmic (usually know as the translog) can be interpreted as second
order Taylor’s series approximations. We’ll work with a first order approximation, for
simplicity. Approximating about x0 :
h(x) = a + bx
369
18.1. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION 370
The coefficient a is the value of the function at x = 0, and the slope is the value of
the derivative at x = 0. These are of course not known. One might try estimation by
ordinary least squares. The objective function is
n
s(a, b) = 1/n ∑ (yt − h(xt ))2 .
t=1
The limiting objective function, following the argument we used to get equations
14.3.1 and 17.2.1 is
Z 2π
s∞ (a, b) = ( f (x) − h(x))2 dx.
0
The theorem regarding the consistency of extremum estimators (Theorem 19) tells
us that â and b̂ will converge almost surely to the values that minimize the limiting
objective function. Solving the first order conditions1 reveals that s∞ (a, b) obtains its
minimum at a0 = 67 , b0 = π1 . The estimated approximating function ĥ(x) therefore
tends almost surely to
In Figure 18.1.1 we see the true function and the limit of the approximation to see the
asymptotic bias as a function of x.
(The approximating model is the straight line, the true model has curvature.) Note
that the approximating model is in general inconsistent, even at the approximation
point. This shows that “flexible functional forms” based upon Taylor’s series approxi-
mations do not in general lead to consistent estimation of functions.
The approximating model seems to fit the true model fairly well, asymptotically.
However, we are interested in the elasticity of the function. Recall that an elasticity is
1The following results were obtained using the command maxima -b fff.mac You can get the source
file at http://pareto.uab.es/mcreel/Econometrics/Include/Nonparametric/fff.mac.
18.1. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION 371
2.5
1.5
1
0 1 2 3 4 5 6 7
Good approximation of the elasticity over the range of x will require a good approxi-
mation of both f (x) and f 0 (x) over the range of x. The approximating elasticity is
In Figure 18.1.2 we see the true elasticity and the elasticity obtained from the limiting
approximating model.
The true elasticity is the line that has negative slope for large x. Visually we see
that the elasticity is not approximated so well. Root mean squared error in the approx-
imation of the elasticity is
Z 2π
1/2
2
(ε(x) − η(x)) dx = . 31546
0
18.1. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION 372
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7
Now suppose we use the leading terms of a trigonometric series as the approxi-
mating model. The reason for using a trigonometric series as an approximating model
is motivated by the asymptotic properties of the Fourier flexible functional form (Gal-
lant, 1981, 1982), which we will study in more detail below. Normally with this type
of model the number of basis functions is an increasing function of the sample size.
Here we hold the set of basis function fixed. We will consider the asymptotic behavior
of a fixed model, which we interpret as an approximation to the estimator’s behavior
in finite samples. Consider the set of basis functions:
h i
Z(x) = 1 x cos(x) sin(x) cos(2x) sin(2x) .
2.5
1.5
1
0 1 2 3 4 5 6 7
Maintaining these basis functions as the sample size increases, we find that the limiting
objective function is minimized at
7 1 1 1
a1 = , a2 = , a3 = − 2 , a4 = 0, a5 = − 2 , a6 = 0 .
6 π π 4π
Substituting these values into gK (x) we obtain the almost sure limit of the approxima-
tion
(18.1.1)
1 1
g∞ (x) = 7/6 + x/π + (cos x) − 2 + (sin x) 0 + (cos 2x) − 2 + (sin 2x) 0
π 4π
In Figure 18.1.3 we have the approximation and the true function: Clearly the trun-
cated trigonometric series model offers a better approximation, asymptotically, than
does the linear model. In Figure 18.1.4 we have the more flexible approximation’s
elasticity and that of the true function: On average, the fit is better, though there is
18.2. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: HYPOTHESIS TESTING 374
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7
some implausible wavyness in the estimate. Root mean squared error in the approxi-
mation of the elasticity is
Z 2π 2 !1/2
g0 (x)x
ε(x) − ∞ dx = . 16213,
0 g∞ (x)
about half that of the RMSE when the first order approximation is used. If the trigono-
metric series contained infinite terms, this error measure would be driven to zero, as
we shall see.
• Consider means of testing for the hypothesis that consumers maximize utility.
A consequence of utility maximization is that the Slutsky matrix D2p h(p,U ),
18.2. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: HYPOTHESIS TESTING 375
where h(p,U ) are the a set of compensated demand functions, must be neg-
ative semi-definite. One approach to testing for utility maximization would
estimate a set of normal demand functions x(p, m).
• Estimation of these functions by normal parametric methods requires specifi-
cation of the functional form of demand, for example
x(p, m) = x(p, m, θ0 ) + ε, θ0 ∈ Θ0 ,
y = f (x) + ε,
xi ∂ f (x)
ξ xi = ,
f (x) ∂xi f (x)
at an arbitrary point xi .
The Fourier form, following Gallant (1982), but with a somewhat different parameter-
ization, may be written as
A J
(18.3.1) gK (x | θK ) = α + x β + 1/2x Cx +
0 0
∑∑ u jα cos( jk0α x) − v jα sin( jk0α x) .
α=1 j=1
A J
(18.3.3) Dx gK (x | θK ) = β + Cx + ∑∑ −u jα sin( jk0α x) − v jα cos( jk0α x) jkα
α=1 j=1
A J
(18.3.4) D2x gK (x|θK ) = C + ∑∑ −u jα cos( jk0α x) + v jα sin( jk0α x) j2 kα k0α
α=1 j=1
When λ is the zero vector, Dλ h(x) ≡ h(x). Taking this definition and the last few
equations into account, we see that it is possible to define (1 × K) vector Z λ (x) so that
• Both the approximating model and the derivatives of the approximating model
are linear in the parameters.
• For the approximating model to the function (not derivatives), write g K (x|θK ) =
z0 θK for simplicity.
The following theorem can be used to prove the consistency of the Fourier form.
T HEOREM 28. [Gallant and Nychka, 1987] Suppose that ĥn is obtained by max-
imizing a sample objective function sn (h) over HKn where HK is a subset of some
18.3. THE FOURIER FUNCTIONAL FORM 379
almost surely.
(d) Identification: Any point h in the closure of H with s∞ (h, h∗ ) ≥ s∞ (h∗ , h∗ ) must
have k h − h∗ k= 0.
Under these conditions limn→∞ k h∗ − ĥn k= 0 almost surely, provided that limn→∞ Kn =
∞ almost surely.
The modification of the original statement of the theorem that has been made is to
set the parameter space Θ in Gallant and Nychka’s (1987) Theorem 0 to a single point
and to state the theorem in terms of maximization rather than minimization.
This theorem is very similar in form to Theorem 19. The main differences are:
(1) A generic norm k h k is used in place of the Euclidean norm. This norm
may be stronger than the Euclidean norm, so that convergence with respect
to k h k implies convergence w.r.t the Euclidean norm. Typically we will
want to make sure that the norm is strong enough to imply convergence of all
functions of interest.
18.3. THE FOURIER FUNCTIONAL FORM 380
(2) The “estimation space” H is a function space. It plays the role of the parame-
ter space Θ in our discussion of parametric estimators. There is no restriction
to a parametric family, only a restriction to a space of functions that satisfy
certain conditions. This formulation is much less restrictive than the restric-
tion to a parametric family.
(3) There is a denseness assumption that was not present in the other theorem.
We will not prove this theorem (the proof is quite similar to the proof of theorem [19],
see Gallant, 1987) but we will discuss its assumptions, in relation to the Fourier form
as the approximating model.
18.3.1. Sobolev norm. Since all of the assumptions involve the norm k h k , we
need to make explicit what norm we wish to use. We need a norm that guarantees
that the errors in approximation of the functions we are interested in are accounted
for. Since we are interested in first-order elasticities in the present case, we need close
approximation of both the function f (x) and its first derivative f 0 (x), throughout the
range of x. Let X be an open set that contains all values of x that we’re interested in.
The Sobolev norm is appropriate in this case. It is defined, making use of our notation
for partial derivatives, as:
k h km,X = max sup Dλ h(x)
|λ∗ |≤m X
k f (x) − gK (x | θK ) km,X .
We see that this norm takes into account errors in approximating the function and
partial derivatives up to order m. If we want to estimate first order elasticities, as is the
18.3. THE FOURIER FUNCTIONAL FORM 381
where D is a finite constant. In plain words, the functions must have bounded partial
derivatives of one order higher than the derivatives we seek to estimate.
18.3.3. The estimation space and the estimation subspace. Since in our case
we’re interested in consistent estimation of first-order elasticities, we’ll define the es-
timation space as follows:
D EFINITION 29. [Estimation space] The estimation space H = W2,X (D). The es-
timation space is an open set, and we presume that h∗ ∈ H .
So we are assuming that the function to be estimated has bounded second deriva-
tives throughout X .
With seminonparametric estimators, we don’t actually optimize over the estimation
space. Rather, we optimize over a subspace, HKn , defined as:
(1) The dimension of the parameter vector, dim θKn → ∞ as n → ∞. This is achieved
by making A and J in equation 18.3.1 increasing functions of n, the sample
size. It is clear that K will have to grow more slowly than n. The second
requirement is:
(2) We need that the HK be dense subsets of H .
The estimation subspace HK , defined above, is a subset of the closure of the estimation
space, H . A set of subsets Aa of a set A is “dense” if the closure of the countable
union of the subsets is equal to the closure of A :
∪∞
a=1 Aa = A
Use a picture here. The rest of the discussion of denseness is provided just for com-
pleteness: there’s no need to study it in detail. To show that HK is a dense subset of
H with respect to k h k1,X , it is useful to apply Theorem 1 of Gallant (1982), who in
turn cites Edmunds and Moscatelli (1977). We reproduce the theorem as presented by
Gallant, with minor notational changes, for convenience of reference:
T HEOREM 31. [Edmunds and Moscatelli, 1977] Let the real-valued function h ∗ (x)
be continuously differentiable up to order m on an open set containing the closure of
X . Then it is possible to choose a triangular array of coefficients θ 1 , θ2 , . . . θK , . . .,
18.3. THE FOURIER FUNCTIONAL FORM 383
such that for every q with 0 ≤ q < m, and every ε > 0, k h∗ (x) − hK (x|θK ) kq,X =
o(K −m+q+ε ) as K → ∞.
lim k h∗ − hK k1,X = 0,
K→∞
However,
∪∞ HK ⊂ H ,
so
∪∞ HK ⊂ H .
Therefore
H = ∪ ∞ HK ,
With random sampling, as in the case of Equations 14.3.1 and 17.2.1, the limiting
objective function is
Z
(18.3.6) s∞ (g, f ) = − ( f (x) − g(x))2 dµx − σ2ε .
X
where the true function f (x) takes the place of the generic function h ∗ in the presenta-
tion of the theorem. Both g(x) and f (x) are elements of ∪∞ HK .
The pointwise convergence of the objective function needs to be strengthened to
uniform convergence. We will simply assume that this holds, since the way to verify
this depends upon the specific application. We also have continuity of the objective
function in g, with respect to the norm k h k1,X since
lim s∞ g 1 , f ) − s ∞ g 0 , f )
kg1 −g0 k1,X →0
Z h 2 2 i
= lim g1 (x) − f (x) − g0 (x) − f (x) dµx.
kg1 −g0 k1,X →0 X
By the dominated convergence theorem (which applies since the finite bound D used
to define W2,Z (D) is dominated by an integrable function), the limit and the integral
can be interchanged, so by inspection, the limit is zero.
18.3.6. Identification. The identification condition requires that for any point (g, f )
in H × H , s∞ (g, f ) ≥ s∞ ( f , f ) ⇒ k g − f k1,X = 0. This condition is clearly satisfied
given that g and f are once continuously differentiable (by the assumption that defines
the estimation space).
• Estimation space H = W2,X (D): the function space in the closure of which
the true function must lie.
• Consistency norm k h k1,X . The closure of H is compact with respect to this
norm.
• Estimation subspace HK . The estimation subspace is the subset of H that is
representable by a Fourier form with parameter θK . These are dense subsets
of H .
• Sample objective function sn (θK ), the negative of the sum of squares. By
standard arguments this converges uniformly to the
• Limiting objective function s∞ ( g, f ), which is continuous in g and has a
global maximum in its first argument, over the closure of the infinite union of
the estimation subpaces, at g = f .
• As a result of this, first order elasticities
xi ∂ f (x)
f (x) ∂xi f (x)
gK (x|θK ) = z0 θK .
18.4. KERNEL REGRESSION ESTIMATORS 386
+
θ̂K = Z0K ZK Z0K y,
√ d
n z0 θ̂K − f (x) → N(0, AV ),
where " #
+
ZK ZK
0
AV = lim E z0 zσ̂2 .
n→∞ n
Formally, this is exactly the same as if we were dealing with a parametric lin-
ear model. I emphasize, though, that this is only valid if K grows very slowly
as n grows. If we can’t stick to acceptable rates, we should probably use some
other method of approximating the small sample distribution. Bootstrapping
is a possibility. We’ll discuss this in the section on simulation.
• Suppose we have an iid sample from the joint density f (x, y), where x is k
-dimensional. The model is
yt = g(xt ) + εt ,
where
E(εt |xt ) = 0.
f (x, y)
Z
g(x) = y dy
h(x)
1
Z
= y f (x, y)dy,
h(x)
• This suggests that we could estimate g(x) by estimating h(x) and y f (x, y)dy.
R
18.4.1. Estimation of the denominator. A kernel estimator for h(x) has the form
1 n K [(x − xt ) /γn ]
ĥ(x) = ∑
n t=1 γkn
,
In this respect, K(·) is like a density function, but we do not necessarily re-
strict K(·) to be nonnegative.
• The window width parameter, γn is a sequence of positive numbers that satis-
fies
lim γn = 0
n→∞
lim nγkn = ∞
n→∞
So, the window width must tend to zero, but not too quickly.
• To show pointwise consistency of ĥ(x) for h(x), first consider the expectation
of the estimator (since the estimator is an average of iid terms we only need
to consider the expectation of a representative term):
Z
E ĥ(x) = γ−k
n K [(x − z) /γn ] h(z)dz.
Z
∗ k ∗
E ĥ(x) = γ−k
n K (z ) h(x − γn z )γn dz
∗
Z
= K (z∗ ) h(x − γn z∗ )dz∗ .
18.4. KERNEL REGRESSION ESTIMATORS 389
Now, asymptotically,
Z
lim E ĥ(x) = lim K (z∗ ) h(x − γn z∗ )dz∗
n→∞ n→∞
Z
= lim K (z∗ ) h(x − γn z∗ )dz∗
n→∞
Z
= K (z∗ ) h(x)dz∗
Z
= h(x) K (z∗ ) dz∗
= h(x),
since γn → 0 and K (z∗ ) dz∗ = 1 by assumption. (Note: that we can pass the
R
nγknV ĥ(x) = γ−k
n V {K [(x − z) /γn ]}
n o
2 2
nγknV ĥ(x) = γ−k
n E (K [(x − z) /γ n ]) − γ−k
n {E (K [(x − z) /γn ])}
Z Z 2
2 k
= γn K [(x − z) /γn ] h(z)dz − γn
−k
γn K [(x − z) /γn ] h(z)dz
−k
Z h i2
2 k b
= γ−k
n K [(x − z) /γ n ] h(z)dz − γ n E h(x)
18.4. KERNEL REGRESSION ESTIMATORS 390
by the previous result regarding the expectation and the fact that γ n → 0.
Therefore,
Z
2
lim nγk V ĥ(x) = lim γ−k
n K [(x − z) /γn ] h(z)dz.
n→∞ n n→∞
Using exactly the same change of variables as before, this can be shown to be
Z
lim nγk V ĥ(x) = h(x) [K(z∗ )]2 dz∗ .
n→∞ n
Since both [K(z∗ )]2 dz∗ and h(x) are bounded, this is bounded, and since
R
V ĥ(x) → 0.
• Since the bias and the variance both go to zero, we have pointwise consistency
(convergence in quadratic mean implies convergence in probability).
mator of f (x, y). The estimator has the same form as the estimator for h(x), only with
one dimension more:
1 n K [(x − xt ) /γn ]
Z
y fˆ(y, x)dy = ∑ yt
n t=1 γkn
1
Z
ĝ(x) = y fˆ(y, x)dy
ĥ(x)
1 n K[(x−xt )/γn ]
n ∑t=1 yt γkn
= K[(x−x
1 n t )/γn ]
n ∑t=1 γkn
n
∑t=1 yt K [(x − xt ) /γn ]
= n .
∑t=1 K [(x − xt ) /γn ]
18.4.3. Discussion.
• The standard normal density is a popular choice for K(.) and K∗ (y, x), though
there are possibly better alternatives.
(1) Split the data. The out of sample data is yout and xout .
(2) Choose a window width γ.
(3) With the in sample data, fit ŷtout corresponding to each xtout . This fitted value
is a function of the in sample data, as well as the evaluation point xtout , but it
does not involve ytout .
(4) Repeat for all out of sample points.
(5) Calculate RMSE(γ)
(6) Go to step 2, or to the next step if enough window widths have been tried.
(7) Select the γ that minimizes RMSE(γ) (Verify that a minimum has been found,
for example by plotting RMSE as a function of γ).
(8) Re-estimate using the best γ and all of the data.
This same principle can be used to choose A and J in a Fourier form model.
The previous discussion suggests that a kernel density estimator may easily be
constructed. We have already seen how joint densities may be estimated. If were
interested in a conditional density, for example of y conditional on x, then the kernel
18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 393
fˆ(x, y)
fby|x =
ĥ(x)
1 n K∗ [(y−yt )/γn ,(x−xt )/γn ]
n ∑t=1 γnk+1
=
1 n K[(x−xt )/γn ]
n ∑t=1 γkn
n
1 ∑t=1 K∗ [(y − yt ) /γn , (x − xt ) /γn ]
= n
γn ∑t=1 K [(x − xt ) /γn ]
where we obtain the expressions for the joint and marginal densities from the section
on kernel regression.
and η p (x, φ, γ) is a normalizing factor to make the density integrate (sum) to one. Be-
cause h2p (y|γ)/η p (x, φ, γ) is a homogenous function of θ it is necessary to impose a
normalization: γ0 is set to 1. The normalization factor η p (φ, γ) is calculated (following
Cameron and Johansson) using
∞
E(Y r ) = ∑ yr fY (y|φ, γ)
y=0
∞ [h p (y|γ)]2
= ∑ yr η p (φ, γ)
fY (y|φ)
y=0
∞ p p
= ∑ ∑ ∑ yr fY (y|φ)γk γl yk yl /η p(φ, γ)
y=0 k=0 l=0
( )
p p ∞
= ∑ ∑ γk γl ∑ yr+k+l fY (y|φ) /η p (φ, γ)
k=0 l=0 y=0
p p
= ∑ ∑ γk γl mk+l+r /η p(φ, γ).
k=0 l=0
Recall that γ0 is set to 1 to achieve identification. The mr in equation 18.6.1 are the
raw moments of the baseline density. Gallant and Nychka (1987) give conditions under
which such a density may be treated as correctly specified, asymptotically. Basically,
the order of the polynomial must increase as the sample size increases. However, there
are technicalities.
Similarly to Cameron and Johannson (1997), we may develop a negative binomial
polynomial (NBP) density for count data. The negative binomial baseline density may
18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 395
where φ = {λ, ψ}, λ > 0 and ψ > 0. The usual means of incorporating conditioning
variables x is the parameterization λ = ex β . When ψ = λ/α we have the negative
0
binomial-I model (NB-I). When ψ = 1/α we have the negative binomial-II (NP-II)
model. For the NB-I density, V (Y ) = λ + αλ. In the case of the NB-II model, we have
V (Y ) = λ + αλ2 . For both forms, E(Y ) = λ.
The reshaped density, with normalization to sum to one, is
ψ y
[h p (y|γ)]2 Γ(y + ψ) ψ λ
(18.6.2) fY (y|φ, γ) = .
η p (φ, γ) Γ(y + 1)Γ(ψ) ψ+λ ψ+λ
−ψ
(18.6.3) MY (t) = ψψ λ − et λ + ψ .
To illustrate, Figure 18.6.1 shows calculation of the first four raw moments of the NB
density, calculated using MuPAD, which is a Computer Algebra System that (use to
be?) free for personal use. These are the moments you would need to use a second
order polynomial (p = 2). MuPAD will output these results in the form of C code,
which is relatively easy to edit to write the likelihood function for the model. This has
been done in NegBinSNP.cc, which is a C++ version of this model that can be compiled
to use with octave using the mkoctfile command. Note the impressive length of the
expressions when the degree of the expansion is 4 or 5! This is an example of a model
that would be difficult to formulate without the help of a program like MuPAD.
18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 396
It is possible that there is conditional heterogeneity such that the appropriate re-
shaping should be more local. This can be accomodated by allowing the γ k parameters
to depend upon the conditioning variables, for example using polynomials.
Gallant and Nychka, Econometrica, 1987 prove that this sort of density can ap-
proximate a wide variety of densities arbitrarily well as the degree of the polynomial
increases with the sample size. This approach is not without its drawbacks: the sample
objective function can have an extremely large number of local maxima that can lead
to numeric difficulties. If someone could figure out how to do in a way such that the
sample objective function was nice and smooth, they would probably get the paper
published in a good journal. Any ideas?
Here’s a plot of true and the limiting SNP approximations (with the order of the
polynomial fixed) to four different count data densities, which variously exhibit over
18.7. EXAMPLES 397
and underdispersion, as well as excess zeros. The baseline model is a negative bino-
mial density.
Case 1 Case 2
.5
.4 .1
.3
.2 .05
.1
0 5 10 15 20 0 5 10 15 20 25
Case 3 Case 4
.25 .2
.2 .15
.15
.1
.1
.05
.05
18.7. Examples
We’ll use the MEPS OBDV data to illustrate kernel regression and semi-nonparametric
maximum likelihood.
18.7.1. Kernel regression estimation. Let’s try a kernel regression fit for the
OBDV data. The program OBDVkernel.m loads the MEPS OBDV data, scans over a
range of window widths and calculates leave-one-out CV scores, and plots the fitted
OBDV usage versus AGE, using the best window width. The plot is in Figure 18.7.1.
18.7. EXAMPLES 398
4.5
3.5
2.5
2
15 20 25 30 35 40 45 50 55 60 65
Note that usage increases with age, just as we’ve seen with the parametric models.
Once could use bootstrapping to generate a confidence interval to the fit.
18.7.2. Seminonparametric ML estimation and the MEPS data. Now let’s es-
timate a seminonparametric density for the OBDV data. We’ll reshape a negative bi-
nomial density, as discussed above. The program EstimateNBSNP.m loads the MEPS
OBDV data and estimates the model, using a NB-I baseline density and a 2nd order
polynomial expansion. The output is:
OBDV
======================================================
BFGSMIN final results
------------------------------------------------------
STRONG CONVERGENCE
Function conv 1 Param conv 1 Gradient conv 1
------------------------------------------------------
Objective function value 2.17061
Stepsize 0.0065
24 iterations
------------------------------------------------------
******************************************************
NegBin SNP model, MEPS full data set
Information Criteria
CAIC : 19907.6244 Avg. CAIC: 4.3619
BIC : 19897.6244 Avg. BIC: 4.3597
AIC : 19833.3649 Avg. AIC: 4.3456
******************************************************
Note that the CAIC and BIC are lower for this model than for the models presented
in Table 3. This model fits well, still being parsimonious. You can play around trying
other use measures, using a NP-II baseline density, and using other orders of expan-
sions. Density functions formed in this way may have MANY local maxima, so you
need to be careful before accepting the results of a casual run. To guard against hav-
ing converged to a local maximum, one can try using multiple starting values, or one
could try simulated annealing as an optimization method. If you uncomment the rel-
evant lines in the program, you can use SA to do the minimization. This will take
a lot of time, compared to the default BFGS minimization. The chapter on parallel
computations might be interesting to read before trying this.
CHAPTER 19
Simulation-based estimation
19.1. Motivation
Simulation methods are of interest when the DGP is fully characterized by a pa-
rameter vector, but the likelihood function is not calculable. If it were available, we
would simply estimate by MLE, which is asymptotically fully efficient.
y∗i = Xi β + εi
(19.1.1) εi ∼ N(0, Ω)
y = τ(y∗ )
401
19.1. MOTIVATION 402
This mapping is such that each element of y is either zero or one (in some
cases only one element will be one).
• Define
Ai = A(yi ) = {y∗ |yi = τ(y∗ )}
where
−1/2 −ε0 Ω−1 ε
n(ε, Ω) = (2π) −M/2
|Ω| exp
2
is the multivariate normal density of an M -dimensional random vector. The
log-likelihood function is
1 n
ln L (θ) = ∑ ln pi (θ)
n i=1
1 n 1 n Dθ pi (θ̂)
∑ i
n i=1
g ( θ̂) = ∑ p (θ̂) ≡ 0.
n i=1 i
• The problem is that evaluation of Li (θ) and its derivative w.r.t. θ by standard
methods of numeric integration such as quadrature is computationally infea-
sible when m (the dimension of y) is higher than 3 or 4 (as long as there are
no restrictions on Ω).
19.1. MOTIVATION 403
• The mapping τ(y∗ ) has not been made specific so far. This setup is quite
general: for different choices of τ(y∗ ) it nests the case of dynamic binary
discrete choice models as well as the case of multinomial discrete choice (the
choice of one out of a finite set of alternatives).
– Multinomial discrete choice is illustrated by a (very simple) job search
model. We have cross sectional data on individuals’ matching to a set of
m jobs that are available (one of which is unemployment). The utility of
alternative j is
u j = X jβ + ε j
y j = 1 u j > uk , ∀k ∈ m, k 6= j
u jt = W jt β − ε jt ,
j ∈ {1, 2}
t ∈ {1, 2, ..., m}
Then
y∗ = u 2 − u 1
= (W2 −W1 )β + ε2 − ε1
≡ Xβ + ε
19.1. MOTIVATION 404
y = 1 [y∗ > 0] ,
exp(−λ)λi
Pr(y = i) =
i!
The mean and variance of the Poisson distribution are both equal to λ :
E (y) = V (y) = λ.
λi = exp(Xi β).
This ensures that the mean is positive (as it must be). Estimation by ML is straightfor-
ward.
Often, count data exhibits “overdispersion” which simply means that
If this is the case, a solution is to use the negative binomial distribution rather than the
Poisson. An alternative is to introduce a latent variable that reflects heterogeneity into
the specification:
λi = exp(Xi β + ηi )
where ηi has some specified density with support S (this density may depend on addi-
tional parameters). Let dµ(ηi ) be the density of ηi . In some cases, the marginal density
of y
exp [− exp(Xiβ + ηi )] [exp(Xiβ + ηi )]yi
Z
Pr(y = yi ) = dµ(ηi )
S yi !
will have a closed-form solution (one can derive the negative binomial distribution in
the way if η has an exponential distribution), but often this will not be possible. In
this case, simulation is a means of calculating Pr(y = i), which is then used to do ML
estimation. This would be an example of the Simulated Maximum Likelihood (SML)
estimation.
• In this case, since there is only one latent variable, quadrature is probably
a better choice. However, a more flexible model with heterogeneity would
allow all parameters (not just the constant) to vary. For example
• W (0) = 0
• [W (s) −W (t)] ∼ N(0, s − t)
• [W (s) −W (t)] and [W ( j) −W (k)] are independent for s > t > j > k. That is,
non-overlapping segments are independent.
One can think of Brownian motion the accumulation of independent normally dis-
tributed shocks with infinitesimal variance.
To estimate a model of this sort, we typically have data that are assumed to be obser-
vations of yt in discrete points y1 , y2 , ...yT . That is, though yt is a continuous process it
is observed in discrete time.
To perform inference on θ, direct ML or GMM estimation is not usually feasible,
because one cannot, in general, deduce the transition density f (yt |yt−1 , θ). This den-
sity is necessary to evaluate the likelihood function or to evaluate moment conditions
(which are based upon expectations with respect to this density).
model is
εt ∼ N(0, 1)
The discretization induces a new parameter, φ (that is, the φ0 which defines
the best approximation of the discretization to the actual (unknown) discrete
time version of the model is not equal to θ0 which is the true parameter value).
This is an approximation, and as such “ML” estimation of φ (which is actu-
ally quasi-maximum likelihood, QML) based upon this equation is in general
biased and inconsistent for the original parameter, θ. Nevertheless, the ap-
proximation shouldn’t be too bad, which will be useful, as we will see.
• The important point about these three examples is that computational diffi-
culties prevent direct application of ML, GMM, etc. Nevertheless the model
is fully specified in probabilistic terms up to a parameter vector. This means
that the model is simulable, conditional on the parameter vector.
1 n
θ̂ML = arg max sn (θ) = ∑ ln p(yt |Xt , θ)
n t=1
where p(yt |Xt , θ) is the density function of the t th observation. When p(yt |Xt , θ) does
not have a known closed form, θ̂ML is an infeasible estimator. However, it may be
possible to define a random function such that
1 H
p̃ (yt , Xt , θ) = ∑ f (νts , yt , Xt , θ)
H s=1
• The SML simply substitutes p̃ (yt , Xt , θ) in place of p(yt |Xt , θ) in the log-
likelihood function, that is
1 n
θ̂SML = arg max sn (θ) = ∑ ln p̃ (yt , Xt , θ)
n i=1
u j = X jβ + ε j
y j = 1 u j > uk , k ∈ m, k 6= j
The problem is that Pr(y j = 1|θ) can’t be calculated when m is larger than 4 or 5.
However, it is easy to simulate this probability.
∑H
h=1 ỹi jh
πi j =
e
H
1 n 0
ln L (β, Ω) = ∑ yi ln p̃ (yi , Xi , θ)
n i=1
• The H draws of ε̃i are draw only once and are used repeatedly during the
iterations used to find β̂ and Ω̂. The draws are different for each i. If the ε̃i are
re-drawn at every iteration the estimator will not converge.
• The log-likelihood function with this simulator is a discontinuous function of
β and Ω. This does not cause problems from a theoretical point of view since
it can be shown that ln L (β, Ω) is stochastically equicontinuous. However,
it does cause problems if one attempts to use a gradient-based optimization
method such as Newton-Raphson.
• It may be the case, particularly if few simulations, H, are used, that some
πi are zero. If the corresponding element of yi is equal to 1, there
elements of e
will be a log(0) problem.
• Solutions to discontinuity:
– 1) use an estimation method that doesn’t require a continuous and dif-
ferentiable objective function, for example, simulated annealing. This is
computationally costly.
– 2) Smooth the simulated probabilities so that they are continuous func-
tions of the parameters. For example, apply a kernel transformation such
as
m m
ỹi j = Φ A × ui j − max uik + .5 × 1 ui j = max uik
k=1 k=1
19.2. SIMULATED MAXIMUM LIKELIHOOD (SML) 410
19.2.2. Properties. The properties of the SML estimator depend on how H is set.
The following is taken from Lee (1995) “Asymptotic Bias in Simulated Maximum
Likelihood Estimation of Discrete Choice Models,” Econometric Theory, 11, pp. 437-
83.
√ d
n θ̂SML − θ0 → N(0, I −1 (θ0 ))
√ d
n θ̂SML − θ0 → N(B, I −1 (θ0 ))
• This means that the SML estimator is asymptotically biased if H doesn’t grow
faster than n1/2 .
19.3. METHOD OF SIMULATED MOMENTS (MSM) 411
• The varcov is the typical inverse of the information matrix, so that as long
as H grows fast enough the estimator is consistent and fully asymptotically
efficient.
Suppose we have a DGP(y|x, θ) which is simulable given θ, but is such that the
density of y is not calculable.
Once could, in principle, base a GMM estimator upon the moment conditions
where
Z
k(xt , θ) = K(yt , xt )p(y|xt , θ)dy,
1 H
k (xt , θ) = ∑ K(e
e yth , xt )
H h=1
a.s.
• By the law of large numbers, e
k (xt , θ) → k (xt , θ) , as H → ∞, which provides
a clear intuitive basis for the estimator, though in fact we obtain consistency
even for H finite, since a law of large numbers is also operating across the
n observations of real data, so errors introduced by simulation cancel them-
selves out.
• This allows us to form the moment conditions
h i
(19.3.1) ft (θ) = K(yt , xt ) − e
m k (xt , θ) zt
19.3. METHOD OF SIMULATED MOMENTS (MSM) 412
1 n
m(θ)
e = ∑m
n i=1
ft (θ)
" #
1 n 1 H
(19.3.2) = ∑ K(yt , xt ) − H ∑ k(eyth, xt ) zt
n i=1 h=1
with which we form the GMM criterion and estimate as usual. Note that the
yth , xt ) appears linearly within the sums.
unbiased simulator k(e
19.3.1. Properties. Suppose that the optimal weighting matrix is used. McFad-
den (ref. above) and Pakes and Pollard (refs. above) show that the asymptotic distri-
bution of the MSM estimator is very similar to that of the infeasible GMM estimator.
In particular, assuming that the optimal weighting matrix is used, and for H finite,
√ 0
d 1
−1 0 −1
(19.3.3) n θ̂MSM − θ → N 0, 1 + D∞ Ω D∞
H
−1
where D∞ Ω−1 D0∞ is the asymptotic variance of the infeasible GMM estimator.
• That is, the asymptotic variance is inflated by a factor 1 + 1/H. For this rea-
son the MSM estimator is not fully asymptotically efficient relative to the
infeasible GMM estimator, for H finite, but the efficiency loss is small and
controllable, by setting H reasonably large.
• The estimator is asymptotically unbiased even for H = 1. This is an advantage
relative to SML.
• If one doesn’t use the optimal weighting matrix, the asymptotic varcov is just
the ordinary GMM varcov, inflated by 1 + 1/H.
• The above presentation is in terms of a specific moment condition based upon
the conditional mean. Simulated GMM can be applied to moment conditions
of any form.
19.3. METHOD OF SIMULATED MOMENTS (MSM) 413
19.3.2. Comments. Why is SML inconsistent if H is finite, while MSM is? The
reason is that SML is based upon an average of logarithms of an unbiased simulator
(the densities of the observations). To use the multinomial probit model as an example,
the log-likelihood function is
1 n 0
ln L (β, Ω) = ∑ yi ln pi(β, Ω)
n i=1
due to the fact that ln(·) is a nonlinear transformation. The only way for the two to be
equal (in the limit) is if H tends to infinite so that p̃ (·) tends to p (·).
The reason that MSM does not suffer from this problem is that in this case the
unbiased simulator appears linearly within every sum of terms, and it appears within
a sum over n (see equation [19.3.2]). Therefore the SLLN applies to cancel out sim-
ulation errors, from which we get consistency. That is, using simple notation for the
random sampling case, the moment conditions
" #
1 n 1 H
(19.3.4) m̃(θ) = ∑ K(yt , xt ) − H ∑ k(eyth , xt ) zt
n i=1 h=1
" #
1 n 1 H
(19.3.5) = ∑ k(xt , θ0) + εt − H ∑ [k(xt , θ) + ε̃ht ] zt
n i=1 h=1
19.4. EFFICIENT METHOD OF MOMENTS (EMM) 414
• If you look at equation 19.3.5 a bit, you will see why the variance inflation
factor is (1 + H1 ).
The choice of which moments upon which to base a GMM estimator can have very
pronounced effects upon the efficiency of the estimator.
mt (θ) = Dθ ln pt (θ | It )
The efficient method of moments (EMM) (see Gallant and Tauchen (1996), “Which
Moments to Match?”, ECONOMETRIC THEORY, Vol. 12, 1996, pages 657-681)
seeks to provide moment conditions that closely mimic the score vector. If the approx-
imation is very good, the resulting estimator will be very nearly fully efficient.
The DGP is characterized by random sampling from the density
We can define an auxiliary model, called the “score generator”, which simply pro-
vides a (misspecified) parametric density
f (y|xt , λ) ≡ ft (λ)
1 n
λ̂ = arg max sn (λ) = ∑ ln ft (λ).
Λ n t=1
• After determining λ̂ we can calculate the score functions Dλ ln f (yt |xt , λ̂).
• The important point is that even if the density is misspecified, there is a
pseudo-true λ0 for which the true expectation, taken with respect to the true
but unknown density of y, p(y|xt , θ0 ), and then marginalized over x is zero:
0
0
Z Z
∃λ : EX EY |X Dλ ln f (y|x, λ ) = Dλ ln f (y|x, λ0 )p(y|x, θ0 )dydµ(x) = 0
X Y |X
p
• We have seen in the section on QML that λ̂ → λ0 ; this suggests using the
moment conditions
1 n
Z
(19.4.1) mn (θ, λ̂) = ∑ Dλ ln ft (λ̂)pt (θ)dy
n t=1
19.4. EFFICIENT METHOD OF MOMENTS (EMM) 416
• These moment conditions are not calculable, since pt (θ) is not available, but
they are simulable using
1 n 1 H
fn (θ, λ̂) =
m ∑ ∑
n t=1 H h=1
yth |xt , λ̂)
Dλ ln f (e
where ỹth is a draw from DGP(θ), holding xt fixed. By the LLN and the fact
that λ̂ converges to λ0 ,
e ∞ (θ0 , λ0 ) = 0.
m
This is not the case for other values of θ, assuming that λ0 is identified.
• The advantage of this procedure is that if f (yt |xt , λ) closely approximates
p(y|xt , θ), then m
e n (θ, λ̂) will closely approximate the optimal moment con-
ditions which characterize maximum likelihood estimation, which is fully
efficient.
• If one has prior information that a certain density approximates the data well,
it would be a good choice for f (·).
• If one has no density in mind, there exist good ways of approximating un-
known distributions parametrically: Philips’ ERA’s (Econometrica, 1983)
and Gallant and Nychka’s (Econometrica, 1987) SNP density estimator which
we saw before. Since the SNP density is consistent, the efficiency of the in-
direct estimator is the same as the infeasible ML estimator.
19.4.1. Optimal weighting matrix. I will present the theory for H finite, and
possibly small. This is done because it is sometimes impractical to estimate with H
very large. Gallant and Tauchen give the theory for the case of H so large that it may
be treated as infinite (the difference being irrelevant given the numerical precision of
19.4. EFFICIENT METHOD OF MOMENTS (EMM) 417
a computer). The theory for the case of H infinite follows directly from the results
presented here.
e λ̂) depends on the pseudo-ML estimate λ̂. We can
The moment condition m(θ,
apply Theorem 22 to conclude that
√
d
(19.4.2) n λ̂ − λ0 → N 0, J (λ0 )−1 I (λ0 )J (λ0 )−1
If the density f (yt |xt , λ̂) were in fact the true density p(y|xt , θ), then λ̂ would be the
maximum likelihood estimator, and J (λ0 )−1 I (λ0 ) would be an identity matrix, due
to the information matrix equality. However, in the present case we assume that
f (yt |xt , λ̂) is only an approximation to p(y|xt , θ), so there is no cancellation.
2
∂
Recall that J (λ0 ) ≡ p lim ∂λ∂λ s
0 n (λ 0 ) . Comparing the definition of s (λ) with
n
As in Theorem 22,
0 ∂sn (λ) ∂sn (λ)
I (λ ) = lim E n .
n→∞ ∂λ λ0 ∂λ0 λ0
In this case, this is simply the asymptotic variance covariance matrix of the moment
√
conditions, Ω. Now take a first order Taylor’s series approximation to nmn (θ0 , λ̂)
about λ0 :
√ √ √
nm̃n (θ0 , λ̂) = nm̃n (θ0 , λ0 ) + nDλ0 m̃(θ0 , λ0 ) λ̂ − λ0 + o p (1)
√
First consider nm̃n (θ0 , λ0 ). It is straightforward but somewhat tedious to show
1 0
that the asymptotic variance of this term is H I∞ (λ ).
19.4. EFFICIENT METHOD OF MOMENTS (EMM) 418
√
a.s.
Next consider the second term nDλ0 m̃(θ0 , λ0 ) λ̂ − λ0 . Note that Dλ0 m̃n (θ0 , λ0 ) →
J (λ0 ), so we have
√ √
nDλ0 m̃(θ0 , λ0 ) λ̂ − λ0 = nJ (λ0 ) λ̂ − λ0 , a.s.
√
0 0 a
nJ (λ ) λ̂ − λ ∼ N 0, I (λ0)
Now, combining the results for the first and second terms,
√ 0 a 1 0
nm̃n (θ , λ̂) ∼ N 0, 1 + I (λ )
H
Suppose that I[
(λ0 ) is a consistent estimator of the asymptotic variance-covariance
matrix of the moment conditions. This may be complicated if the score generator is
a poor approximator, since the individual score contributions may not have mean zero
in this case (see the section on QML) . Even if this is the case, the individuals means
can be calculated by simulation, so it is always possible to consistently estimate I (λ 0 )
when the model is simulable. On the other hand, if the score generator is taken to
be correctly specified, the ordinary estimator of the information matrix is consistent.
Combining this with the result on the efficient GMM weighting matrix in Theorem 25,
we see that defining θ̂ as
−1
1 [
θ̂ = arg min mn (θ, λ̂) 0
1+ 0
I (λ ) mn (θ, λ̂)
Θ H
• If one has used the Gallant-Nychka ML estimator as the auxiliary model, the
appropriate weighting matrix is simply the information matrix of the auxil-
iary model, since the scores are uncorrelated. (e.g., it really is ML estimation
19.4. EFFICIENT METHOD OF MOMENTS (EMM) 419
asymptotically, since the score generator can approximate the unknown den-
sity arbitrarily well).
19.4.2. Asymptotic distribution. Since we use the optimal weighting matrix, the
asymptotic distribution is as in Equation 15.4.1, so we have (using the result in Equa-
tion 19.4.2):
!−1
−1
√ 0
d 1
n θ̂ − θ → N 0, D∞ 1+ I (λ0 ) D0∞ ,
H
where
D∞ = lim E Dθ m0n (θ0 , λ0 ) .
n→∞
implies that
−1
1 a
nmn (θ̂, λ̂) 0
1+ I (λ̂) mn (θ̂, λ̂) ∼ χ2 (q)
H
where q is dim(λ) − dim(θ), since without dim(θ) moment conditions the model is
not identified, so testing is impossible. One test of the model is simply based on this
statistic: if it exceeds the χ2 (q) critical point, something may be wrong (the small
sample performance of this sort of test would be a topic worth investigating).
can be used to test which moments are not well modeled. Since these mo-
ments are related to parameters of the score generator, which are usually re-
lated to certain features of the model, this information can be used to revise
√
the model. These aren’t actually distributed as N(0, 1), since nmn (θ0 , λ̂)
√ √
and nmn (θ̂, λ̂) have different distributions (that of nmn (θ̂, λ̂) is somewhat
more complicated). It can be shown that the pseudo-t statistics are biased
toward nonrejection. See Gourieroux et. al. or Gallant and Long, 1995, for
more details.
19.5. Examples
εt ∼ N(0, 1)
19.5. EXAMPLES 421
Indicate these scores by mn (θ, φ̂). Then the system of stochastic differential equations
is simulated over θ, and the scores are calculated and averaged over the simulations
1 N
m̃n (θ, φ̂) = ∑ min(θ, φ̂)
N i=1
ηt ∼ N(0, τ)
11 n = rows(y);
12 scores = zeros(n,rows(phi)); # container for moment contributions
13 reps = columns(rand_draws); # how many simulations?
14 for i = 1:reps
15 e = rand_draws(:,i);
16 y = feval(dgp, theta, x, e, dgpargs); # simulated data
17 sgdata = [y x]; # simulated data for SG
18 scores = scores + numgradient(sg, {phi, sgdata, sgargs}); # gradient
of SG
19 endfor
20 scores = scores / reps; # average over number of simulations
21 endfunction
L ISTING 19.1
The file emm_example.m performs EMM estimation of the probit model, using a
logit model as the score generator. The results we obtain are
------------------------------------------------------
STRONG CONVERGENCE
Function conv 1 Param conv 1 Gradient conv 1
------------------------------------------------------
Objective function value 0.281571
Stepsize 0.0279
15 iterations
------------------------------------------------------
======================================================
Model results:
******************************************************
EMM example
It might be interesting to compare the standard errors with those obtained from
ML estimation, to check efficiency of the EMM estimator. One could even do a Monte
Carlo study.
19.5. EXAMPLES 425
Exercises
426
20.1. EXAMPLE PROBLEMS 427
the high-level matrix programming (HLMP) languages1 that allow the incorporation
of parallelism into programs written in these languages.
Following are examples of parallel implementations of several mainstream prob-
lems in econometrics. A focus of the examples is on the possibility of hiding paral-
lelization from end users of programs. If programs that run in parallel have an interface
that is nearly identical to the interface of equivalent serial versions, end users will find
it easy to take advantage of parallel computing’s performance. We continue to use
Octave, taking advantage of the MPI Toolbox (MPITB) for Octave, by by Fernández
Baldomero et al. (2004). There are also parallel packages for Ox, R, and Python which
may be of interest to econometricians, but as of this writing, the following examples
are the most accessible introduction to parallel programming for econometricians.
This section introduces example problems from econometrics, and shows how they
can be parallelized in a natural way.
20.1.1. Monte Carlo. A Monte Carlo study involves repeating a random exper-
iment many times under identical conditions. Several authors have noted that Monte
Carlo studies are obvious candidates for parallelization (Doornik et al. 2002; Bruche,
2003) since blocks of replications can be done independently on different computers.
To illustrate the parallelization of a Monte Carlo study, we use same trace test example
as do Doornik, et. al. (2002). tracetest.m is a function that calculates the trace test
statistic for the lack of cointegration of integrated time series. This function is illustra-
tive of the format that we adopt for Monte Carlo simulation of a function: it receives
a single argument of cell type, and it returns a row vector that holds the results of
1By ”high-level matrix programming language” I mean languages such as MATLAB (TM the Math-
works, Inc.), Ox (TM OxMetrics Technologies, Ltd.), and GNU Octave (www.octave.org), for exam-
ple.
20.1. EXAMPLE PROBLEMS 428
one random simulation. The single argument in this case is a cell array that holds the
length of the series in its first position, and the number of series in the second position.
It generates a random result though a process that is internal to the function, and it
reports some output in a row vector (in this case the result is a scalar).
mc_example1.m is an Octave script that executes a Monte Carlo study of the trace
test by repeatedly evaluating the tracetest.m function. The main thing to notice
about this script is that lines 7 and 10 call the function montecarlo.m. When called
with 3 arguments, as in line 7, montecarlo.m executes serially on the computer it is
called from. In line 10, there is a fourth argument. When called with four arguments,
the last argument is the number of slave hosts to use. We see that running the Monte
Carlo study on one or more processors is transparent to the user - he or she must only
indicate the number of slave computers to be used.
20.1.2. ML. For a sample {(yt , xt )}n of n observations of a set of dependent and
explanatory variables, the maximum likelihood estimator of the parameter θ can be
defined as
θ̂ = arg max sn (θ)
where
1 n
sn (θ) = ∑ ln f (yt |xt , θ)
n t=1
Here, yt may be a vector of random variables, and the model may be dynamic since xt
may contain lags of yt . As Swann (2002) points out, this can be broken into sums over
blocks of observations, for example two blocks:
( ! !)
n1 n
1
sn (θ) =
n ∑ ln f (yt |xt , θ) + ∑ ln f (yt |xt , θ)
t=1 t=n1 +1
20.1.3. GMM. For a sample as above, the GMM estimator of the parameter θ can
be defined as
θ̂ ≡ arg min sn (θ)
Θ
where
sn (θ) = mn (θ)0Wn mn (θ)
20.1. EXAMPLE PROBLEMS 430
and
1 n
mn (θ) = ∑ mt (yt |xt , θ)
n t=1
Since mn (θ) is an average, it can obviously be computed blockwise, using for example
2 blocks:
( ! !)
n1 n
1
(20.1.1) mn (θ) =
n ∑ mt (yt |xt , θ) + ∑ mt (yt |xt , θ)
t=1 t=n1 +1
We see that the weight depends upon every data point in the sample. To calculate the
fit at every point in a sample of size n, on the order of n2 k calculations must be done,
where k is the dimension of the vector of explanatory variables, x. Racine (2002)
demonstrates that MPI parallelization can be used to speed up calculation of the kernel
regression estimator by calculating the fits for portions of the sample on different com-
puters. We follow this implementation here. kernel_example1.m is a script for serial
and parallel kernel regression. Serial execution is obtained by setting the number of
slaves equal to zero, in line 15. In line 17, a single slave is specified, so execution is in
parallel on the master and slave nodes.
The example programs show that parallelization may be mostly hidden from end
users. Users can benefit from parallelization without having to write or understand
parallel code. The speedups one can obtain are highly dependent upon the specific
problem at hand, as well as the size of the cluster, the efficiency of the network, etc.
Some examples of speedups are presented in Creel (2005). Figure 20.1.1 reproduces
speedups for some econometric problems on a cluster of 12 desktop computers. The
speedup for k nodes is the time to finish the problem on a single node divided by the
time to finish the problem on k nodes. Note that you can get 10X speedups, as claimed
in the introduction. It’s pretty obvious that much greater speedups could be obtained
using a larger cluster, for the ”embarrassingly parallel” problems.
20.1. EXAMPLE PROBLEMS 432
10 MONTECARLO
BOOTSTRAP
MLE
9 GMM
KERNEL
1
2 4 6 8 10 12
nodes
Bibliography
[1] Bruche, M. (2003) A note on embarassingly parallel computation using OpenMosix and Ox, work-
ing paper, Financial Markets Group, London School of Economics.
[2] Creel, M. (2005) User-friendly parallel computations with econometric examples, Computational
Economics, V. 26, pp. 107-128.
[3] Doornik, J.A., D.F. Hendry and N. Shephard (2002) Computationally-intensive econometrics us-
ing a distributed matrix-programming language, Philosophical Transactions of the Royal Society
of London, Series A, 360, 1245-1266.
[4] Fernández Baldomero, J. (2004) LAM/MPI parallel computing under GNU Octave,
atc.ugr.es/javier-bin/mpitb.
[5] Racine, Jeff (2002) Parallel distributed kernel estimation, Computational Statistics & Data Anal-
ysis, 40, 293-302.
[6] Swann, C.A. (2002) Maximum likelihood estimation using parallel computing: an introduction to
MPI, Computational Economics, 19, 145-178.
433
CHAPTER 21
21.1. Data
We’ll develop a model for private consumption and real gross private investment.
The data are obtained from the US Bureau of Economic Analysis (BEA) National
Income and Product Accounts (NIPA), Table 11.1.5, Lines 2 and 6 (you can download
quarterly data from 1947-I to the present). The data we use are in the file rbc_data.m.
This data is real (constant dollars).
The program plots.m will make a few plots, including Figures 21.1.1 though 21.1.3.
First looking at the plot for levels, we can see that real consumption and investment are
clearly nonstationary (surprise, surprise). There appears to be somewhat of a structural
change in the mid-1970’s.
434
21.1. DATA 435
Looking at growth rates, the series for consumption has an extended period of high
growth in the 1970’s, becoming more moderate in the 90’s. The volatility of growth of
consumption has declined somewhat, over time. Looking at investment, there are some
notable periods of high volatility in the mid-1970’s and early 1980’s, for example.
Since 1990 or so, volatility seems to have declined.
Economic models for growth often imply that there is no long term growth (!) - the
data that the models generate is stationary and ergodic. Or, the data that the models
generate needs to be passed through the inverse of a filter. We’ll follow this, and
generate stationary business cycle data by applying the bandpass filter of Christiano
and Fitzgerald (1999). The filtered data is in Figure 21.1.3. We’ll try to specify an
economic model that can generate similar data. To get data that look like the levels for
consumption and investment, we’d need to apply the inverse of the bandpass filter.
21.2. AN RBC MODEL 436
Consider a very simple stochastic growth model (the same used by Maliar and
Maliar (2003), with minor notational difference):
∞ E0 ∑∞ βt U (ct )
max{ct ,kt }t=0 t=0
α
ct + kt = (1 − δ) kt−1 + φt kt−1
εt ∼ IIN(0, σ2ε )
it = kt − (1 − δ) kt−1
0
We would like to estimate the parameters θ = β, γ, δ, α, ρ, σ2ε using the data that
we have on consumption and investment. This problem is very similar to the GMM
estimation of the portfolio model discussed in Sections 15.11 and 15.12. Once can
21.3. A REDUCED FORM MODEL 437
derive the Euler condition in the same way we did there, and use it to define a GMM
estimator. That approach was not very successful, recall. Now we’ll try to use some
more informative moment conditions to see if we get better results.
Macroeconomic time series data are often modeled using vector autoregressions.
A vector autogression is just the vector version of an autoregressive model. Let yt be a
G-vector of jointly dependent variables. A VAR(p) model is
n p o
log h jt = κ j + P( j,.) |vt−1 | − 2/π + ℵ( j,.) vt−1 + G( j,.) log ht−1
21.3. A REDUCED FORM MODEL 438
The variance of the VAR error depends upon its own past, as well as upon the past
realizations of the shocks.
ηt ∼ IIN (0, I2 )
ηt = Rt−1 vt
and we know how to calculate Rt and vt , given the data and the parameters. Thus,
it is straighforward to do estimation by maximum likelihood. This will be the score
generator.
21.5. SOLVING THE STRUCTURAL MODEL 439
or
n h io −1γ
1 − δ + αφt+1 ktα−1
−γ
ct = βEt ct+1
The problem is that we cannot solve for ct since we do not know the solution for the
expectation in the previous equation.
The parameterized expectations algorithm (PEA: den Haan and Marcet, 1990), is
a means of solving the problem. The expectations term is replaced by a parametric
function. As long as the parametric function is a flexible enough function of variables
that have been realized in period t, there exist parameter values that make the approx-
imation as close to the true expectation as is desired. We will write the approximation
h i
1 − δ + αφt+1 ktα−1
−γ
Et ct+1 ' exp (ρ0 + ρ1 log φt + ρ2 log kt−1 )
For given values of the parameters of this approximating function, we can solve for ct ,
and then for kt using the restriction that
α
ct + kt = (1 − δ) kt−1 + φt kt−1
This allows us to generate a series {(ct , kt )}. Then the expectations approximation is
updated by fitting
ct+1 1 − δ + αφt+1 ktα−1 = exp (ρ0 + ρ1 log φt + ρ2 log kt−1 ) + ηt
−γ
21.5. SOLVING THE STRUCTURAL MODEL 440
by nonlinear least squares. The 2 step procedure of generating data and updating the
parameters of the approximation to expectations is iterated until the parameters no
longer change. When this is the case, the expectations function is the best fit to the
generated data. As long it is a rich enough parametric model to encompass the true
expectations function, it can be made to be equal to the true expectations function by
using a long enough simulation.
0
Thus, given the parameters of the structural model, θ = β, γ, δ, α, ρ, σ2ε , we can
generate data {(ct , kt )} using the PEA. From this we can get the series {(ct , it )} using
it = kt − (1 − δ) kt−1 . This can be used to do EMM estimation using the scores of the
reduced form model to define moments, using the simulated data from the structural
model.
Bibliography
441
CHAPTER 22
Introduction to Octave
Why is Octave being used here, since it’s not that well-known by econometricians?
Well, because it is a high quality environment that is easily extensible, uses well-tested
and high performance numerical libraries, it is licensed under the GNU GPL, so you
can get it for free and modify it if you like, and it runs on both GNU/Linux, Mac OSX
and Windows systems. It’s also quite easy to learn.
Get the bootable CD, as was described in Section 1.3. Then burn the image, and
boot your computer with it. This will give you this same PDF file, but with all of the
example programs ready to run. The editor is configure with a macro to execute the
programs using Octave, which is of course installed. From this point, I assume you
are running the CD (or sitting in the computer room across the hall from my office),
or that you have configured your computer to be able to run the *.m files mentioned
below.
The objective of this introduction is to learn just the basics of Octave. There are
other ways to use Octave, which I encourage you to explore. These are just some
rudiments. After this, you can look at the example programs scattered throughout the
document (and edit them, and run them) to learn more about how Octave can be used
to do econometrics. Students of mine: your problem sets will include exercises that
442
22.2. A SHORT INTRODUCTION 443
can be done by modifying the example programs in relatively minor ways. So study
the examples!
Octave can be used interactively, or it can be used to run programs that are written
using a text editor. We’ll use this second method, preparing programs with NEdit, and
calling Octave from within the editor. The program first.m gets us started. To run this,
open it up with NEdit (by finding the correct file inside the /home/knoppix/Desktop/Econometrics
folder and clicking on the icon) and then type CTRL-ALT-o, or use the Octave item in
the Shell menu (see Figure 22.2.1).
22.3. IF YOU’RE RUNNING A LINUX INSTALLATION... 444
Note that the output is not formatted in a pleasing way. That’s because printf()
doesn’t automatically start a new line. Edit first.m so that the 8th line reads ”printf(”hello
world\n”);” and re-run the program.
We need to know how to load and save data. The program second.m shows how.
Once you have run this, you will find the file ”x” in the directory Econometrics/Include/OctaveIntro/
You might have a look at it with NEdit to see Octave’s default format for saving data.
Basically, if you have data in an ASCII text file, named for example ”myfile.data”,
formed of numbers separated by spaces, just use the command ”load myfile.data”.
After having done so, the matrix ”myfile” (without extension) will contain the data.
Please have a look at CommonOperations.m for examples of how to do some basic
things in Octave. Now that we’re done with the basics, have a look at the Octave
programs that are included as examples. If you are looking at the browsable PDF
version of this document, then you should be able to click on links to open them.
If not, the example programs are available here and the support files needed to run
these are available here. Those pages will allow you to examine individual files, out of
context. To actually use these files (edit and run them), you should go to the home page
of this document, since you will probably want to download the pdf version together
with all the support files and examples. Or get the bootable CD.
There are some other resources for doing econometrics with Octave. You might
like to check the article Econometrics with Octave and the Econometrics Toolbox ,
which is for Matlab, but much of which could be easily used with Octave.
Then to get the same behavior as found on the CD, you need to:
• Get the collection of support programs and the examples, from the document
home page.
22.3. IF YOU’RE RUNNING A LINUX INSTALLATION... 445
• Put them somewhere, and tell Octave how to find them, e.g., by putting a link
to the MyOctaveFiles directory in /usr/local/share/octave/site-m
• Make sure nedit is installed and configured to run Octave and use syntax
highlighting. Copy the file /home/econometrics/.nedit from the CD to do
this. Or, get the file NeditConfiguration and save it in your $HOME directory
with the name ”.nedit”. Not to put too fine a point on it, please note that
there is a period in that name.
• Associate *.m files with NEdit so that they open up in the editor when you
click on them. That should do it.
CHAPTER 23
• All vectors will be column vectors, unless they have a transpose symbol (or
I forget to apply this rule - your help catching typos and er0rors is much
appreciated). For example, if xt is a p × 1 vector, xt0 is a 1 × p vector. When I
refer to a p-vector, I mean a column vector.
[3, Chapter 1]
∂s(θ)
Let s(·) : ℜ p → ℜ be a real valued function of the p-vector θ. Then ∂θ is orga-
nized as a p-vector,
∂s(θ)
∂θ1
∂s(θ)
∂s(θ)
∂θ2
= ..
∂θ .
∂s(θ)
∂θ p
∂2 s(θ)
Following this convention, ∂s(θ)
∂θ0 is a 1 × p vector, and ∂θ∂θ0 is a p × p matrix. Also,
∂2 s(θ) ∂ ∂s(θ) ∂ ∂s(θ)
= = 0 .
∂θ∂θ 0 ∂θ ∂θ0 ∂θ ∂θ
∂a0 x
E XERCISE 33. For a and x both p-vectors, show that ∂x = a.
∂x0 Ax
E XERCISE 34. For A a p × p matrix and x a p × 1 vector, show that ∂x = A + A0 .
has dimension n × r.
∂ exp(x0 β)
E XERCISE 35. For x and β both p × 1 vectors, show that ∂β = exp(x0 β)x.
D EFINITION 36. A sequence is a mapping from the natural numbers {1, 2, ...} =
{n}∞
n=1 = {n} to some other set, so that the set is ordered according to the natural
Real-valued sequences:
It’s important to note that Nεω depends upon ω, so that converge may be much
more rapid for certain ω than for others. Uniform convergence requires a similar rate
of convergence throughout Ω.
(insert a diagram here showing the envelope around f (ω) in which f n (ω) must lie)
a collection of such mappings, i.e., each Xn (ω) is a random variable with respect to the
probability space (Ω, F , P) . For example, given the model Y = X β0 + ε, the OLS es-
timator β̂n = (X 0 X )−1 X 0Y, where n is the sample size, can be used to form a sequence
of random vectors {β̂n }. A number of modes of convergence are in use when deal-
ing with sequences of random variables. Several such modes of convergence should
already be familiar:
p
Convergence in probability is written as Xn → X , or plim Xn = X .
P (A ) = 1.
In other words, Xn (ω) → X (ω) (ordinary convergence of the two functions) except on
a.s.
a set C = Ω − A such that P(C) = 0. Almost sure convergence is written as Xn → X ,
or Xn → X , a.s. One can show that
a.s. p
Xn → X ⇒ Xn → X .
d
Convergence in distribution is written as Xn → X . It can be shown that convergence in
probability implies convergence in distribution.
D EFINITION 43. [Uniform almost sure convergence] {Xn (ω, θ)} converges uni-
formly almost surely in Θ to X (ω, θ) if
Implicit is the assumption that all Xn (ω, θ) and X (ω, θ) are random variables w.r.t.
u.a.s.
(Ω, F , P) for all θ ∈ Θ. We’ll indicate uniform almost sure convergence by → and
u.p.
uniform convergence in probability by → .
• An equivalent definition, based on the fact that “almost sure” means “with
probability one” is
Pr lim sup |Xn (ω, θ) − X (ω, θ)| = 0 = 1
n→∞ θ∈Θ
23.3. RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY 451
This has a form similar to that of the definition of a.s. convergence - the
essential difference is the addition of the sup.
It’s often useful to have notation for the relative magnitudes of quantities. Quanti-
ties that are small relative to others can often be ignored, which simplifies analysis.
D EFINITION 44. [Little-o] Let f (n) and g(n) be two real-valued functions. The
f (n)
notation f (n) = o(g(n)) means limn→∞ g(n) = 0.
D EFINITION 45. [Big-O] Let f (n) and g(n) be two real-valued functions. The
f (n)
notation f (n) = O(g(n)) means there exists some N such that for n > N, g(n) < K,
where K is a finite constant.
f (n)
This definition doesn’t require that g(n) have a limit (it may fluctuate boundedly).
If { fn } and {gn } are sequences of random variables analogous definitions are
f (n) p
D EFINITION 46. The notation f (n) = o p (g(n)) means g(n) → 0.
E XAMPLE 47. The least squares estimator θ̂ = (X 0 X )−1 X 0Y = (X 0 X )−1 X 0 X θ0 + ε =
0 −1 X 0 ε
θ0 + (X 0 X )−1 X 0 ε. Since plim (X X)1 = 0, we can write (X 0X )−1 X 0 ε = o p (1) and
θ̂ = θ0 + o p (1). Asymptotically, the term o p (1) is negligible. This is just a way of
indicating that the LS estimator is consistent.
D EFINITION 48. The notation f (n) = O p (g(n)) means there exists some Nε such
that for ε > 0 and all n > Nε ,
f (n)
P < Kε > 1 − ε,
g(n)
Useful rules:
• O p (n p )O p (nq ) = O p (n p+q )
• o p (n p )o p (nq ) = o p (n p+q )
E XAMPLE 50. Consider a random sample of iid r.v.’s with mean 0 and variance
σ2 . The estimator of the mean θ̂ = 1/n ∑ni=1 xi is asymptotically normally distributed,
A
e.g., n1/2 θ̂ ∼ N(0, σ2 ). So n1/2 θ̂ = O p (1), so θ̂ = O p (n−1/2 ). Before we had θ̂ = o p (1),
now we have have the stronger result that relates the rate of convergence to the sample
size.
E XAMPLE 51. Now consider a random sample of iid r.v.’s with mean µ and vari-
ance σ2 . The estimator of the mean θ̂ = 1/n ∑ni=1 xi is asymptotically normally dis-
A
tributed, e.g., n1/2 θ̂ − µ ∼ N(0, σ2 ). So n1/2 θ̂ − µ = O p (1), so θ̂ − µ = O p (n−1/2 ),
so θ̂ = O p (1).
These two examples show that averages of centered (mean zero) quantities typi-
cally have plim 0, while averages of uncentered quantities have finite nonzero plims.
Note that the definition of O p does not mean that f (n) and g(n) are of the same order.
Asymptotic equality ensures that this is the case.
D EFINITION 52. Two sequences of random variables { f n } and {gn } are asymptot-
a
ically equal (written fn = gn ) if
f (n)
plim =1
g(n)
Finally, analogous almost sure versions of o p and O p are defined in the obvious
way.
EXERCISES 453
Exercises
∂a0 x
(1) For a and x both p × 1 vectors, show that ∂x = a.
∂x0 Ax
(2) For A a p × p matrix and x a p × 1 vector, show that ∂x = A + A0 .
(3) For x and β both p × 1 vectors, show that Dβ exp x0 β = exp(x0 β)x.
(4) For x and β both p × 1 vectors, find the analytic expression for D 2β exp x0 β.
(5) Write an Octave program that verifies each of the previous results by taking nu-
meric derivatives. For a hint, type help numgradient and help numhessian
inside octave.
CHAPTER 24
The GPL
This document and the associated examples and materials are copyright Michael
Creel, under the terms of the GNU General Public License. This license follows:
GNU GENERAL PUBLIC LICENSE Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc. 59 Temple Place, Suite
330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute ver-
batim copies of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your freedom to share and
change it. By contrast, the GNU General Public License is intended to guarantee your
freedom to share and change free software–to make sure the software is free for all its
users. This General Public License applies to most of the Free Software Foundation’s
software and to any other program whose authors commit to using it. (Some other Free
Software Foundation software is covered by the GNU Library General Public License
instead.) You can apply it to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our Gen-
eral Public Licenses are designed to make sure that you have the freedom to distribute
copies of free software (and charge for this service if you wish), that you receive source
code or can get it if you want it, that you can change the software or use pieces of it in
new free programs; and that you know you can do these things.
454
24. THE GPL 455
To protect your rights, we need to make restrictions that forbid anyone to deny you
these rights or to ask you to surrender the rights. These restrictions translate to certain
responsibilities for you if you distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether gratis or for a fee,
you must give the recipients all the rights that you have. You must make sure that they,
too, receive or can get the source code. And you must show them these terms so they
know their rights.
We protect your rights with two steps: (1) copyright the software, and (2) offer
you this license which gives you legal permission to copy, distribute and/or modify the
software.
Also, for each author’s protection and ours, we want to make certain that everyone
understands that there is no warranty for this free software. If the software is modified
by someone else and passed on, we want its recipients to know that what they have
is not the original, so that any problems introduced by others will not reflect on the
original authors’ reputations.
Finally, any free program is threatened constantly by software patents. We wish to
avoid the danger that redistributors of a free program will individually obtain patent
licenses, in effect making the program proprietary. To prevent this, we have made it
clear that any patent must be licensed for everyone’s free use or not licensed at all.
The precise terms and conditions for copying, distribution and modification follow.
GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPY-
ING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains a notice
placed by the copyright holder saying it may be distributed under the terms of this
General Public License. The "Program", below, refers to any such program or work,
and a "work based on the Program" means either the Program or any derivative work
24. THE GPL 456
under copyright law: that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another language. (Here-
inafter, translation is included without limitation in the term "modification".) Each
licensee is addressed as "you".
Activities other than copying, distribution and modification are not covered by this
License; they are outside its scope. The act of running the Program is not restricted,
and the output from the Program is covered only if its contents constitute a work based
on the Program (independent of having been made by running the Program). Whether
that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Program’s source code
as you receive it, in any medium, provided that you conspicuously and appropriately
publish on each copy an appropriate copyright notice and disclaimer of warranty; keep
intact all the notices that refer to this License and to the absence of any warranty; and
give any other recipients of the Program a copy of this License along with the Program.
You may charge a fee for the physical act of transferring a copy, and you may at
your option offer warranty protection in exchange for a fee.
2. You may modify your copy or copies of the Program or any portion of it, thus
forming a work based on the Program, and copy and distribute such modifications or
work under the terms of Section 1 above, provided that you also meet all of these
conditions:
a) You must cause the modified files to carry prominent notices stating that you
changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in whole or in part
contains or is derived from the Program or any part thereof, to be licensed as a whole
at no charge to all third parties under the terms of this License.
24. THE GPL 457
c) If the modified program normally reads commands interactively when run, you
must cause it, when started running for such interactive use in the most ordinary way,
to print or display an announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide a warranty) and that
users may redistribute the program under these conditions, and telling the user how to
view a copy of this License. (Exception: if the Program itself is interactive but does not
normally print such an announcement, your work based on the Program is not required
to print an announcement.)
These requirements apply to the modified work as a whole. If identifiable sections
of that work are not derived from the Program, and can be reasonably considered
independent and separate works in themselves, then this License, and its terms, do not
apply to those sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based on the Program,
the distribution of the whole must be on the terms of this License, whose permissions
for other licensees extend to the entire whole, and thus to each and every part regardless
of who wrote it.
Thus, it is not the intent of this section to claim rights or contest your rights to
work written entirely by you; rather, the intent is to exercise the right to control the
distribution of derivative or collective works based on the Program.
In addition, mere aggregation of another work not based on the Program with the
Program (or with a work based on the Program) on a volume of a storage or distribution
medium does not bring the other work under the scope of this License.
3. You may copy and distribute the Program (or a work based on it, under Section
2) in object code or executable form under the terms of Sections 1 and 2 above provided
that you also do one of the following:
24. THE GPL 458
4. You may not copy, modify, sublicense, or distribute the Program except as
expressly provided under this License. Any attempt otherwise to copy, modify, sub-
license or distribute the Program is void, and will automatically terminate your rights
under this License. However, parties who have received copies, or rights, from you un-
der this License will not have their licenses terminated so long as such parties remain
in full compliance.
5. You are not required to accept this License, since you have not signed it. How-
ever, nothing else grants you permission to modify or distribute the Program or its
derivative works. These actions are prohibited by law if you do not accept this Li-
cense. Therefore, by modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and all its terms and
conditions for copying, distributing or modifying the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the Program), the
recipient automatically receives a license from the original licensor to copy, distribute
or modify the Program subject to these terms and conditions. You may not impose any
further restrictions on the recipients’ exercise of the rights granted herein. You are not
responsible for enforcing compliance by third parties to this License.
7. If, as a consequence of a court judgment or allegation of patent infringement
or for any other reason (not limited to patent issues), conditions are imposed on you
(whether by court order, agreement or otherwise) that contradict the conditions of this
License, they do not excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this License and any
other pertinent obligations, then as a consequence you may not distribute the Program
at all. For example, if a patent license would not permit royalty-free redistribution of
the Program by all those who receive copies directly or indirectly through you, then
24. THE GPL 460
the only way you could satisfy both it and this License would be to refrain entirely
from distribution of the Program.
If any portion of this section is held invalid or unenforceable under any particular
circumstance, the balance of the section is intended to apply and the section as a whole
is intended to apply in other circumstances.
It is not the purpose of this section to induce you to infringe any patents or other
property right claims or to contest validity of any such claims; this section has the sole
purpose of protecting the integrity of the free software distribution system, which is
implemented by public license practices. Many people have made generous contri-
butions to the wide range of software distributed through that system in reliance on
consistent application of that system; it is up to the author/donor to decide if he or she
is willing to distribute software through any other system and a licensee cannot impose
that choice.
This section is intended to make thoroughly clear what is believed to be a conse-
quence of the rest of this License. 8. If the distribution and/or use of the Program is
restricted in certain countries either by patents or by copyrighted interfaces, the orig-
inal copyright holder who places the Program under this License may add an explicit
geographical distribution limitation excluding those countries, so that distribution is
permitted only in or among countries not thus excluded. In such case, this License
incorporates the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions of the
General Public License from time to time. Such new versions will be similar in spirit
to the present version, but may differ in detail to address new problems or concerns.
Each version is given a distinguishing version number. If the Program specifies a
version number of this License which applies to it and "any later version", you have
the option of following the terms and conditions either of that version or of any later
24. THE GPL 461
version published by the Free Software Foundation. If the Program does not specify
a version number of this License, you may choose any version ever published by the
Free Software Foundation.
10. If you wish to incorporate parts of the Program into other free programs whose
distribution conditions are different, write to the author to ask for permission. For
software which is copyrighted by the Free Software Foundation, write to the Free
Software Foundation; we sometimes make exceptions for this. Our decision will be
guided by the two goals of preserving the free status of all derivatives of our free
software and of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE
IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE
COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM
"AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IM-
PLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE EN-
TIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED
TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY
WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMIT-
TED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GEN-
ERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT
24. THE GPL 462
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this when it starts in
an interactive mode:
Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision comes
with ABSOLUTELY NO WARRANTY; for details type ‘show w’. This is free soft-
ware, and you are welcome to redistribute it under certain conditions; type ‘show c’
for details.
The hypothetical commands ‘show w’ and ‘show c’ should show the appropriate
parts of the General Public License. Of course, the commands you use may be called
something other than ‘show w’ and ‘show c’; they could even be mouse-clicks or menu
items–whatever suits your program.
You should also get your employer (if you work as a programmer) or your school,
if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample;
alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program ‘Gnomovi-
sion’ (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989 Ty Coon, President of Vice
This General Public License does not permit incorporating your program into pro-
prietary programs. If your program is a subroutine library, you may consider it more
useful to permit linking proprietary applications with the library. If this is what you
want to do, use the GNU Library General Public License instead of this License.
CHAPTER 25
The attic
This holds material that is not really ready to be incorporated into the main body,
but that I don’t want to lose. Basically, ignore it, unless you’d like to help get it ready
for inclusion.
Returning to the Poisson model, lets look at actual and fitted count probabilities.
Actual relative frequencies are f (y = j) = ∑i 1(yi = j)/n and fitted frequencies are
fˆ(y = j) = ∑ni=1 fY ( j|xi , θ̂)/n We see that for the OBDV measure, there are many
more actual zeros than predicted. For ERV, there are somewhat more actual zeros than
fitted, but the difference is not too important.
Why might OBDV not fit the zeros well? What if people made the decision to
contact the doctor for a first visit, they are sick, then the doctor decides on whether or
not follow-up visits are needed. This is a principal/agent type situation, where the total
number of visits depends upon the decision of both the patient and the doctor. Since
464
25.1. HURDLE MODELS 465
different parameters may govern the two decision-makers choices, we might expect
that different parameters govern the probability of zeros versus the other counts. Let
λ p be the parameters of the patient’s demand for visits, and let λd be the paramter of
the doctor’s “demand” for visits. The patient will initiate visits according to a discrete
choice model, for example, a logit model:
The above probabilities are used to estimate the binary 0/1 hurdle process. Then, for
the observations where visits are positive, a truncated Poisson density is estimated.
This density is
fY (y, λd )
fY (y, λd |y > 0) =
Pr(y > 0)
fY (y, λd )
=
1 − exp(−λd )
exp(−λd )λ0d
Pr(y = 0) = .
0!
Since the hurdle and truncated components of the overall density for Y share no pa-
rameters, they may be estimated separately, which is computationally more efficient
than estimating the overall model. (Recall that the BFGS algorithm, for example, will
have to invert the approximated Hessian. The computational overhead is of order K 2
25.1. HURDLE MODELS 466
Here are hurdle Poisson estimation results for OBDV, obtained from this estimation program
**************************************************************************
MEPS data, OBDV
logit results
Strong convergence
Observations = 500
Function value -0.58939
t-Stats
params t(OPG) t(Sand.) t(Hess)
constant -1.5502 -2.5709 -2.5269 -2.5560
pub_ins 1.0519 3.0520 3.0027 3.0384
priv_ins 0.45867 1.7289 1.6924 1.7166
sex 0.63570 3.0873 3.1677 3.1366
age 0.018614 2.1547 2.1969 2.1807
educ 0.039606 1.0467 0.98710 1.0222
inc 0.077446 1.7655 2.1672 1.9601
Information Criteria
Consistent Akaike
639.89
Schwartz
632.89
Hannan-Quinn
614.96
Akaike
603.39
**************************************************************************
25.1. HURDLE MODELS 468
Fitted and actual probabilites (NB-II fits are provided as well) are:
TABLE 2. Actual and Hurdle Poisson fitted frequencies
For the Hurdle Poisson models, the ERV fit is very accurate. The OBDV fit is not
so good. Zeros are exact, but 1’s and 2’s are underestimated, and higher counts are
overestimated. For the NB-II fits, performance is at least as good as the hurdle Poisson
model, and one should recall that many fewer parameters are used. Hurdle version of
the negative binomial model are also widely used.
25.1.1. Finite mixture models. The following are results for a mixture of 2 neg-
ative binomial (NB-I) models, for the OBDV data, which you can replicate using
this estimation program
25.1. HURDLE MODELS 470
**************************************************************************
MEPS data, OBDV
mixnegbin results
Strong convergence
Observations = 500
Function value -2.2312
t-Stats
params t(OPG) t(Sand.) t(Hess)
constant 0.64852 1.3851 1.3226 1.4358
pub_ins -0.062139 -0.23188 -0.13802 -0.18729
priv_ins 0.093396 0.46948 0.33046 0.40854
sex 0.39785 2.6121 2.2148 2.4882
age 0.015969 2.5173 2.5475 2.7151
educ -0.049175 -1.8013 -1.7061 -1.8036
inc 0.015880 0.58386 0.76782 0.73281
ln_alpha 0.69961 2.3456 2.0396 2.4029
constant -3.6130 -1.6126 -1.7365 -1.8411
pub_ins 2.3456 1.7527 3.7677 2.6519
priv_ins 0.77431 0.73854 1.1366 0.97338
sex 0.34886 0.80035 0.74016 0.81892
age 0.021425 1.1354 1.3032 1.3387
educ 0.22461 2.0922 1.7826 2.1470
inc 0.019227 0.20453 0.40854 0.36313
ln_alpha 2.8419 6.2497 6.8702 7.6182
logit_inv_mix 0.85186 1.7096 1.4827 1.7883
Information Criteria
25.1. HURDLE MODELS 471
Consistent Akaike
2353.8
Schwartz
2336.8
Hannan-Quinn
2293.3
Akaike
2265.2
**************************************************************************
Delta method for mix parameter st. err.
mix se_mix
0.70096 0.12043
• The 95% confidence interval for the mix parameter is perilously close to 1,
which suggests that there may really be only one component density, rather
than a mixture. Again, this is not the way to test this - it is merely suggestive.
• Education is interesting. For the subpopulation that is “healthy”, i.e., that
makes relatively few visits, education seems to have a positive effect on visits.
For the “unhealthy” group, education has a negative effect on visits. The other
results are more mixed. A larger sample could help clarify things.
The following are results for a 2 component constrained mixture negative binomial
model where all the slope parameters in λ j = exβ j are the same across the two compo-
nents. The constants and the overdispersion parameters α j are allowed to differ for the
two components.
25.1. HURDLE MODELS 472
**************************************************************************
MEPS data, OBDV
cmixnegbin results
Strong convergence
Observations = 500
Function value -2.2441
t-Stats
params t(OPG) t(Sand.) t(Hess)
constant -0.34153 -0.94203 -0.91456 -0.97943
pub_ins 0.45320 2.6206 2.5088 2.7067
priv_ins 0.20663 1.4258 1.3105 1.3895
sex 0.37714 3.1948 3.4929 3.5319
age 0.015822 3.1212 3.7806 3.7042
educ 0.011784 0.65887 0.50362 0.58331
inc 0.014088 0.69088 0.96831 0.83408
ln_alpha 1.1798 4.6140 7.2462 6.4293
const_2 1.2621 0.47525 2.5219 1.5060
lnalpha_2 2.7769 1.5539 6.4918 4.2243
logit_inv_mix 2.4888 0.60073 3.7224 1.9693
Information Criteria
Consistent Akaike
2323.5
Schwartz
2312.5
Hannan-Quinn
25.2. MODELS FOR TIME SERIES DATA 473
2284.3
Akaike
2266.1
**************************************************************************
Delta method for mix parameter st. err.
mix se_mix
0.92335 0.047318
This section can be ignored in its present form. Just left in to form a basis for
completion (by someone else ?!) at some point.
Hamilton, Time Series Analysis is a good reference for this section. This is very
incomplete and contributions would be very welcome.
Up to now we’ve considered the behavior of the dependent variable yt as a function
of other variables xt . These variables can of course contain lagged dependent variables,
e.g., xt = (wt , yt−1 , ..., yt− j ). Pure time series methods consider the behavior of yt as
a function only of its own lagged values, unconditional on other observable variables.
One can think of this as modeling the behavior of yt after marginalizing out all other
variables. While it’s not immediately clear why a model that has other explanatory
variables should marginalize to a linear in the parameters time series model, most time
series work is done with linear models, though nonlinear time series is also a large and
growing field. We’ll stick with linear time series models.
∞
(25.2.1) {Yt }t=−∞
n
(25.2.2) {yt }t=1
where µt = E (yt ) .
µt = µ, ∀t
γ jt = γ j , ∀t
As we’ve seen, this implies that γ j = γ− j : the autocovariances depend only one the
interval between observations, but not the time of the observations.
25.2. MODELS FOR TIME SERIES DATA 475
1 n p
(25.2.4) ∑ yt → µ
n t=1
This implies that the autocovariances die off, so that the yt are not so strongly depen-
dent that they don’t satisfy a LLN.
γj
(25.2.5) ρj =
γ0
25.2. MODELS FOR TIME SERIES DATA 476
D EFINITION 60 (White noise). White noise is just the time series literature term
for a classical error. εt is white noise if i) E (εt ) = 0, ∀t, ii) V (εt ) = σ2 , ∀t, and iii) εt
and εs are independent, t 6= s. Gaussian white noise just adds a normality assumption.
25.2.2. ARMA models. With these concepts, we can discuss ARMA models.
These are closely related to the AR and MA error processes that we’ve already dis-
cussed. The main difference is that the lhs variable is observed directly now.
MA(q) processes. A qth order moving average (MA) process is
γ0 = E (yt − µ)2
2
= E εt + θ1 εt−1 + θ2 εt−2 + · · · + θq εt−q
= σ2 1 + θ21 + θ22 + · · · + θ2q
= 0, j > q
The dynamic behavior of an AR(p) process can be studied by writing this pth order
difference equation as a vector first order difference equation:
φ φ ··· φp
1 2
yt c yt−1 εt
1 0 0 0
yt−1 0 yt−2 0
. .. 0
. = . 0 1 0 .. + ..
.. .. . .
.. . . .
.. .. .. 0···
.
yt−p+1 0 yt−p 0
0 ··· 0 1 0
or
Yt = C + FYt−1 + Et
= C + F (C + FYt−1 + Et ) + Et+1
and
or in general
∂Yt+ j j
= F(1,1)
∂Et (1,1)
0
If the system is to be stationary, then as we move forward in time this impact must
die off. Otherwise a shock causes a permanent change in the mean of yt . Therefore,
stationarity requires that
j
lim F =0
j→∞ (1,1)
Consider the eigenvalues of the matrix F. These are the for λ such that
|F − λIP | = 0
The determinant here can be expressed as a polynomial. for example, for p = 1, the
matrix F is simply
F = φ1
so
|φ1 − λ| = 0
can be written as
φ1 − λ = 0
and
|F − λIP | = λ2 − λφ1 − φ2
λ2 − λφ1 − φ2
which can be found using the quadratic equation. This generalizes. For a pth order AR
process, the eigenvalues are the roots of
Supposing that all of the roots of this polynomial are distinct, then the matrix F can be
factored as
F = T ΛT −1
where T is the matrix which has as its columns the eigenvectors of F, and Λ is a
diagonal matrix with the eigenvalues on the main diagonal. Using this decomposition,
we can write
F j = T ΛT −1 T ΛT −1 · · · T ΛT −1
F j = T Λ j T −1
and
j
λ1 0 0
j
0 λ2
j
Λ =
..
.
j
0 λp
25.2. MODELS FOR TIME SERIES DATA 480
Supposing that the λi i = 1, 2, ..., p are all real valued, it is clear that
j
lim F(1,1) = 0
j→∞
requires that
|λi | < 1, i = 1, 2, ..., p
• It may be the case that some eigenvalues are complex-valued. The previous
result generalizes to the requirement that the eigenvalues be less than one in
modulus, where the modulus of a complex number a + bi is
p
mod(a + bi) = a2 + b2
This leads to the famous statement that “stationarity requires the roots of the
determinantal polynomial to lie inside the complex unit circle.” draw picture
here.
• When there are roots on the unit circle (unit roots) or outside the unit circle,
we leave the world of stationary processes.
j
• Dynamic multipliers: ∂yt+ j /∂εt = F(1,1) is a dynamic multiplier or an impulse-
response function. Real eigenvalues lead to steady movements, whereas comlpex
eigenvalue lead to ocillatory behavior. Of course, when there are multiple
eigenvalues the overall effect can be a mixture. pictures
Invertibility of AR process
To begin with, define the lag operator L
Lyt = yt−1
25.2. MODELS FOR TIME SERIES DATA 481
L2 yt = L(Lyt )
= Lyt−1
= yt−2
or
= 1 − yt−2
or
yt (1 − φ1 L − φ2 L2 − · · · − φ p L p ) = εt
1 − φ1 L − φ2 L2 − · · · − φ p L p = (1 − λ1 L)(1 − λ2 L) · · · (1 − λ p L)
For the moment, just assume that the λi are coefficients to be determined. Since L is
defined to operate as an algebraic quantitiy, determination of the λ i is the same as
determination of the λi such that the following two expressions are the same for all z :
1 − φ1 z − φ2 z2 − · · · − φ p z p = (1 − λ1 z)(1 − λ2 z) · · ·(1 − λ p z)
25.2. MODELS FOR TIME SERIES DATA 482
The LHS is precisely the determinantal polynomial that gives the eigenvalues of F.
Therefore, the λi that are the coefficients of the factorization are simply the eigenvalues
of the matrix F.
Now consider a different stationary process
(1 − φL)yt = εt
1 + φL + φ2 L2 + ... + φ j L j (1 − φL)yt = 1 + φL + φ2 L2 + ... + φ j L j εt
1 + φL + φ2 L2 + ... + φ j L j − φL − φ2 L2 − ... − φ j L j − φ j+1 L j+1 yt
== 1 + φL + φ2 L2 + ... + φ j L j εt
1 − φ j+1 L j+1 yt = 1 + φL + φ2 L2 + ... + φ j L j εt
so
yt = φ j+1 L j+1 yt + 1 + φL + φ2 L2 + ... + φ j L j εt
25.2. MODELS FOR TIME SERIES DATA 483
= 1 + φL + φ2 L2 + ... + φ j L j εt
yt ∼
and the approximation becomes better and better as j increases. However, we started
with
(1 − φL)yt = εt
= 1 + φL + φ2 L2 + ... + φ j L j (1 − φL)yt
yt ∼
so
1 + φL + φ2 L2 + ... + φ j L j (1 − φL) ∼
=1
yt (1 − φ1 L − φ2 L2 − · · · − φ p L p ) = εt
yt (1 − λ1 L)(1 − λ2 L) · · · (1 − λ p L) = εt
where the λ are the eigenvalues of F, and given stationarity, all the |λi | < 1. Therefore,
we can invert each first order polynomial on the LHS to get
! ! !
∞ ∞ ∞
∑ λ1 L j ∑ λ2 L j ∑ λ pj L j
j j
yt = ··· εt
j=0 j=0 j=0
25.2. MODELS FOR TIME SERIES DATA 484
yt = (1 + ψ1 L + ψ2 L2 + · · · )εt
• The ψi are formed of products of powers of the λi , which are in turn functions
of the φi .
• The ψi are real-valued because any complex-valued λi always occur in con-
jugate pairs. This means that if a + bi is an eigenvalue of F, then so is a − bi.
In multiplication
= a2 + b2
which is real-valued.
• This shows that an AR(p) process is representable as an infinite-order MA(q)
process.
• Recall before that by recursive substitution, an AR(p) process can be written
as
If the process is mean zero, then everything with a C drops out. Take this and
lag it by j periods to get
As j → ∞, the lagged Y on the RHS drops out. The Et−s are vectors of zeros
except for their first element, so we see that the first equation here, in the
25.2. MODELS FOR TIME SERIES DATA 485
limit, is just
∞
yt = ∑ Fj ε
1,1 t− j
j=0
which makes explicit the relationship between the ψi and the φi (and the λi as
well, recalling the previous factorization of F j ).
µ = c + φ1 µ + φ2 µ + ... + φ p µ
so
c
µ=
1 − φ1 − φ2 − ... − φ p
and
c = µ − φ1 µ − ... − φ p µ
so
With this, the second moments are easy to find: The variance is
γ0 = φ1 γ1 + φ2 γ2 + ... + φ p γ p + σ2
25.2. MODELS FOR TIME SERIES DATA 486
γj = E (yt − µ) yt− j − µ)
= E (φ1 (yt−1 − µ) + φ2 (yt−2 − µ) + ... + φ p (yt−p − µ) + εt ) yt− j − µ
Using the fact that γ− j = γ j , one can take the p + 1 equations for j = 0, 1, ..., p, which
have p + 1 unknowns (σ2 , γ0 , γ1 , ..., γ p ) and solve for the unknowns. With these, the γ j
for j > p can be solved for recursively.
Invertibility of MA(q) process. An MA(q) can be written as
yt − µ = (1 + θ1 L + ... + θq Lq )εt
and each of the (1 − ηi L) can be inverted as long as |ηi | < 1. If this is the case, then
we can write
(1 + θ1 L + ... + θq Lq )−1 (yt − µ) = εt
where
(1 + θ1 L + ... + θq Lq )−1
with δ0 = −1, or
or
yt = c + δ1 yt−1 + δ2 yt−2 + ... + εt
where
c = µ + δ1 µ + δ2 µ + ...
• It turns out that one can always manipulate the parameters of an MA(q) pro-
cess to find an invertible representation. For example, the two MA(1) pro-
cesses
yt − µ = (1 − θL)εt
and
yt∗ − µ = (1 − θ−1 L)εt∗
σ2ε∗ = σ2ε θ2
γ0 = σ2 (1 + θ2 ).
so the variances are the same. It turns out that all the autocovariances will
be the same, as is easily checked. This means that the two MA processes are
observationally equivalent. As before, it’s impossible to distinguish between
observationally equivalent processes on the basis of data.
• For a given MA(q) process, it’s always possible to manipulate the parameters
to find an invertible representation (which is unique).
• It’s important to find an invertible representation, since it’s the only repre-
sentation that allows one to represent εt as a function of past y0 s. The other
representations express
• Why is invertibility important? The most important reason is that it provides
a justification for the use of parsimonious models. Since an AR(1) process
has an MA(∞) representation, one can reverse the argument and note that at
least some MA(∞) processes have an AR(1) representation. At the time of
estimation, it’s a lot easier to estimate the single AR(1) coefficient rather than
the infinite number of coefficients associated with the MA representation.
• This is the reason that ARMA models are popular. Combining low-order AR
and MA models can usually offer a satisfactory representation of univariate
time series data with a reasonable number of parameters.
• Stationarity and invertibility of ARMA models is similar to what we’ve seen
- we won’t go into the details. Likewise, calculating moments is similar.
[1] Davidson, R. and J.G. MacKinnon (1993) Estimation and Inference in Econometrics, Oxford Univ.
Press.
[2] Davidson, R. and J.G. MacKinnon (2004) Econometric Theory and Methods, Oxford Univ. Press.
[3] Gallant, A.R. (1985) Nonlinear Statistical Models, Wiley.
[4] Gallant, A.R. (1997) An Introduction to Econometric Theory, Princeton Univ. Press.
[5] Hamilton, J. (1994) Time Series Analysis, Princeton Univ. Press
[6] Hayashi, F. (2000) Econometrics, Princeton Univ. Press.
[7] Wooldridge (2003), Introductory Econometrics, Thomson. (undergraduate level, for supplemen-
tary use only).
489
Index
Cobb-Douglas model, 27
parameter space, 54
convergence, almost sure, 449
Product rule, 447
convergence, in distribution, 449
convergence, in probability, 449
R- squared, uncentered, 36
Convergence, ordinary, 448
R-squared, centered, 37
convergence, pointwise, 448
convergence, uniform, 448
convergence, uniform almost sure, 450
cross section, 23
leverage, 34
likelihood function, 54
matrix, idempotent, 33
matrix, projection, 32
matrix, symmetric, 33
490