Linear Model and Extensions Peng Ding Instant Download
Linear Model and Extensions Peng Ding Instant Download
https://ebookbell.com/product/linear-model-and-extensions-peng-
ding-54747274
https://ebookbell.com/product/generalized-linear-models-and-
extensions-4th-edition-james-w-hardin-7239834
https://ebookbell.com/product/loglinear-models-extensions-and-
applications-aleksandr-aravkin-42997960
The Linear Model And Hypothesis A General Unifying Theory 1st Edition
George Seber Auth
https://ebookbell.com/product/the-linear-model-and-hypothesis-a-
general-unifying-theory-1st-edition-george-seber-auth-5235770
Spectral Mixture For Remote Sensing Linear Model And Applications 1st
Ed Yosio Edemir Shimabukuro
https://ebookbell.com/product/spectral-mixture-for-remote-sensing-
linear-model-and-applications-1st-ed-yosio-edemir-shimabukuro-7320754
Linear Model Theory Exercises And Solutions 1st Ed Dale L Zimmerman
https://ebookbell.com/product/linear-model-theory-exercises-and-
solutions-1st-ed-dale-l-zimmerman-22503060
https://ebookbell.com/product/linear-model-theory-with-examples-and-
exercises-1st-ed-dale-l-zimmerman-22503054
https://ebookbell.com/product/linear-model-theory-univariate-
multivariate-and-mixed-models-1st-edition-keith-e-muller-2447236
https://ebookbell.com/product/the-social-aspects-of-environmental-and-
climate-change-institutional-dynamics-beyond-a-linear-model-e-c-h-
keskitalo-38172174
https://ebookbell.com/product/from-data-to-decisions-in-music-
education-research-data-analytics-and-the-general-linear-model-
using-r-1st-edition-brian-c-wesolowski-51448806
arXiv:2401.00649v1 [stat.ME] 1 Jan 2024
Peng Ding
Acronyms xiii
Symbols xv
Preface xix
I Introduction 1
1 Motivations for Statistical Models 3
1.1 Data and statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Why linear models? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
v
vi Contents
15 Lasso 155
15.1 Introduction to the lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
15.2 Comparing the lasso and ridge: a geometric perspective . . . . . . . . . . . 155
15.3 Computing the lasso via coordinate descent . . . . . . . . . . . . . . . . . . 158
15.3.1 The soft-thresholding lemma . . . . . . . . . . . . . . . . . . . . . . 158
15.3.2 Coordinate descent for the lasso . . . . . . . . . . . . . . . . . . . . 158
15.4 Example: comparing OLS, ridge and lasso . . . . . . . . . . . . . . . . . . . 159
15.5 Other shrinkage estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
15.6 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
17 Interaction 179
17.1 Two binary covariates interact . . . . . . . . . . . . . . . . . . . . . . . . . 179
17.2 A binary covariate interacts with a general covariate . . . . . . . . . . . . . 180
17.2.1 Treatment effect heterogeneity . . . . . . . . . . . . . . . . . . . . . 180
17.2.2 Johnson–Neyman technique . . . . . . . . . . . . . . . . . . . . . . . 180
17.2.3 Blinder–Oaxaca decomposition . . . . . . . . . . . . . . . . . . . . . 180
17.2.4 Chow test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
17.3 Difficulties of intereaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
17.3.1 Removable interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 182
17.3.2 Main effect in the presence of interaction . . . . . . . . . . . . . . . 182
17.3.3 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
17.4 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
IX Appendices 327
A Linear Algebra 329
A.1 Basics of vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . 329
A.2 Vector calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
A.3 Homework problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Bibliography 367
Acronyms
I try hard to avoid using acronyms to reduce the unnecessary burden for reading. The
following are standard and will be used repeatedly.
ANOVA (Fisher’s) analysis of variance
CLT central limit theorem
CV cross-validation
EHW Eicker–Huber–White (robust covariance matrix or standard error)
FWL Frisch–Waugh–Lovell (theorem)
GEE generalized estimating equation
GLM generalized linear model
HC heteroskedasticity-consistent (covariance matrix or standard error)
IID independent and identically distributed
LAD least absolute deviations
lasso least absolute shrinkage and selection operator
MLE maximum likelihood estimate
OLS ordinary least squares
RSS residual sum of squares
WLS weighted least squares
xiii
Symbols
All vectors are column vectors as in R unless stated otherwise. Let the superscript “t ” denote
the transpose of a vector or matrix.
a
∼ approximation in distribution
R the set of all real numbers
β regression coefficient
ε error term
H hat matrix H = X(X t X)−1 X t
hii leverage score: the (i, i)the element of the hat matrix H
In identity matrix of dimension n × n
xi covariate vector for unit i
X covariate matrix
Y outcome vector
yi outcome for unit i
independence and conditional independence
xv
Useful R packages
xvii
Preface
xix
xx Preface
my teaching assistants to review the appendices in the first two lab sessions and assigned
homework problems from the appendices to remind the students to review the background
materials. Then you can cover Chapters 2–24. You can omit Chapter 18 and some sections
in other chapters due to their technical complications. If time permits, you can consider
covering Chapter 25 due to the importance of the generalized estimating equation as well
as its byproduct called the “cluster-robust standard error”, which is important for many
social science applications. Furthermore, you can consider covering Chapter 27 due to the
importance of the Cox proportional hazards model.
Homework problems
This book contains many homework problems. It is important to try some homework prob-
lems. Moreover, some homework problems contain useful theoretical results. Even if you do
not have time to figure out the details for those problems, it is helpful to at least read the
statements of the problems.
Omitted topics
Although “Linear Model” is a standard course offered by most statistics departments, it
is not entirely clear what we should teach as the field of statistics is evolving. Although
I made some suggestions to the instructors above, you may still feel that this book has
omitted some important topics related to the linear model.
et al. (2012) is a canonical textbook on applied longitudinal data analysis. This book also
covers the Cox proportional hazards model in Chapter 27. For more advanced methods for
survival analysis, Kalbfleisch and Prentice (2011) is a canonical textbook.
Causal inference
I do not cover causal inference in this book intentionally. To minimize the overlap of the
materials, I wrote another textbook on causal inference (Ding, 2023). However, I did teach a
version of “Linear Model” with a causal inference unit after introducing the basics of linear
model and logistic model. Students seemed to like it because of the connections between
statistical models and causal inference.
• This book covers the theory of the linear model related to not only social sciences but
also biomedical studies.
• This book provides homework problems with different technical difficulties. The solu-
tions to the problems are available to instructors upon request.
Other textbooks may also have one or two of the above features. This book has the above
features simultaneously. I hope that instructors and readers find these features attractive.
Acknowledgments
Many students at UC Berkeley made critical and constructive comments on early versions of
my lecture notes. As teaching assistants for my “Linear Model” course, Sizhu Lu, Chaoran
Yu, and Jason Wu read early versions of my book carefully and helped me to improve the
book a lot.
Professors Hongyuan Cao and Zhichao Jiang taught related courses based on an early
version of the book. They made very valuable suggestions.
I am also very grateful for the suggestions from Nianqiao Ju.
When I was a student, I took a linear model course based on Weisberg (2005). In my
early years of teaching, I used Christensen (2002) and Agresti (2015) as reference books.
I also sat in Professor Jim Powell’s econometrics courses and got access to his wonderful
lecture notes. They all heavily impacted my understanding and formulation of the linear
model.
If you identify any errors, please feel free to email me.
Part I
Introduction
1
Motivations for Statistical Models
3
4 Linear Model and Extensions
(Q3) Estimate the causal effect of some components in X on Y . What if we change some
components of X? How do we measure the impact of the hypothetical intervention of
some components of X on Y ? This is a much harder question because most statistical
tools are designed to infer association, not causation. For example, the U.S. Food and
Drug Administration (FDA) approves drugs based on randomized controlled trials
(RCT) because RCTs are most credible to infer causal effects of drugs on health
outcomes. Economists are interested in evaluating the effect of a job training program
on employment and wages. However, this is a notoriously difficult problem with only
observational data.
The above descriptions are about generic X and Y , which can be many different types.
We often use different statistical models to capture the features of different types of data.
I give a brief overview of models that will appear in later parts of this book.
(T1) X and Y are univariate and continuous. In Francis Galton’s1 classic example, X is the
parents’ average height and Y is the children’s average height (Galton, 1886). Galton
derived the following formula:
σ̂y
y = ȳ + ρ̂ (x − x̄)
σ̂x
which is equivalent to
y − ȳ x − x̄
= ρ̂ , (1.1)
σ̂y σ̂x
where
n
X n
X
x̄ = n−1 xi , ȳ = n−1 yi
i=1 i=1
are the sample means,
n
X n
X
σ̂x2 = (n − 1)−1 (xi − x̄)2 , σ̂y2 = (n − 1)−1 (yi − ȳ)2
i=1 i=1
are the sample variances, and ρ̂ = σ̂xy /(σ̂x σ̂xy ) is the sample Pearson correlation
coefficient with the sample covariance
n
X
σ̂xy = (n − 1)−1 (xi − x̄)(yi − ȳ).
i=1
1 Who was Francis Galton? He was Charles Darwin’s nephew and was famous for his pioneer work in
statistics and for devising a method for classifying fingerprints that proved useful in forensic science. He
also invented the term eugenics, a field that causes a lot of controversies nowadays.
Motivations for Statistical Models 5
(T3) Y binary or indicator of two classes, and X multivariate of mixed types. For example,
in the R package wooldridge, the dataset mroz contains an outcome of interest being the
binary indicator for whether a woman was in the labor force in 1975, and some useful
covariates are
(T4) Y categorical without ordering. For example, the choice of housing type, single-family
house, townhouse, or condominium, is a categorical variable.
(T5) Y categorical and ordered. For example, the final course evaluation at UC Berkeley
can take value in {1, 2, 3, 4, 5, 6, 7}. These numbers have clear ordering but they are
not the usual real numbers.
(T6) Y counts. For example, the number of times one went to the gym last week is a
non-negative integer representing counts.
(T7) Y time-to-event outcome. For example, in medical trials, a major outcome of interest
is the survival time; in labor economics, a major outcome of interest is the time to
find the next job. The former is called survival analysis in biostatistics and the latter
is called duration analysis in econometrics.
(T8) Y multivariate and correlated. In medical trials, the data are often longitudinal, mean-
ing that the patient’s outcomes are measured repeatedly over time. So each patient
has a multivariate outcome. In field experiments of public health and development
economics, the randomized interventions are often at the village level but the out-
come data are collected at the household level. So within villages, the outcomes are
correlated.
(R1) Linear models are simple but non-trivial starting points for learning.
(R2) Linear models can provide insights because we can derive explicit formulas based on
elegant algebra and geometry.
(R3) Linear models can handle nonlinearity by incorporating nonlinear terms, for example,
X can contain the polynomials or nonlinear transformations of the original covariates.
In statistics, “linear” often means linear in parameters, not necessarily in covariates.
(R5) Linear models are simpler than nonlinear models, but they do not necessarily perform
worse than more complicated nonlinear models. We have finite data so we cannot fit
arbitrarily complicated models.
If you are interested in nonlinear models, you can take another machine learning course.
2
Ordinary Least Squares (OLS) with a Univariate
Covariate
Galton's regression
80
●
75 ● ● ● ●
●
● ● ●● ● ● ●
● ●
● ● ●
● ● ● ● ●●● ●
●● ● ● ● ● ● ●● ●●
● ●
● ● ●
● ● ● ● ● ●● ● ●● ● ● ● ●●
● ● ●● ● ●● ● ● ●
● ● ●
●
● ● ● ● ●● ● ●
● ● ●
● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●●
●
● ● ● ● ● ●● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ●
● ● ●
70 ● ● ●
● ● ●●● ● ● ● ● ●● ● ●
● ● ●
● ●
● ●●●●●●● ●●●●
●
● ●● ●● ● ● ●
childHeight
●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ●●● ● ●● ● ● ● ● ●●
● ●
● ● ●● ● ●● ● ●●●● ● ● ●● ● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ●
●
● ● ● ● ● ●●● ●●● ● ●● ●● ● ●●● ● ●● ● ● ●●● ● ●● ● ● ●● ● ●
● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ●● ●
● ● ●
● ● ● ●● ● ● ● ● ● ●●●● ● ●● ●
● ● ● ● ● ●● ● ●● ●● ●
● ● ● ●● ●●
● ● ●
●
● ● ●●
●
●●● ●
●
●●●● ●
●
● ● ● ●
●
●
● ● ● ● ● ● ●● ● ● ●●●● ●
● ●● ● ● ● ●● ● ●●●
● ●● ●●●● ● ● ●
● ●
● ● ●
● ● ●● ● ●●● ●●
● ● ●●●●● ● ● ● ●● ● ● ●● ● ●
● ● ● ● ●
65 ● ● ●
●
●
● ●
●
● ●
●●
●
●
● ● ● ●●●●●●
●
● ● ●● ● ●
●
●
●
●●● ●●
●●
● ●●●●●
●
● ●
●
● ●●
● ●
●●
●
● ●
●
● ● ● ●● ●● ● ● ●● ● ● ●● ●●
●●● ●●●● ● ●
● ●● ● ● ●
● ● ● ● ●
● ● ●● ● ● ● ●
● ●● ● ●●
●●● ●●● ● ● ●
● ● ●
● ● ● ●● ● ● ●●● ● ●● ● ● ●●●●
● ● ● ●●● ●●● ● ●
● ●
● ●● ●
●●● ● ● ●● ● ● ● ●● ●
● ●
● ● ● ●● ● ● ●● ● ●
● ●●●●● ●●
● ● ●● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
●
● ● ● ●● ● ●● ●● ● ● ●● ● ●●
● ● ● ●
60 ● ● ● ● ● ● ● ●● ● ● ●● ● ●
● ●
fitted line: y=22.64+0.64x
●
55
64 68 72
midparentHeight
With n data points (xi, yi )ni=1 , our goal is to find the best linear fit of the data
What do we mean by the “best” fit? Gauss proposed to use the following criterion, called
7
8 Linear Model and Extensions
The OLS criterion is based on the squared “misfits” yi − a − bxi . Another intuitive
criterion is based on the absolute values of those misfits, which is called the least absolute
deviation (LAD). However, OLS is simpler because the objective function is smooth in (a, b).
We will discuss LAD in Chapter 26.
How to solve the OLS minimization problem? The objective function is quadratic, and
as a and b diverge, it diverges to infinity. So it must has a unique minimizer (α̂, β̂) which
satisfies the first-order condition:
( Pn
− n2 i=1 (yi − α̂ − β̂xi ) = 0,
Pn
− n2 i=1 xi (yi − α̂ − β̂xi ) = 0.
These two equations are called the Normal Equations of OLS. The first equation implies
ȳ = α̂ + β̂ x̄, (2.1)
that is, the OLS line must go through the sample mean of the data (x̄, ȳ). The second
equation implies
xy = α̂x̄ + β̂x2 , (2.2)
where xy is the sample mean of the xi yi ’s, and x2 is the sample mean of the x2i ’s. Subtracting
(2.1)×x̄ from (2.2), we have
So the OLS coefficient of x equals the sample covariance between x and y divided by the
sample variance of x. From (2.1), we obtain that
α̂ = ȳ − β̂ x̄.
y = α̂ + β̂x = ȳ − β̂ x̄ + β̂x
=⇒ y − ȳ = β̂(x − x̄)
σ̂xy ρ̂xy σ̂x σ̂y
=⇒ y − ȳ = 2 (x − x̄) = (x − x̄)
σ̂x σ̂x2
y − ȳ x − x̄
=⇒ = ρ̂xy ,
σ̂y σ̂x
Ceres, and his work was published in 1809. Legendre’s work appeared in 1805 but Gauss claimed that he
had been using it since 1794 or 1795. Stigler (1981) reviews the history of OLS.
Ordinary Least Squares (OLS) with a Univariate Covariate 9
which equals Pn
xi yi ⟨x, y⟩
β̂ = Pi=1
n 2 = ,
i=1 xi ⟨x, x⟩
where
Pn x and y are the n-dimensional vectors containing all observations, and ⟨x, y⟩ =
i=1 i yi denotes the inner product. Although not directly useful, this formula will be the
x
building block for many discussions later.
is the weight proportional to the squared distance between xi and xj . In the above formulas,
we define bij = 0 if xi = xj .
Remark: Wu (1986) and Gelman and Park (2009) used this formula. Problem 3.9 gives
a more general result.
Part II
where xti = (xi1 , . . . , xip ) is the row vector consisting of the covariates of unit i, and Xj =
(x1j , . . . , xnj )t is the column vector of the j-th covariate for all units.
We want to find the best linear fit of the data (xi , ŷi )ni=1 with
where β̂ is called the OLS coefficient, the ŷi ’s are called the fitted values, and the yi − ŷi ’s
are called the residuals.
The objective function is quadratic in b which diverges to infinity when b diverges to
infinity. So it must have a unique minimizer β̂ satisfying the first order condition
n
2X
− xi (yi − xti β̂) = 0,
n i=1
which simplifies to
n
X
xi (yi − xti β̂) = 0 ⇐⇒ X t (Y − X β̂) = 0. (3.1)
i=1
The above equation (3.1) is called the Normal equation of the OLS, which implies the main
theorem:
13
14 Linear Model and Extensions
Pn
if X t X = i=1 xi xti is non-degenerate.
The equivalence of the two forms of the OLS coefficient follows from
t
x1
n
xt2 X
X t X = (x1 , . . . , xn ) . = xi xti ,
.. i=1
xtn
and
y1
y2 X n
X t Y = (x1 , . . . , xn ) = xi yi .
..
. i=1
yn
For different purposes, both forms can be useful.
The non-degeneracy of X t X in Theorem 3.1 requires that for any non-zero vector α ∈
p
R , we must have
αt X t Xα = ∥Xα∥2 ̸= 0
which is equivalent to
Xα ̸= 0,
i.e., the columns of X are linearly independent 1 . This effectively rules out redundant
columns in the design matrix X. If X1 can be represented by other columns X1 =
c2 X2 + · · · + cp Xp for some (c2 , . . . , cp ), then X t X is degenerate.
Throughout the book, we invoke the following condition unless stated otherwise.
Condition 3.1 The column vectors of X are linearly independent.
Xb = b1 X1 + · · · + bp Xp
represents a linear combination of the column vectors of the design matrix X. So the OLS
problem is to find the best linear combination of the column vectors of X to approximate the
response vector Y . Recall that all linear combinations of the column vectors of X constitute
1 This book uses different notions of “independence” which can be confusing sometimes. In linear algebra,
a set of vectors is linearly independent if any nonzero linear combination of them is not zero; see Chapter A.
In probability theory, two random variables are independent if their joint density factorizes into the product
of the marginal distributions; see Chapter B.
OLS with Multiple Covariates 15
the column space of X, denoted by C(X) 2 . So the OLS problem is to find the vector in C(X)
that is the closest to Y . Geometrically, the vector must be the projection of Y onto C(X).
By projection, the residual vector ε̂ = Y −X β̂ must be orthogonal to C(X), or, equivalently,
the residual vector is orthogonal to X1 , . . . , Xp . This geometric intuition implies that
X1t ε̂ = 0, . . . , Xpt ε̂ = 0,
t
X1 ε̂
t ..
⇐⇒ X ε̂ = . = 0,
Xpt ε̂
⇐⇒ X t (Y − X β̂) = 0,
which is essentially the Normal equation (3.1). The above argument gives a geometric deriva-
tion of the OLS formula in Theorem 3.1.
In Figure 3.1, since the triangle ABC is rectangular, the fitted vector Ŷ = X β̂ is orthog-
onal to the residual vector ε̂, and moreover, the Pythagorean Theorem implies that
∥Y ∥2 = ∥X β̂∥2 + ∥ε̂∥2 .
where implies that ∥Y − Xb∥2 ≥ ∥Y − X β̂∥2 with equality holding if and only if b = β̂.
The first term equals ∥Y − X β̂∥2 and the second term equals ∥X(β̂ − b)∥2 . We need to show
the last two terms are zero. By symmetry of these two terms, we only need to show that
the last term is zero. This is true by the Normal equation (3.1) of the OLS:
H = X(X t X)−1 X t
is an n × n matrix. It is called the hat matrix because it puts a hat on Y when multiplying
Y . Algebraically, we can show that H is a projection matrix because
and
t
X(X t X)−1 X t
Ht =
= X(X t X)−1 X t
= H.
OLS with Multiple Covariates 17
Recall that C(X) is the column space of X. (G1) states that projecting any vector in
C(X) onto C(X) does not change the vector, and (G2) states that projecting any vector
orthogonal to C(X) onto C(X) results in a zero vector.
Proof of Proposition 3.1: I first prove (G1). If v ∈ C(X), then v = Xb for some b,
which implies that Hv = X(X t X)−1 X t Xb = Xb = v. Conversely, if v = Hv, then v =
X(X t X)−1 X t v = Xu with u = (X t X)−1 X t v, which ensures that v ∈ C(X).
I then prove (G2). If w ⊥ C(X), then w is orthogonal to all column vectors of X. So
Xjt w = 0 (j = 1, . . . , p)
t
=⇒ X w = 0
=⇒ Hw = X(X t X)−1 X t w = 0.
It shows that the predicted value ŷi is a linear combination of all the outcomes. Moreover,
if X contains a column of intercepts 1n = (1, . . . , 1)t , then
n
X
H1n = 1n =⇒ hij = 1 (i = 1, . . . , n),
j=1
which implies that ŷi is a weighted average of all the outcomes. Although the sum of the
weights is one, some of them can be negative.
In general, the hat matrix has complex forms, but when the covariates are dummy
variables, it has more explicit forms. I give two examples below.
Example 3.1 In a treatment-control experiment with m treated and n control units, the
matrix X contains 1 and a dummy variable for the treatment:
1m 1m
X= .
1n 0n
and t
x11 ··· x1p x1
X = ... .. = .. = (X , . . . , X ) ∈ Rn×p
. . 1 p
xn1 ··· xnp xtn
as the response and covariate matrices, respectively. Define the multiple OLS coefficient
matrix as
n
X
B̂ = arg minp×q
∥yi − B t xi ∥2
B∈R
i=1
B̂1 = (X t X)−1 X t Y1 ,
..
.
B̂q = (X t X)−1 X t Yq .
Remark: This result tells us that the OLS fit with a vector outcome reduces to multiple
separate OLS fits, or, the OLS fit of a matrix Y on a matrix X reduces to the column-wise
OLS fits of Y on X.
X(K) Y(K)
where the kth sample consists of (X(k) , Y(k) ) with X(k) ∈ Rnk ×p and Y(k) ∈ Rnk being the
PK
covariate matrix and outcome vector. Note that n = k=1 nk Let β̂ be the OLS coefficient
based on the full sample, and β̂(k) be the OLS coefficient based on the kth sample. Show
that
XK
β̂ = W(k) β̂(k) ,
k=1
YS = XS b
β̂S = XS−1 YS
if XS is invertible and β̂S = 0 otherwise. Show that the OLS coefficient equals a weighted
average of these subset coefficients:
X
β̂ = wS β̂S
S
Remark: To prove this result, we can use Cramer’s rule to express the OLS coefficient
and use the Cauchy–Binet formula to expand the determinant of X t X. This result extends
Problem 2.1. Berman (1988) attributed it to Jacobi. Wu (1986) used it in analyzing the
statistical properties of OLS.
4
The Gauss–Markov Model and Theorem
21
22 Linear Model and Extensions
E(β̂) = E (X t X)−1 X t Y
= (X t X)−1 X t E(Y )
= (X t X)−1 X t Xβ
= β.
□
We can decompose the response vector as
Y = Ŷ + ε̂,
where the fitted vector is Ŷ = X β̂ = HY and the residual vector is ε̂ = Y − Ŷ = (In − H)Y.
The two matrices H and In − H are the keys, which have the following properties.
HX = X, (In − H)X = 0,
These follow from simple linear algebra, and I leave the proof as Problem 4.1. It states
that H and In − H are projection matrices onto the column space of X and its complement.
Algebraically, Ŷ and ε̂ are orthogonal by the OLS projection because Lemma 4.1 implies
Ŷ t ε̂ = Y t H t (In − H)Y
= Y t H(In − H)Y
= 0.
and
Ŷ 2 H 0
cov =σ .
ε̂ 0 In − H
Please do not be confused with the two statements above. First, Ŷ and ε̂ are orthogonal.
Second, Ŷ and ε̂ are uncorrelated. They have different meanings. The first statement is an
algebraic fact of the OLS procedure. It is about a relationship between two vectors Ŷ
and ε̂ which holds without assuming the Gauss–Markov model. The second statement is
stochastic. It is about a relationship between two random vectors Ŷ and ε̂ which requires
the Gauss–Markov model assumption.
Proof of Theorem 4.2: The conclusion follows from the simple fact that
Ŷ HY H
= = Y
ε̂ (In − H)Y In − H
is a linear transformation of Y . It has mean
Ŷ H
E = E(Y )
ε̂ In − H
H
= Xβ
In − H
HXβ
=
(In − H) Xβ
Xβ
= ,
0
and covariance matrix
Ŷ H
cov = cov(Y ) H t (In − H)t
ε̂ In − H
2 H
=σ H In − H
In − H
H2
2 H(In − H)
=σ
(In − H)H (In − H)2
2 H 0
=σ ,
0 In − H
where the last step follows from Lemma 4.1. □
Assume the Gauss–Markov model. Although the original responses and error terms are
uncorrelated between units with cov(εi , εj ) = 0 for i ̸= j, the fitted values and the residuals
are correlated with
cov(ŷi , ŷj ) = σ 2 hij , cov(ε̂i , ε̂j ) = σ 2 (1 − hij )
for i ̸= j based on Theorem 4.2.
where
n
X
rss = ε̂2i
i=1
is the residual sum of squares. However, Theorem 4.2 shows that ε̂i has mean zero and
variance σ 2 (1 − hii ), which is not the same as the variance of original εi . Consequently, rss
has mean
n
X
E(rss) = σ 2 (1 − hii )
i=1
= σ 2 {n − trace(H)}
= σ 2 (n − p),
Theorem 4.3 implies that σ̃ 2 is a biased estimator for σ 2 because E(σ̃ 2 ) = σ 2 (n − p)/n.
It underestimates σ 2 but with a large sample size n, the bias is small.
Theorem 4.4 Under Assumption 4.1, the OLS estimator β̂ for β is the best linear unbiased
estimator (BLUE) in the sense that3
cov(β̃) ⪰ cov(β̂)
Before proving Theorem 4.4, we need to understand its meaning and immediate impli-
cations. We do not compare the OLS estimator with any arbitrary estimators. In fact, we
restrict to the estimators that are linear and unbiased. Condition (C1) requires that β̃ is
a linear estimator. More precisely, it is a linear transformation of the response vector Y ,
where A can be any complex and possibly nonlinear function of X. Condition (C2) requires
that β̃ is an unbiased estimator for β, no matter what true value β takes.
Why do we restrict the estimator to be linear? The class of linear estimator is actually
quite large because A can be any nonlinear function of X, and the only requirement is that
the estimator is linear in Y . The unbiasedness is a natural requirement for many problems.
However, in many modern applications with many covariates, some biased estimators can
perform better than unbiased estimators if they have smaller variances. We will discuss
these estimators in Part V of this book.
We compare the estimators based on their covariances, which are natural extensions of
variances for scalar random variables. The conclusion cov(β̃) ⪰ cov(β̂) implies that for any
vector c ∈ Rp , we have
ct cov(β̃)c ⪰ ct cov(β̂)c
which is equivalent to
var(ct β̃) ≥ var(ct β̂),
So any linear transformation of the OLS estimator has a variance smaller than or equal to
the same linear transformation of any other estimator. In particular, if c = (0, . . . , 1, . . . , 0)t
with only the jth coordinate being 1, then the above inequality implies that
var(β̃j ) ≥ var(β̂j ), (j = 1, . . . , p).
So the OLS estimator has a smaller variance than other estimators for all coordinates.
Now we prove the theorem.
Proof of Theorem 4.4: We must verify that the OLS estimator itself satisfies (C1) and
(C2). We have β̂ = ÂY with  = (X t X)−1 X t , and it is unbiased by Theorem 4.1.
First, the unbiasedness requirement implies that
E(β̃) = β =⇒ E(AY ) = AE(Y ) = AXβ = β
=⇒ AXβ = β
for any value of β. So
AX = Ip (4.1)
t −1 t
must hold. In particular, the OLS estimator satisfies ÂX = (X X) X X = Ip .
Second, we can decompose the covariance of β̃ as
cov(β̃) = cov(β̂ + β̃ − β̂)
= cov(β̂) + cov(β̃ − β̂) + cov(β̂, β̃ − β̂) + cov(β̃ − β̂, β̂).
The last two terms are in fact zero. By symmetry, we only need to show that the third term
is zero:
n o
cov(β̂, β̃ − β̂) = cov ÂY, (A − Â)Y
= Âcov(Y )(A − Â)t
= σ 2 Â(A − Â)t
= σ 2 (ÂAt − ÂÂt )
= σ 2 (X t X)−1 X t At − (X t X)−1 X t X(X t X)−1
= σ 2 (X t X)−1 Ip − (X t X)−1
(by (4.1))
= 0.
26 Linear Model and Extensions
Assume xi must be in the interval [0, 1]. We want to choose their values to minimize
var(β̂). Assume that n is an even number. Find the minimizers xi ’s.
Hint: You may find the following probability result useful. For a random variable ξ in
the interval [0, 1], we have the following inequality
var(ξ) = E(ξ 2 ) − {E(ξ)}2
≤ E(ξ) − {E(ξ)}2
= E(ξ){1 − E(ξ)}
≤ 1/4.
The first inequality becomes an equality if and only if ξ = 0 or 1; the second inequality
becomes an equality if and only if E(ξ) = 1/2.
cov(β̄) ⪰ cov(β̂).
then
Y t Q1 Y
β̃ = β̂ +
..
.
Y t Qp Y
is unbiased for β.
Remark: The above estimator β̃ is a quadratic function of Y . It is a nonlinear unbiased
estimator for β. It is not difficult to show the unbiasedness. More remarkably, Koopmann
(1982, Theorem 4.3) showed that under Assumption 4.1, any unbiased estimator for β must
have the form of β̃.
Random documents with unrelated
content Scribd suggests to you:
locis patentibus.
quae autem
eo biennio a tempestatibus
tacta laesa fuerint,
ea in fundamenta 20
coiciantur. cetera quae non
erunt vitiata, ab
| natura rerum probata
durare poterunt
supra terram (15)
aedificata. nec solum ea in
quadratis lapidibus
sunt observanda
sed etiam in caementiciis
structuris.
Plain text
2 reticulata praestant
structuram. utraque
autem ex
minutis|simis (25)
sunt instruenda, uti
materia ex calce et
harena
crebriter parietes satiati
diutius contineantur.
molli enim
et rara potestate cum sint,
exsiccant sugendo e
materia
sucum. cum autem
superarit et
abundarit copia
calcis et 5
harenae, paries plus
habens umoris non
cito fiet evanidus,
sed ab his | continetur.
simul autem umida
potestas e 47
materia per caementorum
raritatem fuerit
exsucta calxque
ab harena discedat et
dissolvatur, item
caementa non
possunt cum his cohaerere,
sed in vetustatem
parietes 10
3 efficiunt ruinosos. id |
autem licet
animadvertere etiam
(5)
de nonnullis monumentis,
quae circa urbem
facta sunt e
marmore seu lapidibus
quadratis
intrinsecusque
medio calcata
structuris, vetustate
evanida facta
materia
caementorumque
exsucta raritate, proruunt
et coagmentorum
ab 15
4 ruina dissolutis iuncturis
dissi|pantur. quodsi
qui noluerit (10)
in id vitium incidere, medio
cavo servato
secundum
orthostatas,
intrinsecus ex rubro saxo
quadrato aut ex
testa aut
ex silicibus ordinariis struat
bipedales parietes,
et cum
his ansis ferreis et plumbo
frontes vinctae sint.
ita enim 20
non acervatim sed ordine
structum | opus
poterit esse sine
(15)
vitio sempiternum, quod
cubilia et coagmenta
eorum inter
se sedentia et iuncturis
alligata non
protrudent opus
neque
orthostatas inter se
religatos labi
patiuntur.
9 | De latericiis vero
dummodo ad
perpendiculum sint
49
stantes nihil deducitur, sed
quanti fuerint olim
facti, tanti
esse semper aestimantur
esse semper aestimantur.
itaque nonnullus
civitatibus et
publica opera et privatas
domos etiam regias
e latere 10
structas licet | videre, et
primum Athenis
murum qui spectat
(5)
ad Hymettum montem et
Pentelensem, item
Patris
in aede Iovis et Herculis
latericias cellas, cum
circa lapidea
in aede epistylia sint et
columnae, in Italia
Arretio
vetustum egregie factum
murum, Trallibus
domum regibus 15
Attalicis factam quae ad |
habitandum semper
datur ei (10)
qui civitatis gerit
sacerdotium. item
Lacedaemone e
quibusdam
b
parietibus etiam picturae
excisae intersectis
lateribus
inclusae sunt in ligneis
formis et in
comitium ad
ornatum aedilitatis Varronis
et Murenae fuerunt
adlatae. 20
10 Croesi domus, quam
Sardiani civibus ad |
requiescendum (15)
aetatis otio, seniorum
collegio gerusiam
dedicaverunt. item
Halicarnasso potentissimi
regis Mausoli domus
cum Proconnesio
marmore omnia haberet
ornata, parietes
habet
latere structos, qui ad hoc
tempus egregiam
praestant firmitatem
25
ita tectoriis operibus
expoliti uti vitri
perlucidi|tatem (20)
videantur habere. neque is
rex ab inopia id
fecit. infinitis enim
vectigalibus erat
fartus, quod
imperabat
11 Cariae toti. acumen autem
eius et sollertiam ad
1 arbitrio x.
3 pretia x.
4 octogesimas x. | ea: ex x.
5 pacta: parte x.
ba
9 nonnullis: n̄ ın ıllıſ (sic) S.
12 hy(i S)mectiū HS,
himettiū G. |
tentelensem x. | item
paries HG, itaque paries
S.
13 lapideae (-ee G, -eę S) x.
14 arretio (sic) x.
15 tralibus x. | domus …
(16) facta x.
20 murrenę G.
22 otio H et (in ras.) S: octo
G (ī aꝉ otiū Gc in m.).
23 alicarnasso HS(Gc):
helicarnassio G. |
proconnensio x.
28 ininfinitis x.
29 toti om. (in albo) S.
Plain text
aedificia paranda sic licet
considerare. cum
esset enim
natus Mylasis et
animadvertisset
Halicarnasso locum
naturaliter
esse munitum
em|poriumque
idoneum portum
utilem, (25)
ibi sibi domum constituit. is
autem locus est
theatri curvaturae
similis. itaque in imo
secundum portum
forum 5
est constitutum. per
mediam autem
altitudinis
curvaturam
praecinctionemque platea
ampla lati|tudine
facta, in qua 50
media Mausoleum ita
egregiis operibus est
factum ut in
septem spectaculis
nominetur. in
summa arce media
Martis
fanum habens statuam
colossicam
acrolithon nobili
manu 10
Leocharis factam. hanc
autem statuam | alii
Leocharis (5)
alii Timothei putant esse. in
cornu autem summo
dextro
Veneris et Mercurii fanum
ad ipsum Salmacidis
fontem.
12 is autem falsa opinione
putatur venerio
morbo inplicare
eos qui ex eo biberint. sed
haec opinio quare
per orbem 15
terrae falso rumore sit
pervagata non
pigebit ex|ponere.
(10)
non enim quod dicitur
molles et inpudicos
ex ea aqua
fieri, id potest esse, sed est
eius fontis potestas
perlucida
saporque egregius. cum
autem Melas et
Arevanias ab
Argis et Troezene coloniam
communem eo loci
deduxerunt, 20
barbaros Caras et Lelegas
eiecerunt. hi autem
ad mon|tes (15)
fugati inter se
congregantes
discurrebant et ibi
latrocinia
facientes crudeliter eos
vastabant. postea
de colonis unus
ad eum fontem propter
bonitatem aquae
quaestus causa
tabernam omnibus copiis
instruxit eamque
exercendo eos 25
barbaros allectabat. ita
singillatim
decurrentes et ad
coetus
coetus
| convenientes e duro
feroque more
commutati in
Graecorum (20)
consuetudinem et
suavitatem sua
voluntate
reducebantur.
ergo ea aqua non inpudici
morbi vitio sed
humanitatis
2 my(i S)lasus HS (-sis
HcScG, sed -sus Gc). |
alicarnasso (-naso S) x. |
locū G: loco cum H,
lo≣͜cū S.
10 colossicam: colossi (colosi
G) quam x.
11 telocharis HS(Gc), telo
claris G. | teleocharis
HS(Gc), teleo claris G.
13 mercuri H (-rii GS).
15 biberint HS: -runt G.
16 falsorum (-oꝝ S) ore x.
19 (saporque) eius. e̶i̶u̶s̶ add.
S.
20 troezen x.
21 eiecerunt S: eicerunt HG.
26 allectabat HG: -tavit S. |
singillatim HG:
singulatim S.
27 feroque Sc: ferroque x.
29 in(m S)pudico x. | morbo
ante corr. H.
Plain text
dulcedine mollitis animis
barbarorum eam
famam
est adepta.
10 geratur x.
11 at: ut x.
12 imperaret: sperarent S,
spirarent HGSc (-ret Gc).
13 artemi(e ante corr. S,
item v. 8. 19)siam (-ā S)
uxorē eius regnantem (-
tē GS) x (-re … te Sc). |
ⱶ
rhodii HG (≣rodii Sc):
throdii (ante ras., et sic
semper exc. v. 24. 25) S.
16 (item v. 22) tum HS: tunc
G.
19 hrodii (hic ante corr.) G. |
ornatam classem G (non
HS).
20 dare G: darent HS.
20 dare G: darent HS.
22 inanibus (ante corr.) om.
G. | pelagum (-ū GS) x (-
us L).
Plain text
remigibus | inpositis
Rhodum est
profecta. Rhodii
autem (20)
cum prospexissent suas
naves laureatas
venire, opinantes
cives victores reverti hostes
receperunt. tum
Artemisia
Rhodo capta principibus
occisis tropaeum in
urbe Rhodo
suae victoriae constituit
aeneasque duas
statuas fecit unam 5
Rhodiorum civita|tis
alteram suae
imaginis. eam ita
figuravit (25)
Rhodiorum | civitati
stigmata
inponentem. ideo
autem 52
postea Rhodii religione
inpediti quod nefas
est tropaea
dedicata removeri, circa
eum locum
aedificium
struxerunt
et ita erecta Graia statione
texerunt ne qui
posset aspicere, 10
et id αβατον voci|tari
iusserunt. (5)
ebookbell.com