STAT 714 Linear Statistical Models: Lecture Notes
STAT 714 Linear Statistical Models: Lecture Notes
Lecture Notes
Joshua M. Tebbs
Department of Statistics
The University of South Carolina
TABLE OF CONTENTS STAT 714, J. TEBBS
Contents
2.3 Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Estimability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Distributional Theory 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
i
TABLE OF CONTENTS STAT 714, J. TEBBS
5.2.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6 Statistical Inference 95
6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7 Appendix 118
ii
TABLE OF CONTENTS STAT 714, J. TEBBS
iii
CHAPTER 1 STAT 714, J. TEBBS
INTRODUCTION : Linear models are models that are linear in their parameters. The
general form of a linear model is given by
Y = Xβ + ,
SCOPE : Several models commonly used in statistics are examples of the general linear
model Y = Xβ + . These include, but are not limited to, linear regression models and
analysis of variance (ANOVA) models. Regression models generally refer to those for
which X is full rank, while ANOVA models refer to those for which X consists of zeros
and ones.
PAGE 1
CHAPTER 1 STAT 714, J. TEBBS
Example 1.1. One-sample problem. Suppose that Y1 , Y2 , ..., Yn is an iid sample with
mean µ and variance σ 2 > 0. If 1 , 2 , ..., n are iid with mean E(i ) = 0 and common
variance σ 2 , we can write the GM model
Y = Xβ + ,
where
Y1 1 1
Y2 1 2
Yn×1 =
..
,
Xn×1 =
..
,
β 1×1 = µ, n×1 =
..
.
. . .
Yn 1 n
Note that E() = 0 and cov() = σ 2 I.
Example 1.2. Simple linear regression. Consider the model where a response variable
Y is linearly related to an independent variable x via
Yi = β0 + β1 xi + i ,
for i = 1, 2, ..., n, where the i are uncorrelated random variables with mean 0 and
common variance σ 2 > 0. If x1 , x2 , ..., xn are fixed constants, measured without error,
then this is a GM model Y = Xβ + with
Y1 1 x1 1
Y2 1 x2 β0 2
Yn×1 = . , Xn×2 = . .
, β 2×1 = , n×1 = ..
.
.. .. ..
β1 .
Yn 1 xn n
Note that E() = 0 and cov() = σ 2 I.
Example 1.3. Multiple linear regression. Suppose that a response variable Y is linearly
related to several independent variables, say, x1 , x2 , ..., xk via
PAGE 2
CHAPTER 1 STAT 714, J. TEBBS
for i = 1, 2, ..., n, where i are uncorrelated random variables with mean 0 and common
variance σ 2 > 0. If the independent variables are fixed constants, measured without
error, then this model is a special GM model Y = Xβ + where
β0
Y 1 x11 x12 · · · x1k 1
1
β1
1 x21 x22 · · · x2k
Y2 2
Y = . , Xn×p = . . , β p×1 = , = ,
. . .
β2 ..
.. .. .. .. .. ..
..
.
.
Yn 1 xn1 xn2 · · · xnk n
βk
Yij = µ + αi + ij ,
for i = 1, 2, ..., a and j = 1, 2, ..., ni , where the random errors ij are uncorrelated random
variables with zero mean and common variance σ 2 > 0. If the a treatment effects
α1 , α2 , ..., αa are best regarded as fixed constants, then this model is a special case of the
GM model Y = Xβ + . To see this, note that with n = ai=1 ni ,
P
µ
Y11 1n1 1n1 0n1 · · · 0n1
α
1
0n2 1n2 · · · 0n2
Y12 1
Yn×1 = , Xn×p = n2 , β p×1 = α2 ,
.. .. .. .. ... ..
. . . . . ..
.
Yana 1na 0na 0na · · · 1na
αa
where p = a + 1 and n×1 = (11 , 12 , ..., ana )0 , and where 1ni is an ni × 1 vector of ones
and 0ni is an ni × 1 vector of zeros. Note that E() = 0 and cov() = σ 2 I.
NOTE : In Example 1.4, note that the first column of X is the sum of the last a columns;
i.e., there is a linear dependence in the columns of X. From results in linear algebra,
we know that X is not of full column rank. In fact, the rank of X is r = a, one less
PAGE 3
CHAPTER 1 STAT 714, J. TEBBS
Example 1.5. Two-way nested ANOVA. Consider an experiment with two factors,
where one factor, say, Factor B, is nested within Factor A. In other words, every level
of B appears with exactly one level of Factor A. A statistical model for this situation is
for i = 1, 2, ..., a, j = 1, 2, ..., bi , and k = 1, 2, ..., nij . In this model, µ denotes the overall
mean, αi represents the effect due to the ith level of A, and βij represents the effect
of the jth level of B, nested within the ith level of A. If all parameters are fixed, and
the random errors ijk are uncorrelated random variables with zero mean and constant
unknown variance σ 2 > 0, then this is a special GM model Y = Xβ + . For example,
with a = 3, b = 2, and nij = 2, we have
Y 1 1 0 0 1 0 0 0 0 0
111
Y112 1 1 0 0 1 0 0 0 0 0 µ
Y121 1 1 0 0 0 1 0 0 0 0 α1
Y122 1 1 0 0 0 1 0 0 0 0 α2
Y211 1 0 1 0 0 0 1 0 0 0 α3
Y212 1 0 1 0 0 0 1 0 0 0 β
, β = 11
Y=
, X =
,
Y221 1 0 1 0 0 0 0 1 0 0 β12
Y222 1 0 1 0 0 0 0 1 0 0 β21
Y311 1 0 0 1 0 0 0 0 1 0 β22
Y312 1 0 0 1 0 0 0 0 1 0
β31
Y321 1 0 0 1 0 0 0 0 0 1 β32
Y322 1 0 0 1 0 0 0 0 0 1
and = (111 , 112 , ..., 322 )0 . Note that E() = 0 and cov() = σ 2 I. The X matrix is not
of full column rank. The rank of X is r = 6 and there are p = 10 columns.
PAGE 4
CHAPTER 1 STAT 714, J. TEBBS
Example 1.6. Two-way crossed ANOVA with interaction. Consider an experiment with
two factors (A and B), where Factor A has a levels and Factor B has b levels. In general,
we say that factors A and B are crossed if every level of A occurs in combination with
every level of B. Consider the two-factor (crossed) ANOVA model given by
for i = 1, 2, ..., a, j = 1, 2, ..., b, and k = 1, 2, ..., nij , where the random errors ij are
uncorrelated random variables with zero mean and constant unknown variance σ 2 > 0.
If all the parameters are fixed, this is a special GM model Y = Xβ + . For example,
with a = 3, b = 2, and nij = 3,
Y111 1 1 0 0 1 0 1 0 0 0 0 0
Y112 1 1 0 0 1 0 1 0 0 0 0 0
Y113 1 1 0 0 1 0 1 0 0 0 0 0
Y121 1 1 0 0 0 1 0 1 0 0 0 0 µ
Y122 1 1 0 0 0 1 0 1 0 0 0 0 α1
Y123 1 1 0 0 0 1 0 1 0 0 0 0 α2
Y211 1 0 1 0 1 0 0 0 1 0 0 0 α3
Y212 1 0 1 0 1 0 0 0 1 0 0 0 β1
Y213 1 0 1 0 1 0 0 0 1 0 0 0 β
, β = 2
Y=
, X =
,
Y221 1 0 1 0 0 1 0 0 0 1 0 0 γ11
Y222 1 0 1 0 0 1 0 0 0 1 0 0 γ12
Y223 1 0 1 0 0 1 0 0 0 1 0 0 γ21
Y311 1 0 0 1 1 0 0 0 0 0 1 0 γ22
Y312 1 0 0 1 1 0 0 0 0 0 1 0 γ31
Y313 1 0 0 1 1 0 0 0 0 0 1 0
γ32
Y321 1 0 0 1 0 1 0 0 0 0 0 1
Y322 1 0 0 1 0 1 0 0 0 0 0 1
Y323 1 0 0 1 0 1 0 0 0 0 0 1
and = (111 , 112 , ..., 323 )0 . Note that E() = 0 and cov() = σ 2 I. The X matrix is not
of full column rank. The rank of X is r = 6 and there are p = 12 columns.
PAGE 5
CHAPTER 1 STAT 714, J. TEBBS
Yijk = µ + αi + βj + ijk ,
for i = 1, 2, ..., a, j = 1, 2, ..., b, and k = 1, 2, ..., nij , where the random errors ij are
uncorrelated random variables with zero mean and common variance σ 2 > 0. Note that
no-interaction model is a special case of the interaction model in Example 1.6 when
H0 : γ11 = γ12 = · · · = γ32 = 0 is true. That is, the no-interaction model is a reduced
version of the interaction model. With a = 3, b = 2, and nij = 3 as before, we have
Y111 1 1 0 0 1 0
Y112 1 1 0 0 1 0
Y113 1 1 0 0 1 0
Y121 1 1 0 0 0 1
Y122 1 1 0 0 0 1
Y123 1 1 0 0 0 1
Y211 1 0 1 0 1 0 µ
Y212 1 0 1 0 1 0 α1
Y213 1 0 1 0 1 0 α2
Y=
, X =
, β =
,
Y221 1 0 1 0 0 1 α3
Y222 1 0 1 0 0 1 β1
Y223 1 0 1 0 0 1 β2
Y311 1 0 0 1 1 0
Y312 1 0 0 1 1 0
Y313 1 0 0 1 1 0
Y321 1 0 0 1 0 1
Y322 1 0 0 1 0 1
Y323 1 0 0 1 0 1
and = (111 , 112 , ..., 323 )0 . Note that E() = 0 and cov() = σ 2 I. The X matrix is not
of full column rank. The rank of X is r = 4 and there are p = 6 columns. Also note that
PAGE 6
CHAPTER 1 STAT 714, J. TEBBS
the design matrix for the no-interaction model is the same as the design matrix for the
interaction model, except that the last 6 columns are removed.
for i = 1, 2, ..., a, j = 1, 2, ..., ni , where the random errors ij are uncorrelated random
variables with zero mean and common variance σ 2 > 0. In this model, µ represents the
overall mean, αi represents the (fixed) effect of receiving the ith treatment (disregarding
the covariates), and βi denotes the slope of the line that relates Y to x for the ith
treatment. Note that this model allows the treatment slopes to be different. The xij ’s
are assumed to be fixed values measured without error.
Note that E() = 0 and cov() = σ 2 I. The X matrix is not of full column rank. If there
are no linear dependencies among the last 3 columns, the rank of X is r = 6 and there
are p = 7 columns.
REDUCED MODEL: Consider the ANCOVA model in Example 1.8 which allows for
unequal slopes. If β1 = β2 = · · · = βa ; that is, all slopes are equal, then the ANCOVA
PAGE 7
CHAPTER 1 STAT 714, J. TEBBS
model reduces to
Yij = µ + αi + βxij + ij .
That is, the common-slopes ANCOVA model is a reduced version of the model that
allows for different slopes. Assuming the same error structure, this reduced ANCOVA
model is also a special GM model Y = Xβ + . With a = 3 and n1 = n2 = n3 = 3, as
before, we have
Y 1 1 0 0 x11 11
11
Y12 1 1 0 0 x12 12
Y13 1 1 0 0 x13 µ 13
Y21 1 0 1 0 x21 α1 21
Y = Y22 , X = , β = α2 , = .
1 0 1 0 x22 22
Y23 1 0 1 0 x23 α3 23
Y31 1 0 0 1 x31 β 31
Y32 1 0 0 1 x32 32
Y33 1 0 0 1 x33 33
As long as at least one of the xij ’s is different, the rank of X is r = 4 and there are p = 5
columns.
GOAL: We now provide examples of linear models of the form Y = Xβ + that are not
GM models.
Yij = µ + αi + ij ,
for i = 1, 2, ..., a and j = 1, 2, ..., ni , where the treatment effects α1 , α2 , ..., αa are best
regarded as random; e.g., the a levels of the factor of interest are drawn from a large
population of possible levels, and the random errors ij are uncorrelated random variables
PAGE 8
CHAPTER 1 STAT 714, J. TEBBS
with zero mean and common variance σ 2 > 0. For concreteness, let a = 4 and nij = 3.
The model Y = Xβ + looks like
Y11 11
Y12 12
Y13 13
Y21 21
Y22 13 03 03 03 α1 22
Y23 03 13 03 03 α2 23
Y=
= 112 µ +
+
Y31 03 03 13 03 α3 31
Y32 03 03 03 13 α4 32
| {z } | {z }
Y33 = Z1 = 1 33
Y41 41
Y42 42
Y43 43
| {z }
= 2
= Xβ + Z1 1 + 2 ,
provided that the αi ’s and the errors ij are uncorrelated. Note that cov() 6= σ 2 I.
Example 1.10. Two-factor mixed model. Consider an experiment with two factors (A
and B), where Factor A is fixed and has a levels and Factor B is random with b levels.
A statistical model for this situation is given by
Yijk = µ + αi + βj + ijk ,
for i = 1, 2, ..., a, j = 1, 2, ..., b, and k = 1, 2, ..., nij . The αi ’s are best regarded as fixed
and the βj ’s are best regarded as random. This model assumes no interaction.
PAGE 9
CHAPTER 1 STAT 714, J. TEBBS
about the large population of blocks (and not those b blocks in the experiment), then
blocks are considered as random. The treatment effects α1 , α2 , ..., αa are regarded as
fixed constants if the a treatments are the only ones of interest.
NOTE : For concreteness, suppose that a = 2, b = 4, and nij = 1. We can write the
model above as
Y 11
11
Y12 12
Y13 β 13
1
µ
Y14 14 14 04 I4
β 14
2
Y= = + + .
α1
Y21 14 04 14 I4 β3 21
α2
Y22 | {z } β4 22
= Xβ
| {z }
Y23 = Z1 1 23
Y24 24
| {z }
= 2
Y = Xβ + Z1 1 + Z2 2 + · · · + Zk k ,
where Z1 , Z2 , ..., Zk are known matrices (typically Zk = Ik ) and 1 , 2 , ..., k are uncorre-
lated random vectors with uncorrelated components.
PAGE 10
CHAPTER 1 STAT 714, J. TEBBS
Example 1.11. Time series models. When measurements are taken over time, the GM
model may not be appropriate because observations are likely correlated. A linear model
of the form Y = Xβ + , where E() = 0 and cov() = σ 2 V, V known, may be more
appropriate. The general form of V is chosen to model the correlation of the observed
responses. For example, consider the statistical model
Yt = β0 + β1 t + t ,
for t = 1, 2, ..., n, where t = ρt−1 + at , at ∼ iid N (0, σ 2 ), and |ρ| < 1 (this is a
stationarity condition). This is called a simple linear trend model where the error
process {t : t = 1, 2, ..., n} follows an autoregressive model of order 1, AR(1). It is easy
to show that E(t ) = 0, for all t, and that cov(t , s ) = σ 2 ρ|t−s| , for all t and s. Therefore,
if n = 5,
2 3 4
1 ρ ρ ρ ρ
2
ρ 1 ρ ρ ρ
3
V = σ 2 ρ2 ρ 1 ρ ρ 2 .
3 2
ρ ρ ρ 1 ρ
ρ4 ρ3 ρ2 ρ 1
Example 1.12. Random coefficient models. Suppose that t measurements are taken
(over time) on n individuals and consider the model
for i = 1, 2, ..., n and j = 1, 2, ..., t; that is, the different p × 1 regression parameters β i
are “subject-specific.” If the individuals are considered to be a random sample, then we
can treat β 1 , β 2 , ..., β n as iid random vectors with mean β and p × p covariance matrix
Σββ , say. We can write this model as
PAGE 11
CHAPTER 1 STAT 714, J. TEBBS
Yi = β0 + β1 Xi + i ,
where i ∼ iid N (0, σ2 ). The Xi ’s are not observed exactly; instead, they are measured
with non-negligible error so that
Wi = Xi + Ui ,
Yi = β0 + β1 (Wi − Ui ) + i
= β0 + β1 Wi + (i − β1 Ui ) .
| {z }
= ∗i
Because the Wi ’s are not fixed in advance, we would at least need E(∗i |Wi ) = 0 for this
to be a GM linear model. However, note that
The first term is zero if i is independent of both Xi and Ui . The second term generally
is not zero (unless β1 = 0, of course) because Ui and Xi + Ui are correlated. Therefore,
this can not be a GM model.
PAGE 12
CHAPTER 2 STAT 714, J. TEBBS
Y = Xβ + ,
Since β is unknown, all we really know is that E(Y) = Xβ ∈ C(X). To estimate E(Y),
it seems natural to take the vector in C(X) that is closest to Y.
DEFINITION : An estimate β
b is a least squares estimate of β if Xβ
b is the vector in
LEAST SQUARES : Let β = (β1 , β2 , ..., βp )0 and define the error sum of squares
the squared distance from Y to Xβ. The point where Q(β) is minimized satisfies
∂Q(β)
0
∂β1
∂Q(β)
∂Q(β) ∂β2 0
= 0, or, in other words,
.. = .. .
∂β . .
∂Q(β)
∂βp
0
PAGE 13
CHAPTER 2 STAT 714, J. TEBBS
∂a0 b ∂b0 Ab
=a and = (A + A0 )b.
∂b ∂b
∂b0 Ab
= 2Ab
∂b
if A is symmetric.
= Y0 Y − 2Y0 Xβ + β 0 X0 Xβ.
∂Q(β)
= −2X0 Y + 2X0 Xβ,
∂β
X0 Xβ = X0 Y.
These are the normal equations. If X0 X is nonsingular, then the unique least squares
estimator of β is
b = (X0 X)−1 X0 Y.
β
When X0 X is singular, which can happen in ANOVA models (see Chapter 1), there can
be multiple solutions to the normal equations. Having already proved algebraically that
the normal equations are consistent, we know that the general form of the least squares
solution is
b = (X0 X)− X0 Y + [I − (X0 X)− X0 X]z,
β
PAGE 14
CHAPTER 2 STAT 714, J. TEBBS
X0 Xβ = X0 Y,
the normal equations are consistent if X0 Y ∈ C(X0 X). Clearly, X0 Y ∈ C(X0 ). Thus, we’ll
be able to establish consistency (geometrically) if we can show that C(X0 X) = C(X0 ).
Result 2.3. Suppose that S1 and T1 are orthogonal complements, as well as S2 and T2 .
If S1 ⊆ S2 , then T2 ⊆ T1 .
Proof. See Monahan, pp 244.
CONSISTENCY : We use the previous two results to show that C(X0 X) = C(X0 ). Take
S1 = N (X0 X), T1 = C(X0 X), S2 = N (X), and T2 = C(X0 ). We know that S1 and
T1 (S2 and T2 ) are orthogonal complements. Because N (X0 X) ⊆ N (X), the last result
guarantees C(X0 ) ⊆ C(X0 X). But, C(X0 X) ⊆ C(X0 ) trivially, so we’re done. Note also
NOTE : We now state a result that characterizes all solutions to the normal equations.
= (Y − Xβ b − Xβ)0 (Y − Xβ
b + Xβ b − Xβ)
b + Xβ
b 0 (Y − Xβ)
= (Y − Xβ) b − Xβ)0 (Xβ
b + (Xβ b − Xβ),
PAGE 15
CHAPTER 2 STAT 714, J. TEBBS
b − Xβ)0 (Y − Xβ)
since the cross product term 2(Xβ b = 0; verify this using the fact that
β b + z0 z, where
b solves the normal equations. Thus, we have shown that Q(β) = Q(β)
b − Xβ. Therefore, Q(β) ≥ Q(β)
z = Xβ b for all β and, hence, β
b minimizes Q(β). (=⇒)
Thus, Q(β)
e = Q(β).
b But because Q(β) b + z0 z, where z = Xβ
e = Q(β) b − Xβ,
e it must be
b − Xβ
true that z = Xβ e = 0; that is, Xβ
b = Xβ.
e Thus,
X 0 Xβ
e = X 0 Xβ
b = X0 Y,
since β
b is a solution to the normal equations. This shows that β
e is also solution to the
normal equations.
INVARIANCE : In proving the last result, we have discovered a very important fact;
namely, if β
b and β
e both solve the normal equations, then Xβ
b = Xβ.
e In other words,
Xβ
b is invariant to the choice of β.
b
NOTE : The following result ties least squares estimation to the notion of a perpendicular
projection matrix. It also produces a general formula for the matrix.
Both terms on the right hand side are nonnegative, and the first term does not involve
β. Thus, (Y − Xβ)0 (Y − Xβ) is minimized by minimizing (MY − Xβ)0 (MY − Xβ), the
squared distance between MY and Xβ. This distance is zero if and only if MY = Xβ,
which proves the result. Now to show the above equation:
PAGE 16
CHAPTER 2 STAT 714, J. TEBBS
It suffices to show that (∗) and (∗∗) are zero. To show that (∗) is zero, note that
M = X(X0 X)− X0 .
PX = X(X0 X)− X0 .
PROPERTIES : Let PX denote the perpendicular projection matrix onto C(X). Then
(a) PX is idempotent
(d) PX is symmetric
(e) PX is unique.
We have already proven (a), (b), (d), and (e); see Matrix Algebra Review 5. Part (c) must
be true; otherwise, part (e) would not hold. However, we can prove (c) more rigorously.
1. X(X0 X)− 0 0 − 0
1 X X = X(X X)2 X X = X
2. X(X0 X)− 0 0 − 0
1 X = X(X X)2 X .
PAGE 17
CHAPTER 2 STAT 714, J. TEBBS
v0 X(X0 X)− 0 0 0 − 0 0 0 0 − 0 0 0 0
1 X X = v1 X(X X)1 X X = d X X(X X)1 X X = d X X = v X,
X(X0 X)− 0 0 − 0 0 − 0 0 − 0
1 X v = X(X X)1 X Xd = X(X X)2 X Xd = X(X X)2 X v.
b≡Y
PX Y = X β b
that is closest to Y. Now, recall that I − PX is the perpendicular projection matrix onto
N (X0 ). Note that
(I − PX )Y = Y − PX Y = Y − Y
b ≡b
e.
PAGE 18
CHAPTER 2 STAT 714, J. TEBBS
Y=Y
b +b
e.
Y0 Y = Y0 IY = Y0 (PX + I − PX )Y
= Y0 PX Y + Y0 (I − PX )Y
= Y0 PX PX Y + Y0 (I − PX )(I − PX )Y
b 0Y
= Y e0 b
b +b e,
since PX and I−PX are both symmetric and idempotent; i.e., they are both perpendicular
projection matrices (but onto orthogonal spaces). This orthogonal decomposition of Y0 Y
is often given in a tabular display called an analysis of variance (ANOVA) table.
Source df SS
Model r b 0Y
Y b = Y 0 PX Y
Residual n − r e0 b
b e = Y0 (I − PX )Y
Total n Y0 Y = Y0 IY
It is interesting to note that the sum of squares column, abbreviated “SS,” catalogues
3 quadratic forms, Y0 PX Y, Y0 (I − PX Y), and Y0 IY. The degrees of freedom column,
abbreviated “df,” catalogues the ranks of the associated quadratic form matrices; i.e.,
r(PX ) = r
r(I − PX ) = n − r
r(I) = n.
PAGE 19
CHAPTER 2 STAT 714, J. TEBBS
NOTE : The following “visualization” analogy is taken liberally from Christensen (2002).
VISUALIZATION : One can think about the geometry of least squares estimation in
three dimensions (i.e., when n = 3). Consider your kitchen table and take one corner of
the table to be the origin. Take C(X) as the two dimensional subspace determined by the
surface of the table, and let Y be any vector originating at the origin; i.e., any point in
R3 . The linear model says that E(Y) = Xβ, which just says that E(Y) is somewhere on
the table. The least squares estimate Y
b = Xβ
b = PX Y is the perpendicular projection
e = (I − PX )Y is the vector
of Y onto the surface of the table. The residual vector b
starting at the origin, perpendicular to the surface of the table, that reaches the same
height as Y. Another way to think of the residual vector is to first connect Y and
PX Y with a line segment (that is perpendicular to the surface of the table). Then,
shift the line segment along the surface (keeping it perpendicular) until the line segment
has one end at the origin. The residual vector b
e is the perpendicular projection of Y
onto C(I − PX ) = N (X0 ); that is, the projection onto the orthogonal complement of the
table surface. The orthogonal complement C(I − PX ) is the one-dimensional space in
the vertical direction that goes through the origin. Once you have these vectors in place,
sums of squares arise from Pythagorean’s Theorem.
A SIMPLE PPM : Suppose Y1 , Y2 , ..., Yn are iid with mean E(Yi ) = µ. In terms of the
general linear model, we can write Y = Xβ + , where
Y 1 1
1
Y2 1 2
Y= ..
, X = 1 = , β = µ,
.. =
..
.
. . .
Yn 1 n
P1 Y = n−1 JY = Y 1,
PAGE 20
CHAPTER 2 STAT 714, J. TEBBS
Pn
where Y = n−1 i=1 Yi . The perpendicular projection matrix P1 projects Y onto the
space
C(P1 ) = {z ∈ Rn : z = (a, a, ..., a)0 ; a ∈ R}.
the vector which contains the deviations from the mean. The perpendicular projection
matrix I − P1 projects Y onto
( n
)
X
C(I − P1 ) = z ∈ Rn : z = (a1 , a2 , ..., an )0 ; ai ∈ R, ai = 0 .
i=1
REMARK : The matrix P1 plays an important role in linear models, and here is why.
Most linear models, when written out in non-matrix notation, contain an intercept
term. For example, in simple linear regression,
Yi = β0 + β1 xi + i ,
the intercept terms are β0 and µ, respectively. In the corresponding design matrices, the
first column of X is 1. If we discard the “other” terms like β1 xi and αi + βj + γij in the
models above, then we have a reduced model of the form Yi = µ + i ; that is, a model
that relates Yi to its overall mean, or, in matrix notation Y = 1µ + . The perpendicular
projection matrix onto C(1) is P1 and
2
Y0 P1 Y = Y0 P1 P1 Y = (P1 Y)0 (P1 Y) = nY .
PAGE 21
CHAPTER 2 STAT 714, J. TEBBS
This is the model sum of squares for the model Yi = µ + i ; that is, Y0 P1 Y is the sum of
squares that arises from fitting the overall mean µ. Now, consider a general linear model
of the form Y = Xβ + , where E() = 0, and suppose that the first column of X is 1.
In general, we know that
Y0 Y = Y0 IY = Y0 PX Y + Y0 (I − PX )Y.
Y0 (I − P1 )Y = Y0 (PX − P1 )Y + Y0 (I − PX )Y.
The quantity Y0 (I−P1 )Y is called the corrected total sum of squares and the quantity
Y0 (PX − P1 )Y is called the corrected model sum of squares. The term “corrected”
is understood to mean that we have removed the effects of “fitting the mean.” This is
important because this is the sum of squares breakdown that is commonly used; i.e.,
Source df SS
Model (Corrected) r−1 Y0 (PX − P1 )Y
Residual n−r Y0 (I − PX )Y
Total (Corrected) n−1 Y0 (I − P1 )Y
In ANOVA models, the corrected model sum of squares Y0 (PX − P1 )Y is often broken
down further into smaller components which correspond to different parts; e.g., orthog-
onal contrasts, main effects, interaction terms, etc. Finally, the degrees of freedom are
simply the corresponding ranks of PX − P1 , I − PX , and I − P1 .
NOTE : In the general linear model Y = Xβ + , the residual vector from the least
e = (I − PX )Y ∈ N (X0 ), so b
squares fit b e0 X = 0; that is, the residuals in a least squares
fit are orthogonal to the columns of X, since the columns of X are in C(X). Note that if
1 ∈ C(X), which is true of all linear models with an intercept term, then
n
X
e0 1 =
b ebi = 0,
i=1
that is, the sum of the residuals from a least squares fit is zero. This is not necessarily
true of models for which 1 ∈
/ C(X).
PAGE 22
CHAPTER 2 STAT 714, J. TEBBS
= PX − PW − PW + PW = PX − PW .
v = (PX − PW )(d1 + d2 )
= (PX − PW )(Xa + d2 )
= PX Xa + PX d2 − PW Xa − PW d2
= Xa + 0 − PW Xa − 0
Thus, C(PX − PW ) ⊆ C[(I − PW )X]. Now, suppose that w ∈ C[(I − PW )X]. Then
w = (I − PW )Xc, for some c. Thus,
PAGE 23
CHAPTER 2 STAT 714, J. TEBBS
Result 2.10. If C(W) ⊂ C(X), then C(PX − PW ) = C[(I − PW )X] is the orthogonal
complement of C(PW ) with respect to C(PX ); that is,
C(PX − PW ) = C(PW )⊥
C(PX ) .
showing that the orthogonal complement of C(PW ) with respect to C(PX ) is contained
in C(PX − PW ).
REMARK : The preceding two results are important for hypothesis testing in linear
models. Consider the linear models
Y = Xβ + and Y = Wγ + ,
where C(W) ⊂ C(X). As we will learn later, the condition C(W) ⊂ C(X) implies that
Y = Wγ + is a reduced model when compared to Y = Xβ + , sometimes called
the full model. If E() = 0, then, if the full model is correct,
PAGE 24
CHAPTER 2 STAT 714, J. TEBBS
2.3 Reparameterization
IMPORTANT : Any two linear models with the same estimation space are really the
same model; the models are said to be reparameterizations of each other. Any two
such models will give the same predicted values, the same residuals, the same ANOVA
table, etc. In particular, suppose that we have two linear models:
Y = Xβ + and Y = Wγ + .
If C(X) = C(W), then PX does not depend on which of X or W is used; it depends only
on C(X) = C(W). As we will find out, the least-squares estimate of E(Y) is
Y
b = PX Y = Xβ
b = Wb
γ.
EXAMPLE : Recall the simple linear regression model from Chapter 1 given by
Yi = β0 + β1 xi + i ,
for i = 1, 2, ..., n. Although not critical for this discussion, we will assume that 1 , 2 , ..., n
are uncorrelated random variables with mean 0 and common variance σ 2 > 0. Recall
PAGE 25
CHAPTER 2 STAT 714, J. TEBBS
and P
i Yi
X0 Y = P .
x Y
i i i
For the simple linear regression model, it can be shown (verify!) that the perpendicular
projection matrix PX is given by
PX = X(X0 X)−1 X0
2
1 −x) 1 −x)(x2 −x) 1 −x)(xn −x)
1
+ P(x(x 2
1
+ (xP
2 ··· 1
+ (xP
2
n i i −x) n i (xi −x) n i (xi −x)
−x)2
n + 1 −x)(x 2 −x) 2 −x)(xn −x)
1 (xP 1 P(x2 1 (xP
+ ··· +
2 2 2
i (xi −x) n i (xi −x) n i (xi −x)
= .. .. .. ..
.
.
. . .
2
1 −x)(xn −x) 2 −x)(xn −x) P(xn −x) 2
1
+ (xP (xi −x)2
1
+ (xP
2 ··· 1
+
n i n i (xi −x) n i (x i −x)
Yi = γ0 + γ1 (xi − x) + i
or Y = Wγ + , where
Y 1 x1 − x 1
1
1 x2 − x
Y2 γ0 2
Yn×1 =
.. , Wn×2 = ..
..
,
γ 2×1 = , n×1 =
..
.
. . . γ1 .
Yn 1 xn − x n
PAGE 26
CHAPTER 2 STAT 714, J. TEBBS
To find the least-squares estimator for γ in the reparameterized model, observe that
1
n 0 0
W0 W = and (W0 W)−1 = n .
2 1
P
0 i (xi − x) 0 (xi −x)2
P
i
Note that (W0 W)−1 is diagonal; this is one of the benefits to working with this param-
eterization. The least squares estimator of γ is given by
γ
b0 Y
b = (W0 W)−1 W0 Y =
γ = P ,
(xi −x)(Yi −Y )
γ
b1 iP
(xi −x)2 i
PW = W(W0 W)−1 W0
2
1 −x) 1 −x)(x2 −x) 1 −x)(xn −x)
1
+ P(x(x −x) 2
1
+ (xP
2 ··· 1
+ (xP
2
n i i n i (xi −x) n i (xi −x)
−x)2
n + 1 −x)(x 2 −x) 2 −x)(xn −x)
1 (xP 1 P(x2 1 (xP
+ ··· +
(xi −x)2 n i (xi −x)
2 n i (xi −x)
2
=
i
.. .. ... ..
.
. . .
2
1 −x)(xn −x) 2 −x)(xn −x) P(xn −x) 2
1
+ (xP (xi −x)2
1
+ (xP
2 ··· 1
+
n i n i (xi −x) n i (x i −x)
which is the same as PX . Thus, the fitted values will be the same; i.e., Y
b = PX Y =
Xβ
b = Wb
γ = PW Y, and the analysis will be the same under both parameterizations.
Exercise: Show that the one way fixed effects ANOVA model Yij = µ + αi + ij , for
i = 1, 2, ..., a and j = 1, 2, ..., ni , and the cell means model Yij = µi + ij are reparameter-
izations of each other. Does one parameterization confer advantages over the other?
PAGE 27
CHAPTER 3 STAT 714, J. TEBBS
3.1 Introduction
REMARK : Estimability is one of the most important concepts in linear models. Consider
the general linear model
Y = Xβ + ,
where E() = 0. In our discussion that follows, the assumption cov() = σ 2 I is not
needed. Suppose that X is n × p with rank r ≤ p. If r = p (as in regression models), then
b = (X0 X)−1 X0 Y. If r < p,
estimability concerns vanish as β is estimated uniquely by β
(a common characteristic of ANOVA models), then β can not be estimated uniquely.
However, even if β is not estimable, certain functions of β may be estimable.
3.2 Estimability
DEFINITIONS :
PAGE 28
CHAPTER 3 STAT 714, J. TEBBS
Yij = µ + αi + ij ,
Note that r(X) = 3, so X is not of full rank; i.e., β is not uniquely estimable. Consider
the following parametric functions λ0 β:
PAGE 29
CHAPTER 3 STAT 714, J. TEBBS
E(Y 1+ − Y 2+ ) = (µ + α1 ) − (µ + α2 )
= α1 − α2 = λ04 β
and that Y 1+ −Y 2+ = c+a0 Y, where c = 0 and a0 = (1/2, 1/2, −1/2, −1/2, 0, 0). Finally,
Y 2+ + Y 3+ 1
E Y 1+ − = (µ + α1 ) − {(µ + α2 ) + (µ + α3 )}
2 2
1
= α1 − (α2 + α3 ) = λ05 β.
2
Note that
Y 2+ + Y 3+
Y 1+ − = c + a0 Y,
2
where c = 0 and a0 = (1/2, 1/2, −1/4, −1/4, −1/4, −1/4).
REMARKS :
2. If λ01 β, λ02 β, ..., λ0k β are estimable, then any linear combination of them; i.e.,
Pk 0
i=1 di λi β, where di ∈ R, is also estimable.
DEFINITION : Linear functions λ01 β, λ02 β, ..., λ0k β are said to be linearly independent
if λ1 , λ2 , ..., λk comprise a set of linearly independent vectors; i.e., Λ = (λ1 λ2 · · · λk )
has rank k.
PAGE 30
CHAPTER 3 STAT 714, J. TEBBS
Proof. Suppose that λ0 β is estimable. Then λ0 = a0 X, for some a. From Result 2.5,
λ0 β
b = a0 Xβ
b = a0 PX Y
λ0 β
e = a0 Xβ
e = a0 PX Y.
Example 3.2. In Example 3.1, we considered the one-way fixed effects ANOVA model
Yij = µ + αi + ij , for i = 1, 2, 3 and j = 1, 2. For this model, it is easy to show that
6 2 2 2
0
2 2 0 0
XX=
2 0 2 0
2 0 0 2
PAGE 31
CHAPTER 3 STAT 714, J. TEBBS
Note that
Y11
1 1 1 1 1 1 Y12 Y + Y12 + Y21 + Y12 + Y31 + Y32
11
0
1 1 0 0 0 0 Y21 Y11 + Y12
XY =
=
.
0 0 1 1 0 0 Y22 Y21 + Y22
0 0 0 0 1 1 Y31 Y31 + Y32
Y32
Note that
λ03 β
b = λ0 β
3
e = Y 1+ .
λ04 β
b = λ0 β
4
e = Y 1+ − Y 2+ .
e = Y 1+ − 1 (Y 2+ + Y 3+ ).
b = λ0 β
λ05 β 5
2
PAGE 32
CHAPTER 3 STAT 714, J. TEBBS
Finally, note that these three estimable functions are linearly independent since
1 0 0
1 1
1
Λ = λ3 λ4 λ5 =
0 −1 −1/2
0 0 −1/2
has rank r(Λ) = 3. Of course, more estimable functions λ0i β can be found, but we can
find no more linearly independent estimable functions because r(X) = 3.
Result 3.4. Under the model assumptions Y = Xβ + , where E() = 0, the least
squares estimator λ0 β
b of an estimable function λ0 β is a linear unbiased estimator of λ0 β.
b solves the normal equations. We know (by definition) that λ0 β
Proof. Suppose that β b is
λ0 β
b = λ0 {(X0 X)− X0 Y + [I − (X0 X)− X0 X]z}
= λ0 (X0 X)− X0 Xβ
= a0 X(X0 X)− X0 Xβ
= a0 PX Xβ = a0 Xβ = λ0 β.
PAGE 33
CHAPTER 3 STAT 714, J. TEBBS
λ0 c1 = 0
λ0 c2 = 0
..
.
λ0 cs = 0
REMARK : There are two spaces of interest: C(X0 ) = R(X) and N (X). If X is n × p
with rank r < p, then dim{C(X0 )} = r and dim{N (X)} = s = p − r. Therefore, if
c1 , c2 , ..., cs are linearly independent, then {c1 , c2 , ..., cs } must be a basis for N (X). But,
⇐⇒ λ is orthogonal to c1 , c2 , ..., cs
⇐⇒ λ0 ci = 0, i = 1, 2, ..., s.
TERMINOLOGY : A set of linear functions {λ01 β, λ02 β, ..., λ0k β} is said to be jointly
nonestimable if the only linear combination of λ01 β, λ02 β, ..., λ0k β that is estimable is
the trivial one; i.e., ≡ 0. These types of functions are useful in non-full-rank linear models
and are associated with side conditions.
PAGE 34
CHAPTER 3 STAT 714, J. TEBBS
GENERAL CASE : Consider the one-way fixed effects ANOVA model Yij = µ + αi + ij ,
for i = 1, 2, ..., a and j = 1, 2, ..., ni , where E(ij ) = 0. In matrix form, X and β are
µ
1n1 1n1 0n1 · · · 0n1
α1
1n2 0n2 1n2 · · · 0n2
Xn×p = . and β p×1 = α2 ,
. . .
.
.. .. .. .. ..
..
.
1na 0na 0na · · · 1na
αa
P
where p = a+1 and n = i ni . Note that the last a columns of X are linearly independent
and the first column is the sum of the last a columns. Hence, r(X) = r = a and
s = p − r = 1. With c1 = (1, −10a )0 , note that Xc1 = 0 so {c1 } forms a basis for N (X).
Thus, the necessary and sufficient condition for λ0 β = λ0 µ + ai=1 λi αi to be estimable
P
is
a
X
0
λ c1 = 0 =⇒ λ0 = λi .
i=1
1. µ + αi
2. αi − αk
Pa Pa
3. any contrast in the α’s; i.e., i=1 λi αi , where i=1 λi = 0.
1. µ
2. αi
Pa
3. i=1 ni α i .
There is only s = 1 jointly nonestimable function. Later we will learn that jointly non-
estimable functions can be used to “force” particular solutions to the normal equations.
PAGE 35
CHAPTER 3 STAT 714, J. TEBBS
The following are examples of sets of linearly independent estimable functions (verify!):
1. {µ + α1 , µ + α2 , ..., µ + αa }
2. {µ + α1 , α1 − α2 , ..., α1 − αa }.
LEAST SQUARES ESTIMATES : We now wish to calculate the least squares estimates
of estimable functions. Note that X0 X and one generalized inverse of X0 X is given by
n n 1 n2 · · · na 0 0 0 ··· 0
n1 n1 0 ··· 0 0 1/n1 0 ··· 0
X0 X = n 2 0 n 2 0
· · · 0 and (X X) = 0
−
1/n2 · · ·
0 0
.. .. .. .. .. .. .. ..
.. ..
. . . . . . . . . .
na 0 0 · · · na 0 0 0 · · · 1/na
REMARK : We know that this solution is not unique; had we used a different generalized
inverse above, we would have gotten a different least squares estimate of β. However, least
squares estimates of estimable functions λ0 β are invariant to the choice of generalized
inverse, so our choice of (X0 X)− above is as good as any other. From this solution, we
have the unique least squares estimates:
µ + αi Y i+
αi − αk Y i+ − Y k+
Pa Pa Pa
i=1 λi αi , where i=1 λi = 0 i=1 λi Y i+
PAGE 36
CHAPTER 3 STAT 714, J. TEBBS
GENERAL CASE : Consider the two-way fixed effects (crossed) ANOVA model
Yijk = µ + αi + βj + ijk ,
for i = 1, 2, ..., a and j = 1, 2, ..., b, and k = 1, 2, ..., nij , where E(ij ) = 0. For ease of
presentation, we take nij = 1 so there is no need for a k subscript; that is, we can rewrite
the model as Yij = µ + αi + βj + ij . In matrix form, X and β are
µ
α1
α2
1b 1b 0b · · · 0b Ib
..
.
1b 0b 1b · · · 0b Ib
Xn×p = and β p×1 = αa ,
.. .. .. . . . .
. .. ..
. . .
β1
1b 0b 0b · · · 1b Ib
β2
..
.
βb
where p = a + b + 1 and n = ab. Note that the first column is the sum of the last b
columns. The 2nd column is the sum of the last b columns minus the sum of columns 3
through a + 1. The remaining columns are linearly independent. Thus, we have s = 2
linear dependencies so that r(X) = a + b − 1. The dimension of N (X) is s = 2. Taking
1 1
c1 = −1a and c2 = 0a
0b −1b
produces Xc1 = Xc2 = 0. Since c1 and c2 are linearly independent; i.e., neither is
a multiple of the other, {c1 , c2 } is a basis for N (X). Thus, necessary and sufficient
conditions for λ0 β to be estimable are
a
X
0
λ c1 = 0 =⇒ λ0 = λi
i=1
b
X
λ0 c2 = 0 =⇒ λ0 = λa+j .
j=1
PAGE 37
CHAPTER 3 STAT 714, J. TEBBS
1. µ + αi + βj
2. αi − αk
3. βj − βk
Pa Pa
4. any contrast in the α’s; i.e., i=1 λi αi , where i=1 λi = 0
Pb Pb
5. any contrast in the β’s; i.e., j=1 λa+j βj , where j=1 λa+j = 0.
1. µ
2. αi
3. βj
Pa
4. i=1 αi
Pb
5. j=1 βj .
1. {αa , βb }
P P
2. { i αi , j βj }.
1. {µ + α1 + β1 , α1 − α2 , ..., α1 − αa , β1 − β2 , ..., β1 − βb }.
NOTE : When replication occurs; i.e., when nij > 1, for all i and j, our estimability
findings are unchanged. Replication does not change R(X). We obtain the following
least squares estimates:
PAGE 38
CHAPTER 3 STAT 714, J. TEBBS
µ + αi + βj Y ij+
αi − αl Y i++ − Y l++
βj − βl Y +j+ − Y +l+
Pa Pa Pa
i=1 ci αi , with i=1 ci = 0 i=1 ci Y i++
Pb Pb Pb
j=1 di βj , with j=1 di = 0 j=1 di Y +j+
These formulae are still technically correct when nij = 1. When some nij = 0, i.e., there
are missing cells, estimability may be affected; see Monahan, pp 46-48.
GENERAL CASE : Consider the two-way fixed effects (crossed) ANOVA model
PAGE 39
CHAPTER 3 STAT 714, J. TEBBS
There are p = 12 parameters. The last six columns of X are linearly independent, and the
other columns can be written as linear combinations of the last six columns, so r(X) = 6
and s = p − r = 6. To determine which functions λ0 β are estimable, we need to find a
basis for N (X). One basis {c1 , c2 , ..., c6 } is
−1 −1 0 0 0 −1
1 0 −1 0 0 1
1 0 0 −1 0 1
1 0 0 0 0 0
0 1 0 0 −1 1
0 1 0 0 0 0
, , , , , .
−1
0 0 1 0 1
0 0 1 0 0 0
−1
0 0 0 1 1
0
0 0
1
0
0
0 0 0 0 1 0
0 0 0 0 0 1
3.3 Reparameterization
PAGE 40
CHAPTER 3 STAT 714, J. TEBBS
1. PW = PX .
Proof.
1. PW = PX since perpendicular projection matrices are unique.
2. Note that
X0 XTb
γ = X0 Wb
γ = X0 PW Y = X0 PX Y = X0 Y.
PAGE 41
CHAPTER 3 STAT 714, J. TEBBS
3. This follows from (2), since the least squares estimate is invariant to the choice of the
solution to the normal equations.
4. If q0 ∈ R(W), then q0 = a0 W, for some a. Then, q0 S0 = a0 WS0 = a0 X ∈ R(X), so
that q0 S0 β is estimable under Model GL. From (3), we know the least squares estimate
of q0 S0 β is q0 S0 Tb
γ . But,
γ = a0 WS0 Tb
q0 S0 Tb γ = a0 XTb
γ = a0 Wb
γ = q0 γ
b.
WARNING: The converse to (4) is not true; i.e., q0 S0 β being estimable under Model GL
doesn’t necessarily imply that q0 γ is estimable under Model GL-R. See Monahan, pp 52.
TERMINOLOGY : Because C(W) = C(X) and r(X) = r, Wn×t must have at least r
columns. If W has exactly r columns; i.e., if t = r, then the reparameterization of
Model GL is called a full rank reparameterization. If, in addition, W0 W is diagonal,
the reparameterization of Model GL is called an orthogonal reparameterization; see,
e.g., the centered linear regression model in Section 2 (notes).
NOTE : A full rank reparameterization always exists; just delete the columns of X that are
linearly dependent on the others. In a full rank reparameterization, (W0 W)−1 exists, so
the normal equations W0 Wγ = W0 Y have a unique solution; i.e., γ
b = (W0 W)−1 W0 Y.
DISCUSSION : There are two (opposing) points of view concerning the utility of full rank
reparameterizations.
• Some argue that, since making inferences about q0 γ under the full rank reparam-
eterized model (Model GL-R) is equivalent to making inferences about q0 S0 β in
the possibly-less-than-full rank original model (Model GL), the inclusion of the
possibility that the design matrix has less than full column rank causes a needless
complication in linear model theory.
• The opposing argument is that, since computations required to deal with the repa-
rameterized model are essentially the same as those required to handle the original
model, we might as well allow for less-than-full rank models in the first place.
PAGE 42
CHAPTER 3 STAT 714, J. TEBBS
• I tend to favor the latter point of view; to me, there is no reason not to include
less-than-full rank models as long as you know what you can and can not estimate.
Yij = µ + αi + ij ,
for i = 1, 2, ..., a and j = 1, 2, ..., ni , where E(ij ) = 0. In matrix form, X and β are
µ
1n1 1n1 0n1 · · · 0n1
α1
1n2 0n2 1n2 · · · 0n2
Xn×p = and β p×1 = ,
.. .. .. .. .. α2
. . . . . ..
.
1na 0na 0na · · · 1na
αa
P
where p = a + 1 and n = i ni . This is not a full rank model since the first column is
the sum of the last a columns; i.e., r(X) = a.
Exercise: What are the matrices T and S associated with this reparameterization?
PAGE 43
CHAPTER 3 STAT 714, J. TEBBS
Example 3.4. Two part multiple linear regression model. Consider the linear model
Y = Xβ + , where E() = 0. Suppose that X is full rank. Write X = (X1 X2 ) and
β = (β 01 , β 02 )0 so that the model can be written as
Model GL: Y = X1 β 1 + X2 β 2 + .
Now, set W1 = X1 and W2 = (I − PX1 )X2 , where PX1 = X1 (X01 X1 )−1 X01 is the
perpendicular projection matrix onto C(X1 ). A reparameterized version of Model GL is
Model GL-R: Y = W1 γ 1 + W2 γ 2 + ,
PAGE 44
CHAPTER 3 STAT 714, J. TEBBS
so that
(X01 X1 )−1 X01 Y γ
b1
b = (W0 W)−1 W0 Y =
γ ≡ .
{X02 (I − PX1 )X2 }−1 X02 (I − PX1 )Y γ
b2
where note that (X01 X1 )−1 X01 X2 is the estimate obtained from “regressing” X2 on X1 .
Furthermore, the estimate γ
b 2 can be thought of as the estimate obtained from regressing
Y on W2 = (I − PX1 )X2 .
APPLICATION : Consider the two part full-rank regression model Y = X1 β 1 +X2 β 2 +,
where E() = 0. Suppose that X2 = x2 is n × 1 and that β 2 = β2 is a scalar. Consider
two different models:
Reduced model: Y = X1 β 1 +
Full model: Y = X1 β 1 + x2 β2 + .
We use the term “reduced model” since C(X1 ) ⊂ C(X1 , x2 ). Consider the full model
Y = X1 β 1 + x2 β2 + and premultiply by I − PX1 to obtain
= b2 (I − PX1 )x2 + ∗ ,
PAGE 45
CHAPTER 3 STAT 714, J. TEBBS
(I − PX1 )Y = Y − PX1 Y ≡ b
eY|X1 ,
ex2 |X1 + ∗ ,
eY|X1 = b2b
b
• If a linear trend exists in this plot, this suggests that x2 enters into the (full) model
linearly. This plot can also be useful for detecting outliers and high leverage points.
• On the down side, added-variable plots only look at one predictor at a time so one
can not assess multicolinearity; that is, if the predictor x2 is “close” to C(X1 ), this
may not be detected in the plot.
• The slope of the least squares regression line for the added variable plot is
βb2 = [{(I − PX1 )x2 }0 (I − PX1 )x2 ]−1 {(I − PX1 )x2 }0 (I − PX1 )Y
b = (X0 X)−1 X0 Y.
• If r = p, then a unique least squares solution exists; i.e., β
PAGE 46
CHAPTER 3 STAT 714, J. TEBBS
Example 3.5. Consider the one-way fixed effects ANOVA model Yij = µ + αi + ij , for
i = 1, 2, ..., a and j = 1, 2, ..., ni , where E(ij ) = 0. The normal equations are
P P
n n1 n2 · · · na µ Y
i j ij
P
n1 n1 0 · · · 0 α1 Y1j
j
X0 Xβ = n2 0 n2 · · · 0 α2 = = X0 Y,
P
j Y 2j
.. .. .. . . . .. .. ..
. . . . . .
P
na 0 0 · · · na αa j ajY
Pa
REMARK : Equations like µ = 0 and i=1 ni αi = 0 are used to “force” a particular so-
lution to the normal equations and are called side conditions. Different side conditions
produce different least squares solutions. We know that in the one-way ANOVA model,
the parameters µ and αi , for i = 1, 2, ..., a, are not estimable (individually). Imposing
side conditions does not change this. My feeling is that when we attach side conditions to
force a unique solution, we are doing nothing more than solving a mathematical problem
that isn’t relevant. After all, estimable functions λ0 β have least squares estimates that
do not depend on which side condition was used, and these are the only functions we
should ever be concerned with.
PAGE 47
CHAPTER 3 STAT 714, J. TEBBS
REMARK : We have seen similar results for the two-way crossed ANOVA model. In
general, what and how many conditions should we use to “force” a particular solution to
the normal equations? Mathematically, we are interested in imposing additional linear
restrictions of the form Cβ = 0 where the matrix C does not depend on Y.
These are called the augmented normal equations. When we add the constraint
Cβ = 0, we want these equations to be consistent for all Y. We now would like to find
a sufficient condition for consistency. Suppose that w ∈ R(X0 X) ∩ R(C). Note that
Thus, 0 = w − w = X0 Xv1 + C0 v2
=⇒ 0 = v10 X0 X + v20 C
0 0
X X X X
=⇒ 0 = (v10 v20 ) = v0 ,
C C
PAGE 48
CHAPTER 3 STAT 714, J. TEBBS
Thus, we need v10 X0 Y = 0, for all Y. Successively taking Y to be standard unit vectors,
for i = 1, 2, ..., n, convinces us that v10 X0 = 0 ⇐⇒ Xv1 = 0 ⇐⇒ X0 Xv1 = 0 =⇒ w = 0.
Thus, the augmented normal equations are consistent when R(X0 X)∩R(C) = {0}. Since
R(X0 X) = R(X), a sufficient condition for consistency is R(X) ∩ R(C) = {0}. Now,
consider the parametric function λ0 Cβ, for some λ. We know that λ0 Cβ is estimable if
and only if λ0 C ∈ R(X). However, clearly λ0 C ∈ R(C). Thus, λ0 Cβ is estimable if and
only if λ0 Cβ = 0. In other words, writing
c01
c02
C=
.. ,
.
0
cs
the set of functions {c01 β, c02 β, ..., c0s β} is jointly nonestimable. Therefore, we can set
a collection of jointly nonestimable functions equal to zero and augment the normal
equations so that they remain consistent. We get a unique solution if
0
XX
r = p.
C
PAGE 49
CHAPTER 3 STAT 714, J. TEBBS
SUMMARY : To augment the normal equations, we can find a set of s jointly nonestimable
functions {c01 β, c02 β, ..., c0s β} with
c01
c02
r(C) = r
.. = s.
.
c0s
Then,
0 0
XX XY
β =
C 0
is consistent and has a unique solution.
Example 3.5 (continued). Consider the one-way fixed effects ANOVA model
Yij = µ + αi + ij ,
for i = 1, 2, ..., a and j = 1, 2, ..., ni , where E(ij ) = 0. The normal equations are
P P
n n1 n2 · · · na µ Y
i j ij
P
n1 n1 0 · · · 0 α 1 Y1j
j
X0 Xβ = n2 0 n2 · · · 0 α2 = = X0 Y.
P
j Y2j
.. .. .. . . .. .. ..
. . .
. . . .
P
na 0 0 · · · na αa j Yaj
We know that r(X) = r = a < p (this system can not be solved uniquely) and that
s = p − r = (a + 1) − a = 1. Thus, to augment the normal equations, we need to find
s = 1 (jointly) nonestimable function. Take c01 = (1, 0, 0, ..., 0), which produces
µ
α1
0
c1 β = (1 0 0 · · · 0) α2 = µ.
..
.
αa
PAGE 50
CHAPTER 3 STAT 714, J. TEBBS
Solving this (now full rank) system produces the unique solution
µ
b = 0
α
bi = Y i+ i = 1, 2, ..., a.
You’ll note that this choice of c1 used to augment the normal equations corresponds to
specifying the side condition µ = 0.
Pa
Exercise. Redo this example using (a) the side condition i=1 ni αi = 0, (b) the side
condition αa = 0 (what SAS does), and (c) using another side condition.
Example 3.6. Consider the two-way fixed effects (crossed) ANOVA model
Yij = µ + αi + βj + ij ,
for i = 1, 2, ..., a and j = 1, 2, ..., b, where E(ij ) = 0. For purposes of illustration, let’s
take a = b = 3, so that n = ab = 9 and p = a + b + 1 = 7. In matrix form, X and β are
1 1 0 0 1 0 0
1 1 0 0 0 1 0 µ
1 1 0 0 0 0 1 α1
1 0 1 0 1 0 0 α2
X9×7 = 1 0 1 0 0 1 0 and β 7×1 = α3 .
1 0 1 0 0 0 1 β
1
1 0 0 1 1 0 0 β2
1 0 0 1 0 1 0 β3
1 0 0 1 0 0 1
PAGE 51
CHAPTER 3 STAT 714, J. TEBBS
This system does not have a unique solution. To augment the normal equations, we will
need a set of s = 2 linearly independent jointly nonestimable functions. From Section
P P
3.2.2, one example of such a set is { i αi , j βj }. For this choice, our matrix C is
0
c1 0 1 1 1 0 0 0
C= = .
0
c2 0 0 0 0 1 1 1
Solving this system produces the “estimates” of µ, αi and βj under the side conditions
P P
i αi = j βj = 0. These “estimates” are
µ
b = Y ++
bi = Y i+ − Y ++ , i = 1, 2, 3
α
βbj = Y +j − Y ++ , j = 1, 2, 3.
PAGE 52
CHAPTER 3 STAT 714, J. TEBBS
Exercise. Redo this example using (a) the side conditions αa = 0 and βb = 0 (what
SAS does) and (b) using another set of side conditions.
QUESTION : In general, can we give a mathematical form for the particular solution?
Note that we are now solving
0 0
XX XY
β = ,
C 0
which is equivalent to
0 0
XX XY
β =
0
CC 0
since Cβ = 0 iff C0 Cβ = 0. Thus, any solution to this system must also satisfy
(X0 X + C0 C)β = X0 Y.
But,
X X
r(X0 X + C0 C) = r (X0 C0 ) = r = p,
C C
that is, X0 X + C0 C is nonsingular. Hence, the unique solution to the augmented normal
equations must be
b = (X0 X + C0 C)−1 X0 Y.
β
b = X(X0 X + C0 C)−1 X0 Y
Xβ
= PX Y,
is the perpendicular projection matrix onto C(X). This shows that (X0 X + C0 C)−1 is a
(non-singular) generalized inverse of X0 X.
PAGE 53
CHAPTER 4 STAT 714, J. TEBBS
4.1 Introduction
E{t(Y)} = λ0 β,
for all β. We have seen this implies that c = 0 and λ0 ∈ R(X); i.e., λ0 β is estimable.
• When λ0 β is estimable, it is possible to find several estimators that are unbiased for
λ0 β. For example, in the one-way (fixed effects) ANOVA model Yij = µ + αi + ij ,
with E(ij ) = 0, Y11 , (Y11 + Y12 )/2, and Y 1+ are each unbiased estimators of
λ0 β = µ + α1 (there are others too).
E(λ0 β)
b = E(a0 PX Y) = a0 PX E(Y) = a0 PX Xβ = a0 Xβ = λ0 β.
GOAL: Among all linear unbiased estimators for λ0 β, we want to find the “best” linear
unbiased estimator in the sense that it has the smallest variance. We will show that
the least squares estimator λ0 β
b is the best linear unbiased estimator (BLUE) of λ0 β,
PAGE 54
CHAPTER 4 STAT 714, J. TEBBS
b = var[λ0 β
var(θ) b + (θb − λ0 β)]
b
= var(λ0 β)
b + var(θb − λ0 β)
b +2cov(λ0 β,
b θb − λ0 β).
b
| {z }
≥0
b θb − λ0 β)
cov(λ0 β, b = cov(a0 PX Y, a0 Y − a0 PX Y)
= cov[a0 PX Y, a0 (I − PX )Y]
= σ 2 Ia0 PX (I − PX )a = 0,
pr(θb = λ0 β)
b = 1. This establishes uniqueness.
PAGE 55
CHAPTER 4 STAT 714, J. TEBBS
where λ0i = a0i X, for some ai ; i.e., λ0i ∈ R(X), for i = 1, 2, ..., k. We say that Λ0 β is
estimable if and only if λ0i β, i = 1, 2, ..., k, are each estimable. Put another way, Λ0 β is
estimable if and only if Λ0 = A0 X, for some matrix A.
definite.
Proof. It suffices to show that x0 [cov(c + A0 Y) − cov(Λ0 β)]x
b ≥ 0, for all x. Note that
is unbiased and
Yi = β0 + β1 xi + i ,
for i = 1, 2, ..., n, where 1 , 2 , ..., n are uncorrelated random variables with mean 0 and
common variance σ 2 > 0 (these are the Gauss Markov assumptions). Recall that, in
PAGE 56
CHAPTER 4 STAT 714, J. TEBBS
matrix notation,
Y1 1 x1 1
Y2 1 x2 β0 2
Y=
..
,
X=
.. ..
,
β= , =
..
.
. . . β1 .
Yn 1 xn n
REVIEW : Consider the Gauss-Markov model Y = Xβ+, where E() = 0 and cov() =
σ 2 I. The best linear unbiased estimator (BLUE) for any estimable function λ0 β is λ0 β,
b
where β
b is any solution to the normal equations. Clearly, E(Y) = Xβ is estimable and
b = X(X0 X)−1 X0 Y = PX Y = Y,
Xβ b
the perpendicular projection of Y onto C(X); that is, the fitted values from the least
squares fit. The residuals are given by
e=Y−Y
b b = Y − PX Y = (I − PX )Y,
the perpendicular projection of Y onto N (X0 ). Recall that the residual sum of squares
is
b 0 (Y − Xβ)
b = (Y − Xβ)
Q(β) e0 b
b =b e = Y0 (I − PX )Y.
PAGE 57
CHAPTER 4 STAT 714, J. TEBBS
Result 4.3. Suppose that Z is a random vector with mean E(Z) = µ and covariance
matrix cov(Z) = Σ. Let A be nonrandom. Then
Proof. Note that Z0 AZ is a scalar random variable; hence, Z0 AZ = tr(Z0 AZ). Also, recall
that expectation E(·) and tr(·) are linear operators. Finally, recall that tr(AB) = tr(BA)
for conformable A and B. Now,
= tr[AE(ZZ0 )]
= tr[A(Σ + µµ0 )]
= tr(AΣ) + tr(Aµµ0 )
REMARK : Finding var(Z0 AZ) is more difficult; see Section 4.9 in Monahan. Consider-
able simplification results when Z follows a multivariate normal distribution.
= σ 2 [n − r(PX )] = σ 2 (n − r).
Thus,
b2 = (n − r)−1 Y0 (I − PX )Y
σ
PAGE 58
CHAPTER 4 STAT 714, J. TEBBS
Source df SS MS F
Model (Corrected) r−1 SSR = Y0 (PX − P1 )Y MSR = SSR
r−1
F = MSR
MSE
NOTES :
• The degrees of freedom associated with each SS is the rank of its appropriate
perpendicular projection matrix; that is, r(PX −P1 ) = r −1 and r(I−PX ) = n−r.
• Note that
cov(Y,
b be) = cov[PX Y, (I − PX )Y] = PX σ 2 I(I − PX ) = 0.
That is, the least squares fitted values are uncorrelated with the residuals.
= (Xβ)0 Xβ + σ 2 r(PX − P1 )
= (Xβ)0 Xβ + (r − 1)σ 2 .
Thus,
E(MSR) = (r − 1)−1 E(SSR) = σ 2 + (r − 1)−1 (Xβ)0 Xβ.
• If Xβ ∈ C(1), that is, the independent variables in X add nothing to the model,
then (Xβ)0 (PX − P1 )Xβ = 0 and MSR and MSE are both unbiased estimators
of σ 2 . If this is true, F should be close to 1. Large values of F occur when
(Xβ)0 (PX − P1 )Xβ is large, that is, when Xβ is “far away” from C(1), that is,
when the independent variables in X are more relevant in explaining E(Y).
PAGE 59
CHAPTER 4 STAT 714, J. TEBBS
Y = Xβ + Wδ + ,
where E() = 0 and cov() = σ 2 I. The vector η = Wδ includes the variables and
coefficients missing from Xβ. If the analyst uses Y = Xβ + to describe the data,
s/he is missing important variables that are in W, that is, the analyst is misspecifying
the true model by underfitting. We now examine the effect of underfitting on (a) least
squares estimates of estimable functions and (b) the estimate of the error variance σ 2 .
E(λ0 β)
b = E[λ0 (X0 X)− X0 Y] = λ0 (X0 X)− X0 E(Y)
= a0 PX Xβ + a0 PX Wδ
= λ0 β + a0 PX Wδ,
showing that λ0 β
b is no longer unbiased, in general. The amount of the bias depends on
PAGE 60
CHAPTER 4 STAT 714, J. TEBBS
CONSEQUENCES : Now, let’s turn to the estimation of σ 2 . Under the correct model,
4.4.2 Overfitting
Y = X1 β 1 + ,
Y = X1 β 1 + X2 β 2 + ,
that is, the extra variables in X2 are not needed; i.e., β 2 = 0. Set X = [X1 X2 ] and
suppose that X and X1 have full column rank (i.e., a regression setting). The least
e 1 = (X0 X1 )−1 X0 Y. We know that
squares estimator of β 1 under the true model is β 1 1
E(β
e 1) = β1
e ) = σ 2 (X0 X1 )−1 .
cov(β 1 1
On the other hand, the normal equations associated with the larger (unnecessarily large)
model are
X01 X1 X01 X2 β1 X01 Y
X0 Xβ = X0 Y ⇐⇒ =
X02 X1 X02 X2 β2 X02 Y
PAGE 61
CHAPTER 4 STAT 714, J. TEBBS
E(β
b ) = β
1 1
E(β
b ) = 0.
2
large model,
b = (X0 X1 )−1 X0 Y = β
i.e., X0 X is block diagonal, and β 1 1 1
e . This would mean that
1
using the unnecessarily large model has no effect on our estimate of β 1 . However,
the precision with which we can estimate σ 2 is affected since r(I−PX ) < r(I−PX1 );
that is, we have fewer residual degrees of freedom.
b ) 6= σ 2 (X0 X1 )−1 .
cov(β 1 1
Furthermore, as X2 gets “closer” to C(X1 ), then X02 (I − PX1 )X2 gets “smaller.”
This makes [X02 (I − PX1 )X2 ]−1 “larger.” This makes cov(β
b ) “larger.”
1
PAGE 62
CHAPTER 4 STAT 714, J. TEBBS
Y = Xβ + ,
V = QDQ0 ,
where Q is orthogonal and D is the diagonal matrix consisting of λ1 , λ2 , ..., λn , the eigen-
values of V. Because V is pd, we know that λi > 0, for each i = 1, 2, ..., n. The symmetric
square root of V is
V1/2 = QD1/2 Q0 ,
√ √ √
where D1/2 = diag( λ1 , λ2 , ..., λn ). Note that V1/2 V1/2 = V and that V−1 =
V−1/2 V−1/2 , where
V−1/2 = QD−1/2 Q0
√ √ √
and D−1/2 = diag(1/ λ1 , 1/ λ2 , ..., 1/ λn ).
⇐⇒ Y∗ = Uβ + ∗ ,
PAGE 63
CHAPTER 4 STAT 714, J. TEBBS
and
cov(∗ ) = V−1/2 cov()V−1/2 = V−1/2 σ 2 VV−1/2 = σ 2 I.
Note also that R(X) = R(U), because V−1/2 is nonsingular. This means that λ0 β is
estimable in the Aitken model if and only if λ0 β is estimable in the transformed GM
model. The covariance structure on does not affect estimability.
U0 Uβ = U0 Y∗ .
⇐⇒ X0 V−1 Xβ = X0 V−1 Y
X0 V−1 Xβ = X0 V−1 Y
are called the Aitken equations. These should be compared with the normal equations
X0 Xβ = X0 Y
and normal equations, respectively. “GLS” stands for generalized least squares. “OLS”
stands for ordinary least squares.
PAGE 64
CHAPTER 4 STAT 714, J. TEBBS
0 −1
β
b
GLS = (X V X)−1 X0 V−1 Y.
Result 4.4. Consider the Aitken model Y = Xβ+, where E() = 0 and cov() = σ 2 V,
where V is known. If λ0 β is estimable, then λ0 β
b 0
GLS is the BLUE for λ β.
REMARK : If X is full rank, then estimability concerns vanish (as in the GM model) and
b GLS = (X0 V−1 X)−1 X0 V−1 Y is unique. In this case, straightforward calculations show
β
that E(β
b ) = β and
GLS
Example 4.2. Heteroscedastic regression through the origin. Consider the regression
model Yi = βxi + i , for i = 1, 2, ..., n, where E(i ) = 0, var(i ) = σ 2 g 2 (xi ), for some real
function g(·), and cov(i , j ) = 0, for i 6= j. For this model,
Y x g 2 (x1 ) 0 ··· 0
1 1
g 2 (x2 ) ···
Y2 x 0 0
, X = 2 , and V =
Y= .. .. .. .. .. ..
.
.
. . . . .
Yn xn 0 0 ··· g 2 (xn )
PAGE 65
CHAPTER 4 STAT 714, J. TEBBS
Which one is better? Both of these estimators are unbiased, so we turn to the variances.
Straightforward calculations show that
Pn
σ2 2 2
i=1 xi g (xi )
var(βbOLS ) = 2
( ni=1 x2i )
P
2
σ
var(βbGLS ) = Pn 2 2
.
i=1 xi /g (xi )
Thus, Pn 2 2
1 i=1 xi g (xi )
Pn 2 2 ≤ 2 =⇒ var(βbGLS ) ≤ var(βbOLS ).
i=1 xi /g (xi ) ( ni=1 x2i )
P
This result should not be surprising; after all, we know that βbGLS is BLUE.
V−1/2 X[(V−1/2 X)0 V−1/2 X]− (V−1/2 X)0 V−1/2 Y = V−1/2 Xβ,
b
by Result 2.5. Multiplying through by V1/2 and simplifying gives the result.
PAGE 66
CHAPTER 4 STAT 714, J. TEBBS
Result 4.6. A = X(X0 V−1 X)− X0 V−1 is a projection matrix onto C(X).
Proof. We need to show that
(a) A is idempotent
showing that A is idempotent. To show (b), note Aw = X(X0 V−1 X)− X0 V−1 w ∈ C(X).
To show (c), it suffices to show C(A) = C(X). But, A = X(X0 V−1 X)− X0 V−1 implies
that C(A) ⊂ C(X) and AX = X implies that C(X) ⊂ C(A).
Result 4.7. In the Aitken model, if C(VX) ⊂ C(X), then the GLS and OLS estimates
will be equal; i.e., OLS estimates will be BLUE in the Aitken model.
Proof. The proof proceeds by showing that A = X(X0 V−1 X)− X0 V−1 is the perpendicular
projection matrix onto C(X) when C(VX) ⊂ C(X). We already know that A is a
projection matrix onto C(X). Thus, all we have to show is that if w⊥C(X), then Aw = 0.
If V is nonsingular, then r(VX) = r(X). The only way this and C(VX) ⊂ C(X) holds
is if C(VX) = C(X), in which case VXB1 = X and VX = XB2 , for some matrices
B1 and B2 . Multiplying through by V−1 gives XB1 = V−1 X and X = V−1 XB2 .
Thus, C(V−1 X) = C(X) and C(V−1 X)⊥ = C(X)⊥ . If w⊥C(X), then w⊥C(V−1 X); i.e.,
w ∈ N (X0 V−1 ). Since Aw = X(X0 V−1 X)−1 X0 V−1 w = 0, we are done.
PAGE 67
CHAPTER 5 STAT 714, J. TEBBS
5 Distributional Theory
5.1 Introduction
RECALL: If Z ∼ N (0, 1), then the probability density function (pdf) of Z is given by
1 2
fZ (z) = √ e−z /2 I(z ∈ R).
2π
The N (µ, σ 2 ) family is a location-scale family generated by the standard density fZ (z).
is a location-scale family generated by fZ (z); see Casella and Berger, Chapter 3. That
is, if Z ∼ fZ (z), then
1 x−µ
X = σZ + µ ∼ fX (x|µ, σ) = fZ .
σ σ
APPLICATION : With the standard normal density fZ (z), it is easy to see that
1 x−µ 1 1 2
fX (x|µ, σ) = fZ =√ e− 2σ2 (x−µ) I(x ∈ R).
σ σ 2πσ
That is, any normal random variable X ∼ N (µ, σ 2 ) may be obtained by transforming
Z ∼ N (0, 1) via X = σZ + µ.
PAGE 68
CHAPTER 5 STAT 714, J. TEBBS
STARTING POINT : Suppose that Z1 , Z2 , ..., Zp are iid standard normal random vari-
ables. The joint pdf of Z = (Z1 , Z2 , ..., Zp )0 is given by
p
Y
fZ (z) = fZ (zi )
i=1
p p
1 −
Pp 2
i=1 zi /2
Y
= √ e I(zi ∈ R)
2π i=1
If Z has pdf fZ (z), we say that Z has a standard multivariate normal distribution;
i.e., a multivariate normal distribution with mean 0p×1 and covariance matrix Ip . We
write Z ∼ Np (0, I).
Y = V1/2 Z + µ,
E(Y) = E(V1/2 Z + µ) = µ,
since cov(Z) = I. The transformation y = g(z) = V1/2 z+µ is linear in z (and hence, one-
to-one) and the pdf of Y can be found using a transformation. The inverse transformation
is z = g −1 (y) = V−1/2 (y − µ). The Jacobian of the inverse transformation is
−1
∂g (y) −1/2
∂y = |V |,
PAGE 69
CHAPTER 5 STAT 714, J. TEBBS
where |A| denotes the determinant of A. The matrix V−1/2 is pd; thus, its determinant
is always positive. Thus, for y ∈ Rp ,
fY (y) = fZ {g −1 (y)}|V−1/2 |
is defined for all t in an open neighborhood about zero. The function MX (t) is called the
moment generating function (mgf ) of X.
Result 5.1.
1. If MX (t) exists, then E(|X|j ) < ∞, for all j ≥ 1, that is, the moment generating
function characterizes an infinite set of moments.
2. MX (0) = 1.
PAGE 70
CHAPTER 5 STAT 714, J. TEBBS
4. Uniqueness. If X1 ∼ MX1 (t), X2 ∼ MX2 (t), and MX1 (t) = MX2 (t) for all t in an
open neighborhood about zero, then FX1 (x) = FX2 (x) for all x.
5. If X1 , X2 , ..., Xn are independent random variables with mgfs MXi (t), i = 1, 2, ..., n,
and Y = a0 + ni=1 ai Xi , then
P
n
Y
MY (t) = ea0 t MXi (ai t).
i=1
Result 5.2.
TERMINOLOGY : Define the random vector X = (X1 , X2 , ..., Xp )0 and let t = (t1 , t2 , ..., tp )0 .
The moment generating function for X is given by
Z
0
MX (t) = E{exp(t X)} = exp(t0 x)dFX (x),
Rp
provided that E{exp(t0 X)} < ∞, for all ||t|| < δ, ∃δ > 0.
Result 5.3.
1. If MX (t) exists, then MXi (ti ) = MX (t∗i ), where t∗i = (0, ..., 0, ti , 0, ..., 0)0 . This
implies that E(|Xi |j ) < ∞, for all j ≥ 1.
Thus,
∂ 2 MX (t)
E(Xr Xs ) = .
∂tr ts
tr =ts =0
PAGE 71
CHAPTER 5 STAT 714, J. TEBBS
4. Uniqueness. If X1 and X2 are random vectors with MX1 (t) = MX2 (t) for all t in
an open neighborhood about zero, then FX1 (x) = FX2 (x) for all x.
6. Let X = (X01 , X02 , ..., X0m )0 and suppose that MX (t) exists. Let MXi (ti ) denote the
mgf of Xi . Then, X1 , X2 , ..., Xm are independent if and only if
n
Y
MX (t) = MXi (ti )
i=1
for all t = (t01 , t02 , ..., t0m )0 in an open neighborhood about zero.
5.2.3 Properties
1. X = a0 Y ∼ N (a0 µ, a0 Va).
2. X = AY + b ∼ Nk (Aµ + b, AVA0 ).
Result 5.6. If Y ∼ Np (µ, V), then any r × 1 subvector of Y has an r-variate normal
distribution with the same means, variances, and covariances as the original distribution.
Proof. Partition Y = (Y10 , Y20 )0 , where Y1 is r × 1. Partition µ = (µ01 , µ02 )0 and
V11 V12
V=
V21 V22
PAGE 72
CHAPTER 5 STAT 714, J. TEBBS
WARNING: Joint normality implies marginal normality. That is, if Y1 and Y2 are
jointly normal, then they are marginally normal. However, if Y1 and Y2 are marginally
normal, this does not necessarily mean that they are jointly normal.
Y = Xβ + ,
where ∼ Nn (0, σ 2 I). Note that E(Y) = Xβ and that V = cov(Y) = σ 2 I. Fur-
thermore, because Y is a linear combination of , it is also normally distributed;
i.e., Y ∼ Nn (Xβ, σ 2 I). With PX = X(X0 X)− X0 , we know that Y
b = PX Y and
e = (I − PX )Y. Now,
b
E(Y)
b = E(PX Y) = PX E(Y) = PX Xβ = Xβ
and
b = cov(PX Y) = PX cov(Y)P0 = σ 2 PX IPX = σ 2 PX ,
cov(Y) X
b ∼ Nn (Xβ, σ 2 PX ).
Y
e ∼ Nn {0, σ 2 (I − PX )}.
Exercise: Show that b
PAGE 73
CHAPTER 5 STAT 714, J. TEBBS
where Γ0 = (γ1 , γ2 )0 and r(Γ) = 1. Since r(Γ) = 1, this means that at least one of γ1 and
γ2 is not equal to zero. Without loss, take γ1 6= 0, in which case
γ2
Y2 = Y1 .
γ1
Note that E(Y) = 0 = (0, 0)0 and
2 2 2 2
γ Z
1 1 γ γ Z
1 2 1
γ γ γ
1 2
cov(Y) = E(YY0 ) = E = 1 = Γ0 Γ = V.
γ γ Z 2 γ 2Z 2 γ1 γ2 γ22
1 2 1 2 1
Note that |V| = 0. Thus, Y2×1 is a random vector with all of its probability mass located
in the linear subspace {(y1 , y2 ) : y2 = γ2 y1 /γ1 }. Since r(V) = 1 < 2, Y does not have a
density function.
= E(Yi − µi )E{(Yj − µj )0 } = 0.
Necessity (⇐=): Suppose that Vij = 0 for all i 6= j, and let t = (t01 , t02 , ..., t0m )0 . Note
that
m
X m
X
0
t Vt = t0i Vii ti and t µ = 0
t0i µi .
i=1 i=1
PAGE 74
CHAPTER 5 STAT 714, J. TEBBS
Thus,
Y = Xβ + ,
where ∼ Nn (0, σ 2 I). We have already seen that Y ∼ Nn (Xβ, σ 2 I). Also, note that
with PX = X(X0 X)− X0 ,
Y
b PX
= Y,
e
b I − PX
cov(Y,
b be) = PX σ 2 I(I − PX )0 = 0.
PAGE 75
CHAPTER 5 STAT 714, J. TEBBS
That is, the fitted values and residuals from the least-squares fit are independent. This
explains why residual plots that display nonrandom patterns are consistent with a vio-
lation of our model assumptions.
Y|{X = x} ∼ N (µY |X , ΣY |X ),
where
µY |X = µY + ΣY X Σ−1
X (x − µX )
and
ΣY |X = ΣY − ΣY X Σ−1
X ΣXY .
Again, the conditional mean µY |X is a linear function of x and the conditional covariance
matrix ΣY |X is free of x.
PAGE 76
CHAPTER 5 STAT 714, J. TEBBS
RECALL: Suppose that U ∼ χ2n ; that is, U has a (central) χ2 distribution with n > 0
degrees of freedom. The pdf of U is given by
1 n
−1
fU (u|n) = u 2 e−u/2 I(u > 0).
Γ( n2 )2n/2
The χ2n family of distributions is a gamma(α, β) subfamily with shape parameter α = n/2
and scale parameter β = 2. Note that E(U ) = n, var(U ) = 2n, and MU (t) = (1−2t)−n/2 ,
for t < 1/2.
RECALL: If Z1 , Z2 , ..., Zn are iid N (0, 1), then U1 = Z12 ∼ χ21 and
n
X
0
ZZ= Zi2 ∼ χ2n .
i=1
Proof. Exercise.
We write V ∼ χ2n (λ). When λ = 0, the χ2n (λ) distribution reduces to the central χ2n
distribution. In the χ2n (λ) pdf, notice that e−λ λj /j! is the jth term of a Poisson pmf with
parameter λ > 0.
PAGE 77
CHAPTER 5 STAT 714, J. TEBBS
PAGE 78
CHAPTER 5 STAT 714, J. TEBBS
Now, combine the exponents in the integrand, square out the (y − µ)2 term, combine like
terms, complete the square, and collapse the expression to (1 − 2t)−1/2 exp{µ2 t/(1 − 2t)}
times some normal density that is integrated over R.
Result 5.12. If U1 , U2 , ..., Um are independent random variables, where Ui ∼ χ2ni (λi );
i = 1, 2, ..., m, then U = i Ui ∼ χ2n (λ), where n = i ni and λ = i λi .
P P P
Result 5.13. Suppose that V ∼ χ2n (λ). For fixed n and c > 0, the quantity Pλ (V > c)
is a strictly increasing function of λ.
Proof. See Monahan, pp 106-108.
IMPLICATION : If V1 ∼ χ2n (λ1 ) and V2 ∼ χ2n (λ2 ), where λ2 > λ1 , then pr(V2 > c) >
pr(V1 > c). That is, V2 is (strictly) stochastically greater than V1 , written V2 >st V1 .
Note that
V2 >st V1 ⇐⇒ FV2 (v) < FV1 (v) ⇐⇒ SV2 (v) > SV1 (v),
for all v, where FVi (·) denotes the cdf of Vi and SVi (·) = 1 − FVi (·) denotes the survivor
function of Vi .
We write W ∼ Fn1 ,n2 . The moment generating function for the F distribution does not
exist in closed form.
PAGE 79
CHAPTER 5 STAT 714, J. TEBBS
We write W ∼ Fn1 ,n2 (λ). When λ = 0, the noncentral F distribution reduces to the
central F distribution.
and
2n22 (n1 + 2λ)2
n1 + 4λ
var(W ) = 2 + .
n1 (n2 − 2) (n2 − 2)(n2 − 4) n2 − 4
E(W ) exists only when n2 > 2 and var(W ) exists only when n2 > 4. The moment
generating function for the noncentral F distribution does not exist in closed form.
Result 5.14. If U1 and U2 are independent random variables with U1 ∼ χ2n1 (λ) and
U2 ∼ χ2n2 , then
U1 /n1
W = ∼ Fn1 ,n2 (λ).
U2 /n2
Proof. See Searle, pp 51-52.
Result 5.15. Suppose that W ∼ Fn1 ,n2 (λ). For fixed n1 , n2 , and c > 0, the quantity
Pλ (W > c) is a strictly increasing function of λ. That is, if W1 ∼ Fn1 ,n2 (λ1 ) and
W2 ∼ Fn1 ,n2 (λ2 ), where λ2 > λ1 , then pr(W2 > c) > pr(W1 > c); i.e., W2 >st W1 .
REMARK : The fact that the noncentral F distribution tends to be larger than the
central F distribution is the basis for many of the tests used in linear models. Typically,
test statistics are used that have a central F distribution if the null hypothesis is true
and a noncentral F distribution if the null hypothesis is not true. Since the noncentral
F distribution tends to be larger, large values of the test statistic are consistent with
PAGE 80
CHAPTER 5 STAT 714, J. TEBBS
the alternative hypothesis. Thus, the form of an appropriate rejection region is to reject
H0 for large values of the test statistic. The power is simply the probability of rejection
region (defined under H0 ) when the probability distribution is noncentral F . Noncentral
F distributions are available in most software packages.
GOAL: We would like to find the distribution of Y0 AY, where Y ∼ Np (µ, V). We will
obtain this distribution by taking steps. Result 5.16 is a very small step. Result 5.17 is
a large step, and Result 5.18 is the finish line. There is no harm in assuming that A is
symmetric.
Result 5.11 says Yi2 ∼ χ21 (µ2i /2), for i = 1, 2, ..., p. Thus, from Result 5.12,
p
X
0
YY= Yi2 ∼ χ2p (µ0 µ/2).
i=1
Note also that r(A) = tr(A) = tr(P1 P01 ) = tr(P01 P1 ) = tr(Is ) = s. Now, to go the other
way (=⇒), suppose that A is a symmetric, idempotent matrix of rank s. The spectral
decomposition of A is given by A = QDQ0 , where D = diag(λ1 , λ2 , ..., λp ) and Q is
orthogonal. Since A is idempotent, we know that s of the eigenvalues λ1 , λ2 , ..., λp are
equal to 1 and other p − s eigenvalues are equal to 0. Thus, we can write
Is 0 0
P
A = QDQ0 = P1 P2 1 = P1 P01 .
0 0 P02
PAGE 81
CHAPTER 5 STAT 714, J. TEBBS
Thus, we have shown that (a) holds. To show that (b) holds, note that because Q is
orthogonal,
P01 P01 P1 P01 P2
Ip = Q0 Q = P1 P2 = .
P02 P02 P1 P02 P2
It is easy to convince yourself that P01 P1 is an identity matrix. Its dimension is s × s
because tr(P01 P1 ) = tr(P1 P01 ) = tr(A) = r(A), which equals s by assumption.
Y0 AY = Y0 P1 P01 Y = X0 X,
where X = P01 Y. Since Y ∼ Np (µ, I), and since X = P01 Y is a linear combination of Y,
we know that
X ∼ Ns (P01 µ, P01 IP1 ) ∼ Ns (P01 µ, Is ).
Result 5.16 says that Y0 AY = X0 X ∼ χ2s {(P01 µ)0 (P01 µ)/2}. But,
1 1 1
λ ≡ (P01 µ)0 P01 µ = µ0 P1 P01 µ = µ0 Aµ.
2 2 2
Now,
Y0 AY = X0 V1/2 AV1/2 X = X0 BX,
where B = V1/2 AV1/2 . Recall that V1/2 is the symmetric square root of V. From Result
5.17, we know that Y0 AY = X0 BX ∼ χ2s (λ) if B is idempotent of rank s. However, note
that
r(B) = r(V1/2 AV1/2 ) = r(A) = r(AV) = s,
PAGE 82
CHAPTER 5 STAT 714, J. TEBBS
since AV has rank s (by assumption) and V and V1/2 are both nonsingular. Also, AV
is idempotent by assumption so that
AV = AVAV ⇒ A = AVA
⇒ B = BB.
Example 5.2. Suppose that Y = (Y1 , Y2 , ..., Yn )0 ∼ Nn (µ1, σ 2 I), so that µ = µ1 and
V = σ 2 I, where 1 is n × 1 and I is n × n. The statistic
n
X
2
(n − 1)S = (Yi − Y )2 = Y0 (I − n−1 J)Y = Y0 AY,
i=1
(n − 1)S 2
2
= Y0 BY,
σ
Result 5.18 says that (n − 1)S 2 /σ 2 = Y0 BY ∼ χ2n−1 (λ), where λ = 21 µ0 Bµ. However,
1 1
λ = µ0 Bµ = (µ1)0 σ −2 (I − n−1 J)µ1 = 0,
2 2
(n − 1)S 2
2
= Y0 BY ∼ χ2n−1 ,
σ
PAGE 83
CHAPTER 5 STAT 714, J. TEBBS
Y = Xβ + ,
where X is n × p with rank r ≤ p and ∼ Nn (0, σ 2 I). Let PX = X(X0 X)− X0 denote the
perpendicular projection matrix onto C(X). We know that Y ∼ Nn (Xβ, σ 2 I). Consider
the (uncorrected) partitioning of the sums of squares given by
Y0 Y = Y0 PX Y + Y0 (I − PX )Y.
• We first consider the residual sum of squares Y0 (I − PX )Y. Dividing this quantity
by σ 2 , we get
AV = σ −2 (I − PX )σ 2 I = I − PX ,
Y0 PX Y/σ 2 = Y0 (σ −2 PX )Y = Y0 BY,
PAGE 84
CHAPTER 5 STAT 714, J. TEBBS
BV = σ −2 PX σ 2 I = PX ,
an idempotent matrix with rank r(PX ) = r(X) = r. Result 5.18 says that
1 1
λ = µ0 Bµ = (Xβ)0 σ −2 PX Xβ = (Xβ)0 Xβ/2σ 2 .
2 2
That is, Y0 PX Y/σ 2 has a noncentral χ2 distribution with r degrees of freedom and
noncentrality parameter λ = (Xβ)0 Xβ/2σ 2 .
In the last calculation, note that λ = (Xβ)0 Xβ/2σ 2 = 0 iff Xβ = 0. In this case, both
quadratic forms Y0 (I − PX )Y/σ 2 and Y0 PX Y/σ 2 have central χ2 distributions.
GOALS : In this subsection, we consider two problems. With Y ∼ N (µ, V), we would
like to establish sufficient conditions for
Result 5.19. Suppose that Y ∼ Np (µ, V). If BVA = 0, then Y0 AY and BY are
independent.
Proof. We may assume that A is symmetric. Write A = QDQ0 , where D =
diag(λ1 , λ2 , ..., λp ) and Q is orthogonal. We know that s ≤ p of the eigenvalues
λ1 , λ2 , ..., λp are nonzero where s = r(A). We can thus write
D1 0 0
P
A = QDQ0 = P1 P2 1 = P1 D1 P01 ,
0 0 P02
PAGE 85
CHAPTER 5 STAT 714, J. TEBBS
Y0 AY = Y0 P1 D1 P01 Y = X0 D1 X,
Example 5.4. Suppose that Y = (Y1 , Y2 , ..., Yn )0 ∼ Nn (µ1, σ 2 I), where 1 is n × 1 and I
is n × n, so that µ = µ1 and V = σ 2 I. Recall that
because I − n−1 J is the ppm onto C(1)⊥ . Since functions of independent statistics are
also independent, Y and S 2 are also independent.
PAGE 86
CHAPTER 5 STAT 714, J. TEBBS
Result 5.20. Suppose that Y ∼ Np (µ, V). If BVA = 0, then Y0 AY and Y0 BY are
independent.
Proof. Write A and B in their spectral decompositions; that is, write
D1 0 0
P
A = PDP0 = P1 P2 1 = P1 D1 P01 ,
0 0 P02
where R1 = diag(γ1 , γ2 , ..., γt ) and t = r(B). Since P and Q are orthogonal, this implies
that P01 P1 = Is and Q01 Q1 = It . Suppose that BVA = 0. Then,
= R1 Q01 VP1 D1
= R−1 0 −1
1 R1 Q1 VP1 D1 D1
= Q01 VP1
Now,
P01 P0 µ 0
P VP1 0
Y ∼ N 1 , 1 .
Q01 Q0 µ
1 0 Q01 VQ1
That is, P01 Y and Q01 Y are jointly normal and uncorrelated; thus, they are independent.
So are Y0 P1 D1 P01 Y and Y0 Q1 R1 Q01 Y. But A = P1 D1 P01 and B = Q1 R1 Q01 , so we are
done.
Y = Xβ + ,
where X is n × p with rank r ≤ p and ∼ Nn (0, σ 2 I). Let PX = X(X0 X)− X0 denote
the perpendicular projection matrix onto C(X). We know that Y ∼ Nn (Xβ, σ 2 I). In
PAGE 87
CHAPTER 5 STAT 714, J. TEBBS
Y0 (I − PX )Y/σ 2 ∼ χ2n−r ,
and that
Y0 PX Y/σ 2 ∼ χ2r (λ),
where A = σ −2 (I − PX ). Also,
Y0 PX Y/σ 2 = Y0 (σ −2 PX )Y = Y0 BY,
BVA = σ −2 PX σ 2 Iσ −2 (I − PX ) = 0,
that is, Y0 (I − PX )Y/σ 2 and Y0 PX Y/σ 2 are independent quadratic forms. Thus, the
statistic
Y0 PX Y/r
F =
Y0 (I − PX )Y/(n − r)
σ −2 Y0 PX Y/r 0 2
= ∼ Fr,n−r (Xβ) Xβ/2σ ,
σ −2 Y0 (I − PX )Y/(n − r)
OBSERVATIONS :
• On the other hand, as the length of Xβ gets larger, so does λ. This shifts the
noncentral Fr,n−r {(Xβ)0 Xβ/2σ 2 } distribution to the right, because the noncentral
F distribution is stochastically increasing in its noncentrality parameter.
PAGE 88
CHAPTER 5 STAT 714, J. TEBBS
REMARK : An important general notion in linear models is that sums of squares like
Y0 PX Y and Y0 Y can be “broken down” into sums of squares of smaller pieces. We now
discuss Cochran’s Theorem (Result 5.21), which serves to explain why this is possible.
Result 5.21. Suppose that Y ∼ Nn (µ, σ 2 I). Suppose that A1 , A2 , ..., Ak are n × n
symmetric and idempotent matrices, where r(Ai ) = si , for i = 1, 2, ..., k. If A1 + A2 +
· · · + Ak = In , then Y0 A1 Y/σ 2 , Y0 A2 Y/σ 2 , ..., Y0 Ak Y/σ 2 follow independent χ2si (λi )
distributions, where λi = µ0 Ai µ/2σ 2 , for i = 1, 2, ..., k, and ki=1 si = n.
P
IMPORTANCE : We now show how Cochran’s Threorem can be used to deduce the joint
distribution of the sums of squares in an analysis of variance. Suppose that we partition
the design matrix X and the parameter vector β in Y = Xβ + into k + 1 parts, so that
β
0
β1
Y = (X0 X1 · · · Xk )
.. ,
.
βk
Y = X0 β 0 + X1 β 1 + · · · + Xk β k + .
Y = X0 β 0 +
Y = X0 β 0 + X1 β 1 +
..
.
Y = X0 β 0 + X1 β 1 + · · · + Xk−1 β k−1 + ,
PAGE 89
CHAPTER 5 STAT 714, J. TEBBS
and let R(β 0 , β 1 , ..., β i ) denote the regression (model) sum of squares from fitting the
ith submodel, for i = 0, 1, ..., k; that is,
−
R(β 0 , β 1 , ..., β i ) = Y0 (X0 X1 · · · Xi ) [(X0 X1 · · · Xi )0 (X0 X1 · · · Xi )] (X0 X1 · · · Xi )0 Y
= Y0 PX∗i Y,
where
−
PX∗i = (X0 X1 · · · Xi ) [(X0 X1 · · · Xi )0 (X0 X1 · · · Xi )] (X0 X1 · · · Xi )0
is the perpendicular projection matrix onto C(X∗i ), where X∗i = (X0 X1 · · · Xi ) and i =
0, 1, ..., k. Clearly,
C(X∗0 ) ⊂ C(X∗1 ) ⊂ · · · ⊂ C(X∗k−1 ) ⊂ C(X).
= Y0 A0 Y + Y0 A1 Y + Y0 A2 Y + · · · + Y0 Ak Y + Y0 Ak+1 Y,
where
A0 = PX∗0
Ak = PX − PX∗k−1
Ak+1 = I − PX .
Note that A0 + A1 + A2 + · · · + Ak+1 = I. Note also that the Ai matrices are symmetric,
for i = 0, 1, ..., k + 1, and that
= PX∗i − PX∗i−1 = Ai ,
for i = 1, 2, ..., k, since PX∗i PX∗i−1 = PX∗i−1 and PX∗i−1 PX∗i = PX∗i−1 . Thus, Ai is idempo-
tent for i = 1, 2, ..., k. However, clearly A0 = PX∗0 = X0 (X00 X0 )− X00 and Ak+1 = I − PX
PAGE 90
CHAPTER 5 STAT 714, J. TEBBS
s0 = r(A0 ) = r(X0 )
Pk+1
It is easy to see that i=0 si = n.
1 0
Y A0 Y ∼ χ2s0 [λ0 = (Xβ)0 A0 Xβ/2σ 2 ]
σ2
1 0
Y Ai Y ∼ χ2si [λi = (Xβ)0 Ai Xβ/2σ 2 ], i = 1, 2, ..., k,
σ2
1 0 1
2
Y Ak+1 Y = 2 Y0 (I − PX )Y ∼ χ2sk+1 ,
σ σ
where sk+1 = n − r(X). Note that the last quadratic follows a central χ2 distribution
because λk+1 = (Xβ)0 (I − PX )Xβ/2σ 2 = 0. Cochran’s Theorem also guarantees that the
quadratic forms Y0 A0 Y/σ 2 , Y0 A1 Y/σ 2 , ..., Y0 Ak Y/σ 2 , Y0 Ak+1 Y/σ 2 are independent.
ANOVA TABLE : The quadratic forms Y0 Ai Y, for i = 0, 1, ..., k + 1, and the degrees of
freedom si = r(Ai ) are often presented in the following ANOVA table:
Source df SS Noncentrality
β0 s0 R(β 0 ) λ0 = (Xβ)0 A0 Xβ/2σ 2
β 1 (after β 0 ) s1 R(β 0 , β 1 ) − R(β 0 ) λ1 = (Xβ)0 A1 Xβ/2σ 2
β 2 (after β 0 , β 1 ) s2 R(β 0 , β 1 , β 2 ) − R(β 0 , β 1 ) λ2 = (Xβ)0 A2 Xβ/2σ 2
.. .. .. ..
. . . .
β k (after β 0 , ..., β k−1 ) sk R(β 0 , ..., β k ) − R(β 0 , ..., β k−1 ) λk = (Xβ)0 Ak Xβ/2σ 2
Residual sk+1 Y0 Y − R(β 0 , ..., β k ) λk+1 = 0
Total n Y0 Y (Xβ)0 Xβ/2σ 2
2
Note that if X0 = 1, then β 0 = µ, R(β 0 ) = Y0 P1 Y = nY . The R(·) notation
will come in handy when we talk about hypothesis testing later. The sums of squares
PAGE 91
CHAPTER 5 STAT 714, J. TEBBS
R(β 0 ), R(β 0 , β 1 ) − R(β 0 ), ..., R(β 0 , ..., β k ) − R(β 0 , ..., β k−1 ) are called the sequential
sums of squares. These correspond to the Type I sums of squares printed out by SAS
in the ANOVA and GLM procedures. We will also use the notation
Example 5.6. Consider the one-way (fixed effects) analysis of variance model
Yij = µ + αi + ij ,
for i = 1, 2, ..., a and j = 1, 2, ..., ni , where ij ∼ iid N (0, σ 2 ) random variables. In matrix
form, Y, X, and β are
µ
Y 1n1 1n1 0n1 · · · 0n1
11
α1
· · · 0n2
Y12 1n2 0n2 1n2
Yn×1 = , Xn×p = .. , and β p×1 = ,
.. .. .. .. ...
α2
. . . . .
..
.
Yana 1na 0na 0na · · · 1na
αa
P
where p = a + 1 and n = i ni . Note that we can write
X = (X0 X1 ) ,
where X0 = 1,
1n1 0n1 · · · 0n1
· · · 0n2
0n2 1n2
X1 =
.. .. ... .. ,
. . .
0na 0na · · · 1na
and β = (β 0 , β 01 )0 , where β 0 = µ and β 1 = (α1 , α2 , ..., αa )0 . That is, we can express this
model in the form
Y = X0 β 0 + X1 β 1 + .
The submodel is
Y = X0 β 0 + ,
PAGE 92
CHAPTER 5 STAT 714, J. TEBBS
A 0 = P1
A 1 = PX − P1
A 2 = I − PX .
These matrices are clearly symmetric and idempotent. Also, note that
1 0
Y P1 Y ∼ χ21 [λ0 = (Xβ)0 P1 Xβ/2σ 2 ]
σ2
1 0
Y (PX − P1 )Y ∼ χ2a−1 [λ1 = (Xβ)0 (PX − P1 )Xβ/2σ 2 ]
σ2
1 0
Y (I − PX )Y ∼ χ2n−a .
σ2
Cochran’s Theorem also guarantees the quadratic forms Y0 P1 Y/σ 2 , Y0 (PX − P1 )Y/σ 2 ,
and Y0 (I − PX )Y/σ 2 are independent. The sums of squares, using our new notation, are
Y0 P1 Y = R(µ)
Y0 (I − PX )Y = Y0 Y − R(µ, α1 , ..., αa ).
Source df SS Noncentrality
µ 1 R(µ) λ0 = (Xβ)0 P1 Xβ/2σ 2
α1 , ..., αa (after µ) a−1 R(µ, α1 , ..., αa ) − R(µ) λ1 = (Xβ)0 (PX − P1 )Xβ/2σ 2
Residual n−a Y0 Y − R(µ, α1 , ..., αa ) 0
Total n Y0 Y (Xβ)0 Xβ/2σ 2
F STATISTIC : Because
1 0
Y (PX − P1 )Y ∼ χ2a−1 (λ1 )
σ2
PAGE 93
CHAPTER 5 STAT 714, J. TEBBS
and
1 0
Y (I − PX )Y ∼ χ2n−a ,
σ2
and because these two quadratic forms are independent, it follows that
Y0 (PX − P1 )Y/(a − 1)
F = ∼ Fa−1,n−a (λ1 ).
Y0 (I − PX )Y/(n − a)
This is the usual F statistic to test H0 : α1 = α2 = · · · = αa = 0.
• If H0 is true, then Xβ ∈ C(1) and λ1 = (Xβ)0 (PX −P1 )Xβ/2σ 2 = 0 since PX −P1
is the ppm onto C(1)⊥
C(X) . In this case, F ∼ Fa−1,n−a , a central F distribution. A
MSE = (n − a)−1 Y0 (I − PX )Y
Xa
= σ2 + ni (αi − α+ )2 /(a − 1),
i=1
PAGE 94
CHAPTER 6 STAT 714, J. TEBBS
6 Statistical Inference
Complementary reading from Monahan: Chapter 6 (and revisit Sections 3.9 and 4.7).
6.1 Estimation
SUFFICIENCY : Under the assumptions stated above, we know that Y ∼ Nn (Xβ, σ 2 I).
Set θ = (β 0 , σ 2 )0 . The pdf of Y, for all y ∈ Rn , is given by
that is, Y has pdf in the exponential family (see Casella and Berger, Chapter 3). The
family is full rank (i.e., it is not curved), so we know that T(Y) = (Y0 Y, X0 Y) is
a complete sufficient statistic for θ. We also know that minimum variance unbiased
estimators (MVUEs) of functions of θ are unbiased functions of T(Y).
Result 6.1. Consider the general linear model Y = Xβ + , where X is n × p with rank
r ≤ p and ∼ Nn (0, σ 2 I). The MVUE for an estimable function Λ0 β is given by Λ0 β,
b
PAGE 95
CHAPTER 6 STAT 714, J. TEBBS
b = (X0 X)− X0 Y is any solution to the normal equations. The MVUE for σ 2 is
where β
MSE = (n − r)−1 Y0 (I − PX )Y
0
= (n − r)−1 (Y0 Y − β
b X0 Y),
estimators are also functions of T(Y) = (Y0 Y, X0 Y), the complete sufficient statistic.
Thus, each estimator is the MVUE for its expected value.
n n
log L(β, σ 2 |y) = − log(2π) − log σ 2 − (y − Xβ)0 (y − Xβ)/2σ 2
2 2
with respect to β and σ 2 . For every value of σ 2 , maximizing the loglikelihood is the same
as minimizing Q(β) = (y − Xβ)0 (y − Xβ), that is, the least squares estimator
b = (X0 X)− X0 Y,
β
b 0 (y − Xβ)
is also an MLE. Now substitute (y − Xβ) b = y0 (I − PX )y in for Q(β) and
σ 2
bMLE = n−1 Y0 (I − PX )Y.
Note that the MLE for σ 2 is biased. The MLE is rarely used in practice; MSE is the
conventional estimator for σ 2 .
INVARIANCE : Under the normal GM model, the MLE for an estimable function Λ0 β
is Λ0 β,
b where β
b is any solution to the normal equations. This is true because of the
invariance property of maximum likelihood estimators (see, e.g., Casella and Berger,
Chapter 7). If Λ0 β is estimable, recall that Λ0 β
b is unique even if β
b is not.
PAGE 96
CHAPTER 6 STAT 714, J. TEBBS
PREVIEW : We now provide a general discussion on testing reduced versus full models
within a Gauss Markov linear model framework. Assuming normality will allow us to
derive the sampling distribution of the resulting test statistic.
Y = Xβ + ,
where r(X) = r ≤ p, E() = 0, and cov() = σ 2 I. Note that these are our usual GM
model assumptions. For the purposes of this discussion, we assume that this model (the
full model) is a “correct” model for the data. Consider also the linear model
Y = Wγ + ,
where E() = 0, cov() = σ 2 I and C(W) ⊂ C(X). We call this a reduced model because
the estimation space is smaller than in the full model. Our goal is to test whether or not
the reduced model is also correct.
• If the reduced model is also correct, there is no reason not to use it. Smaller
models are easier to interpret and fewer degrees of freedom are spent in estimating
σ 2 . Thus, there are practical and statistical advantages to using the reduced model
if it is also correct.
RECALL: Let PW and PX denote the perpendicular projection matrices onto C(W) and
C(X), respectively. Because C(W) ⊂ C(X), we know that PX − PW is the ppm onto
C(PX − PW ) = C(W)⊥
C(X) .
PAGE 97
CHAPTER 6 STAT 714, J. TEBBS
If the reduced model is also correct, then E(Y) = Wγ ∈ C(W) ⊂ C(X). Geometri-
cally, performing a reduced-versus-full model test therefore requires the analyst to decide
whether E(Y) is more likely to be in C(W) or C(X) − C(W). Under the full model,
our estimate for E(Y) = Xβ is PX Y. Under the reduced model, our estimate for
E(Y) = Wγ is PW Y.
• If the reduced model is correct, then PX Y and PW Y are estimates of the same
thing, and PX Y − PW Y = (PX − PW )Y should be small.
• If the reduced model is not correct, then PX Y and PW Y are estimating different
things, and PX Y − PW Y = (PX − PW )Y should be large.
• The decision about reduced model adequacy therefore hinges on assessing whether
(PX − PW )Y is large or small. Note that (PX − PW )Y is the perpendicular
projection of Y onto C(W)⊥
C(X) .
However, the length of (PX − PW )Y is also related to the sizes of C(X) and C(W). We
therefore adjust for these sizes by using
Y0 (PX − PW )Y/r(PX − PW ).
We now compute the expectation of this quantity when the reduced model is/is not
correct. For notational simplicity, set r∗ = r(PX − PW ). When the reduced model is
correct, then
1
E{Y0 (PX − PW )Y/r∗ } = (Wγ)0 (PX − PW )Wγ + tr{(PX − PW )σ 2 I}
r ∗
1
= ∗ {σ 2 tr(PX − PW )}
r
1
= ∗ (r∗ σ 2 ) = σ 2 .
r
This is correct because (PX − PW )Wγ = 0 and tr(PX − PW ) = r(PX − PW ) = r∗ .
Thus, if the reduced model is correct, Y0 (PX − PW )Y/r∗ is an unbiased estimator of σ 2 .
PAGE 98
CHAPTER 6 STAT 714, J. TEBBS
Thus, if the reduced model is not correct, Y0 (PX − PW )Y/r∗ is estimating something
larger than σ 2 . Of course, σ 2 is unknown, so it must be estimated. Because the full
model is assumed to be correct,
TEST STATISTIC : To test the reduced model versus the full model, we use
Y0 (PX − PW )Y/r∗
F = .
MSE
Using only our GM model assumptions (i.e., not necessarily assuming normality), we can
surmise the following:
• When the reduced model is correct, the numerator and denominator of F are both
unbiased estimators of σ 2 , so F should be close to 1.
• When the reduced model is not correct, the numerator in F is estimating something
larger than σ 2 , so F should be larger than 1. Thus, values of F much larger than
1 are not consistent with the reduced model being correct.
• Values of F much smaller than 1 may mean something drastically different; see
Christensen (2003).
which is the difference in the regression (model) sum of squares, corrected or uncorrected,
from fitting the two models. Also, the term
PAGE 99
CHAPTER 6 STAT 714, J. TEBBS
say, where r0 = r(PW ) = r(W). Thus, r∗ = r − r0 is the difference in the ranks of the
X and W matrices. This also equals the difference in the model degrees of freedom from
the two ANOVA tables.
REMARK : You will note that we have formulated a perfectly sensible strategy for testing
reduced versus full models while avoiding the question, “What is the distribution of F ?”
Our entire argument is based on first and second moment assumptions, that is, E() = 0
and cov() = σ 2 I, the GM assumptions. We now address the distributional question.
Y0 (PX − PW )Y/r∗
F = ,
MSE
we require that ∼ Nn (0, σ 2 I), from which it follows that Y ∼ Nn (Xβ, σ 2 I). First,
we handle the denominator MSE = Y0 (I − PX )Y/(n − r). In Example 5.3 (notes), we
showed that
Y0 (I − PX )Y/σ 2 ∼ χ2n−r .
This distributional result holds regardless of whether or not the reduced model is correct.
Now, we turn our attention to the numerator. Take A = σ −2 (PX − PW ) and consider
the quadratic form
Y0 AY = Y0 (PX − PW )Y/σ 2 .
AV = σ −2 (PX − PW )σ 2 I = PX − PW
• If the reduced model is correct and Xβ ∈ C(W), then (PX − PW )Xβ = 0 because
PX − PW projects onto C(W)⊥
C(X) . This means that the noncentrality parameter
PAGE 100
CHAPTER 6 STAT 714, J. TEBBS
• Regardless of whether or not the reduced model is correct, the quadratic forms
Y0 (PX − PW )Y and Y0 (I − PX )Y are independent since
H0 : Y = Wγ +
versus
H1 : Y = Xβ + ,
where r∗ = r − r0 , r0 = r(W), and Fr∗ ,n−r,α is the upper α quantile of the Fr∗ ,n−r
distribution.
PAGE 101
CHAPTER 6 STAT 714, J. TEBBS
Yi = β0 + β1 (xi − x) + i ,
where ∼ Nn (0, σ 2 I). Suppose that we would like to test whether the reduced model
Yi = β0 + i ,
for i = 1, 2, ..., n, also holds. In matrix notation, the reduced model can be expressed as
Y 1
1 1
Y2 1 2
Y= ..
, W =
.. = 1, γ = β0 , = .. ,
. . .
Yn 1 n
where ∼ Nn (0, σ 2 I) and 1 is an n × 1 vector of ones. Note that C(W) ⊂ C(X) with
r0 = 1, r = 2, and r∗ = r − r0 = 1. When the reduced model is correct,
Y0 (PX − PW )Y/r∗
F = ∼ F1,n−2 ,
MSE
where MSE is the mean-squared error from the full model. When the reduced model is
not correct, F ∼ F1,n−2 (λ), where
1
λ = (Xβ)0 (PX − PW )Xβ
2σ 2
X n
2
= β1 (xi − x)2 /2σ 2 .
i=1
Exercises: (a) Verify that this expression for the noncentrality parameter λ is correct.
(b) Suppose that n is even and the values of xi can be selected anywhere in the interval
(d1 , d2 ). How should we choose the xi values to maximize the power of a level α test?
PAGE 102
CHAPTER 6 STAT 714, J. TEBBS
PROBLEM : Consider our usual Gauss-Markov linear model with normal errors; i.e.,
Y = Xβ + , where X is n × p with rank r ≤ p and ∼ Nn (0, σ 2 I). We now consider
the problem of testing
H0 : K0 β = m
versus
H1 : K0 β 6= m,
Example 6.2. Consider the regression model Yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + β4 xi4 + i ,
for i = 1, 2, ..., n. Express each hypothesis in the form H0 : K0 β = m:
1. H0 : β1 = 0
2. H0 : β3 = β4 = 0
3. H0 : β1 + β3 = 1, β2 − β4 = −1
4. H0 : β2 = β3 = β4 .
Example 6.3. Consider the analysis of variance model Yij = µ + αi + ij , for i = 1, 2, 3, 4
and j = 1, 2, ..., ni . Express each hypothesis in the form H0 : K0 β = m:
1. H0 : µ + α1 = 5, α3 − α4 = 1
2. H0 : α1 − α2 = α3 − α4
3. H0 : α1 − 2 = 13 (α2 + α3 + α4 ).
PAGE 103
CHAPTER 6 STAT 714, J. TEBBS
where β
b is a least squares estimator of β. Also,
K0 β
b = K0 (X0 X)− X0 Y,
a linear function of Y, so K0 β
b follows an s-variate normal distribution with mean
E(K0 β)
b = K0 β and covariance matrix
cov(K0 β)
b = K0 cov(β)K
b = σ 2 K0 (X0 X)− X0 X(X0 X)− K = σ 2 K0 (X0 X)− K = σ 2 H,
NOTE : In the calculation above, note that K0 (X0 X)− X0 X(X0 X)− K = K0 (X0 X)− K only
because K0 β is estimable; i.e., K0 = A0 X for some A. It is also true that H is nonsingular.
and
IMPLICATION : The lemma above is important, because it convinces us that the distri-
bution of K0 β
b is full rank. Subtracting m, we have
K0 β
b − m ∼ Ns (K0 β − m, σ 2 H).
(K0 β
b − m)0 (σ 2 H)−1 (K0 β
b − m).
PAGE 104
CHAPTER 6 STAT 714, J. TEBBS
1
λ= (K0 β − m)0 H−1 (K0 β − m).
2σ 2
where λ = (K0 β − m)0 H−1 (K0 β − m)/2σ 2 . Note that if H0 : K0 β = m is true, the
noncentrality parameter λ = 0 and F ∼ Fs,n−r . Therefore, an α level rejection region for
the test of H0 : K0 β = m versus H1 : K0 β 6= m is
RR = {F : F > Fs,n−r,α }.
PAGE 105
CHAPTER 6 STAT 714, J. TEBBS
TEST PROCEDURE : Consider the Gauss-Markov linear model with normal errors; i.e.,
Y = Xβ + , where X is n × p with rank r ≤ p and ∼ Nn (0, σ 2 I). Suppose that k0 β
is estimable; i.e., k0 ∈ R(X), and that our goal is to test
H0 : k0 β = m
versus
H1 : k0 β 6= m
RR = {T : T ≥ tn−r,α/2 }.
One sided tests use rejection regions that are suitably adjusted. When H0 is not true,
T ∼ tn−r (µ), where
k0 β − m
µ= p .
σ 2 k0 (X0 X)− k
This distribution is of interest for power and sample size calculations.
PAGE 106
CHAPTER 6 STAT 714, J. TEBBS
H0 : Y = Xβ + and K0 β = 0.
We now find a reduced model that corresponds to this hypothesis. Note that K0 β = 0
holds if and only if β⊥C(K). To identify the reduced model, pick a matrix U such that
C(U) = C(K)⊥ . We then have
for some vector γ. Substituting β = Uγ into the linear model Y = Xβ + gives the
reduced model Y = XUγ + , or letting W = XU, our hypothesis above can be written
H0 : Y = Wγ + ,
K0 β = 0, we now show that the ppm onto C(PX D) is also the ppm onto C(W)⊥
C(X) ; i.e.,
PAGE 107
CHAPTER 6 STAT 714, J. TEBBS
Because C(U) = C(K)⊥ , we know that X0 v ∈ C(K) = C(X0 D), since K0 = D0 X. Thus,
Suppose that v ∈ C(PX D). Clearly, v ∈ C(X). Also, v = PX Dd, for some d and
v0 XU = d0 D0 PX XU = d0 D0 XU = d0 K0 U = 0,
IMPLICATION : It follows immediately that the numerator sum of squares for testing
the reduced model Y = Wγ + versus the full model Y = Xβ + is Y0 MPX D Y, where
is the ppm onto C(PX D). If ∼ Nn (0, σ 2 I), the resulting test statistic
Y0 MPX D Y/r(MPX D )
F = 0 ∼ Fr(MPX D ),r(I−PX ) (λ),
Y (I − PX )Y/r(I − PX )
1
λ= (Xβ)0 MPX D Xβ.
2σ 2
GOAL: Our goal now is to show that the F statistic above is the same F statistic we
derived in Section 6.3 with m = 0, that is,
(K0 β)
b 0 H−1 K0 β/s
b
F = .
Y0 (I − PX )Y/(n − r)
Recall that this statistic was derived for the testable hypothesis H0 : K0 β = 0. First,
we show that r(MPX D ) = s, where, recall, s = r(K). To do this, it suffices to show
that r(K) = r(PX D). Because K0 β is estimable, we know that K0 = D0 X, for some D.
Writing K = X0 D, we see that for any vector a,
X0 Da = 0 ⇐⇒ Da⊥C(X),
PAGE 108
CHAPTER 6 STAT 714, J. TEBBS
= (K0 β)
b 0 [D0 X(X0 X)− X0 D]− K0 β
b
= (K0 β)
b 0 [K0 (X0 X)− K]− K0 β.
b
Y0 MPX D Y/r(MPX D )
F =
Y0 (I − PX )Y/r(I − PX )
and
b 0 H−1 K0 β/s
(K0 β) b
F =
Y0 (I − PX )Y/(n − r)
are equal. We already showed that r(MPX D ) = s, and because r(I − PX ) = n − r, we
are done.
REMARK : In the linear model Y = Xβ + , where E() = 0 (note the minimal assump-
tions), we have, up until now, allowed the p × 1 parameter vector β to take on any value
in Rp , that is, we have made no restrictions on the parameters in β. We now consider
the case where β is restricted to the subspace of Rp consisting of values of β that satisfy
PAGE 109
CHAPTER 6 STAT 714, J. TEBBS
PROBLEM : In the linear model Y = Xβ +, where E() = 0, we would like to minimize
subject to the constraint that P0 β = δ. Essentially, this requires us to find the minimum
value of Q(β) over the linear subspace {β ∈ Rp : P0 β = δ}. This is a restricted mini-
mization problem and standard Lagrangian methods apply; see Appendix B in Monahan.
The Lagrangian a(β, θ) is a function of β and the Lagrange multipliers in θ and can be
written as
a(β, θ) = (Y − Xβ)0 (Y − Xβ) + 2θ 0 (P0 β − δ).
∂a(β, θ)
= −2X0 Y + 2X0 Xβ + 2Pθ
∂β
∂a(β, θ)
= 2(P0 β − δ).
∂θ
Setting these equal to zero leads to the restricted normal equations (RNEs), that is,
0 0
XX P β XY
= .
0
P 0 θ δ
Denote by β
b and θ
H
bH the solutions to the RNEs, respectively. The solution β
b is called
H
DISCUSSION : We now present some facts regarding this restricted linear model and its
(restricted) least squares estimator. We have proven all of these facts for the unrestricted
model; restricted versions of the proofs are all in Monahan.
1. The restricted normal equations are consistent; see Result 3.8, Monahan (pp 62-63).
PAGE 110
CHAPTER 6 STAT 714, J. TEBBS
Y = Xβ + , E() = 0, P0 β = δ,
4. If λ0 β is estimable in the unrestricted model; i.e., the model without the linear
restriction, then λ0 β is estimable in the restricted model. The converse is not true.
89-90).
RECALL: A likelihood ratio testing procedure is intuitive. One simply compares the
maximized likelihood over the restricted parameter space (that is, the space under H0 )
to the maximized likelihood over the entire parameter space. If the former is small when
compared to the latter, then there a large amount of evidence against H0 .
PAGE 111
CHAPTER 6 STAT 714, J. TEBBS
DERIVATION : Under our model assumptions, we know that Y ∼ Nn (Xβ, σ 2 I). The
likelihood function for θ = (β 0 , σ 2 )0 is
Θ = {θ : β ∈ Rp , σ 2 ∈ R+ }.
The restricted parameter space, that is, the parameter space under H0 : K0 β = m, is
Θ0 = {θ : β ∈ Rp , K0 β = m, σ 2 ∈ R+ }.
supΘ0 L(θ|Y)
λ ≡ λ(Y) = .
supΘ L(θ|Y)
We reject the null hypothesis H0 for small values of λ = λ(Y). Thus, to perform a level
α test, reject H0 when λ < c, where c ∈ (0, 1) is chosen to satisfy PH0 {λ(Y) ≤ c} = α.
We have seen (Section 6.1) that the unrestricted MLEs of β and σ 2 are
where β
b H is any solution to
0 0
XX K β XY
= ,
K0 0 θ m
PAGE 112
CHAPTER 6 STAT 714, J. TEBBS
where s = r(K) and c∗ = s−1 (n − r)(c−2/n − 1). Furthermore, Monahan’s Theorem 6.1
(pp 139-140) shows that when K0 β is estimable,
b H ) − Q(β)
Q(β b − m)0 H−1 (K0 β
b = (K0 β b − m),
where H = K0 (X0 X)− K. Applying this result, and noting that Q(β)/(n
b − r) = MSE,
we see that
{Q(β
b H ) − Q(β)}/s
b (K0 β
b − m)0 H−1 (K0 β
b − m)/s
> c∗ ⇐⇒ F = > c∗ .
Q(β)/(n − r)
b MSE
That is, the LRT specifies that we reject H0 when F is large. Choosing c∗ = Fs,n−r,α
provides a level α test. Therefore, under the Gauss Markov model with normal errors,
the LRT for H0 : K0 β = m is the same test as that in Section 6.3.
λ0 β
b ∼ N {λ0 β, σ 2 λ0 (X0 X)− λ}
and, hence,
λ0 β
b − λ0 β
Z=p ∼ N (0, 1).
σ 2 λ0 (X0 X)− λ
If σ 2 was known, our work would be done as Z is a pivot. More likely, this is not the
case, so we must estimate it. An obvious point estimator for σ 2 is MSE, where
PAGE 113
CHAPTER 6 STAT 714, J. TEBBS
To verify that T ∼ tn−r , it remains only to show that Z and Y0 (I − PX )Y/σ 2 are
independent, or equivalently, that λ0 β
b and Y0 (I − PX )Y are, since λ0 β
b is a function of
λ0 β
b = a0 X(X0 X)− X0 Y = a0 PX Y,
showing that
q
0b
λ β ± tn−r,α/2 MSE λ0 (X0 X)− λ
Yi = β0 + β1 xi + i ,
for i = 1, 2, ..., n, where 1 , 2 , ..., n are iid N (0, σ 2 ). Recall also that the least squares
estimator of β = (β0 , β1 )0 is
βb0 Y − βb1 x
b = (X0 X)−1 X0 Y =
β = P ,
(xi −x)(Yi −Y )
βb1 iP
2
i (xi −x)
PAGE 114
CHAPTER 6 STAT 714, J. TEBBS
We now consider the problem of writing a 100(1 − α) percent confidence interval for
E(Y |x = x0 ) = β0 + β1 x0 ,
λ0 β
b = βb0 + βb1 x0 .
1 (x0 − x)2
= +P 2
.
n i (xi − x)
PROBLEM : Consider the Gauss Markov linear model Y = Xβ +, where X is n×p with
rank r ≤ p and ∼ Nn (0, σ 2 I). We now consider the problem of writing simultaneous
confidence intervals for the k estimable functions λ01 β, λ02 β, ..., λ0k β. Let the p × k matrix
Λ = (λ1 λ2 · · · λk ) so that
λ01 β
0
0
λ2 β
τ =Λβ =
..
.
.
λ0k β
PAGE 115
CHAPTER 6 STAT 714, J. TEBBS
τb = Λ0 β
b ∼ Nk (Λ0 β, σ 2 H),
have that
b ∼ N (λ0 β, σ 2 hjj ),
λ0j β j
where hjj is the jth diagonal element of H. Using our previous results, we know that
λ0j β
p
b ± tn−r,α/2 σb2 hjj ,
{λ0j β
p
b ± tn−r,α/2 σb2 hjj , j = 1, 2, ..., k},
let Ej denote the event that interval j contains λ0j β, that is, pr(Ej ) = 1 − α, for j =
1, 2, ..., k. The probability that each of the k intervals includes their target λ0j β is
k
! k
!
\ [
pr Ej = 1 − pr Ej ,
j=1 j=1
Thus, the probability that each interval contains its intended target is
k
!
\
pr Ej ≥ 1 − kα.
j=1
PAGE 116
CHAPTER 6 STAT 714, J. TEBBS
Obviously, this lower bound 1 − kα can be quite a bit lower than 1 − α, that is, the
simultaneous coverage probability of the set of
{λ0j β
p
b ± tn−r,α/2 b2 hjj , j = 1, 2, ..., k}
σ
BONFERRONI : From the argument on the last page, it is clear that if one takes d =
tn−r,α/2k , then
k
!
\
pr Ej ≥ 1 − k(α/k) = 1 − α.
j=1
Thus, 100(1 − α) percent simultaneous confidence intervals for λ01 β, λ02 β, ..., λ0k β are
λ0j β
p
b ± tn−r,α/2k σb2 hjj
for j = 1, 2, ..., k.
SCHEFFÉ : The idea behind Scheffé’s approach is to consider an arbitrary linear combi-
nation of τ = Λ0 β, say, u0 τ = u0 Λ0 β and construct a confidence interval
√ √
C(u, d) = (u0 τb − d σb2 u0 Hu, u0 τb + d σb2 u0 Hu),
Since d is chosen in this way, one guarantees the necessary simultaneous coverage proba-
bility for all possible linear combinations of τ = Λ0 β (an infinite number of combinations).
Clearly, the desired simultaneous coverage is then conferred for the k functions of interest
τj = λ0j β, j = 1, 2, ..., k; these functions result from taking u to be the standard unit
vectors. The argument in Monahan (pp 144) shows that d = (kFk,n−r,α )1/2 .
PAGE 117
CHAPTER 7 STAT 714, J. TEBBS
7 Appendix
The (i, j)th element of A is denoted by aij . The dimensions of A are m (the number of
rows) by n (the number of columns). If m = n, A is square. If we want to emphasize
the dimension of A, we can write Am×n .
Result MAR1.1.
(a) (A0 )0 = A
(c) A = 0 iff A0 A = 0
(d) (AB)0 = B0 A0
(e) (A + B)0 = A0 + B0
PAGE 118
CHAPTER 7 STAT 714, J. TEBBS
that is, aij = 1 for i = j, and aij = 0 when i 6= j. The n × n matrix of ones J is
1 1 ··· 1
1 ··· 1
1
J = Jn = .. .. . . ..
;
. . . .
1 1 ··· 1
n×n
that is, aij = 1 for all i and j. Note that J = 110 , where 1 = 1n is an n × 1 (column)
vector of ones. The n × n matrix where aij = 0, for all i and j, is called the null matrix,
or the zero matrix, and is denoted by 0.
AC = CA = I,
PAGE 119
CHAPTER 7 STAT 714, J. TEBBS
Result MAR1.2.
and at least one of the ci ’s is not zero; that is, it is possible to express at least one vector
as a nontrivial linear combination of the others. If
n
X
ci ai = 0 =⇒ c1 = c2 = · · · = cn = 0,
i=1
PAGE 120
CHAPTER 7 STAT 714, J. TEBBS
The number of linearly independent rows of any matrix is always equal to the number of
linearly independent columns. Alternate notation for r(A) is rank(A).
• If r(A) < min{n, p}, we say that A is less than full rank or rank deficient.
Result MAR2.1.
(f) For any matrix An×n and vector bn×1 , r(A, b) ≥ r(A); i.e., the inclusion of a
column vector cannot decrease the rank of a matrix.
Y = Xβ + ,
PAGE 121
CHAPTER 7 STAT 714, J. TEBBS
X0 Xβ = X0 Y.
b = (X0 X)−1 X0 Y.
β
This is the unique solution to the normal equations (since inverses are unique). Note
that if r(X) = r < p, then a unique solution to the normal equations does not exist.
TERMINOLOGY : We say that two vectors a and b are orthogonal, and write a⊥b, if
their inner product is zero; i.e.,
a0 b = 0.
Vectors a1 , a2 , ..., an are mutually orthogonal if and only if a0i aj = 0 for all i 6= j. If
a1 , a2 , ..., an are mutually orthogonal, then they are also linearly independent (verify!).
The converse is not necessarily true.
ci = ai /||ai ||,
where ||ai || = (a0i ai )1/2 , i = 1, 2, ..., n, are orthonormal. The quantity ||ai || is the length
of ai . If a1 , a2 , ..., an are the columns of A, then A0 A is diagonal; similarly, C0 C = I.
PAGE 122
CHAPTER 7 STAT 714, J. TEBBS
||Ax|| = ||x||. Geometrically, this means that multiplication of x by A only rotates the
vector x (since the length remains unchanged).
(i) x1 ∈ V, x2 ∈ V ⇒ x1 + x2 ∈ V, and
(ii) x ∈ V ⇒ cx ∈ V for c ∈ R.
TERMINOLOGY : We say that subspaces S1 and S2 are orthogonal, and write S1 ⊥S2 ,
if x01 x2 = 0, for all x1 ∈ S1 and for all x2 ∈ S2 .
PAGE 123
CHAPTER 7 STAT 714, J. TEBBS
It is easy to see that S1 and S2 are orthogonal. That S1 is a subspace is argued as follows.
Clearly, S1 ⊆ V. Now, suppose that x1 ∈ S1 and x2 ∈ S1 ; i.e.,
0 0
x1 = 0 and x2 = 0 ,
z1 z2
for z1 , z2 ∈ R. Then,
0
x 1 + x2 = ∈ S1
0
z1 + z2
and
0
cx1 = 0 ∈ S1 ,
cz1
for all c ∈ R. Thus, S1 is a subspace. That S2 is a subspace follows similarly.
TERMINOLOGY : Suppose that V is a vector space and that x1 , x2 , ..., xn ∈ V. The set
of all linear combinations of x1 , x2 , ..., xn ; i.e.,
( n
)
X
S= x∈V:x= ci x i
i=1
PAGE 124
CHAPTER 7 STAT 714, J. TEBBS
Result MAR3.1. Suppose that S and T are vector spaces. If S ⊆ T , and dim(S) =
dim(T ), then S = T .
Proof. See pp 244-5 in Monahan.
||y||2 = y0 y = (y1 + y2 )0 (y1 + y2 ) = y10 y1 + 2y10 y2 + y20 y2 = ||y1 ||2 + ||y2 ||2 .
This is simply Pythagorean’s Theorem. The cross product term is zero since y1 and
y2 are orthogonal.
PAGE 125
CHAPTER 7 STAT 714, J. TEBBS
is the set of all m × 1 vectors spanned by the columns of A; that is, C(A) is the set of all
vectors that can be written as a linear combination of the columns of A. The dimension
of C(A) is the column rank of A.
TERMINOLOGY : Let
b01
b02
Am×n =
..
,
.
b0m
where bi is n × 1. Denote
m
X
n
R(A) = {x ∈ R : x = di bi ; di ∈ R}
i=1
= {x ∈ Rn : x0 = d0 A; d ∈ Rm }.
We call R(A) the row space of A. It is the set of all n × 1 vectors spanned by the rows
of A; that is, the set of all vectors that can be written as a linear combination of the
rows of A. The dimension of R(A) is the row rank of A.
Result MAR3.3.
PAGE 126
CHAPTER 7 STAT 714, J. TEBBS
(f) For any A and B, C(AB) ⊆ C(A). If B is nonsingular, then C(AB) = C(A).
Example. Define
1 1 2 3
A= 1 0 3 and c = −1 .
1 0 3 −1
The column space of A is the set of all linear combinations of the columns of A; i.e., the
set of vectors of the form
c1 + c2 + 2c3
c1 a1 + c2 a2 + c3 a3 = ,
c1 + 3c3
c1 + 3c3
where c1 , c2 , c3 ∈ R. Thus, the column space C(A) is the set of all 3 × 1 vectors of
the form (a, b, b)0 , where a, b ∈ R. Any two vectors of {a1 , a2 , a3 } span this space. In
addition, any two of {a1 , a2 , a3 } are linearly independent, and hence form a basis for
C(A). The set {a1 , a2 , a3 } is not linearly independent since Ac = 0. The dimension of
C(A); i.e., the rank of A, is r = 2. The dimension of N (A) is 1, and c forms a basis for
this space.
Result MAR3.6. For an m×n matrix A, N (A0 ) and C(A) are orthogonal complements
in Rm .
Proof. Both N (A0 ) and C(A) are vector spaces with vectors in Rm . From the last
result, we know that dim{C(A)} = rank(A) = r, say, and dim{N (A0 )} = m − r, since
PAGE 127
CHAPTER 7 STAT 714, J. TEBBS
A0 x = A0 Ac = 0 =⇒ c0 A0 Ac = 0 =⇒ (Ac)0 Ac = 0 =⇒ Ac = x = 0.
To finish the proof, we need to show that N (A0 ) and C(A) are orthogonal spaces. Suppose
that x1 ∈ C(A) and x2 ∈ N (A0 ). It suffices to show that x01 x2 = 0. But, note that
x1 ∈ C(A) =⇒ x1 = Ac, for some c. Also, x2 ∈ N (A0 ) =⇒ A0 x2 = 0. Since x01 x2 =
(Ac)0 x2 = c0 A0 x2 = c0 0 = 0, the result follows.
Result MAR3.7. Suppose that S1 and T1 are orthogonal complements. Suppose that
S2 and T2 are orthogonal complements. If S1 ⊆ S2 , then T2 ⊆ T1 .
Proof. See pp 244 in Monahan.
PAGE 128
CHAPTER 7 STAT 714, J. TEBBS
AGA = A =⇒ AA− A = A.
If A is square and nonsingular, then the generalized inverse of A is A−1 since AA− A =
AA−1 A = A.
NOTES :
• If A is m × n, then A− is n × m.
Note that r(A) = 2 because −a1 + 6a2 − a3 = 0. Thus A−1 does not exist. However, it
is easy to show that AGA = A; thus, G is a generalized inverse of A.
PAGE 129
CHAPTER 7 STAT 714, J. TEBBS
is a generalized inverse of A. This result essentially shows that every matrix has a
generalized inverse (see Results A.10 and A.11, Monahan). Also, it gives a method to
compute it.
Result MAR4.3. Let Am×n , xn×1 , cm×1 , and In×n be matrices, and suppose that
Ax = c is consistent. Then, x∗ is a solution to Ax = c if and only if
x∗ = A− c + (I − A− A)z,
for some z ∈ Rn . Thus, we can generate all solutions by just knowing one of them; i.e.,
by knowing A− c.
Proof. (⇐=) We know that x∗ = A− c is a solution (Result MAR4.1). Suppose that
x∗ = A− c + (I − A− A)z, for some z ∈ Rn . Thus,
= AA− c = Ax∗ = c;
PAGE 130
CHAPTER 7 STAT 714, J. TEBBS
x∗ = A − c + x∗ − A − c
= A− c + x∗ − A− Ax∗ = A− c + (I − A− A)x∗ .
COMPARE : Suppose that X1 , X2 , ..., Xn is an iid sample from fX (x; θ) and let X =
(X1 , X2 , ..., Xn )0 . Suppose also that θb1 = θb1 (X) is an unbiased estimator of θ; that is,
Eθ [θb1 (X)] = θ for all θ ∈ Θ, say. The general form of an unbiased estimator for θ is
θe = θb1 + T,
Y = Xβ + ,
X0 Xβ = X0 Y.
The normal equations are consistent (see below). Thus, the general form of the least
squares estimator is given by
PAGE 131
CHAPTER 7 STAT 714, J. TEBBS
where z ∈ Rp . Of course, if r(X) = p, then (X0 X)−1 exists, and the unique solution
becomes
b = (X0 X)−1 X0 Y.
β
X0 XA = X0 XB ⇐⇒ XA = XB.
Proof. The necessity part (⇐=) is obvious. For the sufficiency part (=⇒), note that
X0 XA = X0 XB =⇒ X0 XA − X0 XB = 0
This can only be true if XA − XB = 0. Thus, the lemma is proven. Now, let (X0 X)−
denote a generalized inverse of X0 X so that X0 X(X0 X)− X0 X = X0 X. Taking A0 =
X0 X(X0 X)− and B0 = I in the lemma, we have
=⇒ X0 X(X0 X)− X0 Y = X0 Y.
Yij = µ + αi + ij ,
PAGE 132
CHAPTER 7 STAT 714, J. TEBBS
PAGE 133
CHAPTER 7 STAT 714, J. TEBBS
1. P is idempotent
2. Px ∈ S, for any x
3. z ∈ S =⇒ Pz = z (projection).
NOTE : In general, projection matrices are not unique. However, if we add the require-
ment that Pz = 0, for any z⊥S, then P is called a perpendicular projection matrix,
which is unique. These matrices are important in linear models.
(I − A− A)(I − A− A) = I − 2A− A + A− A = I − A− A.
For any x, note that (I − A− A)x ∈ N (A) because A(I − A− A)x = 0. Finally, if
z ∈ N (A), then Az = 0. Thus, (I − A− A)z = z − A− Az = z.
and take
0.8 0.4
P= .
0.4 0.2
PAGE 134
CHAPTER 7 STAT 714, J. TEBBS
Result MAR5.3. The matrix M is a perpendicular projection matrix onto C(M) if and
only if M is symmetric and idempotent.
Proof. (=⇒) Suppose that M is a perpendicular projection matrix onto C(M) and write
v = v1 + v2 , where v1 ∈ C(M) and v2 ⊥C(M). Also, let w = w1 + w2 , where w1 ∈ C(M)
and w2 ⊥C(M). Since (I − M)v = (I − M)v2 and Mw = Mw1 = w1 , we get
PAGE 135
CHAPTER 7 STAT 714, J. TEBBS
TERMINOLOGY : The sum of the diagonal elements of a square matrix A is called the
trace of A, written tr(A), that is, for An×n = (aij ),
n
X
tr(A) = aii .
i=1
Result MAR6.1.
2. tr(cA) = ctr(A)
3. tr(A0 ) = tr(A)
4. tr(AB) = tr(BA)
Pn Pn
5. tr(A0 A) = i=1 j=1 a2ij .
Result MAR6.2.
1. |A0 | = |A|
2. |AB| = |BA|
3. |A−1 | = |A|−1
PAGE 136
CHAPTER 7 STAT 714, J. TEBBS
REVIEW : The table below summarizes equivalent conditions for the existence of an
inverse matrix A−1 (where A has dimension n × n).
If u 6= 0, then A − λI must be singular (see last table). Thus, the values of λ which
satisfy Au = λu are those values where
|A − λI| = 0.
Aui = λi ui ,
From our discussion on systems of equations and consistency, we know a general solution
for ui is given by ui = [I − (A − λi I)− (A − λi I)]z, for z ∈ Rn .
PAGE 137
CHAPTER 7 STAT 714, J. TEBBS
Result MAR6.4. If A is a symmetric matrix, then there exists a basis for C(A) con-
sisting of eigenvectors of nonzero eigenvalues. If λ is a nonzero eigenvalue of multiplicity
m, then the basis will contain m eigenvectors for λ. Furthermore, N (A) consists of the
eigenvectors associated with λ = 0 (along with 0).
Qn
1. |A| = i=1 λi
Pn
2. tr(A) = i=1 λi .
PAGE 138
CHAPTER 7 STAT 714, J. TEBBS
Proof (in the symmetric case). Write A in its Spectral Decomposition A = QDQ0 .
Note that |A| = |QDQ0 | = |DQ0 Q| = |D| = ni=1 λi . Also, tr(A) = tr(QDQ0 ) =
Q
Result MAR6.6. Suppose that A is symmetric. The rank of A equals the number of
nonzero eigenvalues of A.
Proof. Write A in its spectral decomposition A = QDQ0 . Because r(D) = r(A) and
because the only nonzero elements in D are the nonzero eigenvalues, the rank of D must
be the number of nonzero eigenvalues of A.
= r + dim[N (A)],
showing that dim[N (A)] = n − r. Since A has n eigenvalues, all are accounted for λ = 1
P
(with multiplicity r) and for λ = 0 (with multiplicity n − r). Now tr(A) = i λi = r,
the multiplicity of λ = 1. But r(A) = dim[C(A)] = r as well.
PAGE 139
CHAPTER 7 STAT 714, J. TEBBS
1 0 1
x0 Ax = x Ax + x0 A0 x
2 2
1 1
= x0 A + A0 x = x0 Bx,
2 2
UPSHOT : In working with quadratic forms, we can, without loss of generality, assume
that the matrix of the quadratic form is symmetric.
1. A pd =⇒ |A| > 0
2. A nnd =⇒ |A| ≥ 0.
PAGE 140
CHAPTER 7 STAT 714, J. TEBBS
RESULT : A square matrix A is pd iff there exists a nonsingular lower triangular matrix
L such that A = LL0 . This is called the Choleski Factorization of A. Monahan
proves this result (see pp 258), provides an algorithm on how to find L, and includes an
example.
RESULT : Suppose that A is symmetric and pd. Writing A in its Spectral Decompo-
sition, we have A = QDQ0 . Because A is pd, λ1 , λ2 , ..., λn , the eigenvalues of A, are
√ √ √
positive. If we define A1/2 = QD1/2 Q0 , where D1/2 = diag( λ1 , λ2 , ..., λn ), then A1/2
is symmetric and
The matrix A1/2 is called the symmetric square root of A. See Monahan (pp 259-60)
for an example.
PAGE 141
CHAPTER 7 STAT 714, J. TEBBS
DEFINTION : Suppose that E(Yi ) = µi , var(Yi ) = σi2 , for i = 1, 2, ..., n, and cov(Yi , Yj ) =
σij , for i 6= j. The mean of Y is
E(Y1 ) µ1
E(Y2 ) µ2
µ = E(Y) =
..
=
..
.
. .
E(Yn ) µn
The variance-covariance matrix of Y is
σ 2 σ12 · · · σ1n
1
σ21 σ22 · · · σ2n
Σ = cov(Y) = .. .. ... ..
.
. . .
σn1 σn2 · · · σn2
n
NOTE : Note that Σ contains the variances σ12 , σ22 , ..., σn2 on the diagonal and the 2
covariance terms cov(Yi , Yj ), for i < j, as the elements strictly above the diagonal. Since
cov(Yi , Yj ) = cov(Yj , Yi ), it follows that Σ is symmetric.
EXAMPLE : Suppose that Y1 , Y2 , ..., Yn is an iid sample with mean E(Yi ) = µ and
variance var(Yi ) = σ 2 and let Y = (Y1 , Y2 , ..., Yn )0 . Then µ = E(Y) = µ1n and
Σ = cov(Y) = σ 2 In .
EXAMPLE : Consider the GM linear model Y = Xβ+. In this model, the random errors
1 , 2 , ..., n are uncorrelated random variables with zero mean and constant variance σ 2 .
We have E() = 0n×1 and cov() = σ 2 In .
PAGE 142
CHAPTER 7 STAT 714, J. TEBBS
TERMINOLOGY : Suppose that Z11 , Z12 , ..., Znp are random variables. We call
Z Z12 · · · Z1p
11
Z21 Z22 · · ·
Z2p
Zn×p = .. .. .. ..
. . . .
Zn1 Zn2 · · · Znp
Proof. That cov(Y) = E[(Y − µ)(Y − µ)0 ] follows straightforwardly from the definition
of variance and covariance in the scalar case. Showing this equals E(YY0 )−µµ0 is simple
algebra.
DEFINITION : Suppose that Yp×1 and Xq×1 are random vectors with means µY and
µX , respectively. The covariance between Y and X is the p × q matrix defined by
where
σij = E[{Yi − E(Yi )}{Xj − E(Xj )}] = cov(Yi , Xj ).
DEFINITION : Random vectors Yp×1 and Xq×1 are uncorrelated if cov(Y, X) = 0p×q .
Result RV2. If cov(Y, X) = 0, then cov(Y, a + BX) = 0, for all nonrandom con-
formable a and B. That is, Y is uncorrelated with any linear function of X.
PAGE 143
CHAPTER 7 STAT 714, J. TEBBS
TERMINOLOGY : Suppose that var(Yi ) = σi2 , for i = 1, 2, ..., n, and cov(Yi , Yj ) = σij ,
for i 6= j. The correlation matrix of Y is the n × n matrix
1 ρ12 · · · ρ1n
ρ21 1 · · · ρ2n
R = (ρij ) =
.. .. ... .. ,
. . .
ρn1 ρn2 · · · 1
σij
ρij = ,
σi σj
for i, j = 1, 2, ..., n.
TERMINOLOGY : Suppose that Y1 , Y2 , ..., Yn are random variables and that a1 , a2 , ..., an
are constants. Define a = (a1 , a2 , ..., an )0 and Y = (Y1 , Y2 , ..., Yn )0 . The random variable
n
X
X = a0 Y = ai Y i
i=1
E(a0 Y) = a0 µ.
Result RV4. Suppose that Y = (Y1 , Y2 , ..., Yn )0 is a random vector with mean µ =
E(Y), let Z be a random matrix, and let A and B (a and b) be nonrandom conformable
matrices (vectors). Then
1. E(AY) = Aµ
PAGE 144
CHAPTER 7 STAT 714, J. TEBBS
3. E(AZB) = AE(Z)B.
var(a0 Y) = a0 Σa.
Proof. The quantity a0 Y is a scalar random variable, and its variance is given by
But, note that a0 (Y − µ) is a scalar, and hence equals (Y − µ)0 a. Using this fact, we
can rewrite the last expectation to get
Result RV6. Suppose that Y = (Y1 , Y2 , ..., Yn )0 is a random vector with covariance
matrix Σ = cov(Y), and let a and b be conformable vectors of constants. Then
cov(a0 Y, b0 Y) = a0 Σb.
Result RV7. Suppose that Y = (Y1 , Y2 , ..., Yn )0 is a random vector with mean µ = E(Y)
and covariance matrix Σ = cov(Y). Let b, A, and B denote nonrandom conformable
vectors/matrices. Then
1. E(AY + b) = Aµ + b
2. cov(AY + b) = AΣA0
PAGE 145
CHAPTER 7 STAT 714, J. TEBBS
Result RV9. If Y = (Y1 , Y2 , ..., Yn )0 is a random vector with mean µ = E(Y) and
covariance matrix Σ, then P {(Y − µ) ∈ C(Σ)} = 1.
Proof. Without loss, take µ = 0, and let MΣ be the perpendicular projection matrix
onto C(Σ). We know that Y = MΣ Y + (I − MΣ )Y and that
IMPLICATION : Result RV9 says that there exists a subset C(Σ) ⊆ Rn that contains Y
with probability one (i.e., almost surely). If Σ is positive semidefinite (psd), then Σ is
singular and C(Σ) is concentrated in a subspace of Rn , where the subspace has dimension
r = r(Σ), r < n. In this situation, the pdf of Y may not exist.
Result RV10. Suppose that X, Y, and Z are n × 1 vectors and that X = Y + Z. Then
PAGE 146