0% found this document useful (0 votes)
35 views43 pages

UMVUE Statmat 2 2022

This document discusses applications of statistical inference methods, focusing on estimation in linear models and sampling from finite populations. It describes linear models where a response variable is a linear combination of covariates and a random error. Least squares estimation is introduced as a common approach to fitting linear models by minimizing the sum of squared errors. The properties of least squares estimators are then discussed both mechanically and when the data follow a probability model.

Uploaded by

Maria Risnawati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views43 pages

UMVUE Statmat 2 2022

This document discusses applications of statistical inference methods, focusing on estimation in linear models and sampling from finite populations. It describes linear models where a response variable is a linear combination of covariates and a random error. Least squares estimation is introduced as a common approach to fitting linear models by minimizing the sum of squared errors. The properties of least squares estimators are then discussed both mechanically and when the data follow a probability model.

Uploaded by

Maria Risnawati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

5.

5 Applications 423

5.5 Applications
Many methods of statistical inference rely on samples of identically distributed
random variables. Two major areas of application of the methods are in anal-
ysis of linear models and sampling of finite populations.

5.5.1 Estimation in Linear Models

In a simple variation on the requirement of identical distributions, we assume


a model with two components, one “systematic” and one random, and the
distributions of the observable random variables depend on the systematic
component.

Systematic and Random Components

The most common form of linear model is one in which a random variable Y
is the sum of a systematic component that determines its expected value and
random component that is the value of an underlying unobservable random
variable that has an expected value of 0. The systematic component may be
a function of some additional variables x and parameters θ. If we represent
the underlying unobservable random with expectation 0, as ǫ, we have

Y = f(x, θ) + ǫ. (5.63)

In this setup the mean of the random variable Y is determined by the param-
eter θ and the values of the x variables, which are covariates (also called re-
gressors, carriers, or independent variables). We generally treat the covariates
as fixed variables, that is, whether or not we could also model the covariates
as random variables, in the simplest cases, we will use their observed values
without regard to their origin.

Regression Models

The model above is a regression model. In the simplest variation, the observ-
able random variables are independent, and have distributions in the same
location family: P = {Pf(x,θ),Pǫ }. The family Pǫ of distributions Pǫ of the
random component may be a parametric family, such as N(0, σ 2 ), or it may
be a nonparametric family. Whatever other assumptions on Pǫ, we assume
E(ǫ) = 0.

Linear Models

Often we assume that the systematic component is a linear combination of


the covariates. This setup is called a linear model, and is usually written in
the form

c
Theory of Statistics 2000–2020 James E. Gentle
424 5 Unbiased Point Estimation

Y = xT β + E, (5.64)
where Y is the observable random variable, x is an observable p-vector of
covariates, β is an unknown and unobservable p-vector of parameters, and E
is an unobservable random variable with E(E) = 0 and V(E) = σ 2 I. The
parameter space for β is B ⊆ IRp .
An item of a random sample from this model may be denoted

Yi = xT
i β + Ei , (5.65)

and a random sample be written in the vector-matrix form

Y = Xβ + E, (5.66)

where Y and E are n-vectors, X is an n × p matrix whose rows are the xT i ,


and β is the p-vector above. A sample of realizations may be written in the
vector-matrix form
y = Xβ + ǫ. (5.67)
where y and ǫ are n-vectors. This is the most commonly used notation.

Inference in a Linear Model

For estimation in a linear model, rather than formulating a decision problem


and seeking a minimum risk estimator, we usually begin with a different ap-
proach. Estimation in a linear model is most commonly developed based on
two simple heuristics: least squares and unbiasedness.
The degree of β is p, meaning that the minimum number of observations
required for unbiased estimation of β is p. Inferences about characteristics
of the distribution of ǫ require additional observations, however, and so we
assume n > p in the following.
In statistical inference, we can think of β either as an unobservable random
variable or as an unknown constant. If we think of it as an unknown constant
and we want to determine a value of it that optimizes some objective function
(such as a likelihood or a sum of squares), then we first must substitute a
variable for the constant. Although we often skip over this step, it is important
conceptually.

Least Squares Solutions of Overdetermined Linear Systems

Having substituted the variable b is in place of the unknown model parameter


β, we have an overdetermined linear system

y ≈ Xb, (5.68)

where y and X are given, b is unknown, and y ∈ IRn , X ∈ IRn×p , and b ∈ IRp .
Solving for b in this system is a common problem in linear algebra. It is

c
Theory of Statistics 2000–2020 James E. Gentle
5.5 Applications 425

one aspect of the statistical problem of fitting the model (5.66), in which we
assume that y is a realization of a random variable Y with E(Y ) = Xβ, but
for the time being we will just consider the algebraic issues in solving, or
“fitting”, the overdetermined system.
Fitting an overdetermined system y ≈ Xb involves a choice of a criterion
for the goodness of the approximation. A common choice is the squared error;
that is, a solution is a vector b that minimizes ky − Xbk2 . This follows the
approach to statistical inference discussed in Section 3.2.3. The solution to
the linear algebra problem (5.68) is often called an “estimator” even though
there is no underlying probability distribution.
We define a least squares estimator (LSE) of b or of β in equation (5.66)
as
b∗ = arg min ky − Xbk2 , (5.69)
b∈B
√ p Pp
2
where kck = kck2 = cT c = i=1 ci for the p-vector c.
A least squares estimator of β may or may not be unique. Whether or not
b∗ is unique,
ky − Xb∗ k2 (5.70)
is unique. This is because the objective function is convex and bounded below.
The least squares estimator is obtained by direct minimization of
s(b) = ky − Xbk2
= yT y − 2bT X T y + bT X T Xb.
First of all, we note that s(b) is differentiable, and
∂2
s(b) = X T X
∂b2
is nonnegative definitive. We therefore know that at the minimum, we have
the estimating equation
∂s(b)/∂b = 0. (5.71)
The estimating equation leads to the normal equations:
X T Xb = X T y. (5.72)
The coefficient matrix in these equations has a special form; it is a Gramian
matrix. We may use b∗ to denote any solution to the normal equations formed
from the linear system y = Xb, that is
b∗ = (X T X)− X T y. (5.73)
Notice that if X is not of full rank, b∗ is not unique.
A unique solution to these equations is
βb = (X T X)+ X T y; (5.74)
that is, the solution arising from the Moore-Penrose inverse (see page 784).

c
Theory of Statistics 2000–2020 James E. Gentle
426 5 Unbiased Point Estimation

LSE in a Probability Model

The mechanical aspects of least squares fitting do not rely on any probability
distributions.
An LSE of β yields LSEs of other quantities. In general, for an estimand
θ that can be expressed as
b
θ = Eg(Y, β), (5.75)
we call θb = g(y, β)
b the LSE of θ. Notice that this definition preserves unbi-
asedness if the relationships are linear.
If the quantities in the equations correspond to n observations that follow
the model (5.64), then we form an LSE of lT β, for given l ∈ IRp , as
b
lT β. (5.76)
While this quantity may not be unique, the quantity
b 2 /(n − p)
kY − X βk (5.77)
is unique; it is the LSE of V(ǫ) = σ 2 ; and furthermore, it is unbiased for σ 2
(exercise).

Linear U-Estimability

One of the most important questions for statistical inference involves esti-
mating or testing some linear combination of the elements of the parameter
β; for example, we may wish to estimate β1 − β2 or to test the hypothesis
that β1 − β2 = c1 for some constant c1 . In general, we will consider the linear
combination lT β. Whether or not it makes sense to estimate such a linear
combination depends on whether there is a function of the observable random
variable Y such that
g(E(Y )) = lT β. (5.78)
We generally restrict our attention to linear functions of E(Y ) and formally
define a linear combination lT β to be (linearly) U-estimable if and only if there
exists a vector t such that
tT E(Y ) = lT β (5.79)
for any β.
It is clear that if X is of full column rank, then lT β is linearly estimable
for any l. More generally, it is easy to see that lT β is linearly estimable for
any l ∈ span(X T ). (The t vector in equation (5.79) is just the normalized
coefficients expressing l in terms of the columns of X.)
Estimability depends only on the simplest distributional assumption about
the model; that is, that E(ǫ) = 0.
Theorem 5.6
Let Y = Xβ + ǫ where E(ǫ) = 0. Let lT β be a linearly estimable function and
let βb = (X T X)+ X T Y . Then lT βb is unbiased for lT β.

c
Theory of Statistics 2000–2020 James E. Gentle
5.5 Applications 427

Proof.
Because l ∈ span(X T ) = span(X T X), we can write

l = X T X t̃, (5.80)

for some vector t̃ Now, we have


b = E(lT (X T X)+ X T Y )
E(lT β)
= t̃T X T X(X T X)+ X T Xβ
= t̃T X T Xβ
= lT β. (5.81)

Although we have been taking βb to be (X T X)+ X T Y , the equations above


follow for other least squares fits, b∗ = (X T X)− X T Y , for any generalized
inverse. In fact, the estimator of lT β is invariant to the choice of the generalized
inverse.
Theorem 5.7
Let Y = Xβ + ǫ where E(ǫ) = 0. Let lT β be a linearly estimable function, let
b
βb = (X T X)+ X T Y and let b∗ = (X T X)− X T Y . Then lT b∗ = lT β.
Proof.
If b∗ = (X T X)− X T Y , we have X T Xb∗ = X T Y , and so

lT βb − lT b∗ = t̃T X T X(βb − b∗ ) = t̃T (X T Y − X T Y ) = 0.

Gauss-Markov Theorem

The Gauss-Markov theorem provides a restricted optimality property for es-


timators of estimable functions of β under the condition that E(ǫ) = 0 and
V(ǫ) = σ 2 I; that is, in addition to the assumption of zero expectation, which
we have used above, we also assume that the elements of ǫ have constant
variance and that their covariances are zero. Note that we do not assume
independence or normality.
The Gauss-Markov theorem states that lT βb is the unique best linear un-
biased estimator (BLUE) of the estimable function lT β. (Recall that Theo-
rem 5.7 tells us that the inner product is invariant to the choice of the general-
b where b∗ and βb are given in equations (5.73)
ized inverse; that is, lT b∗ = lT β,
and (5.74) respectively.) “Linear” estimator in this context means a linear
combination of X; that is, an estimator in the form aT X. It is clear that lT βb
is linear, and we have already seen that it is unbiased for lT β. “Best” in this
context means that its variance is no greater than any other estimator that
fits the requirements.

c
Theory of Statistics 2000–2020 James E. Gentle
428 5 Unbiased Point Estimation

Theorem 5.8 (Gauss-Markov theorem)


Let Y = Xβ + ǫ where E(ǫ) = 0 and V(ǫ) = σ 2 I, and assume lT β is linearly
estimable. Let βb = (X T X)+ X T Y . Then lT βb is the a.s. unique BLUE of lT β.
Proof.
Let aT Y be any unbiased estimator of lT β, and write l = X T Y t̃ as in equa-
tion (5.80) above. Because aT Y is unbiased for any β, as we saw above, it
must be the case that aT X = lT . Recalling that X T X βb = X T Y , we have

V(aT Y ) = V(aT Y − lT βb + lT β)
b
b
= V(aT Y − t̃T X T Y + lT β)
b + 2Cov(aT Y − t̃T X T Y, t̃T X T Y ).
= V(aT Y − t̃T X T Y ) + V(lT β)

Now, under the assumptions on the variance-covariance matrix of ǫ, which is


also the (conditional, given X) variance-covariance matrix of Y , we have

b = (aT − t̃T X T )σ 2 IX t̃
Cov(aT Y − t̃T X T Y, lT β)
= (aT X − t̃T X T X)σ 2 I t̃
= (lT − lT )σ 2 I t̃
= 0;

that is,
b
V(aT Y ) = V(aT Y − t̃T X T Y ) + V(lT β).
This implies that
b
V(aT Y ) ≥ V(lT β);
that is, lT βb has minimum variance among the linear unbiased estimators of
lT β.
To see that it is unique, we consider the case in which V(aT Y ) = V(lT β); b
T T T
that is, V(a Y − t̃ X Y ) = 0. For this variance to equal 0, it must be the
case that aT − t̃T X T = 0 or aT Y = t̃T X T Y = lT βb a.s.; that is, lT βb is the
a.s. unique linear unbiased estimator that achieves the minimum variance.
If we assume further that ǫ ∼ Nn (0, σ 2 I), we see that lT βb is the uniformly
minimum variance unbiased estimator (UMVUE) for lT β. This is because
(X T Y, (Y − X β)b T (Y − X β))
b is complete and sufficient for (β, σ2 ). This line
b T (Y −X β)/(n−r),
of reasoning also implies that (Y −X β) b where r = rank(X),
2
is UMVUE for σ .
****** biased estimator with smaller MSE
Example 5.27 Inadmissibility of the LSE in the Linear Model
inadmissible under squared-error loss regularization; see page 252

c
Theory of Statistics 2000–2020 James E. Gentle
5.5 Applications 429

Optimal Properties of the Moore-Penrose Inverse


The solution corresponding to the Moore-Penrose inverse is unique because
that generalized inverse is unique. That solution is interesting for another
reason.
Theorem 5.9
Let b∗ be any solution to the normal equations (5.72), that is,
b∗ = (X T X)− X T Y,
and let
βb = (X T X)+ X T Y
then
b 2 ≤ kb∗ k2 .
kβk
Proof.
To see that this solution has minimum norm, first factor Z, as
X = QRU T ,
and form the Moore-Penrose inverse as
 −1 
R1 0
X+ = U QT .
0 0
Now let
βb = X + Y.
This is a least squares solution (that is, we have chosen a specific least squares
solution).
Now, let  
T c1
Q Y = ,
c2
where c1 has exactly r elements and c2 has n − r elements, and let
 
t1
U Tb = ,
t2
where b is the variable in the norm kY − Xbk2 that we seek to minimize, and
where t1 has r elements.
Because multiplication by an orthogonal matrix does not change the norm,
we have
kY − Xbk2 = kQT (Y − XU U T b)k2
     
c R1 0 t1
= 1 −
c2 0 0 t2 2
 
c1 − R1 t1

= .
c2
2

c
Theory of Statistics 2000–2020 James E. Gentle
430 5 Unbiased Point Estimation

The residual norm is minimized for t1 = R−1 1 c1 and t2 arbitrary. However, if


t2 = 0, then ktk2 is also minimized. Because U T b = t and U is orthogonal,
b kβk
kbk2 = ktk2 = kt1 k2 + kt2 k2 , and so with t2 = 0, that is, with b = β, b 2 is
the minimum among the norms of all least squares solutions, kb∗k2 .

Quadratic Forms

Quadratic forms in nonnegative definite or positive definite matrices arise


often in statistical applications, especially in the analysis of linear models. The
analysis often involves the decomposition of a quadratic form in the positive
definite matrix A, yT Ay, into a sum, yT A1 y + yT A2 y, where A1 + A2 = A
and A1 and A2 are nonnegative definite matrices.

Cochran’s Theorems

There are various facts that are sometimes called Cochran’s theorem. The
simplest one concerns k symmetric idempotent n × n matrices, A1 , . . . , Ak
that sum to the identity matrix.
Theorem 5.10 (Cochran’s theorem I)
Let A1 , . . . , Ak be symmetric idempotent n × n matrices such that

In = A1 + · · · + Ak .

Then
Ai Aj = 0 for all i 6= j.
Proof.
For an arbitrary j, for some matrix V , we have

V T Aj V = diag(Ir , 0),

where r = rank(Aj ). Now

In = V T In V
Xk
= V T Ai V
i=1
X
= diag(Ir , 0) + V T Ai V,
i6=j

which implies X
V T Ai V = diag(0, In−r ).
i6=j
T
Now for each i, V Ai V is idempotent, and because the diagonal elements of
a symmetric idempotent matrix are all nonnegative, and hence the equation

c
Theory of Statistics 2000–2020 James E. Gentle
5.5 Applications 431

implies implies that for each i 6= j, the first r diagonal elements are 0. Fur-
thermore, since these diagonal elements are 0, all elements in the first r rows
and columns are 0. We have, therefore, for each i 6= j,

V T Ai V = diag(0, Bi )

for some (n − r) × (n − r) symmetric idempotent matrix Bi . Now, for any


i 6= j, consider Ai Aj and form V T Ai Aj V . We have

V T Ai Aj V = (V T Ai V )(V T Aj V )
= diag(0, Bi )diag(Ir , 0)
= 0.

Because V is nonsingular, this implies the desired conclusion; that is, that
Ai Aj = 0 for any i 6= j.
We can now extend this result to an idempotent matrix in place of I; that
is, for an idempotent matrix A with A = A1 + · · · + Ak .
Theorem 5.11 (Cochran’s theorem II)
Let A1 , . . . , Ak be n × n symmetric matrices and let

A = A1 + · · · + Ak .

Then any two of the following conditions imply the third one:
(a). A is idempotent.
(b). Ai is idempotent for i = 1, . . . , k.
(c). Ai Aj = 0 for all i 6= j.
(The theorem also applies to nonsymmetric matrices if condition (c) is aug-
mented with the requirement that rank(A2i ) = rank(Ai ) for all i. We will
restrict our attention to symmetric matrices, however, because in most appli-
cations of these results, the matrices are symmetric.)
Proof.
First, if we assume properties (a) and (b), we can show that property (c)
follows for the special case A = I.
Now, let us assume properties (b) and (c) and show that property (a)
holds. With properties (b) and (c), we have

AA = (A1 + · · · + Ak ) (A1 + · · · + Ak )
k
X k
XX
= Ai Ai + Ai Aj
i=1 i6=j j=1
k
X
= Ai
i=1
= A.

c
Theory of Statistics 2000–2020 James E. Gentle
432 5 Unbiased Point Estimation

Hence, we have property (a); that is, A is idempotent.


Finally, let us assume properties (a) and (c). Property (b) follows imme-
diately from
A2i = Ai Ai = Ai A = Ai AA = A2i A = A3i
and the fact that Ap+1 = Ap =⇒ A is idempotent.
Theorem 5.12 (Cochran’s theorem IIa)
Any two of the properties (a) through (c) also imply a fourth property:
(d). rank(A) = rank(A1 ) + · · · + rank(Ak ).
Proof.
We first note that any two of properties (a) through (c) imply the third one,
so we will just use properties (a) and (b). Property (a) gives

rank(A) = tr(A) = tr(A1 + · · · + Ak ) = tr(A1 ) + · · · + tr(Ak ),

and property (b) states that the latter expression is rank(A1 )+· · ·+rank(Ak ),
thus yielding property (d).
There is also a partial converse: properties (a) and (d) imply the other
properties.
One of the most important special cases of Cochran’s theorem is when
A = I in the sum:
In = A1 + · · · + Ak .
The identity matrix is idempotent, so if rank(A1 ) + · · · + rank(Ak ) = n, all
the properties above hold. (See Gentle (2007), pages 283–285.)
In applications of linear models, a quadratic form involving Y is often
partitioned into a sum of quadratic forms. The most important statistical
application of Cochran’s theorem is for the distribution of quadratic forms of
normally distributed random vectors.
Theorem 5.13 (Cochran’s theorem III)
Assume that Y is distributed as Nd (µ, Id ), P
and for i = 1, . . . k, let Ai be a d ×d
symmetric matrix with rank ri such that i Ai = Id . This yields a partition
of the total sum of squares Y T Y into k components:

Y T Y = Y T A1 Y + · · · + Y T Ak Y.

Then the Y T Ai Y have independent 2


P noncentral chi-squared distributions χri (δi )
T
with δi = µ Ai µ if and only if i ri = d.
Proof.
This follows from the results above and the multivariate normal distribution.
(See Gentle (2007), pages 324–325.)

c
Theory of Statistics 2000–2020 James E. Gentle
5.5 Applications 433

The “Sum of Squares” Quadratic Form


In statistical analysis, we often compare the variability within various sub-
samples with the overall variability of the full sample. This is the basic idea in
the common method called analysis of variance (AOV). The variability within
any sample is usually measured by P the sum of squares of the elements in the
sample from their overall mean, (yi − ȳ)2 .
This sum of squares can be expressed as a quadratic form in an idempotent
matrix. We can develop this matrix by use of the expressions for recursive
computation of the variance. The basic matrix is the Helmert matrix (see
Gentle (2007), page 308):
 √ √ √ √ 
1/ √n 1/ √n 1/ n · · · 1/ n
 1/ 2 −1/ 2 0√ · · · 0 
 √ √ 
 1/ 6 1/ 6 −2/ 6 · · · 0 
Hn =  . . . .

 (5.82)
 .. .. .. .. .. 
 . 
√ 1 √ 1 √ 1 · · · − √(n−1)
n(n−1) n(n−1) n(n−1) n(n−1)

Note that the Helmert matrix is orthogonal:


HnT Hn = Hn HnT = In .
The (n−1)×n matrix below the first row of the Helmert matrix is of particular
interest. Let  √ T
1/ n 1n
Hn =  .........  . (5.83)
Kn−1
First note that the two partitions are orthogonal to each other:

1/ n 1T n Kn−1 = 0. (5.84)
(This also follows from the orthogonality of Hn of course.)
Now let
T
A = Kn−1 Kn−1 , (5.85)
that is,  n−1 
−1 −1
n n ··· n
 −1 n−1
··· −1 
 n n n 
A= . .. .. ..  (5.86)
 .. . . . 
−1 −1 n−1
n n ··· n
Note that,
P for a sample of size n, A is the matrix of the quadratic form
that yields (xi − x̄)2 :
X
yT Ay = (yi − ȳ)2 . (5.87)

We can form similar matrices for subsamples so as to decompose a sum of


squares.

c
Theory of Statistics 2000–2020 James E. Gentle
434 5 Unbiased Point Estimation

Example 5.28 one-way fixed-effects AOV model


Consider the linear model

Yij = µ + αi + ǫij , i = 1, . . . , m; j = 1, . . . , n, (5.88)

where we assume that E(ǫij ) = 0 and V(ǫij ) = σ 2 for all i, j, and Cov(ǫij , ǫi′j ′ ) =
0 if i 6= i′ or j 6= j ′ . This can be expressed in the form of the linear
model (5.66), Y = Xβ + E, where β = (µ, α1, . . . , αm ) and
 
1 1 0 ··· 0
 .. .. .. .. .. 
. . . . . 
 
1 1 0 ··· 0
 
1 0 1 ··· 0
 
 .. .. .. .. .. 
. . . . . 
X = 1 0 1 ··· 0
 (5.89)
 
. . . . . 
 .. .. .. .. .. 
 
1 0 0 ··· 1
 
. . . . . 
 .. .. .. .. .. 
1 0 0 ··· 1

Letting
n
X
Yi = Yij /n (5.90)
j=1

and m
X
Y = Y i /m, (5.91)
i=1

we may form two sums of squares


m
X
SSA = n (Y i − Y )2 (5.92)
i=1

and
m X
X n
SSE = (Yij − Y i )2 , (5.93)
i=1 j=1

which have the property that


m X
X n
(Yij − Y )2 = SSA + SSE. (5.94)
i=1 j=1

Both SSA and SSE can be expressed as quadratic forms in matrices similar
T
to Kn−1 Kn−1 , where Kn−1 is given in equation (5.83). This is what you are
asked to do in Exercise 5.7.

c
Theory of Statistics 2000–2020 James E. Gentle
5.5 Applications 435

The question of interest in a model such is this is whether the αi are


different from one another; that is, whether or not it is meaningful to group
the Yij based on the i.
Example 5.29 estimating the effects in a one-way fixed-effects AOV
model (continuation of Example 5.28)
The individual αi are not U-estimable. We can see this because in this case,
l = (0, . . . , 1, . . .) and so l is not in the row space of X in (5.89). (This argument
follows from the condition in equation (5.79).) We see that l = (1, . . . , 1, . . .)
and so l is not in the row space of X and so µ + αi is estimable, and its
UMVUE is Y i . Also, αi − αj for i 6= j is estimable because it corresponds to
an l with first element 0, and all other elements 0 except for two, one of which
is 1 the other is −1. Such vectors are called contrasts.
For any linear combination of β = (µ, α1 , . . . , αm) that is estimable, say
lT β, we see that the a.s. unique UMVUE is lT β, b where βb = (X T X)+ X T Y
(equation (5.74)).
Although the form of the AOV model (5.88) is the one that is commonly
used, we see that P a closely related model could be formed by restricting this
model so that i αi = 0. This related model is Yij = θi + ǫij . The θi in this
restricted model are U-estimable.
Notice that so far we have not assumed any specific family of distributions
for the AOV model. We have unique UMVUEs. To answer the question posed
above of whether the αi are actually different from one another, however,
we need a basis for a statistical test. We might attempt some kind of non-
parametric test based on rankings, but in the next example, we will make the
common assumption that the random components have a normal distribution.
Note that the previous assumption of 0 covariances gives independence if we
assume normality. Cochran’s theorem tells us what the distributions are.
Example 5.30 distribution of the sums of squares in a one-way fixed-
effects AOV model (continuation of Example 5.28)
If we assume that ǫij ∼ N(0, 1), we know the distributions of functions of SSA
and SSE, and on that basis we can assess the significance of the αi . We have
 !2 
m m
1  n X X
SSA − αi − αi /m  ∼ χ2m−1 (5.95)
σ2 m − 1 i=1 i=1

and
1
SSE ∼ χ2m(n−1). (5.96)
σ2
(Exercise 5.8.)
The UMVUE of σ 2 is SSE/(m(n − 1)). Note that the UMVUE of σ 2 is
the same as the general result given in equation (5.77). (Exercise 5.9.) The
UMVUE is consistent in n for m fixed, and is consistent in m for n fixed.

c
Theory of Statistics 2000–2020 James E. Gentle
436 5 Unbiased Point Estimation

You are to show this in Exercise 5.10. Compare this with the MLE of σ 2 in
Example 6.27 in Chapter 6.
The model in equation (5.88) is called the one-way AOV model. If the αi
in this model are assumed to be constants, it is called a “fixed-effects model”.
A fixed-effects model is also sometimes called “model I”. Now let’s consider a
variant of this called a “random-effects model” or “model II”, because the αi
in this model are assumed to be iid random variables.
Example 5.31 UMVUEs of the variances in the one-way random-
effects AOV model
Consider the linear model

Yij = µ + δi + ǫij , i = 1, . . . , m; j = 1, . . . , n, (5.97)

where the δi are identically distributed with E(δi ) = 0, V(δi ) = σδ2 , and
Cov(δi , δĩ ) = 0 for i 6= ĩ, and the ǫij are independent of the δi and are
identically distributed with with E(ǫij ) = 0, V(ǫij ) = σǫ2 , and Cov(ǫij , ǫĩj̃ ) = 0
for either i 6= ĩ or j 6= j̃.
An important difference in the random-effects model and the fixed-effects
model is that in the random-effects model, we do not have independence of
the observables. We have
 2
 σδ + σǫ2 for i = ĩ, j = j̃,
Cov(Yij , Yĩj̃ ) = σδ2 for i = ĩ, j 6= j̃, (5.98)

0 for i 6= ĩ.

A model such as this may be appropriate when there are a large number
of possible treatments and m of them are chosen randomly and applied to
experimental units whose responses Yij are observed. While in the fixed-effects
model (5.88), we are interested in whether α1 = · · · = αm = 0, in the random-
effects model, we are interested in whether σδ2 = 0, which would result in a
similar practical decision about the treatments.
In the model (5.97) the variance of each Yij is σδ2 + σǫ2 , and our interest in
using the model is to make inference on the relative sizes of the components of
the variance σδ2 and σǫ2 . The model is sometimes called a “variance components
model”.
iid iid
Let us suppose now that δi ∼ N(0, σδ2 ), where σδ2 ≥ 0, and ǫij ∼ N(0, σǫ2 ),
where as usual σ 2 > 0. This will allow us to determine exact sampling distri-
butions of the relevant statistics.
We transform the model using Helmert matrices Hm and Hn as in equa-
tion (5.82).
Let
     
Y11 · · · Y1n δ1 ǫ11 · · · ǫ1n
 ..  ; δ =  ..  ; and ǫ =  .. ..  .
Y =  ... .   .   . . 
Ym1 · · · Ymn δm ǫm1 · · · ǫmn

c
Theory of Statistics 2000–2020 James E. Gentle
5.5 Applications 437

We now write the original model as

Y = δ1T
n + ǫ.

Now, for the transformations. Let

Z = Hm XHnT ,

e
δ = Hm δ,
and
ǫ = Hm ǫHnT .
e
We first of all note that the transformations are all nonsingular and

Z = H1T
n +e
ǫ.

Next, we see because of the orthonormality of the Helmert matrices that


the distributions of δe and e
ǫ are the same as those of δ and ǫ and they are still
independent. Furthermore, the Zij are independent, and we have
iid
Zi1 ∼ N(0, σa2 + σ 2 ), for i = 1, . . . , m

and
iid
Zij ∼ N(0, σ 2 ), for i = 1, . . . , m; j = 2, . . . , n0.
To continue with the analysis, we follow the same steps as in Example 5.28,
and get the same decomposition of the “adjusted total sum of squares” as in
equation (5.94):
Xm Xn
(Zij − Z)2 = SSA + SSE. (5.99)
i=1 j=1

Again, we get chi-squared distributions, but the distribution involving SSA


is not the same as in expression (5.95) for the fixed-effects model.
Forming
MSA = SSA/(m − 1)
and
MSE = SSE/(m(n − 1)),
we see that
E(MSA) = nσδ2 + σǫ2
and
E(MSE) = σǫ2 .
Unbiased estimators of σδ2 and σǫ2 are therefore

s2δ = (MSA − MSE)/n (5.100)

c
Theory of Statistics 2000–2020 James E. Gentle
438 5 Unbiased Point Estimation

and
s2ǫ = MSE, (5.101)
and we can also see that these are UMVUEs.
Now we note something that might at first glance be surprising: s2δ in
equation (5.100) may be negative. This occurs if (m − 1)MSA/m < MSE.
This will be the case if the variation among Yij for a fixed i is relatively large
compared to the variation among Y i (or similarly, if the variation among Zij
for a fixed i is relatively large compared to the variation among Z i ).
Compare this with the MLEs in Example 6.29 in Chapter 6

Predictions in the Linear Model

Given a vector x0 , use of βb in equation (5.64), with E set to E(E), we have


the predicted value of Y given x0 :

Yb0 = βbT x0
= ((X T X)+ X T y)T x0 . (5.102)

If x0 ∈ span(X), then from Theorem 5.7, (b∗ )T x0 = βbT x0 , so in this case the
predicted value of Y is invariant to choice of the generalized inverse.
In the model (5.66) corresponding to a set of n observations on the
model (5.64), we have predicted values of the response Y at all rows within
X:

Yb = X βb
= X(X T X)+ X T Y. (5.103)

From equation (3.42), we see that this has the minimum MSE of any function
of X.
The idempotent projection matrix X(X T X)+ X T is called the “hat ma-
trix” because given Y , it provides Yb . (See page 795 for properties of projection
matrices.)
We see from Definition 1.46 page 116 that Yb is the projection of Y onto
the column space of X. (This is a slightly different meaning of the word
“projection”, but obviously the meanings are related.) From Theorem 1.64
we see that the “residual vector” Y − Yb is orthogonal to the columns of X;
that is, Cov(Y − Yb , x) = 0 for any column x of X, and since Yb is a linear
combination of the columns of X, Cov(Y − Yb , Yb ) = 0. If we assume a normal
distribution for ǫ, then 0 covariance implies independence.

5.5.2 Estimation in Survey Samples of Finite Populations

A substantial proportion of all applications of statistics deal with sample sur-


veys in finite populations. Some aspects of this kind of application distinguish

c
Theory of Statistics 2000–2020 James E. Gentle
5.5 Applications 439

it from other areas of applied statistics. Särndal et al. (1997) provide a general
coverage of the theory and methods. Valliant et al. (2000) provide a different
perspective on some of the particular issues of inference in finite populations.

Finite Populations

We think of a finite population as being a finite set P = {(1, y1 ), . . . , (N, yN )}.


Our interest will be in making inferences about the population using a sample
S = {(L1 , X1 ), . . . , (Ln , Xn )}. We will also refer to X = {X1 , . . . , Xn } as
the “sample”. In discussions of sampling it is common to use n to denote
the size of the sample and N to denote the size of the population. Another
common
PN notation used in sampling is Y to denote the population total, Y =
i=1 yi . Estimation of the total is one of the most basic objectives in sampling
applications.
The parameter that characterizes the population is θ = (y1 , . . . , yN ). The
parameter space, Θ, is the subspace of IRN containing all possible values of
the yi .
There are two approaches to the analysis of the problem. In one, which
is the more common and which we will follow, P is essentially the sample
space. In another approach P or θ is thought of as some random sample from
a sample space or parameter space, called a “superpopulation”.
The sample is completely determined by the set LS = {i1 , . . . , in } of in-
dexes of P that correspond to elements in X. For analysis of sampling meth-
ods, we define an indicator

1 if i ∈ LS
Ii =
0 othersise.

“Sampling” can be thought of as selecting the elements of LS , that is, the


labels of the population elements.
Probability-based inferences about P are determined by the method of
selection of S. This determines the probability of getting any particular S,
which we will denote by p(S). If p(S) is constant for all S, we call the selected
sample a simple random sample.
A sample may be collected without replacement or with replacement. (The
meanings of these are just what the words mean. In sampling without replace-
ment, the elements of S are distinct.) Sampling with replacement is generally
easier to analyze, because it is the same as taking a random sample from a
discrete uniform distribution. Sampling without replacement is more common
and it is what we will assume throughout.
There are many variations on the method of collecting a sample. Both a
general knowledge of the population and some consideration of the mechani-
cal aspects of collecting the sample may lead to the use of stratified sampling,
cluster sampling, multi-stage sampling, systematic sampling, or other varia-
tions.

c
Theory of Statistics 2000–2020 James E. Gentle
440 5 Unbiased Point Estimation

Estimation

We are interested in “good” estimators, specifically UMVUEs, P of estimable


functions of θ. An interesting estimable function of θ is Y = Ni=1 θi .
One of the most important results is the following theorem.
Theorem 5.14
(i) if p(S) > 0 for all S, then the set of order statistics X(1) , . . . , X(n) is
complete for all θ ∈ Θ.
and
(ii) if p(S) is constant for all S, then the order statistics X(1) , . . . , X(n) are
sufficient for all θ ∈ Θ.
This theorem is somewhat similar to Corollary 3.1.1, which applied to the
family of distributions dominated by Lebesgue measure. The sufficiency is
generally straightforward, and we expect it to hold in any iid case.
The completeness is a little more complicated, and the proof of Theorem
3.13 in MS2 is worth looking at. The set of order statistics may be complete
in some family, such as the family of distributions dominated by Lebesgue
measure, but may not be complete in some subfamily, such as the family of
normal distributions with mean 0.
After we have (i) and (ii), we have
(iii): For any estimable function of θ, its unique UMVUE is the unbiased
estimator T (X1 , . . . , Xn ) that is symmetric in its arguments. (The symmetry
makes the connection to the order statistics.)
Example 5.32 UMVUE of population total using simple random
sample PN
Consider estimation of Y = g(θ) = i=1 yi from the simple random sample
X1 , . . . , Xn . We first note that
N X
Yb = yi
n
i∈LS
N
N X
= Ii yi
n i=1

is unbiased for Y :
N
N X
E(Yb ) = yi E(Ii )
n
i=1
N
X
= yi .
i=1

From Theorem 5.14, we can see easily that Yb = N y is the UMVUE of Y .


Now we consider the variance of Yb . First, note that

c
Theory of Statistics 2000–2020 James E. Gentle
5.5 Applications 441
n  n
V(Ii ) = 1−
N N
(it’s Bernoulli), and for i 6= j,

Cov(Ii , Ij ) = E(Ii Ij ) − E(Ii )E(Ij )


n(n − 1) n2
= − 2.
N (N − 1) N
Hence,
N
!
N 2 X
V(Yb ) = 2 V Ii yi
n i=1
 
N
N 2 X 2 X
= 2 yi V(Ii ) + 2 yi yj Cov(Ii , Ij )
n
i=1 1≤i≤j≤N
 
N n  X 2
N X
2
= 1− yi − yi yj 
n N i=1
N − 1
1≤i≤j≤N
N  2
N2  n 1 X Y
= 1− yi − . (5.104)
n N N −1 N
i=1

We see that the variance of Yb is composed of three factors, an expansion


factor N 2 /n, a finite population correction factor (1 − n/N ), and the variance
of a selection from a finite population,
N  2
2 1 X Y
σ = yi − . (5.105)
N − 1 i=1 N

The sample variance S 2 is unbiased for σ 2 , and so from this we have


immediately the UMVUE of V(Yb ) (Exercise 5.11).

Horvitz-Thompson Estimation

The properties of any statistic derived from a sample X1 , . . . , Xn depend on


the sampling design; that is, on how the items in the sample were selected. The
two main properties of the design are the probability that a specific population
item, say yi , is selected, and the probability that two specific population items,
say yi and yj are both selected. Probabilities of combinations of larger sets
may also be of interest, but we can work out simple expectations and variances
just based on these two kinds of probabilities.
Let πi be the probability that yi is included in the sample, and let πij be
the probability that both yi and yj are included.

c
Theory of Statistics 2000–2020 James E. Gentle
442 5 Unbiased Point Estimation

If πi > 0 for all i, the Horvitz-Thompson estimator of the population total


is X yi
YbHT = . (5.106)
πi
i∈LS

It is easy to see that YbHT is unbiased for Y :


!
  X yi
E YbHT = E
πi
i∈LS
N 
X 
yi
= πi
πi
i=1
N
X
= yi .
i=1
The variance of the Horvitz-Thompson estimator depends on the πij as
well as the πi :
  X N X N  2
b yi yj
V YHT = (πi πj − πij ) − (5.107)
πi πj
i=1 j=i+1

(Exercise 5.12). Expressions for other sampling estimators are often shown in
a similar manner.
An important approximation for working out variances of more compli-
cated sampling estimators is linearization, especially when the estimator in-
volves a ratio.

Notes and Further Reading


Most of the material in this chapter is covered in MS2 Chapter 3 and Section
4.5, and in TPE2 Chapter 2.

Unbiasedness
The property of unbiasedness for point estimators was given a solid but pre-
liminary treatment by Halmos (1946).
Unbiasedness has a heuristic appeal, although people will sometimes ques-
tion its relevance by pointing out that it provides no guarantee of the goodness
of an estimator in a single set of data. That argument, however, could apply to
most measures of the quality of an estimator. Similar types of arguments could
bring into question any consideration of asymptotic properties of statistical
procedures.
Unbiasedness is particularly useful when the loss is squared-error, because
in that case unbiasedness may lead to uniformly minimum risk estimators. For
absolute-error loss functions, a corresponding approach would be to require
median unbiasedness.

c
Theory of Statistics 2000–2020 James E. Gentle
Exercises 443

Fisher Efficient Estimators and Exponential Families

Fisher efficient estimators occur only in exponential families, and there is


always one in an exponential family. This fact had been know for some time,
but the first rigorous proof was given by Wijsman (1973).

U-Statistics

The fundamental paper by Hoeffding (1948) considered the asymptotic nor-


mality of certain unbiased point estimators and introduced the class of es-
timators that he named U-statistics. Serfling (1980) provides an extensive
discussion of U-statistics, as well as V-statistics. The statement and proof
Theorem 5.3 and the use of the conditional kernels hk as in equation (5.48)
follow Serfling. Kowalski and Tu (2008) consider several applications of U-
statistics in a variety of settings.

Exercises
5.1. Show that the estimator (5.3) in Example 5.1 is the UMVUE of π. (Note
that there are three things to show: (t − 1)/(N − 1) is unbiased, it has
minimum variance among all unbiased estimators, and it is unique —
“the” implies uniqueness.)
5.2. Consider the problem of using a sample of size 1 for estimating g(θ) = e−3θ
where θ is the parameter in a Poisson distribution.
a) Show that T (X) = (−2)X is unbiased for g(θ).
b) Show that T (X) = (−2)X is a UMVUE g(θ).
c) What is wrong with this estimator?
5.3. Show that the estimators (5.11) and (5.7) are the same.
5.4. Show that the h(T )s in Example 5.6 are unbiased for the g(θ)s given.
5.5. Define an alternative kernel for U-statistic that is unbiased for the covari-
ance in Example 5.19; that is, instead of the kernel in equation (5.37),
give a kernel similar to that in equation (5.46). Show that the resulting
U-statistic is unbiased for the covariance.
5.6. In the setup of model (5.64), show that the LSE kY − X βk b 2 /(n − p) is
2
unbiased for σ .
5.7. Let Xij = µ + αi + ǫij , i = 1, . . . , m, j = 1, . . . , n, where αi ’s and ǫij ’s
are independent random variables, αi ∼ N(0, σα2 ), ǫij ∼ N(0, σǫ2 ), and µ,
σα2 , and σǫ2 are unknown parameters. Let
n
X
Xi = Xij /n,
j=1

m
X
X= X i /m,
i=1

c
Theory of Statistics 2000–2020 James E. Gentle
444 5 Unbiased Point Estimation
m
X
MSA = n (X i − X)2 /(m − 1),
i=1

and
m X
X n
MSE = (Xij − X i )2 /(m(n − 1)).
i=1 j=1

Express MSA and MSE as quadratic forms using parts of Helmert matrices
and use Chochran’s theorem to show that they are independent.
5.8. Show that the quantities in expressions (5.95) and (5.96) have the chi-
squared distributions claimed.
5.9. Show that the UMVUE of σ 2 , SSE/(m(n − 1)), given in Example 5.28
is the same as the UMVUE of σ 2 for the general linear model given in
equation (5.77).
Hint: Write the model given in equation (5.88) in the form of the general
linear model in equation (5.67).
iid
5.10. Suppose Xij ∼ N(µi , σ 2 ) for i = 1, . . . , m and j = 1, . . . , n. (Compare the
one-way AOV model of Examples 5.28, 5.29, and 5.30.)
a) Determine the UMVUE Tmn (X) of σ 2 .
b) Show that Tmn (X) is consistent in mean squared error for σ 2 as m →
∞ and n remains fixed.
c) Show that Tmn (X) is consistent in mean squared error for σ 2 as n →
∞ and m remains fixed.
5.11. Show that the sample variance S 2 is the UMVUE of σ 2 in equation (5.105)
of Example 5.32. Hence, determine the UMVUE of V(Yb ).
5.12. Show that the variance of the Horvitz-Thompson estimator is as shown in
equation (5.107), for given πi and πij . This is tedious, but it requires very
little other than “advanced arithmetic” and simple properties of variances
of sums.

c
Theory of Statistics 2000–2020 James E. Gentle
6
Statistical Inference Based on Likelihood

The concepts of probability theory can be applied to statistical analyses in a


very straightforward manner: we assume that observed events are governed
by some data-generating process that depends on a probability distribution
P , and our observations of those events can be used to make inferences about
the probability distribution. The various ways that we use the observations
to make those inferences constitute the main body of statistical theory. One
of the general approaches that I outlined in Section 3.2 involves the use of a
likelihood function. We considered this approach briefly in Section 3.2.1. In
this chapter, we will explore the use of likelihood in statistical inference more
fully. In this chapter, the emphasis will be on estimation, and in Chapter 7,
we will consider use of likelihood in testing statistical hypotheses.
Although methods based on the likelihood may not have the logical ap-
peal of methods based on a decision-theoretic approach, they do have an intu-
itive appeal. More importantly, estimators and tests based on this approach
have a number of desirable mathematical properties, especially asymptotic
properties. Methods based on maximizing the likelihood are grounded on the
likelihood principle.
We begin with some general definitions and notation for methods based
on the likelihood principle, and then look at specific applications.

6.1 The Likelihood Function and Its Use in Statistical


Inference

Definition 6.1 (likelihood function)


Given a sample x1 , . . . , xn from distributions with probability densities pi (x)
with respect to a common σ-finite measure, the likelihood function is defined
as
n
Y
Ln (pi ; x) = c pi (xi ), (6.1)
i=1

c
Theory of Statistics 2000–2020 James E. Gentle
446 6 Statistical Inference Based on Likelihood

where c ∈ IR+ is any constant independent of the pi .


It is common to speak of Ln (pi ; X) with c = 1 as “the” likelihood function,
and in the following, we will not write the c.
Methods based on the likelihood function are often chosen because of their
asymptotic properties, and so it is common to use the n subscript as in equa-
tion (6.1); in the following, however, we will usually find it convenient to drop
the n.
As we generally do in discussing methods of statistical inference, in some
cases, we will view the sample x1 , . . . , xn as a set of constants. In cases when
we want to consider the probabilistic or statistical properties of the statistical
methods, we will view the observations as a vector of random variables.
In equation (6.1), the domain of the likelihood function is some class of
distributions specified by their probability densities, P = {pi (x)}, where all
PDFs are dominated by a common σ-finite measure. In applications, often
the PDFs are of a common parametric form, so equivalently, we can think of
the domain of the likelihood function as being a parameter space, say Θ. In
that case, the family of densities can be written as P = {pθ (x)} where θ ∈ Θ,
the known parameter space. It is usually more convenient to write pθ (x) as
p(x ; θ), and we often write the likelihood function (6.1) as
n
Y
L(θ ; x) = p(xi ; θ). (6.2)
i=1

Although in equation (6.2), we have written L(θ ; x), the expression


L(pθ ; x) may be more appropriate because it reminds us of an essential in-
gredient in the likelihood, namely a PDF.

What Likelihood Is Not

The differences in a likelihood and a PDF are illustrated clearly in Example 1.5
on page 20. A likelihood is neither a probability nor a probability density.
Notice, for example, that while the definite integrals over IR+ of both PDFs
in in Example 1.5 are 1, the definite integrals over IR+ of the likelihood (1.21)
in Example 1.5 are not the same, as we can easily see from the plots on the
right side of Figure 1.2.
It is not appropriate to refer to the “likelihood of an observation”. We use
the term “likelihood” in the sense of the likelihood of a model or the likelihood
of a distribution given observations.

The Log-Likelihood Function

The log-likelihood function,

lL (θ ; x) = log L(θ ; x), (6.3)

c
Theory of Statistics 2000–2020 James E. Gentle
6.1 The Likelihood Function 447

is a sum rather than a product. We often denote the log-likelihood without


the “L” subscript. The notation for the likelihood and the log-likelihood varies
with authors. My own choice of an uppercase “L” for the likelihood and a
lowercase “l” for the log-likelihood is long-standing, and not based on any
notational optimality consideration. Because of the variation in the notation
for the log-likelihood, I will often use the “lL ” notation because this expression
is suggestive of the meaning.
We will often work with either the likelihood or the log-likelihood as if
there is only one observation.

Likelihood Principle

According to the likelihood principle in statistical inference all of the informa-


tion that the data provide concerning the relative merits of two hypotheses is
contained in the likelihood ratio of those hypotheses and the data; that is, if
for x and y,
L(θ ; x)
= c(x, y) ∀θ, (6.4)
L(θ ; y)
where c(x, y) is constant for given x and y, then any inference about θ based
on x should be in agreement with any inference about θ based on y.
Although at first glance, we may think that the likelihood principle is so
obviously the right way to make decisions, Example 6.1 may cause us to think
more critically about this principle.
The likelihood principle asserts that for making inferences about a proba-
bility distribution, the overall data-generating process need not be considered;
only the observed data are relevant.
Example 6.1 The likelihood principle in sampling from a Bernoulli
distribution
In Example 3.12 we considered the problem of making inferences on the pa-
rameter π in a family of Bernoulli distributions.
One approach was to take a random
P sample of size n, X1 , . . . , Xn from the
Bernoulli(π), and then use T = Xi , which has a binomial distribution with
parameters n and π.
Another approach was to take a sequential sample, X1 , X2 , . . ., until a
fixed number t of 1’s have occurred. The size of the sample N is random and
the random variable N has a negative binomial distribution with parameters
t and π.
Now, suppose we take the first approach with n = n0 and we observe
T = t0 ; and then we take the second approach with t = t0 and we observe
N = n0 . Using the PDFs in equations 3.43 and 3.44 we get the likelihoods
 
n0 t 0
LB (π) = π (1 − π)n0 −t0 (6.5)
t0
and

c
Theory of Statistics 2000–2020 James E. Gentle
448 6 Statistical Inference Based on Likelihood
 
n0 − 1 t 0
LNB (π) = π (1 − π)9 . (6.6)
t0 − 1
Because LB (π)/LNB (π) does not involve π, the maxima of the likelihoods
will occur at the same point. A maximum likelihood estimator of π based on
a binomial observation of t0 out of n0 is the same as a maximum likelihood
estimator of π based on a negative binomial observation of n0 for t0 1’s because
the maximum of the likelihood occurs at the same place, t0 /n0 . The estimators
conform to the likelihood principle. Recall that the UMVU estimators are
different. (Example 5.1 and follow-up in Example 5.5 and Exercise 5.1.)

Further comments on Example 6.1


We see that the likelihood principle allows the likelihood function to be de-
fined as any member of an equivalence class {cL : c ∈ IR+ }, as in the defini-
tion (3.45).
The likelihood principle, however, is stronger than just the requirement
that the estimator be invariant. It says that because LB (π)/LNB (π) does not
involve π, any decision about π based on a binomial observation of 3 out of
12 should be the same as any decision about π based on a negative binomial
observation of 12 for 3 1’s. Because the variance of π̂ does depend on whether a
binomial distribution or a negative binomial distribution is assumed, the fact
that the estimators are the same does not imply that the inference follows the
likelihood principle. See Example 6.9.
We will revisit this example again in Example 7.12 on page 539, where we
wish to test a statistical hypothesis concerning π. We get different conclusions
in a significance test.

6.2 Maximum Likelihood Parametric Estimation

Let us assume a parametric model; that is, a family of densities P = {p(x ; θ)}
where θ ∈ Θ, a known parameter space.
For a sample X1 , . . . , Xn from a distribution with probability density
p(x ; θ), we write the likelihood function as a function of a variable in place
of the parameter:
n
Y
L(t ; x) = p(xi ; t). (6.7)
i=1

Note the reversal in roles of variables and parameters. While I really like to
write the likelihood as a function of a variable of something other than the
parameter, which I think of as fixed, I usually write it like everyone else; that
is, I write
Yn
L(θ ; x) = p(xi ; θ).
i=1

c
Theory of Statistics 2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 449

In the likelihood function the data, that is, the realizations of the vari-
ables in the density function, are considered as fixed, and the parameters are
considered as variables of the optimization problem,
max L(θ ; x). (6.8)
θ

For given x, the relative values of L(θ ; x) are important. For given x1 and
x2 , the relative values of L(θ ; x1 ) and L(θ ; x2 ) are not relevant. Notice in
Example 1.5, while L(θ ; 5) ≤ L(θ ; 1) for all θ, max L(θ ; 5) occurs at θ = 5,
and max L(θ ; 1) occurs at θ = 1. Notice also in Example 6.1, while LB (π)
in equation (6.5) is uniformly less than LNB (π) in equation (6.6), they both
achieve their maximum at the same point, π = 1/4.

Closure of the Parameter Space


It is important to specify the domain of the likelihood function. If Θ is the
domain of L in equation (6.7), we want to maximize L for t ∈ Θ; that is,
maximum likelihood often involves a constrained optimization problem.
There may be difficulties with this maximization problem (6.8), however,
because of open sets. The first kind of problem is because the parameter space
may be open. We address that problem in our definition the optimal estimator
below. See Example 6.4. The second kind of open set may be the region over
which the likelihood function is positive. This problem may arise because the
support of the distribution is open and is dependent on the parameter to be
estimated. We address that problem by adding a zero-probability set to the
support (see Example 6.5 below).
For certain properties of statistics that are derived from a likelihood ap-
proach, it is necessary to consider the parameter space Θ to be closed (see,
for example, Wald (1949)). Often in a given probability model, such as the
exponential or the binomial, we do not assume Θ to be closed. If Θ is not
a closed set, however, the maximum in (6.8) may not exist, so we consider
the closure of Θ, Θ. (If Θ is closed Θ is the same set, so we can always just
consider Θ.)

6.2.1 Definition and Examples


Definition 6.2 (maximum likelihood estimate; estimator)
Let L(θ ; x) be the likelihood of θ ∈ Θ for the observations x from a distri-
bution with PDF with respect to a σ-finite measure ν. A maximum likelihood
b is defined as
estimate, or MLE, of θ, written θ,
θb = arg max L(θ ; x), (6.9)
θ∈Θ

if it exists. There may be more than solution; any one is an MLE. If x is


viewed as a random variable, then θb is called a maximum likelihood estimator
of θ.

c
Theory of Statistics 2000–2020 James E. Gentle
450 6 Statistical Inference Based on Likelihood

While I like to use the “hat” notation to mean an MLE, I also sometimes use
it to mean any estimate or estimator.
The estimate (or estimator) θb is a Borel function of the observations or of
the random variables.
We use “MLE” to denote either a maximum likelihood estimate or estima-
tor, or to denote the method of maximum likelihood estimation. The proper
meaning can be determined from the context. If the term MLE is used in a
statement about a maximum likelihood estimate or estimator, the statement
can be assumed to apply to both the estimate and the estimator.
If θb in (6.9) exists, we also have

θb = arg max lL (θ ; x), (6.10)


θ∈Θ

that is, the MLE can be identified either from the likelihood function or from
the log-likelihood.

The Likelihood Equations

Notice that finding an MLE means to solve a constrained optimization prob-


lem. In simple cases, the constraints may not be active. In even simpler cases,
the likelihood is differentiable, and the MLE occurs at a stationary point in
the interior of the constraint space. In these happy cases, the MLE can be
identified by differentiation.
If the likelihood function or the log-likelihood function is differentiable
within Θ◦ , we call
∇L(θ ; x) = 0 (6.11)
or
∇lL (θ ; x) = 0 (6.12)
the likelihood equations.
If θr ∈ Θ◦ is a root of the likelihood equations and if the Hessian HL (θr )
evaluated at θr is negative definite, then θr ∈ Θ◦ is a local optimizer of L
(and of lL ). (See Theorem 0.0.13.)
If the maximum occurs within Θ◦ , then every MLE is a root of the like-
lihood equations. There may be other roots within Θ◦ , of course. Any such
root of the likelihood equation, called an RLE, may be of interest.
Example 6.2 MLE in the exponential family (continuation of Ex-
ample 1.5)
In the exponential family of Example 1.5, with a sample x1 , . . . , xn , the like-
lihood in equation (1.21) becomes
Pn
L(θ ; x) = θ−n e− i=1 xi /θ
IIR+ (θ),

whose derivative wrt θ is

c
Theory of Statistics 2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 451
n
!
Pn X Pn
−nθ−n−1 e− i=1 xi /θ
+ θ−n−2 xi e− i=1 xi /θ
IIR+ (θ).
i=1

Equating this to zero, we obtain


n
X
θb = xi /n
i=1

as a stationary point. Checking the second derivative, we find it is negative


b and so we conclude that θb is indeed the MLE of θ, and it is the only
at θ,
maximizer. Also, from the plot on the right side of Figure 1.2, we have visual
confirmation. Of course, Figure 1.2 is for a sample of size one.
We can easily see that for a sample of size n this graph would be similar,
but it would have a sharper peak; see Figure 6.1.

Sample of Size 1 Sample of Size 5

0.006
0.3

0.004
likelihood

likelihood
0.2

0.002
0.1

0.000
0.0

0 5 10 0 5 10

θ θ

Figure 6.1. Likelihood for Different Sample Sizes

The fact that the likelihood has a sharper peak is in agreement with our
expectation that the estimate should be “better” if we have a larger sample.

Example 6.3 MLE in the exponential family with right censoring


In use of the exponential family for modeling “lifetimes”, say of lightbulbs, it
is often the case that the experiment is terminated before all of the random
variables are realized; that is, we may have a potential sample x1 , . . . , xn ,

c
Theory of Statistics 2000–2020 James E. Gentle
452 6 Statistical Inference Based on Likelihood

but actually we only have values for the xi < tc, where tc is some fixed
and known value. It might be called the “censoring time”. This setup is yields
censored data, in particular, it is right censored data, because the larger values
are censored. Suppose that t1 , . . . , tr observations are obtained, leaving n − r
unobserved values of the potential sample. In this setup, the time tc is fixed,
and so r is a random variable. We could also contemplate an experimental
setup in which r is chosen in advance, and so the censoring time tc is a
random variable. (These two data-generating processes are similar to the two
experiments we described for Bernoulli data in Example 3.12, and to which we
have alluded in other examples.) The first method is called “Type I censoring”
(upper bound on the observation fixed) and the other method is called “Type
II censoring” (fixed number of observed values to be taken).
Censoring is different from a situation in which the distribution is trun-
cated, as in Exercise 2.14 on page 203.
For right censored data with n, r, and tc as described above from any
distribution with PDF f(x; θ) and CDF F (x; θ), the likelihood function is
r
Y
L(θ; x) = f(ti ; θ)(1 − F (tc; θ))n−r .
i=1

We may note in passing that the likelihood is the same for type I and type
II censoring, just as we saw it to be in the binomial and negative binomial
distributions arising from Bernoulli data in Example 3.12.
Now, for the case where the distribution is exponential with parameter θ,
we have the likelihood function
1 Pri=1 ti /θ (n−r)tc /θ
L(θ; x) e e .
θr
The maximum, which we can find by differentiation, occurs at
θb = T /r,
Pr
where T = i=1 ti + (n − r)tc is called the “total time on test”.

MLE in ∂Θ

If Θ is open, and if the maximizer in equation (6.9) is in ∂Θ, the distribution


defined by the MLE may be degenerate, as can be the case in the following
example.
Example 6.4 MLE of Bernoulli parameter
Consider the Bernoulli family of distributions with parameter π. In the usual
definition of this family, π ∈ Π =]0, 1[. Suppose we take a random sample
X1 , . . . , Xn . The log-likelihood is
n n
!
X X
lL (π ; x) = xi log(π) + n − xi log(1 − π). (6.13)
i=1 i=1

c
Theory of Statistics 2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 453

This is a concave differentiable function, so we can get the maximum by


differentiating and setting the result to zero. We obtain
n
X
π
b= xi /n. (6.14)
i=1
Pn Pn
If i=1xi = 0 or if i=1 xi = n, π
b∈ / Π, but π
b ∈ Π so πb is the MLE of π.
Note that in this case, the MLE corresponds to the Bayes estimator with
loss function (4.52) and uniform prior (see page 360) and to the UMVUE (see
page 394).
Further comments on Example 6.4
In Example 6.1 we considered the problem of making inferences on the pa-
rameter π in a family of Bernoulli distributions, and considered two different
approaches. One approach was to take a random P sample of size n, X1 , . . . , Xn
from the Bernoulli(π), and then use T = Xi , which has a binomial distri-
bution with parameters n and π. Another approach was to take a sequential
sample, X1 , X2 , . . ., until a fixed number t of 1’s have occurred. The likeli-
hood principle tells us that if the data are the same, we should reach the
same conclusions. In Example 6.1 we wrote the likelihood functions based on
these two different approaches. One was the same as in equation (6.13) and
so the MLE under that setup would be that given in equation (6.14). After
canceling constants, the other log-likelihood in Example 6.1 was also
n n
!
X X
xi log(π) + n − xi log(1 − π),
i=1 i=1
Pn
so a that sampling scheme yields the same MLE, if n and i=1 xi are the
same.
Of course making inferences about a parameter involves more than just
obtaining a good estimate of it. We will consider the problem again in Exam-
ples 6.9 and 7.12.
Allowing an MLE to be in Θ − Θ is preferable to saying that an MLE does
not exist. It does, however, ignore the question of continuity of L(θ ; x) over
Θ, and it allows an estimated PDF that is degenerate.
We have encountered this situation before in the case of UMVUEs; see
Example 5.5.
While the open parameter space in Example 6.4 would lead to a problem
with existence of the MLE if its definition was as a maximum over the pa-
rameter space instead of its closure, an open support can likewise lead to a
problem. Consider a distribution with Lebesgue PDF

pX (x) = h(x, θ)IS(θ) (x) (6.15)

where S(θ) is open. In this case, the likelihood has the form

c
Theory of Statistics 2000–2020 James E. Gentle
454 6 Statistical Inference Based on Likelihood

L(θ ; x) = h(x, θ)IR(x) (θ), (6.16)

where R(x) is open. It is quite possible that sup L(θ ; x) will occur on R(x) −
R(x).
Example 6.5 MLE in U(0, θ); closed support
iid
Consider X1 , . . . , Xn ∼ U(0, θ), with θ ∈ Θ = IR+ . The PDF is
1
pX (x) = I[0,θ] (x). (6.17)
θ
The likelihood is
1
L(θ ; x) = I[x ,∞[ (θ). (6.18)
θ (n)
The MLE is easily seen to be θb = x(n) . In Example 5.8, we saw that the
UMVUE of θ is (1 + 1/n)x(n).
Suppose we take the support to be the open interval ]0, θ[. (Despite Def-
inition 1.12, such a support is often assumed.) The likelihood function then
is
1
L(θ ; x) = I]x(n) ,∞[ (θ).
θ
This is discontinuous and it does not have a maximum, as we see in Figure 6.2.

(
L

]
x (n)
θ

Figure 6.2. Discontinuous Likelihood with No Maximum

c
Theory of Statistics 2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 455

In this case the maximum of the likelihood does not exist, but the supre-
mum of the likelihood occurs at x(n) and it is finite. We would like to call x(n)
the MLE of θ.
We can reasonably do this by modifying the definition of the family of
distributions by adding a zero-probability set to the support. We redefine the
family in equation (6.15) to have the Lebesgue PDF
1
pX (x) = I[0,θ] (x). (6.19)
θ
Now, the open interval ]x(n), ∞[ where the likelihood was positive before be-
comes a half-closed interval [x(n), ∞[, and the maximum of the likelihood
occurs at x(n).
This is one reason why we define the support to be closed.
This approach is cleaner than solving the logical problem by defining the
MLE in terms of the sup rather than the max. A definition in terms of the sup
may not address problems that could arise due to various types of discontinuity
of L(θ ; x) at the boundary of S(θ).

MLE of More than One Parameter

It is usually more difficult to determine the MLE of more than one parameter.
The likelihood equation in that case is a system of equations. Also, of course,
the likelihood equation, whether a single equation or a system, may not be
easy to solve, as the following example shows.
Example 6.6 MLE of the parameters in a gamma distribution
Consider the gamma family of distributions with parameters α and β. Given
a random sample x1 , . . . , xn , the log-likelihood of α and β is
X 1X
lL (α, β ; x) = −nα log(β)−n log(Γ(α)))+(α−1) log(xi )− xi . (6.20)
β
This yields the likelihood equations
Γ′ (α) X
− n log(β) − n + log(xi ) = 0 (6.21)
Γ(α)
and
nα 1 X
− + 2 xi = 0. (6.22)
β β
Checking the Hessian (at any point in the domain), we see that a root of the
likelihood equations is a local minimizer.
At the solution we have
X
βb = xi /(nb
α) (6.23)

and

c
Theory of Statistics 2000–2020 James E. Gentle
456 6 Statistical Inference Based on Likelihood

α) X
Γ′ (b X
log(b
α) − + log(xi )/n − log( xi /n) = 0. (6.24)
Γ(b α)
There is no closed form solution. A numerical method must be used; see
Example 6.14.
Sometimes in multiple-parameter models, the parameters are functionally
independent and the optimization can be performed on different parts of the
separable likelihood function. This is the case for a normal distribution, as we
see in Example 6.25.
Example 6.7 MLE in the exponential family with range dependency

Consider the two-parameter exponential family, that is, a shifted version of the
exponential family of distributions. This family is the subject of Example 5.9
on page 397. The Lebesgue PDF is

θ−1 e−(x−α)/θ I]α,∞[ (x)

Suppose we have observations X1 , X2 , . . . , Xn . The likelihood function is


 X 
L(α, θ; X) = θ−n exp − (Xi − α)/θ I]0,X(1) ] (α)I]0,∞[ (θ).

This is 0 when α > X(1) , but it is increasing in α on ]0, X(1)] independently


of θ.
Hence, the MLE of α is X(1) .
Now, we substitute this back into L(α, θ; X) and maximize wrt θ, that is,
we solve   X 
max θ−n exp − (Xi − X(1) )/θ .
θ

We do this by forming and solving the likelihood equation, noting that it


yields a maximum within the parameter space. We get
1X
θb = (Xi − X(1) ).
n
In Example 5.9, we found the UMVUEs:
1 X
Tα = X(1) − (Xi − X(1) )
n(n − 1)

and
1 X
Tθ = (Xi − X(1) ).
n−1
(Recall that we find a complete sufficient statistic and then manipulate it to
be unbiased.) Notice the similarity of these to the MLEs, which are biased.

c
Theory of Statistics 2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 457

6.2.2 Finite Sample Properties of MLEs

Any approach to estimation may occasionally yield very poor or meaningless


estimators. In addition to the possibly negative UMVUEs for variance com-
ponents in Example 5.31, we have seen in Exercise 5.2 that a UMVUE of
g(θ) = e−3θ in a Poisson distribution is not a very good estimator. While in
some cases the MLE is more reasonable (see Exercise 6.1), in other cases the
MLE may be very poor.
As we have mentioned, MLEs have a nice intuitive property. In Section 6.3
we will see that they also often have good asymptotic properties.
We now consider some other properties; some useful and some less desir-
able.

Relation to Sufficient Statistics

Theorem 6.1
If there is a sufficient statistic and an MLE exists, then an MLE is a function
of the sufficient statistic.
Proof.
This follows directly from the factorization theorem.

Relation to Efficient Statistics


Given the three Fisher information regularity conditions (see page 168) we
have defined “Fisher efficient estimators” as unbiased estimators that achieve
the lower bound on their variance.
Theorem 6.2

Assume the FI regularity conditions for a family of distributions {Pθ } with


the additional Le Cam-type requirement that the Fisher information matrix
I(θ) is positive definite for all θ. Let T (X) be a Fisher efficient estimator of
θ. Then T (X) is an MLE of θ.
Proof.
Let pθ (x) be the PDF. We have

log(pθ (x)) = I(θ)(T (x) − θ)
∂θ
for any θ and x. Clearly, for θ = T (x), this equation is 0 (hence, T (X) is an
RLE). Because I(θ), which is the negative of the Hessian of the likelihood, is
positive definite for all θ, the likelihood is convex in θ and T (x) maximizes
the likelihood.
Notice that without the additional requirement of a positive definite in-
formation matrix, Theorem 6.2 would yield only the conclusion that T (X) is
an RLE.

c
Theory of Statistics 2000–2020 James E. Gentle
458 6 Statistical Inference Based on Likelihood

Equivariance of MLEs

If θb is a good estimator of θ, it would seem to be reasonable that g(θ)


b is a good
estimator of g(θ), where g is a Borel function. “Good”, of course, is relative
to some criterion. In a decision-theoretic approach, we seek L-invariance; that
is, invariance of the loss function (see page 266). Even if the loss function is
invariant, other properties may not be preserved. If the criterion is UMVU,
then the estimator in general will not have this equivariance property; that
is, if θb is a UMVUE of θ, then g(θ) b may not be a UMVUE of g(θ). (It is not
even unbiased in general.)
We now consider the problem of determining the MLE of g(θ) when we
have an MLE θb of θ. Following the definition of an MLE, the MLE of g(θ)
should be the maximizer of the likelihood function of g(θ). If the function g
is not one-to-one, the likelihood function of g(θ) may not be well-defined. We
therefore introduce the induced likelihood.
Definition 6.3 (induced likelihood)
Let {pθ : θ ∈ Θ} with Θ ⊆ IRd be a family of PDFs wrt a common σ-
finite measure, and let L(θ) be the likelihood associated with this family,
given observations. Now let g be a Borel function from Θ to Λ ⊆ IRd1 where
1 ≤ d1 ≤ d. Then
e
L(λ) = sup L(θ) (6.25)
{θ : θ∈Θ and g(θ)=λ}

is called the induced likelihood function for the transformed parameter.


The induced likelihood provides an appropriate MLE for g(θ) in the sense of
the following theorem.
Theorem 6.3
Suppose {pθ : θ ∈ Θ} with Θ ⊆ IRd is a family of PDFs wrt a common
σ-finite measure with associated likelihood L(θ). Let θb be an MLE of θ. Now
e
let g be a Borel function from Θ to Λ ⊆ IRd1 where 1 ≤ d1 ≤ d and let L(λ)
b e
be the resulting induced likelihood. Then g(θ) maximizes L(λ).
Proof.
Follows directly from definitions, but it is an exercise to fill in the details.
Usually when we consider reparametrizations, as in Section 2.6, with one-
to-one functions. This provides a clean approach to the question of the MLE
of g(θ) without having to introduce an induced likelihood.
Given the distribution Pθ for the random variable X, suppose we seek
an MLE of g̃(θ). If g̃ is not one-to-one, then g̃(θ) does not provide enough
information to define the distribution Pg̃(θ) for X. Therefore, we cannot define
the likelihood for g̃(θ).
If g̃(θ) is one-to-one, let g(θ) = g̃(θ), otherwise, define

g(θ) = (g̃(θ), h(θ))

c
Theory of Statistics 2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 459

in such a way that g(θ) is one-to-one. The function h is not unique, but g−1
b is an MLE of g(θ); and so g̃(θ)
is unique; the likelihood is well-defined; g(θ) b
is an MLE of g̃(θ). Compare this with the results of Theorem 6.3 above.
Example 6.8 MLE of the variance in a Bernoulli distribution
Consider the Bernoulli family of distributions with parameter π. The vari-
ance of a Bernoulli distribution is g(π) = π(1 − π). Given a random sample
x1 , . . . , xn, the MLE of π is
n
X
π
b= xi /n,
i=1

as we saw in Example 6.4, hence the MLE of the variance is


1X  X 
xi 1 − xi /n . (6.26)
n
Note that this estimator is biased and that it is the same estimator as that of
the variance in a normal distribution from Example 3.13:
X
(xi − x̄)2 /n.

As we saw in Example 5.7, the UMVUE of the variance in a Bernoulli


distribution is, as in equation (5.11),
1 X  X 
xi 1 − xi /n .
n−1
The difference in the MLE and the UMVUE of the variance in the Bernoulli
distribution is the same as the difference in the estimators of the variance in
the normal distribution that we encountered in Example 3.13 and Exam-
ple 5.6. How do the MSEs of the estimators of the variance in a Bernoulli
distribution compare? (Exercise 6.6.)
Whenever the variance of a distribution can be expressed as a function
of other parameters g(θ), as in the case of the Bernoulli distribution, the
b where θb is an MLE of θ. The MLE of the
estimator of the variance is g(θ),
variance of the gamma distribution, for example, is α bβb2 , where α
b and βb are
the MLEs in Example 6.6. The plug-in estimator of the variance of the gamma
distribution, given the sample, X1 , X2 . . . , Xn , as always, is
n
1X 2
Xi − X .
n i=1

Example 6.9 The likelihood principle in sampling from a Bernoulli


distribution

c
Theory of Statistics 2000–2020 James E. Gentle
460 6 Statistical Inference Based on Likelihood

In Example 6.1 we considered the problem of making inferences on the param-


eter π in a family of Bernoulli
P distributions by either taking a random sample
of size n and using T = Xi , which has a binomial distribution, or by taking
a sample, X1 , X2 , . . ., until a fixed number t of 1’s have occurred and observing
the size of the sample N , which has a negative binomial distribution. Given
T = t or N = n, either way, we get the MLE

π̂ = t/n.

To make inferences on π using π̂ we need the variance V(π̂). Under the bino-
mial distribution, we need the variance of T /n, which is π(1 − π)/n, whose
MLE as in Example 6.8 is π̂(1 − π̂)/n. Under the negative binomial distri-
bution, we need the variance of t/N . The variance of N is t(1 − π)/π 2 and
its MLE is the same with π̂ in place of π. The variance of t/N cannot be
expressed in closed form. (See Stephan (1945).)
Although we have seen in equations (6.5) and (6.6) that the ratio of the
likelihoods does not involve π and the MLEs based on the two data-generating
processes conform to the likelihood principle, the variances of the MLEs are
different.

Other Properties of MLEs

Some properties of MLEs are not always desirable.


First of all, we note that an MLE may be biased. The most familiar ex-
ample of this is the MLE of the variance, as seen in Examples 6.8 and 3.13.
Another example is the MLE of the location parameter in the uniform distri-
bution in Example 6.5.
Although the MLE approach is usually an intuitively logical one, it is not
based on a formal decision theory, so it is not surprising that MLEs may not
possess certain desirable properties that are formulated from that perspective.
An example of a likelihood function that is not very useful without some
modification is in nonparametric probability density estimation. Suppose we
assume that a sample
Qn comes from a distribution with continuous PDF p(x).
The likelihood is i=1 p(xi ). Even under the assumption of continuity, there
is no solution. We will discuss this problem in Chapter 8.
C. R. Rao cites another example in which the likelihood function is not
very meaningful.
Example 6.10 a meaningless MLE
Consider an urn containing N balls labeled 1, . . . , N and also labeled with
distinct real numbers θ1 , . . . , θN (with N known). For a sample without re-
placement of size n < N where we observe (xi , yi ) = (label, θlabel ), what is the
likelihood function? It is either 0, if the label and θlabel for at least one ob-
−1
servation is inconsistent, or N n
, otherwise; and, of course, we don’t know!

c
Theory of Statistics 2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 461

This likelihood function is not informative, and could not be used, for exam-
ple, for P
estimating θ = θ1 + · · · + θN . (There is a pretty good estimator of θ;
it is N ( yi )/n.)
There are other interesting examples in which MLEs do not have desirable
(or expected) properties.
• An MLE may be discontinuous in the data. This is obviously the case for
a discrete distribution, but it can also occur in a contaminated continuous
distribution as, for example, in the case of ǫ-mixture distribution family
with CDF
Pxc ,ǫ (x) = (1 − ǫ)P (x) + ǫI[xc ,∞[ (x), (6.27)
where 0 ≤ ǫ ≤ 1.
• An MLE may not be a function of a sufficient statistic (if the MLE is not
unique).
• An MLE may not satisfy the likelihood equation as, for example, when the
likelihood function is not differentiable at its maximum, as in Example 6.5.
• The likelihood equation may have a unique root, yet no MLE exists. While
there are examples in which the roots of the likelihood equations occur
at minima of the likelihood, this situation does not arise in any realistic
distribution (that I am aware of). Romano and Siegel (1986) construct a
location family of distributions with support on

IR − {x1 + θ, x2 + θ : x1 < x2 },

where x1 and x2 are known but θ is unknown, with a Lebesgue density


p(x) that rises as x ր x1 to a singularity at x1 and rises as x ւ x2 to a
singularity at x2 and that is continuous and strictly convex over ]x1 , x2 [
and singular at both x1 and x2 . With a single observation, the likelihood
equation has a root at the minimum of the convex portion of the density
between x1 + θ and x2 + θ, but the likelihood increases without bound at
both x1 + θ and x2 + θ.
• An MLE may differ from an MME; in particular an MLE of the population
mean may not be the sample mean.
Note that Theorem 6.1 hints at two other issues: nonuniqueness of an MLE
and existence of an MLE. We now consider these.

Nonuniqueness

There are many cases in which the MLEs are not unique (and I’m not just
referring to RLEs). The following examples illustrate this.
Example 6.11 likelihood in a Cauchy family
Consider the Cauchy distribution with location parameter θ. The likelihood
equation is

c
Theory of Statistics 2000–2020 James E. Gentle
462 6 Statistical Inference Based on Likelihood
n
X 2(xi − θ)
= 0.
i=1
1 + (xi − θ)2

This may have multiple roots (depending on the sample), and so the one
yielding the maximum would be the MLE. Depending on the sample, however,
multiple roots can yield the same value of the likelihood function.
Another example in which the MLE is not unique is U(θ − 1/2, θ + 1/2).
Example 6.12 likelihood in a uniform family with fixed range
Given the sample x1 , . . . , xn, the likelihood function for U(θ − 1/2, θ + 1/2)
is
I[x(n) −1/2, x(1) +1/2] (θ).
It is maximized at any value between x(n) − 1/2 and x(1) + 1/2.

Nonexistence and Other Properties

We have already mentioned situations in which the likelihood approach does


not seem to be the logical way, and have seen that sometimes in nonparametric
problems, the MLE does not exist. This often happens when there are more
“things to estimate” than there are observations. This can also happen in
parametric problems. It may happen that the maximum does not exist because
the likelihood is unbounded from above. In this case the argmax does not exist,
and the maximum likelihood estimate does not exist.
Example 6.13 nonexistence of MLE
Consider the normal family of distributions with parameters µ and σ 2 . Sup-
pose we have one observation x. The log-likelihood is

1 (x − µ)2
lL (µ, σ2 ; x) = − log(2πσ 2 ) − ,
2 2σ 2
which is unbounded when µ = x and σ 2 approaches zero. It is therefore clear
that no MLE of σ 2 exists. Strictly speaking, we could also say that no MLE
of µ exists either; however, for any fixed value of σ 2 in the (open) parameter
space, µ = x maximizes the likelihood, so it is reasonable to call x the MLE
of µ.
Recall from Example 5.14 that the degree of the variance functional is 2.

In this case, some people prefer to say that the likelihood function does not
exist; that is, they suggest that the definition of a likelihood function include
boundedness.

c
Theory of Statistics 2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 463

6.2.3 The Score Function and the Likelihood Equations

In several of the preceding examples, we found the MLEs by differentiating


the likelihood and equating the derivative to zero. In many cases, of course,
we cannot find an MLE by just differentiating the likelihood; Example 6.5
is such a case. We will discuss methods of finding an MLE in Section 6.2.4
beginning on page 465.
In the following we will generally consider only the log-likelihood, and we
will assume that it is differentiable within Θ◦ .
The derivative of the log-likelihood is the score function sn (θ ; x) (equa-
tion (3.57) on page 244). The score function is important in computations for
determining an MLE, as we see in Section 6.2.4, but it is also important in
studying properties of roots of the likelihood equation, especially asymptotic
properties, as we see in Section 6.3.
The score function is an estimating function and leads to the likelihood
equation ∇lL (θ ; x) = 0 or
sn (θ ; x) = 0, (6.28)
which is an estimating equation, similar to the estimating equation (5.71) for
least squares estimators. Generalizations of these equations are called “gener-
alized estimating equations”, or GEEs; see Section 3.2.5.
Any root of the likelihood equations, which is called an RLE, may be an
MLE. A theorem from functional analysis, usually proved in the context of
numerical optimization, states that if θ∗ is an RLE and HlL (θ∗ ; x) is negative
definite, then there is a local maximum at θ∗ . This may allow us to determine
that an RLE is an MLE. There are, of course, other ways of determining
whether an RLE is an MLE. In MLE, the determination that an RLE is
actually an MLE is an important step in the process.

The Log-Likelihood Function and the Score Function in Regular


Families

In the regular case satisfying the three Fisher information regularity conditions
(see page 168), the likelihood function and consequently the log-likelihood
are twice differentiable within Θ◦ , and the operations of differentiation and
integration can be interchanged. In this case, the score estimating function is
unbiased (see Definition 3.7):
Z

Eθ (sn (θ ; X)) = lL (θ ; x)p(x; θ)dx
∂θ
ZX

= p(x; θ)dx
X ∂θ
Z

= p(x; θ)dx
∂θ X
= 0. (6.29)

c
Theory of Statistics 2000–2020 James E. Gentle
464 6 Statistical Inference Based on Likelihood

The derivatives of the log-likelihood function relate directly to useful con-


cepts in statistical inference. If it exists, the derivative of the log-likelihood
is the relative rate of change, with respect to the parameter placeholder θ, of
the probability density function at a fixed observation. If θ is a scalar, some
positive function of the derivative, such as its square or its absolute value, is
obviously a measure of the effect of change in the parameter, or of change
in the estimate of the parameter. More generally, an outer product of the
derivative with itself is a useful measure of the changes in the components of
the parameter:
  T
∇lL θ(k) ; x ∇lL θ(k) ; x .

Notice that the average of this quantity with respect to the probability density
of the random variable X,
 
(k)
  (k)
T
I(θ1 ; X) = Eθ1 ∇lL θ ; X ∇lL θ ; X , (6.30)

is the information matrix for an observation on Y about the parameter θ.


If θ is a scalar, the square of the first derivative is the negative of the
second derivative,
 2
∂ ∂2
lL (θ ; x) = − 2 lL (θ ; x),
∂θ ∂θ

or, in general,
  T 
∇lL θ(k) ; x ∇lL θ(k) ; x = − HlL θ(k) ; x . (6.31)

MLEs in Exponential Families

If X has a distribution in the exponential class and we write its density in the
natural or canonical form, the likelihood has the form

L(η ; x) = exp(η T T (x) − ζ(η))h(x). (6.32)

The log-likelihood equation is particularly simple:

∂ζ(η)
T (x) − = 0. (6.33)
∂η
Newton’s method for solving the likelihood equation is
 −1  
(k) (k−1) ∂ 2 ζ(η) ∂ζ(η)
η =η − (k−1) T (x) − (k−1)
∂η(∂η)T η=η ∂η η=η

c
Theory of Statistics 2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 465

Note that the second term includes the Fisher information matrix for η.
(The expectation is constant.) (Note that the FI matrix is not for a distribu-
tion; it is for a parametrization of a distribution.)
We have
∂ 2 ζ(η)
V(T (X)) = |η=η .
∂η(∂η)T
Note that the variance is evaluated at the true η (even though in an expression
such as ∂η it must be a variable).
If we have a full-rank member of the exponential class then V is positive
definite, and hence there is a unique maximum.
If we write
∂ζ(η)
µ(η) = ,
∂η
in the full-rank case, µ−1 exists and so we have the solution to the likelihood
equation:
ηb = µ−1 (T (x)). (6.34)
So maximum likelihood estimation is very nice for the exponential class.

6.2.4 Finding an MLE

Notice that the problem of obtaining an MLE is a constrained optimization


problem; that is, an objective function is to be optimized subject to the con-
straints that the solution be within the closure of the parameter space.
In some cases the MLE occurs at a stationary point, which can be identified
by differentiation. That is not always the case, however. A standard example
in which the MLE does not occur at a stationary point is a distribution in
which the range depends on the parameter, and the simplest such distribution
is the uniform U(0, θ), which was the subject of Example 6.5.
In this section, we will discuss some standard methods of maximizing a
likelihood function and also some methods that are useful in more complicated
situations.

Computations

If the log-likelihood is twice differentiable and if the range does not depend
on the parameter, Equation (6.31) is interesting because the second deriva-
tive, or an approximation of it, is used in a Newton-like method to solve the
maximization problem (6.10). Newton’s equation

HlL (θ(k−1) ; x) d(k) = ∇lL (θ(k−1) ; x) (6.35)

is used to determine the step direction in the k th iteration. A quasi-Newton


method uses a matrix H e lL (θ(k−1) ) in place of the Hessian HlL (θ(k−1)). (See
notes on optimization in Appendix 0.4.)

c
Theory of Statistics 2000–2020 James E. Gentle

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy