UMVUE Statmat 2 2022
UMVUE Statmat 2 2022
5 Applications 423
5.5 Applications
Many methods of statistical inference rely on samples of identically distributed
random variables. Two major areas of application of the methods are in anal-
ysis of linear models and sampling of finite populations.
The most common form of linear model is one in which a random variable Y
is the sum of a systematic component that determines its expected value and
random component that is the value of an underlying unobservable random
variable that has an expected value of 0. The systematic component may be
a function of some additional variables x and parameters θ. If we represent
the underlying unobservable random with expectation 0, as ǫ, we have
Y = f(x, θ) + ǫ. (5.63)
In this setup the mean of the random variable Y is determined by the param-
eter θ and the values of the x variables, which are covariates (also called re-
gressors, carriers, or independent variables). We generally treat the covariates
as fixed variables, that is, whether or not we could also model the covariates
as random variables, in the simplest cases, we will use their observed values
without regard to their origin.
Regression Models
The model above is a regression model. In the simplest variation, the observ-
able random variables are independent, and have distributions in the same
location family: P = {Pf(x,θ),Pǫ }. The family Pǫ of distributions Pǫ of the
random component may be a parametric family, such as N(0, σ 2 ), or it may
be a nonparametric family. Whatever other assumptions on Pǫ, we assume
E(ǫ) = 0.
Linear Models
c
Theory of Statistics
2000–2020 James E. Gentle
424 5 Unbiased Point Estimation
Y = xT β + E, (5.64)
where Y is the observable random variable, x is an observable p-vector of
covariates, β is an unknown and unobservable p-vector of parameters, and E
is an unobservable random variable with E(E) = 0 and V(E) = σ 2 I. The
parameter space for β is B ⊆ IRp .
An item of a random sample from this model may be denoted
Yi = xT
i β + Ei , (5.65)
Y = Xβ + E, (5.66)
y ≈ Xb, (5.68)
where y and X are given, b is unknown, and y ∈ IRn , X ∈ IRn×p , and b ∈ IRp .
Solving for b in this system is a common problem in linear algebra. It is
c
Theory of Statistics
2000–2020 James E. Gentle
5.5 Applications 425
one aspect of the statistical problem of fitting the model (5.66), in which we
assume that y is a realization of a random variable Y with E(Y ) = Xβ, but
for the time being we will just consider the algebraic issues in solving, or
“fitting”, the overdetermined system.
Fitting an overdetermined system y ≈ Xb involves a choice of a criterion
for the goodness of the approximation. A common choice is the squared error;
that is, a solution is a vector b that minimizes ky − Xbk2 . This follows the
approach to statistical inference discussed in Section 3.2.3. The solution to
the linear algebra problem (5.68) is often called an “estimator” even though
there is no underlying probability distribution.
We define a least squares estimator (LSE) of b or of β in equation (5.66)
as
b∗ = arg min ky − Xbk2 , (5.69)
b∈B
√ p Pp
2
where kck = kck2 = cT c = i=1 ci for the p-vector c.
A least squares estimator of β may or may not be unique. Whether or not
b∗ is unique,
ky − Xb∗ k2 (5.70)
is unique. This is because the objective function is convex and bounded below.
The least squares estimator is obtained by direct minimization of
s(b) = ky − Xbk2
= yT y − 2bT X T y + bT X T Xb.
First of all, we note that s(b) is differentiable, and
∂2
s(b) = X T X
∂b2
is nonnegative definitive. We therefore know that at the minimum, we have
the estimating equation
∂s(b)/∂b = 0. (5.71)
The estimating equation leads to the normal equations:
X T Xb = X T y. (5.72)
The coefficient matrix in these equations has a special form; it is a Gramian
matrix. We may use b∗ to denote any solution to the normal equations formed
from the linear system y = Xb, that is
b∗ = (X T X)− X T y. (5.73)
Notice that if X is not of full rank, b∗ is not unique.
A unique solution to these equations is
βb = (X T X)+ X T y; (5.74)
that is, the solution arising from the Moore-Penrose inverse (see page 784).
c
Theory of Statistics
2000–2020 James E. Gentle
426 5 Unbiased Point Estimation
The mechanical aspects of least squares fitting do not rely on any probability
distributions.
An LSE of β yields LSEs of other quantities. In general, for an estimand
θ that can be expressed as
b
θ = Eg(Y, β), (5.75)
we call θb = g(y, β)
b the LSE of θ. Notice that this definition preserves unbi-
asedness if the relationships are linear.
If the quantities in the equations correspond to n observations that follow
the model (5.64), then we form an LSE of lT β, for given l ∈ IRp , as
b
lT β. (5.76)
While this quantity may not be unique, the quantity
b 2 /(n − p)
kY − X βk (5.77)
is unique; it is the LSE of V(ǫ) = σ 2 ; and furthermore, it is unbiased for σ 2
(exercise).
Linear U-Estimability
One of the most important questions for statistical inference involves esti-
mating or testing some linear combination of the elements of the parameter
β; for example, we may wish to estimate β1 − β2 or to test the hypothesis
that β1 − β2 = c1 for some constant c1 . In general, we will consider the linear
combination lT β. Whether or not it makes sense to estimate such a linear
combination depends on whether there is a function of the observable random
variable Y such that
g(E(Y )) = lT β. (5.78)
We generally restrict our attention to linear functions of E(Y ) and formally
define a linear combination lT β to be (linearly) U-estimable if and only if there
exists a vector t such that
tT E(Y ) = lT β (5.79)
for any β.
It is clear that if X is of full column rank, then lT β is linearly estimable
for any l. More generally, it is easy to see that lT β is linearly estimable for
any l ∈ span(X T ). (The t vector in equation (5.79) is just the normalized
coefficients expressing l in terms of the columns of X.)
Estimability depends only on the simplest distributional assumption about
the model; that is, that E(ǫ) = 0.
Theorem 5.6
Let Y = Xβ + ǫ where E(ǫ) = 0. Let lT β be a linearly estimable function and
let βb = (X T X)+ X T Y . Then lT βb is unbiased for lT β.
c
Theory of Statistics
2000–2020 James E. Gentle
5.5 Applications 427
Proof.
Because l ∈ span(X T ) = span(X T X), we can write
l = X T X t̃, (5.80)
Gauss-Markov Theorem
c
Theory of Statistics
2000–2020 James E. Gentle
428 5 Unbiased Point Estimation
V(aT Y ) = V(aT Y − lT βb + lT β)
b
b
= V(aT Y − t̃T X T Y + lT β)
b + 2Cov(aT Y − t̃T X T Y, t̃T X T Y ).
= V(aT Y − t̃T X T Y ) + V(lT β)
b = (aT − t̃T X T )σ 2 IX t̃
Cov(aT Y − t̃T X T Y, lT β)
= (aT X − t̃T X T X)σ 2 I t̃
= (lT − lT )σ 2 I t̃
= 0;
that is,
b
V(aT Y ) = V(aT Y − t̃T X T Y ) + V(lT β).
This implies that
b
V(aT Y ) ≥ V(lT β);
that is, lT βb has minimum variance among the linear unbiased estimators of
lT β.
To see that it is unique, we consider the case in which V(aT Y ) = V(lT β); b
T T T
that is, V(a Y − t̃ X Y ) = 0. For this variance to equal 0, it must be the
case that aT − t̃T X T = 0 or aT Y = t̃T X T Y = lT βb a.s.; that is, lT βb is the
a.s. unique linear unbiased estimator that achieves the minimum variance.
If we assume further that ǫ ∼ Nn (0, σ 2 I), we see that lT βb is the uniformly
minimum variance unbiased estimator (UMVUE) for lT β. This is because
(X T Y, (Y − X β)b T (Y − X β))
b is complete and sufficient for (β, σ2 ). This line
b T (Y −X β)/(n−r),
of reasoning also implies that (Y −X β) b where r = rank(X),
2
is UMVUE for σ .
****** biased estimator with smaller MSE
Example 5.27 Inadmissibility of the LSE in the Linear Model
inadmissible under squared-error loss regularization; see page 252
c
Theory of Statistics
2000–2020 James E. Gentle
5.5 Applications 429
c
Theory of Statistics
2000–2020 James E. Gentle
430 5 Unbiased Point Estimation
Quadratic Forms
Cochran’s Theorems
There are various facts that are sometimes called Cochran’s theorem. The
simplest one concerns k symmetric idempotent n × n matrices, A1 , . . . , Ak
that sum to the identity matrix.
Theorem 5.10 (Cochran’s theorem I)
Let A1 , . . . , Ak be symmetric idempotent n × n matrices such that
In = A1 + · · · + Ak .
Then
Ai Aj = 0 for all i 6= j.
Proof.
For an arbitrary j, for some matrix V , we have
V T Aj V = diag(Ir , 0),
In = V T In V
Xk
= V T Ai V
i=1
X
= diag(Ir , 0) + V T Ai V,
i6=j
which implies X
V T Ai V = diag(0, In−r ).
i6=j
T
Now for each i, V Ai V is idempotent, and because the diagonal elements of
a symmetric idempotent matrix are all nonnegative, and hence the equation
c
Theory of Statistics
2000–2020 James E. Gentle
5.5 Applications 431
implies implies that for each i 6= j, the first r diagonal elements are 0. Fur-
thermore, since these diagonal elements are 0, all elements in the first r rows
and columns are 0. We have, therefore, for each i 6= j,
V T Ai V = diag(0, Bi )
V T Ai Aj V = (V T Ai V )(V T Aj V )
= diag(0, Bi )diag(Ir , 0)
= 0.
Because V is nonsingular, this implies the desired conclusion; that is, that
Ai Aj = 0 for any i 6= j.
We can now extend this result to an idempotent matrix in place of I; that
is, for an idempotent matrix A with A = A1 + · · · + Ak .
Theorem 5.11 (Cochran’s theorem II)
Let A1 , . . . , Ak be n × n symmetric matrices and let
A = A1 + · · · + Ak .
Then any two of the following conditions imply the third one:
(a). A is idempotent.
(b). Ai is idempotent for i = 1, . . . , k.
(c). Ai Aj = 0 for all i 6= j.
(The theorem also applies to nonsymmetric matrices if condition (c) is aug-
mented with the requirement that rank(A2i ) = rank(Ai ) for all i. We will
restrict our attention to symmetric matrices, however, because in most appli-
cations of these results, the matrices are symmetric.)
Proof.
First, if we assume properties (a) and (b), we can show that property (c)
follows for the special case A = I.
Now, let us assume properties (b) and (c) and show that property (a)
holds. With properties (b) and (c), we have
AA = (A1 + · · · + Ak ) (A1 + · · · + Ak )
k
X k
XX
= Ai Ai + Ai Aj
i=1 i6=j j=1
k
X
= Ai
i=1
= A.
c
Theory of Statistics
2000–2020 James E. Gentle
432 5 Unbiased Point Estimation
and property (b) states that the latter expression is rank(A1 )+· · ·+rank(Ak ),
thus yielding property (d).
There is also a partial converse: properties (a) and (d) imply the other
properties.
One of the most important special cases of Cochran’s theorem is when
A = I in the sum:
In = A1 + · · · + Ak .
The identity matrix is idempotent, so if rank(A1 ) + · · · + rank(Ak ) = n, all
the properties above hold. (See Gentle (2007), pages 283–285.)
In applications of linear models, a quadratic form involving Y is often
partitioned into a sum of quadratic forms. The most important statistical
application of Cochran’s theorem is for the distribution of quadratic forms of
normally distributed random vectors.
Theorem 5.13 (Cochran’s theorem III)
Assume that Y is distributed as Nd (µ, Id ), P
and for i = 1, . . . k, let Ai be a d ×d
symmetric matrix with rank ri such that i Ai = Id . This yields a partition
of the total sum of squares Y T Y into k components:
Y T Y = Y T A1 Y + · · · + Y T Ak Y.
c
Theory of Statistics
2000–2020 James E. Gentle
5.5 Applications 433
c
Theory of Statistics
2000–2020 James E. Gentle
434 5 Unbiased Point Estimation
where we assume that E(ǫij ) = 0 and V(ǫij ) = σ 2 for all i, j, and Cov(ǫij , ǫi′j ′ ) =
0 if i 6= i′ or j 6= j ′ . This can be expressed in the form of the linear
model (5.66), Y = Xβ + E, where β = (µ, α1, . . . , αm ) and
1 1 0 ··· 0
.. .. .. .. ..
. . . . .
1 1 0 ··· 0
1 0 1 ··· 0
.. .. .. .. ..
. . . . .
X = 1 0 1 ··· 0
(5.89)
. . . . .
.. .. .. .. ..
1 0 0 ··· 1
. . . . .
.. .. .. .. ..
1 0 0 ··· 1
Letting
n
X
Yi = Yij /n (5.90)
j=1
and m
X
Y = Y i /m, (5.91)
i=1
and
m X
X n
SSE = (Yij − Y i )2 , (5.93)
i=1 j=1
Both SSA and SSE can be expressed as quadratic forms in matrices similar
T
to Kn−1 Kn−1 , where Kn−1 is given in equation (5.83). This is what you are
asked to do in Exercise 5.7.
c
Theory of Statistics
2000–2020 James E. Gentle
5.5 Applications 435
and
1
SSE ∼ χ2m(n−1). (5.96)
σ2
(Exercise 5.8.)
The UMVUE of σ 2 is SSE/(m(n − 1)). Note that the UMVUE of σ 2 is
the same as the general result given in equation (5.77). (Exercise 5.9.) The
UMVUE is consistent in n for m fixed, and is consistent in m for n fixed.
c
Theory of Statistics
2000–2020 James E. Gentle
436 5 Unbiased Point Estimation
You are to show this in Exercise 5.10. Compare this with the MLE of σ 2 in
Example 6.27 in Chapter 6.
The model in equation (5.88) is called the one-way AOV model. If the αi
in this model are assumed to be constants, it is called a “fixed-effects model”.
A fixed-effects model is also sometimes called “model I”. Now let’s consider a
variant of this called a “random-effects model” or “model II”, because the αi
in this model are assumed to be iid random variables.
Example 5.31 UMVUEs of the variances in the one-way random-
effects AOV model
Consider the linear model
where the δi are identically distributed with E(δi ) = 0, V(δi ) = σδ2 , and
Cov(δi , δĩ ) = 0 for i 6= ĩ, and the ǫij are independent of the δi and are
identically distributed with with E(ǫij ) = 0, V(ǫij ) = σǫ2 , and Cov(ǫij , ǫĩj̃ ) = 0
for either i 6= ĩ or j 6= j̃.
An important difference in the random-effects model and the fixed-effects
model is that in the random-effects model, we do not have independence of
the observables. We have
2
σδ + σǫ2 for i = ĩ, j = j̃,
Cov(Yij , Yĩj̃ ) = σδ2 for i = ĩ, j 6= j̃, (5.98)
0 for i 6= ĩ.
A model such as this may be appropriate when there are a large number
of possible treatments and m of them are chosen randomly and applied to
experimental units whose responses Yij are observed. While in the fixed-effects
model (5.88), we are interested in whether α1 = · · · = αm = 0, in the random-
effects model, we are interested in whether σδ2 = 0, which would result in a
similar practical decision about the treatments.
In the model (5.97) the variance of each Yij is σδ2 + σǫ2 , and our interest in
using the model is to make inference on the relative sizes of the components of
the variance σδ2 and σǫ2 . The model is sometimes called a “variance components
model”.
iid iid
Let us suppose now that δi ∼ N(0, σδ2 ), where σδ2 ≥ 0, and ǫij ∼ N(0, σǫ2 ),
where as usual σ 2 > 0. This will allow us to determine exact sampling distri-
butions of the relevant statistics.
We transform the model using Helmert matrices Hm and Hn as in equa-
tion (5.82).
Let
Y11 · · · Y1n δ1 ǫ11 · · · ǫ1n
.. ; δ = .. ; and ǫ = .. .. .
Y = ... . . . .
Ym1 · · · Ymn δm ǫm1 · · · ǫmn
c
Theory of Statistics
2000–2020 James E. Gentle
5.5 Applications 437
Y = δ1T
n + ǫ.
Z = Hm XHnT ,
e
δ = Hm δ,
and
ǫ = Hm ǫHnT .
e
We first of all note that the transformations are all nonsingular and
Z = H1T
n +e
ǫ.
and
iid
Zij ∼ N(0, σ 2 ), for i = 1, . . . , m; j = 2, . . . , n0.
To continue with the analysis, we follow the same steps as in Example 5.28,
and get the same decomposition of the “adjusted total sum of squares” as in
equation (5.94):
Xm Xn
(Zij − Z)2 = SSA + SSE. (5.99)
i=1 j=1
c
Theory of Statistics
2000–2020 James E. Gentle
438 5 Unbiased Point Estimation
and
s2ǫ = MSE, (5.101)
and we can also see that these are UMVUEs.
Now we note something that might at first glance be surprising: s2δ in
equation (5.100) may be negative. This occurs if (m − 1)MSA/m < MSE.
This will be the case if the variation among Yij for a fixed i is relatively large
compared to the variation among Y i (or similarly, if the variation among Zij
for a fixed i is relatively large compared to the variation among Z i ).
Compare this with the MLEs in Example 6.29 in Chapter 6
Yb0 = βbT x0
= ((X T X)+ X T y)T x0 . (5.102)
If x0 ∈ span(X), then from Theorem 5.7, (b∗ )T x0 = βbT x0 , so in this case the
predicted value of Y is invariant to choice of the generalized inverse.
In the model (5.66) corresponding to a set of n observations on the
model (5.64), we have predicted values of the response Y at all rows within
X:
Yb = X βb
= X(X T X)+ X T Y. (5.103)
From equation (3.42), we see that this has the minimum MSE of any function
of X.
The idempotent projection matrix X(X T X)+ X T is called the “hat ma-
trix” because given Y , it provides Yb . (See page 795 for properties of projection
matrices.)
We see from Definition 1.46 page 116 that Yb is the projection of Y onto
the column space of X. (This is a slightly different meaning of the word
“projection”, but obviously the meanings are related.) From Theorem 1.64
we see that the “residual vector” Y − Yb is orthogonal to the columns of X;
that is, Cov(Y − Yb , x) = 0 for any column x of X, and since Yb is a linear
combination of the columns of X, Cov(Y − Yb , Yb ) = 0. If we assume a normal
distribution for ǫ, then 0 covariance implies independence.
c
Theory of Statistics
2000–2020 James E. Gentle
5.5 Applications 439
it from other areas of applied statistics. Särndal et al. (1997) provide a general
coverage of the theory and methods. Valliant et al. (2000) provide a different
perspective on some of the particular issues of inference in finite populations.
Finite Populations
c
Theory of Statistics
2000–2020 James E. Gentle
440 5 Unbiased Point Estimation
Estimation
is unbiased for Y :
N
N X
E(Yb ) = yi E(Ii )
n
i=1
N
X
= yi .
i=1
c
Theory of Statistics
2000–2020 James E. Gentle
5.5 Applications 441
n n
V(Ii ) = 1−
N N
(it’s Bernoulli), and for i 6= j,
Horvitz-Thompson Estimation
c
Theory of Statistics
2000–2020 James E. Gentle
442 5 Unbiased Point Estimation
(Exercise 5.12). Expressions for other sampling estimators are often shown in
a similar manner.
An important approximation for working out variances of more compli-
cated sampling estimators is linearization, especially when the estimator in-
volves a ratio.
Unbiasedness
The property of unbiasedness for point estimators was given a solid but pre-
liminary treatment by Halmos (1946).
Unbiasedness has a heuristic appeal, although people will sometimes ques-
tion its relevance by pointing out that it provides no guarantee of the goodness
of an estimator in a single set of data. That argument, however, could apply to
most measures of the quality of an estimator. Similar types of arguments could
bring into question any consideration of asymptotic properties of statistical
procedures.
Unbiasedness is particularly useful when the loss is squared-error, because
in that case unbiasedness may lead to uniformly minimum risk estimators. For
absolute-error loss functions, a corresponding approach would be to require
median unbiasedness.
c
Theory of Statistics
2000–2020 James E. Gentle
Exercises 443
U-Statistics
Exercises
5.1. Show that the estimator (5.3) in Example 5.1 is the UMVUE of π. (Note
that there are three things to show: (t − 1)/(N − 1) is unbiased, it has
minimum variance among all unbiased estimators, and it is unique —
“the” implies uniqueness.)
5.2. Consider the problem of using a sample of size 1 for estimating g(θ) = e−3θ
where θ is the parameter in a Poisson distribution.
a) Show that T (X) = (−2)X is unbiased for g(θ).
b) Show that T (X) = (−2)X is a UMVUE g(θ).
c) What is wrong with this estimator?
5.3. Show that the estimators (5.11) and (5.7) are the same.
5.4. Show that the h(T )s in Example 5.6 are unbiased for the g(θ)s given.
5.5. Define an alternative kernel for U-statistic that is unbiased for the covari-
ance in Example 5.19; that is, instead of the kernel in equation (5.37),
give a kernel similar to that in equation (5.46). Show that the resulting
U-statistic is unbiased for the covariance.
5.6. In the setup of model (5.64), show that the LSE kY − X βk b 2 /(n − p) is
2
unbiased for σ .
5.7. Let Xij = µ + αi + ǫij , i = 1, . . . , m, j = 1, . . . , n, where αi ’s and ǫij ’s
are independent random variables, αi ∼ N(0, σα2 ), ǫij ∼ N(0, σǫ2 ), and µ,
σα2 , and σǫ2 are unknown parameters. Let
n
X
Xi = Xij /n,
j=1
m
X
X= X i /m,
i=1
c
Theory of Statistics
2000–2020 James E. Gentle
444 5 Unbiased Point Estimation
m
X
MSA = n (X i − X)2 /(m − 1),
i=1
and
m X
X n
MSE = (Xij − X i )2 /(m(n − 1)).
i=1 j=1
Express MSA and MSE as quadratic forms using parts of Helmert matrices
and use Chochran’s theorem to show that they are independent.
5.8. Show that the quantities in expressions (5.95) and (5.96) have the chi-
squared distributions claimed.
5.9. Show that the UMVUE of σ 2 , SSE/(m(n − 1)), given in Example 5.28
is the same as the UMVUE of σ 2 for the general linear model given in
equation (5.77).
Hint: Write the model given in equation (5.88) in the form of the general
linear model in equation (5.67).
iid
5.10. Suppose Xij ∼ N(µi , σ 2 ) for i = 1, . . . , m and j = 1, . . . , n. (Compare the
one-way AOV model of Examples 5.28, 5.29, and 5.30.)
a) Determine the UMVUE Tmn (X) of σ 2 .
b) Show that Tmn (X) is consistent in mean squared error for σ 2 as m →
∞ and n remains fixed.
c) Show that Tmn (X) is consistent in mean squared error for σ 2 as n →
∞ and m remains fixed.
5.11. Show that the sample variance S 2 is the UMVUE of σ 2 in equation (5.105)
of Example 5.32. Hence, determine the UMVUE of V(Yb ).
5.12. Show that the variance of the Horvitz-Thompson estimator is as shown in
equation (5.107), for given πi and πij . This is tedious, but it requires very
little other than “advanced arithmetic” and simple properties of variances
of sums.
c
Theory of Statistics
2000–2020 James E. Gentle
6
Statistical Inference Based on Likelihood
c
Theory of Statistics
2000–2020 James E. Gentle
446 6 Statistical Inference Based on Likelihood
The differences in a likelihood and a PDF are illustrated clearly in Example 1.5
on page 20. A likelihood is neither a probability nor a probability density.
Notice, for example, that while the definite integrals over IR+ of both PDFs
in in Example 1.5 are 1, the definite integrals over IR+ of the likelihood (1.21)
in Example 1.5 are not the same, as we can easily see from the plots on the
right side of Figure 1.2.
It is not appropriate to refer to the “likelihood of an observation”. We use
the term “likelihood” in the sense of the likelihood of a model or the likelihood
of a distribution given observations.
c
Theory of Statistics
2000–2020 James E. Gentle
6.1 The Likelihood Function 447
Likelihood Principle
c
Theory of Statistics
2000–2020 James E. Gentle
448 6 Statistical Inference Based on Likelihood
n0 − 1 t 0
LNB (π) = π (1 − π)9 . (6.6)
t0 − 1
Because LB (π)/LNB (π) does not involve π, the maxima of the likelihoods
will occur at the same point. A maximum likelihood estimator of π based on
a binomial observation of t0 out of n0 is the same as a maximum likelihood
estimator of π based on a negative binomial observation of n0 for t0 1’s because
the maximum of the likelihood occurs at the same place, t0 /n0 . The estimators
conform to the likelihood principle. Recall that the UMVU estimators are
different. (Example 5.1 and follow-up in Example 5.5 and Exercise 5.1.)
Let us assume a parametric model; that is, a family of densities P = {p(x ; θ)}
where θ ∈ Θ, a known parameter space.
For a sample X1 , . . . , Xn from a distribution with probability density
p(x ; θ), we write the likelihood function as a function of a variable in place
of the parameter:
n
Y
L(t ; x) = p(xi ; t). (6.7)
i=1
Note the reversal in roles of variables and parameters. While I really like to
write the likelihood as a function of a variable of something other than the
parameter, which I think of as fixed, I usually write it like everyone else; that
is, I write
Yn
L(θ ; x) = p(xi ; θ).
i=1
c
Theory of Statistics
2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 449
In the likelihood function the data, that is, the realizations of the vari-
ables in the density function, are considered as fixed, and the parameters are
considered as variables of the optimization problem,
max L(θ ; x). (6.8)
θ
For given x, the relative values of L(θ ; x) are important. For given x1 and
x2 , the relative values of L(θ ; x1 ) and L(θ ; x2 ) are not relevant. Notice in
Example 1.5, while L(θ ; 5) ≤ L(θ ; 1) for all θ, max L(θ ; 5) occurs at θ = 5,
and max L(θ ; 1) occurs at θ = 1. Notice also in Example 6.1, while LB (π)
in equation (6.5) is uniformly less than LNB (π) in equation (6.6), they both
achieve their maximum at the same point, π = 1/4.
c
Theory of Statistics
2000–2020 James E. Gentle
450 6 Statistical Inference Based on Likelihood
While I like to use the “hat” notation to mean an MLE, I also sometimes use
it to mean any estimate or estimator.
The estimate (or estimator) θb is a Borel function of the observations or of
the random variables.
We use “MLE” to denote either a maximum likelihood estimate or estima-
tor, or to denote the method of maximum likelihood estimation. The proper
meaning can be determined from the context. If the term MLE is used in a
statement about a maximum likelihood estimate or estimator, the statement
can be assumed to apply to both the estimate and the estimator.
If θb in (6.9) exists, we also have
that is, the MLE can be identified either from the likelihood function or from
the log-likelihood.
c
Theory of Statistics
2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 451
n
!
Pn X Pn
−nθ−n−1 e− i=1 xi /θ
+ θ−n−2 xi e− i=1 xi /θ
IIR+ (θ).
i=1
0.006
0.3
0.004
likelihood
likelihood
0.2
0.002
0.1
0.000
0.0
0 5 10 0 5 10
θ θ
The fact that the likelihood has a sharper peak is in agreement with our
expectation that the estimate should be “better” if we have a larger sample.
c
Theory of Statistics
2000–2020 James E. Gentle
452 6 Statistical Inference Based on Likelihood
but actually we only have values for the xi < tc, where tc is some fixed
and known value. It might be called the “censoring time”. This setup is yields
censored data, in particular, it is right censored data, because the larger values
are censored. Suppose that t1 , . . . , tr observations are obtained, leaving n − r
unobserved values of the potential sample. In this setup, the time tc is fixed,
and so r is a random variable. We could also contemplate an experimental
setup in which r is chosen in advance, and so the censoring time tc is a
random variable. (These two data-generating processes are similar to the two
experiments we described for Bernoulli data in Example 3.12, and to which we
have alluded in other examples.) The first method is called “Type I censoring”
(upper bound on the observation fixed) and the other method is called “Type
II censoring” (fixed number of observed values to be taken).
Censoring is different from a situation in which the distribution is trun-
cated, as in Exercise 2.14 on page 203.
For right censored data with n, r, and tc as described above from any
distribution with PDF f(x; θ) and CDF F (x; θ), the likelihood function is
r
Y
L(θ; x) = f(ti ; θ)(1 − F (tc; θ))n−r .
i=1
We may note in passing that the likelihood is the same for type I and type
II censoring, just as we saw it to be in the binomial and negative binomial
distributions arising from Bernoulli data in Example 3.12.
Now, for the case where the distribution is exponential with parameter θ,
we have the likelihood function
1 Pri=1 ti /θ (n−r)tc /θ
L(θ; x) e e .
θr
The maximum, which we can find by differentiation, occurs at
θb = T /r,
Pr
where T = i=1 ti + (n − r)tc is called the “total time on test”.
MLE in ∂Θ
c
Theory of Statistics
2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 453
where S(θ) is open. In this case, the likelihood has the form
c
Theory of Statistics
2000–2020 James E. Gentle
454 6 Statistical Inference Based on Likelihood
where R(x) is open. It is quite possible that sup L(θ ; x) will occur on R(x) −
R(x).
Example 6.5 MLE in U(0, θ); closed support
iid
Consider X1 , . . . , Xn ∼ U(0, θ), with θ ∈ Θ = IR+ . The PDF is
1
pX (x) = I[0,θ] (x). (6.17)
θ
The likelihood is
1
L(θ ; x) = I[x ,∞[ (θ). (6.18)
θ (n)
The MLE is easily seen to be θb = x(n) . In Example 5.8, we saw that the
UMVUE of θ is (1 + 1/n)x(n).
Suppose we take the support to be the open interval ]0, θ[. (Despite Def-
inition 1.12, such a support is often assumed.) The likelihood function then
is
1
L(θ ; x) = I]x(n) ,∞[ (θ).
θ
This is discontinuous and it does not have a maximum, as we see in Figure 6.2.
(
L
]
x (n)
θ
c
Theory of Statistics
2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 455
In this case the maximum of the likelihood does not exist, but the supre-
mum of the likelihood occurs at x(n) and it is finite. We would like to call x(n)
the MLE of θ.
We can reasonably do this by modifying the definition of the family of
distributions by adding a zero-probability set to the support. We redefine the
family in equation (6.15) to have the Lebesgue PDF
1
pX (x) = I[0,θ] (x). (6.19)
θ
Now, the open interval ]x(n), ∞[ where the likelihood was positive before be-
comes a half-closed interval [x(n), ∞[, and the maximum of the likelihood
occurs at x(n).
This is one reason why we define the support to be closed.
This approach is cleaner than solving the logical problem by defining the
MLE in terms of the sup rather than the max. A definition in terms of the sup
may not address problems that could arise due to various types of discontinuity
of L(θ ; x) at the boundary of S(θ).
It is usually more difficult to determine the MLE of more than one parameter.
The likelihood equation in that case is a system of equations. Also, of course,
the likelihood equation, whether a single equation or a system, may not be
easy to solve, as the following example shows.
Example 6.6 MLE of the parameters in a gamma distribution
Consider the gamma family of distributions with parameters α and β. Given
a random sample x1 , . . . , xn , the log-likelihood of α and β is
X 1X
lL (α, β ; x) = −nα log(β)−n log(Γ(α)))+(α−1) log(xi )− xi . (6.20)
β
This yields the likelihood equations
Γ′ (α) X
− n log(β) − n + log(xi ) = 0 (6.21)
Γ(α)
and
nα 1 X
− + 2 xi = 0. (6.22)
β β
Checking the Hessian (at any point in the domain), we see that a root of the
likelihood equations is a local minimizer.
At the solution we have
X
βb = xi /(nb
α) (6.23)
and
c
Theory of Statistics
2000–2020 James E. Gentle
456 6 Statistical Inference Based on Likelihood
α) X
Γ′ (b X
log(b
α) − + log(xi )/n − log( xi /n) = 0. (6.24)
Γ(b α)
There is no closed form solution. A numerical method must be used; see
Example 6.14.
Sometimes in multiple-parameter models, the parameters are functionally
independent and the optimization can be performed on different parts of the
separable likelihood function. This is the case for a normal distribution, as we
see in Example 6.25.
Example 6.7 MLE in the exponential family with range dependency
Consider the two-parameter exponential family, that is, a shifted version of the
exponential family of distributions. This family is the subject of Example 5.9
on page 397. The Lebesgue PDF is
and
1 X
Tθ = (Xi − X(1) ).
n−1
(Recall that we find a complete sufficient statistic and then manipulate it to
be unbiased.) Notice the similarity of these to the MLEs, which are biased.
c
Theory of Statistics
2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 457
Theorem 6.1
If there is a sufficient statistic and an MLE exists, then an MLE is a function
of the sufficient statistic.
Proof.
This follows directly from the factorization theorem.
c
Theory of Statistics
2000–2020 James E. Gentle
458 6 Statistical Inference Based on Likelihood
Equivariance of MLEs
c
Theory of Statistics
2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 459
in such a way that g(θ) is one-to-one. The function h is not unique, but g−1
b is an MLE of g(θ); and so g̃(θ)
is unique; the likelihood is well-defined; g(θ) b
is an MLE of g̃(θ). Compare this with the results of Theorem 6.3 above.
Example 6.8 MLE of the variance in a Bernoulli distribution
Consider the Bernoulli family of distributions with parameter π. The vari-
ance of a Bernoulli distribution is g(π) = π(1 − π). Given a random sample
x1 , . . . , xn, the MLE of π is
n
X
π
b= xi /n,
i=1
c
Theory of Statistics
2000–2020 James E. Gentle
460 6 Statistical Inference Based on Likelihood
π̂ = t/n.
To make inferences on π using π̂ we need the variance V(π̂). Under the bino-
mial distribution, we need the variance of T /n, which is π(1 − π)/n, whose
MLE as in Example 6.8 is π̂(1 − π̂)/n. Under the negative binomial distri-
bution, we need the variance of t/N . The variance of N is t(1 − π)/π 2 and
its MLE is the same with π̂ in place of π. The variance of t/N cannot be
expressed in closed form. (See Stephan (1945).)
Although we have seen in equations (6.5) and (6.6) that the ratio of the
likelihoods does not involve π and the MLEs based on the two data-generating
processes conform to the likelihood principle, the variances of the MLEs are
different.
c
Theory of Statistics
2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 461
This likelihood function is not informative, and could not be used, for exam-
ple, for P
estimating θ = θ1 + · · · + θN . (There is a pretty good estimator of θ;
it is N ( yi )/n.)
There are other interesting examples in which MLEs do not have desirable
(or expected) properties.
• An MLE may be discontinuous in the data. This is obviously the case for
a discrete distribution, but it can also occur in a contaminated continuous
distribution as, for example, in the case of ǫ-mixture distribution family
with CDF
Pxc ,ǫ (x) = (1 − ǫ)P (x) + ǫI[xc ,∞[ (x), (6.27)
where 0 ≤ ǫ ≤ 1.
• An MLE may not be a function of a sufficient statistic (if the MLE is not
unique).
• An MLE may not satisfy the likelihood equation as, for example, when the
likelihood function is not differentiable at its maximum, as in Example 6.5.
• The likelihood equation may have a unique root, yet no MLE exists. While
there are examples in which the roots of the likelihood equations occur
at minima of the likelihood, this situation does not arise in any realistic
distribution (that I am aware of). Romano and Siegel (1986) construct a
location family of distributions with support on
IR − {x1 + θ, x2 + θ : x1 < x2 },
Nonuniqueness
There are many cases in which the MLEs are not unique (and I’m not just
referring to RLEs). The following examples illustrate this.
Example 6.11 likelihood in a Cauchy family
Consider the Cauchy distribution with location parameter θ. The likelihood
equation is
c
Theory of Statistics
2000–2020 James E. Gentle
462 6 Statistical Inference Based on Likelihood
n
X 2(xi − θ)
= 0.
i=1
1 + (xi − θ)2
This may have multiple roots (depending on the sample), and so the one
yielding the maximum would be the MLE. Depending on the sample, however,
multiple roots can yield the same value of the likelihood function.
Another example in which the MLE is not unique is U(θ − 1/2, θ + 1/2).
Example 6.12 likelihood in a uniform family with fixed range
Given the sample x1 , . . . , xn, the likelihood function for U(θ − 1/2, θ + 1/2)
is
I[x(n) −1/2, x(1) +1/2] (θ).
It is maximized at any value between x(n) − 1/2 and x(1) + 1/2.
1 (x − µ)2
lL (µ, σ2 ; x) = − log(2πσ 2 ) − ,
2 2σ 2
which is unbounded when µ = x and σ 2 approaches zero. It is therefore clear
that no MLE of σ 2 exists. Strictly speaking, we could also say that no MLE
of µ exists either; however, for any fixed value of σ 2 in the (open) parameter
space, µ = x maximizes the likelihood, so it is reasonable to call x the MLE
of µ.
Recall from Example 5.14 that the degree of the variance functional is 2.
In this case, some people prefer to say that the likelihood function does not
exist; that is, they suggest that the definition of a likelihood function include
boundedness.
c
Theory of Statistics
2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 463
In the regular case satisfying the three Fisher information regularity conditions
(see page 168), the likelihood function and consequently the log-likelihood
are twice differentiable within Θ◦ , and the operations of differentiation and
integration can be interchanged. In this case, the score estimating function is
unbiased (see Definition 3.7):
Z
∂
Eθ (sn (θ ; X)) = lL (θ ; x)p(x; θ)dx
∂θ
ZX
∂
= p(x; θ)dx
X ∂θ
Z
∂
= p(x; θ)dx
∂θ X
= 0. (6.29)
c
Theory of Statistics
2000–2020 James E. Gentle
464 6 Statistical Inference Based on Likelihood
Notice that the average of this quantity with respect to the probability density
of the random variable X,
(k)
(k)
T
I(θ1 ; X) = Eθ1 ∇lL θ ; X ∇lL θ ; X , (6.30)
or, in general,
T
∇lL θ(k) ; x ∇lL θ(k) ; x = − HlL θ(k) ; x . (6.31)
If X has a distribution in the exponential class and we write its density in the
natural or canonical form, the likelihood has the form
∂ζ(η)
T (x) − = 0. (6.33)
∂η
Newton’s method for solving the likelihood equation is
−1
(k) (k−1) ∂ 2 ζ(η) ∂ζ(η)
η =η − (k−1) T (x) − (k−1)
∂η(∂η)T η=η ∂η η=η
c
Theory of Statistics
2000–2020 James E. Gentle
6.2 Maximum Likelihood Parametric Estimation 465
Note that the second term includes the Fisher information matrix for η.
(The expectation is constant.) (Note that the FI matrix is not for a distribu-
tion; it is for a parametrization of a distribution.)
We have
∂ 2 ζ(η)
V(T (X)) = |η=η .
∂η(∂η)T
Note that the variance is evaluated at the true η (even though in an expression
such as ∂η it must be a variable).
If we have a full-rank member of the exponential class then V is positive
definite, and hence there is a unique maximum.
If we write
∂ζ(η)
µ(η) = ,
∂η
in the full-rank case, µ−1 exists and so we have the solution to the likelihood
equation:
ηb = µ−1 (T (x)). (6.34)
So maximum likelihood estimation is very nice for the exponential class.
Computations
If the log-likelihood is twice differentiable and if the range does not depend
on the parameter, Equation (6.31) is interesting because the second deriva-
tive, or an approximation of it, is used in a Newton-like method to solve the
maximization problem (6.10). Newton’s equation
c
Theory of Statistics
2000–2020 James E. Gentle