Unesco - Eolss Sample Chapters: Statistical Parameter Estimation
Unesco - Eolss Sample Chapters: Statistical Parameter Estimation
Contents
1. Fundamental Concepts
1.1. Parameter and Estimator
1.2. Mean Squared Error
S
TE S
1.3. Loss and Risk
R
AP LS
1.4. Sufficient Statistic
1.5. Likelihood Function
1.6. Distributional Classes
C EO
1.6.1. (Log-) Location-Scale-Families
1.6.2. Exponential Family
2. Optimality Properties
2.1. Unbiasedness
2.2. Consistency
E –
H
2.3. Admissibility
2.4. Minimum Variance Bound
PL O
Summary
The estimation of the parameters of a statistical model is one of the fundamental issues
in statistics. Choosing an appropriate estimator, that is ‘best’ in one or another respect,
is an important task, hence firstly several optimally criterions are considered. In
practice, however, constructive methods of parameter estimation are needed. Some of
the methods most frequently used are considered, the method of moments, linear
estimation methods, and the most important one, the method of maximum likelihood in
some detail. At last, the closely related problem of interval estimation is considered.
1. Fundamental Concepts
All estimation procedures are based on a random sample, X1,…, X n from a random
variable X . Let f ( x θ ) denote the probability mass function (pmf), if X is a
discrete, or the probability density function (pdf), if X is a continuous variable, where
the form of the pmf or pdf is known, but the parameter vector (parameter for short)
θ = (θ1 ,…, θk ) is unknown. We call the set of possible values for the parameter θ the
k
parameter space Θ , being a subset of .
S
TE S
Remark: As there are formally only slight (and quite obvious) differences between the
R
AP LS
discrete and the continuous cases, we focus on the latter for simplicity.
sample alone and does not depend on any (unknown) parameters. The sample mean X
H
2
PL O
1 n 1 n
X= ∑ Xi,
n i =1
S2 = ∑
n − 1 i =1
( X i − X )2 , (1)
SA NE
are well known examples of statistics. Our objective is to find statistics which will serve
as estimators for the unknown parameter θ , or more generally for certain functions
τ (θ ) of the parameters. Thus a very broad definition of an ‘estimator’ is the following.
U
or we could consider an average measure of closeness like the mean absolute deviation,
The MSE summarizes two properties of an estimator, its ‘precision’ and its ‘accuracy’,
S
TE S
two important concepts in practical applications. By some simple transformations the
MSE of an estimator T can be written as follows,
R
AP LS
MSET (θ ) = Var (T ( X) ) + ⎡⎣τ (θ ) − Eθ (T ( X) ) ⎤⎦ ,
2
(5)
C EO
where Var(T ( X)) denotes the variance of T ( X) . The standard deviation, Var(T ) ,
is a measure of the precision of an estimator (the smaller the variance, the greater the
E –
precision), that is a measure of its performance; the square root of the second term,
τ (θ ) − Eθ (T ( X)) (not to be confused with the MAD), is a measure of how accurate
H
PL O
the estimator is, that is how large on the average the error systematically introduced by
using T is.
M SC
Though not stated explicitly, associated with all these measures is a certain concept of
‘loss’; the MSE, for instance, penalizes the deviations of an estimator from its objective
SA NE
function quadratically.
Being involved in decision making in the presence of uncertainty, a certain kind of loss,
L ( θ,Τ (x) ) , will be incurred, meaning the ‘loss’ incurred, when the actual ‘state of
nature’ is θ, but T ( x) is taken as the estimate of θ . Frequently it will be difficult to
determine the actual loss function L over a whole region of interest (there are some
rational procedures, however), so it is customary to analyse the decision problem using
some ‘standard’ loss functions. Especially for estimation problems usually two loss
functions are considered, the squared error loss and the linear loss.
The squared error loss is defined as (with Q a k × k known positive definite matrix)
L(θ, a) = (θ − a) Q (θ − a)T , a ∈ k
. (6)
Frequently it will not be unreasonable to assume that the loss function is approximately
linear (at least piecewise); for a one-dimensional parameter the linear loss can be
S
TE S
written as ( K 0 and K1 are two known constants)
R
AP LS
⎧ K 0 (θ − a ) if a≤θ
L(θ, a ) = ⎨ (8)
⎩Κ1 (a − θ ) if a >θ
C EO
If one regards over– and underestimation as being of equal (relative) importance, the
loss function reduces to
E –
L(θ, a ) = c θ − a
H
(9)
PL O
Because the true ‘state of nature’ is not known (otherwise no decision would be
M SC
required) the actual loss incurred will be unknown too. The usual way to handle this
problem is to consider the ‘average’ or ‘expected’ loss incurred. Averaging over X
alone leads to the classical (frequentist) notion of a ‘risk’ associated with a decision
rule.
SA NE
Definition: The risk function for the estimator T is defined as the expected value of the
loss function,
U
Note that the mean squared error of an estimator, MSET (θ ), is the risk of the
estimator with regard to a quadratic loss function.
Averaging over both, X and θ, leads to the Bayes risk. This approach requires the
existence of a prior distribution for the parameter θ.
Definition: The Bayes risk for the estimator T , with respect to the prior distribution
π over the parameter space Θ , is defined as
The expectation is to be understood with respect to the prior distribution π . Note that
the Bayes risk is a number, not a function of θ . (Compare Bayesian Statistics.)
S
TE S
depending on the problem at hand. This intuitive concept can be formalized as follows.
R
AP LS
Definition: A statistic S ( X 1,…, X n ) is called a sufficient statistic for a parameter θ if
the conditional distribution of ( X1,…, X n ) given S = s does not depend on θ (for
C EO
any value of s ). S can also be a vector of statistics,
S = ( S1 ( X1,…, X n ),…, Sk ( X1,…, X n ) ) . In this case we say, that Si , i = 1,…, k ,
are jointly sufficient for θ .
E –
Though being quite intuitive the definition is not easy to work with. With the help of the
H
(12)
where the function h is nonnegative and does not depend on θ and the function g is
U
n
f ( x1,…, xn θ ) = ∏ f ( xi θ ). (13)
i =1
Read in the usual way, x1,…, xn are mathematical variables, and θ is a fixed (but
unknown) parameter value, which gave rise for the observations at hand. Turned the
other way around, given that x = ( x1,…, xn ), the function is called the likelihood
function,
The (log-) likelihood function is used to compare the plausibility of various parameter
values, given the observations, x1,…, xn , at hand. The most plausible value, the
maximum likelihood value, plays a prominent role in parameter estimation (cf. Section
3.2).
S
TE S
What makes the likelihood function so important in parameter estimation is the fact, that
it ‘adjusts itself’ even to rather complex observational situations. Consider for example
R
AP LS
the situation where fixed portions of the sample space χ are excluded from observation
(called ‘Type-I censoring’), a situation quite often encountered in reliability or survival
analysis. Only failures in the interval [ a, b], for instance, are observed, failures smaller
C EO
than a or larger than b are not observed, though we know their number, r and s,
respectively. In this case the likelihood function is given by
E –
r n−s s
l1 (θ x) ∝ ⎡⎢ ∫−∞ f ( x θ ) dx ⎤⎥ ⋅ θ ) ⋅ ⎡⎢ ∫b f ( x θ ) dx ⎤⎥
∞
∏
a
H
f ( x(i ) (15)
⎣ ⎦ ⎣ ⎦
PL O
i = r +1
M SC
where x(i ) denotes the i-th largest observation. A similar situation arises, when fixed
portions of the sample are excluded from observation (called ‘Type-II censoring’). If the
smallest r and the largest s observations are excluded, the likelihood function is given
SA NE
by
r n−s s
l2 (θ x) ∝ ⎡⎢∫−∞ f (x θ) dx⎤⎥ ⋅ ∏ f (x(i) θ) ⋅ ⎡⎢⎣∫x(n−s) f (x θ) dx⎤⎥⎦ .
x(r ) ∞
U
(16)
⎣ ⎦ i =r +1
Note that there is a fundamental difference. In the second case r and s are
predetermined values, whereas in the first case they are to be observed as well (but they
enter the likelihood function as if they were given in advance). In a certain sense the
likelihood function adjusts itself to the different observational situations.
Apart from the difference mentioned above, the two likelihood functions are quite
similar in appearance with respect to the parameter θ , being a parameter of the
underlying stochastic model f ( x θ ). So we could expect the conclusions drawn to be
quite similar too. If, by chance, r and s coincide for the two cases, and
S
TE S
1.6. Distributional Classes
R
AP LS
Estimation procedures for distributions sharing some structural properties turn out to be
quite similar. Moreover, the finding of ‘optimal’ estimators (and the demonstration of
C EO
their optimality) becomes easier, if we can rely on certain properties of the underlying
distribution. This is the purpose of the following definitions which cover a wide range
of practically important distributions. The estimation problem for distributions not
covered by any these classes usually will be more difficult.
E –
⎛x−μ⎞
F ( x; μ ,σ ) = F0 ⎜ ⎟ (18)
⎝ σ ⎠
SA NE
the standard deviation of X ). The most important members of this class are the normal
distributions, where the base is the cdf of the standard normal distribution, Φ.
⎛x−μ⎞
f ( x; μ ,σ ) = 1 f 0 ⎜
σ ⎝ σ ⎟⎠
(19)
⎛ ln ( x) − μ ⎞
F ( x; μ ,σ ) = F0 ⎜ ⎟ (20)
⎝ σ ⎠
F0 again is the base (or reduced) cdf, and μ ,σ are location and scale parameters,
respectively. Now these terms are related to ln ( X ) instead of X . Important members
of this class are the lognormal and the Weibull distributions.
⎛ ln ( x) − μ ⎞
f ( x; μ ,σ ) = 1 f 0 ⎜
S
⎟
TE S
(21)
xσ ⎝ σ ⎠
R
AP LS
where f 0 is the density function corresponding to F0 .
C EO
There are, however, practically important distributions not belonging to these classes;
the gamma distributions, for instance, are neither LSF nor LLSF.
-
E –
-
H
-
PL O
M SC
Bibliography
U
Bard, Y. (1974): Nonlinear parameter Estimation, New York: Academic Press.[Comprehensive and
application oriented text for fitting models to data by different estimation methods]
Casella, G. and Berger, R.L. (1990): Statistical Inference, Pacific Grove: Wadsworth & Brooks/Cole.
[Clear written introduction to the principles of data reduction, point estimation, hypotheses testing, and
interval estimation and decision theory]
Hahn, G.J. and Meeker, W.O. (1991): Statistical Intervals- A Guide for Practitioners, New York: Wiley.
[Application oriented presentation of confidence intervals, prediction intervals, and tolerance intervals]
Lehmann, E.L.(1993): Theory of Point Estimation, New York: Wiley. [High level text focusing on
mathematical aspects of point estimation]
Mood, A.M., Graybill, F.A. and Boes, D.C. (1974): Introduction to the Theory of Statistics, New York:
McGraw-Hill. [Well written introduction to the methods of statistical estimation and other statistical
techniques]
Pestman, W.R. (1998): Mathematical Statistics-An Introduction, Berlin: W. de Gruyter. [Well written
basic text on mathematical aspects of statistics]
Biographical Sketches
Werner Gurker Born March 18, 1953, at Mauthen in Carinthia, Austria. Studies in engineering
mathematics at the Technische Hochschule Wien. Receiving a Dipl.-Ing. degree in engineering
mathematics in 1981. Dissertation in mathematics and Doctor of engineering science degree in 1988.
Assistant professor at the Technische Hochschule Wien since 1995. Main interest and publications in
statistical calibration and reliability theory.
Reinhard Viertl born March 25, 1946, at Hall in Tyrol, Austria. Studies in civil engineering and
engineering mathematics at the Technische Hochschule Wien. Receiving a Dipl.-Ing. degree in
engineering mathematics in 1972. Dissertation in mathematics and Doctor of engineering science degree
in 1974. Appointed assistant at the Technische Hochschule Wien and promotion to University Docent in
1979. Research fellow and visiting lecturer at the University of California, Berkeley, from 1980 to 1981,
and visiting Docent at the University of Klagenfurt, Austria in winter 1981 - 1982. Since 1982 full
S
TE S
professor of applied statistics at the Department of Statistics, Vienna University of Technology. Visiting
professor at the Department of Statistics, University of Innsbruck, Austria from 1991 to 1993. He is a
R
AP LS
fellow of the Royal Statistical Society, London, held the Max Kade fellowship in 1980, and is founder of
the Austrian Bayes Society, member of the International Statistical Institute, president of the Austrian
Statistical Society from 1987 to 1995. Invitation to membership in the New York Academy of Sciences in
1998. Author of the books Statistical Methods in Accelerated Life Testing (1988), Introduction to
C EO
Stochastics in German language (1990), Statistical Methods for Non-Precise Data (1996). Editor of the
books Probability and Bayesian Statistics (1987), Contributions to Environmental Statistics in German
language (1992). Co-editor of a book titled Mathematical and Statistical Methods in Artificial
Intelligence (1995), and co-editor of two special volumes of journals. Author of over 70 scientific papers
in algebra, probability theory, accelerated life testing, regional statistics, and statistics with non-precise
E –
data. Editor of the publication series of the Vienna University of Technology, member of the editorial
H