Robust Estimators (By Lax 1980)
Robust Estimators (By Lax 1980)
Distributions
Author(s): David A. Lax
Source: Journal of the American Statistical Association , Sep., 1985, Vol. 80, No. 391
(Sep., 1985), pp. 736-741
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association
Stable URL: https://www.jstor.org/stable/2288493
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Taylor & Francis, Ltd. and American Statistical Association are collaborating with JSTOR to
digitize, preserve and extend access to Journal of the American Statistical Association
* David A. Lax is Assistant Professor, Business Administration, Harvard This article thus considers scale estimators that satisfy Equa-
Business School, Boston, MA 02163. John Tukey, Gary Simon, and David tion (2.1).
Pasta made major contributions to this research. The author also thanks David
Donoho, Paul Velleman, David Hoaglin, Frederick Mosteller, and James Se-
benius. He gratefully acknowledges financial support from Army Research ? 1985 American Statistical Association
Office Grant DAHC04-74-0178 and National Science Foundation Grant SOC- Journal of the American Statistical Association
75-15702. September 1985, Vol. 80, No. 391, Theory and Methods
736
3. DESIGN OF A STUDY OF FINITE- tributions of the logs of scale-estimators, I also examine several
SAMPLE PERFORMANCE pseudovariances of the sampling distributions. The 100p%
pseudovariance is defined as the variance of the normal distri-
3.1 Evaluation Criteria
bution that has the same 100(1 - 2p)% interquartile distance.
To assess an estimator's robustness to long-tailed symmetric Formally, if Fn is the empirical cumulative distribution of a
noise, Tukey (see Hoaglin et al. 1982) proposed evaluating an sample of size n, and FD is the standardized normal distribution
estimator's performance for three distributions, which he calls function, the 100p% pseudovariance Vp* is defined to be
"the three corners," that in his opinion "span" the space of
distributions of concern. In addition to the unit-normal distri- VP* = (n n (1) - (r)) (3.2)
bution, he proposed one distribution with consistently long tails
and one with potentially erratic tail behavior. This article ex- The pseudovariances provide two kinds of informat
amines the performance of scale estimators in these three dis- whether an estimator's sampling distribution is longer
tributors: tailed than a normal distribution, and (b) whether the
a smooth shape or whether there is erratic tail behavior (that
1. normal-observations follow a normal distribution with
might yield a few wild values in a large sample). (See Andrews
mean 0 and variance 1.
et al. 1972.) For a normal distribution, the pseudovariances for
2. slash-observations follow the same distribution as NI
selected p will be constant and equal to the variance. The l00p%
U, where N - N(O, 1) and U - U(0, 1), where U is independent
pseudovariances of a long-tailed distribution will increase as p
of N and U(0, 1) is a uniform density on (0, 1); this distribution
decreases; the pseudovariances of a short-tailed distribution will
has the consistently long tails of a Cauchy distribution.
decrease as p decreases.
3. one-wild-in a sample of size 20, 19 points will be drawn
Because the pseudovariances ignore the most extreme values,
from N(O, 1) and one point will be drawn from N(O, 100); thus
they distinguish between smooth and erratic tail behavior. An-
all samples of size 20 from this distribution have one potentially
drews et al. (1972) noted that when the tails of a symmetric
wild point in 20.
distribution are smooth, the 4.2% pseudovariance will approx-
A reliable scale estimator should give similar estimates over imately equal the variance. When the tail behavior is erratic
repeated samples from a distribution; that is, the estimator's and a large sample contains a few extreme points, the variance
sampling distribution should have small variation and display will be inflated much more than the pseudovariances. Thus,
smooth tail behavior. For estimators whose sampling distri- for long-tailed sampling distributions, one might infer that the
butions have smooth tail behavior, I prefer those with the small- tail behavior is smooth if the variance is between, say, the 1%
est variability. The variance of the scale estimator's sampling and 10% pseudovariance and that the tail behavior is erratic if
distribution is, itself, an inappropriate measure of variability. the variance exceeds the 1 % pseudovariance.
A scale estimator S provides the same ordering of samples as The performance of an estimator across the three "corner"
the scale estimator kS, where k is an arbitrary positive constant; distributions is measured by its worst-case performance. If EN,
yet the variance of kS is k2var(S). Because var[ln(S)] is unaf- ES, and Ew represent the efficiencies of the log of an estimator
fected by the scaling of the estimator, estimators are compared under the normal, slash, and one-wild densities, the estimator's
using the variance of the log of the estimate. Because the dis- worst-case performance, or following Tukey, the triefficiency
tribution of a scale estimate is likely skewed to the right, the is the minimum of the efficiencies over the three distributions,
log transformation will also have a symmetrizing influence. It min{EN, ES, EW}. Estimators with smooth sampling distribu-
is worth noting that other scale-free measures of variation were tions are ranked according to their triefficiencies.
also used by Lax (1975b) and produced the same ranking of Some estimators will dominate others. That is, estimator A
estimates as var[ln(S)]. For the purposes of brevity, the re- dominates estimator B if estimator A is more efficient in all
mainder of the article refers to the variance of the log of thethree distributions than estimator B. A dominated estimator can
estimate as the variance of the estimator or of its sampling be discarded unless it has other advantages.
distribution.
Let Vmin be the smallest known variance of (the log of) a
3.2 Monte Carlo Calculations
scale estimator in repeated samples from a distribution. Then
an estimator with variance V has conditional relative variance The finite sample behavior of scale estimators under a gi
efficiency distribution for the data is estimated by approximating the s
pling distribution of the (log of the) estimator using Mon
E = I00 X Vmin/V. (3.1)
Carlo methods. A random sample of size 20 is drawn from
For some distributions, the minimum datapossible
distribution, and the scale
variance is estimate
known. is computed for
For other distributions, the minimum estimator. One thousand
possible variance draws fromnot
may the normal distribu
be known. Bounds like the Cramer-Rao lower bound are sel- and 640 draws each from the slash distribution and the o
dom sharp, and we only know the minimum variance from wild distribution provide a good approximation of the samp
among the estimators we have considered. distributions of each estimator for each of the three distr
To protect against the nonrobustness of the variance as a tions.
measure of dispersion-which is after all the purpose of this Because the Monte Carlo calculations use a swindle or vari-
study- and to examine the tail behavior of the sampling dis- ance-reduction technique, the estimated variances and pseu-
For a vector of sample observations X = {xl, x2, . * ., Mbi(X) = E XiWbi(Ui) E Wbi(Ui), (4.6)
i=1 i=1
with average xi, let di = xi - x be the deviations of xi from
the mean. The sample standard deviation equals with biweight weighting fun
/n 1 /2
Wbi(U) = (1 - U 2)2
S= d?/ (n - 1)). (4.1)
= 0 otherwise. (4.7)
A two-sided p% trimmed mean, M2,pf(), is obtained by sort- Let T be an estimate of the center of a sample X, and
ing the observations, temporarily setting aside the [pnl2] be
small-
some function. Let ui = (xi - T)IS be the normalized
est observations and the [pnl2] largest observations (whereobservations.
[q] Huber (1964) suggested solving the following
means the greatest integer part of q), and computing the arith-
equation for S:
metic average of the remaining observations. A one-sided r%
I n
trimmed mean, Mlir( ), is obtained by sorting the observations, 1 v 2(ui) = E[ V 2(Z) I Z -
temporarily setting aside the [rn] largest observations, and com- n - 1 i=1
puting the arithmetic average of the remaining observations.When ,v(u) = u and T = xi, the right-hand side of (4.8)
Let M2,p(X) be the p% two-sided mean of the vector of sample
equals 1 and the sample standard deviation is the solution to
observations X. Let di = xi - M2,p(X) be the deviations of theIn other words, ,v(u) = u implies that all the squared
(4.8).
xi from M2,p(X). By analogy to (4.1), a trimmed standard de-
deviations receive equal weight no matter how large they are.
viation is defined as Huber suggests a function VH that limits the influence of
Strim = (Mi,r(db d2, . . . d2))12. (4.2) points far from the estimated center T of the sample. The Huber
V/H function is
The trimmed mean evaluated in the study uses a 20% two-sided
trimmed mean and a 20% one-sided trimmed mean; that is, V/H(U) = -b, u < -b
p = r = 20.
=u, u ? - b
4.3 The MAD
=b, u > b, (4.9)
The median absolute deviation from the median, called the
for some b > 0. The function
MAD, is a common resistant measure of scale (see Mosteller
deviations of observations th
and Tukey 1977). Let m be the median of {xI, x2, . . ., x,,}
is, observations whose normalized deviations are bigger in
and let di = Ixi - ml be the absolute deviation of xi from the
magnitude than b-are not included fully in the sum in (4.8);
median. Then, the MAD is defined by the median of the ab-
their influence is limited because their contributions are set to
solute deviations di from the median,
b2 rather than u,?.
MAD = med{dl, d2, * , dn}l (4.3) When ,u is monotone, a solution to (4.8) will be unique if
it exists. For nonmonotone ,u, if one solution to (4.8) exists,
4.4 The Gaussian Skip
there will usually be two positive solutions. A good starting
Let x(f) be the ith order statistic in a sample of n. Let M2(X)
guess SO should lead, in the case of the Huber estimator, to the
be a measure of location evaluated at the sample X, appropriate
and let d(i)solution. Nonetheless, Equations (4.8) and (4.9)
may have negative solutions. Thus when one intends to use a 4.6.1 The BiweightA Estimator. The bisquare V function
scale estimator in an automatic fashion as part of a larger al- is Vbi(u) = UWbi(U), where Wbi(u) is the biweight weighting
gorithm, the Huber scale estimator may be an unsuitable choice. function given by (4.7). Substituting Ybi into (4.14) and ma-
Equation (4.8) can be solved iteratively. If we replace the nipulating yields
right-hand side of (4.8) by 1, take T to be the median, take S0 _ _ ~~~~~~~~~~1/2
to be the MAD, and for the kth iteration, define Uik = (xi - (xi- T)2(1 U U2)4
T)/Sk_ 1, Newton-Raphson iteration gives
Sbix = n" Liuj <1 (4.15)
in (n- 1)1/2 | (1 - u2)(1 -5,2)
S |~1(l E l/(Uik) - (n 1) lu,l < 1
(n var(x))1'2 -> a. (4.11) sum in the denominator by assuming that w1i(u,)ui u 0. Thus
the modified biweight estimator is a weighted average of the
Thus the variance of the location estimator x can serve as a squared deviations from T,
scale estimator.
_ _ ~~~~~~~~~1/2
An A estimate of scale is defined, analogously to (4.11),
E ((Xi - T)2Wbi(Ui))2
from the asymptotic variance of a robust estimator of location.
The robust M estimate of location, given a scale estimate S, a (n- 1)1/2 > Wi(U1)
positive constant c, and some function qV, is defined to be the
Iulj < I
solution Tn of the following equation (Huber 1964):
n 4.6.3 The Sine A Estimator. Gross (1976) used an A es-
V((xi - TJ)/S) = 0. (4.12) timator of scale with c set to 2.1, T chosen to be the median,
So chosen to be the MAD, and
By analogy to (4.11), under appropriate regularity conditions,
qi(u) = sin(u), lul s 7r
as n -> oo,
= 0, otherwise. (4.18)
(n var(Tn)) 1/2 > (A ,,(T, F)) 1/2, (4.13)
Thus the sine A estimator is
where A,(T, F) is the asymptotic variance of the M estimator
n2. ISO
based on the function Vi and with the data following distribu-
tion F. Sli (n- 1)1/2
Setting T to be an estimate of the location of the sample, So
to be an estimate of the scale of the sample, c to be a positive
constant, and ui = (xi - T)/cSO, it is not difficult to derive - uJl s Xt 1u, < 7X
the following finite sample approximation to A,W(T, F) (for
example, see Gross 1976): x ([ sin2(ui)J cos(u,) ) . (4.19)
4.6.4 The Modified Sine A Estimator. I modified
A estimator by inserting an arctangent transform
attempt to symmetrize the ratio in (4.19). The m
S2 c _= n(CS)2_ 2(Ui) u i) (4.14)
estimator is
Table 1 Variance Efficiencies for Selected Estimators: are defined in terms of the variance of the log of an estimator
Monte Carlo Estimates in Samples of Size 20 in repeated samples of size 20 from a distribution.
Seven estimators are undominated: the biweight A-estimators
Efficiency
with c = 9 and c = 10, the modified biweight A estimator,
Estimator Normal One-Wilda Slashb Triefficiencyc the modified sine A estimator, the iterated Huber M estimator
with b = 1.4, the sample standard deviation, and the trimmed
A-Estimators (Vu function)
Bisquare (c = 6) 65.2 77.1 90.1 65.2 standard deviation. Of these, only three estimators performed
Bisquare (c = 7) 74.8 82.9 89.3 74.8 well across the three distributions; the triefficiencies of the
Bisquare (c = 8) 81.8 85.4 87.6 81.8
biweight A-estimators with c = 9 and c = 10 and the modified
Bisquare (c = 9) 86.7 85.8 86.1 85.8d
'Bisquare (c = 10) 90.0 84.8 84.6 84.6d sine A estimator exceed 82%, whereas the triefficiencies of the
Modified bisquare (c = 6) 47.5 56.8 96.8 47.5d other four undominated estimators fall below 50%.
Sine (c = 2.1) 77.5 83.7 88.4 77.5
The biweight with c = 9 has the largest triefficiency (85.8%).
Modified sine (c = 2.1) 82.1 89.6 94.5 82.1d
M-Estimators (Huber u function) The modified sine A estimator, which dominates the sine A
b = 1.4 (iterated) 48.1 56.8 100.0 48.1d estimator, can probably reach the same or better levels of per-
b = 1.7 (iterated) 72.3 83.8 83.8 72.3
formance by raising the scaling constant c above 2.1. Raising
b = 1.4 (one-step) 55.2 68.1 86.8 55.2
b = 1.7 (one-step) 60.5 71.8 83.1 60.5 the scaling constant gives positive weight to more of the ob-
b = 2.0 (one-step) 69.8 76.1 75.9 69.8 servations and thus should improve the estimator's performance
Sample Standard Deviation 100.0 10.9 - e _e
in the normal and one-wild distributions while hurting perform-
Trimmed Standard Deviation 89.9 100.0 28.1 28.1d
MAD 35.3 41.5 91.8 35.3 ance in the consistently long-tailed distribution.
Gaussian Skip 54.7 59.3 90.1 54.7 The sample standard deviation performed quite poorly in both
a In samples of size 20 from a one-wild long-tailed19
distribution, distributions.
data points The are
trimmed standard
drawn from deviation
N(0, 1)suc-
cessfully
and the remaining point is drawn from N(O, 100). protects against the one potentially wild value in 20
b The random variable Z = XIY follows the slash distribution if X - N(0, 1) and Y - U(0, 1).
c An estimator's triefficiency is the smallest ofand
itsloses little (90% efficiency)
efficiencies over thein the normal
three distribution, but
distributions.
d The estimate is undominated. its performance deteriorates to 28% efficiency in the Cauchy-
e The slash distribution has Cauchy tails. Thus the variance of the standard deviation should
be infinite and the efficiency should be zero. tailed slash distribution. The MAD and Gaussian skip protect
successfully
NOTE: The variance of an estimator used to compute against the
efficiency isCauchy-tailed
the variance slashof
distribution, but
the sampling
distribution of the log of the estimator. The variance efficiency of an estimator is the ratio of the
they estimator's
smallest known variance in a distribution over the are relatively inefficient
variance ininthe other
that two distributions.
distribution. MADThe
is median absolute deviation from the median. biweight estimator with c = 9 is more than twice as efficient
(in terms of triefficiency) as the MAD.
whereas with nondecreasing functions, outlying observations The redescending biweight and sine A estimator outperform
always have some influence on the estimate. M-estimators that use the nondecreasing Huber ,v function.
Do we want to ignore outlying observations altogether? In Only the iterated Huber M estimator with b = 1.4 was un-
the exploratory Monte Carlo study, A-estimators with rede- dominated; the highest triefficiency among the Huber M esti-
scending qi functions outperformed A-estimators with nonde-mator (iterated, b = 1.7) was 72.3%. As Section 4.6 mentions,
creasing V' functions. This dominance suggests that we mayredescending A-estimators also dominated A-estimators with
indeed wish to ignore extreme outliers completely. The com- Huber functions.
parison in the next section between nondecreasing M-estimators The superiority of redescending qi functions over nonde-
and redescending A-estimators in the high-precision Monte Carlocreasing functions suggests that robust scale estimators should
study also speaks to this question. completely ignore outlying observations. Many robust location
estimators including the biweight location estimator also give
5. RESULTS
extreme outliers zero influence.
Table 1 presents the efficiencies of the scale estimators de- How far away from the center of the sample must an obser-
scribed in Section 4. The efficiencies described in Section 3.1 vation be before we call it an outlier and give it zero weight?
Table 2. Selected Pseudovariances Divided by the Variance of the Logarithm of the Estimator
Distribution
Estimator .1% 1% 4.2% 10% 25% .1% 1% 4.2% 10% 25% .1% 1% 4.2% 10% 25%
Biweight A estimator
(c = 9) 1.12 1.07 1.04 1.02 1.01 1.11 1.07 1.05 1.03 1.02 1.06 1.03 1.00 .96 .93
Biweight A estimator
(c = 10) 1.10 1.07 1.04 1.03 1.02 1.10 1.07 1.05 1.03 1.02 1.08 1.04 .99 .96 .93
Modified Sine
A estimator 1.10 1.07 1.04 1.02 1.01 1.08 1.06 1.05 1.04 1.03 1.02 1.00 .98 .97 .95
MAD 1.05 1.04 1.01 1.00 .98 1.00 1.03 1.02 1.01 1.00 1.04 1.00 .98 .98 .96
a In samples of size 20 from the one-wild distribution, 19 points are drawn from N(O, 1) and one
b The random variable Z = XI Y follows the slash distribution if X - N(0, 1) and Y V U(0, 1).
NOTE: MAD is median absolute deviation from the median.
Should scale estimators ignore fewer or more outlying obser- of each estimate is not substantially greater than its pseudo-
vations than location estimators? variances in all three distributions, we can infer that the sam-
Both the results of this Monte Carlo study and theoretical pling distribution of the estimate is consistently rather than
calculations suggest that scale estimators should ignore fewer erratically long-tailed.
points than location estimators. The biweight location estimator In summary, A-estimators of scale are more robust in sym-
achieves the best balance across distributions when c = 6 (see metric long-tailed distributions than the other estimators stud-
Mosteller and Tukey 1977). Table 1 shows that the biweight ied. A-estimators can be used without monitoring (unlike M-
scale estimator achieves the highest triefficiency with c = 9. estimators) because they always yield positive values. Finally,
A higher scaling constant c means that more points influence redescending V functions such as the biweight and sine out-
the estimate; in other words, the best biweight scale estimator perform nonredescending V/ functions such as the Huber v
uses more of the sample than the best biweight location esti- function.
mator. [Received April 1982. Revised February 1985.]
A simple calculation supports this conclusion. The Fisher
information about a parameter p contained in the distribution REFERENCES
g(x I p) can be written as
Andrews, D. F., Bickel, P. J., Hempel, F. R., Huber, P. J., Rogers, W. H.,
and Tukey, J. W. (1972), Robust Estimates of Location: Survey and Ad-
I, = E[hp(x)], (5.1)
vances, Princeton, NJ: Princeton University Press.
Bickel, P. J., and Lehmann, E. L. (1976), "Descriptive Statistics for Non-
where hp(x) is the score for p, and
parametric Models: III-Dispersion," The Annals of Statistics, 4, 1139-
1158.
hp(x) = (alap) ln g(x I p). (5.2) (1979), "Descriptive Statistics for Nonparametric Models: IV-
Spread," Contributions to Statistics, Jaroslav Hajek Memorial Volume, ed.
If x follows the distribution f (x I ,u, a) = a I'fo((x - /a),
Jana Jureckova, Prague: Academia, pp. 33-40.
then the score for location hP is De Wet, T., and van Wyk, J. W. J. (1979), "Efficiency and Robustness of
Hogg's Adaptive Trimmed Means," Communications in Statistics, Part A-
hp (x) = (ad/d) ln f(x j u, a). (5.3) Theory and Methods, 8, 117-128.
Gross, A. M. (1976), "Confidence Interval Robustness With Long-Tailed Sym-
One can easily show that the score for scale is metric Distributions," Journal of the American Statistical Association, 71,
409-416.
ha(x) = (a/da) ln f(x a a) = [(x - u) a]h- 1/a. Harter, H. L., Moore, A. H., and Curry, T. F. (1979), "Adaptive Robust
Estimation of Location and Scale Parameters of Symmetric Populations,"
(5.4) Communications in Statistics, Part A-Theory and Methods, 8, 1473-1491.
Hoaglin, D. C., Mosteller, F., and Tukey, J. W. (eds.) (1982), Understanding
Comparing I,, = E[h2(x)] and I, = E[h2(x)], Equation (5.4)Robust and Exploratory Data Analysis, New York: John Wiley.
shows that the information about scale includes a term in Huber, P. J. (1964), "Robust Estimation of a Location Parameter," The Annals
of Mathematical Statistics, 35, 73-101.
((x - 2
Iglewicz, B. (1982), "Robust Scale Estimates," in Understanding Robust and
Exploratory Data Analysis, eds. D. C. Hoaglin, F. Mosteller, and J. W.
E[((x- ,u) /a)2h2(x)].
Tukey, New York: John Wiley.
Lax, D. A. (1975a), "An Interim Report of a Monte Carlo Study of Robust
Thus the extreme observations contribute substantial informa-
Estimates of Width," Technical Report 93 (Ser. 2), Princeton University,
tion about scale-and relatively more about scale than about Dept. of Statistics.
location. We therefore expect that robust scale estimators should (1975b), Robust Estimators of Widths in Long-Tailed Symmetric Dis-
tributions: Performance in Small Samples, unpublished A.B. thesis,
ignore less of the sample to attain efficiency. This intuition is
Princeton University, Dept. of Statistics.
consistent with the Monte Carlo results. Lemmer, H. (1979), "A Robust Estimate of Spread," South African Statistical
As Section 3.1 mentions, the pseudovariances allow us to Journal, 13, 121-126.
Mosteller, F., and Tukey, J. W. (1977), Data Analysis and Regression: A
examine the shape of the sampling distribution of the estimators.
Second Course in Statistics, Reading, MA: Addison-Wesley.
Table 2 presents the pseudovariances of the log estimator di- Oja, H. (1981), "On Location, Scale, Skewness, and Kurtosis of Univariate
vided by the variance of the log (estimator) for the biweight Distributions," Scandinavian Journal of Statistics, 8, 154-168.
Rothschild, M., and Stiglitz, J. (1970), "Increasing Risk I: A Definition,"
A-estimators with c = 9, for c = 10, for the modified sine A
Journal of Economic Theory, 2, 225-243.
estimator, and as a comparison, for the MAD. Because the Simon, G. (1976), "Computer Simulation Swindles, With Applications to
100p% pseudovariances increase as p decreases for all four Estimates of Location and Dispersion," Applied Statistics, 25, 266-274.
Tukey, J. W. (1960), "A Survey of Sampling From Contaminated Dis-
estimators in all three distributions, we can infer that the sam-
tributions," in Contributions to Probability and Statistics, eds. I. Olkin, S.
pling distribution of (the log of) each estimator is long-tailed. Ghurye, W. Hoeffding, W. Madow, and H. Mann, Stanford, CA: Stanford
This "long-tailedness" is not problematic; because the variance University Press.