An Intuitive Geometric Approach To The Gauss Markov Theorem
An Intuitive Geometric Approach To The Gauss Markov Theorem
Leandro da Silva Pereira, Lucas Monteiro Chaves & Devanil Jaques de Souza
To cite this article: Leandro da Silva Pereira, Lucas Monteiro Chaves & Devanil Jaques de
Souza (2016): An intuitive geometric approach to the Gauss Markov theorem, The American
Statistician, DOI: 10.1080/00031305.2016.1209127
Article views: 64
Download by: [Cornell University Library] Date: 25 August 2016, At: 08:34
ACCEPTED MANUSCRIPT
Abstract
Algebraic proofs of Gauss Markov theorem are very disappointing from an intuitive point
of view. An alternative is to use geometry which emphasizes the essential statistical ideas behind
the result. A truly geometrical intuitive approach to the theorem is presented, based only in
Keywords
1
UTFPR – Federal Technological University of Parana, DAMAT, Apucarana-PR, Brazil. Zip code: 86812-460. E-
mail: leandropereira@utfpr.edu.br
2
UFLA – Federal University of Lavras, DEX, Lavras-MG, Brazil. Zip code: 37200-000.
E-mail: lucas@dex.ufla.br
3
UFLA – Federal University of Lavras, DEX, Lavras-MG, Brazil. Zip code: 37200-000.
E-mail: devaniljaques@dex.ufla.br
1
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
1 – Introduction
There are few general results in statistics. Often, particular and very restrictive
assumptions are necessary, like for example normality. One of these few general results is the
Gauss Markov theorem, with highly practical consequences. The majesty of Gauss Markov
theorem relies in the fact that it holds regardless the distribution of the random variable
considered. This result gives statistical meaning to a pure mathematical fact: the least squares
method. Since the theorem is a basic result, it is taught in beginning statistics courses. However,
the proofs presented in most of the textbooks (Rencher 2008, Rao 1999, Casella and Berger
2002) are based only on algebraic properties of positive definite matrices. The experience in
class seems to lead us to conclude that this kind of demonstration does not improve student
comprehension of the result. Proving the Gauss Markov theorem by algebraic methods seems to
be at most innocuous. There are demonstrations which adopt a geometric flavor, but they usually
rely on some algebraic results (Ruud 2000, Gruber 1998, Saville and Wood 1991). First of all,
we have to be clear about what is meant by a geometric demonstration. In general this will not
be, from a mathematical point of view, a totally rigorous proof, requiring some degree of
intuition. The tricky part of our geometric approach is to interpret matrices as linear
functions, they transform vectors in vectors and linear subspaces in linear subspaces. In that
sense it is a very concrete geometrical object. Another basic geometric concept is the vector
projection onto linear subspaces, particularly the orthogonal projections. The purpose of this
article is to provide an intuitive geometric demonstration of the Gauss Markov theorem, using
2
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Consider Y a random vector with unknown mean vector μ E Y . By reasons related to the
random experiment that generate Y , we can suppose some linear relations among the
components of the unknown mean vector μ and, therefore, assume that the vector μ belongs to
some known linear subspace W , which, in essence, characterizes a linear model. The vector Y
n
stands in the data space, in general, the n dimensional Euclidean space . Since the dimension
of data space is higher than the dimension of W , that is, there are more data than characteristics
Such procedure is called parameterization and can be done in the following way: consider W to
be the image of a linear transformation X , defined in another vector space, which will be called
be considered injective, that is, for each vector w in W , there exists a unique vector β in the
parameter space such that w Xβ . In practical situations, the experiment defines the matrix X ,
the design matrix, and the column subspace of X defines W . Then it is possible to make clear
3
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
The greatest advantage of representing the linear model as in Figure 1 is the description
of the estimation process. What is the estimation process? It is a very simple decision: after a
data vector y is observed, we have only to choose a vector in W which we believe to be a good
representative of E Y . If y belongs to the space W , as this space is the space where the
mean of the random vector Y is restricted, there is no reason to not estimate E Y by the
observed vector y . But this seldom occurs, since the observed vector y is affected by random
errors. Therefore, almost surely, the vector y doesn’t belong to the subspace W and a natural
procedure to estimate the vector E Y is to take some kind of projection of y on W . That
process, when some specific restrictions on the projection are considered, explains more
sophisticated estimation methods, like ridge regression or elastic net regression (Hoerl and
Kennard 1970, Zou and Hastie 2005). Let’s go to the simplest idea: to choose the vector in W
closest to y . This estimation procedure is called the least squares method. How to do this? The
answer is to use linear orthogonal projections. If PW is the linear orthogonal projection onto W ,
the chosen vector is PW (y) (Rencher 2008, p. 43, p. 228). Since the linear transformation X is
injective, there is only one β̂ such that PW (y) Xβˆ . This equation, in its algebraic form, is
denominated the normal equation and β̂ is the least squares estimate of the parameter vector β .
To express β̂ in terms of the data vector y it is necessary to have an expression of the projection
)1 X and
PW as a matrix. This can be done with a little linear algebra, getting PW X(XX
βˆ (XX)1 X y , where X is the transpose of X . But that linear algebra is not necessary to
4
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
The statistical properties of such estimation method are established by the Gauss Markov
D(Y)=E Y E[Y] Y E[Y] , the total variance is the sum of the variances of each
estimators of β .
matrix”. If a lot of values of the random vector Y are observed, they form a cloud of points in
the data space. We will call this cloud the dispersion cloud. The matrix D(Y) tells us about the
shape of this cloud. This can be seen in the following way. Consider a unitary vector x . The
random variable given by Y x Y cos( ) , where Y x is the inner product and is the
angle between x and Y . In this way, we have a one-dimensional random variable with the
same direction as x and centered on μ x μ cos( ) , where is the angle between μ and
x (Figure 2).
5
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
The variance of this random variable gives a good idea of the width of the dispersion
var x Y xD Y x 2xx 2 . That is, the variability doesn’t depend on the direction
of x . So, all directions in the data space are equally likely. This means that the width of the
dispersion cloud must be almost the same in any direction, that is, the dispersion cloud has
approximate spherical symmetry. To delimit the cloud, we can take a sphere that contains, say,
approximately 75% of the cloud points. Furthermore, with high probability, such cloud should be
closely centered at μ . Observe that, since μ W , this spherical cloud must be almost
symmetric in relation to W . If the observed data vectors in the cloud are orthogonally projected
into the subspace W , such projection will have an image with almost spherical symmetry and
The task now is to visualize the dispersion cloud of the estimator β̂ . The dispersion cloud
of the least squares estimator β̂ is then obtained by taking, in the parameter space, the pre-image,
by the transformation X , of the projected dispersion cloud onto W . Since, in general, the
transformation X does not preserve distance, the pre-image will not have spherical symmetry
anymore. With basic linear algebra knowledge, it is possible to prove that this pre-image will
have an elliptical symmetry. Then, the dispersion cloud of the least squares estimator β̂ is, with
high probability, approximately an ellipsoid centered in the real vector β , as shown in Figure 3.
6
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
of an ellipse centered at β , Xβˆ Xβ Xβˆ Xβ βˆ β XX βˆ β const.
The symmetry of the dispersion cloud of β̂ gives an intuitive notion that the least squares
estimator β̂ is unbiased, as in fact it is. Students should take away from this development that
Now let us analyze the behavior of other linear unbiased estimators. Such estimators
Since β is unknown, the equality must hold for every β , then, LX I . It follows from this
a projection onto W (Gentle 2007, p. 286). So, any other linear unbiased estimator is obtained in
the same way as the least squares estimator; that is, it is obtained as a linear projection onto W .
The distinction between them is that the projection is no longer orthogonal as in the least squares
case. A non-orthogonal projection is denominated oblique. We will not give its precise
mathematical definition, but it is possible to have a good idea of how it works by only looking at
7
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Remembering once more that the dispersion cloud of Y is a sphere approximately centered at
the vector Xβ in W , the oblique projection of this sphere is no longer spherical but elliptical.
The following analogy helps to see this. The shadow of a sphere, projected by the sun at noon, is
circular shaped with unchanged radius. As the sun sets, the projected image becomes an ellipse
with increasing eccentricity. Another good idea is to perform an experiment using a flashlight to
make the projection and a styrofoam sphere to represent the dispersion cloud (Figure 5).
To compare the total variance of the least-squares estimator with the total variance of other linear
orthogonal and oblique projections in the W subspace. The spherical cloud related to least
squares estimator is entirely contained in the elliptic cloud obtained by oblique projection.
Taking the pre-image, by X , the same occurs for the dispersion cloud in the parameter space.
This demonstrates, intuitively, that the total variance of the least squares estimator is lower than
the total variance of any other linear unbiased estimator. Therefore we have an intuitive
4 – Conclusions
Although it is known but not often used, geometry is the natural context for problems
related to least squares methods. The use of geometrical arguments is intuitive and enlightens the
statistical concepts. We believe that the geometrical approach to visualizing the Gauss Markov
8
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
References:
Hoerl, A. E. and Kennard, R. W. (1970), “Ridge regression. Biased estimation for non-
Rao, C. R., and Toutenburg, H. (1999), “Linear Models, Least Squares and Alternatives”,
Zou, H. and Hastie, T. (2005), “Regularization and variable selection via the elastic net”.
9
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
10
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
11
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
12
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
13
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
14
ACCEPTED MANUSCRIPT