0% found this document useful (0 votes)
109 views15 pages

An Intuitive Geometric Approach To The Gauss Markov Theorem

The document discusses an intuitive geometric approach to proving the Gauss Markov theorem. It presents the linear model geometrically using subspaces and linear transformations. It describes the estimation process as choosing the vector in the subspace W that is closest to the observed data vector y. This is done using orthogonal projections. The Gauss Markov theorem states that the least squares estimator, which projects the data onto the subspace W, has minimum variance among all unbiased linear estimators.

Uploaded by

Sigit Wahyudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views15 pages

An Intuitive Geometric Approach To The Gauss Markov Theorem

The document discusses an intuitive geometric approach to proving the Gauss Markov theorem. It presents the linear model geometrically using subspaces and linear transformations. It describes the estimation process as choosing the vector in the subspace W that is closest to the observed data vector y. This is done using orthogonal projections. The Gauss Markov theorem states that the least squares estimator, which projects the data onto the subspace W, has minimum variance among all unbiased linear estimators.

Uploaded by

Sigit Wahyudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

The American Statistician

ISSN: 0003-1305 (Print) 1537-2731 (Online) Journal homepage: http://www.tandfonline.com/loi/utas20

An intuitive geometric approach to the Gauss


Markov theorem

Leandro da Silva Pereira, Lucas Monteiro Chaves & Devanil Jaques de Souza

To cite this article: Leandro da Silva Pereira, Lucas Monteiro Chaves & Devanil Jaques de
Souza (2016): An intuitive geometric approach to the Gauss Markov theorem, The American
Statistician, DOI: 10.1080/00031305.2016.1209127

To link to this article: http://dx.doi.org/10.1080/00031305.2016.1209127

Accepted author version posted online: 21


Jul 2016.
Published online: 21 Jul 2016.

Submit your article to this journal

Article views: 64

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=utas20

Download by: [Cornell University Library] Date: 25 August 2016, At: 08:34
ACCEPTED MANUSCRIPT

An intuitive geometric approach to the Gauss Markov theorem

Leandro da Silva Pereira1

Lucas Monteiro Chaves2

Devanil Jaques de Souza3

Abstract

Algebraic proofs of Gauss Markov theorem are very disappointing from an intuitive point

of view. An alternative is to use geometry which emphasizes the essential statistical ideas behind

the result. A truly geometrical intuitive approach to the theorem is presented, based only in

simple geometrical concepts, like linear subspaces and orthogonal projections.

Keywords

Orthogonal projection, dispersion cloud of points, Gauss Markov estimator.

1
UTFPR – Federal Technological University of Parana, DAMAT, Apucarana-PR, Brazil. Zip code: 86812-460. E-
mail: leandropereira@utfpr.edu.br
2
UFLA – Federal University of Lavras, DEX, Lavras-MG, Brazil. Zip code: 37200-000.
E-mail: lucas@dex.ufla.br
3
UFLA – Federal University of Lavras, DEX, Lavras-MG, Brazil. Zip code: 37200-000.
E-mail: devaniljaques@dex.ufla.br

1
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

1 – Introduction

There are few general results in statistics. Often, particular and very restrictive

assumptions are necessary, like for example normality. One of these few general results is the

Gauss Markov theorem, with highly practical consequences. The majesty of Gauss Markov

theorem relies in the fact that it holds regardless the distribution of the random variable

considered. This result gives statistical meaning to a pure mathematical fact: the least squares

method. Since the theorem is a basic result, it is taught in beginning statistics courses. However,

the proofs presented in most of the textbooks (Rencher 2008, Rao 1999, Casella and Berger

2002) are based only on algebraic properties of positive definite matrices. The experience in

class seems to lead us to conclude that this kind of demonstration does not improve student

comprehension of the result. Proving the Gauss Markov theorem by algebraic methods seems to

be at most innocuous. There are demonstrations which adopt a geometric flavor, but they usually

rely on some algebraic results (Ruud 2000, Gruber 1998, Saville and Wood 1991). First of all,

we have to be clear about what is meant by a geometric demonstration. In general this will not

be, from a mathematical point of view, a totally rigorous proof, requiring some degree of

intuition. The tricky part of our geometric approach is to interpret matrices as linear

transformations. Linear transformations can be viewed as geometric objects because, as

functions, they transform vectors in vectors and linear subspaces in linear subspaces. In that

sense it is a very concrete geometrical object. Another basic geometric concept is the vector

projection onto linear subspaces, particularly the orthogonal projections. The purpose of this

article is to provide an intuitive geometric demonstration of the Gauss Markov theorem, using

only the concepts of linear subspaces, linear transformations and projections.

2
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

2 – The Linear Model

Consider Y a random vector with unknown mean vector μ  E Y . By reasons related to the

random experiment that generate Y , we can suppose some linear relations among the

components of the unknown mean vector μ and, therefore, assume that the vector μ belongs to

some known linear subspace W , which, in essence, characterizes a linear model. The vector Y

n
stands in the data space, in general, the n dimensional Euclidean space . Since the dimension

of data space is higher than the dimension of W , that is, there are more data than characteristics

to be estimated, it is plausible to use a lower number of variables to describe the W subspace.

Such procedure is called parameterization and can be done in the following way: consider W to

be the image of a linear transformation X , defined in another vector space, which will be called

parameter space, W  Im  X . To avoid technical difficulties the linear transformation X will

be considered injective, that is, for each vector w in W , there exists a unique vector β in the

parameter space such that w  Xβ . In practical situations, the experiment defines the matrix X ,

the design matrix, and the column subspace of X defines W . Then it is possible to make clear

the linear model assumptions: Y is a random vector in the data space, μ  E Y  Xβ a  


vector in W , where β is an unknown vector in the parameter space and Y  Xβ  ε , where ε is

the vector of errors.

All of this can be described geometrically by Figure 1.

3
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

The greatest advantage of representing the linear model as in Figure 1 is the description

of the estimation process. What is the estimation process? It is a very simple decision: after a

data vector y is observed, we have only to choose a vector in W which we believe to be a good

 
representative of E Y . If y belongs to the space W , as this space is the space where the

mean of the random vector Y is restricted, there is no reason to not estimate E Y by the

observed vector y . But this seldom occurs, since the observed vector y is affected by random

errors. Therefore, almost surely, the vector y doesn’t belong to the subspace W and a natural

procedure to estimate the vector E Y is to take some kind of projection of y on W . That

process, when some specific restrictions on the projection are considered, explains more

sophisticated estimation methods, like ridge regression or elastic net regression (Hoerl and

Kennard 1970, Zou and Hastie 2005). Let’s go to the simplest idea: to choose the vector in W

closest to y . This estimation procedure is called the least squares method. How to do this? The

answer is to use linear orthogonal projections. If PW is the linear orthogonal projection onto W ,

the chosen vector is PW (y) (Rencher 2008, p. 43, p. 228). Since the linear transformation X is

injective, there is only one β̂ such that PW (y)  Xβˆ . This equation, in its algebraic form, is

denominated the normal equation and β̂ is the least squares estimate of the parameter vector β .

To express β̂ in terms of the data vector y it is necessary to have an expression of the projection

 )1 X and
PW as a matrix. This can be done with a little linear algebra, getting PW  X(XX

βˆ  (XX)1 X y , where X is the transpose of X . But that linear algebra is not necessary to

understand what follows.

4
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

3 – The Geometry of Gauss Markov Theorem

The statistical properties of such estimation method are established by the Gauss Markov

theorem. Recall that, given a random vector Y with covariance matrix

D(Y)=E  Y  E[Y] Y  E[Y]  , the total variance is the sum of the variances of each
 

component of Y , that is, the trace of D  Y .

THEOREM: (Gauss Markov) If Y  Xβ  ε with covariance matrix D(Y)= I , then the


2

least-squares estimator βˆ  (XX


 )1 X Y has minimum total variance among all linear unbiased

estimators of β .

First of all, it is necessary to have an intuitive idea of the meaning of “covariance

matrix”. If a lot of values of the random vector Y are observed, they form a cloud of points in

the data space. We will call this cloud the dispersion cloud. The matrix D(Y) tells us about the

shape of this cloud. This can be seen in the following way. Consider a unitary vector x . The

orthogonal projection of the random vector Y in the direction of x defines a one-dimensional

random variable given by Y  x  Y cos( ) , where Y x is the inner product and  is the

angle between x and Y . In this way, we have a one-dimensional random variable with the

same direction as x and centered on μ  x  μ cos( ) , where   is the angle between μ and

x (Figure 2).

5
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

The variance of this random variable gives a good idea of the width of the dispersion

cloud in the x direction. As we are supposing D(Y)= 2I , the variance is

var  x  Y  xD  Y x   2xx   2 . That is, the variability doesn’t depend on the direction

of x . So, all directions in the data space are equally likely. This means that the width of the

dispersion cloud must be almost the same in any direction, that is, the dispersion cloud has

approximate spherical symmetry. To delimit the cloud, we can take a sphere that contains, say,

approximately 75% of the cloud points. Furthermore, with high probability, such cloud should be

closely centered at μ . Observe that, since μ  W , this spherical cloud must be almost

symmetric in relation to W . If the observed data vectors in the cloud are orthogonally projected

into the subspace W , such projection will have an image with almost spherical symmetry and

the same radius of the original dispersion cloud.

The task now is to visualize the dispersion cloud of the estimator β̂ . The dispersion cloud

of the least squares estimator β̂ is then obtained by taking, in the parameter space, the pre-image,

by the transformation X , of the projected dispersion cloud onto W . Since, in general, the

transformation X does not preserve distance, the pre-image will not have spherical symmetry

anymore. With basic linear algebra knowledge, it is possible to prove that this pre-image will

have an elliptical symmetry. Then, the dispersion cloud of the least squares estimator β̂ is, with

high probability, approximately an ellipsoid centered in the real vector β , as shown in Figure 3.

6
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

In other words, a sphere in W,  w  μ  w  μ  const. , is the image, by the transformation X ,


of an ellipse centered at β , Xβˆ  Xβ   Xβˆ  Xβ  βˆ  β XX βˆ  β  const.

The symmetry of the dispersion cloud of β̂ gives an intuitive notion that the least squares

estimator β̂ is unbiased, as in fact it is. Students should take away from this development that

the concepts of unbiasedness and geometrical symmetry are closely related.

Now let us analyze the behavior of other linear unbiased estimators. Such estimators

have the form β  LY , where L is a p  n matrix and

β  E β  E LY  L E Y  LXβ .

Since β is unknown, the equality must hold for every β , then, LX  I . It follows from this

equality that (XL)  (XL)(XL)=X(LX)L=X  I  L=XL . This property implies that


2
XL is

a projection onto W (Gentle 2007, p. 286). So, any other linear unbiased estimator is obtained in

the same way as the least squares estimator; that is, it is obtained as a linear projection onto W .

The distinction between them is that the projection is no longer orthogonal as in the least squares

case. A non-orthogonal projection is denominated oblique. We will not give its precise

mathematical definition, but it is possible to have a good idea of how it works by only looking at

the two-dimensional case (Figure 4).

7
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Remembering once more that the dispersion cloud of Y is a sphere approximately centered at

the vector Xβ in W , the oblique projection of this sphere is no longer spherical but elliptical.

The following analogy helps to see this. The shadow of a sphere, projected by the sun at noon, is

circular shaped with unchanged radius. As the sun sets, the projected image becomes an ellipse

with increasing eccentricity. Another good idea is to perform an experiment using a flashlight to

make the projection and a styrofoam sphere to represent the dispersion cloud (Figure 5).

To compare the total variance of the least-squares estimator with the total variance of other linear

unbiased estimators, it is enough heuristically to compare the dispersion clouds obtained by

orthogonal and oblique projections in the W subspace. The spherical cloud related to least
squares estimator is entirely contained in the elliptic cloud obtained by oblique projection.

Taking the pre-image, by X , the same occurs for the dispersion cloud in the parameter space.

This demonstrates, intuitively, that the total variance of the least squares estimator is lower than

the total variance of any other linear unbiased estimator. Therefore we have an intuitive

geometric demonstration of the Gauss Markov Theorem.

4 – Conclusions

Although it is known but not often used, geometry is the natural context for problems

related to least squares methods. The use of geometrical arguments is intuitive and enlightens the

statistical concepts. We believe that the geometrical approach to visualizing the Gauss Markov

theorem has considerable pedagogical value.

8
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

References:

 Casella, G. and Berger, R. L. (2002), “Statistical Inference”, second edition, Pacific

Grove, USA: Duxbury.

 Gentle, E. J. (2007), “Matrix Algebra. Theory, computations, and applications in

statistics”, New York, USA: Springer.

 Gruber, M. H. J. (1998), “Improving Efficiency by Shrinkage: The James-Stein and

Ridge Regression Estimators”, New York, USA: Marcel Dekker.

 Hoerl, A. E. and Kennard, R. W. (1970), “Ridge regression. Biased estimation for non-

orthogonal problems”. Technometrics 12.

 Rao, C. R., and Toutenburg, H. (1999), “Linear Models, Least Squares and Alternatives”,

second edition, New York, USA: Springer-Verlag.

 Rencher, A. C. and Schaalje, G. B. (2008), “Linear Models in Statistics”, New Jersey,

USA: John Wiley & Sons.

 Ruud, P. A. (2000), “An Introduction to Classical Econometric Theory”, New York,

USA: Oxford University Press.

 Saville, D. J. and Wood, G. L. (1991). “Statistical methods: the geometric approach”.

New York, USA: Springer-Verlag

 Zou, H. and Hastie, T. (2005), “Regularization and variable selection via the elastic net”.

J. R. Statistic Soc. B, 67, part 2, pp. 301-320.

9
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Figure 1: Geometrical characterization of a linear model

10
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Figure 2: Random variable in the x vector direction.

11
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Figure 3: Dispersion clouds of the random vectors Y and β̂ .

12
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Figure 4: Oblique projection.

13
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Figure 5: Oblique projection of the dispersion cloud

14
ACCEPTED MANUSCRIPT

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy