0% found this document useful (0 votes)
12 views28 pages

Sampling MND MLE AED 2021

Uploaded by

cindy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views28 pages

Sampling MND MLE AED 2021

Uploaded by

cindy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

The Multivariate Normal Distribution

Sampling from a Multivariate Normal Distribution


MLE, Sampling Distribution of X̄ and S

Santiago Alférez
Febrero de 2021
Análisis Estadı́stico de Datos
MACC

Universidad del Rosario

1 / 24
Contenidos

The Multivariate Normal Likelihood

Maximum Likelihood Estimation of µ and Σ

The Sampling Distribution of


X and S

Large-Sample Behavior of X and S

2 / 24
The Multivariate Normal Likelihood
The Multivariate Normal Likelihood

The multivariate normal likelihood


• Let us assume that the p × 1 vectors X1 , X2 , . . . , Xn represent a
random sample from a multivariate normal population with mean
vector µ and covariance matrix Σ.
• Since X1 , X2 , . . . , Xn are mutually independent and each has
distribution Np (µ, Σ), the joint density function of all the
observations is the product of the marginal normal densities:
n 
( ) 
Joint density Y 1 −(xj −µ)0 Σ−1 (xj −µ)/2
= e
of X1 , X2 , . . . , Xn j=1
(2π)p/2 |Σ|1/2
1 1 Pn 0 −1
= np/2 n/2
e− j=1 (xj −µ) Σ (xj −µ)/2
(2π) |Σ|

3 / 24
The Multivariate Normal Likelihood
The multivariate normal likelihood
• When the numerical values of the observations become available,
they may be substituted for the xj in the above Equation.
• The resulting expression, now considered as a function of µ and Σ
for the fixed set of observations x1 , x2 , . . . , xn , is called the
likelihood.
• Many good statistical procedures employ values for the population
parameters that ”best” explain the observed data.
• One meaning of best is to select the parameter values that
maximize the joint density evaluated at the observations. This
technique is called maximum likelihood estimation, and the
maximizing parameter values are called maximum likelihood
estimates.
4 / 24
The Multivariate Normal Likelihood

• We shall consider maximum likelihood estimation of the


parameters µ and Σ for a multivariate normal population.
• To do so, we take the observations x1 , x2 , . . . , xn as fixed and
consider the joint density of Equation evaluated at these values.
The result is the likelihood function.

The trace of a symmetric matrix


Let A be a k × k symmetric matrix and x be a k × 1 vector. Then

a x0 Ax = tr (x0 Ax) = tr (Axx0 )


b tr(A) = ki λi , where the λi are the eigenvalues of A
P

5 / 24
The Multivariate Normal Likelihood

( )
Joint density Pn
(xj −µ)0 Σ−1 (xj −µ)/2
= 1 1
(2π)np/2 |Σ|n/2
e− j=1
of X1 , X2 , . . . , Xn

Applying the trace of a symmetric matrix


(xj − µ)0 Σ−1 (xj − µ) = tr (xj − µ)0 Σ−1 (xj − µ) = tr Σ−1 (xj − µ) (xj − µ)0
   

Since the trace of a sum of matrices is equal to the sum of the traces of the matrices:
Xn Xn
(xj − µ)0 Σ−1 (xj − µ) = tr (xj − µ)0 Σ−1 (xj − µ)
 
j=1 j=1
n
X
tr Σ−1 (xj − µ) (xj − µ)0
 
=
j=1
" n
!#
X
= tr Σ−1 (xj − µ) (xj − µ)0
j=1

6 / 24
The Multivariate Normal Likelihood
h P i
Pn 0 −1 −1 n 0
j=1 (x j − µ) Σ (x j − µ) = tr Σ j=1 (x j − µ) (x j − µ)

Adding and subtracting x = (1/n) nj=1 xj in each (xj − µ)


P

Because the cross-product terms, nj=1 (xj − x) (x − µ)0 and


P
Pn 0
j=1 (x − µ) (xj − x) are both matrices of zeros:
n
(xj − x + x − µ) (xj − x + x − µ)0
X

j=1
n n
(xj − x) (xj − x)0 +
X X
= (x − µ)(x − µ)0
j=1 j=1
n
(xj − x) (xj − x)0 + n(x − µ)(x − µ)0
X
=
j=1

7 / 24
The Multivariate Normal Likelihood

h P i
Pn 0 −1 −1 n 0
j=1 (x j − µ) Σ (x j − µ) = tr Σ j=1 (x j − µ) (x j − µ)
h P i
−1 n 0 0
=tr Σ j=1 (xj − x) (xj − x) + n(x − µ)(x − µ)

Join density of a random sample from a multivariate normal population


( )
Joint density of
= (2π)−np/2 |Σ|−n/2
X1 , X2 , . . . , Xn
n h P i o
n 0
× exp − tr Σ−1 j=1 (x j − x) (x j − x) + n(x − µ)(x − µ)0 /2

8 / 24
The Multivariate Normal Likelihood

Likelihood function from the join density

1 −1
Pn 0 0
L(µ, Σ) = np/2 n/2
e− tr[Σ ( j=1 (xj −x)(xj −x) +n(x−µ)(x−µ) )]/2
(2π) |Σ|

Different ways of the exponent in the likelihood function


  
n
0
X
tr Σ−1  (xj − x) (xj − x) + n(x − µ)(x − µ)0 
j=1
  
n
0
X
= tr Σ−1  (xj − x) (xj − x)  + n tr Σ−1 (x − µ)(x − µ)0
 

  j=1 
n
0
X
= tr Σ−1  (xj − x) (xj − x)  + n(x − µ)0 Σ−1 (x − µ)
j=1

9 / 24
Maximum Likelihood Estimation of
µ and Σ
Maximum Likelihood Estimation of µ and Σ

Inequality to obtain the MLE of µ and Σ


Given a p × p symmetric positive definite matrix B and a scalar
b > 0, it follows that
1 − tr(Σ−1 B)/2 1
b
e ≤ (2b)pb e−bp
|Σ| |B|b

for all positive definite Σ , with equality holding only for


(p×p)
Σ = (1/2b)B

10 / 24
Maximum Likelihood Estimation of µ and Σ

MLE of µ and Σ
Let X1 , X2 , . . . , Xn be a random sample from a normal population
with mean µ and covariance Σ. Then
n
1X  0 (n − 1)
µ̂ = X and Σ̂ = Xj − X Xj − X = S
n n
j=1

are the maximum likelihood estimators of µ and Σ, respectively.


Their observed values, x and (1/n) nj=1 (xj − x) (xj − x)0 , are
P

called the maximum likelihood estimates of µ and Σ.

11 / 24
Maximum Likelihood Estimation of µ and Σ

The exponent in the likelihood function without the − 21


h P i
n 0
tr Σ−1 j=1 (xj − x) (xj − x) + n(x − µ)0 Σ−1 (x − µ)

Proof of MLE of µ and Σ


Since Σ−1 is positive definite, so the distance (x − µ)0 Σ−1 (x − µ) > 0
unless µ = x. Thus, the likelihood is maximized with respect to µ at µ̂ = x.
Then, we need only maximize
1 −1
Pn 0
L(µ, Σ) = np/2 n/2
e− tr[Σ ( j=1 (xj −x)(xj −x) )]/2
(2π) |Σ|

with respect to Σ.

12 / 24
Maximum Likelihood Estimation of µ and Σ

Given a p × p symmetric positive definite matrix B and a scalar b > 0, it


1 − tr(Σ−1 B)/2 1 pb −bp for all positive definite Σ ,
follows that |Σ| be ≤ |B| b (2b) e
(p×p)
with equality holding only for Σ = (1/2b)B
Pn
−1 0
L(µ, Σ) = 1
(2π)np/2 |Σ|n/2
e− tr[Σ ( j=1 (xj −x)(xj −x) )]/2

Proof of MLE of µ and Σ


Pn 0
Using the result of the blue block with b = n/2 and B = j=1 (xj − x) (xj − x) ,
Pn 0
the maximum occurs at Σ̂ = (1/n) j=1 (xj − x) (xj − x)
• The maximum likelihood estimators are random quantities. They are obtained by
replacing the observations x1 , x2 , . . . , xn in the expressions for µ̂ and Σ̂ with
the corresponding random vectors, X1 , X2 , . . . , Xn

13 / 24
Maximum Likelihood Estimation of µ and Σ

Relation betweek likelihood and generalized variance


• The maximum likelihood estimator X is a random vector and the
maximum likelihood estimator Σ̂ is a random matrix. The maximum
likelihood estimates are their particular values for the given data set.
• The maximum of the likelihood is L(µ̂, Σ̂) = 1
(2π)np/2
e−np/2 1n/2 or,
|Σ̂|
since |Σ̂| = [(n − 1)/n]p |S|

L(µ̂, Σ̂) = constant × ( generalized variance )−n/2

• The generalized variance determines the peakedness of the likelihood


function and, consequently, is a natural measure of variability when the
parent population is multivariate normal.

14 / 24
Maximum Likelihood Estimation of µ and Σ

Invariance property of MLE


Let θ̂ be the MLE of θ, and consider estimating the parameter h(θ) which is a
function of θ. Then the MLE of

h(θ) is given by h(θ̂)


(a function of θ) (a function of θ̂)

−1
1. The maximum likelihood estimator of µ0 Σ−1 µ is µ̂0 Σ̂ µ̂, where µ̂ = X and
Σ̂ = ((n − 1)/n)S are the maximum likelihood estimators of µ and Σ
respectively.
√ √
2. The maximum likelihood estimator of σii is σ̂ii , where
n
1X 2
σ̂ii = Xij − X̄i
n
j=1
is the maximum likelihood estimator of σii = Var (Xi )
15 / 24
Maximum Likelihood Estimation of µ and Σ

Sufficient Statistics
• The sample estimates X and S are sufficient statistics
• This means that all of the information contained in the data can
be summarized by these two statistics alone
• This is only true if the data follow a multivariate normal
distribution - if they do not, other terms are needed (i.e., skewness
array, kurtosis array, etc...)
• Some statistical methods only use one or both of these matrices in
their analysis procedures and not the actual data

16 / 24
The Sampling Distribution of
X and S
The Sampling Distribution of X and S

Some considerations
• With (p = 1), we know that X is normal with mean µ = (population
mean) and variance n1 σ 2 = population variance
sample size
• The result for the multivariate case (p ≥ 2) is analogous in that X has a
normal distribution with mean µ and covariance matrix (1/n)Σ.
2
• For the sample variance, recall that (n − 1)s2 = nj=1 Xj − X̄ is
P

distributed as σ 2 times a chi-square variable having n − 1 df.


• This chi-square is the distribution of a sum of squares of independent
standard normal random variables. That is, (n − 1)s2 is distributed as
σ 2 Z12 + · · · + Zn−1
2 = (σZ1 )2 + · · · + (σZn−1 )2 .


• The individual terms σZi are independently distributed as N 0, σ 2 . It is




this latter form that is suitably generalized to the basic sampling


distribution for the sample covariance matrix.
17 / 24
The Sampling Distribution of X and S
MLE of µ and Σ
1 Pn
The maximum likelihood estimator of µ is µ̂ = X = n j=1 Xj and
the ML estimator of Σ is
n
n−1 1X
Σ̂ = S = Sn = (Xj − µ̂) (Xj − µ̂)0
n n
j=1

Sampling distribution of µ̂
The estimator is a linear combination of normal random vectors each
from Np (µ, Σ) i.i.d.

1 1 1
µ̂ = X = X1 + X2 + · · · + Xn
n n n
So µ̂ = X also has a normal distribution Np (µ, (1/n)Σ)
18 / 24
The Sampling Distribution of X and S
n−1
Σ̂ = n S

Sampling distribution of Σ̂
The matrix
n
(xj − x) (xj − x)0
X
(n − 1)S =
j=1

is distributed as a Wishart random matrix with (n − 1) degrees of freedom.

Whishart distribution
• A multivariate analogue to the chi-square distribution.
• It’s defined as
Wm (· | Σ) = Wishart distribution with m degrees of freedom
Xm
= The distribution of Zj Z0j
where Zj ∼ Np (0, Σ) and independent. j=1

Note: X and S are independent.


19 / 24
Large-Sample Behavior of X and S
Law of Large Numbers

Data are not always (multivariate) normal


The law of large numbers for multivariate data
• Let X1 , X2 , . . . , Xn be independent observations from a
population with mean E(X) = µ.
• Then X = (1/n) nj=1 Xj converges in probability to µ as n gets
P

large; that is,


X → µ for large samples
• And
S ( or Sn ) approach Σ for large samples
• These are true regardless of the true distribution of the Xj ’s.

20 / 24
Central Limit Theorem

The central limit theorem


• Let X1 , X2 , . . . , Xn be independent observations from a
population with mean E(X) = µ and finite (non-singular, full
rank), covariance matrix Σ.

• Then n(X − µ) has an approximate N (0, Σ) distribution if
n >> p (i.e., ”much larger than”).
• So, for ”large” n
 
1
X = Sample mean vector ≈ N µ, Σ
n

regardless of the underlying distribution of the Xj ’s.

21 / 24
Central Limit Theorem

What if Σ is unknown?
If n is large ”enough”, S will be close to Σ, So

 
1
n(X − µ) ≈ Np (0, S) or X ≈ Np µ, S
n

since n(X − µ)0 Σ−1 (X − µ) ∼ χ2p

n(X − µ)0 S−1 (X − µ) ≈ χ2p

22 / 24
Large-Sample Behavior of X and S

Some additional considerations


• Using S instead of Σ does not seriously effect approximation.
• n must be large relative to p; that is, (n − p) is large.
• The probability contours for X are tighter than those for X since
we have (1/n)Σ for X rather than Σ for X.

23 / 24
Comparison of Probability Contours

Below are contours for 99%, 95%, 90%, 75%, 50% and 20% for an
example with n = 20:

Contours for Xj Contours for X̄

❜ ❜

24 / 24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy