0% found this document useful (0 votes)
32 views30 pages

EM algorithm-ppt

Uploaded by

Sathish K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views30 pages

EM algorithm-ppt

Uploaded by

Sathish K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Missing Data and the EM algorithm

MSc Further Statistical Methods


Lecture 4 and 5
Hilary Term 2007
Steffen Lauritzen, University of Oxford; January 31, 2007
Missing data problems

case A B C D E F
1 a1 b1 ∗ d1 e1 *
2 a2 ∗ c2 d2 e2 ∗
.. .. .. .. .. .. ..
. . . . . . .
n an bn cn ∗ ∗ ∗

∗ or NA denotes values that are missing , i.e. non-observed.


Examples of missingness

• non-reply in surveys;
• non-reply for specific questions: ”missing” ∼ don’t
know, essentially an additional state for the variable
in question
• recording error
• variable out of range
• just not recorded (e.g. too expensive)

Different types of missingness demand different treatment.


Notation for missingness

Data matrix Y , missing data matrix M = {Mij }:



1 if Yij is missing
Mij =
0 if Yij is observed.

Convenient to introduce the notation Y = (Yobs , Ymis ),


where Ymis are conceptual and denote the data that were
not observed.
This notation follows Little and Rubin (2002).
Patterns of missingness

Little and Rubin (2002) classify these into the following


techincal categories.
We shall illustrate with a case of cross-classification of Sex,
Race, Admission and Department, S, R, A, D.

Univariate: Mij = 0 unless j = j ∗ , e.g. an unmeasured


response. Example: R unobserved for some, but data
otherwise complete.
Multivariate: Mij = 0 unless j ∈ J ⊂ V , as above, just
with multivariate response, e.g. in surveys. Example:
For some subjects, both R and S unobserved.
Monotone: There is an ordering of V so Mik = 0 implies
Mij = 0 for j < k, e.g. drop-out in longitudinal
studies. Example: For some, A is unobserved, others
neither A nor R, but data otherwise complete.
Disjoint: Two subsets of variables never observed
together. Controversial. Appears in Rubin’s causal
model. Example: S and R never both observed.
General: none of the above. Haphazardly scattered
missing values. Example: R unobserved for some, A
unobserved for others, S, D for some.
Latent: A certain variable is never observed. Maybe it is
even unobservable. Example: S never observed, but
believed to be important for explaining the data.
Methods for dealing with missing data

Complete case analysis: analyse only cases where all


variables are observed. Can be adequate if most cases
are present, but will generally give serious biases in
the analysis. In survey’s, for example, this
corresponds to making inference about the population
of responders, not the full population;
Weighting methods. For example, if a population total
µ = E(Y ) should be estimated and unit i has been
selected with probability πi a standard method is the
Horwitz–Thompson estimator
P Yi
µ̂ = P π1i .
πi
To correct for non-response, one could let ρi be the
response-probability, estimate this in some way as ρ̂i
and then let P Yi
πi ρ̂i
µ̃ = P 1 .
πi ρ̂i

Imputation methods: Find ways of estimating the values


of the unobserved values as Ŷmis , then proceed as if
there were complete data. Without care, this can give
misleading results, in particular because the ”sample
size” can be grossly overestimated.
Model-based likelihood methods: Model the missing data
mechanism and then proceed to make a proper
likelihood-based analysis, either via the method of
maximum-likelihood or using Bayesian methods. This
appears to be the most sensible way.
Typically this approach was not computationally
feasible in the past, but modern algorithms and
computers have changed things completely. Ironically,
the efficient algorithms are indeed based upon
imputation of missing values, but with proper
corrections resulting.
Mechanisms of missingness

The data are missing completely at random, MCAR, if

f (M | Y, θ) = f (M | θ), i.e. M ⊥
⊥ Y | θ.

Heuristically, the values of Y have themselves no


influence on the missingness. Example is recording
error, latent variables, and variables that are missing
by design (e.g. measuring certain values only for the
first m out of n cases). Beware: it may be
counterintuitive that missing by design is MCAR.
The data are missing at random, MAR, if

f (M | Y, θ) = f (M | Yobs , θ), i.e. M ⊥⊥ Ymis | (Yobs , θ).


Heuristically, only the observed values of Y have
influence on the missingness. By design, e.g. if
individuals with certain characteristics of Yobs are not
included in part of study (where Ymis is measured).
The data are not missing at random, NMAR, in all other
cases.
For example, if certain values of Y cannot be
recorded when they are out of range, e.g. in survival
analysis.

The classifications above of the mechanism of missingness


lead again to increasingly complex analyses.
It is not clear than the notion MCAR is helpful, but MAR
is. Note that if data are MCAR, they are also MAR.
Likelihood-based methods

The most convincing treatment of missing data problems


seems to be via modelling the missing data mechanism, i.e.
by considering the missing data matrix M as an explicit
part of the data.
The likelihood function then takes the form
Z
L(θ | M, yobs ) ∝ f (M, yobs , ymis | θ) dymis
Z
= Cmis (θ | M, yobs , ymis )f (yobs , ymis | θ) dymis , (1)

where the factor Cmis (θ | M, y) = f (M | yobs , ymis , θ) is


based on an explicit model for the missing data mechanism.
Ignoring the missing data mechanism
The likelihood function ignoring the missing data
mechanism is
Z
Lign (θ | yobs ) ∝ f (yobs | θ) = f (yobs , ymis | θ) dymis . (2)

When is L ∝ Lign so the missing data mechanism can be


ignored for further analysis? This is true if:

1. The data are MAR;


2. The parameters η governing the missingness are
separate from parameters of interest ψ i.e. the
parameters vary in a product region, so that
information about the value of one does not restrict
the other.
Ignorable missingness

If data are MAR and the missingness parameter is separate


from the parameter of interest, we have θ = (η, ψ) and
Cmis (θ) = f (M | yobs , ymis , η) = f (M | yobs , η)
Hence, the correction factor Cmis is constant (1) and can
be taken outside in the integral so that
L(θ | M, yobs ) ∝ Cmis (η)Lign (θ | yobs )
and since
f (yobs , ymis | θ) = f (yobs , ymis | ψ)
we get
L(θ | M, yobs ) ∝ Cmis (η)Lign (ψ | yobs ),
which shows that the missingness mechanism can be
ignored when concerned with likelihood inference about ψ.
For a Bayesian analysis the parameters must in addition be
independent w.r.t. the prior:

f (η, ψ) = f (η)f (ψ).

If the data are NMAR or the parameters are not separate,


then the missing data mechanism cannot be ignored.
Care must then be taken to model the mechanism
f (M | yobs , ymis , θ) and the corresponding likelihood term
must be properly included in the analysis.
Note: Ymis is MAR if data is (M, Y ), i.e. if M is considered
part of the data, since then M ⊥
⊥ Ymis | (M, Yobs , θ).
The EM algorithm

The EM algorithm is an alternative to Newton–Raphson or


the method of scoring for computing MLE in cases where
the complications in calculating the MLE are due to
incomplete observation and data are MAR, missing at
random, with separate parameters for observation and the
missing data mechanism, so the missing data mechanism
can be ignored.
Data (X, Y ) are the complete data whereas only
incomplete data Y = y are observed. (Rubin uses Y = Yobs
and X = Ymis ).
The complete data log-likelihood is:

l(θ) = log L(θ; x, y) = log f (x, y; θ).


The marginal log-likelihood or incomplete data
log-likelihood is based on y alone and is equal to
ly (θ) = log L(θ; y) = log f (y; θ).
We wish to maximize ly in θ but ly is typically quite
unpleasant: Z
ly (θ) = log f (x, y; θ) dx.

The EM algorithm is a method of maximizing the latter


iteratively and alternates between two steps, one known as
the E-step and one as the M-step, to be detailed below.
We let θ∗ be and arbitrary but fixed value, typically the
value of θ at the current iteration.
The E-step calculates the expected complete data
log-likelihood ratio q(θ | θ∗ ):
 
∗ f (X, y; θ)
q(θ | θ ) = Eθ∗ log |Y = y
f (X, y; θ∗ )
Z
f (x, y; θ)
= log f (x | y; θ∗ ) dx.
f (x, y; θ∗ )

The M-step maximizes q(θ | θ∗ ) in θ for for fixed θ∗ , i.e.


calculates
θ∗∗ = arg max q(θ | θ∗ ).
θ

After an E-step and subsequent M-step, the likelihood


function has never decreased.
The picture on the next overhead should show it all.
Expected and complete data likelihood

6
KL(fθy∗ : fθy ) ≥ 0

∇ly (θ∗ ) 
 ly (θ) − ly (θ∗ )

q(θ | θ∗ ) − q(θ∗ | θ∗ )
- θ
θ∗
ly (θ) − ly (θ∗ ) = q(θ | θ∗ ) + KL(fθy∗ : fθy )
∂ ∂
∇ly (θ∗ ) = ly (θ) = q(θ | θ∗ ) .
∂θ θ=θ ∗ ∂θ θ=θ ∗
Kullback-Leibler divergence

The KL divergence between f and g is


Z
f (x)
KL(f : g) = f (x) log dx.
g(x)
Also known as relative entropy of g with respect to f .
Since − log x is a convex function, Jensen’s inequality gives
KL(f : g) ≥ 0 and KL(f : g) = 0 if and only if f = g,
since
Z Z
f (x) g(x)
KL(f : g) = f (x) log dx ≥ − log f (x) dx = 0,
g(x) f (x)
so KL divergence defines an (asymmetric) distance measure
between probability distributions.
Expected and marginal log-likelihood

Since f (x | y; θ) = f {(x, y); θ}/f (y; θ) we have


f (y; θ)f (x | y; θ)
Z

q(θ | θ ) = log f (x | y; θ∗ ) dx
f (y; θ∗ )f (x | y; θ∗ )
= log f (y; θ) − log f (y; θ∗ )
f (x | y; θ)
Z
+ log f (x | y; θ∗ ) dx
f (x | y; θ∗ )
= ly (θ) − ly (θ∗ ) − KL(fθy∗ : fθy ).

Since the KL-divergence is minimized for θ = θ∗ ,


differentiation of the above expression yields
∂ ∂
q(θ | θ∗ ) = ly (θ) .
∂θ θ=θ ∗ ∂θ θ=θ ∗
Let now θ0 = θ∗ and define the iteration

θn+1 = arg max q(θ | θn ).


θ

Then

ly (θn+1 ) = ly (θn ) + q(θn+1 | θn ) + KL(fθyn+1 : fθyn )


≥ ly (θn ) + 0 + 0.

So the log-likelihood never decreases after a combined


E-step and M-step.
It follows that any limit point must be a saddle point or a
local maximum of the likelihood function.
Mixtures

Consider a sample Y = (Y1 , . . . , Yn ) from individual


densities

f (y; α, µ) = {αφ(y − µ) + (1 − α)φ(y)}

where φ is the normal density


1 2
φ(y) = √ e−y /2

and α and µ are both unknown, 0 < α < 1.
This corresponds to a fraction α of the observations being
contaminated, or originating from a different population.
Incomplete observation

The likelihood function becomes


Y
Ly (α, µ) = {αφ(yi − µ) + (1 − α)φ(yi )}
i

is quite unpleasant, although both Newton–Raphson and


the method of scoring can be used.
But suppose we knew which observations came from which
population?
In other words, let X = (X1 , . . . , Xn ) be i.i.d. with
P (Xi = 1) = α and suppose that the conditional
distribution of Yi given Xi = 1 was N (µ, 1) whereas given
Xi = 0 it was N (0, 1), i.e. that Xi was indicating whether
Yi was contaminated or not.
Then the marginal distribution of Y is precisely the mixture
distribution and the ‘complete data likelihood’ is
Y
Lx,y (α, µ) = αxi φ(yi − µ)xi (1 − α)1−xi φ(yi )1−xi
i
P P Y
xi
∝ α (1 − α)n− xi
φ(yi − µ)xi
i

so taking logarithms we get (ignoring a constant) that


X  X 
lx,y (α, µ) = xi log α + n − xi log(1 − α)
X
− xi (yi − µ)2 /2.
i

If we did not know how to maximize this explicitly,


differentiation easily leads to:
X X X
α̂ = xi /n, µ̂ = xi yi / xi .

Thus, when complete data are available the frequency of


contaminated observations is estimated by the observed
frequency and the mean µ of these is estimated by the
average among the contaminated observations.
E-step and M-step

By taking expectations, we get the E-step as

q(α, µ | α∗ , µ∗ ) = Eα∗ ,µ∗ {lX,y (α, µ) | Y = y}


X  X 
= x∗i log α + n − x∗i log(1 − α)
X
− x∗i (yi − µ)2 /2
i

where

x∗i = Eα∗ ,µ∗ (Xi | Yi = yi ) = Pα∗ ,µ∗ (Xi = 1 | Yi = yi ).

Since this has the same form as the complete data


likelihood, just with x∗i replacing xi , the M-step simply
becomes
X X X
α∗∗ = x∗i /n, µ∗∗ = x∗i yi / x∗i ,

i.e. here the mean of the contaminated observations is


estimated by a weighted average of all the observations, the
weight being proportional to the probability that this
observation is contaminated. In effect, x∗i act as imputed
values of xi .
The imputed values x∗i needed in the E-step are calculated
as follows:

x∗i = E(Xi | Yi = yi ) = P (Xi = 1 | Yi = yi )


α∗ φ(yi − µ∗ )
= .
α∗ φ(yi − µ∗ ) + (1 − α∗ )φ(yi )
Incomplete two-way tables

As another example, let us consider a 2×-table with


n1 = {n1ij } complete observations of two binary variables I
and J, n2 = {ni+ observations where only I was observed,
and n3 = {n+j observations where only J was observed,
and let us assume that the mechanism of missingness can
be ignored.
The complete data log-likelihood is
X
log L(p) = (n1ij + n2ij + n3ij ) log pij
ij

and the E-step needs

n∗ij = n1ij + n2∗ 3∗


ij + nij
where
n2∗ 2 2 2
ij = E(Nij | p, ni+ ) = pj | i ni+

and
n3∗ 3 3 2
ij = E(Nij | p, n+j ) = pi | j n+j .

We thus get
pij pij
n2∗
ij = n2 , n3∗
ij = n3 . (3)
pi0 + pi1 i+ p0j + p1j +j
The M-step now maximizes log L(p) = ij n∗ij log pij by
P
letting
pij = (n1ij + n2∗ 3∗
ij + nij )/n (4)
where n is the total number of observations.
The EM algorithm alternates between (3) and (4) until
convergence.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy