0% found this document useful (0 votes)
0 views17 pages

EM algorithm and variants: an informal tutorial

The document provides an informal tutorial on the Expectation-Maximization (EM) algorithm, a method for solving maximum likelihood estimation problems. It explains the iterative process of the EM algorithm, which involves an E-step for estimating latent variables and an M-step for maximizing the likelihood function. The tutorial also discusses the theoretical foundations, convergence properties, and variants of the EM algorithm, highlighting its significance in statistical learning.

Uploaded by

alexis.roche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views17 pages

EM algorithm and variants: an informal tutorial

The document provides an informal tutorial on the Expectation-Maximization (EM) algorithm, a method for solving maximum likelihood estimation problems. It explains the iterative process of the EM algorithm, which involves an E-step for estimating latent variables and an M-step for maximizing the likelihood function. The tutorial also discusses the theoretical foundations, convergence properties, and variants of the EM algorithm, highlighting its significance in statistical learning.

Uploaded by

alexis.roche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

EM algorithm and variants: an informal tutorial

Alexis Roche∗
Service Hospitalier Frédéric Joliot, CEA, F-91401 Orsay, France
Spring 2003 (revised: September 2012)

1. Introduction
The expectation-maximization (EM) algorithm introduced by Dempster et al [12] in 1977 is a
very general method to solve maximum likelihood estimation problems. In this informal report,
we review the theory behind EM as well as a number of EM variants, suggesting that beyond
the current state of the art is an even much wider territory still to be discovered.

2. EM background
Let Y a random variable with probability density function (pdf) p(y|θ), where θ is an unknown
parameter vector. Given an outcome y of Y , we aim at maximizing the likelihood function
L(θ) ≡ p(y|θ) wrt θ over a given search space Θ. This is the very principle of maximum
likelihood (ML) estimation. Unfortunately, except in not very exciting situations such as, e.g.
estimating the mean and variance of a Gaussian population, a ML estimation problem has
generally no closed-form solution. Numerical routines are then needed to approximate it.

2.1. EM as a likelihood maximizer


The EM algorithm is a class of optimizers specifically taylored to ML problems, which makes
it both general and not so general. Perhaps the most salient feature of EM is that it works
iteratively by maximizing successive local approximations of the likelihood function. Therefore,
each iteration consists of two steps: one that performs the approximation (the E-step) and one
that maximizes it (the M-step). But, let’s make it clear, not any two-step iterative scheme is an
EM algorithm. For instance, Newton and quasi-Newton methods [27] work in a similar iterative
fashion but do not have much to do with EM. What essentially defines an EM algorithm is the
philosophy underlying the local approximation scheme – which, in particular, doesn’t rely on
differential calculus.
The key idea underlying EM is to introduce a latent variable Z whose pdf depends on θ with
the property that maximizing p(z|θ) is easy or, say, easier than maximizing p(y|θ). Loosely


alexis.roche@gmail.com

1
speaking, we somehow enhance the incomplete data by guessing some useful additional infor-
mation. Technically, Z can be any variable such that θ → Z → Y is a Markov chain1 , i.e. we
assume that p(y|z, θ) is independent from θ, yielding a Chapman-Kolmogorov equation:

p(z, y|θ) = p(z|θ)p(y|z) (1)

Reasons for that definition will arise soon. Conceptually, Z is a complete-data space in the
sense that, if it were fully observed, then estimating θ would be an easy game. We will emphasize
that the convergence speed of EM is highly dependent upon the complete-data specification,
which is widely arbitrary despite some estimation problems may have seamingly “natural”
hidden variables. But, for the time being, we assume that the complete-data specification step
has been accomplished.

2.2. EM as a consequence of Jensen’s inequality


Quite surprisingly, the original EM formulation stems from a very simple variational argument.
Under almost no assumption regarding the complete variable Z, except its pdf doesn’t vanish
to zero, we can bound the variation of the log-likelihood function L(θ) ≡ log p(y|θ) as follows:

p(y|θ)
L(θ) − L(θ′ ) = log
p(y|θ′ )

p(z, y|θ)
= log dz
p(y|θ′ )

p(z, y|θ)
= log p(z|y, θ′ ) dz
p(z, y|θ′ )

p(z|θ)
= log p(z|y, θ′ ) dz (2)
p(z|θ′ )

p(z|θ)
≥ log p(z|y, θ′ ) dz (3)
p(z|θ′ )
| {z }
Call this Q(θ,θ ′ )

Step (2) results from the fact that p(y|z, θ) is independent from θ owing to (1). Step (3)
follows from Jensen’s inequality (see [9] and appendix A.2) along with the well-known con-
cavity property of the logarithm function. Therefore, Q(θ, θ′ ) is an auxiliary function for the
log-likelihood, in the sense that: (i) the likelihood variation from θ′ to θ is always greater than
Q(θ, θ′ ), and (ii) we have Q(θ′ , θ′ ) = 0. Hence, starting from an initial guess θ′ , we are guaran-
teed to increase the likelihood value if we can find a θ such that Q(θ, θ′ ) > 0. Iterating such a
process defines an EM algorithm.
There is no general convergence theorem for EM, but thanks to the above mentioned mono-
tonicity property, convergence results may be proved under mild regularity conditions. Typi-
cally, convergence towards a non-global likelihood maximizer, or a saddle point, is a worst-case
scenario. Still, the only trick behind EM is to exploit the concavity of the logarithm function!

1
In many presentations of EM, Z is as an aggregate variable (X, Y ), where X is some “missing” data, which cor-
responds to the special case where the transition Z → Y is deterministic. We believe this restriction, although
important in practice, is not useful to the global understanding of EM. By the way, further generalizations
will be considered later in this report.

2
2.3. EM as expecation-maximization
Let’s introduce some notations. Developing the logarithm in the right-hand side of (3), we may
interpret our auxiliary function as a difference: Q(θ, θ′ ) = Q(θ|θ′ ) − Q(θ′ |θ′ ), with:

Q(θ|θ ) ≡ log p(z|θ) p(z|y, θ′ ) dx

(4)

Clearly, for a fixed θ′ , maximizing Q(θ, θ′ ) wrt θ is equivalent to maximizing Q(θ|θ′ ). If we


consider the residual function: R(θ|θ′ ) ≡ L(θ) − Q(θ|θ′ ), the incomplete-data log-likelihood may
be written as:
L(θ) = Q(θ|θ′ ) + R(θ|θ′ )
The EM algorithm’s basic principle is to replace the maximization of L(θ) with that of Q(θ|θ′ ),
hopefully easier to deal with. We can ignore R(θ|θ′ ) because inequality (3) implies that R(θ|θ′ ) ≥
R(θ′ |θ′ ). In other words, EM works because the auxiliary function Q(θ|θ′ ) always deteriorates
as a likelihood approximation when θ departs from θ′ . In an ideal world, the approximation
error would be constant; then, maximizing Q would, not only increase, but truly maximize the
likelihood. Unfortunately, this won’t be the case in general. Therefore, unless we decide to
give up on maximizing the likelihood, we have to iterate – which gives rise to quite a popular
statistical learning algorithm.
Given a current parameter estimate θn :

• E-step: form the auxiliary function Q(θ|θn ) as defined in (4), which involves computing
the posterior distribution of the unobserved variable, p(z|y, θn ). The “E” in E-step stands
for “expectation” for reasons that will arise in section 2.4.

• M-step: update the parameter estimate by maximizing the auxiliary function:

θn+1 = arg max Q(θ|θn )


θ

An obvious but important generalization of the M-step is to replace the maximization


with a mere increase of Q(θ|θn ). Since, anyway, the likelihood won’t be maximized in
one iteration, increasing the auxiliary function is enough to ensure that the likelihood will
increase in turn, thus preserving the monotonicity property of EM. This defines generalized
EM (GEM) algorithms. More on this later.

2.4. Some probabilistic interpretations here...


For those familiar with probability theory, Q(θ|θ′ ) as defined in (4) is nothing but the conditional
expectation of the complete-data log-likelihood in terms of the observed variable, taken at Y = y,
and assuming the true parameter value is θ′ :

Q(θ|θ′ ) ≡ E[ log p(Z|θ)|y, θ′ ] (5)

This remark explains the “E” in E-step, but also yields some probabilistic insight on the
auxiliary function. For all θ, Q(θ|θ′ ) is an estimate of the the complete-data log-likelihood that
is built upon the knowledge of the incomplete data and under the working assumption that the
true parameter value is known. In some way, it is not far from being the “best” estimate that

3
we can possibly make without knowing Z, because conditional expectation is, by definition, the
estimator that minimizes the conditional mean squared error2 .
Having said that, we might still be a bit suspiscious. While we can grant that Q(θ|θ′ ) is
a reasonable estimate of the complete-data log-likelihood, recall that our initial problem is to
maximize the incomplete-data (log) likelihood. How good a fit is Q(θ|θ′ ) for L(θ)? To answer
that, let’s see a bit more how the residual R(θ|θ′ ) may be interpreted. We have:

R(θ|θ ) = log p(y|θ) − log p(z|θ) p(z|y, θ′ ) dz


p(y|θ)
= log p(z|y, θ′ ) dz
p(z|θ)

p(y|z, θ)
= log p(z|y, θ′ ) dz, (6)
p(z|y, θ)

where the last step relies on Bayes’ law. Now, p(y|z, θ) = p(y|z) is independent from θ by the
Markov property (1). Therefore, using the simplified notations qθ (z) ≡ p(z|y, θ) and qθ′ (z) ≡
p(z|y, θ′ ), we get: ∫
qθ′ (z)
R(θ|θ′ ) − R(θ′ |θ′ ) = log qθ′ (z) dz (7)
qθ (z)
| {z }
Call this D(qθ′ ∥qθ )

In the language of information theory, this quantity D(qθ′ ∥qθ ) is known as the Kullback-
Leibler distance, a general tool to assess the deviation between two pdfs [9]. Although it is not,
strictly speaking, a genuine mathematical distance, it is always positive and vanishes iff the
pdfs are equal which, again and not surprinsingly, comes as a direct consequence of Jensen’s
inequality.
What does that mean in our case? We noticed earlier that the likelihood approxima-
tion Q(θ|θ′ ) cannot get any better as θ deviates from θ′ . We now realize from equation (7)
that this property reflects an implicit strategy of ignoring the variations of p(z|y, θ) wrt θ.
Hence, a perfect approximation would be one for which p(z|y, θ) is independent from θ. In
other words, we would like θ → Y → Z to define a Markov chain... But, look, we already
assumed that θ → Z → Y is a Markov chain. Does the Markov property still hold when
permuting the roles of Y and Z?
From the fundamental data processing inequality [9], the answer is no in general. Details
are unnecessary here. Just remember that the validity domain of Q(θ|θ′ ) as a local likelihood
approximation is controlled by the amount of information that both y and θ contain about the
complete data. We are now going to study this aspect more carefully.

2.5. EM as a fix point algorithm and local convergence


Quite clearly, EM is a fix point algorithm:

θn+1 = Φ(θn ) with Φ(θ′ ) = arg max Q(θ, θ′ )


θ∈Θ

Assume the sequence θn converges towards some value θ̂ – hopefully, the maximum likelihood
estimate but possibly some other local maximum or saddle point. Under the assumption that
Φ is continuous, θ̂ must be a fix point for Φ, i.e. θ̂ = Φ(θ̂). Furthermore, we can approximate
∫ [ ]2
2
For all θ, we have: Q(θ|θ′ ) = arg min log p(z|θ) − µ p(z|y, θ′ ) dz.
µ

4
the sequence’s asymptotic behavior using a first-order Taylor expansion of Φ around θ̂, which
leads to:
∂Φ
θn+1 ≈ S θ̂ + (I − S)θn with S = I − |
∂θ θ̂
This expression shows that the rate of convergence is controlled by S, a square matrix that is
constant across iterations. Hence, S is called the speed matrix, and its spectral radius3 defines
the global speed. Unless the global speed is one, the local convergence of EM is only linear. We
may relate S to the likelihood function by exploiting the fact that, under sufficient smoothness
assumptions, the maximization of Q is characterized by:
∂Q
(θn+1 , θn ) = 0
∂θt
From the implicit function theorem, we get the gradient of Φ:

∂Φ ( ∂ 2 Q )−1 ∂ 2 Q ( ∂ 2 Q )−1 [ ∂ 2 Q ∂2Q ]


=− ⇒ S= +
∂θ ∂θ∂θt ∂θ′ ∂θt ∂θ∂θt ∂θ∂θt ∂θ′ ∂θt
where, after some manipulations:

∂2Q ∂ 2 log p(z|θ)
| = p(z|y, θ̂) |θ̂ dz
∂θ∂θt (θ̂,θ̂) | ∂θ∂θ {z
t
}
Call this −Jz (θ̂)

∂2Q ∂ 2 log p(y|θ) ∂2Q


| = | − |
∂θ′ ∂θt ( θ̂, θ̂)
| ∂θ∂θ {z
t θ̂
} ∂θ∂θt (θ̂,θ̂)
Call this −Jy (θ̂)

The two quantities Jy (θ̂) and Jz (θ̂) turn out to be respectively the observed-data information
matrix and the complete-data information matrix. The speed matrix is thus given by:

S = Jz (θ̂)−1 Jy (θ̂) with Jz (θ̂) ≡ E[Jz (θ̂)|y, θ̂] (8)

We easily check that: Jz (θ̂) = Jy (θ̂) + Fz|y (θ̂), where Fz|y (θ̂) is the Fisher information matrix
corresponding to the posterior pdf p(z|y, θ̂), which is always symmetric positive. Therefore, we
have the alternative expression:

S = [Jy (θ̂) + Fz|y (θ̂)]−1 Jy (θ̂)

For fast convergence, we want S close to identity, so we better have the posterior Fisher
matrix as “small” as possible. To interpret this result, let’s imagine that Z is drawn from
p(z|y, θ̂), which is not exactly true since θ̂ may be at least slightly different from the actual
parameter value. The Fisher information matrix represents the average information that the
complete data contains about θ̂ conditional to the observed data. In this context, Fz|y (θ̂) is
a measure of missing information, and the speed matrix is the fraction of missing data. The
conclusion is that the rate of convergence of EM is governed by the fraction of missing data.

3
Let (λ1 , λ2 , . . . , λm ) be the complex eigenvalues of S. The spectral radius is ρ(S) = mini |λi |.

5
2.6. EM as a proximal point algorithm
Chrétien & Hero [7] note that EM may also be interpreted as a proximal point algorithm, i.e.
an iterative scheme of the form:

θn+1 = arg max [L(θ) − λn Ψ(θ, θn )], (9)


θ

where Ψ is some positive regularization function and λn is a sequence of positive numbers.


Let us see where this result comes from. In section 2.4, we have established the fundamental
log-likelihood decomposition underlying EM, L(θ) = Q(θ|θ′ )+R(θ|θ′ ), and related the variation
of R(θ|θ′ ) to a Kullback distance (7). Thus, for some current estimate θn , we can write:

Q(θ|θn ) = L(θ) − D(qθn ∥qθ ) − R(θn |θn ),

where qθ (z) ≡ p(z|y, θ) and qθn (z) ≡ p(x|y, θn ) are the posterior pdfs of the complete data,
under θ and θn , respectively. From this equation, it becomes clear that maximizing Q(θ|θn ) is
equivalent to an update rule of the form (9) with:

Ψ(θ, θn ) = D(qθn ∥qθ ), λn ≡ 1

The proximal interpretation of EM is very useful to derive general convergence results [7].
In particular, the convergence rate may be superlinear if the sequence λn is chosen so as to
converge towards zero. Unfortunately, such generalizations are usually intractable because the
objective function may no longer simplify as soon as λn ̸= 1.

2.7. EM as maximization-maximization
Another powerful way of conceptualizing EM is to reinterpret the E-step as another maxi-
mization. This idea, which was formalized only recently by Neal & Hinton [26], appears as a
breakthrough in the general understanding of EM-type procedures. Let us consider the following
function:
L(θ, q) ≡ Eq [ log p(Z, y|θ)] + H(q) (10)

where q(z) is some pdf (yes, any pdf), and H(q) is its entropy [9], i.e. H(q) ≡ − log q(z) q(z) dz.
We easily obtain an equivalent expression that involves a Kullback-Leiber distance:

L(θ, q) = L(θ) − D(q∥qθ ),

where we still define qθ (z) ≡ p(z|y, θ) for notational convenience. The last equation reminds
us immediately of the proximal interpretation of EM which was briefly discussed in section 2.6.
The main difference here is that we don’t impose q(z) = p(z|y, θ) for some θ. Equality holds
for any distribution!
Assume we have an initial guess θn and try to find q that maximizes L(θn , q). From the
above discussed properties of the Kullback-Leibler distance, the answer is q(z) = qθn (z). Now,
substitute qθn in (10), and maximize over θ: this is the same as performing a standard M-step4 !
Hence, the conventional EM algorithm boils down to an alternate maximization of L(θ, q) over
a search space Θ × Q, where Q is a suitable set of pdfs, i.e. Q must include all pdfs from the
set {q(z) = p(z|y, θ), θ ∈ Θ}. It is easy to check that any global maximizer (θ̂, q̂) of L(θ̂, q̂) is

4
To see that, just remember that p(z, y|θ) = p(z|θ)p(y|z) where p(y|z) is independent from θ due to the Markov
property (1).

6
such that θ̂ is also a global maximizer of L(θ). By the way, this is also true for local maximizers
under weak assumptions [26].
The key observation of Neal & Hinton is that the alternate scheme underlying EM may
be replaced with other maximization strategies without hampering the simplicity of EM. In
the conventional EM setting, the auxiliary pdf qn (z) is always constrained to a specific form.
This is to say that EM authorizes only specific pathways in the expanded search space Θ × Q,
yielding some kind of “labyrinth” motion. Easier techniques to find its way in a labyrinth
include breaking the walls or escaping through the roof. Similarly, one may consider relaxing
the maximization constraint in the E-step. This leads for instance to incremental and sparse
EM variants (see section 3).

3. Deterministic EM variants
We first present deterministic EM variants as opposed to stochastic variants. Most of these
deterministic variants attempt at speeding up the algorithm, either by simplifying computations,
or by increasing the rate of convergence (see section 2.5).

3.1. CEM
Classification EM [6]. The whole EM story is about introducing a latent variable Z and per-
forming some inference about its posterior pdf. We might wonder: why not simply estimate Z?
This is actually the idea underlying the CEM algorithm, which is a simple alternate maximiza-
tion of the functional p(z, y|θ) wrt both θ and z. Given a current parameter estimate θn , this
leads to:
• Classification step: find zn = arg max p(z|y, θn ).
z

• Maximization step: find θn+1 = arg max p(zn |θ).


θ

Notice that a special instanciation of CEM is the well-known k-means algorithm. In prac-
tice, CEM has several advantages over EM, like being easier to implement and typically faster
to converge. However, CEM doesn’t maximize the incomplete-data likelihood and, therefore,
the monotonicity property of EM is lost. While CEM estimates the complete data explicitely,
EM estimates only sufficient statistics for the complete data. In this regard, EM may be under-
stood as a fuzzy classifier that avoids the statistical efficiency problems inherent to the CEM
approach. Yet, CEM is often useful in practice.

3.2. Aitken’s acceleration


An early EM extension [12, 21, 22]. Aitken’s acceleration is a general purpose technique to
speed up the convergence of a fixed point recursion with asymptotic linear behavior. Section 2.5
established that, under appropriate smoothness assumptions, EM may be approximated by a
recursion of the form:
θn+1 ≈ S θ̂ + (I − S)θn ,
where θ̂ is the unknown limit and S is the speed matrix given by (8) which depends on this
limit. Aitken’s acceleration stems from the remark that, if S was known, then the limit could
be computed explicitely in a single iteration, namely: θ̂ ≈ θ0 + S −1 (θ1 − θ0 ) for some starting
value θ0 . Despite that S is unknown and the sequence is not strictly linear, we are tempted to
consider the following modified EM scheme. Given a current parameter estimate θn ,

7
• E-step: compute Q(θ|θn ) and approximate the inverse speed matrix: Sn−1 = Jy (θn )−1 Jz (θn ).

• M-step: unchanged, get an intermediate value θ∗ = arg max Q(θ|θn ).


θ

• Acceleration step: update the parameter using θn+1 = θn + Sn−1 (θ∗ − θn ).


It turns out that this scheme is nothing but the Newton-Raphson method to find a zero of
θ 7→ Φ(θ) − θ, where Φ is the map defined by the EM sequence, i.e. Φ(θ′ ) = arg maxθ Q(θ|θ′ ).
Since the standard EM sets Sn = I on each iteration, it may be viewed as a first-order approach
to the same zero-crossing problem, hence avoiding the expense of computing Sn . Beside this
important implementational issue, convergence is problematic using Aitken’s acceleration as the
monotonicity property of EM is generally lost.

3.3. AEM
Accelerated EM [15]. To trade off between EM and its Aitken’s accelerated version (see sec-
tion 3.2), Jamshidian and Jennrich propose a conjugate gradient approach. Don’t be messed
up: this is not a traditional gradient-based method (otherwise there would be no point to talk
about it in this report). No gradient computation is actually involved in here. The “gradient”
is the function Φ(θ) − θ, which may be viewed as a generalized gradient for the incomplete-
data log-likelihood, hence justifying the use of the generalized conjugate gradient method (see
e.g. [27]). Compared to the Aitken’s accelerated EM, the resulting AEM algorithm doesn’t
require computing the speed matrix. Instead, the parameter update rule is of the form:

θn+1 = θn + λn dn ,

where dn is a direction composed from the current direction Φ(θn ) − θn and the previous di-
rections (the essence of conjugate gradient), and λn is a step size typically computed from a
line maximization of the complete-data likelihood (which may or may not be cumbersome). As
an advantage of line maximizations, the monotonicity property of EM is safe. Also, from this
generalized gradient perspective, it is straightforward to devise EM extensions that make use of
other gradient descent techniques such as the steepest descent or quasi-Newton methods [27].

3.4. ECM
Expectation Conditional Maximization [23]. This variant (not to be confused with CEM, see
above) was introduced to cope with situations where the standard M-step is intractable. It is
the first on a list of coordinate ascent-based EM extensions.
In ECM, the M-step is replaced with a number of lower dimensional maximization problems
called CM-steps. This implies decomposing the parameter space as a sum of subspaces, which,
up to some possible reparameterization, is the same as splitting the parameter vector into
several blocks, θ = (t1 , t2 , . . . , ts ). Starting from a current estimate θn , the CM-steps update
one coordinate block after another by partially maximizing the auxiliary Q-function, yielding a
scheme similar in essence to Powell’s multidimensional optimization method [27]. This produces
a sequence θn = θn,0 → θn,1 → θn,2 → . . . → θn,s−1 → θn,s = θn+1 , such that:

Q(θn |θn ) ≤ Q(θn,1 |θn ) ≤ Q(θn,2 |θn ) ≤ . . . ≤ Q(θn,s−1 |θn ) ≤ Q(θn+1 |θn )

Therefore, the auxiliary function is guaranteed to increase on each CM-step, hence globally
in the M-step, and so does the incomplete-data likelihood. Hence, ECM is a special case of
GEM (see section 2.3).

8
3.5. ECME
ECM either [18]. This is an extension of ECM where some CM-steps are replaced with steps
that maximize, or increase, the incomplete-data log-likelihood L(θ) rather than the auxiliary
Q-function. To make sure that the likelihood function increases globally in the M-step, the
only requirement is that the CM-steps that act on the actual log-likelihood be performed after
the usual Q-maximizations. This is because increasing the Q-function only increases likelihood
from the starting point, namely θn , which is held fixed during the M-step (at least, this is what
we assume)5 .
Starting with Q-maximizations is guaranteed to increase the likelihood, and of course sub-
sequent likelihood maximizations can only improve the situation. With the correct setting,
ECME is even more general than GEM as defined in section 2.3 while inheriting its funda-
mental monotonicity property. An example application of ECME is in mixture models, where
typically mixing proportions are updated using a one-step Newton-Raphson gradient descent on
the incomplete-data likelihood, leading to a simple additive correction to the usual EM update
rule [18]. At least in this case, ECME has proved to converge faster than standard EM.

3.6. SAGE
Space-Alternating Generalized EM [13, 14]. In the continuity of ECM and ECME (see sec-
tions 3.4 and 3.5), one can imagine defining an auxiliary function specific to each coordinate
block of the parameter vector. More technically, using a block decomposition θ = (t1 , t2 , . . . , ts ),
we assume that, for each block i = 1, . . . , s, there exists a function Qi (θ|θ′ ) such that, for
all θ and θ′ with identical block coordinates except (maybe) for the i-th block, we have:
L(θ) − L(θ′ ) ≥ Qi (θ|θ′ ) − Qi (θ′ |θ′ ).
This idea has two important implications. First, the usual ECM scheme needs to be rephrased,
because changing the auxiliary function across CM-steps may well result in decreasing the
likelihood, a problem worked around in ECME with an appropriate ordering of the CM-steps. In
this more general framework, though, there may be no such fix to save the day. In order to ensure
monotonicity, at least some CM-steps should start with “reinitializing” their corresponding
auxiliary function, which means... performing an E-step. It is important to realize that, because
the auxiliary function is coordinate-specific, so is the E-step. Hence, each “CM-step” becomes
an EM algorithm in itself which is sometimes called a “cycle”. We end up with a nested
algorithm where cycles are embedded in a higher-level iterative scheme.
Furthermore, how to define the Qi ’s? From section 2, we know that the standard EM auxiliary
function Q(θ|θ′ ) is built from the complete-data space Z; see in particular equation (5). Fessler
& Herro introduce hidden-data spaces, a concept that generalizes complete-data spaces in the
sense that hidden-data spaces may be coordinate-specific, i.e. there is a hidden variable Zi for
each block ti . Formally, given a block decomposition θ = (t1 , t2 , . . . , ts ), Zi is a hidden-data
space for ti if:
p(zi , y|θ) = p(y|zi , {tj̸=i }) p(zi |θ)
This definition’s main feature is that the conditional probability of Y knowing Zi is allowed
to be dependent on every parameter block but ti . Let us check that the resulting auxiliary
5
For example, if one chooses θ∗ such that L(θ∗ ) ≥ L(θn ) and, then, θn+1 such that Q(θn+1 |θn ) ≥ Q(θ∗ |θn ),
the only conlusion is that the likelihood increases from θn to θ∗ , but may actually decrease from θ∗ to θn+1
because θ∗ is not the starting point of Q. Permuting the L-maximization and the Q-maximization, we
have Q(θ∗ |θn ) ≥ Q(θn |θn ), thus L(θ∗ ) ≥ L(θn ), and therefore L(θn+1 ) ≥ L(θn ) since we have assumed
L(θn+1 ) ≥ L(θ∗ ). This argument generalizes easily to any intermediate sequence using the same cascade
inequalities as in the derivation of ECM (see section 3.4).

9
function fulfils the monotonicity condition. We define:

Qi (θ|θ′ ) ≡ E[ log p(Zi |θ)|y, θ′ ]

Then, applying Jensen’s inequality (3) like in section 2.2, we get:



′ ′ ′ ′ p(y|zi , θ)
L(θ) − L(θ ) ≥ Qi (θ|θ ) − Qi (θ |θ ) + log p(zi |y, θ′ ) dzi
p(y|zi , θ′ )

When θ and θ′ differ only by ti , the integral vanishes because the conditional pdf p(y|zi , θ)
is independent from ti by the above definition. Consequently, maximizing Qi (θ|θ′ ) with respect
to ti only (the other parameters being held fixed) is guaranteed to increase the incomplete-data
likelihood. Specific applications of SAGE to the Poisson imaging model or penalized least-
squares regression were reported to converge much faster than standard EM.

3.7. CEMM
Component-wise EM for Mixtures [4]. Celeux et al extend the SAGE methodology to the case
of constrained likelihood maximization, which arises typically in mixture problems where the
sum of mixing proportions should equate to one. Using a Lagrangian dualization approach, they
recast the initial problem into unconstrained maximization by defining an appropriate penalized
log-likelihood function. The CEMM algorithm they derive is a natural coordinatewise variant
of EM whose convergence to a stationary point of the likelihood is established under mild
regularity conditions.

3.8. AECM
Alternating ECM [24, 25]. In an attempt to summarize earlier contributions, Meng & van
Dyk propose to cast a number of EM extensions into a unified framework, the so-called AECM
algorithm. Essentially, AECM is a SAGE algorithm (itself a generalization of both ECM and
ECME) that includes another data augmentation trick. The idea is to consider a family of
complete-data spaces indexed by a working parameter α. More formally, we define a joint pdf
q(z, y|θ, α) as depending on both θ and α, yet imposing the constraint that the corresponding
marginal incomplete-data pdf be preserved:

p(y|θ) = q(z, y|θ, α) dz,

and thus independent from α. In other words, α is identifiable only given the complete data.
A simple way of achieving such data augmentation is to define Z = fθ,α (Z0 ), where Z0 is some
reference complete-data space and fθ,α is a one-to-one mapping for any (θ, α). Interestingly, it
can be seen that α has no effect if fθ,α is insensitive to θ. In AECM, α is tuned beforehand so
as that to minimize the fraction of missing data (8), thereby maximizing the algorithm’s global
speed. In general, however, this initial mimimization cannot be performed exactly since the
global speed may depend on the unknown maximum likelihood parameter.

3.9. PX-EM
Parameter-Expanded EM [19, 17]. Liu et al revisit the working parameter method suggested
by Meng and van Dyk [24] (see section 3.8) from a slighlty different angle. In their strategy, the
joint pdf q(z, y|θ, α) is defined so as to meet the two following requirements. First, the baseline

10
model is embedded in the expanded model in the sense that q(z, y|θ, α0 ) = p(z, y|θ) for some
null value α0 . Second, which is the main difference with AECM, the observed-data marginals
are consistent up to a many-to-one reduction function r(θ, α),

p(y|r(θ, α)) = q(z, y|θ, α) dz,

for all (θ, α). From there, the trick is to to “pretend” estimating α iteratively rather than
pre-processing its value.
The PX-EM algorithm is simply an EM on the expanded model with additional instructions
after the M-step to apply the reduction function and reset α to its null value. Thus, given
a current estimate θn , the E-step forms the auxiliary function corresponding to the expanded
model from (θn , α0 ), which amounts to the standard E-step because α = α0 . The M-step then
provides (θ∗ , α∗ ) such that q(y|θ∗ , α∗ ) ≥ q(y|θn , α0 ), and the additional reduction step updates
θn according to θn+1 = r(θ∗ , α∗ ), implying p(y|θn+1 ) = q(y|θ∗ , α∗ ). Because q(y|θn , α0 ) =
p(y|θn ) by construction of the expanded model, we conclude that p(y|θn+1 ) ≥ p(y|θn ), which
shows that PX-EM preserves the monotonicity property of EM.
In some way, PX-EM capitalizes on the fact that a large deviation between the estimate of α
and its known value α0 is an indication that the parameter of interest θ is poorly estimated.
Hence, PX-EM adjusts the M-step for this deviation via the reduction function. A variety of
examples where PX-EM converges much faster than EM is reported in [19]. Possible variants
of PX-EM include the coordinatewise extensions underlying SAGE.

3.10. Incremental EM
Following the maximization-maximization approach discussed in section 2.7, Neal∏& Hinton [26]
address the common case where observations are i.i.d. Then, we have p(y|θ) = i p(yi |θ) and,
similarly, the global EM objective function (10) reduces to:

L(θ, q) = {Eqi [ log p(Zi , yi |θ)] + H(qi )},
i

where we can search for q under the factored form q(z) = i qi (z). Therefore, for a given θ,
maximizing L(θ, q) wrt q is equivalent to maximizing the contribution of each data item wrt qi ,
hence splitting the global maximization problem into a number of simpler maximizations. Incre-
mental EM variants are justified from this remark, the general idea being to update θ by visiting
the data items sequencially rather than from a global E-step. Neal & Hinton demonstrate an
incremental EM variant for mixtures that converges twice as fast as standard EM.

3.11. Sparse EM
Another suggestion of Neal & Hinton [26] is to track the auxiliary distribution q(z) in a subspace
of the original search space Q (at least for a certain number of iterations). This general strategy
includes sparse EM variants where q is updated only at pre-defined plausible unobserved values.
Alternatively, “winner-take-all” EM variants such as the CEM algorithm [6] (see section 3.1)
may be seen in this light. Such procedures may have strong computational advantages but,
in counterpart, are prone to estimation bias. In the maximization-maximization interpretation
of EM, this comes as no surprise since these approaches “project” the estimate on a reduced
search space that may not contain the maximum likelihood solution.

11
4. Stochastic EM variants
While deterministic EM variants were mainly motivated by convergence speed considerations,
stochastic variants are more concerned with other limitations of standard EM. One is that the
EM auxiliary function (4) involves computing an integral that may not be tractable in some
situations. The idea is then to replace this tedious computation with a stochastic simulation.
As a typical side effect of such an approach, the modified algorithm inherits a lesser tendancy
to getting trapped in local maxima, yielding improved global convergence properties.

4.1. SEM
Stochastic EM [5]. As noted in section 2.4, the standard EM auxiliary function is the best
estimate of the complete-data log-likelihood in the sense of the conditional mean squared error.
The idea underlying SEM, like other stochastic EM variants, is that there might be no need to
ask for such a “good” estimate. Therefore, SEM replaces the standard auxiliary function with:

Q̂(θ|θ′ ) = log p(z ′ |θ′ ),

where z ′ is a random sample drawn from the posterior distribution of the unobserved variable6 ,
p(z|y, θ′ ). This leads to the following modified iteration; given a current estimate θn :

• Simulation step: compute p(z|y, θn ) and draw an unobserved sample zn from p(z|y, θn ).

• Maximization step: find θn+1 = arg max p(zn |θ).


θ

By construction, the resulting sequence θn is an homogeneous Markov chain7 which, under


mild regularity conditions, converges to a stationary pdf. This means in particular that θn
doesn’t converge to a unique value! Various schemes can be used to derive a pointwise limit,
such as averaging the estimates over iterations once stationarity has been reached (see also
SAEM regarding this issue). It was established in some specific cases that the stationary pdf
concentrates around the likelihood maximizer with a variance inversly proportional to the sam-
ple size. However, in cases where several local maximizers exist, one may expect a multimodal
behavior.

4.2. DA
Data Augmentation algorithm [28]. Going further into the world of random samples, one may
consider replacing the M-step in SEM with yet another random draw. In a Bayesian context,
maximizing p(zn |θ) wrt θ may be thought of as computing the mode of the posterior distribution
p(θ|zn ), given by:
p(zn |θ)p(θ)
p(θ|zn ) = ∫
p(zn |θ′ )p(θ′ ) dθ′
where we can assume a flat (or non-informative) prior distribution for θ. In DA, this maximiza-
tion is replaced with a random draw θn+1 ∼ p(θ|zn ). From equation (1), we easily check that
p(θ|zn ) = p(θ|zn , y). Therefore, DA alternates conditional draws zn |(θn , y) and θn+1 |(zn , y),
which is the very principle of a Gibbs sampler. Results from Gibbs sampling theory apply, and

6
Notice that when Z is defined as Z = (X, Y ), this simulation reduces to a random draw of the missing data X.
∏ draws need to be mutually independent conditional to (θ1 , θ2 , . . . , θn ), i.e. p(z1 , z2 , . . . , zn |θ1 , θ2 , . . . , θn ) =
7
The
i p(zi |θi ).

12
it is shown under general conditions that the sequence θn is a Markov chain that converges
in distribution towards p(θ|y). Once the sequence has reached stationarity, averaging θn over
iterations yields a random variable that converges to the conditional mean E(θ|y), which is an
estimator of θ generally different from the maximum likelihood but not necessarily worse.
Interesting enough, several variants of DA have been proposed recently [20, 17] following the
parameter expansion strategy underlying the PX-EM algorithm described in section 3.9.

4.3. SAEM
Stochastic Approximation type EM [3]. The SAEM algorithm is a simple hybridation of EM
and SEM that provides a pointwise convergence as opposed to the erratic behavior of SEM.
Given a current estimate θn , SAEM performs a standard EM iteration in addition to the SEM
iteration. The parameter is then updated as a weighted mean of both contributions, yielding:
θn+1 = (1 − γn+1 )θn+1
EM SEM
+ γn+1 θn+1 ,
where 0 ≤ γn ≤ 1. Of course, to apply SAEM, the standard EM needs to be tractable.
The sequence γn is typically chosen so as to decrease from 1 to 0, in such a way that the
algorithm is equivalent to SEM in the early iterations, and then becomes more similar to EM.
It is established that SAEM converges almost surely towards a local likelihood maximizer (thus
avoiding
∑ saddle points) under the assumption that γn decreases to 0 with limn→∞ (γn /γn+1 ) = 1
and n γn → ∞.

4.4. MCEM
Monte Carlo EM [30]. At least formally, MCEM turns out to be a generalization of SEM. In
(1) (2) (m)
the SEM simulation step, draw m independent samples zn , zn , . . . , zn instead of just one,
and then maximize the following function:
1 ∑
m
Q̂(θ|θn ) = log p(zn(j) |θ),
m
j=1

which, in general, converges almost surely to the standard EM auxiliary function thanks to the
law of large numbers.
Choosing a large value for m justifies calling this Monte Carlo something. In this case,
Q̂ may be seen as an empirical approximation of the standard EM auxiliary function, and the
algorithm is expected to behave similarly to EM. On the other hand, choosing a small value
for m is not forbidden, if not advised (in particular, for computational reasons). We notice that,
for m = 1, MCEM reduces to SEM. A possible strategy consists of increasing progressively the
parameter m, yielding a “simulated annealing” MCEM which is close in spirit to SAEM.

4.5. SAEM2
Stochastic Approximation EM [11]. Delyon et al propose a generalization of MCEM called
SAEM, not to be confused with the earlier SAEM algorithm presented in section 4.3, although
both algorithms promote a similar simulated annealing philosophy. In this version, the auxiliary
function is defined recursively by averaging a Monte Carlo approximation with the auxiliary
function computed in the previous step:
γn ∑
mn
Q̂n (θ) = (1 − γn )Q̂n−1 (θ) + log p(zn(j) |θ),
mn
j=1

13
(1) (2) (m )
where zn , zn , . . . , zn n are drawn independently from p(z|y, θn ). The weights γn are typically
decreased across iterations in such a way that Q̂n (θ) eventually stabilizes at some point. One
may either increase the number of random draws mn , or set a constant value mn ≡ 1 when sim-
ulations have heavy computanional cost compared to the maximization step. The convergence
of SAEM2 towards a local likelihood maximizer is proved in [11] under quite general conditions.
Kuhn et al [16] further extend the technique to make it possible to perform the simulation
under a distribution Πθn (z) simpler to deal with than the posterior pdf p(z|y, θn ). Such a
distribution may be defined as the transition probability of a Markov chain generated by a
Metropolis-Hastings algorithm. If Πθ (z) is such that its associated Markov chain converges
to p(z|y, θ), then the convergence properties of SAEM2 generalize under mild additional as-
sumptions.

5. Conclusion
This report’s primary goal is to give a flavor of the current state of the art on EM-type statistical
learning procedures. We also hope it will help researchers and developers in finding the literature
relevant to their current existential questions. For a more comprehensive overview, we advise
some good tutorials that are found on the internet [8, 2, 1, 10, 29, 17].

A. Appendix
A.1. Maximum likelihood quickie
Let Y a random variable with pdf p(y|θ), where θ is an unknown parameter vector. Given an
outcome y of Y , maximum likelihood estimation consists of finding the value of θ that maximizes
the probability p(y|θ) over a given search space Θ. In this context, p(y|θ) is seen as a function
of θ and called the likelihood function. Since it is often more convenient to manipulate the
logarithm of this expression, we will focus on the equivalent problem of maximizing the log-
likelihood function:
θ̂(y) = arg max L(y, θ)
θ∈Θ

where the log-likelihood L(y, θ) ≡ log p(y|θ) is denoted L(y, θ) to emphasize the dependance in
y, contrary to the notation L(θ) usually employed throghout this report. Whenever the log-
likelihood is differentiable wrt θ, we also define the score function as the log-likelihood gradient:
∂L
S(y, θ) = (y, θ)
∂θ
In this case, a fundamental result is that, for all vector U (y, θ), we have:

∂ ( ∂U t )
E(SU t ) = E(U t ) − E
∂θ ∂θ
where the expectation is taken wrt the distribution p(y|θ). This equality is easily obtained from
the logarithm differentiation formula and some additional manipulations. Assigning the “true”
value of θ in this expression leads to the following:

• E(S) = 0
( ∂S t )
• Cov(S, S) = −E (Fisher information)
∂θ

14
• If U (y) is an unbiased estimator of θ, then Cov(S, U ) = Id.

• In the case of a single parameter, the above result implies Var(U ) ≥ Var(S)1
from the
Cauchy-Schwartz inequality, i.e. the Fisher information is a lower bound for the variance
of U . Equality occurs iff U is an affine function of S, which imposes a specific form to
p(y|θ) (Darmois theorem).

A.2. Jensen’s inequality


For any random variable X and any real continuous concave function f , we have:

f [E(X)] ≥ E[f (X)],

If f is strictly concave, equality occurs iff X is deterministic.

References
[1] A. Berger. Convexity, Maximum Likelihood and All That. Tutorial published on the web,
1998.

[2] J. Bilmes. A Gentle Tutorial of the EM Algorithm and its Application to Parameter
Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report ICSI-TR-
97-021, University of Berkeley, Apr. 1998.

[3] G. Celeux, D. Chauveau, and J. Diebolt. On Stochastic Versions of the EM Algorithm.


Technical Report 2514, INRIA, Mar. 1995.

[4] G. Celeux, S. Chrétien, F. Forbes, and A. Mkhadri. A Component-wise EM Algorithm for


Mixtures. Journal of Computational and Graphical Statistics, 10:699–712, 2001.

[5] G. Celeux and J. Diebolt. The SEM Algorithm: a Probabilistic Teacher Algorithm Derived
from the EM Algorithm for the Mixture Problem. Computational Statistics Quaterly, 2:73–
82, 1985.

[6] G. Celeux and G. Govaert. A classification EM algorithm for clustering and two stochastic
versions. Computational Statistics and Data Analysis, 14:315–332, 1992.

[7] S. Chrétien and A. O. Hero. Kullback Proximal Algorithms for Maximum Likelihood
Estimation. IEEE Transactions on Information Theory, 46(5):1800–1810, 2000.

[8] C. Couvreur. The EM Algorithm: A Guided Tour. In Proc. 2d IEEE European Workshop
on Computationaly Intensive Methods in Control and Signal Processing, Prague, Czech
Republik, Aug. 1996.

[9] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons,
Inc., 1991.

[10] F. Dellaert. The Expectation Maximization Algorithm. Tutorial published on the web,
2002.

[11] B. Delyon, M. Lavielle, and E. Moulines. Convergence of a stochastic approximation version


of the EM algorithm. Annals of Statistics, 27(1):94–128, 1999.

15
[12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39:1–38, 1977.

[13] J. A. Fessler and A. O. Hero. Space-Alternating Generalized Expectation-Maximization


Algorithm. IEEE Transactions on Signal Processing, 42(10):2664–2677, 1994.

[14] J. A. Fessler and A. O. Hero. Penalized Maximum-Likelihood Image Reconstruction Using


Space-Alternating Generalized EM Algorithms. IEEE Transactions on Image Processing,
4(10):1417–1429, 1995.

[15] M. Jamshidian and R. I. Jennrich. Conjugate gradient acceleration of the EM algorithm.


Journal of the American Statistical Association, 88:221–228, 1993.

[16] E. Kuhn and M. Lavielle. Coupling a stochastic approximation version of EM with a


MCMC procedure. Submitted, 2002.

[17] C. Liu. An Example of Algorithm Mining: Covariance Adjustment to Accelerate EM and


Gibbs. To appear in Development of Modern Statistics and Related Topics, 2003.

[18] C. Liu and D. B. Rubin. The ECME algorithm: a simple extension of EM and ECM with
faster monotone convergence. Biometrika, 81:633–648, 1994.

[19] C. Liu, D. B. Rubin, and Y. N. Wu. Parameter Expansion to Accelerate EM: The PX-EM
Algorithm. Biometrika, 85:755–770, 1998.

[20] J. S. Liu and Y. N. Wu. Parameter Expansion for Data Augmentation. Journal of the
American Statistical Association, 94:1264–1274, 1999.

[21] T. A. Louis. Finding the observed information matrix when using the EM algorithm.
Journal of the Royal Statistical Society: Series B, 44:226–233, 1982.

[22] I. Meilijson. A fast improvement of the EM algorithm on its own terms. Journal of the
Royal Statistical Society: Series B, 51:127–138, 1989.

[23] X. L. Meng and D. B. Rubin. Maximum likelihood via the ECM algorithm: a general
framework. Biometrika, 80:267–278, 1993.

[24] X. L. Meng and D. A. van Dyk. The EM Algorithm - An Old Folk Song Sung To A Fast
New Tune. Journal of the Royal Statistical Society: Series B, 59:511–567, 1997.

[25] X. L. Meng and D. A. van Dyk. Seeking efficient data augmentation schemes via conditional
and marginal data augmentation. Biometrika, 86(2):301–320, 1999.

[26] R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse, and
other variants. In M. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer
Academic Publishers, 1998.

[27] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in


C. Cambrige University Press, 2nd edition, 1992.

[28] M. A. Tanner and W. H. Wong. The Calculation of Posterior Distributions by Data


Augmentation. Journal of the American Statistical Association, 82:528–550, 1987.

16
[29] D. A. van Dyk and X. L. Meng. Algorithms based on data augmentation: A graphical
representation and comparison. In K. Berk and M. Pourahmadi, editors, Computing Science
and Statistics: Proceedings of the 31st Symposium on the Interface, pages 230–239. Interface
Foundation of North America, 2000.

[30] G. Wei and M. A. Tanner. A Monte Carlo implementation of the EM algorithm and the
poor man’s data augmentation algorithm. Journal of the American Statistical Association,
85:699–704, 1990.

17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy