-Expectation Maximization Algorithm
-Expectation Maximization Algorithm
Algorithm
Introduction
• Presented by Dempster, Laird and Rubin in 1977
• Basically the same principle was already proposed
earlier by some other authors in specific
circumstances
• EM algorithm is an iterative estimation algorithm that
can derive the maximum likelihood (ML) estimates in
the presence of missing/hidden data (“incomplete
data”)
Many to one mapping
Expectation Step (E-Step)
• The basic functioning of the EM algorithm can
be divided into two steps (the parameter to be
estimated is θ):
• Expectation step (E-step)
• Coin A
• 10
C5 (0.6)5 (0.4)5 = 252 0.07776 0.01024
• 0.201
• Coin B
• 10
C5 (0.5)5 (0.5)5 = 252 0.03125 0.03125
• 0.246 1
0.201 0.246
• Normalization Factor =
• = 2.237
• Normalized value for A: 0.201 2.237 = 0.45
• Normalized value for B: 0.246 2.237 = 0.55
Case 2:
• HHHHTHHHHH
Initially assumed values:̂ = 0.6 and̂
(0) (0)
• A B = 0.5
• Coin A
• 10
C9 (0.6)9 (0.4)1 0.0403
• Coin B
• 10
C9 (0.5)9 (0.5)1 0.0098
• Normalization Factor
• 1/(0.0403+0.0098)
• = 19.96
• Normalized value for A: 0.0403 19.96 = 0.80
• Normalized value for B: 0.0098 19.96 = 0.20
Case 3:
• HTHHHHHTHH
• Initially assumed values: ̂A = 0.6 and ̂ = 0.5
(0) (0)
B
• Coin A
• 10C8 (0.6)8 (0.4)2
• 0.121
• Coin B
• 10C8 (0.5)8 (0.5)2
• 0.044
• Normalization Factor
• = 6.061
• Normalized value for A: 0.121 6.061 = 0.73
• Normalized value for B: 0.044 6.061 = 0.27
Case 4:
• HTHTTTTHHT
Initially assumed values: ̂ = 0.6 and ̂ = 0.5
(0) (0)
• A B
• Coin A
• 10
C4 (0.6)4 (0.4)6
• 0.1115
• Coin B
• 10
C4 (0.5)4 (0.5)6
• 0.2051
• Normalization Factor
• = 3.1586
• Normalized value for A: 0.1115 3.1586 = 0.35
• Normalized value for B: 0.2051 3.1586 = 0.65
Case 5:
• THHHTHHHTH
• Initially assumed values: ̂A = 0.6 and ̂ = 0.5
(0) (0)
B
• Coin A
• 10C7 (0.6)7 (0.4)3
• 0.215
• Coin B
• 10C7 (0.5)7 (0.5)3
• 0.1172
• Normalization Factor
• = 3.01
• Normalized value for A: 0.215 3.01 = 0.65
• Normalized value for B: 0.1172 3.01 = 0.35
Calculations Coin A Coin B
• 0.45 5, 5 0.55 5, 5 2.2 H, 2.2 T 2.8 H, 2.8 T
• 0.80 9, 1 0.20 9, 1 7.2 H, 0.8 T 1.8 H, 0.2 T
• 0.73 8, 2 0.27 8, 2 5.8 H, 1.5 T 2.2 H, 0.5 T
• 0.35 4, 6 0.65 4, 6 1.4 H, 2.1 T 2.6 H, 3.9 T
• 0.65 7, 3 0.35 7, 3 4.6 H, 2 T 2.4 H, 1 T
• 21.2H, 8.6 T 11.8 H, 8.4 T
M-Step Calculation for next iteration
21.2
ˆ (1)
0.71
21.2 8.6
A
11.7
ˆ (1)
0.58
11.7 8.4
B
Advantages
• The EM algorithm will naturally produce valid
parameters for the mixture distribution.
• The EM algorithm doesn't require the gradient.
• From an implementation standpoint, the EM
algorithm is often described as being very simple, but
plugging things into a standard optimization solver
sounds even simpler.
Disadvantages
• Because of the uniform distribution, it seems like your
objective function might have some nasty behavior.
• For example, imagine sliding the uniform distribution to
the side. The likelihood won't change at all (the derivative
will be zero w.r.t. the location parameter), until a data
point falls into or out of the support of the distribution.
• The likelihood will then abruptly jump to a new value, and
the function will be non differentiable at this point.
• This kind of behavior doesn't play nicely with gradient-
based optimization algorithms. I don't know how it would
affect the EM algorithm.
References
• Andrew Ng. CS229 Lecture Notes – EM
Algorithm
• Chuong B Do & Serafim Batzoglou. What is the
expectation maximization algorithm? Nature
Biotechnology 26, 897 - 899 (2008)