Sampling
Sampling
Lecture 2: Sampling
MING GAO
DaSE@ECNU
(for course related communications)
mgao@dase.ecnu.edu.cn
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 2 / 45
Monte Carlo Method
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 3 / 45
Monte Carlo Method
Example I
Algorithm:
Example I
Algorithm:
Step i: It randomly and uniformly
generates a point Pi inside the sample
space Ω = {(x, y )|0 ≤ x, y ≤ 1}.
Example I
Algorithm:
Step i: It randomly and uniformly
generates a point Pi inside the sample
space Ω = {(x, y )|0 ≤ x, y ≤ 1}.
Let set
S = {(x, y ) : x 2 + y 2 ≤ 1 ∧ x, y ≥ 0} be
the circle region. And ∀Pi ∈ S, we
define IS (Pi ) and IΩ−S (Pi );
Example I
Algorithm:
Step i: It randomly and uniformly
generates a point Pi inside the sample
space Ω = {(x, y )|0 ≤ x, y ≤ 1}.
Let set
S = {(x, y ) : x 2 + y 2 ≤ 1 ∧ x, y ≥ 0} be
the circle region. And ∀Pi ∈ S, we
define IS (Pi P
) and IΩ−S (Pi );
n
π IS (Pi )
4 ≈ .
Pn i=1 P
I (Pi )+ n I
i=1 S (Pi )
i=1 Ω−S
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 4 / 45
Monte Carlo Method
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 5 / 45
Monte Carlo Method
The term following the last equals sign is the sum over all x of a
function of x, weighted by the marginal probabilities P(X = x).
Clearly this is an expectation, and therefore may be approximated by
Monte Carlo, giving us
n
1X
P(Y = y ) ≈ P(Y = y |X = xi ).
n
i=1
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 6 / 45
Monte Carlo Method
R1
Example III: approximating integral 0 x 2 dx
Rb
For an integral a f (x)dx, it is
hard to find a rectangle to bound
the value of f (x), especially for a
high-dimensional function.
Alternatively,
R b f (x) we compute
a p(x) p(x)dx.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 7 / 45
Monte Carlo Method
√
3 The rate of convergence is proportional to N;
Monte Carlo Method
√
3 The rate of convergence is proportional to N;
4 Major issues:
The proportionality constant increases exponentially with the
dimension of the integral.
Another problem is that sampling from complex distributions is
not as easy as uniform.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 8 / 45
Monte Carlo Method
R
Rejection sampling: approximating f (x)p(x)dx
1 PN
N i=1 f (xi ) is difficult to compute since it is hard to draw from
p(x).
1: i ← 0;
2: while i 6= N do
3: x (i) ∼ q(x);
4: u ∼ U(0, 1);
p(x (i) )
5: if u <kq(x (i) )
then;
6: accept x (i) ;
7: i ← i + 1;
8: else
9: reject x (i) ;
10: end if
11: end while
Monte Carlo Method
R
Rejection sampling: approximating f (x)p(x)dx
1 PN
N i=1 f (xi ) is difficult to compute since it is hard to draw from
p(x).
1: i ← 0;
2: while i 6= N do
3: x (i) ∼ q(x);
4: u ∼ U(0, 1);
p(x (i) )
5: if u <kq(x (i) )
then;
6: accept x (i) ;
7: i ← i + 1; where density q(x) (e.g., Gaussian) can
8: else sample directly.
9: reject x (i) ; What is the average acceptance ratio?
10: end if However, it is hard to find the
11: end while reasonable q(x) and the value of k.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 9 / 45
Monte Carlo Method
R
Importance sampling: approximating I (f ) = f (x)p(x)dx
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 10 / 45
Monte Carlo Method
Going back to Example II with the discrete sum over latent variables
X it is clear that the optimal importance sampling function would be
the conditional distribution of X given Y , i.e.,
X P(Y = y , X = x)
P(Y = y ) = P(X |Y = y ).
P(X |Y = y )
x∈X
Monte Carlo Method
Going back to Example II with the discrete sum over latent variables
X it is clear that the optimal importance sampling function would be
the conditional distribution of X given Y , i.e.,
X P(Y = y , X = x)
P(Y = y ) = P(X |Y = y ).
P(X |Y = y )
x∈X
Going back to Example II with the discrete sum over latent variables
X it is clear that the optimal importance sampling function would be
the conditional distribution of X given Y , i.e.,
X P(Y = y , X = x)
P(Y = y ) = P(X |Y = y ).
P(X |Y = y )
x∈X
Going back to Example II with the discrete sum over latent variables
X it is clear that the optimal importance sampling function would be
the conditional distribution of X given Y , i.e.,
X P(Y = y , X = x)
P(Y = y ) = P(X |Y = y ).
P(X |Y = y )
x∈X
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 11 / 45
Monte Carlo Method
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 12 / 45
Markov Chain Monte Carlo
We cannot
R sample directly from the target distribution p(x) in the
integral f (x)p(x)dx.
Create a Markov chain whose transition matrix does not
depend on the normalization term.
Markov Chain Monte Carlo
We cannot
R sample directly from the target distribution p(x) in the
integral f (x)p(x)dx.
Create a Markov chain whose transition matrix does not
depend on the normalization term.
Make sure the chain has a stationary distribution and it is
equal to the target distribution.
Markov Chain Monte Carlo
We cannot
R sample directly from the target distribution p(x) in the
integral f (x)p(x)dx.
Create a Markov chain whose transition matrix does not
depend on the normalization term.
Make sure the chain has a stationary distribution and it is
equal to the target distribution.
After sufficient number of iterations, the chain will converge
the stationary distribution.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 13 / 45
Markov Chain Monte Carlo
Stationary Distribution
Theorem
Let X0 , X1 , · · · , be an irreducible and aperiodic Markov chain with
transition matrix P. Then, limn→∞ Pijn exists and independent of i,
denoted as limn→∞ Pijn = π(j). We also have
Markov Chain Monte Carlo
Stationary Distribution
Theorem
Let X0 , X1 , · · · , be an irreducible and aperiodic Markov chain with
transition matrix P. Then, limn→∞ Pijn exists and independent of i,
denoted as limn→∞ Pijn = π(j). We also have
π(1) π(2) · · · π(j) · · ·
π(1) π(2) · · · π(j) · · ·
n
lim P = ··· ··· ··· ··· ··· (2)
n→∞
π(1) π(2) · · · π(j) · · ·
··· ··· ··· ··· ···
Markov Chain Monte Carlo
Stationary Distribution
Theorem
Let X0 , X1 , · · · , be an irreducible and aperiodic Markov chain with
transition matrix P. Then, limn→∞ Pijn exists and independent of i,
denoted as limn→∞ Pijn = π(j). We also have
π(1) π(2) · · · π(j) ···
π(1) π(2) · · · π(j) ···
lim P n =
··· ··· ··· ··· ··· (2)
n→∞
π(1) π(2) · · · π(j) ···
··· ··· ··· ··· ···
π(j) = ∞
P P∞
i=1 π(i)Pij , and i=1 π(i) = 1.
Markov Chain Monte Carlo
Stationary Distribution
Theorem
Let X0 , X1 , · · · , be an irreducible and aperiodic Markov chain with
transition matrix P. Then, limn→∞ Pijn exists and independent of i,
denoted as limn→∞ Pijn = π(j). We also have
π(1) π(2) · · · π(j) ···
π(1) π(2) · · · π(j) ···
lim P n =
··· ··· ··· ··· ··· (2)
n→∞
π(1) π(2) · · · π(j) ···
··· ··· ··· ··· ···
π(j) = ∞
P P∞
i=1 π(i)Pij , and i=1 π(i) = 1.
π is the unique and non-negative solution for equation πP = π.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 15 / 45
Markov Chain Monte Carlo
Outline
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 17 / 45
Markov Chain Monte Carlo MCMC Sampling Algorithm
How to choose α(i, j) such that π(i)Pij α(i, j) = π(j)Pji α(j, i).
Markov Chain Monte Carlo MCMC Sampling Algorithm
How to choose α(i, j) such that π(i)Pij α(i, j) = π(j)Pji α(j, i).
In terms of symmetry, we simply choose
α(i, j) = π(j)Pji , α(j, i) = π(i)Pij .
Markov Chain Monte Carlo MCMC Sampling Algorithm
How to choose α(i, j) such that π(i)Pij α(i, j) = π(j)Pji α(j, i).
In terms of symmetry, we simply choose
α(i, j) = π(j)Pji , α(j, i) = π(i)Pij .
Therefore, we have π(i) Pij α(i, j) = π(j) Pji α(j, i) .
| {z } | {z }
Qij Qji
Markov Chain Monte Carlo MCMC Sampling Algorithm
How to choose α(i, j) such that π(i)Pij α(i, j) = π(j)Pji α(j, i).
In terms of symmetry, we simply choose
α(i, j) = π(j)Pji , α(j, i) = π(i)Pij .
Therefore, we have π(i) Pij α(i, j) = π(j) Pji α(j, i) .
| {z } | {z }
Qij Qji
The
transition matrix:
Qij = Pij α(i,Pj), if j 6= i;
Qii = Pii + k6=i Pi,k (1 − α(i, k)), Otherwise.
Markov Chain Monte Carlo MCMC Sampling Algorithm
How to choose α(i, j) such that π(i)Pij α(i, j) = π(j)Pji α(j, i).
In terms of symmetry, we simply choose
α(i, j) = π(j)Pji , α(j, i) = π(i)Pij .
Therefore, we have π(i) Pij α(i, j) = π(j) Pji α(j, i) .
| {z } | {z }
Qij Qji
The
transition matrix:
Qij = Pij α(i,Pj), if j 6= i;
Qii = Pii + k6=i Pi,k (1 − α(i, k)), Otherwise.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 19 / 45
Markov Chain Monte Carlo Metropolis-Hastings Algorithm
Outline
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 20 / 45
Markov Chain Monte Carlo Metropolis-Hastings Algorithm
Metropolis-Hastings algorithm
Observation
Let P(x2 |x1 ) be a proposed
distribution. If we let
0: initialize x (0) ; n π(x)P(x (i) |x) o
1: for i = 0 to max do α(i, j) = min 1, ,
π(x (i) )P(x|x (i) )
2: sample u ∼ U[0, 1];
3: sample x ∼ P(x|x (i) ); we can get a high accept ratio, and
π(x)P(x |x) (i)
4: if u < min{1, π(x (i) )P(x|x (i) ) },
further improve the algorithm effi-
ciency.
5: then x (i+1) = x;
However, for high-dimensional
6: else reject x, and
P, Metropolis-Hastings algorithm
x (i+1) = x (i) ;
may be inefficient because of
7: endif
α < 1. Is there a way to find a
8: endfor
transition matrix with acceptance
9: output Last N samples;
ratio α = 1?
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 21 / 45
Markov Chain Monte Carlo Metropolis-Hastings Algorithm
Properties of MCMC
Observation
For example:
Sample:
x (t+1) |x (t) ∼ N (0.5x (t) , 1.0);
Convergence:
x (t) |x (0) ∼ N (0, 1.33), t → +∞.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 22 / 45
Markov Chain Monte Carlo Gibbs Sampling
Outline
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 23 / 45
Markov Chain Monte Carlo Gibbs Sampling
Intuition
That is
i.e.,
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 24 / 45
Markov Chain Monte Carlo Gibbs Sampling
Intuition Cont’d
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 25 / 45
Markov Chain Monte Carlo Gibbs Sampling
Intuition Cont’d
Transition matrix
The transition probabilities between two points A and B are given by
T (A → B).
p(yB |x1 ), if xA = xB = x1 ;
T (A → B) = p(xB |y1 ), if yA = yB = y1 ;
0, otherwise.
Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1 , · · · , xi−1 , xi+1 , · · · , xn ). The transition
0 0
probabilities are given by T (x → x ) = P(xi |x−i ). Then, we have:
Markov Chain Monte Carlo Gibbs Sampling
Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1 , · · · , xi−1 , xi+1 , · · · , xn ). The transition
0 0
probabilities are given by T (x → x ) = P(xi |x−i ). Then, we have:
0 0
T (x → x )p(x) = P(xi |x−i )P(xi |x−i )P(x−i )
0 0 0 0 0 0
T (x → x)p(x ) = P(xi |x−i )P(xi |x−i )P(x−i )
Markov Chain Monte Carlo Gibbs Sampling
Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1 , · · · , xi−1 , xi+1 , · · · , xn ). The transition
0 0
probabilities are given by T (x → x ) = P(xi |x−i ). Then, we have:
0 0
T (x → x )p(x) = P(xi |x−i )P(xi |x−i )P(x−i )
0 0 0 0 0 0
T (x → x)p(x ) = P(xi |x−i )P(xi |x−i )P(x−i )
0
Note that x−i = x−i . That is,
0 0 0
T (x → x )p(x) = T (x → x)p(x ). (8)
Markov Chain Monte Carlo Gibbs Sampling
Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1 , · · · , xi−1 , xi+1 , · · · , xn ). The transition
0 0
probabilities are given by T (x → x ) = P(xi |x−i ). Then, we have:
0 0
T (x → x )p(x) = P(xi |x−i )P(xi |x−i )P(x−i )
0 0 0 0 0 0
T (x → x)p(x ) = P(xi |x−i )P(xi |x−i )P(x−i )
0
Note that x−i = x−i . That is,
0 0 0
T (x → x )p(x) = T (x → x)p(x ). (8)
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 26 / 45
Markov Chain Monte Carlo Gibbs Sampling
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 28 / 45
Markov Chain Monte Carlo Gibbs Sampling
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 28 / 45
Markov Chain Monte Carlo Gibbs Sampling
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 28 / 45
Markov Chain Monte Carlo Gibbs Sampling
Outline
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 29 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
symbol meaning
M the number of documents
Nm the number of words in document m
K the number of topics
wm,n the index of word n in document m
zm,n the topic k of each word wm,n
α, β fixed hyper-parameters
θ topic distribution for each document
φ topic distribution for each word
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 31 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
Properties of Dirichlet
Γ( K
P
αi ) K αk −1 1
Dir (θ|α) = K k=1 Πk=1 θk ≡ ΠK θαk −1
Πk=1 Γ(αk ) 4(α) k=1 k
N mk
Mult(m1 , · · · , mK |θ, N) = ΠK
k=1 θk
m1 m2 · · · mK
Γ( K
P
αi ) + N) K αk +mk −1
Dir (θ|D, α) = Dir (θ|α, m) = K k=1 Π θ
Πk=1 Γ(αk + mk ) k=1 k
Markov Chain Monte Carlo Latent Dirichlet Allocation
Properties of Dirichlet
Γ( K
P
αi ) K αk −1 1
Dir (θ|α) = K k=1 Πk=1 θk ≡ ΠK θαk −1
Πk=1 Γ(αk ) 4(α) k=1 k
N mk
Mult(m1 , · · · , mK |θ, N) = ΠK
k=1 θk
m1 m2 · · · mK
Γ( K
P
αi ) + N) K αk +mk −1
Dir (θ|D, α) = Dir (θ|α, m) = K k=1 Π θ
Πk=1 Γ(αk + mk ) k=1 k
Markov Chain Monte Carlo Latent Dirichlet Allocation
Properties of Dirichlet
Γ( K
P
αi ) K αk −1 1
Dir (θ|α) = K k=1 Πk=1 θk ≡ ΠK θαk −1
Πk=1 Γ(αk ) 4(α) k=1 k
N mk
Mult(m1 , · · · , mK |θ, N) = ΠK
k=1 θk
m1 m2 · · · mK
Γ( K
P
αi ) + N) K αk +mk −1
Dir (θ|D, α) = Dir (θ|α, m) = K k=1 Π θ
Πk=1 Γ(αk + mk ) k=1 k
The expectation of
Dirichlet is
E (θ) = ( αα10 , αα20 , · · · , ααK0 ),
where α0 = K
P
k=1 αk .
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 32 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
1: for k = 1 to K do
2: φ(k) ∼ Dirichlet(β);
3: for each document m ∈ D
4: θm ∼ Dirichlet(α);
5: for each word wm,n ∈ m
6: zm,n ∼ Mult(θm );
7: wm,n ∼ Mult(φ(zm,n ) );
Markov Chain Monte Carlo Latent Dirichlet Allocation
1: for k = 1 to K do
2: φ(k) ∼ Dirichlet(β);
3: for each document m ∈ D
4: θm ∼ Dirichlet(α);
5: for each word wm,n ∈ m
6: zm,n ∼ Mult(θm );
7: wm,n ∼ Mult(φ(zm,n ) );
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 33 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
The joint probabilities of observing a word wm,n and the corpus are
The joint probabilities of observing a word wm,n and the corpus are
In other words,
p(w , z, φ, θ|α, β)
= ΠM N
m=1 Πn=1 p(wm,n , zm,n , φ, θm |α, β)
= p(φ|β)ΠM N
m=1 p(θm |α)Πn=1 p(wm,n |zm,n , φ)p(zm,n |θm ).
Markov Chain Monte Carlo Latent Dirichlet Allocation
The joint probabilities of observing a word wm,n and the corpus are
In other words,
p(w , z, φ, θ|α, β)
= ΠM N
m=1 Πn=1 p(wm,n , zm,n , φ, θm |α, β)
= p(φ|β)ΠM N
m=1 p(θm |α)Πn=1 p(wm,n |zm,n , φ)p(zm,n |θm ).
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 34 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
LDA Model II
1: for k = 1 to K do
2: φ(k) ∼ Dirichlet(β);
3: for each document m ∈ D
4: θm ∼ Dirichlet(α);
5: for each word wm,n ∈ m
6: zm,n ∼ Mult(θm );
7: for each topic k ∈ [1, K ]
8: for each zm,n = k
9: wm,n ∼ Mult(φ(k) );
Markov Chain Monte Carlo Latent Dirichlet Allocation
LDA Model II
LDA Model II
α −→} θm −→ zm , and β −→ φk −→ wk
| {z (9)
| {z } | {z } | {z }
Dirichlet Multinomial Dirichlet Multinomial
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 35 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
Suppose we have a dice of K sides. We toss the dice and the prob-
ability of landing on side k is p(t = k|f ) = fi . We throw the dice
N times and obtain a set of results s = {s1 , s2 , · · · , sN }. The joint
probability is
Markov Chain Monte Carlo Latent Dirichlet Allocation
Suppose we have a dice of K sides. We toss the dice and the prob-
ability of landing on side k is p(t = k|f ) = fi . We throw the dice
N times and obtain a set of results s = {s1 , s2 , · · · , sN }. The joint
probability is
N
Y K
Y
p(s|f ) = p(sn |f ) = f1n1 f2n2 · · · fKnK = fi ni (10)
n=1 i=1
Markov Chain Monte Carlo Latent Dirichlet Allocation
Suppose we have a dice of K sides. We toss the dice and the prob-
ability of landing on side k is p(t = k|f ) = fi . We throw the dice
N times and obtain a set of results s = {s1 , s2 , · · · , sN }. The joint
probability is
N
Y K
Y
p(s|f ) = p(sn |f ) = f1n1 f2n2 · · · fKnK = fi ni (10)
n=1 i=1
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 36 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
Example Cont’d
Example Cont’d
Example Cont’d
Γ(
PK K
(nk + αk )) Y nk +αk −1 N trials is a simple counting
= QK k=1 fk procedure.
k=1 Γ(nk + αk ) k=1
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 37 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
Estimating fi
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 38 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
Likelihood of Observing si
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 39 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
Parameter Inference
We integrate θ and φ to obtain the following:
p(z, w |α, β) = p(w |z, β)p(z|α)
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 40 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
Parameter Inference
We integrate θ and φ to obtain the following:
p(z, w |α, β) = p(w |z, β)p(z|α)
β −→ φ −→ w
| {z } k | {z k}
Dirichlet Multinomial
K
Y
p(w |z, β) = p(wk |zk , β)
k=1
K
Y 4(nk + β)
= ,
4(β)
k=1
where nk =
(1) (2) (V )
(nk , nk , · · · , nk ), and
(v )
nk is the number of words
generated by topic k.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 40 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
Parameter Inference
We integrate θ and φ to obtain the following:
p(z, w |α, β) = p(w |z, β)p(z|α)
β −→ φ −→ w α −→} θm −→ zm
| {z } k | {z k} | {z | {z }
Dirichlet Multinomial Dirichlet Multinomial
K
Y M
Y
p(w |z, β) = p(wk |zk , β) p(z|α) = p(zm |α)
k=1 m=1
K M
Y 4(nk + β) Y 4(nm + α)
= , = ,
4(β) 4(α)
k=1 m=1
where nk = where nm =
(1) (2) (V ) (1) (2) (K )
(nk , nk , · · · , nk ), and (nm , nm , · · · , nm ), and
(v ) (k)
nk is the number of words nm is # words with topic k
generated by topic k. in the m-th document.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 40 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
Z Z
p(z|α) = p(z, θ|α)dθ = p(z|θ, α)p(θ|α)dθ
Z M Y
Z Y K PK α ) Y K
nm,k Γ(
k=1 k αk −1
= p(z|θ)p(θ|α)dθ = θm,k QK θm,k dθ
m=1 k=1 k=1 Γ(αk ) k=1
Z Y M PK K
Γ( k=1 αk ) Y nm,k +αk −1
= QK θm,k dθ
m=1 k=1 Γ(αk ) k=1
M PK Z Y K
Y Γ( k=1 αk ) nm,k +αk −1
= QK θm,k dθ
m=1 k=1 Γ(αk ) k=1
M PK QK M
Y Γ( k=1 αk ) k=1 Γ(αk + nm,k ) Y 4(nm + α)
= QK PK =
m=1 k=1 Γ(αk ) Γ( k=1 αk + nm,k ) m=1
4(α)
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 41 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
p(w |z
Z Z
p(w |z, β) = p(w , φ|z, β)dφ = p(w |z, β, φ)p(φ|z, β)dφ
Z
= p(w |z, φ)p(φ|β)dφ
K Y
Z Y V Γ(PV β ) Y V
nk,v v =1 v v −1
= φk,v QV φβk,v dφ
k=1 v =1 v =1 Γ(βv ) v =1
Z Y K PV V
Γ( v =1 βv ) Y nk,v +βv −1
= QV φk,v dφ
k=1 v =1 Γ(βv ) v =1
K PV QV K
Y Γ( v =1 βv ) v =1 Γ(βv + nk,v ) Y 4(nk + β)
= QV PV =
v =1 Γ(βv ) Γ( v =1 βv + nk,v )
4(β)
k=1 k=1
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 42 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
Gibbs Sampling
Analysis
For simplicity, the topic of the i-th word in the corpus denotes zi ,
where i = (m, n). In terms of Gibbs sampling, we need to compute
the conditional probability p(zi = k|z−i , w).
p(zi = k|z−i , w) = p(zi = k|z−i , w−i , wi = t)
p(zi = k, wi = t|z−i , w−i )
= ∝ p(zi = k, wi = t|z−i , w−i ).
p(wi = t|z−i , w−i )
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 43 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation
Take-home messages
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 45 / 45