0% found this document useful (0 votes)
0 views100 pages

Sampling

Uploaded by

bocerin283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views100 pages

Sampling

Uploaded by

bocerin283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Algorithm Foundations of Data Science

Lecture 2: Sampling

MING GAO

DaSE@ECNU
(for course related communications)
mgao@dase.ecnu.edu.cn

Mar. 14, 2018


Outline

1 Monte Carlo Method

2 Markov Chain Monte Carlo


MCMC Sampling Algorithm
Metropolis-Hastings Algorithm
Gibbs Sampling
Latent Dirichlet Allocation

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 2 / 45
Monte Carlo Method

Monte Carlo Method

MC methods are a class of computational algorithms that rely on


repeated random sampling to obtain numerical results.
1 An early variant of it can be seen in the Buffon’s needle
experiment;
2 It was central to the simulations required for the Manhattan
Project;
3 The founder of MC method were Stanislaw Marcin Ulam,
Enrico Fermi, John von Neumann and Nicholas Metropolis.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 3 / 45
Monte Carlo Method

Monte Carlo Method

MC methods are a class of computational algorithms that rely on


repeated random sampling to obtain numerical results.
1 An early variant of it can be seen in the Buffon’s needle
experiment;
2 It was central to the simulations required for the Manhattan
Project;
3 The founder of MC method were Stanislaw Marcin Ulam,
Enrico Fermi, John von Neumann and Nicholas Metropolis.
Major components of MC methods
1 Define a domain of possible inputs;
2 Generate inputs randomly from a pdf over the domain;
3 Perform a deterministic computation on the inputs;
4 Aggregate the results.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 3 / 45
Monte Carlo Method

Example I

Algorithm:

Question: How accurate of the probabilistic algorithm?


We cannot answer the question in this moment, once we learn ex-
pectation of r.v.s (coming soon).
Monte Carlo Method

Example I

Algorithm:
Step i: It randomly and uniformly
generates a point Pi inside the sample
space Ω = {(x, y )|0 ≤ x, y ≤ 1}.

Question: How accurate of the probabilistic algorithm?


We cannot answer the question in this moment, once we learn ex-
pectation of r.v.s (coming soon).
Monte Carlo Method

Example I

Algorithm:
Step i: It randomly and uniformly
generates a point Pi inside the sample
space Ω = {(x, y )|0 ≤ x, y ≤ 1}.
Let set
S = {(x, y ) : x 2 + y 2 ≤ 1 ∧ x, y ≥ 0} be
the circle region. And ∀Pi ∈ S, we
define IS (Pi ) and IΩ−S (Pi );

Question: How accurate of the probabilistic algorithm?


We cannot answer the question in this moment, once we learn ex-
pectation of r.v.s (coming soon).
Monte Carlo Method

Example I

Algorithm:
Step i: It randomly and uniformly
generates a point Pi inside the sample
space Ω = {(x, y )|0 ≤ x, y ≤ 1}.
Let set
S = {(x, y ) : x 2 + y 2 ≤ 1 ∧ x, y ≥ 0} be
the circle region. And ∀Pi ∈ S, we
define IS (Pi P
) and IΩ−S (Pi );
n
π IS (Pi )
4 ≈ .
Pn i=1 P
I (Pi )+ n I
i=1 S (Pi )
i=1 Ω−S

Question: How accurate of the probabilistic algorithm?


We cannot answer the question in this moment, once we learn ex-
pectation of r.v.s (coming soon).

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 4 / 45
Monte Carlo Method

Sample with discrete distribution

How to sample from discrete distribution 0.1, 0.2, 0.3, 0.4?


Aliasing sample:
CDF sample:

O(log n) for CDF sample,


and O(1) for aliasing
sample.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 5 / 45
Monte Carlo Method

Example II: approximating probabilities

In many applications, the probability P(Y ) of an observed event Y


must be computed as the sum over very many latent variables X of
the joint probability P(Y , X ). That is,
X X
P(Y = y ) = P(Y = y , X = x) = P(Y = y |X = x)P(X = x).
x∈X x∈X
Monte Carlo Method

Example II: approximating probabilities

In many applications, the probability P(Y ) of an observed event Y


must be computed as the sum over very many latent variables X of
the joint probability P(Y , X ). That is,
X X
P(Y = y ) = P(Y = y , X = x) = P(Y = y |X = x)P(X = x).
x∈X x∈X

The term following the last equals sign is the sum over all x of a
function of x, weighted by the marginal probabilities P(X = x).
Clearly this is an expectation, and therefore may be approximated by
Monte Carlo, giving us
n
1X
P(Y = y ) ≈ P(Y = y |X = xi ).
n
i=1

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 6 / 45
Monte Carlo Method
R1
Example III: approximating integral 0 x 2 dx

1 Draw a square, then inscribe a parabola within it;


2 Uniformly scatter objects of uniform size over the square;
3 Count # objects inside the parabola and the total number of
objects;
R1
4 The ratio (0.3328) of the two counts is an estimate of 0 x 2 dx.
Monte Carlo Method
R1
Example III: approximating integral 0 x 2 dx

1 Draw a square, then inscribe a parabola within it;


2 Uniformly scatter objects of uniform size over the square;
3 Count # objects inside the parabola and the total number of
objects;
R1
4 The ratio (0.3328) of the two counts is an estimate of 0 x 2 dx.

Rb
For an integral a f (x)dx, it is
hard to find a rectangle to bound
the value of f (x), especially for a
high-dimensional function.
Alternatively,
R b f (x) we compute
a p(x) p(x)dx.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 7 / 45
Monte Carlo Method

Example IV: approximating expectation f (x)


R
Computing approximate integrals of the form f (x)p(x)dx, i.e.,
computing expectation of f (x) using density p(x).
Monte Carlo Method

Example IV: approximating expectation f (x)


R
Computing approximate integrals of the form f (x)p(x)dx, i.e.,
computing expectation of f (x) using density p(x).
1 Let {xi } is an i.i.d. random sample drawn from p(x);
2 The strong law of large numbers says:
N Z
1 X
f (xi ) −→ f (x)p(x)dx (a.s). (1)
N
i=1


3 The rate of convergence is proportional to N;
Monte Carlo Method

Example IV: approximating expectation f (x)


R
Computing approximate integrals of the form f (x)p(x)dx, i.e.,
computing expectation of f (x) using density p(x).
1 Let {xi } is an i.i.d. random sample drawn from p(x);
2 The strong law of large numbers says:
N Z
1 X
f (xi ) −→ f (x)p(x)dx (a.s). (1)
N
i=1


3 The rate of convergence is proportional to N;
4 Major issues:
The proportionality constant increases exponentially with the
dimension of the integral.
Another problem is that sampling from complex distributions is
not as easy as uniform.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 8 / 45
Monte Carlo Method
R
Rejection sampling: approximating f (x)p(x)dx

1 PN
N i=1 f (xi ) is difficult to compute since it is hard to draw from
p(x).
1: i ← 0;
2: while i 6= N do
3: x (i) ∼ q(x);
4: u ∼ U(0, 1);
p(x (i) )
5: if u <kq(x (i) )
then;
6: accept x (i) ;
7: i ← i + 1;
8: else
9: reject x (i) ;
10: end if
11: end while
Monte Carlo Method
R
Rejection sampling: approximating f (x)p(x)dx

1 PN
N i=1 f (xi ) is difficult to compute since it is hard to draw from
p(x).
1: i ← 0;
2: while i 6= N do
3: x (i) ∼ q(x);
4: u ∼ U(0, 1);
p(x (i) )
5: if u <kq(x (i) )
then;
6: accept x (i) ;
7: i ← i + 1; where density q(x) (e.g., Gaussian) can
8: else sample directly.
9: reject x (i) ; What is the average acceptance ratio?
10: end if However, it is hard to find the
11: end while reasonable q(x) and the value of k.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 9 / 45
Monte Carlo Method
R
Importance sampling: approximating I (f ) = f (x)p(x)dx

If we have a density q(x) (proposal distribution) which is easy to


sample from, we can sample x (i) ∼ q(x). We define the importance
weight as
p(x (i) )
w (x (i) ) = .
q(x (i) )
Monte Carlo Method
R
Importance sampling: approximating I (f ) = f (x)p(x)dx

If we have a density q(x) (proposal distribution) which is easy to


sample from, we can sample x (i) ∼ q(x). We define the importance
weight as
p(x (i) )
w (x (i) ) = .
q(x (i) )
Consider the weighted Monte Carlo sum:
N N
1 X (i) (i) 1 X p(x (i) )
f (x )w (x ) = f (x (i) )
N
i=1
N
i=1
q(x (i) )
Z Z
p(x)
−→ f (x) q(x)dx(a.s) = f (x)p(x)dx.
q(x)

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 10 / 45
Monte Carlo Method

Approximating probabilities Cont.d

Going back to Example II with the discrete sum over latent variables
X it is clear that the optimal importance sampling function would be
the conditional distribution of X given Y , i.e.,
X P(Y = y , X = x)
P(Y = y ) = P(X |Y = y ).
P(X |Y = y )
x∈X
Monte Carlo Method

Approximating probabilities Cont.d

Going back to Example II with the discrete sum over latent variables
X it is clear that the optimal importance sampling function would be
the conditional distribution of X given Y , i.e.,
X P(Y = y , X = x)
P(Y = y ) = P(X |Y = y ).
P(X |Y = y )
x∈X

Note that the right side is a conditional expectation of a function of


X.
Monte Carlo Method

Approximating probabilities Cont.d

Going back to Example II with the discrete sum over latent variables
X it is clear that the optimal importance sampling function would be
the conditional distribution of X given Y , i.e.,
X P(Y = y , X = x)
P(Y = y ) = P(X |Y = y ).
P(X |Y = y )
x∈X

Note that the right side is a conditional expectation of a function of


X . As before P(X |Y ) is not computable.
Monte Carlo Method

Approximating probabilities Cont.d

Going back to Example II with the discrete sum over latent variables
X it is clear that the optimal importance sampling function would be
the conditional distribution of X given Y , i.e.,
X P(Y = y , X = x)
P(Y = y ) = P(X |Y = y ).
P(X |Y = y )
x∈X

Note that the right side is a conditional expectation of a function of


X . As before P(X |Y ) is not computable.
So one must turn to finding some other distribution, i.e., P ∗ (X ),
that is close to P(X |Y ) but which is more easily sampled from and
computed.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 11 / 45
Monte Carlo Method

Analysis of importance sampling


How to pick q(x)

We can sample from any distribution q(x). In practice, we would


like to choose q(x) as close as possible to |f (x)|w (x) to reduce the
variance of our estimator.
Monte Carlo Method

Analysis of importance sampling


How to pick q(x)

We can sample from any distribution q(x). In practice, we would


like to choose q(x) as close as possible to |f (x)|w (x) to reduce the
variance of our estimator.
We have Varq(x) f (x)w (x) = Eq(x) f (x)2 w (x)2 − I (f )2 .
Furthermore, we have

Eq(x) f (x)2 w (x)2 ≥ (Eq(x) |f (x)|w (x))2


Z
= ( |f (x)|p(x)dx)2 .

The term I (f )2 is independent of q(x). So, the best q ∗ (x)


which makes the variance minimum is given by
q ∗ (x) = R |f|f (x)|p(x)dx
(x)|p(x)
.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 12 / 45
Markov Chain Monte Carlo

The Main idea of MCMC

We cannot
R sample directly from the target distribution p(x) in the
integral f (x)p(x)dx.
Create a Markov chain whose transition matrix does not
depend on the normalization term.
Markov Chain Monte Carlo

The Main idea of MCMC

We cannot
R sample directly from the target distribution p(x) in the
integral f (x)p(x)dx.
Create a Markov chain whose transition matrix does not
depend on the normalization term.
Make sure the chain has a stationary distribution and it is
equal to the target distribution.
Markov Chain Monte Carlo

The Main idea of MCMC

We cannot
R sample directly from the target distribution p(x) in the
integral f (x)p(x)dx.
Create a Markov chain whose transition matrix does not
depend on the normalization term.
Make sure the chain has a stationary distribution and it is
equal to the target distribution.
After sufficient number of iterations, the chain will converge
the stationary distribution.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 13 / 45
Markov Chain Monte Carlo

Markov Chain Monte Carlo


Overview
Markov Chain Monte Carlo (MCMC) methods are a class of algo-
rithms for sampling from a probability distribution based on con-
structing a Markov chain that has the desired distribution as its sta-
tionary distribution.
Markov Chain Monte Carlo

Markov Chain Monte Carlo


Overview
Markov Chain Monte Carlo (MCMC) methods are a class of algo-
rithms for sampling from a probability distribution based on con-
structing a Markov chain that has the desired distribution as its sta-
tionary distribution.
The algorithm is proposed in 1953, which is the top-10 most
important algorithms in the 20th century.
Markov Chain Monte Carlo

Markov Chain Monte Carlo


Overview
Markov Chain Monte Carlo (MCMC) methods are a class of algo-
rithms for sampling from a probability distribution based on con-
structing a Markov chain that has the desired distribution as its sta-
tionary distribution.
The algorithm is proposed in 1953, which is the top-10 most
important algorithms in the 20th century.
MCMC works by generating a sequence of sample values in
such a way that, as more and more sample values are
produced, the distribution of values more closely approximates
the desired distribution, π(i).
Markov Chain Monte Carlo

Markov Chain Monte Carlo


Overview
Markov Chain Monte Carlo (MCMC) methods are a class of algo-
rithms for sampling from a probability distribution based on con-
structing a Markov chain that has the desired distribution as its sta-
tionary distribution.
The algorithm is proposed in 1953, which is the top-10 most
important algorithms in the 20th century.
MCMC works by generating a sequence of sample values in
such a way that, as more and more sample values are
produced, the distribution of values more closely approximates
the desired distribution, π(i).
That is, a Markov Chain has stationary distribution π(i)
associated with transition probability matrix P.
Markov Chain Monte Carlo

Markov Chain Monte Carlo


Overview
Markov Chain Monte Carlo (MCMC) methods are a class of algo-
rithms for sampling from a probability distribution based on con-
structing a Markov chain that has the desired distribution as its sta-
tionary distribution.
The algorithm is proposed in 1953, which is the top-10 most
important algorithms in the 20th century.
MCMC works by generating a sequence of sample values in
such a way that, as more and more sample values are
produced, the distribution of values more closely approximates
the desired distribution, π(i).
That is, a Markov Chain has stationary distribution π(i)
associated with transition probability matrix P.
The Markov Chain converges to the stationary distribution
π(i) for arbitrary initial status x0 .
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 14 / 45
Markov Chain Monte Carlo

Stationary Distribution

Theorem
Let X0 , X1 , · · · , be an irreducible and aperiodic Markov chain with
transition matrix P. Then, limn→∞ Pijn exists and independent of i,
denoted as limn→∞ Pijn = π(j). We also have
Markov Chain Monte Carlo

Stationary Distribution

Theorem
Let X0 , X1 , · · · , be an irreducible and aperiodic Markov chain with
transition matrix P. Then, limn→∞ Pijn exists and independent of i,
denoted as limn→∞ Pijn = π(j). We also have

 
π(1) π(2) · · · π(j) · · ·
 π(1) π(2) · · · π(j) · · · 
n
 
lim P =  ··· ··· ··· ··· ···  (2)
n→∞  
 π(1) π(2) · · · π(j) · · · 
··· ··· ··· ··· ···
Markov Chain Monte Carlo

Stationary Distribution

Theorem
Let X0 , X1 , · · · , be an irreducible and aperiodic Markov chain with
transition matrix P. Then, limn→∞ Pijn exists and independent of i,
denoted as limn→∞ Pijn = π(j). We also have

 
π(1) π(2) · · · π(j) ···
 π(1) π(2) · · · π(j) ··· 
lim P n = 
 
 ··· ··· ··· ··· ··· (2)

n→∞ 
 π(1) π(2) · · · π(j) ··· 
··· ··· ··· ··· ···

π(j) = ∞
P P∞
i=1 π(i)Pij , and i=1 π(i) = 1.
Markov Chain Monte Carlo

Stationary Distribution

Theorem
Let X0 , X1 , · · · , be an irreducible and aperiodic Markov chain with
transition matrix P. Then, limn→∞ Pijn exists and independent of i,
denoted as limn→∞ Pijn = π(j). We also have

 
π(1) π(2) · · · π(j) ···
 π(1) π(2) · · · π(j) ··· 
lim P n = 
 
 ··· ··· ··· ··· ··· (2)

n→∞ 
 π(1) π(2) · · · π(j) ··· 
··· ··· ··· ··· ···

π(j) = ∞
P P∞
i=1 π(i)Pij , and i=1 π(i) = 1.
π is the unique and non-negative solution for equation πP = π.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 15 / 45
Markov Chain Monte Carlo

Detailed Balance Condition


Theorem
Let X0 , X1 , · · · , be an aperiodic Markov chain with transition matrix
P and distribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i, j (3)

then π(x) is the stationary distribution of the Markov chain. The


above equation is called the detailed balance condition.
Markov Chain Monte Carlo

Detailed Balance Condition


Theorem
Let X0 , X1 , · · · , be an aperiodic Markov chain with transition matrix
P and distribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i, j (3)

then π(x) is the stationary distribution of the Markov chain. The


above equation is called the detailed balance condition.
Proof: ∞
P P∞ P∞
i=1 π(i)Pij = i=1 π(j)Pji = π(j) i=1 Pji =
π(j) ⇒ πP = π.
Markov Chain Monte Carlo

Detailed Balance Condition


Theorem
Let X0 , X1 , · · · , be an aperiodic Markov chain with transition matrix
P and distribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i, j (3)

then π(x) is the stationary distribution of the Markov chain. The


above equation is called the detailed balance condition.
Proof: ∞
P P∞ P∞
i=1 π(i)Pij = i=1 π(j)Pji = π(j) i=1 Pji =
π(j) ⇒ πP = π.
In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the
stationary distribution.
Markov Chain Monte Carlo

Detailed Balance Condition


Theorem
Let X0 , X1 , · · · , be an aperiodic Markov chain with transition matrix
P and distribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i, j (3)

then π(x) is the stationary distribution of the Markov chain. The


above equation is called the detailed balance condition.
Proof: ∞
P P∞ P∞
i=1 π(i)Pij = i=1 π(j)Pji = π(j) i=1 Pji =
π(j) ⇒ πP = π.
In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the
stationary distribution.
The natural question is how to revise the Markov Chain such
that π becomes a stationary distribution. For example, we
introduce a function α(i, j) s.t. π(i)Pij α(i, j) = π(j)Pji α(j, i)
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 16 / 45
Markov Chain Monte Carlo MCMC Sampling Algorithm

Outline

1 Monte Carlo Method

2 Markov Chain Monte Carlo


MCMC Sampling Algorithm
Metropolis-Hastings Algorithm
Gibbs Sampling
Latent Dirichlet Allocation

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 17 / 45
Markov Chain Monte Carlo MCMC Sampling Algorithm

Revising the Markov Chain


Choosing a reasonable parameter

How to choose α(i, j) such that π(i)Pij α(i, j) = π(j)Pji α(j, i).
Markov Chain Monte Carlo MCMC Sampling Algorithm

Revising the Markov Chain


Choosing a reasonable parameter

How to choose α(i, j) such that π(i)Pij α(i, j) = π(j)Pji α(j, i).
In terms of symmetry, we simply choose
α(i, j) = π(j)Pji , α(j, i) = π(i)Pij .
Markov Chain Monte Carlo MCMC Sampling Algorithm

Revising the Markov Chain


Choosing a reasonable parameter

How to choose α(i, j) such that π(i)Pij α(i, j) = π(j)Pji α(j, i).
In terms of symmetry, we simply choose
α(i, j) = π(j)Pji , α(j, i) = π(i)Pij .
Therefore, we have π(i) Pij α(i, j) = π(j) Pji α(j, i) .
| {z } | {z }
Qij Qji
Markov Chain Monte Carlo MCMC Sampling Algorithm

Revising the Markov Chain


Choosing a reasonable parameter

How to choose α(i, j) such that π(i)Pij α(i, j) = π(j)Pji α(j, i).
In terms of symmetry, we simply choose
α(i, j) = π(j)Pji , α(j, i) = π(i)Pij .
Therefore, we have π(i) Pij α(i, j) = π(j) Pji α(j, i) .
| {z } | {z }
Qij Qji
The
 transition matrix:
Qij = Pij α(i,Pj), if j 6= i;
Qii = Pii + k6=i Pi,k (1 − α(i, k)), Otherwise.
Markov Chain Monte Carlo MCMC Sampling Algorithm

Revising the Markov Chain


Choosing a reasonable parameter

How to choose α(i, j) such that π(i)Pij α(i, j) = π(j)Pji α(j, i).
In terms of symmetry, we simply choose
α(i, j) = π(j)Pji , α(j, i) = π(i)Pij .
Therefore, we have π(i) Pij α(i, j) = π(j) Pji α(j, i) .
| {z } | {z }
Qij Qji
The
 transition matrix:
Qij = Pij α(i,Pj), if j 6= i;
Qii = Pii + k6=i Pi,k (1 − α(i, k)), Otherwise.

Accept with probability


α(i, j);
Otherwise stay in the current
location.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 18 / 45
Markov Chain Monte Carlo MCMC Sampling Algorithm

MCMC Sampling Algorithm


Observation
Let α(i, j) = 0.1, and
Let P(x2 |x1 ) be a proposed distribution. α(j, i) = 0.2 satisfy the
0: initialize x (0) ; detailed balance condi-
1: for i = 0 to N − 1 do tion, thus we have
2: sample u ∼ U[0, 1];
3: sample x ∼ P(x|x (i) ); π(i)Pij 0.1 = π(j)Pji 0.2.
4: if u < α(x, x (i) ) = π(x)P(x (i) |x),
5: then x (i+1) = x; The small value of
6: else reject x, and x (i+1) = x (i) ; α(i, j) results in a high
7: endif rejection ratio. We
8: endfor therefore modify the e-
9: output Last N samples; quation as follows:

π(i)Pij 0.5 = π(j)Pji 1.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 19 / 45
Markov Chain Monte Carlo Metropolis-Hastings Algorithm

Outline

1 Monte Carlo Method

2 Markov Chain Monte Carlo


MCMC Sampling Algorithm
Metropolis-Hastings Algorithm
Gibbs Sampling
Latent Dirichlet Allocation

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 20 / 45
Markov Chain Monte Carlo Metropolis-Hastings Algorithm

Metropolis-Hastings algorithm
Observation
Let P(x2 |x1 ) be a proposed
distribution. If we let
0: initialize x (0) ; n π(x)P(x (i) |x) o
1: for i = 0 to max do α(i, j) = min 1, ,
π(x (i) )P(x|x (i) )
2: sample u ∼ U[0, 1];
3: sample x ∼ P(x|x (i) ); we can get a high accept ratio, and
π(x)P(x |x) (i)
4: if u < min{1, π(x (i) )P(x|x (i) ) },
further improve the algorithm effi-
ciency.
5: then x (i+1) = x;
However, for high-dimensional
6: else reject x, and
P, Metropolis-Hastings algorithm
x (i+1) = x (i) ;
may be inefficient because of
7: endif
α < 1. Is there a way to find a
8: endfor
transition matrix with acceptance
9: output Last N samples;
ratio α = 1?

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 21 / 45
Markov Chain Monte Carlo Metropolis-Hastings Algorithm

Properties of MCMC

Trade-off between Mixing rate and Acceptance ratio, where

Acceptance ratio = E[α(xi , x)]


Mixing ratio = rate that the chain moves around the dist.

We can have multiple transition matrices Pi (i.e., proposal


distribution), and apply them in turn.

Observation
For example:
Sample:
x (t+1) |x (t) ∼ N (0.5x (t) , 1.0);
Convergence:
x (t) |x (0) ∼ N (0, 1.33), t → +∞.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 22 / 45
Markov Chain Monte Carlo Gibbs Sampling

Outline

1 Monte Carlo Method

2 Markov Chain Monte Carlo


MCMC Sampling Algorithm
Metropolis-Hastings Algorithm
Gibbs Sampling
Latent Dirichlet Allocation

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 23 / 45
Markov Chain Monte Carlo Gibbs Sampling

Intuition

Example: two-dimensional case

Let P(x, y ) be a two-dimensional probability distribution, and two


points A(x1 , y1 ) and B(x1 , y2 ). We have

P(x1 , y1 )P(y2 |x1 ) = P(x1 )P(y1 |x1 )P(y2 |x1 ) (4)


P(x1 , y2 )P(y1 |x1 ) = P(x1 )P(y2 |x1 )P(y1 |x1 ) (5)

That is

P(x1 , y1 )P(y2 |x1 ) = P(x1 , y2 )P(y1 |x1 ) (6)

i.e.,

P(A)P(y2 |x1 ) = P(B)P(y1 |x1 ) (7)

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 24 / 45
Markov Chain Monte Carlo Gibbs Sampling

Intuition Cont’d

If p(y |xi ) is considered as the transition probability


of two points whose x-axis coordinates are xi .
Therefore, transition between these two points
satisfies the detailed balance condition, i.e.,
P(A)P(y1 |x1 ) = P(B)P(y2 |x1 ) holds.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 25 / 45
Markov Chain Monte Carlo Gibbs Sampling

Intuition Cont’d

If p(y |xi ) is considered as the transition probability


of two points whose x-axis coordinates are xi .
Therefore, transition between these two points
satisfies the detailed balance condition, i.e.,
P(A)P(y1 |x1 ) = P(B)P(y2 |x1 ) holds.

Transition matrix
The transition probabilities between two points A and B are given by
T (A → B).

 p(yB |x1 ), if xA = xB = x1 ;
T (A → B) = p(xB |y1 ), if yA = yB = y1 ;
0, otherwise.

It is easy to confirm that the detailed balance condition holds, i.e.,


p(A)T (A → B) = p(B)T (B → A).
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 25 / 45
Markov Chain Monte Carlo Gibbs Sampling

Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1 , · · · , xi−1 , xi+1 , · · · , xn ). The transition
0 0
probabilities are given by T (x → x ) = P(xi |x−i ). Then, we have:
Markov Chain Monte Carlo Gibbs Sampling

Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1 , · · · , xi−1 , xi+1 , · · · , xn ). The transition
0 0
probabilities are given by T (x → x ) = P(xi |x−i ). Then, we have:
0 0
T (x → x )p(x) = P(xi |x−i )P(xi |x−i )P(x−i )
0 0 0 0 0 0
T (x → x)p(x ) = P(xi |x−i )P(xi |x−i )P(x−i )
Markov Chain Monte Carlo Gibbs Sampling

Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1 , · · · , xi−1 , xi+1 , · · · , xn ). The transition
0 0
probabilities are given by T (x → x ) = P(xi |x−i ). Then, we have:
0 0
T (x → x )p(x) = P(xi |x−i )P(xi |x−i )P(x−i )
0 0 0 0 0 0
T (x → x)p(x ) = P(xi |x−i )P(xi |x−i )P(x−i )
0
Note that x−i = x−i . That is,
0 0 0
T (x → x )p(x) = T (x → x)p(x ). (8)
Markov Chain Monte Carlo Gibbs Sampling

Multivariate case
For multivariate case
Let P(xi |x−i ) = P(xi |x1 , · · · , xi−1 , xi+1 , · · · , xn ). The transition
0 0
probabilities are given by T (x → x ) = P(xi |x−i ). Then, we have:
0 0
T (x → x )p(x) = P(xi |x−i )P(xi |x−i )P(x−i )
0 0 0 0 0 0
T (x → x)p(x ) = P(xi |x−i )P(xi |x−i )P(x−i )
0
Note that x−i = x−i . That is,
0 0 0
T (x → x )p(x) = T (x → x)p(x ). (8)

Therefore, the detailed balance condition also holds.


Gibbs sampling is feasible if it is easy to sample from the
conditional probability distribution.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 26 / 45
Markov Chain Monte Carlo Gibbs Sampling

Gibbs sampling algorithm (proposed distribution P(xi |x−i ))


0: initialize x1 , · · · , xn ;
1: for τ = 0 to max do
2: sample x1τ +1 ∼ P(x1 |x2τ , x3τ , · · · , xnτ );
3: · · · ;
4: sample xjτ +1 ∼ P(xj |x1τ +1 , · · · , xj−1
τ +1 τ
, xj+1 , · · · , xnτ );
5: · · · ;
6: sample xnτ +1 ∼ P(xn |x1τ +1 , x2τ +1 , · · · , xn−1
τ +1
);
7: output Last N samples;
Markov Chain Monte Carlo Gibbs Sampling

Gibbs sampling algorithm (proposed distribution P(xi |x−i ))


0: initialize x1 , · · · , xn ;
1: for τ = 0 to max do
2: sample x1τ +1 ∼ P(x1 |x2τ , x3τ , · · · , xnτ );
3: · · · ;
4: sample xjτ +1 ∼ P(xj |x1τ +1 , · · · , xj−1
τ +1 τ
, xj+1 , · · · , xnτ );
5: · · · ;
6: sample xnτ +1 ∼ P(xn |x1τ +1 , x2τ +1 , · · · , xn−1
τ +1
);
7: output Last N samples;

Gibbs sampling is a type of random walk through parameter space,


and can be considered as a Metroplis-Hastings algorithm with a spe-
cial proposal distribution.
Markov Chain Monte Carlo Gibbs Sampling

Gibbs sampling algorithm (proposed distribution P(xi |x−i ))


0: initialize x1 , · · · , xn ;
1: for τ = 0 to max do
2: sample x1τ +1 ∼ P(x1 |x2τ , x3τ , · · · , xnτ );
3: · · · ;
4: sample xjτ +1 ∼ P(xj |x1τ +1 , · · · , xj−1
τ +1 τ
, xj+1 , · · · , xnτ );
5: · · · ;
6: sample xnτ +1 ∼ P(xn |x1τ +1 , x2τ +1 , · · · , xn−1
τ +1
);
7: output Last N samples;

Gibbs sampling is a type of random walk through parameter space,


and can be considered as a Metroplis-Hastings algorithm with a spe-
cial proposal distribution.
At each iteration, we are drawing from conditional posterior probabil-
ities. This means that the proposal move is always accepted. Hence,
if we can draw samples from the conditional distributions, Gibbs sam-
pling can be much more efficient than regular Metropolis-Hastings.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 27 / 45
Markov Chain Monte Carlo Gibbs Sampling

Properties of Gibbs Sampling


Properties

No need to tune the proposal distribution;


Good trade-off between acceptance and mixing: Acceptance
ratio is always 1.
Need to be able to derive conditional probability distributions.
Acceleration of Gibbs sampling: (given
p(a, b, c) draw samples from a and c)

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 28 / 45
Markov Chain Monte Carlo Gibbs Sampling

Properties of Gibbs Sampling


Properties

No need to tune the proposal distribution;


Good trade-off between acceptance and mixing: Acceptance
ratio is always 1.
Need to be able to derive conditional probability distributions.
Acceleration of Gibbs sampling: (given
p(a, b, c) draw samples from a and c)
Blocked Gibbs:
(1) Draw (a,b) given c;
(2) Draw c given (a,b);

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 28 / 45
Markov Chain Monte Carlo Gibbs Sampling

Properties of Gibbs Sampling


Properties

No need to tune the proposal distribution;


Good trade-off between acceptance and mixing: Acceptance
ratio is always 1.
Need to be able to derive conditional probability distributions.
Acceleration of Gibbs sampling: (given
p(a, b, c) draw samples from a and c)
Blocked Gibbs:
(1) Draw (a,b) given c;
(2) Draw c given (a,b);
Collapsed Gibbs:
(1) Draw a given c;
(2) Draw c given a;

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 28 / 45
Markov Chain Monte Carlo Gibbs Sampling

Properties of Gibbs Sampling


Properties

No need to tune the proposal distribution;


Good trade-off between acceptance and mixing: Acceptance
ratio is always 1.
Need to be able to derive conditional probability distributions.
Acceleration of Gibbs sampling: (given
p(a, b, c) draw samples from a and c)
Blocked Gibbs:
(1) Draw (a,b) given c;
(2) Draw c given (a,b);
Collapsed Gibbs:
(1) Draw a given c;
(2) Draw c given a;
Marginalize whenever you can.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 28 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Outline

1 Monte Carlo Method

2 Markov Chain Monte Carlo


MCMC Sampling Algorithm
Metropolis-Hastings Algorithm
Gibbs Sampling
Latent Dirichlet Allocation

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 29 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Topic modeling for text


Latent Dirichlet Allocation (LDA)

An example article from a corpus.


Each color codes a different topic.
Markov Chain Monte Carlo Latent Dirichlet Allocation

Topic modeling for text


Latent Dirichlet Allocation (LDA)

An example article from a corpus.


Each color codes a different topic.
There are many models which can be used to represent text
data, such as LSA, PLSA, LDA, word2vec, etc.
Markov Chain Monte Carlo Latent Dirichlet Allocation

Topic modeling for text


Latent Dirichlet Allocation (LDA)

An example article from a corpus.


Each color codes a different topic.
There are many models which can be used to represent text
data, such as LSA, PLSA, LDA, word2vec, etc.
LDA models text in a simple and reasonable manner;
Markov Chain Monte Carlo Latent Dirichlet Allocation

Topic modeling for text


Latent Dirichlet Allocation (LDA)

An example article from a corpus.


Each color codes a different topic.
There are many models which can be used to represent text
data, such as LSA, PLSA, LDA, word2vec, etc.
LDA models text in a simple and reasonable manner;
LDA can be applied to many complex applications, such as
image, graph, location, etc.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 30 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Notations for LDA

symbol meaning
M the number of documents
Nm the number of words in document m
K the number of topics
wm,n the index of word n in document m
zm,n the topic k of each word wm,n
α, β fixed hyper-parameters
θ topic distribution for each document
φ topic distribution for each word

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 31 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Properties of Dirichlet

Γ( K
P
αi ) K αk −1 1
Dir (θ|α) = K k=1 Πk=1 θk ≡ ΠK θαk −1
Πk=1 Γ(αk ) 4(α) k=1 k
 
N mk
Mult(m1 , · · · , mK |θ, N) = ΠK
k=1 θk
m1 m2 · · · mK
Γ( K
P
αi ) + N) K αk +mk −1
Dir (θ|D, α) = Dir (θ|α, m) = K k=1 Π θ
Πk=1 Γ(αk + mk ) k=1 k
Markov Chain Monte Carlo Latent Dirichlet Allocation

Properties of Dirichlet

Γ( K
P
αi ) K αk −1 1
Dir (θ|α) = K k=1 Πk=1 θk ≡ ΠK θαk −1
Πk=1 Γ(αk ) 4(α) k=1 k
 
N mk
Mult(m1 , · · · , mK |θ, N) = ΠK
k=1 θk
m1 m2 · · · mK
Γ( K
P
αi ) + N) K αk +mk −1
Dir (θ|D, α) = Dir (θ|α, m) = K k=1 Π θ
Πk=1 Γ(αk + mk ) k=1 k
Markov Chain Monte Carlo Latent Dirichlet Allocation

Properties of Dirichlet

Γ( K
P
αi ) K αk −1 1
Dir (θ|α) = K k=1 Πk=1 θk ≡ ΠK θαk −1
Πk=1 Γ(αk ) 4(α) k=1 k
 
N mk
Mult(m1 , · · · , mK |θ, N) = ΠK
k=1 θk
m1 m2 · · · mK
Γ( K
P
αi ) + N) K αk +mk −1
Dir (θ|D, α) = Dir (θ|α, m) = K k=1 Π θ
Πk=1 Γ(αk + mk ) k=1 k

The expectation of
Dirichlet is
E (θ) = ( αα10 , αα20 , · · · , ααK0 ),
where α0 = K
P
k=1 αk .

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 32 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

LDA: Latent Dirichlet Allocation

LDA assumes the following generative process for each document d


in a corpus D:

1: for k = 1 to K do
2: φ(k) ∼ Dirichlet(β);
3: for each document m ∈ D
4: θm ∼ Dirichlet(α);
5: for each word wm,n ∈ m
6: zm,n ∼ Mult(θm );
7: wm,n ∼ Mult(φ(zm,n ) );
Markov Chain Monte Carlo Latent Dirichlet Allocation

LDA: Latent Dirichlet Allocation

LDA assumes the following generative process for each document d


in a corpus D:

1: for k = 1 to K do
2: φ(k) ∼ Dirichlet(β);
3: for each document m ∈ D
4: θm ∼ Dirichlet(α);
5: for each word wm,n ∈ m
6: zm,n ∼ Mult(θm );
7: wm,n ∼ Mult(φ(zm,n ) );

where φ(k) ∈ R K and θm ∈ R |V | .

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 33 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Joint probability of LDA model

The joint probabilities of observing a word wm,n and the corpus are

p(wm,n , zm,n , φ, θm |α, β)


= p(wm,n |zm,n , φ)p(zm,n |θm )p(φ|β)p(θm |α),
Markov Chain Monte Carlo Latent Dirichlet Allocation

Joint probability of LDA model

The joint probabilities of observing a word wm,n and the corpus are

p(wm,n , zm,n , φ, θm |α, β)


= p(wm,n |zm,n , φ)p(zm,n |θm )p(φ|β)p(θm |α),

In other words,

p(w , z, φ, θ|α, β)
= ΠM N
m=1 Πn=1 p(wm,n , zm,n , φ, θm |α, β)
= p(φ|β)ΠM N
m=1 p(θm |α)Πn=1 p(wm,n |zm,n , φ)p(zm,n |θm ).
Markov Chain Monte Carlo Latent Dirichlet Allocation

Joint probability of LDA model

The joint probabilities of observing a word wm,n and the corpus are

p(wm,n , zm,n , φ, θm |α, β)


= p(wm,n |zm,n , φ)p(zm,n |θm )p(φ|β)p(θm |α),

In other words,

p(w , z, φ, θ|α, β)
= ΠM N
m=1 Πn=1 p(wm,n , zm,n , φ, θm |α, β)
= p(φ|β)ΠM N
m=1 p(θm |α)Πn=1 p(wm,n |zm,n , φ)p(zm,n |θm ).

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 34 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

LDA Model II

1: for k = 1 to K do
2: φ(k) ∼ Dirichlet(β);
3: for each document m ∈ D
4: θm ∼ Dirichlet(α);
5: for each word wm,n ∈ m
6: zm,n ∼ Mult(θm );
7: for each topic k ∈ [1, K ]
8: for each zm,n = k
9: wm,n ∼ Mult(φ(k) );
Markov Chain Monte Carlo Latent Dirichlet Allocation

LDA Model II

1: for k = 1 to K do We put the words with the same


2: φ(k) ∼ Dirichlet(β); topic together. We have
3: for each document m ∈ D
z = (z1 , z2 , · · · , zK ),
4: θm ∼ Dirichlet(α);
5: for each word wm,n ∈ m w = (w1 , w2 , · · · , wK ),
6: zm,n ∼ Mult(θm );
where wk is the set of words
7: for each topic k ∈ [1, K ]
generated by the k-th topic, and zk
8: for each zm,n = k
is a vector whose terms are the IDs
9: wm,n ∼ Mult(φ(k) );
of the word topics (k).
Markov Chain Monte Carlo Latent Dirichlet Allocation

LDA Model II

1: for k = 1 to K do We put the words with the same


2: φ(k) ∼ Dirichlet(β); topic together. We have
3: for each document m ∈ D
z = (z1 , z2 , · · · , zK ),
4: θm ∼ Dirichlet(α);
5: for each word wm,n ∈ m w = (w1 , w2 , · · · , wK ),
6: zm,n ∼ Mult(θm );
where wk is the set of words
7: for each topic k ∈ [1, K ]
generated by the k-th topic, and zk
8: for each zm,n = k
is a vector whose terms are the IDs
9: wm,n ∼ Mult(φ(k) );
of the word topics (k).
Now, we have two conjugate structures of Dirichlet-Multinomial:

α −→} θm −→ zm , and β −→ φk −→ wk
| {z (9)
| {z } | {z } | {z }
Dirichlet Multinomial Dirichlet Multinomial

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 35 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Dice Toss Toy Example

Suppose we have a dice of K sides. We toss the dice and the prob-
ability of landing on side k is p(t = k|f ) = fi . We throw the dice
N times and obtain a set of results s = {s1 , s2 , · · · , sN }. The joint
probability is
Markov Chain Monte Carlo Latent Dirichlet Allocation

Dice Toss Toy Example

Suppose we have a dice of K sides. We toss the dice and the prob-
ability of landing on side k is p(t = k|f ) = fi . We throw the dice
N times and obtain a set of results s = {s1 , s2 , · · · , sN }. The joint
probability is
N
Y K
Y
p(s|f ) = p(sn |f ) = f1n1 f2n2 · · · fKnK = fi ni (10)
n=1 i=1
Markov Chain Monte Carlo Latent Dirichlet Allocation

Dice Toss Toy Example

Suppose we have a dice of K sides. We toss the dice and the prob-
ability of landing on side k is p(t = k|f ) = fi . We throw the dice
N times and obtain a set of results s = {s1 , s2 , · · · , sN }. The joint
probability is
N
Y K
Y
p(s|f ) = p(sn |f ) = f1n1 f2n2 · · · fKnK = fi ni (10)
n=1 i=1

Suppose that f is a Dirichlet distribution with α as hyper-parameter.


Then we express the probability of f as
K
Γ( K
P
αk ) Y αk −1
Dir (f |α) = QK k=1 fk (11)
k=1 Γ(αk ) k=1

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 36 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Example Cont’d

If we want to estimate the parameter f based on the observation of


s, then we can express f in the following manner
Markov Chain Monte Carlo Latent Dirichlet Allocation

Example Cont’d

If we want to estimate the parameter f based on the observation of


s, then we can express f in the following manner

p(s|f , α)p(f |α)


p(f |s, α) = R 1
p(s|f , α)p(f |α)df
0
QK ni Γ(PKk=1 αk ) QK αk −1
i=1 fi QK k=1 fk
k=1 Γ(αk )
=R Q PK
1 K ni Γ( k=1 αk )
QK αk −1
0 f
i=1 i Q K k=1 fk df
k=1 Γ(α k )
PK
Γ( k=1 αk ) QK nk +αk −1
QK k=1 fk
k=1 Γ(αk )
=
Γ( K
P
αk ) R 1 Q K nk +αk −1
QK k=1
0 k=1 fk df
k=1 Γ(αk )
K
Γ( K (nk + αk )) Y nk +αk −1
P
= QK k=1 fk
k=1 Γ(nk + αk ) k=1
Markov Chain Monte Carlo Latent Dirichlet Allocation

Example Cont’d

If we want to estimate the parameter f based on the observation of


s, then we can express f in the following manner

p(s|f , α)p(f |α)


p(f |s, α) = R 1 Notice that after estimating f
p(s|f , α)p(f |α)df
0 based on s observations, f is
QK ni Γ(PKk=1 αk ) QK αk −1
i=1 fi QK k=1 fk still a Dirichlet distribution
k=1 Γ(αk )
=R Q PK with parameter α + n, where
1 K ni Γ( k=1 αk ) QK αk −1
0 f
i=1 i Q K f
k=1 k df n = (n1 , n2 , · · · , nk ). This
k=1 Γ(αk )
PK
Γ( k=1 αk ) QK
property is known as
QK k=1 fknk +αk −1 conjugate priors. Based on
Γ(α )
= PKk=1 Rk Q this property, estimating the
Γ( k=1 αk ) 1 K nk +αk −1
QK
0 k=1 fk df
parameters fi after observing
k=1 Γ(αk )

Γ(
PK K
(nk + αk )) Y nk +αk −1 N trials is a simple counting
= QK k=1 fk procedure.
k=1 Γ(nk + αk ) k=1

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 37 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Estimating fi

Suppose we want to obtain fi from f =


(f1 , f2 , · · · , fi−1 , fi , fi+1 , · · · , fK ).
Z 1
E (fi |s, α) = fi p(f |s, α)df
0
Z 1 PK K
Γ( k=1 (nk + αk )) Y nk +αk −1
= fi QK fk df
0 k=1 Γ(nk + αk ) k=1
PK K
Γ( k=1 (nk + αk )) 1 Y nk +αk −1
Z
= QK fi fk df
k=1 Γ(nk + αk ) 0 k=1
PK QK
Γ( k=1 (nk + αk )) Γ(ni + αi + 1) k=1,k6=i Γ(ni + αi )
= QK PK
k=1 Γ(nk + αk ) Γ(ni + αi + 1 + k=1,k6=i (nk + αk ))
ni + αi
= PK
k=1 (nk + αk )

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 38 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Likelihood of Observing si

Suppose we want to obtain the likelihood of observing si , i.e., p(si |α).


Z 1 Z 1
p(si |α) = p(si , f |α)df == p(si |f )p(f |α))df
0 0
K K
1Y
Γ( K
Z P
k=1 αk )
fkαk −1 df
ni
Y
= fi QK
0 i=1 k=1 Γ(α )
k k=1
PK Z 1Y K
Γ( αk )
= QK k=1 fknk +αk −1 df
k=1 Γ(αk ) 0 k=1
PK QK
Γ( k=1 αk ) k=1 Γ(nk + αk ) 4(n + α)
= QK K
= .
4(α)
P
k=1 Γ(αk ) Γ( k=1 (nk + αk ))
QK
k=1 Γ(αk )
where 4(α) = .
Γ( K
P
k=1 αk )

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 39 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Parameter Inference
We integrate θ and φ to obtain the following:
p(z, w |α, β) = p(w |z, β)p(z|α)

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 40 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Parameter Inference
We integrate θ and φ to obtain the following:
p(z, w |α, β) = p(w |z, β)p(z|α)
β −→ φ −→ w
| {z } k | {z k}
Dirichlet Multinomial

K
Y
p(w |z, β) = p(wk |zk , β)
k=1
K
Y 4(nk + β)
= ,
4(β)
k=1

where nk =
(1) (2) (V )
(nk , nk , · · · , nk ), and
(v )
nk is the number of words
generated by topic k.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 40 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Parameter Inference
We integrate θ and φ to obtain the following:
p(z, w |α, β) = p(w |z, β)p(z|α)
β −→ φ −→ w α −→} θm −→ zm
| {z } k | {z k} | {z | {z }
Dirichlet Multinomial Dirichlet Multinomial

K
Y M
Y
p(w |z, β) = p(wk |zk , β) p(z|α) = p(zm |α)
k=1 m=1
K M
Y 4(nk + β) Y 4(nm + α)
= , = ,
4(β) 4(α)
k=1 m=1

where nk = where nm =
(1) (2) (V ) (1) (2) (K )
(nk , nk , · · · , nk ), and (nm , nm , · · · , nm ), and
(v ) (k)
nk is the number of words nm is # words with topic k
generated by topic k. in the m-th document.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 40 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Parameter Inference Cont’d


p(z|α)

Z Z
p(z|α) = p(z, θ|α)dθ = p(z|θ, α)p(θ|α)dθ
Z M Y
Z Y K  PK α )  Y K
nm,k Γ(

k=1 k αk −1
= p(z|θ)p(θ|α)dθ = θm,k QK θm,k dθ
m=1 k=1 k=1 Γ(αk ) k=1
Z Y M PK K
Γ( k=1 αk ) Y nm,k +αk −1
= QK θm,k dθ
m=1 k=1 Γ(αk ) k=1
M PK Z Y K
Y Γ( k=1 αk ) nm,k +αk −1
= QK θm,k dθ
m=1 k=1 Γ(αk ) k=1
M PK QK M
Y Γ( k=1 αk ) k=1 Γ(αk + nm,k ) Y 4(nm + α)
= QK PK =
m=1 k=1 Γ(αk ) Γ( k=1 αk + nm,k ) m=1
4(α)

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 41 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Parameter Inference Cont’d

p(w |z

Z Z
p(w |z, β) = p(w , φ|z, β)dφ = p(w |z, β, φ)p(φ|z, β)dφ
Z
= p(w |z, φ)p(φ|β)dφ
K Y
Z Y V  Γ(PV β )  Y V 
nk,v v =1 v v −1
= φk,v QV φβk,v dφ
k=1 v =1 v =1 Γ(βv ) v =1
Z Y K PV V
Γ( v =1 βv ) Y nk,v +βv −1
= QV φk,v dφ
k=1 v =1 Γ(βv ) v =1
K PV QV K
Y Γ( v =1 βv ) v =1 Γ(βv + nk,v ) Y 4(nk + β)
= QV PV =
v =1 Γ(βv ) Γ( v =1 βv + nk,v )
4(β)
k=1 k=1

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 42 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Gibbs Sampling

Analysis

For simplicity, the topic of the i-th word in the corpus denotes zi ,
where i = (m, n). In terms of Gibbs sampling, we need to compute
the conditional probability p(zi = k|z−i , w).
p(zi = k|z−i , w) = p(zi = k|z−i , w−i , wi = t)
p(zi = k, wi = t|z−i , w−i )
= ∝ p(zi = k, wi = t|z−i , w−i ).
p(wi = t|z−i , w−i )

Notice that zi = k, wi = t only involves the m−th document and


the k−th topic, which are related to two Dirichlet-Multinomial (DM)
structures, and is independent to M + K − 2 DM structures.
p(θm |z−i , w−i ) = Dir (θm |nm,−i + α)
p(φk |z−i , w−i ) = Dir (φk |nk,−i + β)

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 43 / 45
Markov Chain Monte Carlo Latent Dirichlet Allocation

Deriving the Transition Probability


Transition probability
Z
p(zi = k, wi = t|z−i , w−i ) = p(zi = k, wi = t, θm , φk |z−i , w−i )dθm dφk
Z
p(zi = k, θm |z−i , w−i )p(wi = t, φk |z−i , w−i )dθm dφk
Z
= p(zi = k|θm )Dir (θm |nm,−i + α)dθm
Z
· p(wi = t|φk )Dir (φk |nk,−i + β)dφk
Z Z
= θmk Dir (θm |nm,−i + α)dθm φkt Dir (φk |nk,−i + β)dφk

= E (θmk )E (φkt ) = θbmk φbkt ,


(k) (t)
nm,−i +αk nk,−i +βt
where θbmk = PK (k) and φbkt = PV (t) .
k=1 (nm,−i +αk ) t=1 (nk,−i +βt )
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 44 / 45
Take-home msg.

Take-home messages

Monte Carlo method


Markov Chain Monte Carlo
MCMC sampling algorithm
Metropolis-Hastings algorithm
Gibbs sampling
Latent Dirichlet Allocation

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 45 / 45

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy