0% found this document useful (0 votes)

0 views100 pages

Sampling

Uploaded by

bocerin283

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views100 pages

Sampling

Uploaded by

bocerin283

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

Algorithm Foundations of Data Science

Lecture 2: Sampling

MING GAO

DaSE@ECNU
(for course related communications)
mgao@dase.ecnu.edu.cn

Mar. 14, 2018

Outline

1 Monte Carlo Method

2 Markov Chain Monte Carlo

MCMC Sampling Algorithm
Metropolis-Hastings Algorithm
Gibbs Sampling
Latent Dirichlet Allocation

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 2 / 45
Monte Carlo Method

Monte Carlo Method

MC methods are a class of computational algorithms that rely on

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 3 / 45
Monte Carlo Method

Monte Carlo Method

MC methods are a class of computational algorithms that rely on

repeated random sampling to obtain numerical results.
1 An early variant of it can be seen in the Buffon’s needle
experiment;
2 It was central to the simulations required for the Manhattan
Project;
3 The founder of MC method were Stanislaw Marcin Ulam,
Enrico Fermi, John von Neumann and Nicholas Metropolis.
Major components of MC methods
1 Define a domain of possible inputs;
2 Generate inputs randomly from a pdf over the domain;
3 Perform a deterministic computation on the inputs;
4 Aggregate the results.
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 3 / 45
Monte Carlo Method

Example I

Algorithm:

Question: How accurate of the probabilistic algorithm?

We cannot answer the question in this moment, once we learn ex-
pectation of r.v.s (coming soon).
Monte Carlo Method

Example I

Algorithm:
Step i: It randomly and uniformly
generates a point Pi inside the sample
space Ω = {(x, y )|0 ≤ x, y ≤ 1}.

Question: How accurate of the probabilistic algorithm?

We cannot answer the question in this moment, once we learn ex-
pectation of r.v.s (coming soon).
Monte Carlo Method

Example I

Algorithm:
Step i: It randomly and uniformly
generates a point Pi inside the sample
space Ω = {(x, y )|0 ≤ x, y ≤ 1}.
Let set
S = {(x, y ) : x 2 + y 2 ≤ 1 ∧ x, y ≥ 0} be
the circle region. And ∀Pi ∈ S, we
define IS (Pi ) and IΩ−S (Pi );

Question: How accurate of the probabilistic algorithm?

We cannot answer the question in this moment, once we learn ex-
pectation of r.v.s (coming soon).
Monte Carlo Method

Example I

Question: How accurate of the probabilistic algorithm?

We cannot answer the question in this moment, once we learn ex-
pectation of r.v.s (coming soon).

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 4 / 45
Monte Carlo Method

Sample with discrete distribution

How to sample from discrete distribution 0.1, 0.2, 0.3, 0.4?

Aliasing sample:
CDF sample:

O(log n) for CDF sample,

and O(1) for aliasing
sample.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 5 / 45
Monte Carlo Method

Example II: approximating probabilities

In many applications, the probability P(Y ) of an observed event Y

must be computed as the sum over very many latent variables X of
the joint probability P(Y , X ). That is,
X X
P(Y = y ) = P(Y = y , X = x) = P(Y = y |X = x)P(X = x).
x∈X x∈X
Monte Carlo Method

Example II: approximating probabilities

In many applications, the probability P(Y ) of an observed event Y

must be computed as the sum over very many latent variables X of
the joint probability P(Y , X ). That is,
X X
P(Y = y ) = P(Y = y , X = x) = P(Y = y |X = x)P(X = x).
x∈X x∈X

The term following the last equals sign is the sum over all x of a
function of x, weighted by the marginal probabilities P(X = x).
Clearly this is an expectation, and therefore may be approximated by
Monte Carlo, giving us
n
1X
P(Y = y ) ≈ P(Y = y |X = xi ).
n
i=1

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 6 / 45
Monte Carlo Method
R1
Example III: approximating integral 0 x 2 dx

1 Draw a square, then inscribe a parabola within it;

2 Uniformly scatter objects of uniform size over the square;
3 Count # objects inside the parabola and the total number of
objects;
R1
4 The ratio (0.3328) of the two counts is an estimate of 0 x 2 dx.
Monte Carlo Method
R1
Example III: approximating integral 0 x 2 dx

1 Draw a square, then inscribe a parabola within it;

Rb
For an integral a f (x)dx, it is
hard to find a rectangle to bound
the value of f (x), especially for a
high-dimensional function.
Alternatively,
R b f (x) we compute
a p(x) p(x)dx.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 7 / 45
Monte Carlo Method

Example IV: approximating expectation f (x)

R
Computing approximate integrals of the form f (x)p(x)dx, i.e.,
computing expectation of f (x) using density p(x).
Monte Carlo Method

Example IV: approximating expectation f (x)

R
Computing approximate integrals of the form f (x)p(x)dx, i.e.,
computing expectation of f (x) using density p(x).
1 Let {xi } is an i.i.d. random sample drawn from p(x);
2 The strong law of large numbers says:
N Z
1 X
f (xi ) −→ f (x)p(x)dx (a.s). (1)
N
i=1

√
3 The rate of convergence is proportional to N;
Monte Carlo Method

Example IV: approximating expectation f (x)

√
3 The rate of convergence is proportional to N;
4 Major issues:
The proportionality constant increases exponentially with the
dimension of the integral.
Another problem is that sampling from complex distributions is
not as easy as uniform.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 8 / 45
Monte Carlo Method
R
Rejection sampling: approximating f (x)p(x)dx

1 PN
N i=1 f (xi ) is difficult to compute since it is hard to draw from
p(x).
1: i ← 0;
2: while i 6= N do
3: x (i) ∼ q(x);
4: u ∼ U(0, 1);
p(x (i) )
5: if u <kq(x (i) )
then;
6: accept x (i) ;
7: i ← i + 1;
8: else
9: reject x (i) ;
10: end if
11: end while
Monte Carlo Method
R
Rejection sampling: approximating f (x)p(x)dx

1 PN
N i=1 f (xi ) is difficult to compute since it is hard to draw from
p(x).
1: i ← 0;
2: while i 6= N do
3: x (i) ∼ q(x);
4: u ∼ U(0, 1);
p(x (i) )
5: if u <kq(x (i) )
then;
6: accept x (i) ;
7: i ← i + 1; where density q(x) (e.g., Gaussian) can
8: else sample directly.
9: reject x (i) ; What is the average acceptance ratio?
10: end if However, it is hard to find the
11: end while reasonable q(x) and the value of k.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 9 / 45
Monte Carlo Method
R
Importance sampling: approximating I (f ) = f (x)p(x)dx

If we have a density q(x) (proposal distribution) which is easy to

sample from, we can sample x (i) ∼ q(x). We define the importance
weight as
p(x (i) )
w (x (i) ) = .
q(x (i) )
Monte Carlo Method
R
Importance sampling: approximating I (f ) = f (x)p(x)dx

If we have a density q(x) (proposal distribution) which is easy to

sample from, we can sample x (i) ∼ q(x). We define the importance
weight as
p(x (i) )
w (x (i) ) = .
q(x (i) )
Consider the weighted Monte Carlo sum:
N N
1 X (i) (i) 1 X p(x (i) )
f (x )w (x ) = f (x (i) )
N
i=1
N
i=1
q(x (i) )
Z Z
p(x)
−→ f (x) q(x)dx(a.s) = f (x)p(x)dx.
q(x)

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 10 / 45
Monte Carlo Method

Approximating probabilities Cont.d

Going back to Example II with the discrete sum over latent variables
X it is clear that the optimal importance sampling function would be
the conditional distribution of X given Y , i.e.,
X P(Y = y , X = x)
P(Y = y ) = P(X |Y = y ).
P(X |Y = y )
x∈X
Monte Carlo Method

Approximating probabilities Cont.d

Note that the right side is a conditional expectation of a function of

X.
Monte Carlo Method

Approximating probabilities Cont.d

Note that the right side is a conditional expectation of a function of

X . As before P(X |Y ) is not computable.
Monte Carlo Method

Approximating probabilities Cont.d

Note that the right side is a conditional expectation of a function of

X . As before P(X |Y ) is not computable.
So one must turn to finding some other distribution, i.e., P ∗ (X ),
that is close to P(X |Y ) but which is more easily sampled from and
computed.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 11 / 45
Monte Carlo Method

Analysis of importance sampling

How to pick q(x)

We can sample from any distribution q(x). In practice, we would

like to choose q(x) as close as possible to |f (x)|w (x) to reduce the
variance of our estimator.
Monte Carlo Method

Analysis of importance sampling

How to pick q(x)

We can sample from any distribution q(x). In practice, we would

like to choose q(x) as close as possible to |f (x)|w (x) to reduce the
variance of our estimator.
We have Varq(x) f (x)w (x) = Eq(x) f (x)2 w (x)2 − I (f )2 .
Furthermore, we have

Eq(x) f (x)2 w (x)2 ≥ (Eq(x) |f (x)|w (x))2

Z
= ( |f (x)|p(x)dx)2 .

The term I (f )2 is independent of q(x). So, the best q ∗ (x)

which makes the variance minimum is given by
q ∗ (x) = R |f|f (x)|p(x)dx
(x)|p(x)
.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 12 / 45
Markov Chain Monte Carlo

The Main idea of MCMC

We cannot
R sample directly from the target distribution p(x) in the
integral f (x)p(x)dx.
Create a Markov chain whose transition matrix does not
depend on the normalization term.
Make sure the chain has a stationary distribution and it is
equal to the target distribution.
Markov Chain Monte Carlo

The Main idea of MCMC

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 13 / 45
Markov Chain Monte Carlo

Markov Chain Monte Carlo

Overview
Markov Chain Monte Carlo (MCMC) methods are a class of algo-
rithms for sampling from a probability distribution based on con-
structing a Markov chain that has the desired distribution as its sta-
tionary distribution.
The algorithm is proposed in 1953, which is the top-10 most
important algorithms in the 20th century.
MCMC works by generating a sequence of sample values in
such a way that, as more and more sample values are
produced, the distribution of values more closely approximates
the desired distribution, π(i).
That is, a Markov Chain has stationary distribution π(i)
associated with transition probability matrix P.
Markov Chain Monte Carlo

Markov Chain Monte Carlo

Stationary Distribution

Theorem
Let X0 , X1 , · · · , be an irreducible and aperiodic Markov chain with
transition matrix P. Then, limn→∞ Pijn exists and independent of i,
denoted as limn→∞ Pijn = π(j). We also have
Markov Chain Monte Carlo

Stationary Distribution

 
π(1) π(2) · · · π(j) · · ·
 π(1) π(2) · · · π(j) · · · 
n
 
lim P =  ··· ··· ··· ··· ···  (2)
n→∞  
 π(1) π(2) · · · π(j) · · · 
··· ··· ··· ··· ···
Markov Chain Monte Carlo

Stationary Distribution

 
π(1) π(2) · · · π(j) ···
 π(1) π(2) · · · π(j) ··· 
lim P n = 
 
 ··· ··· ··· ··· ··· (2)

n→∞ 
 π(1) π(2) · · · π(j) ··· 
··· ··· ··· ··· ···

π(j) = ∞
P P∞
i=1 π(i)Pij , and i=1 π(i) = 1.
Markov Chain Monte Carlo

Stationary Distribution

π(j) = ∞
P P∞
i=1 π(i)Pij , and i=1 π(i) = 1.
π is the unique and non-negative solution for equation πP = π.

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 15 / 45
Markov Chain Monte Carlo

Detailed Balance Condition

Theorem
Let X0 , X1 , · · · , be an aperiodic Markov chain with transition matrix
P and distribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i, j (3)

then π(x) is the stationary distribution of the Markov chain. The

above equation is called the detailed balance condition.
Markov Chain Monte Carlo

Detailed Balance Condition

Theorem
Let X0 , X1 , · · · , be an aperiodic Markov chain with transition matrix
P and distribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i, j (3)

then π(x) is the stationary distribution of the Markov chain. The

above equation is called the detailed balance condition.
Proof: ∞
P P∞ P∞
i=1 π(i)Pij = i=1 π(j)Pji = π(j) i=1 Pji =
π(j) ⇒ πP = π.
Markov Chain Monte Carlo

Detailed Balance Condition

Theorem
Let X0 , X1 , · · · , be an aperiodic Markov chain with transition matrix
P and distribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i, j (3)

then π(x) is the stationary distribution of the Markov chain. The

Detailed Balance Condition

Theorem
Let X0 , X1 , · · · , be an aperiodic Markov chain with transition matrix
P and distribution π. If the following condition holds,

π(i)Pij = π(j)Pji , for all i, j (3)

then π(x) is the stationary distribution of the Markov chain. The

above equation is called the detailed balance condition.
Proof: ∞
P P∞ P∞
i=1 π(i)Pij = i=1 π(j)Pji = π(j) i=1 Pji =
π(j) ⇒ πP = π.
In general, π(i)Pij 6= π(j)Pji . That is, π(i) may not be the
stationary distribution.
The natural question is how to revise the Markov Chain such
that π becomes a stationary distribution. For example, we
introduce a function α(i, j) s.t. π(i)Pij α(i, j) = π(j)Pji α(j, i)
MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 16 / 45
Markov Chain Monte Carlo MCMC Sampling Algorithm

Outline

1 Monte Carlo Method

2 Markov Chain Monte Carlo

MCMC Sampling Algorithm
Metropolis-Hastings Algorithm
Gibbs Sampling
Latent Dirichlet Allocation

MING GAO (DaSE@ECNU) Algorithm Foundations of Data Science Mar. 14, 2018 17 / 45
Markov Chain Monte Carlo MCMC Sampling Algorithm