0% found this document useful (0 votes)

8 views31 pages

08 VariationalInference

Uploaded by

Nicole Oo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views31 pages

08 VariationalInference

Uploaded by

Nicole Oo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Variational Inference

Hady W. Lauw
Photo by Nick Fewings on Unsplash

IS712 Machine Learning

Cross Entropy
• Cost of modeling some distribution 𝑝(𝑥) with a different distribution 𝑞(𝑥)
𝐻 𝑝, 𝑞 = & −𝑝 𝑥 log 𝑞 𝑥 𝑑𝑥

• Kullback-Leibler divergence or KL-divergence is the additional cost as

compared to using the original distribution 𝑝(𝑥)
𝐷!" 𝑝||𝑞 = & −𝑝 𝑥 log 𝑞 𝑥 𝑑𝑥 − & −𝑝 𝑥 log 𝑝 𝑥 𝑑𝑥

𝑞 𝑥
𝐷!" 𝑝||𝑞 = − & 𝑝 𝑥 log 𝑑𝑥
𝑝(𝑥)
– Always non-negative, i.e., 𝐷!" 𝑝||𝑞 ≥ 0
– Minimized when 𝑝(𝑥) and 𝑞(𝑥) are identical distributions

2
Kullback-Leibler Divergence (KL-Divergence)

https://en.wikipedia.org/wiki/File:KL-Gauss-Example.png 3
Latent Variable Model

• Suppose that we have a probabilistic model with observed variables 𝒙 and

hidden variables 𝒛, with its joint distribution parameterized by 𝜃
– Our goal is to maximize the likelihood function given by:
𝑝 𝒙 𝜃 = % 𝑝(𝒙, 𝒛|𝜃)
𝒛

– Suppose direct optimization of 𝑝 𝒙 𝜃 is difficult, but that of 𝑝(𝒙, 𝒛|𝜃) is easier

• Let 𝑞(𝒛) be a distribution over the latent variables. For any choice of
distribution 𝑞(𝒛), the following decomposition holds:
ln 𝑝 𝒙 𝜃 = ℒ 𝑞, 𝜃 + 𝐷!" (𝑞||𝑝)
𝑝(𝒙, 𝒛|𝜃)
ℒ 𝑞, 𝜃 = 9 𝑞 𝒛 ln
𝑞(𝒛)
𝒛
𝑝(𝒛|𝒙, 𝜃)
𝐷!" (𝑞| 𝑝 = − 9 𝑞 𝒛 ln
𝑞(𝒛)
𝒛

4
Decomposition

• We observe that:
𝑝 𝒙, 𝒛 𝜃 = 𝑝 𝒛 𝒙, 𝜃 𝑝 𝒙 𝜃

• Substitute this into ℒ 𝑞, 𝜃 , and we get:

• We get back the original decomposition:

ln 𝑝 𝒙 𝜃 = ℒ 𝑞, 𝜃 + 𝐷!" (𝑞||𝑝)
5
Illustration of the Decomposition

• Because KL-divergence is non-negative, ℒ 𝑞, 𝜃 is effectively a lower bound

on the log-likelihood ln 𝑝 𝒙 𝜃

6
Revisiting EM: E-Step

• When 𝑞 == 𝑝, we get back EM

• In the E-step, holding the old parameters 𝜃 "#$ fixed, we maximize ℒ 𝑞, 𝜃 "#$
w.r.t. 𝑞(𝒛)
– This happens when KL-divergence is 0, and thus 𝑞 𝒛 = 𝑝(𝒛|𝒙, 𝜃 $%& )

7
Revisiting EM: M-Step

• In the M-step, holding 𝑞 𝒛 fixed, we maximize ℒ 𝑞, 𝜃 w.r.t. parameters 𝜃

– This increases ℒ 𝑞, 𝜃 , which is a lower bound on the log-likelihood
– The new posterior 𝑝 𝒛|𝒙, 𝜃 $ ≠ 𝑝 𝒛|𝒙, 𝜃 $%& , so it creates a non-zero KL divergence

8
Variational Inference

• For EM, we assume that the evaluation of the posterior 𝑝 𝒛|𝒙, 𝜃 is tractable
• For many models, this evaluation may not be tractable
– E.g., when we introduce a prior that requires complex normalization, when the
dimensionality is too high
• Approximate inference by using a different distribution 𝑞 ≠ 𝑝 that is simpler
and more tractable
– The previous parameters 𝜃 are now absorbed into 𝒛

ln 𝑝(𝒙) = ℒ 𝑞 + 𝐷!" (𝑞||𝑝)

𝑝(𝒙, 𝒛)
ℒ 𝑞 = & 𝑞 𝒛 ln 𝑑𝒛
𝑞(𝒛)
𝑝(𝒛|𝒙)
𝐷!" (𝑞| 𝑝 = − & 𝑞 𝒛 ln 𝑑𝒛
𝑞(𝒛)

9
Jensen’s Inequality
• For a convex function (any minimum is a global minimum): 𝑓 E 𝑥 ≤ E 𝑓(𝑥)

• For a concave function: E 𝑓(𝑥) ≤ 𝑓 E 𝑥

10
Evidence Lower Bound (ELBO)
• Another view of ℒ 𝑞
• Evidence or likelihood:
ln 𝑝(𝒙) = ln & 𝑝 𝒙, 𝒛 𝑑𝒛

𝑝 𝒙, 𝒛
= ln & 𝑞(𝒛) 𝑑𝒛
𝑞(𝒛)
𝑝 𝒙, 𝒛
= ln E'
𝑞(𝒛)

• Based on Jensen’s inequality, evidence lower bound or ELBO:

𝑝(𝒙, 𝒛) 𝑝 𝒙, 𝒛 𝑝 𝒙, 𝒛
ℒ 𝑞 = & 𝑞 𝒛 ln 𝑑𝒛 = E' ln ≤ ln E'
𝑞(𝒛) 𝑞 𝒛 𝑞(𝒛)

11
Evidence Lower Bound (ELBO)
• To maximize the likelihood, we maximize the ELBO

• Interpretation of ℒ 𝑞
𝑝 𝒙, 𝒛
ℒ 𝑞 = E! ln = E' ln 𝑝 𝒙, 𝒛 − E' ln 𝑞 𝒛
𝑞 𝒛
– E' ln 𝑝 𝒙, 𝒛 is the expectation (under 𝑞) of the log of the joint probability
– 𝐻 𝑞 = −E' ln 𝑞 𝒛 is the entropy of the variational distribution

• As previously shown, maximizing ELBO is minimizing 𝐷./ (𝑞||𝑝)

12
Mean Field Variational Inference

• The variational distribution 𝑞 should be simple and tractable

• A frequent assumption is that the latent variables are independent

• Variational distribution factorizes

,
𝑞 𝒛 = 𝑞 𝑧& , 𝑧( , … , 𝑧) = @ 𝑞(𝑧* )
*+&

• Also possible to group some (dependent) variables together to form partitions

13
Mean Field Variational Inference
• With factorized variational distribution:
𝑝(𝒙, 𝒛)
ℒ 𝑞 = & 𝑞 𝒛 ln 𝑑𝒛
𝑞(𝒛)

= &@𝑞 𝑧* ln 𝑝(𝒙, 𝒛) − 9 ln 𝑞(𝑧* ) 𝑑𝒛

* *

• Dissect dependence on one factor:

ℒ 𝑞 = & 𝑞 𝑧* & ln 𝑝(𝒙, 𝒛) @ 𝑞 𝑧- 𝑑𝒛%* 𝑑𝑧* − & 𝑞 𝑧* ln 𝑞 𝑧* 𝑑𝑧* + const

-.*

= & 𝑞 𝑧* E-.* [ln 𝑝(𝒙, 𝒛)] 𝑑𝑧* − & 𝑞 𝑧* ln 𝑞 𝑧* 𝑑𝑧* + const

• Holding 𝑞 𝑧0 012 fixed, maximize ℒ 𝑞 w.r.t. 𝑞 𝑧2 in turn

14
General Steps

• Identify what distribution 𝑞 should be, e.g., Gaussian, Dirichlet

• Derive the ELBO

• Optimize ELBO via gradient ascent for each 𝑞(𝑧2 ) in turn

• Repeat till convergence

15
VARIATIONAL INFERENCE ON LDA

16
Latent Dirichlet Allocation
• LDA’s generative process for each document 𝑑0 :
– pick a topic distribution 𝜃- from a Dirichlet prior
𝜃- ∼ 𝐷𝑖𝑟(𝛼)
– for each of the 𝑁 words in 𝑑-
• pick a latent class 𝑧/ with probability 𝑝 𝑧/ |𝜃-
• pick a word 𝑤* with probability 𝑝 𝑤* 𝑧/ , 𝛽

• Document probability
1" !
𝑝 𝒘- 𝛼, 𝛽 = & 𝑝(𝜃- |𝛼) @ 9 𝑝 𝑤0 𝑧/ , 𝛽 𝑝(𝑧/ |𝜃- ) 𝑑𝜃-
0+& /+&
– 𝒘- are the observed words in document 𝑑- ∈ 𝐷$23$
– 𝑁- is the number of words in 𝑑-

17
Variational Inference
• Posterior distribution of the hidden variables
𝑝 𝜃, 𝒛, 𝒘|𝛼, 𝛽
𝑝 𝜃, 𝒛 𝒘, 𝛼, 𝛽 =
𝑝(𝒘|𝛼, 𝛽)
– drop index 𝑖 for a specific document
– intractable to compute because of coupling of 𝜃 and 𝛽

• Variational inference
– Simplified graphical model with fewer dependencies
1
𝑞 𝜃, 𝒛 𝛾, 𝝓 = 𝑞(𝜃|𝛾) @ 𝑞(𝑧0 |𝜙0 )
0+&
– 𝑞(𝜃|𝛾) for every document is Dirichlet over 𝐾 topics
– 𝑞(𝑧0 |𝜙0 ) for every token is Multinomial over 𝐾 topics

• Optimization objective
𝛾 ∗, 𝝓∗ = argmin(6,𝝓) 𝐷!" (𝑞(𝜃, 𝒛|𝛾, 𝝓)||𝑝 𝜃, 𝒛 𝒘, 𝛼, 𝛽 )
18
Optimization
• KL minimization is equivalent to ELBO maximization
ELBO = E! ln 𝑝 𝜃, 𝒛, 𝒘|𝛼, 𝛽 + 𝐻(𝑞) Legend:
• 𝐾: no of topics
• The expectation of the log of the joint probability: • 𝑁: no of tokens in document
E# ln 𝑝 𝜃, 𝒛, 𝒘|𝛼, 𝛽 = E# ln 𝑝 𝜃 |𝛼 + E# ln 𝑝 𝒛 |𝜃 + E# ln 𝑝 𝒘 |𝒛, 𝛽 • 𝑉: no of unique words in vocabulary
' ' ' • 𝛼$ : Dirichlet hyperparameter for topic 𝑘
'
= log Γ % 𝛼$ − % log Γ 𝛼$ + % 𝛼$ − 1 Ψ 𝛾$ − Ψ % 𝛾$ • 𝛾$ : Variational parameter for topic 𝑘
$%&
$%& $%& $%& • Γ: Gamma function
) ' ' ) ' + • Ψ: Digamma function
+ % % 𝜙($ Ψ 𝛾$ − Ψ % 𝛾$ + % % % 𝜙($ 𝑤(* log 𝛽$* • 𝜙($ : variational parameter for
(%& $%& $%& (%& $%& *%& assignment of token 𝑛 to topic 𝑘
• 𝛾$ : variational parameter for topic 𝑘 in
• The entropy of the variational distribution: this document
) '
• 𝑤(* : token 𝑛 having word form 𝑣
𝐻 𝑞 = 𝐻 𝛾 + % % 𝐻(𝜙($ ) • 𝛽$* : probability of topic 𝑘 generating
(%& $%& word form 𝑣
' ' '
'
= − log Γ % 𝛾$ + % log Γ 𝛾$ − % 𝛾$ − 1 Ψ 𝛾$ − Ψ % 𝛾$
$%&
$%& $%& $%&
) ' For full derivation refer to
− % % 𝜙($ log 𝜙($ https://youtu.be/2pEkWk-LHmU
(%& $%&
19
Variational Inference Algorithm for LDA

• Randomly initialize variational parameters

• For each iteration:
– For each document 𝑖 (index dropped), update 𝛾. For each token in document, update 𝜙.
1
𝛾/ = 𝛼/ + 9 𝜙0/
0+&
!
𝜙0/ ∝ 𝛽/: exp Ψ 𝛾/ − Ψ 9 𝛾/ Normalize ∑%
"#$ 𝜙&" = 1
/+&
– For corpus, update 𝛽
1"
:
𝛽/: ∝ 9 9 𝜙-0/ ⋅ 𝑤-0 Normalize ∑('#$ 𝛽"' = 1
- 0+&
– Compute ℒ to assess convergence
• Return expectation of variational parameters for solution to latent variables
20
VARIATIONAL AUTOENCODER (VAE)

21
Variational Autoencoder (VAE)

• Designed not only for content encoding, but also content generation

• Key idea:
– regularize the latent space of the encoding so that nearby points in the latent space will
generate similar decodings

https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
22
Generative Model
@
• Let us consider a dataset 𝒙0 0?$
– Data is generated by a random process involving latent variables 𝒛

• Generative process:
– For each data point 𝑖 = 1, … , 𝑁:
• Sample latent variable 𝒛- from a parameterized distribution 𝑝(𝒛|𝜃)
• Sample observation 𝒙- from a parameterized distribution 𝑝(𝒙|𝒛, 𝜃)

• Suppose the integral of the marginal likelihood 𝑝 𝒙 𝜃 = ∫ 𝑝 𝒛 𝜃 𝑝 𝒙 𝒛, 𝜃 𝑑𝒛

A 𝒙|𝒛,B A(𝒛|B)
is intractable, correspondingly so is the true posterior 𝑝 𝒛 𝒙, 𝜃 =
A(𝒙|B)

23
Variational Inference

• Introduce 𝑞(𝒛|𝒙, 𝜙) as approximation to 𝑝(𝒛|𝒙, 𝜃)

• Interpretation:
– 𝒛 is the encoding of input 𝒙
– 𝑞(𝒛|𝒙, 𝜙) is the encoder
– 𝑝(𝒙|𝒛, 𝜃) is the decoder

• Objective is to maximize ELBO:

𝑝 𝒙, 𝒛|𝜃
ℒ 𝑞 = E' ln
𝑞 𝒛|𝒙, 𝜙

• Equivalently, to minimize the following KL-divergence:

𝐷!" (𝑞(𝒛|𝒙, 𝜙)||𝑝(𝒛|𝒙, 𝜃))

24
Decoder and Encoder

Decoder Encoder
• Let 𝑝(𝒛|𝜃) be factorized multivariate • Let 𝑞(𝒛|𝒙, 𝜙) be factorized multivariate
Gaussian Gaussian
ln 𝑝 𝒛 𝜃 = ln 𝒩(𝟎, 𝐈) ln 𝑞 𝒛 𝒙, 𝜙 = ln 𝒩(𝒙; 𝝆, 𝝎𝟐 𝐈)

• Let 𝑝(𝒙|𝒛, 𝜃) be factorized multivariate – 𝝆 and 𝝎𝟐 are outputs of multi-layer

Gaussian perceptrons parameterized by 𝜙 (network
weights) on input 𝒙
ln 𝑝 𝒙 𝒛, 𝜃 = ln 𝒩(𝒙; 𝝁, 𝝈𝟐 𝐈)

– 𝝁 and 𝝈𝟐 are outputs of multi-layer

perceptrons parameterized by 𝜃 (network
weights)
𝝁 = 𝑾* 𝒉 + 𝒃*
𝝈𝟐 = 𝑾+ 𝒉 + 𝒃+
𝒉 = tanh(𝑾$ 𝒛 + 𝒃$ )
– latent representation 𝒉 shared by 𝝁 and 𝝈𝟐
25
Illustration
• Encoded distributions are Normal distributions (parameterized by
mean and variance)

https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-
1bfe67eb5daf
https://www.cl.cam.ac.uk/~pv273/slides/UCLSlides.pdf
26
Optimization (for Gaussian)
• Objective is to maximize ELBO:
𝑝 𝒙, 𝒛|𝜃 𝑝 𝒙|𝒛, 𝜃 𝑝(𝒛|𝜃)
ℒ 𝑞 = E! ln = E! ln = E! ln 𝑝 𝒙|𝒛, 𝜃 − 𝐷%, 𝑞 𝒛|𝒙, 𝜙 ||𝑝(𝒛|𝜃)
𝑞 𝒛|𝒙, 𝜙 𝑞 𝒛|𝒙, 𝜙

• The first term E' ln 𝑝 𝒙|𝒛, 𝜃 is the expected (negative) reconstruction error
– for each point, estimated by taking 𝐿 samples of 𝑞(𝒛|𝒙, 𝜙) via Monte Carlo expectation estimation
,
1
E! ln 𝑝 𝒙|𝒛, 𝜃 ≈ R ln 𝑝(𝒙|𝒛(-) )
𝐿
-#$
– requires “reparameterization trick” to make backpropagation and gradient descent possible

• The second term −𝐷!" 𝑞 𝒛|𝒙, 𝜙 ||𝑝(𝒛|𝜃) regularizes 𝑞(𝒛|𝒙, 𝜙) to the prior 𝒩 𝟎, 𝐈
– Can be derived analytically (no need sampling)
1
1
−𝐷%, 𝑞 𝒛|𝒙, 𝜙 ||𝑝(𝒛|𝜃) = R 1 + ln( 𝜔0 + ) − 𝜌0 +
− 𝜔0 +
2
0#$
– Here 𝑴 is the dimensionality of 𝒛 (size of latent variables)
– 𝜔0 and 𝜌0 are neural network functions of 𝒙 and 𝜙
27
Visualization of Learned Data Manifold

28
Different Dimensionalities of Latent Space

29
Conclusion
• Variational Inference:
– Approximating inference of a latent variable model with simpler, more tractable inference
– Maximizes evidence lower bound (ELBO) or minimizes KL-divergence between
variational distribution and the posterior distribution
– Mean field inference is one simplification where the variational distribution factorizes
• Latent Dirichlet Allocation:
– Variational inference is an alternative to Gibbs sampling
– Uses a variational distribution that decouples some dependent variables
– Varitional distribution factorizes, thus the learning algorithm is parallelizable
• Variational Auto-Encoder:
– An auto-encoder that is also a generative model
– Encoding is not deterministic, but instead is a distribution
– Lends itself naturally to variational inference
– Relies on neural network to approximate functions
– Widely applicable, e.g., collaborative filtering, topic modeling
30
References
• Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
– Chapter 10.1 (Variational Inference)

• Blei, D. Variational Inference.

– https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-
inference-i.pdf

• Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan), 993-1022.
– https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

• Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes.

International Conference on Learning Representations (ICLR).
– https://arxiv.org/pdf/1312.6114.pdf
31

189 Cheat Sheet Minicards
No ratings yet
189 Cheat Sheet Minicards
2 pages
Unit 9 Simple Linear Regression: Structure
No ratings yet
Unit 9 Simple Linear Regression: Structure
22 pages
Variation Al
No ratings yet
Variation Al
25 pages
A Beginner's Guide To Variational Inference
No ratings yet
A Beginner's Guide To Variational Inference
48 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
Variational Autoencoder
No ratings yet
Variational Autoencoder
21 pages
Notes
No ratings yet
Notes
9 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
Class19 Approxinf
No ratings yet
Class19 Approxinf
45 pages
Modeling, Inference and Prediction: 2.1 Probabilistic Models
No ratings yet
Modeling, Inference and Prediction: 2.1 Probabilistic Models
16 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
A Brief Primer On Variational Inference - Fabian Dablander
No ratings yet
A Brief Primer On Variational Inference - Fabian Dablander
14 pages
Wikipedia VAE
No ratings yet
Wikipedia VAE
9 pages
Auto-Encoding Variational Bayes: Diederik P. Kingma Max Welling
No ratings yet
Auto-Encoding Variational Bayes: Diederik P. Kingma Max Welling
9 pages
Var Bayes Linreg
No ratings yet
Var Bayes Linreg
14 pages
Tung Kieu - Probabilistic - Graphical - Model - Report
No ratings yet
Tung Kieu - Probabilistic - Graphical - Model - Report
9 pages
24 Variational Inference
No ratings yet
24 Variational Inference
24 pages
Latent Variable Models: Stefano Ermon
No ratings yet
Latent Variable Models: Stefano Ermon
26 pages
Mlgs 2021 Retake
No ratings yet
Mlgs 2021 Retake
54 pages
05 Vae
No ratings yet
05 Vae
76 pages
VAE talk.compressed - 副本
No ratings yet
VAE talk.compressed - 副本
59 pages
Machine Learning and Pattern Recognition - Variational - Details
No ratings yet
Machine Learning and Pattern Recognition - Variational - Details
3 pages
Unsupervised Variational Acoustic Clustering: Luan Vin Icius Fiorio Bruno Defraene Johan David
No ratings yet
Unsupervised Variational Acoustic Clustering: Luan Vin Icius Fiorio Bruno Defraene Johan David
5 pages
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
No ratings yet
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
89 pages
Variational Autoencoder Explanation
No ratings yet
Variational Autoencoder Explanation
11 pages
L20 GenerativeModels
No ratings yet
L20 GenerativeModels
53 pages
ACV - Notes - Final
No ratings yet
ACV - Notes - Final
7 pages
3 Bayesian Deep Learning
No ratings yet
3 Bayesian Deep Learning
33 pages
Intro To Vae
No ratings yet
Intro To Vae
89 pages
Machine Learning and Pattern Recognition Variational KL
No ratings yet
Machine Learning and Pattern Recognition Variational KL
5 pages
Lecture6 Handout
No ratings yet
Lecture6 Handout
41 pages
Black Box Variational Inference
No ratings yet
Black Box Variational Inference
11 pages
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
No ratings yet
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
9 pages
Auto Encoding Variational Bayes
No ratings yet
Auto Encoding Variational Bayes
14 pages
Tutorial On Diffusion Models
No ratings yet
Tutorial On Diffusion Models
4 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Approximate Inference: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Approximate Inference: Sargur Srihari Srihari@cedar - Buffalo.edu
18 pages
Reparametrization Trick
No ratings yet
Reparametrization Trick
8 pages
8.auto-Encoding Variational Bayes
No ratings yet
8.auto-Encoding Variational Bayes
14 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Variational Inference Ref Paper
No ratings yet
Variational Inference Ref Paper
13 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
51 pages
Demystifying Variational Diffusion Models
No ratings yet
Demystifying Variational Diffusion Models
48 pages
Lecture2 2015
No ratings yet
Lecture2 2015
58 pages
Latent 2
No ratings yet
Latent 2
4 pages
Rezende 15
No ratings yet
Rezende 15
9 pages
Mod 3 Advanced AI
No ratings yet
Mod 3 Advanced AI
37 pages
On The Challenges of Learning With Inference Networks On Sparse, High-Dimensional Data
No ratings yet
On The Challenges of Learning With Inference Networks On Sparse, High-Dimensional Data
14 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
Aait HW3
No ratings yet
Aait HW3
9 pages
P - Improving Latent Variable Discriptiveness by Modelling Rather Than Ad-Hoc Factors
No ratings yet
P - Improving Latent Variable Discriptiveness by Modelling Rather Than Ad-Hoc Factors
11 pages
Bio24 Rathouz
No ratings yet
Bio24 Rathouz
45 pages
Session 10
No ratings yet
Session 10
11 pages
VAE Continued: Biplab Banerjee
No ratings yet
VAE Continued: Biplab Banerjee
23 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
CS236 Hw2 Answers
No ratings yet
CS236 Hw2 Answers
14 pages
An Introduction To Variational Calculus in Machine Learning
No ratings yet
An Introduction To Variational Calculus in Machine Learning
7 pages
25 Customizing Models A Algorithms
No ratings yet
25 Customizing Models A Algorithms
38 pages
CM Latent - Models 2022
No ratings yet
CM Latent - Models 2022
27 pages
4.4 Parametric and Non-Parametric Estimator
No ratings yet
4.4 Parametric and Non-Parametric Estimator
47 pages
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet
Specifications-700-HC Relays: Relay and Timer Specifications
No ratings yet
Specifications-700-HC Relays: Relay and Timer Specifications
1 page
Bharat Petroleum Corporation Limited: Sv/Tv/Ta-Form
100% (2)
Bharat Petroleum Corporation Limited: Sv/Tv/Ta-Form
1 page
Balloon Tutorial
No ratings yet
Balloon Tutorial
19 pages
Saint Louis College: Legislative Committee
No ratings yet
Saint Louis College: Legislative Committee
3 pages
PAS Report 556
No ratings yet
PAS Report 556
264 pages
Major Assignment 1
No ratings yet
Major Assignment 1
4 pages
Xie 2021
No ratings yet
Xie 2021
8 pages
State Budget 2025-26
No ratings yet
State Budget 2025-26
13 pages
Mentee Application Form: PERIOD: 1 October 2022 - 1 May 2023
No ratings yet
Mentee Application Form: PERIOD: 1 October 2022 - 1 May 2023
2 pages
Case Study On Starbucks Coffee
No ratings yet
Case Study On Starbucks Coffee
14 pages
Skilled Worker Visa - Eligible Occupations and Codes - GOV - Uk
No ratings yet
Skilled Worker Visa - Eligible Occupations and Codes - GOV - Uk
98 pages
FT-14D Digital Flexitest™ Switch
No ratings yet
FT-14D Digital Flexitest™ Switch
4 pages
Code Blue PDF
No ratings yet
Code Blue PDF
9 pages
Table Showing Current Ratio: List of Tables
No ratings yet
Table Showing Current Ratio: List of Tables
37 pages
14 Hes
No ratings yet
14 Hes
2 pages
FV - Pitch Deck - Company Name
No ratings yet
FV - Pitch Deck - Company Name
12 pages
Syltherm HF Tds
No ratings yet
Syltherm HF Tds
2 pages
Machine Design, Vol.4 (2012) No.2, ISSN 1821-1259 Pp. 103-106
No ratings yet
Machine Design, Vol.4 (2012) No.2, ISSN 1821-1259 Pp. 103-106
4 pages
Medical Appointment Application: Acta Electronica Malaysia (AEM)
No ratings yet
Medical Appointment Application: Acta Electronica Malaysia (AEM)
5 pages
Problem Set 1 - Simple Interest
50% (2)
Problem Set 1 - Simple Interest
2 pages
Rotax 912 Operator's Manual
No ratings yet
Rotax 912 Operator's Manual
85 pages
Delta Neutral Vega Long
No ratings yet
Delta Neutral Vega Long
6 pages
IIM Prof Database
No ratings yet
IIM Prof Database
24 pages
After Class - AVTC6 - Unit 6 - Pie Charts - K26
No ratings yet
After Class - AVTC6 - Unit 6 - Pie Charts - K26
3 pages
Carrier VRF Xct7 2022
No ratings yet
Carrier VRF Xct7 2022
186 pages
Updated Constitution of Business Club
No ratings yet
Updated Constitution of Business Club
13 pages
Grammar Worksheets
No ratings yet
Grammar Worksheets
30 pages
Briandavidphillips - Core Skills Hypnosis DVD Course
No ratings yet
Briandavidphillips - Core Skills Hypnosis DVD Course
6 pages
Module1 - Magnetism
No ratings yet
Module1 - Magnetism
35 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

08 VariationalInference

Uploaded by

08 VariationalInference

Uploaded by

Variational Inference

IS712 Machine Learning

• Kullback-Leibler divergence or KL-divergence is the additional cost as

• Suppose that we have a probabilistic model with observed variables 𝒙 and

– Suppose direct optimization of 𝑝 𝒙 𝜃 is difficult, but that of 𝑝(𝒙, 𝒛|𝜃) is easier

• Substitute this into ℒ 𝑞, 𝜃 , and we get:

• We get back the original decomposition:

• Because KL-divergence is non-negative, ℒ 𝑞, 𝜃 is effectively a lower bound

• When 𝑞 == 𝑝, we get back EM

• In the M-step, holding 𝑞 𝒛 fixed, we maximize ℒ 𝑞, 𝜃 w.r.t. parameters 𝜃

ln 𝑝(𝒙) = ℒ 𝑞 + 𝐷!" (𝑞||𝑝)

• For a concave function: E 𝑓(𝑥) ≤ 𝑓 E 𝑥

• Based on Jensen’s inequality, evidence lower bound or ELBO:

• As previously shown, maximizing ELBO is minimizing 𝐷./ (𝑞||𝑝)

• The variational distribution 𝑞 should be simple and tractable

• A frequent assumption is that the latent variables are independent

• Variational distribution factorizes

• Also possible to group some (dependent) variables together to form partitions

= &@𝑞 𝑧* ln 𝑝(𝒙, 𝒛) − 9 ln 𝑞(𝑧* ) 𝑑𝒛

• Dissect dependence on one factor:

ℒ 𝑞 = & 𝑞 𝑧* & ln 𝑝(𝒙, 𝒛) @ 𝑞 𝑧- 𝑑𝒛%* 𝑑𝑧* − & 𝑞 𝑧* ln 𝑞 𝑧* 𝑑𝑧* + const

= & 𝑞 𝑧* E-.* [ln 𝑝(𝒙, 𝒛)] 𝑑𝑧* − & 𝑞 𝑧* ln 𝑞 𝑧* 𝑑𝑧* + const

• Holding 𝑞 𝑧0 012 fixed, maximize ℒ 𝑞 w.r.t. 𝑞 𝑧2 in turn

• Identify what distribution 𝑞 should be, e.g., Gaussian, Dirichlet

• Derive the ELBO

• Optimize ELBO via gradient ascent for each 𝑞(𝑧2 ) in turn

• Repeat till convergence

• Randomly initialize variational parameters

• Suppose the integral of the marginal likelihood 𝑝 𝒙 𝜃 = ∫ 𝑝 𝒛 𝜃 𝑝 𝒙 𝒛, 𝜃 𝑑𝒛

• Introduce 𝑞(𝒛|𝒙, 𝜙) as approximation to 𝑝(𝒛|𝒙, 𝜃)

• Objective is to maximize ELBO:

• Equivalently, to minimize the following KL-divergence:

• Let 𝑝(𝒙|𝒛, 𝜃) be factorized multivariate – 𝝆 and 𝝎𝟐 are outputs of multi-layer

– 𝝁 and 𝝈𝟐 are outputs of multi-layer

• Blei, D. Variational Inference.

• Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.