Mod4_Slides
Mod4_Slides
Model families:
Qn
R pθ (xi |x<i )
Autoregressive Models: pθ (x) = i=1
Variational Autoencoders: pθ (x) = pθ (x, z)dz
Autoregressive models provide tractable likelihoods but no direct
mechanism for learning features
Variational autoencoders can learn feature representations (via latent
variables z) but have intractable marginal likelihoods
Key question: Can we design a latent variable model with tractable
likelihoods? Yes!
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 2 / 19
Simple Prior to Complex Data Distributions
Desirable properties of any model distribution pθ (x):
Easy-to-evaluate, closed form density (useful for training)
Easy-to-sample (useful for generation)
Many simple distributions satisfy the above properties e.g., Gaussian,
uniform distributions
a c
Figure: The matrix A = maps a unit square to a parallelogram
b d
z ∼ pZ (z) x = fθ (z)
z = fθ−1 (x)
Simple prior pZ (z) that allows for efficient sampling and tractable
likelihood evaluation. E.g., isotropic Gaussian
Invertible transformations with tractable evaluation:
Likelihood evaluation requires efficient evaluation of x 7→ z mapping
Sampling requires efficient evaluation of z 7→ x mapping
Computing likelihoods also requires the evaluation of determinants of
n × n Jacobian matrices, where n is the data dimensionality
Computing the determinant for an n × n matrix is O(n3 ): prohibitively
expensive within a learning loop!
Key idea: Choose tranformations so that the resulting Jacobian matrix
has special structure. For example, the determinant of a triangular
matrix is the product of the diagonal entries, i.e., an O(n) operation
∂f1 ∂f1
···
∂f ∂z1 ∂zn
J= = ··· ··· ···
∂z ∂fn ∂fn
∂z1 ··· ∂zn
So far
Transform simple to complex distributions via sequence of invertible
transformations
Directed latent variable models with marginal likelihood given by the
change of variables formula
Triangular Jacobian permits efficient evaluation of log-likelihoods
Plan for today
Invertible transformations with diagonal Jacobians (NICE, Real-NVP)
Autoregressive Models as Normalizing Flow Models
Invertible CNNs (MintNet)
Gaussianization flows
Case Study: Probability density distillation for efficient learning and
inference in Parallel Wavenet
det(J) = 1
n
Y
det(J) = si
i=1
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 11 / 1
Samples generated via NICE
n n
!
Y X
det(J) = exp(αθ (z1:d )i ) = exp αθ (z1:d )i
i=d+1 i=d+1
Using with four validation examples z(1) , z(2) , z(3) , z(4) , define interpolated
z as:
such that p(xi | x<i ) = N (µi (x1 , · · · , xi−1 ), exp(αi (x1 , · · · , xi−1 ))2 ).
Here, µi (·) and αi (·) are neural networks for i > 1 and constants for
i = 1.
Sampler for this model:
Sample zi ∼ N (0, 1) for i = 1, · · · , n
Let x1 = exp(α1 )z1 + µ1 . Compute µ2 (x1 ), α2 (x1 )
Let x2 = exp(α2 )z2 + µ2 . Compute µ3 (x1 , x2 ), α3 (x1 , x2 )
Let x3 = exp(α3 )z3 + µ3 . ...
Flow interpretation: transforms samples from the standard Gaussian
(z1 , z2 , . . . , zn ) to those generated from the model (x1 , x2 , . . . , xn ) via
invertible transformations (parameterized by µi (·), αi (·))
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 17 / 1
Masked Autoregressive Flow (MAF)
Figure: Inverse pass of MAF (left) vs. Forward pass of IAF (right)
Computational tradeoffs
MAF: Fast likelihood evaluation, slow sampling
IAF: Fast sampling, slow likelihood evaluation
MAF more suited for training based on MLE, density estimation
IAF more suited for real-time generation
Can we get the best of both worlds?
Training
Step 1: Train teacher model (MAF) via MLE
Step 2: Train student model (IAF) to minimize KL divergence with
teacher
Test-time: Use student model for testing
Improves sampling efficiency over original Wavenet (vanilla
autoregressive model) by 1000x!
Let’s start with a 1D example. Let the data X̃ Rhave density pdata and
a
cumulative density function (CDF) Fdata (a) = −∞ pdata .
Inverse CDF trick: If Fdata is known, we can sample from pdata via
−1
X̃ = Fdata (U) where U ∈ [0, 1] is a uniform random variable.
Model families
Qn
R pθ (xi |x<i )
Autoregressive Models: pθ (x) = i=1
Variational Autoencoders: pθ (x) = pθ (x, z)dz
−1
∂fθ (x)
Normalizing Flow Models: pX (x; θ) = pZ fθ−1 (x) det
∂x
M
X
θ̂ = argmax log pθ (xi ), x1 , x2 , · · · , xM ∼ pdata (x)
θ i=1
Case 1: Optimal generative model will give best sample quality and
highest test log-likelihood
For imperfect models, achieving high log-likelihoods might not always
imply good sample quality, and vice-versa (Theis et al., 2016)
Model families
Qn
R pθ (xi |x<i )
Autoregressive Models: pθ (x) = i=1
Variational Autoencoders: pθ (x) = pθ (x, z)dz
−1
∂fθ (x)
Normalizing Flow Models: pX (x; θ) = pZ fθ−1 (x) det
∂x