0% found this document useful (0 votes)

20 views49 pages

Mod4_Slides

The document discusses likelihood-based learning, focusing on model families such as autoregressive models and variational autoencoders, highlighting their strengths and weaknesses. It introduces flow models, which use invertible transformations to map simple distributions to complex data distributions, allowing for tractable likelihoods. The document also covers the change of variables formula, the geometry of transformations, and the importance of triangular Jacobians for efficient computation in normalizing flow models.

Uploaded by

kaveh1980

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views49 pages

Mod4_Slides

Uploaded by

kaveh1980

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Recap of likelihood-based learning so far:

Model families:
Qn
R pθ (xi |x<i )
Autoregressive Models: pθ (x) = i=1
Variational Autoencoders: pθ (x) = pθ (x, z)dz
Autoregressive models provide tractable likelihoods but no direct
mechanism for learning features
Variational autoencoders can learn feature representations (via latent
variables z) but have intractable marginal likelihoods
Key question: Can we design a latent variable model with tractable
likelihoods? Yes!
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 2 / 19
Simple Prior to Complex Data Distributions
Desirable properties of any model distribution pθ (x):
Easy-to-evaluate, closed form density (useful for training)
Easy-to-sample (useful for generation)
Many simple distributions satisfy the above properties e.g., Gaussian,
uniform distributions

Unfortunately, data distributions are more complex (multi-modal)

Key idea behind flow models: Map simple distributions (easy to

sample and evaluate densities) to complex distributions through an
invertible transformation.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 3 / 19
Variational Autoencoder

A flow model is similar to a variational autoencoder (VAE):

1 Start from a simple prior: z ∼ N (0, I ) = p(z)

2 Transform via p(x | z) = N (µ (z), Σ (z))

θ θ
3 Even though p(z) is simple, the marginal p (x) is very
R θ
complex/flexible. However, pθ (x) = pθ (x, z)dz is expensive to
compute: need to enumerate all z that could have generated x
4 What if we could easily ”invert” p(x | z) and compute p(z | x) by

design? How? Make x = fθ (z) a deterministic and invertible function

of z, so for any x there is a unique corresponding z (no enumeration)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 4 / 19
Continuous random variables refresher

Let X be a continuous random variable

The cumulative density function (CDF) of X is FX (a) = P(X ≤ a)
dFX (a)
The probability density function (pdf) of X is pX (a) = FX′ (a) = da
Typically consider parameterized densities:
2 2
Gaussian: X ∼ N (µ, σ) if pX (x) = σ√12π e −(x−µ) /2σ
1
Uniform: X ∼ U(a, b) if pX (x) = b−a 1[a ≤ x ≤ b]
Etc.
If X is a continuous random vector, we can usually represent it using
its joint probability density function:
Gaussian: if pX (x) = √ 1
exp − 12 (x − µ)T Σ−1 (x − µ)

(2π)n |Σ|

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 5 / 19

Change of Variables formula

Let Z be a uniform random variable U[0, 2] with density pZ . What is

pZ (1)? 12
R21
As a sanity check, 0 2
=1
Let X = 4Z , and let pX be its density. What is pX (4)?
pX (4) = p(X = 4) = p(4Z = 4) = p(Z = 1) = pZ (1) = 1/2 Wrong!
Clearly, X is uniform in [0, 8], so pX (4) = 1/8
To get correct result, need to use change of variables formula

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 6 / 19

Change of Variables formula

Change of variables (1D case): If X = f (Z ) and f (·) is monotone

with inverse Z = f −1 (X ) = h(X ), then:

pX (x) = pZ (h(x))|h′ (x)|

Previous example: If X = f (Z ) = 4Z and Z ∼ U[0, 2], what is

pX (4)?
Note that h(X ) = X /4
pX (4) = pZ (1)h′ (4) = 1/2 × |1/4| = 1/8
More interesting example: If X = f (Z ) = exp(Z ) and Z ∼ U[0, 2],
what is pX (x)?
Note that h(X ) = ln(X )
pX (x) = pZ (ln(x))|h′ (x)| = 1
2x for x ∈ [exp(0), exp(2)]
Note that the ”shape” of pX (x) is different (more complex) from that
of the prior pZ (z).
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 7 / 19
Change of Variables formula
Change of variables (1D case): If X = f (Z ) and f (·) is monotone
with inverse Z = f −1 (X ) = h(X ), then:
pX (x) = pZ (h(x))|h′ (x)|

Proof sketch: Assume f (·) is monotonically increasing

FX (x) = p[X ≤ x] = p[f (Z ) ≤ x] = p[Z ≤ h(x)] = FZ (h(x))
Taking derivatives on both sides:
dFX (x) dFZ (h(x))
pX (x) = = = pZ (h(x))h′ (x)
dx dx

Recall from basic calculus that h′ (x) = [f −1 ]′ (x) = 1

f ′ (f −1 (x))
. So
letting z = h(x) = f −1 (x) we can also write
1
pX (x) = pZ (z) ′
f (z)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 8 / 19
Geometry: Determinants and volumes
Let Z be a uniform random vector in [0, 1]n
Let X = AZ for a square invertible matrix A, with inverse W = A−1 .
How is X distributed?
Geometrically, the matrix A maps the unit hypercube [0, 1]n to a
parallelotope
Hypercube and parallelotope are generalizations of square/cube and
parallelogram/parallelopiped to higher dimensions

a c
Figure: The matrix A = maps a unit square to a parallelogram
b d

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 9 / 19

Geometry: Determinants and volumes
The volume of the parallelotope is equal to the absolute value of the
determinant of the matrix A

a c
det(A) = det = ad − bc
b d

Let X = AZ for a square invertible matrix A, with inverse W = A−1 .

X is uniformly distributed over the parallelotope of area |det(A)|.
Hence, we have
pX (x) = pZ (W x) / |det(A)|
= pZ (W x) |det(W )|
because if W = A−1 , det(W ) = 1
det(A) . Note similarity with 1D case
formula.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 10 / 19
Generalized change of variables
For linear transformations specified via A, change in volume is given
by the determinant of A
For non-linear transformations f(·), the linearized change in volume is
given by the determinant of the Jacobian of f(·).
Change of variables (General case): The mapping between Z and
X , given by f : Rn 7→ Rn , is invertible such that X = f(Z ) and
Z = f −1 (X ).
−1
−1
∂f (x)
pX (x) = pZ f (x) det
∂x

Note 0: generalizes the previous 1D case pX (x) = pZ (h(x))|h′ (x)|

Note 1: unlike VAEs, x, z need to be continuous and have the same
dimension. For example, if x ∈ Rn then z ∈ Rn
Note 2: For any invertible matrix A, det(A−1 ) = det(A)−1
∂f(z) −1

pX (x) = pZ (z) det
∂z
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 11 / 19
Two Dimensional Example

Let Z1 and Z2 be continuous random variables with joint density

pZ1 ,Z2 .
Let u : R2 → R2 be an invertible transformation. Two inputs and two
outputs, denoted u = (u1 , u2 )
Let v = (v1 , v2 ) be its inverse transformation
Let X1 = u1 (Z1 , Z2 ) and X2 = u2 (Z1 , Z2 ) Then, Z1 = v1 (X1 , X2 ) and
Z2 = v2 (X1 , X2 )

pX1 ,X2 (x1 , x2 )

!
∂v1 (x1 ,x2 ) ∂v1 (x1 ,x2 )
= pZ1 ,Z2 (v1 (x1 , x2 ), v2 (x1 , x2 )) det ∂x1 ∂x2 (inverse)
∂v2 (x1 ,x2 ) ∂v2 (x1 ,x2 )
∂x1 ∂x2
! −1
∂u1 (z1 ,z2 ) ∂u1 (z1 ,z2 )
= pZ1 ,Z2 (z1 , z2 ) det ∂z1 ∂z2 (forward)
∂u2 (z1 ,z2 ) ∂u2 (z1 ,z2 )
∂z1 ∂z2

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 12 / 19

Normalizing flow models
Consider a directed, latent-variable model over observed variables X
and latent variables Z
In a normalizing flow model, the mapping between Z and X , given
by fθ : Rn 7→ Rn , is deterministic and invertible such that X = fθ (Z )
and Z = fθ−1 (X )

Using change of variables, the marginal likelihood p(x) is given by

!
−1
∂f (x)
pX (x; θ) = pZ fθ−1 (x) det θ

∂x

Note: x, z need to be continuous and have the same dimension.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 13 / 19
A Flow of Transformations
Normalizing: Change of variables gives a normalized density after
applying an invertible transformation
Flow: Invertible transformations can be composed with each other

zm = fθm ◦ · · · ◦ fθ1 (z0 ) = fθm (fθm−1 (· · · (fθ1 (z0 )))) ≜ fθ (z0 )

Start with a simple distribution for z0 (e.g., Gaussian)

Apply a sequence of M invertible transformations to finally obtain
x = zM
By change of variables
M
∂(fθm )−1 (zm )

Y
pX (x; θ) = pZ fθ−1 (x) det
∂zm
m=1

(Note: determininant of product equals product of determinants)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 14 / 19
Planar flows (Rezende & Mohamed, 2016)

Base distribution: Gaussian

Base distribution: Uniform

10 planar transformations can transform simple distributions into a

more complex one

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 15 / 19

Learning and Inference
Learning via maximum likelihood over the dataset D
!
X ∂fθ−1 (x)
fθ−1 (x) + log det

max log pX (D; θ) = log pZ
θ ∂x
x∈D

Exact likelihood evaluation via inverse tranformation x 7→ z and

change of variables formula
Sampling via forward transformation z 7→ x

z ∼ pZ (z) x = fθ (z)

Latent representations inferred via inverse transformation (no

inference network required!)

z = fθ−1 (x)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 16 / 19

Desiderata for flow models

Simple prior pZ (z) that allows for efficient sampling and tractable
likelihood evaluation. E.g., isotropic Gaussian
Invertible transformations with tractable evaluation:
Likelihood evaluation requires efficient evaluation of x 7→ z mapping
Sampling requires efficient evaluation of z 7→ x mapping
Computing likelihoods also requires the evaluation of determinants of
n × n Jacobian matrices, where n is the data dimensionality
Computing the determinant for an n × n matrix is O(n3 ): prohibitively
expensive within a learning loop!
Key idea: Choose tranformations so that the resulting Jacobian matrix
has special structure. For example, the determinant of a triangular
matrix is the product of the diagonal entries, i.e., an O(n) operation

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 17 / 19

Triangular Jacobian

x = (x1 , · · · , xn ) = f(z) = (f1 (z), · · · , fn (z))

∂f1 ∂f1
···
 
∂f ∂z1 ∂zn
J= =  ··· ··· ··· 
∂z ∂fn ∂fn
∂z1 ··· ∂zn

Suppose xi = fi (z) only depends on z≤i . Then

∂f1
···
 
∂z1 0
∂f
J= =  ··· ··· ··· 
∂z ∂fn ∂fn
∂z1 ··· ∂zn

has lower triangular structure. Determinant can be computed in linear

time. Similarly, the Jacobian is upper triangular if xi only depends on z≥i

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 18 / 19

Recap of normalizing flow models

So far
Transform simple to complex distributions via sequence of invertible
transformations
Directed latent variable models with marginal likelihood given by the
change of variables formula
Triangular Jacobian permits efficient evaluation of log-likelihoods
Plan for today
Invertible transformations with diagonal Jacobians (NICE, Real-NVP)
Autoregressive Models as Normalizing Flow Models
Invertible CNNs (MintNet)
Gaussianization flows
Case Study: Probability density distillation for efficient learning and
inference in Parallel Wavenet

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 8/1

Designing invertible transformations

NICE or Nonlinear Independent Components Estimation (Dinh et al.,

2014) composes two kinds of invertible transformations: additive
coupling layers and rescaling layers
Real-NVP (Dinh et al., 2017)
Inverse Autoregressive Flow (Kingma et al., 2016)
Masked Autoregressive Flow (Papamakarios et al., 2017)
I-resnet (Behrmann et al, 2018)
Glow (Kingma et al, 2018)
MintNet (Song et al., 2019)
And many more

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 9/1

NICE - Additive coupling layers
Partition the variables z into two disjoint subsets, say z1:d and zd+1:n for
any 1 ≤ d < n
Forward mapping z 7→ x:
x1:d = z1:d (identity transformation)
xd+1:n = zd+1:n + mθ (z1:d ) (mθ (·) is a neural network with parameters
θ, d input units, and n − d output units)
Inverse mapping x 7→ z:
z1:d = x1:d (identity transformation)
zd+1:n = xd+1:n − mθ (x1:d )
Jacobian of forward mapping:
!
∂x Id 0
J= = ∂xd+1:n
∂z ∂z1:d In−d

det(J) = 1

Volume preserving transformation since determinant is 1.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 10 / 1
NICE - Rescaling layers
Additive coupling layers are composed together (with arbitrary
partitions of variables in each layer)
Final layer of NICE applies a rescaling transformation
Forward mapping z 7→ x:
xi = si zi
where si > 0 is the scaling factor for the i-th dimension.
Inverse mapping x 7→ z:
xi
zi =
si

Jacobian of forward mapping:

J = diag(s)

n
Y
det(J) = si
i=1
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 11 / 1
Samples generated via NICE

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 12 / 1

Samples generated via NICE

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 13 / 1

Real-NVP: Non-volume preserving extension of NICE
Forward mapping z 7→ x:
x1:d = z1:d (identity transformation)
xd+1:n = zd+1:n ⊙ exp(αθ (z1:d )) + µθ (z1:d )
µθ (·) and αθ (·) are both neural networks with parameters θ, d input
units, and n − d output units [⊙ denotes elementwise product]
Inverse mapping x 7→ z:
z1:d = x1:d (identity transformation)
zd+1:n = (xd+1:n − µθ (x1:d )) ⊙ (exp(−αθ (x1:d )))
Jacobian of forward mapping:

∂x Id 0
J= = ∂xd+1:n
∂z ∂z1:d diag(exp(αθ (z1:d )))

n n
!
Y X
det(J) = exp(αθ (z1:d )i ) = exp αθ (z1:d )i
i=d+1 i=d+1

Non-volume preserving transformation in general since determinant can

be less than or greater than 1
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 14 / 1
Samples generated via Real-NVP

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 15 / 1

Latent space interpolations via Real-NVP

Using with four validation examples z(1) , z(2) , z(3) , z(4) , define interpolated
z as:

z = cosϕ(z(1) cosϕ′ + z(2) sinϕ′ ) + sinϕ(z(3) cosϕ′ + z(4) sinϕ′ )

with manifold parameterized by ϕ and ϕ′ .

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 16 / 1
Continuous Autoregressive models as flow models
Consider a Gaussian autoregressive model:
n
Y
p(x) = p(xi |x<i )
i=1

such that p(xi | x<i ) = N (µi (x1 , · · · , xi−1 ), exp(αi (x1 , · · · , xi−1 ))2 ).
Here, µi (·) and αi (·) are neural networks for i > 1 and constants for
i = 1.
Sampler for this model:
Sample zi ∼ N (0, 1) for i = 1, · · · , n
Let x1 = exp(α1 )z1 + µ1 . Compute µ2 (x1 ), α2 (x1 )
Let x2 = exp(α2 )z2 + µ2 . Compute µ3 (x1 , x2 ), α3 (x1 , x2 )
Let x3 = exp(α3 )z3 + µ3 . ...
Flow interpretation: transforms samples from the standard Gaussian
(z1 , z2 , . . . , zn ) to those generated from the model (x1 , x2 , . . . , xn ) via
invertible transformations (parameterized by µi (·), αi (·))
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 17 / 1
Masked Autoregressive Flow (MAF)

Forward mapping from z 7→ x:

Let x1 = exp(α1 )z1 + µ1 . Compute µ2 (x1 ), α2 (x1 )
Let x2 = exp(α2 )z2 + µ2 . Compute µ3 (x1 , x2 ), α3 (x1 , x2 )
Sampling is sequential and slow (like autoregressive): O(n) time

Figure adapted from Eric Jang’s blog

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 18 / 1
Masked Autoregressive Flow (MAF)

Inverse mapping from x 7→ z:

Compute all µi , αi (can be done in parallel using e.g., MADE)
Let z1 = (x1 − µ1 )/ exp(α1 ) (scale and shift)
Let z2 = (x2 − µ2 )/ exp(α2 )
Let z3 = (x3 − µ3 )/ exp(α3 ) ...
Jacobian is lower diagonal, hence efficient determinant computation
Likelihood evaluation is easy and parallelizable (like MADE)
Layers with different variable orderings can be stacked

Figure adapted from Eric Jang’s blog

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 19 / 1
Inverse Autoregressive Flow (IAF)

Forward mapping from z 7→ x (parallel):

Sample zi ∼ N (0, 1) for i = 1, · · · , n
Compute all µi , αi (can be done in parallel)
Let x1 = exp(α1 )z1 + µ1
Let x2 = exp(α2 )z2 + µ2 ...
Inverse mapping from x 7→ z (sequential):
Let z1 = (x1 − µ1 )/ exp(α1 ). Compute µ2 (z1 ), α2 (z1 )
Let z2 = (x2 − µ2 )/ exp(α2 ). Compute µ3 (z1 , z2 ), α3 (z1 , z2 )
Fast to sample from, slow to evaluate likelihoods of data points (train)
Note: Fast to evaluate likelihoods of a generated point (cache z1 , z2 , . . . , zn )
Figure adapted from Eric Jang’s blog
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 20 / 1
IAF is inverse of MAF

Figure: Inverse pass of MAF (left) vs. Forward pass of IAF (right)

Interchanging z and x in the inverse transformation of MAF gives the

forward transformation of IAF
Similarly, forward transformation of MAF is inverse transformation of
IAF
Figure adapted from Eric Jang’s blog
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 21 / 1
IAF vs. MAF

Computational tradeoffs
MAF: Fast likelihood evaluation, slow sampling
IAF: Fast sampling, slow likelihood evaluation
MAF more suited for training based on MLE, density estimation
IAF more suited for real-time generation
Can we get the best of both worlds?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 22 / 1

Parallel Wavenet

Two part training with a teacher and student model

Teacher is parameterized by MAF. Teacher can be efficiently trained
via MLE
Once teacher is trained, initialize a student model parameterized by
IAF. Student model cannot efficiently evaluate density for external
datapoints but allows for efficient sampling
Key observation: IAF can also efficiently evaluate densities of its
own generations (via caching the noise variates z1 , z2 , . . . , zn )

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 23 / 1

Parallel Wavenet

Probability density distillation: Student distribution is trained to

minimize the KL divergence between student (s) and teacher (t)

DKL (s, t) = Ex∼s [log s(x) − log t(x)]

Evaluating and optimizing Monte Carlo estimates of this objective

requires:
Samples x from student model (IAF)
Density of x assigned by student model
Density of x assigned by teacher model (MAF)
All operations above can be implemented efficiently

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 24 / 1

Parallel Wavenet: Overall algorithm

Training
Step 1: Train teacher model (MAF) via MLE
Step 2: Train student model (IAF) to minimize KL divergence with
teacher
Test-time: Use student model for testing
Improves sampling efficiency over original Wavenet (vanilla
autoregressive model) by 1000x!

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 25 / 1

MintNet (Song et al., 2019)

MintNet: Building invertible neural networks with masked

convolutions.
A regular convolutional neural network is powerful, but it is not
invertible and its Jacobian determinant is expensive.
We can instead use masked convolutions like in autoregressive models
to enforce ordering (like PixelCNN)
Because of the ordering, the Jacobian matrix is triangular and the
determinant is efficient to compute.
If all the diagonal elements of the Jacobian matrix are (strictly)
positive, the transformation is invertible.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 26 / 1

MintNet (Song et al., 2019)

Illustration of a masked convolution with 3 filters and kernel size 3 × 3.

Solid checkerboard cubes inside each filter represent unmasked

weights, while the transparent blue blocks represent the weights that
have been masked out.
The receptive field of each filter on the input feature maps is
indicated by regions shaded with the pattern (the colored square)
below the corresponding filter.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 27 / 1

MintNet (Song et al., 2019)

Uncurated samples on MNIST, CIFAR-10, and ImageNet 32x32

datasets

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 28 / 1

Gaussianization Flows (Meng et al., 2020)

Let X = fθ (Z ) be a flow model with Gaussian prior

Z ∼ N (0, I ) = pZ , and let X̃ ∼ pdata be a random vector distributed
according to the true data distribution.
Flow models are trained with maximum likelihood to minimize the KL
divergence DKL ( pdata ∥ pθ (x) ) = DKL pX̃ pX . Gaussian
samples transformed through fθ should be distributed as the data.

It can be shown that DKL pX̃ pX = DKL pf −1 (X̃ ) pf −1 (X ) =
θ θ

DKL pf −1 (X̃ ) pZ . Data samples transformed through fθ−1 should

θ
be distributed as Gaussian
How can we achieve this?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 29 / 1

Gaussianization Flows (Meng et al., 2020)

Let’s start with a 1D example. Let the data X̃ Rhave density pdata and
a
cumulative density function (CDF) Fdata (a) = −∞ pdata .
Inverse CDF trick: If Fdata is known, we can sample from pdata via
−1
X̃ = Fdata (U) where U ∈ [0, 1] is a uniform random variable.

This means that U = Fdata (X̃ ) is uniform. We can transform U into a

Gaussian using the inverse CDF trick: Φ−1 (U) = Φ−1 (Fdata (X̃ )).
The invertible transformation Φ−1 ◦ Fdata Gaussianizes the data!

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 30 / 1

Gaussianization Flows (Meng et al., 2020)

Step 1: Dimension-wise Gaussianization (Jacobian is a diagonal

matrix and is tractable)

Input data Dimension-wise Gaussianization

Note: Even though each dimension is marginally Gaussian, they are

not jointly Gaussian. Aside: Approximating this with a Gaussian prior
is a shallow flow model known as a copula model (Sklar, 1959).

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 31 / 1

Gaussianization Flows (Meng et al., 2020)

Step 2: apply a rotation matrix to the transformed data (Jacobian is

an orthogonal matrix and is tractable)

Input After rotation

Note: N (0, I) is rotationally invariant

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 32 / 1

Gaussianization Flows (Meng et al., 2020)

Gaussianization flow: repeat Step 1 and Step 2 (stacking learnable

Gaussian copula). Transform data into a normal distribution.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 33 / 1

Recap

Model families
Qn
R pθ (xi |x<i )
Autoregressive Models: pθ (x) = i=1
Variational Autoencoders: pθ (x) = pθ (x, z)dz
−1
∂fθ (x)
Normalizing Flow Models: pX (x; θ) = pZ fθ−1 (x) det

∂x

All the above families are trained by minimizing KL divergence

DKL (pdata ∥pθ ), or equivalently maximizing likelihoods (or
approximations)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 2/1

Why maximum likelihood?

M
X
θ̂ = argmax log pθ (xi ), x1 , x2 , · · · , xM ∼ pdata (x)
θ i=1

Optimal statistical efficiency.

Assume sufficient model capacity, such that there exists a unique
θ∗ ∈ M that satisfies pθ∗ = pdata .
The convergence of θ̂ to θ∗ when M → ∞ is the “fastest” among all
statistical methods when using maximum likelihood training.
Higher likelihood = better lossless compression.
Is the likelihood a good indicator of the quality of samples generated
by the model?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 3/1

Towards likelihood-free learning

Case 1: Optimal generative model will give best sample quality and
highest test log-likelihood
For imperfect models, achieving high log-likelihoods might not always
imply good sample quality, and vice-versa (Theis et al., 2016)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 4/1

Towards likelihood-free learning
Case 2: Great test log-likelihoods, poor samples. E.g., For a discrete
noise mixture model pθ (x) = 0.01pdata (x) + 0.99pnoise (x)
99% of the samples are just noise (most samples are poor)
Taking logs, we get a lower bound
log pθ (x) = log[0.01pdata (x) + 0.99pnoise (x)]
≥ log 0.01pdata (x) = log pdata (x) − log 100

For expected log-likelihoods, we know that

Lower bound
Epdata [log pθ (x)] ≥ Epdata [log pdata (x)] − log 100

Upper bound (via non-negativity of DKL (pdata ∥pθ ) ≥ 0)

Epdata [log pdata (x))] ≥ Epdata [log pθ (x)]
As we increase P the dimension n of x = (x1 , · · · , xn ), absolute value of
n
log pdata (x) = i=1 log pθ (xi |x<i ) increases proportionally to n but
log 100 remains constant. Hence, likelihoods are great
Epdata [log pθ (x)] ≈ Epdata [log pdata (x)] in very high dimensions
Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 5/1
Towards likelihood-free learning

Case 3: Great samples, poor test log-likelihoods. E.g., Memorizing

training set
Samples look exactly like the training set (cannot do better!)
Test set will have zero probability assigned (cannot do worse!)
The above cases suggest that it might be useful to disentangle
likelihoods and sample quality
Likelihood-free learning consider alternative training objectives that
do not depend directly on a likelihood function

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 6/1

Recap

Model families
Qn
R pθ (xi |x<i )
Autoregressive Models: pθ (x) = i=1
Variational Autoencoders: pθ (x) = pθ (x, z)dz
−1
∂fθ (x)
Normalizing Flow Models: pX (x; θ) = pZ fθ−1 (x) det

∂x

All the above families are trained by minimizing KL divergence

dKL (pdata ∥pθ ), or equivalently maximizing likelihoods (or
approximations)
Today: alternative choices for d(pdata ∥pθ )

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 7/1

Demystifying Variational Diffusion Models
No ratings yet
Demystifying Variational Diffusion Models
48 pages
Unclaimed Shares Transferred to IEPF 23-11-2022
No ratings yet
Unclaimed Shares Transferred to IEPF 23-11-2022
184 pages
GMTLecureNotes
No ratings yet
GMTLecureNotes
291 pages
Reparametrization Trick
No ratings yet
Reparametrization Trick
8 pages
slides_no_break
No ratings yet
slides_no_break
77 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
IntroBayesTimeSeries2
No ratings yet
IntroBayesTimeSeries2
73 pages
Real Time Fault Monitoring of Industrial Processes
No ratings yet
Real Time Fault Monitoring of Industrial Processes
570 pages
Lec1_mathreview
No ratings yet
Lec1_mathreview
61 pages
05 Vae
No ratings yet
05 Vae
76 pages
L20-GenerativeModels
No ratings yet
L20-GenerativeModels
53 pages
Tutorialon Diffusion Modelsfor Imaging and Vision
No ratings yet
Tutorialon Diffusion Modelsfor Imaging and Vision
90 pages
FlowChart and Pseudocode
100% (1)
FlowChart and Pseudocode
17 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
89 pages
cs236_lecture5
No ratings yet
cs236_lecture5
29 pages
cs236_lecture11
No ratings yet
cs236_lecture11
27 pages
VAE talk.compressed - 副本
No ratings yet
VAE talk.compressed - 副本
59 pages
Mod6_Slides
No ratings yet
Mod6_Slides
27 pages
Presentation - Deeplearning2015 Courville Autoencoder Extension 01
No ratings yet
Presentation - Deeplearning2015 Courville Autoencoder Extension 01
61 pages
6-Uncertainty6
No ratings yet
6-Uncertainty6
36 pages
Normalizing Flows For Probabilistic Modeling and Inference PDF
No ratings yet
Normalizing Flows For Probabilistic Modeling and Inference PDF
64 pages
UNIT-6 SE
No ratings yet
UNIT-6 SE
54 pages
Advanced Machine Learning: CS 281
100% (1)
Advanced Machine Learning: CS 281
88 pages
HPE ProLiant DL380 Gen10 Server Data Sheet-PSN1010026818USEN
No ratings yet
HPE ProLiant DL380 Gen10 Server Data Sheet-PSN1010026818USEN
6 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
Notes For Generative AI
No ratings yet
Notes For Generative AI
31 pages
Variational Autoencoders
No ratings yet
Variational Autoencoders
94 pages
Lecture14
No ratings yet
Lecture14
23 pages
Mod5_Slides
No ratings yet
Mod5_Slides
37 pages
Idea 2024(l) Neural Autoregressive Flows
No ratings yet
Idea 2024(l) Neural Autoregressive Flows
16 pages
DiffusionModel DDPM
No ratings yet
DiffusionModel DDPM
52 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
UVU APA Student Handbook
No ratings yet
UVU APA Student Handbook
34 pages
Hagemann 2021 Inverse Problems 37 085002
No ratings yet
Hagemann 2021 Inverse Problems 37 085002
24 pages
Khan - Diffusion Models and Normalizing Flows
No ratings yet
Khan - Diffusion Models and Normalizing Flows
36 pages
ITPM MCQs
No ratings yet
ITPM MCQs
32 pages
Flow Based Deep Generative Models Report
No ratings yet
Flow Based Deep Generative Models Report
12 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
51 pages
Nice: N - I C E: ON Linear Ndependent Omponents Stimation
No ratings yet
Nice: N - I C E: ON Linear Ndependent Omponents Stimation
13 pages
Barthelme EP2
No ratings yet
Barthelme EP2
58 pages
Unit 1
No ratings yet
Unit 1
29 pages
Class19 Approxinf
No ratings yet
Class19 Approxinf
45 pages
MAT-52506 Inverse Problems: Samuli Siltanen February 20, 2009
No ratings yet
MAT-52506 Inverse Problems: Samuli Siltanen February 20, 2009
58 pages
Supplier Quality General Requirements
No ratings yet
Supplier Quality General Requirements
19 pages
Vdocument - in Outbound Copc
No ratings yet
Vdocument - in Outbound Copc
13 pages
gigabyte products
No ratings yet
gigabyte products
7 pages
Mlgs 2021 Retake
No ratings yet
Mlgs 2021 Retake
54 pages
Chapter Four
No ratings yet
Chapter Four
86 pages
Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows
No ratings yet
Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows
19 pages
Density Estimation Using Real NVP
No ratings yet
Density Estimation Using Real NVP
32 pages
CS Luxuary Car Project
No ratings yet
CS Luxuary Car Project
29 pages
IAF Kingma Et Al 2016
No ratings yet
IAF Kingma Et Al 2016
16 pages
ACV - Notes - Final
No ratings yet
ACV - Notes - Final
7 pages
Mlgs 2021 Endterm Solution
No ratings yet
Mlgs 2021 Endterm Solution
26 pages
Intro To Vae
No ratings yet
Intro To Vae
89 pages
smile-lecture8
No ratings yet
smile-lecture8
19 pages
AI60201_module3_4_problems (1)
No ratings yet
AI60201_module3_4_problems (1)
4 pages
jrfm
No ratings yet
jrfm
15 pages
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
No ratings yet
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
89 pages
VOIP Exercise in Packet Tracer
No ratings yet
VOIP Exercise in Packet Tracer
18 pages
Opoku Daniel Dartey - pg7366621 - Instrumentation and Control
No ratings yet
Opoku Daniel Dartey - pg7366621 - Instrumentation and Control
19 pages
1312.6114v1
No ratings yet
1312.6114v1
9 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
Sanidhya Rajput
No ratings yet
Sanidhya Rajput
6 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Note 6: EECS 189 Introduction To Machine Learning Fall 2020 1 Multivariate Gaussians
No ratings yet
Note 6: EECS 189 Introduction To Machine Learning Fall 2020 1 Multivariate Gaussians
9 pages
DGM 2023 Endterm Solution
No ratings yet
DGM 2023 Endterm Solution
12 pages
Latent 2
No ratings yet
Latent 2
4 pages
Paperless Society
No ratings yet
Paperless Society
13 pages
Ffjord: F - C D S R G M: REE Form Ontinuous Ynamics For Calable Eversible Enerative Odels
No ratings yet
Ffjord: F - C D S R G M: REE Form Ontinuous Ynamics For Calable Eversible Enerative Odels
13 pages
1
No ratings yet
1
9 pages
Probabilistic Modelling and Reasoning
No ratings yet
Probabilistic Modelling and Reasoning
13 pages
Tutorial on diffusion models
No ratings yet
Tutorial on diffusion models
4 pages
326 Formulas
No ratings yet
326 Formulas
3 pages
Advanced Machine Learning
No ratings yet
Advanced Machine Learning
74 pages
BR24C64-MN6TP Rohm
No ratings yet
BR24C64-MN6TP Rohm
6 pages
An Introduction To Variational Calculus in Machine Learning
No ratings yet
An Introduction To Variational Calculus in Machine Learning
7 pages
Computer Aided Process Design and Simulation (Cheg
No ratings yet
Computer Aided Process Design and Simulation (Cheg
45 pages
Designing of Microstrip Patch Antenna For X-Band Application
No ratings yet
Designing of Microstrip Patch Antenna For X-Band Application
7 pages
Hi-Scan 100100v-2is: Heimann X-Ray Technology New: 160 KV X-Ray Source - Typical Steel Penetration 37 MM
No ratings yet
Hi-Scan 100100v-2is: Heimann X-Ray Technology New: 160 KV X-Ray Source - Typical Steel Penetration 37 MM
2 pages
How To Create Charts in Excel - 105042
No ratings yet
How To Create Charts in Excel - 105042
11 pages
Introduction To Probability Theory
No ratings yet
Introduction To Probability Theory
13 pages
Ssadsa
No ratings yet
Ssadsa
3 pages
15_-_Cyber_Security
No ratings yet
15_-_Cyber_Security
8 pages
L15 Autoregressive and Reversible Models
No ratings yet
L15 Autoregressive and Reversible Models
7 pages
Monte Carlo
No ratings yet
Monte Carlo
6 pages
Water Resources
No ratings yet
Water Resources
9 pages
AMD Ryzen ThreadripperPRO 7995WX @nettrain
No ratings yet
AMD Ryzen ThreadripperPRO 7995WX @nettrain
6 pages
Endpoint Application Control: Trend Micro
No ratings yet
Endpoint Application Control: Trend Micro
2 pages
Data Mining1
No ratings yet
Data Mining1
3 pages
Puritan Bennett 560 Ventilator Product Overview Info Sheet
No ratings yet
Puritan Bennett 560 Ventilator Product Overview Info Sheet
2 pages
C Mock Test-2
No ratings yet
C Mock Test-2
10 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.