0% found this document useful (0 votes)
20 views49 pages

Mod4_Slides

The document discusses likelihood-based learning, focusing on model families such as autoregressive models and variational autoencoders, highlighting their strengths and weaknesses. It introduces flow models, which use invertible transformations to map simple distributions to complex data distributions, allowing for tractable likelihoods. The document also covers the change of variables formula, the geometry of transformations, and the importance of triangular Jacobians for efficient computation in normalizing flow models.

Uploaded by

kaveh1980
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views49 pages

Mod4_Slides

The document discusses likelihood-based learning, focusing on model families such as autoregressive models and variational autoencoders, highlighting their strengths and weaknesses. It introduces flow models, which use invertible transformations to map simple distributions to complex data distributions, allowing for tractable likelihoods. The document also covers the change of variables formula, the geometry of transformations, and the importance of triangular Jacobians for efficient computation in normalizing flow models.

Uploaded by

kaveh1980
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Recap of likelihood-based learning so far:

Model families:
Qn
R pθ (xi |x<i )
Autoregressive Models: pθ (x) = i=1
Variational Autoencoders: pθ (x) = pθ (x, z)dz
Autoregressive models provide tractable likelihoods but no direct
mechanism for learning features
Variational autoencoders can learn feature representations (via latent
variables z) but have intractable marginal likelihoods
Key question: Can we design a latent variable model with tractable
likelihoods? Yes!
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 2 / 19
Simple Prior to Complex Data Distributions
Desirable properties of any model distribution pθ (x):
Easy-to-evaluate, closed form density (useful for training)
Easy-to-sample (useful for generation)
Many simple distributions satisfy the above properties e.g., Gaussian,
uniform distributions

Unfortunately, data distributions are more complex (multi-modal)

Key idea behind flow models: Map simple distributions (easy to


sample and evaluate densities) to complex distributions through an
invertible transformation.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 3 / 19
Variational Autoencoder

A flow model is similar to a variational autoencoder (VAE):


1 Start from a simple prior: z ∼ N (0, I ) = p(z)

2 Transform via p(x | z) = N (µ (z), Σ (z))


θ θ
3 Even though p(z) is simple, the marginal p (x) is very
R θ
complex/flexible. However, pθ (x) = pθ (x, z)dz is expensive to
compute: need to enumerate all z that could have generated x
4 What if we could easily ”invert” p(x | z) and compute p(z | x) by

design? How? Make x = fθ (z) a deterministic and invertible function


of z, so for any x there is a unique corresponding z (no enumeration)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 4 / 19
Continuous random variables refresher

Let X be a continuous random variable


The cumulative density function (CDF) of X is FX (a) = P(X ≤ a)
dFX (a)
The probability density function (pdf) of X is pX (a) = FX′ (a) = da
Typically consider parameterized densities:
2 2
Gaussian: X ∼ N (µ, σ) if pX (x) = σ√12π e −(x−µ) /2σ
1
Uniform: X ∼ U(a, b) if pX (x) = b−a 1[a ≤ x ≤ b]
Etc.
If X is a continuous random vector, we can usually represent it using
its joint probability density function:
Gaussian: if pX (x) = √ 1
exp − 12 (x − µ)T Σ−1 (x − µ)

(2π)n |Σ|

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 5 / 19


Change of Variables formula

Let Z be a uniform random variable U[0, 2] with density pZ . What is


pZ (1)? 12
R21
As a sanity check, 0 2
=1
Let X = 4Z , and let pX be its density. What is pX (4)?
pX (4) = p(X = 4) = p(4Z = 4) = p(Z = 1) = pZ (1) = 1/2 Wrong!
Clearly, X is uniform in [0, 8], so pX (4) = 1/8
To get correct result, need to use change of variables formula

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 6 / 19


Change of Variables formula

Change of variables (1D case): If X = f (Z ) and f (·) is monotone


with inverse Z = f −1 (X ) = h(X ), then:

pX (x) = pZ (h(x))|h′ (x)|

Previous example: If X = f (Z ) = 4Z and Z ∼ U[0, 2], what is


pX (4)?
Note that h(X ) = X /4
pX (4) = pZ (1)h′ (4) = 1/2 × |1/4| = 1/8
More interesting example: If X = f (Z ) = exp(Z ) and Z ∼ U[0, 2],
what is pX (x)?
Note that h(X ) = ln(X )
pX (x) = pZ (ln(x))|h′ (x)| = 1
2x for x ∈ [exp(0), exp(2)]
Note that the ”shape” of pX (x) is different (more complex) from that
of the prior pZ (z).
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 7 / 19
Change of Variables formula
Change of variables (1D case): If X = f (Z ) and f (·) is monotone
with inverse Z = f −1 (X ) = h(X ), then:
pX (x) = pZ (h(x))|h′ (x)|

Proof sketch: Assume f (·) is monotonically increasing


FX (x) = p[X ≤ x] = p[f (Z ) ≤ x] = p[Z ≤ h(x)] = FZ (h(x))
Taking derivatives on both sides:
dFX (x) dFZ (h(x))
pX (x) = = = pZ (h(x))h′ (x)
dx dx

Recall from basic calculus that h′ (x) = [f −1 ]′ (x) = 1


f ′ (f −1 (x))
. So
letting z = h(x) = f −1 (x) we can also write
1
pX (x) = pZ (z) ′
f (z)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 8 / 19
Geometry: Determinants and volumes
Let Z be a uniform random vector in [0, 1]n
Let X = AZ for a square invertible matrix A, with inverse W = A−1 .
How is X distributed?
Geometrically, the matrix A maps the unit hypercube [0, 1]n to a
parallelotope
Hypercube and parallelotope are generalizations of square/cube and
parallelogram/parallelopiped to higher dimensions

 
a c
Figure: The matrix A = maps a unit square to a parallelogram
b d

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 9 / 19


Geometry: Determinants and volumes
The volume of the parallelotope is equal to the absolute value of the
determinant of the matrix A
 
a c
det(A) = det = ad − bc
b d

Let X = AZ for a square invertible matrix A, with inverse W = A−1 .


X is uniformly distributed over the parallelotope of area |det(A)|.
Hence, we have
pX (x) = pZ (W x) / |det(A)|
= pZ (W x) |det(W )|
because if W = A−1 , det(W ) = 1
det(A) . Note similarity with 1D case
formula.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 10 / 19
Generalized change of variables
For linear transformations specified via A, change in volume is given
by the determinant of A
For non-linear transformations f(·), the linearized change in volume is
given by the determinant of the Jacobian of f(·).
Change of variables (General case): The mapping between Z and
X , given by f : Rn 7→ Rn , is invertible such that X = f(Z ) and
Z = f −1 (X ).
 −1 
−1
 ∂f (x)
pX (x) = pZ f (x) det
∂x

Note 0: generalizes the previous 1D case pX (x) = pZ (h(x))|h′ (x)|


Note 1: unlike VAEs, x, z need to be continuous and have the same
dimension. For example, if x ∈ Rn then z ∈ Rn
Note 2: For any invertible matrix A, det(A−1 ) = det(A)−1
∂f(z) −1
 
pX (x) = pZ (z) det
∂z
Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 11 / 19
Two Dimensional Example

Let Z1 and Z2 be continuous random variables with joint density


pZ1 ,Z2 .
Let u : R2 → R2 be an invertible transformation. Two inputs and two
outputs, denoted u = (u1 , u2 )
Let v = (v1 , v2 ) be its inverse transformation
Let X1 = u1 (Z1 , Z2 ) and X2 = u2 (Z1 , Z2 ) Then, Z1 = v1 (X1 , X2 ) and
Z2 = v2 (X1 , X2 )

pX1 ,X2 (x1 , x2 )


!
∂v1 (x1 ,x2 ) ∂v1 (x1 ,x2 )
= pZ1 ,Z2 (v1 (x1 , x2 ), v2 (x1 , x2 )) det ∂x1 ∂x2 (inverse)
∂v2 (x1 ,x2 ) ∂v2 (x1 ,x2 )
∂x1 ∂x2
! −1
∂u1 (z1 ,z2 ) ∂u1 (z1 ,z2 )
= pZ1 ,Z2 (z1 , z2 ) det ∂z1 ∂z2 (forward)
∂u2 (z1 ,z2 ) ∂u2 (z1 ,z2 )
∂z1 ∂z2

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 12 / 19


Normalizing flow models
Consider a directed, latent-variable model over observed variables X
and latent variables Z
In a normalizing flow model, the mapping between Z and X , given
by fθ : Rn 7→ Rn , is deterministic and invertible such that X = fθ (Z )
and Z = fθ−1 (X )

Using change of variables, the marginal likelihood p(x) is given by


!
−1
∂f (x)
pX (x; θ) = pZ fθ−1 (x) det θ

∂x

Note: x, z need to be continuous and have the same dimension.


Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 13 / 19
A Flow of Transformations
Normalizing: Change of variables gives a normalized density after
applying an invertible transformation
Flow: Invertible transformations can be composed with each other

zm = fθm ◦ · · · ◦ fθ1 (z0 ) = fθm (fθm−1 (· · · (fθ1 (z0 )))) ≜ fθ (z0 )

Start with a simple distribution for z0 (e.g., Gaussian)


Apply a sequence of M invertible transformations to finally obtain
x = zM
By change of variables
M
∂(fθm )−1 (zm )
 
Y
pX (x; θ) = pZ fθ−1 (x) det
∂zm
m=1

(Note: determininant of product equals product of determinants)


Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 14 / 19
Planar flows (Rezende & Mohamed, 2016)

Base distribution: Gaussian

Base distribution: Uniform

10 planar transformations can transform simple distributions into a


more complex one

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 15 / 19


Learning and Inference
Learning via maximum likelihood over the dataset D
!
X ∂fθ−1 (x)
fθ−1 (x) + log det

max log pX (D; θ) = log pZ
θ ∂x
x∈D

Exact likelihood evaluation via inverse tranformation x 7→ z and


change of variables formula
Sampling via forward transformation z 7→ x

z ∼ pZ (z) x = fθ (z)

Latent representations inferred via inverse transformation (no


inference network required!)

z = fθ−1 (x)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 16 / 19


Desiderata for flow models

Simple prior pZ (z) that allows for efficient sampling and tractable
likelihood evaluation. E.g., isotropic Gaussian
Invertible transformations with tractable evaluation:
Likelihood evaluation requires efficient evaluation of x 7→ z mapping
Sampling requires efficient evaluation of z 7→ x mapping
Computing likelihoods also requires the evaluation of determinants of
n × n Jacobian matrices, where n is the data dimensionality
Computing the determinant for an n × n matrix is O(n3 ): prohibitively
expensive within a learning loop!
Key idea: Choose tranformations so that the resulting Jacobian matrix
has special structure. For example, the determinant of a triangular
matrix is the product of the diagonal entries, i.e., an O(n) operation

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 17 / 19


Triangular Jacobian

x = (x1 , · · · , xn ) = f(z) = (f1 (z), · · · , fn (z))

∂f1 ∂f1
···
 
∂f ∂z1 ∂zn
J= =  ··· ··· ··· 
∂z ∂fn ∂fn
∂z1 ··· ∂zn

Suppose xi = fi (z) only depends on z≤i . Then


∂f1
···
 
∂z1 0
∂f
J= =  ··· ··· ··· 
∂z ∂fn ∂fn
∂z1 ··· ∂zn

has lower triangular structure. Determinant can be computed in linear


time. Similarly, the Jacobian is upper triangular if xi only depends on z≥i

Stefano Ermon (AI Lab) Deep Generative Models Lecture 7 18 / 19


Recap of normalizing flow models

So far
Transform simple to complex distributions via sequence of invertible
transformations
Directed latent variable models with marginal likelihood given by the
change of variables formula
Triangular Jacobian permits efficient evaluation of log-likelihoods
Plan for today
Invertible transformations with diagonal Jacobians (NICE, Real-NVP)
Autoregressive Models as Normalizing Flow Models
Invertible CNNs (MintNet)
Gaussianization flows
Case Study: Probability density distillation for efficient learning and
inference in Parallel Wavenet

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 8/1


Designing invertible transformations

NICE or Nonlinear Independent Components Estimation (Dinh et al.,


2014) composes two kinds of invertible transformations: additive
coupling layers and rescaling layers
Real-NVP (Dinh et al., 2017)
Inverse Autoregressive Flow (Kingma et al., 2016)
Masked Autoregressive Flow (Papamakarios et al., 2017)
I-resnet (Behrmann et al, 2018)
Glow (Kingma et al, 2018)
MintNet (Song et al., 2019)
And many more

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 9/1


NICE - Additive coupling layers
Partition the variables z into two disjoint subsets, say z1:d and zd+1:n for
any 1 ≤ d < n
Forward mapping z 7→ x:
x1:d = z1:d (identity transformation)
xd+1:n = zd+1:n + mθ (z1:d ) (mθ (·) is a neural network with parameters
θ, d input units, and n − d output units)
Inverse mapping x 7→ z:
z1:d = x1:d (identity transformation)
zd+1:n = xd+1:n − mθ (x1:d )
Jacobian of forward mapping:
!
∂x Id 0
J= = ∂xd+1:n
∂z ∂z1:d In−d

det(J) = 1

Volume preserving transformation since determinant is 1.


Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 10 / 1
NICE - Rescaling layers
Additive coupling layers are composed together (with arbitrary
partitions of variables in each layer)
Final layer of NICE applies a rescaling transformation
Forward mapping z 7→ x:
xi = si zi
where si > 0 is the scaling factor for the i-th dimension.
Inverse mapping x 7→ z:
xi
zi =
si

Jacobian of forward mapping:


J = diag(s)

n
Y
det(J) = si
i=1
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 11 / 1
Samples generated via NICE

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 12 / 1


Samples generated via NICE

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 13 / 1


Real-NVP: Non-volume preserving extension of NICE
Forward mapping z 7→ x:
x1:d = z1:d (identity transformation)
xd+1:n = zd+1:n ⊙ exp(αθ (z1:d )) + µθ (z1:d )
µθ (·) and αθ (·) are both neural networks with parameters θ, d input
units, and n − d output units [⊙ denotes elementwise product]
Inverse mapping x 7→ z:
z1:d = x1:d (identity transformation)
zd+1:n = (xd+1:n − µθ (x1:d )) ⊙ (exp(−αθ (x1:d )))
Jacobian of forward mapping:
 
∂x Id 0
J= = ∂xd+1:n
∂z ∂z1:d diag(exp(αθ (z1:d )))

n n
!
Y X
det(J) = exp(αθ (z1:d )i ) = exp αθ (z1:d )i
i=d+1 i=d+1

Non-volume preserving transformation in general since determinant can


be less than or greater than 1
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 14 / 1
Samples generated via Real-NVP

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 15 / 1


Latent space interpolations via Real-NVP

Using with four validation examples z(1) , z(2) , z(3) , z(4) , define interpolated
z as:

z = cosϕ(z(1) cosϕ′ + z(2) sinϕ′ ) + sinϕ(z(3) cosϕ′ + z(4) sinϕ′ )

with manifold parameterized by ϕ and ϕ′ .


Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 16 / 1
Continuous Autoregressive models as flow models
Consider a Gaussian autoregressive model:
n
Y
p(x) = p(xi |x<i )
i=1

such that p(xi | x<i ) = N (µi (x1 , · · · , xi−1 ), exp(αi (x1 , · · · , xi−1 ))2 ).
Here, µi (·) and αi (·) are neural networks for i > 1 and constants for
i = 1.
Sampler for this model:
Sample zi ∼ N (0, 1) for i = 1, · · · , n
Let x1 = exp(α1 )z1 + µ1 . Compute µ2 (x1 ), α2 (x1 )
Let x2 = exp(α2 )z2 + µ2 . Compute µ3 (x1 , x2 ), α3 (x1 , x2 )
Let x3 = exp(α3 )z3 + µ3 . ...
Flow interpretation: transforms samples from the standard Gaussian
(z1 , z2 , . . . , zn ) to those generated from the model (x1 , x2 , . . . , xn ) via
invertible transformations (parameterized by µi (·), αi (·))
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 17 / 1
Masked Autoregressive Flow (MAF)

Forward mapping from z 7→ x:


Let x1 = exp(α1 )z1 + µ1 . Compute µ2 (x1 ), α2 (x1 )
Let x2 = exp(α2 )z2 + µ2 . Compute µ3 (x1 , x2 ), α3 (x1 , x2 )
Sampling is sequential and slow (like autoregressive): O(n) time

Figure adapted from Eric Jang’s blog


Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 18 / 1
Masked Autoregressive Flow (MAF)

Inverse mapping from x 7→ z:


Compute all µi , αi (can be done in parallel using e.g., MADE)
Let z1 = (x1 − µ1 )/ exp(α1 ) (scale and shift)
Let z2 = (x2 − µ2 )/ exp(α2 )
Let z3 = (x3 − µ3 )/ exp(α3 ) ...
Jacobian is lower diagonal, hence efficient determinant computation
Likelihood evaluation is easy and parallelizable (like MADE)
Layers with different variable orderings can be stacked

Figure adapted from Eric Jang’s blog


Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 19 / 1
Inverse Autoregressive Flow (IAF)

Forward mapping from z 7→ x (parallel):


Sample zi ∼ N (0, 1) for i = 1, · · · , n
Compute all µi , αi (can be done in parallel)
Let x1 = exp(α1 )z1 + µ1
Let x2 = exp(α2 )z2 + µ2 ...
Inverse mapping from x 7→ z (sequential):
Let z1 = (x1 − µ1 )/ exp(α1 ). Compute µ2 (z1 ), α2 (z1 )
Let z2 = (x2 − µ2 )/ exp(α2 ). Compute µ3 (z1 , z2 ), α3 (z1 , z2 )
Fast to sample from, slow to evaluate likelihoods of data points (train)
Note: Fast to evaluate likelihoods of a generated point (cache z1 , z2 , . . . , zn )
Figure adapted from Eric Jang’s blog
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 20 / 1
IAF is inverse of MAF

Figure: Inverse pass of MAF (left) vs. Forward pass of IAF (right)

Interchanging z and x in the inverse transformation of MAF gives the


forward transformation of IAF
Similarly, forward transformation of MAF is inverse transformation of
IAF
Figure adapted from Eric Jang’s blog
Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 21 / 1
IAF vs. MAF

Computational tradeoffs
MAF: Fast likelihood evaluation, slow sampling
IAF: Fast sampling, slow likelihood evaluation
MAF more suited for training based on MLE, density estimation
IAF more suited for real-time generation
Can we get the best of both worlds?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 22 / 1


Parallel Wavenet

Two part training with a teacher and student model


Teacher is parameterized by MAF. Teacher can be efficiently trained
via MLE
Once teacher is trained, initialize a student model parameterized by
IAF. Student model cannot efficiently evaluate density for external
datapoints but allows for efficient sampling
Key observation: IAF can also efficiently evaluate densities of its
own generations (via caching the noise variates z1 , z2 , . . . , zn )

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 23 / 1


Parallel Wavenet

Probability density distillation: Student distribution is trained to


minimize the KL divergence between student (s) and teacher (t)

DKL (s, t) = Ex∼s [log s(x) − log t(x)]

Evaluating and optimizing Monte Carlo estimates of this objective


requires:
Samples x from student model (IAF)
Density of x assigned by student model
Density of x assigned by teacher model (MAF)
All operations above can be implemented efficiently

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 24 / 1


Parallel Wavenet: Overall algorithm

Training
Step 1: Train teacher model (MAF) via MLE
Step 2: Train student model (IAF) to minimize KL divergence with
teacher
Test-time: Use student model for testing
Improves sampling efficiency over original Wavenet (vanilla
autoregressive model) by 1000x!

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 25 / 1


MintNet (Song et al., 2019)

MintNet: Building invertible neural networks with masked


convolutions.
A regular convolutional neural network is powerful, but it is not
invertible and its Jacobian determinant is expensive.
We can instead use masked convolutions like in autoregressive models
to enforce ordering (like PixelCNN)
Because of the ordering, the Jacobian matrix is triangular and the
determinant is efficient to compute.
If all the diagonal elements of the Jacobian matrix are (strictly)
positive, the transformation is invertible.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 26 / 1


MintNet (Song et al., 2019)

Illustration of a masked convolution with 3 filters and kernel size 3 × 3.

Solid checkerboard cubes inside each filter represent unmasked


weights, while the transparent blue blocks represent the weights that
have been masked out.
The receptive field of each filter on the input feature maps is
indicated by regions shaded with the pattern (the colored square)
below the corresponding filter.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 27 / 1


MintNet (Song et al., 2019)

Uncurated samples on MNIST, CIFAR-10, and ImageNet 32x32


datasets

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 28 / 1


Gaussianization Flows (Meng et al., 2020)

Let X = fθ (Z ) be a flow model with Gaussian prior


Z ∼ N (0, I ) = pZ , and let X̃ ∼ pdata be a random vector distributed
according to the true data distribution.
Flow models are trained with maximum likelihood to minimize the KL
divergence DKL ( pdata ∥ pθ (x) ) = DKL pX̃ pX . Gaussian
samples transformed through fθ should be distributed as the data.
  
It can be shown that DKL pX̃ pX = DKL pf −1 (X̃ ) pf −1 (X ) =
  θ θ

DKL pf −1 (X̃ ) pZ . Data samples transformed through fθ−1 should


θ
be distributed as Gaussian
How can we achieve this?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 29 / 1


Gaussianization Flows (Meng et al., 2020)

Let’s start with a 1D example. Let the data X̃ Rhave density pdata and
a
cumulative density function (CDF) Fdata (a) = −∞ pdata .
Inverse CDF trick: If Fdata is known, we can sample from pdata via
−1
X̃ = Fdata (U) where U ∈ [0, 1] is a uniform random variable.

This means that U = Fdata (X̃ ) is uniform. We can transform U into a


Gaussian using the inverse CDF trick: Φ−1 (U) = Φ−1 (Fdata (X̃ )).
The invertible transformation Φ−1 ◦ Fdata Gaussianizes the data!

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 30 / 1


Gaussianization Flows (Meng et al., 2020)

Step 1: Dimension-wise Gaussianization (Jacobian is a diagonal


matrix and is tractable)

Input data Dimension-wise Gaussianization

Note: Even though each dimension is marginally Gaussian, they are


not jointly Gaussian. Aside: Approximating this with a Gaussian prior
is a shallow flow model known as a copula model (Sklar, 1959).

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 31 / 1


Gaussianization Flows (Meng et al., 2020)

Step 2: apply a rotation matrix to the transformed data (Jacobian is


an orthogonal matrix and is tractable)

Input After rotation

Note: N (0, I) is rotationally invariant

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 32 / 1


Gaussianization Flows (Meng et al., 2020)

Gaussianization flow: repeat Step 1 and Step 2 (stacking learnable


Gaussian copula). Transform data into a normal distribution.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 8 33 / 1


Recap

Model families
Qn
R pθ (xi |x<i )
Autoregressive Models: pθ (x) = i=1
Variational Autoencoders: pθ (x) = pθ (x, z)dz
 −1 
∂fθ (x)
Normalizing Flow Models: pX (x; θ) = pZ fθ−1 (x) det

∂x

All the above families are trained by minimizing KL divergence


DKL (pdata ∥pθ ), or equivalently maximizing likelihoods (or
approximations)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 2/1


Why maximum likelihood?

M
X
θ̂ = argmax log pθ (xi ), x1 , x2 , · · · , xM ∼ pdata (x)
θ i=1

Optimal statistical efficiency.


Assume sufficient model capacity, such that there exists a unique
θ∗ ∈ M that satisfies pθ∗ = pdata .
The convergence of θ̂ to θ∗ when M → ∞ is the “fastest” among all
statistical methods when using maximum likelihood training.
Higher likelihood = better lossless compression.
Is the likelihood a good indicator of the quality of samples generated
by the model?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 3/1


Towards likelihood-free learning

Case 1: Optimal generative model will give best sample quality and
highest test log-likelihood
For imperfect models, achieving high log-likelihoods might not always
imply good sample quality, and vice-versa (Theis et al., 2016)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 4/1


Towards likelihood-free learning
Case 2: Great test log-likelihoods, poor samples. E.g., For a discrete
noise mixture model pθ (x) = 0.01pdata (x) + 0.99pnoise (x)
99% of the samples are just noise (most samples are poor)
Taking logs, we get a lower bound
log pθ (x) = log[0.01pdata (x) + 0.99pnoise (x)]
≥ log 0.01pdata (x) = log pdata (x) − log 100

For expected log-likelihoods, we know that


Lower bound
Epdata [log pθ (x)] ≥ Epdata [log pdata (x)] − log 100

Upper bound (via non-negativity of DKL (pdata ∥pθ ) ≥ 0)


Epdata [log pdata (x))] ≥ Epdata [log pθ (x)]
As we increase P the dimension n of x = (x1 , · · · , xn ), absolute value of
n
log pdata (x) = i=1 log pθ (xi |x<i ) increases proportionally to n but
log 100 remains constant. Hence, likelihoods are great
Epdata [log pθ (x)] ≈ Epdata [log pdata (x)] in very high dimensions
Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 5/1
Towards likelihood-free learning

Case 3: Great samples, poor test log-likelihoods. E.g., Memorizing


training set
Samples look exactly like the training set (cannot do better!)
Test set will have zero probability assigned (cannot do worse!)
The above cases suggest that it might be useful to disentangle
likelihoods and sample quality
Likelihood-free learning consider alternative training objectives that
do not depend directly on a likelihood function

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 6/1


Recap

Model families
Qn
R pθ (xi |x<i )
Autoregressive Models: pθ (x) = i=1
Variational Autoencoders: pθ (x) = pθ (x, z)dz
 −1 
∂fθ (x)
Normalizing Flow Models: pX (x; θ) = pZ fθ−1 (x) det

∂x

All the above families are trained by minimizing KL divergence


dKL (pdata ∥pθ ), or equivalently maximizing likelihoods (or
approximations)
Today: alternative choices for d(pdata ∥pθ )

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 7/1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy