0% found this document useful (0 votes)
5 views23 pages

Lecture14

The document outlines concepts related to deep generative models, including score matching, noise conditioned score networks, and Gaussian diffusion processes. It discusses the mathematical foundations of these models, such as stochastic differential equations and theorems related to score matching. Additionally, it presents methods for training generative models and sampling from them using Langevin dynamics and diffusion processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views23 pages

Lecture14

The document outlines concepts related to deep generative models, including score matching, noise conditioned score networks, and Gaussian diffusion processes. It discusses the mathematical foundations of these models, such as stochastic differential equations and theorems related to score matching. Additionally, it presents methods for training generative models and sampling from them using Langevin dynamics and diffusion processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Deep Generative

Models
Lecture 14
Roman Isachenko

Moscow Institute of Physics and Technology

2022 – 2023
Recap of previous lecture
∂L(y) ∂L(y)
az (t) = ; aθ (t) = – adjoint functions.
∂z(t) ∂θ(t)
Theorem (Pontryagin)
daz (t) ∂f (z(t), t, θ) daθ (t) ∂f (z(t), t, θ)
= −az (t)T · ; = −az (t)T · .
dt ∂z dt ∂θ
Forward pass Z
t1
z(t1 ) = f (z(t), t, θ)dt + z0 ⇒ ODE Solver
t0
Backward pass
Z t0
∂L ∂f (z(t), t, θ)

= aθ (t0 ) = − az (t)T dt + 0 
∂θ(t0 ) t1 ∂θ(t) 


Z t0 

∂L ∂f (z(t), t, θ) ∂L
= az (t0 ) = − az (t)T dt + ⇒ ODE Solver
∂z(t0 ) t1 ∂z(t) ∂z(t 1) 
Z t0 


z(t0 ) = − f (z(t), t, θ)dt + z1 .


t1

Chen R. T. Q. et al. Neural Ordinary Differential Equations, 2018 2 / 23


Recap of previous lecture
Continuous-in-time normalizing flows
 
dz(t) d log p(z(t), t) ∂f (z(t), t, θ)
= f (z(t), t, θ); = −tr .
dt dt ∂z(t)
Theorem (Picard)
If f is uniformly Lipschitz continuous in z and continuous in t,
then the ODE has a unique solution.
Forward transform + log-density
 Z t1 " #
f (z(t), t, θ) 
  
x z
= +  dt.
log p(x|θ) log p(z) t0 −tr ∂f (z(t),t,θ)
∂z(t)

Hutchinson’s trace estimator


Z t1  
T ∂f
log p(z(t1 )) = log p(z(t0 )) − Ep(ϵ) ϵ ϵ dt.
t0 ∂z

Grathwohl W. et al. FFJORD: Free-form Continuous Dynamics for Scalable Reversible


Generative Models, 2018 3 / 23
Recap of previous lecture
SDE basics
Let define stochastic process x(t) with initial condition
x(0) ∼ p0 (x):
dx = f(x, t)dt + g (t)dw,
where w(t) is the standard Wiener process (Brownian motion)

w(t) − w(s) ∼ N (0, (t − s)I), dw = ϵ · dt, where ϵ ∼ N (0, I).

Langevin dynamics
Let x0 be a random vector. Then under mild regularity conditions
for small enough η samples from the following dynamics
1 √
xt+1 = xt + η ∇xt log p(xt |θ) + η · ϵ, ϵ ∼ N (0, I).
2
will comes from p(x|θ).
The density p(x|θ) is a stationary distribution for the Langevin
SDE.
Welling M. Bayesian Learning via Stochastic Gradient Langevin Dynamics, 2011 4 / 23
Stochastic differential equation (SDE)

Statement
Let x0 be a random vector. Then samples from the following
dynamics
1 √
xt+1 = xt + η ∇xt log p(xt |θ) + η · ϵ, ϵ ∼ N (0, 1).
2
will come from p(x|θ) under mild regularity conditions for small
enough η and large enough t.

The density p(x|θ) is a stationary dis-


tribution for this SDE.

Song Y. Generative Modeling by Estimating Gradients of the Data Distribution, blog


post, 2021 5 / 23
Outline

1. Score matching

2. Noise conditioned score network

3. Gaussian diffusion process

6 / 23
Outline

1. Score matching

2. Noise conditioned score network

3. Gaussian diffusion process

7 / 23
Generative models zoo

Generative
models

Likelihood-based Implicit density

GANs
Tractable density Approximate density

Autoregressive VAEs
models
Diffusion
Normalizing models
Flows

8 / 23
Score matching
We could sample from the model using Langevin dynamics if we
have ∇x log p(x|θ).
Fisher divergence
1 2
DF (π, p) = Eπ ∇x log p(x|θ) − ∇x log π(x) 2 → min
2 θ
Let introduce score function s(x, θ) = ∇x log p(x|θ).

Problem: we do not know ∇x log π(x).


Song Y. Generative Modeling by Estimating Gradients of the Data Distribution, blog
post, 2021 9 / 23
Score matching
Theorem (implicit score matching)
Under some regularity conditions, it holds
1 2
h1 i
Eπ s(x, θ)−∇x log π(x) 2 = Eπ ∥s(x, θ)∥22 +tr ∇x s(x, θ) +const
2 2
Proof (only for 1D)
2
Eπ s(x)−∇x log π(x) 2 = Eπ s(x)2 +(∇x log π(x))2 −2[s(x)∇x log π(x)]
 
Z
Eπ [s(x)∇x log π(x)] = π(x)∇x log p(x)∇x log π(x)dx
Z +∞
= ∇x log p(x)∇x π(x)dx = π(x)∇x log p(x)
−∞
Z
− ∇2x log p(x)π(x)dx = −Eπ ∇2x log p(x) = −Eπ ∇x s(x)

1 2
h1 i
Eπ s(x) − ∇x log π(x) 2
= Eπ s(x)2 + ∇x s(x) +const.
2 2

Hyvarinen A. Estimation of non-normalized statistical models by score matching, 2005 10 / 23


Score matching
Theorem (implicit score matching)
1 2
h1 i
Eπ s(x, θ) − ∇x log π(x) 2
= Eπ ∥s(x, θ)∥22 + tr ∇x s(x, θ) +const
2 2
Here ∇x s(x, θ) = ∇2x log p(x|θ) is a Hessian matrix.
1. The left hand side is intractable due to unknown π(x) –
denoising score matching.
2. The right hand side is complex due to Hessian matrix – sliced
score matching.

Sliced score matching (Hutchinson’s trace estimation)


h i
tr ∇x s(x, θ) = Ep(ϵ) ϵT ∇x s(x, θ)ϵ


Song Y. Sliced Score Matching: A Scalable Approach to Density and Score


Estimation, 2019
Song Y. Generative Modeling by Estimating Gradients of the Data Distribution, blog
post, 2021 11 / 23
Denoising score matching

Let perturb original data x ∼ π(x) by random normal noise

x′ = x + σ · ϵ, ϵ ∼ N (0, 1), p(x′ |x, σ) = N (x′ |x, σ 2 I)


Z
π(x |σ) = π(x)p(x′ |x, σ)dx.

Then the solution of


1 2
E ′ s(x′ , θ, σ) − ∇x′ log π(x′ |σ) → min
2 π(x |σ) 2 θ

satisfies s(x′ , θ, σ) ≈ s(x′ , θ, 0) = s(x, θ) if σ is small enough.

Vincent P. A Connection Between Score Matching and Denoising Autoencoders, 2010 12 / 23


Denoising score matching

Theorem
2
Eπ(x′ |σ) s(x′ , θ, σ) − ∇x′ log π(x′ |σ) 2
=
2
= Eπ(x) Ep(x′ |x,σ) s(x′ , θ, σ) − ∇x′ log p(x′ |x, σ) 2
+ const(θ)

Gradient of the noise kernel


x′ − x
∇x′ log p(x′ |x, σ) = ∇x′ log N (x′ |x, σ 2 I) = −
σ2

▶ The RHS does not need to compute ∇x′ log π(x′ |σ) and even
∇x′ log π(x′ ).
▶ s(x′ , θ, σ) tries to denoise a corrupted sample x′ .
▶ Score function s(x′ , θ, σ) parametrized by σ. How to make it?

Vincent P. A Connection Between Score Matching and Denoising Autoencoders, 2010 13 / 23


Outline

1. Score matching

2. Noise conditioned score network

3. Gaussian diffusion process

14 / 23
Denoising score matching
▶ If σ is small, the score function is not accurate and Langevin
dynamics will probably fail to jump between modes.

▶ If σ is large, it is good for low-density regions and multimodal


distributions, but we will learn too corrupted distribution.

Song Y. Generative Modeling by Estimating Gradients of the Data Distribution, blog


post, 2021 15 / 23
Noise conditioned score network
▶ Define the sequence of noise levels: σ1 > σ2 > · · · > σL .
▶ Perturb the original data with the different noise level to get
π(x′ |σ1 ), . . . , π(x′ |σL ).
▶ Train denoised score function s(x′ , θ, σ) for each noise level:
L
2
X
σl2 Eπ(x) Ep(x′ |x,σl ) s(x′ , θ, σl ) − ∇′x log p(x′ |x, σl ) 2 → min
θ
l=1
▶ Sample from annealed Langevin dynamics (for l = 1, . . . , L).

Song Y. et al. Generative Modeling by Estimating Gradients of the Data Distribution,


2019 16 / 23
Noise conditioned score network
Training: loss function Inference: annealed Langevin
L 2 dynamic
X ϵ
σl2 Eπ(x) Eϵ sl + ,
σl 2
i=1
Here
▶ sl = s(x + σl · ϵ, θ, σl ).

▶ ∇x′ log p(x′ |x, σ) = − xσ−x 2 =
ϵ
− σl .

Samples

Song Y. et al. Improved Techniques for Training Score-Based Generative Models, 2020 17 / 23
Outline

1. Score matching

2. Noise conditioned score network

3. Gaussian diffusion process

18 / 23
Forward gaussian diffusion process
Let x0 = x ∼ π(x), β ∈ (0, 1). Define the Markov chain
p p
xt = 1 − β · xt−1 + β · ϵ, where ϵ ∼ N (0, 1);
p
q(xt |xt−1 ) = N (xt | 1 − β · xt−1 , β · I).
Statement 1
Applying the Markov chain to samples from any π(x) we will get
x∞ ∼ p∞ (x) = N (0, 1). Here p∞ (x) is a stationary distribution:
Z
p∞ (x) = q(x|x′ )p∞ (x′ )dx′ .

Statement 2 Q
t
Denote ᾱt = − βs ). Then
s=1 (1
√ √
xt = ᾱt · x0 + 1 − ᾱt · ϵ, where ϵ ∼ N (0, 1)

q(xt |x0 ) = N (xt | ᾱt · x0 , (1 − ᾱt ) · I).
We could sample from any timestamp using only x0 !
Sohl-Dickstein J. Deep Unsupervised Learning using Nonequilibrium Thermodynamics,
2015 19 / 23
Forward gaussian diffusion process
Diffusion refers to the flow of particles from high-density regions
towards low-density regions.

1. x0 = x ∼ π(x);
√ √
2. xt = 1 − β · xt−1 + β · ϵ, where ϵ ∼ N (0, 1), t ≥ 1;
3. xT ∼ p∞ (x) = N (0, 1), where T >> 1.
If we are able to invert this process, we will get the way to sample
x ∼ π(x) using noise samples p∞ (x) = N (0, 1).
Now our goal is to revert this process.
Das A. An introduction to Diffusion Probabilistic Models, blog post, 2021 20 / 23
Reverse gaussian diffusion process

Let define the reverse process


p(xt−1 |xt , θ) = N xt−1 |µ(xt , θ, t), σ 2 (xt , θ, t)


Forward process Reverse process


1. x0 = x ∼ π(x); 1. xT ∼ p∞ (x) = N (0, 1);
√ √
2. xt = 1 − β · xt−1 + β · ϵ, 2. xt−1 =
where ϵ ∼ N (0, 1), t ≥ 1; σ(xt , θ, t) · ϵ + µ(xt , θ, t);
3. xT ∼ p∞ (x) = N (0, 1). 3. x0 = x ∼ π(x);
Note: The forward process does not have any learnable
parameters!
Weng L. What are Diffusion Models?, blog post, 2021 21 / 23
Gaussian diffusion model as VAE

▶ Let treat z = (x1 , . . . , xT ) as a latent variable (note: each xt


has the same size).
▶ Variational posterior distribution (note: there is no learnable
parameters) T
Y
q(z|x) = q(x1 , . . . , xT |x0 ) = q(xt |xt−1 ).
t=1
▶ Probabilistic model
p(x, z|θ) = p(x|z, θ)p(z|θ)
▶ Generative distribution and prior
T
Y
p(x|z, θ) = p(x0 |x1 , θ); p(z|θ) = p(xt−1 |xt , θ) · p(xT )
t=2
Das A. An introduction to Diffusion Probabilistic Models, blog post, 2021 22 / 23
Summary
▶ Score matching proposes to minimize Fisher divergence to get
score function.

▶ Sliced score matching and denoising score matching are two


techniques to get scalable algorithm for fitting Fisher
divergence.

▶ Noise conditioned score network uses multiple noise levels and


annealed Langevin dynamics to fit score function.

▶ Gaussian diffusion process is a Markov chain that injects


special form of Gaussian noise to the samples.

▶ Reverse process allows to sample from the real distribution


π(x) using samples from noise.

23 / 23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy