Lecture14
Lecture14
Models
Lecture 14
Roman Isachenko
2022 – 2023
Recap of previous lecture
∂L(y) ∂L(y)
az (t) = ; aθ (t) = – adjoint functions.
∂z(t) ∂θ(t)
Theorem (Pontryagin)
daz (t) ∂f (z(t), t, θ) daθ (t) ∂f (z(t), t, θ)
= −az (t)T · ; = −az (t)T · .
dt ∂z dt ∂θ
Forward pass Z
t1
z(t1 ) = f (z(t), t, θ)dt + z0 ⇒ ODE Solver
t0
Backward pass
Z t0
∂L ∂f (z(t), t, θ)
= aθ (t0 ) = − az (t)T dt + 0
∂θ(t0 ) t1 ∂θ(t)
Z t0
∂L ∂f (z(t), t, θ) ∂L
= az (t0 ) = − az (t)T dt + ⇒ ODE Solver
∂z(t0 ) t1 ∂z(t) ∂z(t 1)
Z t0
z(t0 ) = − f (z(t), t, θ)dt + z1 .
t1
Langevin dynamics
Let x0 be a random vector. Then under mild regularity conditions
for small enough η samples from the following dynamics
1 √
xt+1 = xt + η ∇xt log p(xt |θ) + η · ϵ, ϵ ∼ N (0, I).
2
will comes from p(x|θ).
The density p(x|θ) is a stationary distribution for the Langevin
SDE.
Welling M. Bayesian Learning via Stochastic Gradient Langevin Dynamics, 2011 4 / 23
Stochastic differential equation (SDE)
Statement
Let x0 be a random vector. Then samples from the following
dynamics
1 √
xt+1 = xt + η ∇xt log p(xt |θ) + η · ϵ, ϵ ∼ N (0, 1).
2
will come from p(x|θ) under mild regularity conditions for small
enough η and large enough t.
1. Score matching
6 / 23
Outline
1. Score matching
7 / 23
Generative models zoo
Generative
models
GANs
Tractable density Approximate density
Autoregressive VAEs
models
Diffusion
Normalizing models
Flows
8 / 23
Score matching
We could sample from the model using Langevin dynamics if we
have ∇x log p(x|θ).
Fisher divergence
1 2
DF (π, p) = Eπ ∇x log p(x|θ) − ∇x log π(x) 2 → min
2 θ
Let introduce score function s(x, θ) = ∇x log p(x|θ).
1 2
h1 i
Eπ s(x) − ∇x log π(x) 2
= Eπ s(x)2 + ∇x s(x) +const.
2 2
Theorem
2
Eπ(x′ |σ) s(x′ , θ, σ) − ∇x′ log π(x′ |σ) 2
=
2
= Eπ(x) Ep(x′ |x,σ) s(x′ , θ, σ) − ∇x′ log p(x′ |x, σ) 2
+ const(θ)
▶ The RHS does not need to compute ∇x′ log π(x′ |σ) and even
∇x′ log π(x′ ).
▶ s(x′ , θ, σ) tries to denoise a corrupted sample x′ .
▶ Score function s(x′ , θ, σ) parametrized by σ. How to make it?
1. Score matching
14 / 23
Denoising score matching
▶ If σ is small, the score function is not accurate and Langevin
dynamics will probably fail to jump between modes.
Samples
Song Y. et al. Improved Techniques for Training Score-Based Generative Models, 2020 17 / 23
Outline
1. Score matching
18 / 23
Forward gaussian diffusion process
Let x0 = x ∼ π(x), β ∈ (0, 1). Define the Markov chain
p p
xt = 1 − β · xt−1 + β · ϵ, where ϵ ∼ N (0, 1);
p
q(xt |xt−1 ) = N (xt | 1 − β · xt−1 , β · I).
Statement 1
Applying the Markov chain to samples from any π(x) we will get
x∞ ∼ p∞ (x) = N (0, 1). Here p∞ (x) is a stationary distribution:
Z
p∞ (x) = q(x|x′ )p∞ (x′ )dx′ .
Statement 2 Q
t
Denote ᾱt = − βs ). Then
s=1 (1
√ √
xt = ᾱt · x0 + 1 − ᾱt · ϵ, where ϵ ∼ N (0, 1)
√
q(xt |x0 ) = N (xt | ᾱt · x0 , (1 − ᾱt ) · I).
We could sample from any timestamp using only x0 !
Sohl-Dickstein J. Deep Unsupervised Learning using Nonequilibrium Thermodynamics,
2015 19 / 23
Forward gaussian diffusion process
Diffusion refers to the flow of particles from high-density regions
towards low-density regions.
1. x0 = x ∼ π(x);
√ √
2. xt = 1 − β · xt−1 + β · ϵ, where ϵ ∼ N (0, 1), t ≥ 1;
3. xT ∼ p∞ (x) = N (0, 1), where T >> 1.
If we are able to invert this process, we will get the way to sample
x ∼ π(x) using noise samples p∞ (x) = N (0, 1).
Now our goal is to revert this process.
Das A. An introduction to Diffusion Probabilistic Models, blog post, 2021 20 / 23
Reverse gaussian diffusion process
23 / 23