0% found this document useful (0 votes)
14 views4 pages

Tutorial on diffusion models

This tutorial by Stanley Chan discusses diffusion models, a key mechanism behind recent advancements in generative tools for text-to-image and text-to-video generation. It aims to provide foundational knowledge on diffusion models for students interested in research or application in this area, covering topics such as Variational Auto-Encoders (VAE), Denoising Diffusion Probabilistic Models (DDPM), and Score-Matching Langevin Dynamics (SMLD). The document includes detailed explanations of various concepts, methodologies, and mathematical frameworks related to these models.

Uploaded by

kdkunal.94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

Tutorial on diffusion models

This tutorial by Stanley Chan discusses diffusion models, a key mechanism behind recent advancements in generative tools for text-to-image and text-to-video generation. It aims to provide foundational knowledge on diffusion models for students interested in research or application in this area, covering topics such as Variational Auto-Encoders (VAE), Denoising Diffusion Probabilistic Models (DDPM), and Score-Matching Langevin Dynamics (SMLD). The document includes detailed explanations of various concepts, methodologies, and mathematical frameworks related to these models.

Uploaded by

kdkunal.94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Tutorial on Diusion Models for Imaging and Vision

Stanley Chan1

March 28, 2024


Abstract. The astonishing growth of generative tools in recent years has empowered many exciting
applications in text-to-image generation and text-to-video generation. The underlying principle behind
these generative tools is the concept of diusion, a particular sampling mechanism that has overcome some
shortcomings that were deemed dicult in the previous approaches. The goal of this tutorial is to discuss the
essential ideas underlying the diusion models. The target audience of this tutorial includes undergraduate
and graduate students who are interested in doing research on diusion models or applying these models to
arXiv:2403.18103v1 [cs.LG] 26 Mar 2024

solve other problems.

Contents
1 The Basics: Variational Auto-Encoder (VAE) 2
1.1 VAE Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Training VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Inference with VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Denoising Diusion Probabilistic Model (DDPM) 10


2.1 Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 The magical scalars αt and 1 − αt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Distribution qϕ (xt |x0 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Rewrite the Consistency Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Derivation of qϕ (xt−1 |xt , x0 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Derivation based on Noise Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 Inversion by Direct Denoising (InDI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Score-Matching Langevin Dynamics (SMLD) 30


3.1 Langevin Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 (Stein’s) Score Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Score Matching Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Stochastic Dierential Equation (SDE) 39


4.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Forward and Backward Iterations in SDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Stochastic Dierential Equation for DDPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Stochastic Dierential Equation for SMLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Solving SDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Conclusion 49

1 School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907.

Email: stanchan@purdue.edu.

© 2024 Stanley Chan. All Rights Reserved. 1


1 The Basics: Variational Auto-Encoder (VAE)
1.1 VAE Setting
A long time ago, in a galaxy far far away, we want to build a generator that generates images from a latent
code. The simplest (and perhaps one of the most classical) approach is to consider an encoder-decoder pair
shown below. This is called a variational autoencoder (VAE) [1, 2, 3].

The autoencoder has an input variable x and a latent variable z. For the sake of understanding the
subject, we treat x as a beautiful image and z as some kind of vector living in some high dimensional space.

Example. Getting a latent representation of an image is not an alien thing. Back in the time of JPEG
compression (which is arguably a dinosaur), we use discrete cosine transform (DCT) basis φn to encode
the underlying image / patches of an image. The coecient vector z = [z1 , . . . , zN ]T is obtained by
projecting the patch x onto the space spanned by the basis: zn = ⟨φn , x⟩. So, if you give us an image
x, we will return you a coecient vector z. From z we can do inverse transform to recover (ie decode)
the image. Therefore, the coecient vector z is the latent code. The encoder is the DCT transform,
and the decoder is the inverse DCT transform.

The name “variational” comes from the factor that we use probability distributions to describe x and
z. Instead of resorting to a deterministic procedure of converting x to z, we are more interested in ensuring
that the distribution p(x) can be mapped to a desired distribution p(z), and go backwards to p(x). Because
of the distributional setting, we need to consider a few distributions.
• p(x): The distribution of x. It is never known. If we knew it, we would have become a billionaire. The
whole galaxy of diusion models is to nd ways to draw samples from p(x).
• p(z): The distribution of the latent variable. Because we are all lazy, let’s just make it a zero-mean
unit-variance Gaussian p(z) = N (0, I).
• p(z|x): The conditional distribution associated with the encoder, which tells us the likelihood of z
when given x. We have no access to it. p(z|x) itself is not the encoder, but the encoder has to do
something so that it will behave consistently with p(z|x).
• p(x|z): The conditional distribution associated with the decoder, which tells us the posterior proba-
bility of getting x given z. Again, we have no access to it.
The four distributions above are not too mysterious. Here is a somewhat trivial but educational example
that can illustrate the idea.

Example. Consider a random variable X distributed according to a Gaussian mixture model with
a latent variable z ∈ {1, . . . , K} denoting the cluster identity such that pZ (k) = P[Z = k] = πk for
K
k = 1, . . . , K. We assume k=1 πk = 1. Then, if we are told that we need to look at the k-th cluster
only, the conditional distribution of X given Z is

pX|Z (x|k) = N (x | µk , σk2 I).

© 2024 Stanley Chan. All Rights Reserved. 2


The marginal distribution of x can be found using the law of total probability, giving us
K
 K

pX (x) = pX|Z (x|k)pZ (k) = πk N (x | µk , σk2 I). (1)
k=1 k=1

Therefore, if we start with pX (x), the design question for the encoder to build a magical encoder such
that for every sample x ∼ pX (x), the latent code will be z ∈ {1, . . . , K} with a distribution z ∼ pZ (k).
To illustrate how the encoder and decoder work, let’s assume that the mean and variance are known
and are xed. Otherwise we will need to estimate the mean and variance through an EM algorithm. It
is doable, but the tedious equations will defeat the purpose of this illustration.
Encoder: How do we obtain z from x? This is easy because at the encoder, we know pX (x) and
pZ (k). Imagine that you only have two class z ∈ {1, 2}. Eectively you are just making a binary
decision of where the sample x should belong to. There are many ways you can do the binary decision.
If you like maximum-a-posteriori, you can check

pZ|X (1|x) ≷class 1


class 2 pZ|X (2|x),

and this will return you a simple decision rule. You give us x, we tell you z ∈ {1, 2}.
Decoder: On the decoder side, if we are given a latent code z ∈ {1, . . . , K}, the magical decoder
just needs to return us a sample x which is drawn from pX|Z (x|k) = N (x | µk , σk2 I). A dierent z will
give us one of the K mixture components. If we have enough samples, the overall distribution will
follow the Gaussian mixture.
Smart readers like you will certainly complain: “Your example is so trivially unreal.” No worries. We
understand. Life is of course a lot harder than a Gaussian mixture model with known means and known
variance. But one thing we realize is that if we want to nd the magical encoder and decoder, we must have
a way to nd the two conditional distributions. However, they are both high-dimensional creatures. So,
in order for us to say something more meaningful, we need to impose additional structures so that we can
generalize the concept to harder problems.
In the literature of VAE, people come up with an idea to consider the following two proxy distributions:
• qϕ (z|x): The proxy for p(z|x). We will make it a Gaussian. Why Gaussian? No particular good
reason. Perhaps we are just ordinary (aka lazy) human beings.
• pθ (x|z): The proxy for p(x|z). Believe it or not, we will make it a Gaussian too. But the role of this
Gaussian is slightly dierent from the Gaussian qϕ (z|x). While we will need to estimate the mean
and variance for the Gaussian qϕ (z|x), we do not need to estimate anything for the Gaussian pθ (x|z).
Instead, we will need a decoder neural network to turn z into x. The Gaussian pθ (x|z) will be used to
inform us how good our generated image x is.
The relationship between the input x and the latent z, as well as the conditional distributions, are
summarized in Figure 1. There are two nodes x and z. The “forward” relationship is specied by p(z|x)
(and approximated by qϕ (z|x)), whereas the “reverse” relationship is specied by p(x|z) (and approximated
by pθ (x|z)).

Figure 1: In a variational autoencoder, the variables x and z are connected by the conditional distributions
p(x|z) and p(z|x). To make things work, we introduce two proxy distributions pθ (x|z) and qϕ (z|x), respec-
tively.

© 2024 Stanley Chan. All Rights Reserved. 3


Example. It’s time to consider another trivial example. Suppose that we have a random variable x
and a latent variable z such that

x ∼ N (x | µ, σ 2 ),
z ∼ N (z | 0, 1).

Our goal is to construct a VAE. (What?! This problem has a trivial solution where z = (x − µ)/σ and
x = µ + σz. You are absolutely correct. But please follow our derivation to see if the VAE framework
makes sense.)

By constructing a VAE, we mean that we want to build two mappings “encode” and “decode”. For
simplicity, let’s assume that both mappings are ane transformations:

z = encode(x) = ax + b, so that ϕ = [a, b],


x = decode(z) = cz + d, so that θ = [c, d].

We are too lazy to nd out the joint distribution p(x, z), nor the conditional distributions p(x|z)
and p(z|x). But we can construct the proxy distributions qϕ (z|x) and pθ (x|z). Since we have the
freedom to choose what qϕ and pθ should look like, how about we consider the following two Gaussians

qϕ (z|x) = N (z | ax + b, 1),
pθ (x|z) = N (x | cz + d, c).

The choice of these two Gaussians is not mysterious. For qϕ (z|x): if we are given x, of course we want
the encoder to encode the distribution according to the structure we have chosen. Since the encoder
structure is ax + b, the natural choice for qϕ (z|x) is to have the mean ax + b. The variance is chosen
as 1 because we know that the encoded sample z should be unit-variance. Similarly, for pθ (x|z): if we
are given z, the decoder must take the form of cz + d because this is how we setup the decoder. The
variance is c which is a parameter we need to gure out.
We will pause for a moment before continuing this example. We want to introduce a mathematical
tool.

1.2 Evidence Lower Bound


How do we use these two proxy distributions to achieve our goal of determining the encoder and the decoder?
If we treat ϕ and θ as optimization variables, then we need an objective function (or the loss function) so
that we can optimize ϕ and θ through training samples. To this end, we need to set up a loss function in
terms of ϕ and θ. The loss function we use here is called the Evidence Lower BOund (ELBO) [1]:
 
def p(x, z)
ELBO(x) = Eqϕ (z|x) log . (2)
qϕ (z|x)

You are certainly puzzled how on the Earth people can come up with this loss function!? Let’s see what
ELBO means and how it is derived.

© 2024 Stanley Chan. All Rights Reserved. 4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy