Tutorial on diffusion models
Tutorial on diffusion models
Stanley Chan1
Contents
1 The Basics: Variational Auto-Encoder (VAE) 2
1.1 VAE Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Training VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Inference with VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Conclusion 49
1 School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907.
Email: stanchan@purdue.edu.
The autoencoder has an input variable x and a latent variable z. For the sake of understanding the
subject, we treat x as a beautiful image and z as some kind of vector living in some high dimensional space.
Example. Getting a latent representation of an image is not an alien thing. Back in the time of JPEG
compression (which is arguably a dinosaur), we use discrete cosine transform (DCT) basis φn to encode
the underlying image / patches of an image. The coecient vector z = [z1 , . . . , zN ]T is obtained by
projecting the patch x onto the space spanned by the basis: zn = ⟨φn , x⟩. So, if you give us an image
x, we will return you a coecient vector z. From z we can do inverse transform to recover (ie decode)
the image. Therefore, the coecient vector z is the latent code. The encoder is the DCT transform,
and the decoder is the inverse DCT transform.
The name “variational” comes from the factor that we use probability distributions to describe x and
z. Instead of resorting to a deterministic procedure of converting x to z, we are more interested in ensuring
that the distribution p(x) can be mapped to a desired distribution p(z), and go backwards to p(x). Because
of the distributional setting, we need to consider a few distributions.
• p(x): The distribution of x. It is never known. If we knew it, we would have become a billionaire. The
whole galaxy of diusion models is to nd ways to draw samples from p(x).
• p(z): The distribution of the latent variable. Because we are all lazy, let’s just make it a zero-mean
unit-variance Gaussian p(z) = N (0, I).
• p(z|x): The conditional distribution associated with the encoder, which tells us the likelihood of z
when given x. We have no access to it. p(z|x) itself is not the encoder, but the encoder has to do
something so that it will behave consistently with p(z|x).
• p(x|z): The conditional distribution associated with the decoder, which tells us the posterior proba-
bility of getting x given z. Again, we have no access to it.
The four distributions above are not too mysterious. Here is a somewhat trivial but educational example
that can illustrate the idea.
Example. Consider a random variable X distributed according to a Gaussian mixture model with
a latent variable z ∈ {1, . . . , K} denoting the cluster identity such that pZ (k) = P[Z = k] = πk for
K
k = 1, . . . , K. We assume k=1 πk = 1. Then, if we are told that we need to look at the k-th cluster
only, the conditional distribution of X given Z is
Therefore, if we start with pX (x), the design question for the encoder to build a magical encoder such
that for every sample x ∼ pX (x), the latent code will be z ∈ {1, . . . , K} with a distribution z ∼ pZ (k).
To illustrate how the encoder and decoder work, let’s assume that the mean and variance are known
and are xed. Otherwise we will need to estimate the mean and variance through an EM algorithm. It
is doable, but the tedious equations will defeat the purpose of this illustration.
Encoder: How do we obtain z from x? This is easy because at the encoder, we know pX (x) and
pZ (k). Imagine that you only have two class z ∈ {1, 2}. Eectively you are just making a binary
decision of where the sample x should belong to. There are many ways you can do the binary decision.
If you like maximum-a-posteriori, you can check
and this will return you a simple decision rule. You give us x, we tell you z ∈ {1, 2}.
Decoder: On the decoder side, if we are given a latent code z ∈ {1, . . . , K}, the magical decoder
just needs to return us a sample x which is drawn from pX|Z (x|k) = N (x | µk , σk2 I). A dierent z will
give us one of the K mixture components. If we have enough samples, the overall distribution will
follow the Gaussian mixture.
Smart readers like you will certainly complain: “Your example is so trivially unreal.” No worries. We
understand. Life is of course a lot harder than a Gaussian mixture model with known means and known
variance. But one thing we realize is that if we want to nd the magical encoder and decoder, we must have
a way to nd the two conditional distributions. However, they are both high-dimensional creatures. So,
in order for us to say something more meaningful, we need to impose additional structures so that we can
generalize the concept to harder problems.
In the literature of VAE, people come up with an idea to consider the following two proxy distributions:
• qϕ (z|x): The proxy for p(z|x). We will make it a Gaussian. Why Gaussian? No particular good
reason. Perhaps we are just ordinary (aka lazy) human beings.
• pθ (x|z): The proxy for p(x|z). Believe it or not, we will make it a Gaussian too. But the role of this
Gaussian is slightly dierent from the Gaussian qϕ (z|x). While we will need to estimate the mean
and variance for the Gaussian qϕ (z|x), we do not need to estimate anything for the Gaussian pθ (x|z).
Instead, we will need a decoder neural network to turn z into x. The Gaussian pθ (x|z) will be used to
inform us how good our generated image x is.
The relationship between the input x and the latent z, as well as the conditional distributions, are
summarized in Figure 1. There are two nodes x and z. The “forward” relationship is specied by p(z|x)
(and approximated by qϕ (z|x)), whereas the “reverse” relationship is specied by p(x|z) (and approximated
by pθ (x|z)).
Figure 1: In a variational autoencoder, the variables x and z are connected by the conditional distributions
p(x|z) and p(z|x). To make things work, we introduce two proxy distributions pθ (x|z) and qϕ (z|x), respec-
tively.
x ∼ N (x | µ, σ 2 ),
z ∼ N (z | 0, 1).
Our goal is to construct a VAE. (What?! This problem has a trivial solution where z = (x − µ)/σ and
x = µ + σz. You are absolutely correct. But please follow our derivation to see if the VAE framework
makes sense.)
By constructing a VAE, we mean that we want to build two mappings “encode” and “decode”. For
simplicity, let’s assume that both mappings are ane transformations:
We are too lazy to nd out the joint distribution p(x, z), nor the conditional distributions p(x|z)
and p(z|x). But we can construct the proxy distributions qϕ (z|x) and pθ (x|z). Since we have the
freedom to choose what qϕ and pθ should look like, how about we consider the following two Gaussians
qϕ (z|x) = N (z | ax + b, 1),
pθ (x|z) = N (x | cz + d, c).
The choice of these two Gaussians is not mysterious. For qϕ (z|x): if we are given x, of course we want
the encoder to encode the distribution according to the structure we have chosen. Since the encoder
structure is ax + b, the natural choice for qϕ (z|x) is to have the mean ax + b. The variance is chosen
as 1 because we know that the encoded sample z should be unit-variance. Similarly, for pθ (x|z): if we
are given z, the decoder must take the form of cz + d because this is how we setup the decoder. The
variance is c which is a parameter we need to gure out.
We will pause for a moment before continuing this example. We want to introduce a mathematical
tool.
You are certainly puzzled how on the Earth people can come up with this loss function!? Let’s see what
ELBO means and how it is derived.