0% found this document useful (0 votes)
14 views8 pages

Diffusion Models A Concise Perspective

Diffusion models are a robust class of generative models inspired by non-equilibrium thermodynamics, offering solutions to limitations in GANs, VAEs, and flow-based models by iteratively adding and removing noise in a Markov chain framework. These models achieve high-quality, diverse outputs and stable training, supported by advancements like positional embeddings, variance schedules, and classifier-free guidance. Conditional generation is facilitated through innovations such as Langevin dynamic

Uploaded by

guptashashwatme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

Diffusion Models A Concise Perspective

Diffusion models are a robust class of generative models inspired by non-equilibrium thermodynamics, offering solutions to limitations in GANs, VAEs, and flow-based models by iteratively adding and removing noise in a Markov chain framework. These models achieve high-quality, diverse outputs and stable training, supported by advancements like positional embeddings, variance schedules, and classifier-free guidance. Conditional generation is facilitated through innovations such as Langevin dynamic

Uploaded by

guptashashwatme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Diffusion Models - A beginner’s perspective

Shashwat Gupta1[0009−0003−8037−2348]

IIT Kanpur (Indian Institute of Technology - Kanpur), India


guptashashwatme@gmail.com

Abstract. Diffusion models are a robust class of generative models in-


spired by non-equilibrium thermodynamics, offering solutions to limita-
tions in GANs, VAEs, and flow-based models by iteratively adding and
removing noise in a Markov chain framework. These models achieve high-
quality, diverse outputs and stable training, supported by advancements
like positional embeddings, variance schedules, and classifier-free guid-
ance. Conditional generation is facilitated through innovations such as
Langevin dynamics, classifier guidance, and ControlNets. Improvements
like latent-space diffusion, deterministic sampling (DDIM), and efficient
noise scheduling further enhance their speed and scalability, making them
a state-of-the-art approach for generative tasks across various domains.

Keywords: DDPM · Diffusion Models

1 Diffusion Models

Fig. 1: Summary of Generative Models

There are several types of generative models popular now (as shown in Figure
1), but none is without its flaws:
2 Shashwat Gupta

1. Generative Adversarial Networks (GANs): suffer from unstable training and


limited diversity (mode collapse).
2. Variational Autoencoders (VAE) [8,9,23]: relies on a surrogate loss.
3. Flow-based models: need specialized architectures to construct reversible
transforms.
4. Diffusion Models: inspired by non-equilibrium thermodynamics. Despite be-
ing slow at sampling, diffusion models outperform other generative models;
specifically, they are free from the issues of these models.

Below, we explain the common perspectives to understand of diffusion mod-


els, specifically the ones that are needed to understand our architectures.

Markov Chain Perspective We touch upon the necessary mathematical de-


tails of the diffusion models without diving into the proofs much (More detailed
treatment can be found in [1,2,3,4]). Our approach will mostly be like Denois-
ing Diffusion Probabilistic Model (DDPM) [5,6,7] with some improvements sug-
gested in papers published by OpenAI later on [11,12].
Diffusion Models are latent space models that involve adding noise to a sam-
ple as a Markov chain and then denoising the noisy image using a neural network.
During training, noise is added (according to a variance schedule), and a model
is used to denoise the image in multiple steps. During inference, denoising is
applied to an isotropic noisy sample. Noising and denoising in steps, as opposed
to single steps like GANs, leads to more tractable computations [10].
The forward process is defined as follows:

x0 ∼ q(x)
p
q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I)
T
Y
q(x1:T |x0 ) = q(xt |xt−1 )
t=1

As t → ∞, xt approaches an isotropic Gaussian.


For the forward process, xt can be computed in closed form from x0 by using
a reparametrization trick involving the sum of two Gaussian.
Defining two new variables:

αt = 1 − βt
t
Y
ᾱt = αi
i=1

q(xt |x0 ) = N (xt ; ᾱt xt−1 , (1 − ᾱt )I)
Since βt is small, q(xt−1 |xt ) is also Gaussian. However, estimating this quan-
tity would require using the entire dataset, so we learn a model pθ to approximate
the conditional probabilities.
Diffusion Models - A beginner’s perspective 3

We run the reverse diffusion process:


T
Y
pθ (x0:T ) = p(xT ) pθ (xt−1 |xt )
t=1

pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σ θ (xt , t))


We use simple likelihood for the loss: − log pθ (x). Similar to VAEs, we use
the variational lower bound to upper bound of the objective [8,9]. Upon simpli-
fication and additional conditioning on x0 (for better sampling), and ignoring
pure q(xt ) terms (since they have no learnable parameters), we come up with
the following objective:
T
X
Lreduced = DKL (q(xt |xt , x0 )||pθ (xt−1 |xt )) − log(pθ (x0 |x1 ))
t=2

pθ (x0 |x1 ) ∼ N (xt−1 ; µθ (xt , t); σθ (xt , t))

qθ (xt−1 |x1 , x0 ) ∼ N (xt−1 ; µ̃t (xt , x0 ); β̃I)


By observing that βt is fixed (as per the schedule), as an objective, we can
minimize the MSE between µ̃t (xt , x0 ) and µθ (xt ). After simplification, this re-
duces to the MSE between the error at time t and the predicted error for time
t predicted by the model, with a scaling term that improves sample quality.
√ p
Lsimple = ||ϵt − ϵθ ( ᾱt x0 + 1 −¯ αt ϵ, t)||2

Fig. 2: Diffusion Models demystified

Langevin Dynamics Perspective (Noise-conditioned score networks)


This perspective enables us to understand Conditional image generation. Again
we touch upon the results (more can be followed from [13,19,20,21,24]) Stochas-
tic Gradient Langevin Dynamics [26] can generate samples from a probability
density p(x) using only the gradients ∇x log p(x) in a Markov chain of updates.
δ √
xt = xt−1 + ∇x log p(xt−1 ) + δϵt , where ϵt ∼ N (0, I)
2
4 Shashwat Gupta

Here, δ represents the step size. As T → ∞, ϵ → 0, and xT converges to the


true probability density p(x).
Compared to standard SGD, stochastic gradient Langevin Dynamics injects
Gaussian noise into the parameter updates to avoid collapsing into local minima.
Song and Ermon (2019) [13] proposed score-based generative modelling meth-
ods where samples are produced via Langevin dynamics using gradients of the
data distribution estimated with Stein score-matching.
To scale with high-dimension, they add a pre-specified small noise to the data
and estimate the data point with score matching. According to the manifold
hypothesis, most data is expected to lie on a low-dimensional manifold, even
though the data might seem to be in high dimension. Thus, the data does not
cover the entire space, and estimation is unreliable in sparse regions. Adding a
small perturbation in steps to cover the entire space offers more stable training.

 
ϵθ (xt , t) ϵθ (xt , t)
sθ (xt , t) ≈ ∇xt log q(xt ) = Eq(x0 ) [∇xt q(xt | x0 )] = Eq(x0 ) − √ = −√
1 − ᾱt 1 − ᾱt

Architecture and Algorithm The original implementation of DDPMs used


U-Net architecture consisted of Wide ResNet blocks, group normalisation as
well as self-attention blocks. The diffusion time step t is specified by adding a
sinusoidal position embedding into each residual block. Various other approaches
and architectures are covered in [15,18]
The training and Sampling algorithms are shown in Figure 4.

1.1 Conditioned Generation

To turn a diffusion model into a conditioned model [22], we can add conditioning
information (y) at each step with a guidance-scalar s as :

T
Y
pθ (x0:T |y) = p(xT ) pθ (xt−1 |xt , y)
t=1

∇xt log pθ (xt |y) = ∇xt log pθ (xt ) + s.∇xt log pθ (y|xt )
1
Using ∇xt log q(xt ) = − √1− ϵ (xt , t)
ᾱ θ t


ϵ̄θ (xt , t) = ϵθ (xt , t) − 1 − ᾱt ∇xt log pθ (y|xt ))

The above score-based formulation eliminates the term using pθ (y), which
needs knowledge of all data points.
The following are the popular ways to condition the diffusion model
Diffusion Models - A beginner’s perspective 5

(a) Without Noise, predictions of sparse regions are inaccurate

(b) Adding noise increases the base of predictions to sparse regions closer to the low-
dimensional manifold.

Fig. 3: Role of Noise in Score-matching approach

Fig. 4: Training and Sampling algorithms for DDPM


6 Shashwat Gupta

Classifier Guided Diffusion The score of y wrt x can be estimated using a


classifier [11]. Setting ∇xt log q(y|xt ) = ∇xt log fϕ (y|xt )

ϵ̄θ (xt , t) = ϵθ (xt , t) − 1 − ᾱt w∇xt log fϕ (y|xt )
The resulting ablated diffusion model (ADM) and the one with additional
classifier guidance (ADM-G) can achieve better results than state-of-the-art gen-
erative models (e.g., BigGAN).

Classifier-free guidance Conditioning is also possible without a classifier [17].


Let unconditional denoising diffusion model pθ (x) parameterized through a score
estimator ϵθ (xt , t) and the conditional model pθ (x|y) parameterized through
ϵθ (xt , t, y). These two models can be learned via a single neural network. Pre-
cisely, a conditional diffusion model pθ (x|y) is trained on paired data (x, y),
where the conditioning information y gets discarded periodically at random
such that the model knows how to generate images unconditionally as well,
i.e. ϵθ (xt , t) = ϵθ (xt , t, y = ϕ).
The gradient of an implicit classifier can be represented with conditional and
unconditional score estimators. Once plugged into the classifier-guided modified
score, the score contains no dependency on a separate classifier.

∇xt log p(y|xt ) = ∇xt log p(xt |y) − ∇xt log p(xt )

1  √ 
= −√ ϵθ (xt , t, y) − 1 − ᾱt w∇xt log p(y|xt )
1 − ᾱt
i.e.


ϵ̄θ (xt , t, y) = ϵθ (xt , t, y)+w ϵθ (xt , t, y)−ϵθ (xt , t) = (w+1)ϵθ (xt , t, y)−wϵθ (xt , t)

ControlNets Zhang et al., 2023 [27] developed ControlNet, a separate module


that can be added to an unconditional model for conditional image generation.

1.2 Improvements to Diffusion Model

We now discuss some popular improvements to the diffusion models:

1. Ho et al. (2020) [5] used a linear schedule from β1 = 10−4 to βt = 0.02.


Nichol and Dhariwal (2021) [11] proposed a cosine-based variance schedule
(any arbitrary schedule will work as long as it offers a near-linear drop in
the middle of training and subtle changes around t = 0 and t = T ).
2. The DDPM paper [5] also introduced a positional time step embedding,
where half of the dimensions encode sine embedding and the other half en-
code cosine embedding.
Diffusion Models - A beginner’s perspective 7

3. They also proposed learning the reverse process variance Σθ as an interpo-


lation between βt and β˜t , which gives:

Σθ (xt , t) = exp(v log βt + (1 − v) log β˜t )


Song et al., 2021 [28] proposed using deterministic sampling (Denoising Dif-
fusion implicitly model - DDIM 2020), which has the same marginal noise
distribution but deterministically maps noise back to the original data sam-
ples. Compared to DDPM, DDIM has higher sample quality for small steps,
consistency of high-level features on conditioning and thus, the semantically
meaningful representation of a latent variable.
4. Nicol and Dhariwal (2021) [11] also proposed speeding up diffusion process
by strided sampling.
5. Latent Diffusion [27] runs the diffusion process in latent space instead of pixel
space, thus lower training cost and faster inference. The encoder downsam-
ples to latent space, and the decoder is used to recover back the generated
image.
6. Cold Diffusion [14], generalises the notion of noise by applying various trans-
formations to the image. and uses a modified sampling algorithm to make
the degradation function independent of the restoration operator up to first-
order terms.

References
1. Blog: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
2. Blog: https://theaisummer.com/diffusion-models/
3. Blog: https://towardsdatascience.com/diffusion-models-made-easy-8414298ce4da
(a simplistic explanation)
4. Video: https://www.youtube.com/watch?v=HoKDTa5jHvg&t=1284s
5. Paper: DDPM: https://arxiv.org/pdf/2006.11239.pdf (Ho et al., 2020)
6. Video Explanation: https://www.youtube.com/watch?v=W-O7AZNzbzQ
7. Annotated Code: https://huggingface.co/blog/annotated-diffusion
8. Blog - Variational AutoEncoders: https://lilianweng.github.io/posts/
2018-08-12-vae/
9. Blog - Latent Variable Models: https://theaisummer.com/
latent-variable-models/
10. Paper: Deep Unsupervised Learning using Nonequilibrium Thermodynamics, Dick-
stein et al., 2015: https://arxiv.org/pdf/1503.03585.pdf
11. Paper: Improved Denoising Diffusion Probabilistic Models, Nicol and Dhariwal,
2021: https://arxiv.org/pdf/2102.09672.pdf
12. Paper: Diffusion Models beat GANs on Image Synthesis: https://arxiv.org/pdf/
2105.05233.pdf
13. Paper: Generative Modelling by estimating Gradients of Data Distribution: Noise-
conditioned score network, Yang and Ermon, 2019: https://arxiv.org/abs/1907.
05600
14. Paper: Cold Diffusion: https://arxiv.org/pdf/2208.09392.pdf
15. Paper: Understanding Diffusion Models: A Unified Perspective, Calvin Luo, 2022:
https://arxiv.org/pdf/2208.11970.pdf
8 Shashwat Gupta

16. Paper: Fast Sampling of Diffusion Models with Exponential Integrator, Zhang et
al., 2020: https://arxiv.org/abs/2204.13902
17. Paper: Classifier-Free Diffusion Guidance (Ho et al., 2021): https://openreview.
net/pdf?id=qw8AKxfYbI
18. Paper: Diffusion Models: A Comprehensive study of Methods and Applications,
Yang et al., 2022: https://arxiv.org/pdf/2209.00796.pdf
19. Diffusion and Score-based generative models: https://www.youtube.com/watch?
v=wMmqCMwuM2Q
20. Blog: https://yang-song.net/blog/2021/score/
21. Blog : Autoregressive models, normalizing flow, energy-based models, VAEs. Score-
papers: https://scorebasedgenerativemodeling.github.io/
22. Blog: Guiding Diffusion Process: https://sander.ai/2022/05/26/guidance.html
23. Blog: Diffusion as autoencoders: https://sander.ai/2022/01/31/diffusion.
html
24. Video : Langevin Dynamics end to end: https://www.youtube.com/watch?v=
3-KzIjoFJy4&t=2379s
25. Paper : Adding Conditional Control to Text-to-Image Diffusion Models, Zhang et
al, 2023 https://arxiv.org/abs/2302.05543
26. Paper : Bayesian Learning vis Stochastic Gradient Langevin Dynam-
ics, Welling and Teh, 2011 https://www.stats.ox.ac.uk/~teh/research/
compstats/WelTeh2011a.pdf
27. Paper : High-Resolution Image Synthesis with Latent Diffusion Models Rombach
et al., 2022 https://arxiv.org/abs/2112.10752
28. Paper : Denoising Diffusion Implicit Models, Song et al.,2021 https://arxiv.org/
abs/2010.02502

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy