0% found this document useful (0 votes)

14 views4 pages

Tutorial on diffusion models

This tutorial by Stanley Chan discusses diffusion models, a key mechanism behind recent advancements in generative tools for text-to-image and text-to-video generation. It aims to provide foundational knowledge on diffusion models for students interested in research or application in this area, covering topics such as Variational Auto-Encoders (VAE), Denoising Diffusion Probabilistic Models (DDPM), and Score-Matching Langevin Dynamics (SMLD). The document includes detailed explanations of various concepts, methodologies, and mathematical frameworks related to these models.

Uploaded by

kdkunal.94

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views4 pages

Tutorial on diffusion models

Uploaded by

kdkunal.94

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Tutorial on Diusion Models for Imaging and Vision

Stanley Chan1

March 28, 2024

Abstract. The astonishing growth of generative tools in recent years has empowered many exciting
applications in text-to-image generation and text-to-video generation. The underlying principle behind
these generative tools is the concept of diusion, a particular sampling mechanism that has overcome some
shortcomings that were deemed dicult in the previous approaches. The goal of this tutorial is to discuss the
essential ideas underlying the diusion models. The target audience of this tutorial includes undergraduate
and graduate students who are interested in doing research on diusion models or applying these models to
arXiv:2403.18103v1 [cs.LG] 26 Mar 2024

solve other problems.

Contents
1 The Basics: Variational Auto-Encoder (VAE) 2
1.1 VAE Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Training VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Inference with VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Denoising Diusion Probabilistic Model (DDPM) 10

2.1 Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
√
2.2 The magical scalars αt and 1 − αt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Distribution qϕ (xt |x0 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Rewrite the Consistency Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Derivation of qϕ (xt−1 |xt , x0 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Derivation based on Noise Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 Inversion by Direct Denoising (InDI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Score-Matching Langevin Dynamics (SMLD) 30

3.1 Langevin Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 (Stein’s) Score Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Score Matching Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Stochastic Dierential Equation (SDE) 39

4.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Forward and Backward Iterations in SDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Stochastic Dierential Equation for DDPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Stochastic Dierential Equation for SMLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Solving SDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Conclusion 49

1 School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907.

Email: stanchan@purdue.edu.

© 2024 Stanley Chan. All Rights Reserved. 1

1 The Basics: Variational Auto-Encoder (VAE)
1.1 VAE Setting
A long time ago, in a galaxy far far away, we want to build a generator that generates images from a latent
code. The simplest (and perhaps one of the most classical) approach is to consider an encoder-decoder pair
shown below. This is called a variational autoencoder (VAE) [1, 2, 3].

The autoencoder has an input variable x and a latent variable z. For the sake of understanding the
subject, we treat x as a beautiful image and z as some kind of vector living in some high dimensional space.

Example. Getting a latent representation of an image is not an alien thing. Back in the time of JPEG
compression (which is arguably a dinosaur), we use discrete cosine transform (DCT) basis φn to encode
the underlying image / patches of an image. The coecient vector z = [z1 , . . . , zN ]T is obtained by
projecting the patch x onto the space spanned by the basis: zn = ⟨φn , x⟩. So, if you give us an image
x, we will return you a coecient vector z. From z we can do inverse transform to recover (ie decode)
the image. Therefore, the coecient vector z is the latent code. The encoder is the DCT transform,
and the decoder is the inverse DCT transform.

The name “variational” comes from the factor that we use probability distributions to describe x and
z. Instead of resorting to a deterministic procedure of converting x to z, we are more interested in ensuring
that the distribution p(x) can be mapped to a desired distribution p(z), and go backwards to p(x). Because
of the distributional setting, we need to consider a few distributions.
• p(x): The distribution of x. It is never known. If we knew it, we would have become a billionaire. The
whole galaxy of diusion models is to nd ways to draw samples from p(x).
• p(z): The distribution of the latent variable. Because we are all lazy, let’s just make it a zero-mean
unit-variance Gaussian p(z) = N (0, I).
• p(z|x): The conditional distribution associated with the encoder, which tells us the likelihood of z
when given x. We have no access to it. p(z|x) itself is not the encoder, but the encoder has to do
something so that it will behave consistently with p(z|x).
• p(x|z): The conditional distribution associated with the decoder, which tells us the posterior proba-
bility of getting x given z. Again, we have no access to it.
The four distributions above are not too mysterious. Here is a somewhat trivial but educational example
that can illustrate the idea.

Example. Consider a random variable X distributed according to a Gaussian mixture model with
a latent variable z ∈ {1, . . . , K} denoting the cluster identity such that pZ (k) = P[Z = k] = πk for
K
k = 1, . . . , K. We assume k=1 πk = 1. Then, if we are told that we need to look at the k-th cluster
only, the conditional distribution of X given Z is

pX|Z (x|k) = N (x | µk , σk2 I).

© 2024 Stanley Chan. All Rights Reserved. 2

The marginal distribution of x can be found using the law of total probability, giving us
K
 K

pX (x) = pX|Z (x|k)pZ (k) = πk N (x | µk , σk2 I). (1)
k=1 k=1

Therefore, if we start with pX (x), the design question for the encoder to build a magical encoder such
that for every sample x ∼ pX (x), the latent code will be z ∈ {1, . . . , K} with a distribution z ∼ pZ (k).
To illustrate how the encoder and decoder work, let’s assume that the mean and variance are known
and are xed. Otherwise we will need to estimate the mean and variance through an EM algorithm. It
is doable, but the tedious equations will defeat the purpose of this illustration.
Encoder: How do we obtain z from x? This is easy because at the encoder, we know pX (x) and
pZ (k). Imagine that you only have two class z ∈ {1, 2}. Eectively you are just making a binary
decision of where the sample x should belong to. There are many ways you can do the binary decision.
If you like maximum-a-posteriori, you can check

pZ|X (1|x) ≷class 1

class 2 pZ|X (2|x),

and this will return you a simple decision rule. You give us x, we tell you z ∈ {1, 2}.
Decoder: On the decoder side, if we are given a latent code z ∈ {1, . . . , K}, the magical decoder
just needs to return us a sample x which is drawn from pX|Z (x|k) = N (x | µk , σk2 I). A dierent z will
give us one of the K mixture components. If we have enough samples, the overall distribution will
follow the Gaussian mixture.
Smart readers like you will certainly complain: “Your example is so trivially unreal.” No worries. We
understand. Life is of course a lot harder than a Gaussian mixture model with known means and known
variance. But one thing we realize is that if we want to nd the magical encoder and decoder, we must have
a way to nd the two conditional distributions. However, they are both high-dimensional creatures. So,
in order for us to say something more meaningful, we need to impose additional structures so that we can
generalize the concept to harder problems.
In the literature of VAE, people come up with an idea to consider the following two proxy distributions:
• qϕ (z|x): The proxy for p(z|x). We will make it a Gaussian. Why Gaussian? No particular good
reason. Perhaps we are just ordinary (aka lazy) human beings.
• pθ (x|z): The proxy for p(x|z). Believe it or not, we will make it a Gaussian too. But the role of this
Gaussian is slightly dierent from the Gaussian qϕ (z|x). While we will need to estimate the mean
and variance for the Gaussian qϕ (z|x), we do not need to estimate anything for the Gaussian pθ (x|z).
Instead, we will need a decoder neural network to turn z into x. The Gaussian pθ (x|z) will be used to
inform us how good our generated image x is.
The relationship between the input x and the latent z, as well as the conditional distributions, are
summarized in Figure 1. There are two nodes x and z. The “forward” relationship is specied by p(z|x)
(and approximated by qϕ (z|x)), whereas the “reverse” relationship is specied by p(x|z) (and approximated
by pθ (x|z)).

Figure 1: In a variational autoencoder, the variables x and z are connected by the conditional distributions
p(x|z) and p(z|x). To make things work, we introduce two proxy distributions pθ (x|z) and qϕ (z|x), respec-
tively.

Example. It’s time to consider another trivial example. Suppose that we have a random variable x
and a latent variable z such that

x ∼ N (x | µ, σ 2 ),
z ∼ N (z | 0, 1).

Our goal is to construct a VAE. (What?! This problem has a trivial solution where z = (x − µ)/σ and
x = µ + σz. You are absolutely correct. But please follow our derivation to see if the VAE framework
makes sense.)

By constructing a VAE, we mean that we want to build two mappings “encode” and “decode”. For
simplicity, let’s assume that both mappings are ane transformations:

z = encode(x) = ax + b, so that ϕ = [a, b],

x = decode(z) = cz + d, so that θ = [c, d].

We are too lazy to nd out the joint distribution p(x, z), nor the conditional distributions p(x|z)
and p(z|x). But we can construct the proxy distributions qϕ (z|x) and pθ (x|z). Since we have the
freedom to choose what qϕ and pθ should look like, how about we consider the following two Gaussians

qϕ (z|x) = N (z | ax + b, 1),
pθ (x|z) = N (x | cz + d, c).

The choice of these two Gaussians is not mysterious. For qϕ (z|x): if we are given x, of course we want
the encoder to encode the distribution according to the structure we have chosen. Since the encoder
structure is ax + b, the natural choice for qϕ (z|x) is to have the mean ax + b. The variance is chosen
as 1 because we know that the encoded sample z should be unit-variance. Similarly, for pθ (x|z): if we
are given z, the decoder must take the form of cz + d because this is how we setup the decoder. The
variance is c which is a parameter we need to gure out.
We will pause for a moment before continuing this example. We want to introduce a mathematical
tool.

1.2 Evidence Lower Bound

How do we use these two proxy distributions to achieve our goal of determining the encoder and the decoder?
If we treat ϕ and θ as optimization variables, then we need an objective function (or the loss function) so
that we can optimize ϕ and θ through training samples. To this end, we need to set up a loss function in
terms of ϕ and θ. The loss function we use here is called the Evidence Lower BOund (ELBO) [1]:
 
def p(x, z)
ELBO(x) = Eqϕ (z|x) log . (2)
qϕ (z|x)

You are certainly puzzled how on the Earth people can come up with this loss function!? Let’s see what
ELBO means and how it is derived.

Gen AI Unit 2
No ratings yet
Gen AI Unit 2
65 pages
Demystifying Variational Diffusion Models
No ratings yet
Demystifying Variational Diffusion Models
48 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
Deep Learning A Tutorial
No ratings yet
Deep Learning A Tutorial
16 pages
lec24.diffusion
No ratings yet
lec24.diffusion
83 pages
Arngren - 2007 - Unknown - Modelling cognitive representations
No ratings yet
Arngren - 2007 - Unknown - Modelling cognitive representations
114 pages
Tutorialon Diffusion Modelsfor Imaging and Vision
No ratings yet
Tutorialon Diffusion Modelsfor Imaging and Vision
90 pages
Opennebula 5.10 Integration Guide
No ratings yet
Opennebula 5.10 Integration Guide
230 pages
AASAN.2021 - Invertible and Pseudo-Invertible Encoders An Approach To Inverse Problems With Neural Networks
No ratings yet
AASAN.2021 - Invertible and Pseudo-Invertible Encoders An Approach To Inverse Problems With Neural Networks
199 pages
Family 3555+01 IBM TS4300 Tape Library Models L3A and E3A - IBM Documentation
No ratings yet
Family 3555+01 IBM TS4300 Tape Library Models L3A and E3A - IBM Documentation
72 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
89 pages
Mod4_Slides
No ratings yet
Mod4_Slides
49 pages
Lecture 5 Diffusion - Models Part II Final
No ratings yet
Lecture 5 Diffusion - Models Part II Final
49 pages
VAE talk.compressed - 副本
No ratings yet
VAE talk.compressed - 副本
59 pages
Generalization of VAE
No ratings yet
Generalization of VAE
30 pages
Lec15 Generative Models
No ratings yet
Lec15 Generative Models
51 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
GAPE_module_3 - Copy - Copy
No ratings yet
GAPE_module_3 - Copy - Copy
21 pages
Variational Autoencoders
No ratings yet
Variational Autoencoders
94 pages
Lecture14
No ratings yet
Lecture14
23 pages
Alice's Adventures in A Differentiable Wonderland
No ratings yet
Alice's Adventures in A Differentiable Wonderland
279 pages
Satellite Image Segmentation With Convolutional Neural Networks (CNN)
100% (1)
Satellite Image Segmentation With Convolutional Neural Networks (CNN)
4 pages
DiffusionModel DDPM
No ratings yet
DiffusionModel DDPM
52 pages
Grade 7 Mathematics Syllabus for USA
No ratings yet
Grade 7 Mathematics Syllabus for USA
30 pages
SlideEgg - 400755-Family Feud Game Template PowerPoint Free
No ratings yet
SlideEgg - 400755-Family Feud Game Template PowerPoint Free
36 pages
DLbook
No ratings yet
DLbook
165 pages
poly_aml
No ratings yet
poly_aml
76 pages
Sive John B. Keane John B. Keane Free Download, Borrow, And Streaming Internet Archive
No ratings yet
Sive John B. Keane John B. Keane Free Download, Borrow, And Streaming Internet Archive
1 page
Flow Based Deep Generative Models Report
No ratings yet
Flow Based Deep Generative Models Report
12 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
51 pages
PA 5 UNIT
No ratings yet
PA 5 UNIT
35 pages
Auto Encoder S
No ratings yet
Auto Encoder S
32 pages
Honours Paper
No ratings yet
Honours Paper
13 pages
Part 15 MD
No ratings yet
Part 15 MD
36 pages
Anomaly Detection as a Tool for Discovering New Physics at CERN’s Large Hadron Collider
No ratings yet
Anomaly Detection as a Tool for Discovering New Physics at CERN’s Large Hadron Collider
8 pages
COMPUTER PROGRAM DEVELOPMENT-NOTES
No ratings yet
COMPUTER PROGRAM DEVELOPMENT-NOTES
76 pages
368bb2df9cfd25b100b4e1d96143f6da1c53091d-1746848322051
No ratings yet
368bb2df9cfd25b100b4e1d96143f6da1c53091d-1746848322051
7 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
Logcat
No ratings yet
Logcat
8 pages
Mod 3 Advanced AI
No ratings yet
Mod 3 Advanced AI
37 pages
wruod-2
No ratings yet
wruod-2
9 pages
Copy of deep-learning
No ratings yet
Copy of deep-learning
28 pages
annotated-Enar_AngeluMiguel_ERPFundamentals%20of%20Accounting%20with%20Analytics%20-%20Workbook%20v2024_-1387617706
No ratings yet
annotated-Enar_AngeluMiguel_ERPFundamentals%20of%20Accounting%20with%20Analytics%20-%20Workbook%20v2024_-1387617706
49 pages
Variational Autoencoder Explanation
No ratings yet
Variational Autoencoder Explanation
11 pages
Unit 5
No ratings yet
Unit 5
36 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
21EC51_DC_Module_3
No ratings yet
21EC51_DC_Module_3
34 pages
VAE Continued: Biplab Banerjee
No ratings yet
VAE Continued: Biplab Banerjee
23 pages
Diffusion Based Representation Learning
No ratings yet
Diffusion Based Representation Learning
20 pages
ACV - Notes - Final
No ratings yet
ACV - Notes - Final
7 pages
30 Excel Chart Templates
No ratings yet
30 Excel Chart Templates
10 pages
Countdown & Handover Planning DTPM To OWNER
No ratings yet
Countdown & Handover Planning DTPM To OWNER
8 pages
CSD411-Week14-AutoRBM_1731474657667996771673434e1e7d46
No ratings yet
CSD411-Week14-AutoRBM_1731474657667996771673434e1e7d46
18 pages
AI60201_module3_4_problems (1)
No ratings yet
AI60201_module3_4_problems (1)
4 pages
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
No ratings yet
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
89 pages
AutoEncoder
No ratings yet
AutoEncoder
11 pages
Part 7 - DCIM-For-Dummies - 3rd-Edition
No ratings yet
Part 7 - DCIM-For-Dummies - 3rd-Edition
5 pages
1312.6114v1
No ratings yet
1312.6114v1
9 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Truma Heating CP Plus Operating Instructions EN
No ratings yet
Truma Heating CP Plus Operating Instructions EN
21 pages
Intro To Vae
No ratings yet
Intro To Vae
89 pages
Anvil - Code For Chatbot
No ratings yet
Anvil - Code For Chatbot
5 pages
DLAI4 Networks Gans
No ratings yet
DLAI4 Networks Gans
7 pages
Tema 3
No ratings yet
Tema 3
4 pages
MP-30 service manual V3.1
No ratings yet
MP-30 service manual V3.1
68 pages
Discovering_rules_in_the_poker_hand_dataset
No ratings yet
Discovering_rules_in_the_poker_hand_dataset
2 pages
1807.01622v1
No ratings yet
1807.01622v1
11 pages
AAI Module 3
No ratings yet
AAI Module 3
11 pages
EngEd 321-Module 3
No ratings yet
EngEd 321-Module 3
20 pages
Preload Installer
No ratings yet
Preload Installer
3 pages
FortiAnalyzer 800G Datasheet
No ratings yet
FortiAnalyzer 800G Datasheet
10 pages
CS619 Final Project Viva Notes
100% (1)
CS619 Final Project Viva Notes
25 pages
Tung Kieu - Probabilistic - Graphical - Model - Report
No ratings yet
Tung Kieu - Probabilistic - Graphical - Model - Report
9 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Edsa Distribution Automation
No ratings yet
Edsa Distribution Automation
28 pages
Edn 2
No ratings yet
Edn 2
2 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Martinet Z 1993
No ratings yet
Martinet Z 1993
12 pages
Aser - Hidden Markov Models and Dynamical Systems
No ratings yet
Aser - Hidden Markov Models and Dynamical Systems
145 pages
Tutorial - What Is A Variational Autoencoder - Jaan Altosaar
No ratings yet
Tutorial - What Is A Variational Autoencoder - Jaan Altosaar
20 pages
Introduction To VAE
No ratings yet
Introduction To VAE
5 pages
PPT Sesión 01 2020 Redes Escalables (1940)
No ratings yet
PPT Sesión 01 2020 Redes Escalables (1940)
30 pages
Priors Algorithms Bayesian
No ratings yet
Priors Algorithms Bayesian
108 pages
3D Printers Brochure
No ratings yet
3D Printers Brochure
5 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Audio, Video, and Media in the Ministry
From Everand
Audio, Video, and Media in the Ministry
Clarence Floyd Richmond
No ratings yet
A Discourse Analysis of 1 Peter
From Everand
A Discourse Analysis of 1 Peter
Ervin Ray Starwalt
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Tutorial on diffusion models

Uploaded by

Tutorial on diffusion models

Uploaded by

Tutorial on Diusion Models for Imaging and Vision

March 28, 2024

solve other problems.

2 Denoising Diusion Probabilistic Model (DDPM) 10

3 Score-Matching Langevin Dynamics (SMLD) 30

4 Stochastic Dierential Equation (SDE) 39

© 2024 Stanley Chan. All Rights Reserved. 1

pX|Z (x|k) = N (x | µk , σk2 I).

© 2024 Stanley Chan. All Rights Reserved. 2

pZ|X (1|x) ≷class 1

© 2024 Stanley Chan. All Rights Reserved. 3

z = encode(x) = ax + b, so that ϕ = [a, b],

1.2 Evidence Lower Bound

© 2024 Stanley Chan. All Rights Reserved. 4

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.