0% found this document useful (0 votes)
24 views24 pages

A Score-Based Density Formula, With Applications in

Score-based generative models (SGMs) have revolutionized the field of generative modeling, achieving unprecedented success in generating realistic and diverse content. Despite empirical advances, the theoretical basis for why optimizing the evidence lower bound (ELBO) on the log-likelihood is effective for training diffusion generative models, such as DDPMs, remains largely unexplored. In this paper, we address this question by establishing a density formula for a continuous-time diffusion proc

Uploaded by

ckris111208
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views24 pages

A Score-Based Density Formula, With Applications in

Score-based generative models (SGMs) have revolutionized the field of generative modeling, achieving unprecedented success in generating realistic and diverse content. Despite empirical advances, the theoretical basis for why optimizing the evidence lower bound (ELBO) on the log-likelihood is effective for training diffusion generative models, such as DDPMs, remains largely unexplored. In this paper, we address this question by establishing a density formula for a continuous-time diffusion proc

Uploaded by

ckris111208
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

A Score-Based Density Formula, with Applications in

Diffusion Generative Models


Gen Li∗† Yuling Yan∗‡
arXiv:2408.16765v1 [cs.LG] 29 Aug 2024

August 30, 2024

Abstract
Score-based generative models (SGMs) have revolutionized the field of generative modeling, achiev-
ing unprecedented success in generating realistic and diverse content. Despite empirical advances, the
theoretical basis for why optimizing the evidence lower bound (ELBO) on the log-likelihood is effective
for training diffusion generative models, such as DDPMs, remains largely unexplored. In this paper, we
address this question by establishing a density formula for a continuous-time diffusion process, which
can be viewed as the continuous-time limit of the forward process in an SGM. This formula reveals the
connection between the target density and the score function associated with each step of the forward
process. Building on this, we demonstrate that the minimizer of the optimization objective for training
DDPMs nearly coincides with that of the true objective, providing a theoretical foundation for optimizing
DDPMs using the ELBO. Furthermore, we offer new insights into the role of score-matching regular-
ization in training GANs, the use of ELBO in diffusion classifiers, and the recently proposed diffusion
loss.

Keywords: score-based density formula, score-based generative model, evidence lower bound, denoising
diffusion probabilistic model

Contents
1 Introduction 2

2 Problem set-up 3
2.1 Denoising diffusion probabilistic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 A continuous-time SDE for the forward process . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 The score-based density formula 4


3.1 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 From continuous time to discrete time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Comparison with other results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Implications 7
4.1 Certifying the validity of optimizing ELBO in DDPM . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Understanding the role of regularization in GAN . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 Confirming the use of ELBO in diffusion classifier . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Demystifying the diffusion loss in autoregressive models . . . . . . . . . . . . . . . . . . . . . 10

5 Proof of Theorem 1 10

6 Discussion 13
∗ The authors contributed equally.
† Department of Statistics, The Chinese University of Hong Kong, Hong Kong; Email: genli@cuhk.edu.hk.
‡ Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA; Email: yuling.yan@wisc.edu.

1
A Proof of Proposition 1 14

B Proof of Proposition 2 16

C More discussions on the density formulas 19

D Technical details in Section 4 19


D.1 Technical details in Section 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
D.2 Technical details in Section 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1 Introduction
Score-based generative models (SGMs) represent a groundbreaking advancement in the realm of generative
models, significantly impacting machine learning and artificial intelligence by their ability to synthesize
high-fidelity data instances, including images, audio, and text (Dhariwal and Nichol, 2021; Ho et al., 2020;
Sohl-Dickstein et al., 2015; Song et al., 2021a; Song and Ermon, 2019; Song et al., 2021b). These models
operate by progressively refining noisy data into samples that resemble the target distribution. Due to their
innovative approach, SGMs have achieved unprecedented success, setting new standards in generative AI and
demonstrating extraordinary proficiency in generating realistic and diverse content across various domains,
from image synthesis and super-resolution to audio generation and molecular design (Croitoru et al., 2023;
Ramesh et al., 2022; Rombach et al., 2022; Saharia et al., 2022; Yang et al., 2023).
The foundation of SGMs is rooted in the principles of stochastic processes, especially stochastic differential
equations (SDEs). These models utilize a forward process, which involves the gradual corruption of an initial
data sample with Gaussian noise over several time steps. This forward process can be described as:
add noise add noise add noise
X0 −→ X1 −→ · · · −→ XT , (1.1)

where X0 ∼ pdata is the original data sample, and XT is a sample close to pure Gaussian noise. The
ingenuity of SGMs lies in constructing a reverse denoising process that iteratively removes the noise, thereby
reconstructing the data distribution. This reverse process starts from a Gaussian sample YT and moves
backward as:
denoise denoise denoise
YT −→ YT −1 −→ · · · −→ Y0 (1.2)
d
ensuring that Yt ≈ Xt at each step t. The final output Y0 is a new sample that closely mimics the distribution
of the initial data pdata .
Inspired by the classical results on time-reversal of SDEs (Anderson, 1982; Haussmann and Pardoux,
1986), SGMs construct the reverse process guided by score functions ∇ log pXt associated with each step of
the forward process. Although these score functions are unknown, they are approximated by neural net-
works trained through score-matching techniques (Hyvärinen, 2005, 2007; Song and Ermon, 2019; Vincent,
2011). This leads to two popular models: denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020;
Nichol and Dhariwal, 2021) and denoising diffusion implicit models (DDIMs) (Song et al., 2021a). While the
theoretical results in this paper do not depend on the specific construction of the reverse process, we will
use the DDPM framework to discuss their implications for diffusion generative models.
However, despite empirical advances, there remains a lack of theoretical understanding for diffusion
generative models. For instance, the optimization target of DDPM is derived from a variational lower
bound on the log-likelihood (Ho et al., 2020), which is also referred to as the evidence lower bound (ELBO)
(Luo, 2022). It is not yet clear, from a theoretical standpoint, why optimizing a lower bound of the true
objective is still a valid approach. More surprisingly, recent research suggests incorporating the ELBO of
a pre-trained DDPM into other generative or learning frameworks to leverage the strengths of multiple
architectures, effectively using it as a proxy for the negative log-likelihood of the data distribution. This
approach has shown empirical success in areas such as GAN training, classification, and inverse problems
(Graikos et al., 2022; Li et al., 2023a; Mardani et al., 2024; Xia et al., 2023). While it is conceivable that
the ELBO is a reasonable optimization target for training DDPMs (as similar idea is utilized in e.g., the

2
majorize-minimization algorithm), it is more mysterious why it serves as a good proxy for the negative
log-likelihood in these applications.
In this paper, we take a step towards addressing the aforementioned question. On the theoretical side,
we establish a density formula for a diffusion process (Xt )0≤t<1 defined by the following SDE:
1 1
dXt = − Xt dt + √ dBt (0 ≤ t < 1), X0 ∼ pdata ,
2(1 − t) 1−t
which can be viewed as a continuous-time limit of the forward process (1.1). Under some regularity conditions,
this formula expresses the density of X0 with the score function along this process, having the form
1 + log(2π)
Z 1
1 h X − √1 − tX 2 i d

t 0
log pX0 (x) = − d− E + ∇ log pXt (Xt ) | X0 = x − dt,
2 0 2(1 − t) t 2 2t
where pXt (·) is the density of Xt . By time-discretization, this reveals the connection between the target
density pdata and the score function associated with each step of the forward process (1.1). These theoretical
results will be presented in Section 3.
Finally, using this density formula, we demonstrate that the minimizer of the optimization target for
training DDPMs (derived from the ELBO) also nearly minimizes the true target—the KL divergence between
the target distribution and the generator distribution. This finding provides a theoretical foundation for
optimizing DDPMs using the ELBO. Additionally, we use this formula to offer new insights into the role of
score-matching regularization in training GANs (Xia et al., 2023), the use of ELBO in diffusion classifiers
(Li et al., 2023a), and the recently proposed diffusion loss (Li et al., 2024). These implications will be
discussed in Section 4.

2 Problem set-up
In this section, we formally introduce the Denoising Diffusion Probabilistic Model (DDPM) and the stochastic
differential equation (SDE) that describes the continuous-time limit of the forward process of DDPM.

2.1 Denoising diffusion probabilistic model


Consider the following forward Markov process in discrete time:
p p
Xt = 1 − βt Xt−1 + βt Wt (t = 1, . . . , T ), X0 ∼ pdata , (2.1)
i.i.d.
where W1 , . . . , WT ∼ N (0, Id ) and the learning rates βt ∈ (0, 1). Since our main results do not depend on
Qt discussions. For each t ∈ [T ], let qt be the
the specific choice of βt , we will specify them as needed in later
law or density function of Xt , and let αt := 1 − βt and αt := i=1 αi . A simple calculation shows that:
√ √
Xt = αt X0 + 1 − αt W t where W t ∼ N (0, Id ). (2.2)
We will choose the learning rates βt to ensure that αT is sufficiently small, such that qT is close to the
standard Gaussian distribution.
The key components for constructing the reverse process in the context of DDPM are the score functions
s⋆t : Rd → Rd associated with each qt , defined as the gradient of their log density:
s⋆t (x) := ∇ log qt (x) (t = 1, . . . , T ).
While these score functions are not explicitly known, in practice, noise-prediction networks εt (x) are trained
to predict √
ε⋆t (x) := − 1 − αt s⋆t (x),
which are often referred to as epsilon predictors. To construct the reverse process, we use:
1 
Yt−1 = √ Yt + ηt st (Yt ) + σt Zt (t = T, . . . , 1), YT ∼ N (0, Id ) (2.3)
αt
i.i.d. √
where Z1 , . . . , ZT ∼ N (0, Id ), and st (·) := −εt (·)/ 1 − αt is the estimate of the score function s⋆t (·). Here
ηt , σt > 0 are the coefficients that influence the performance of the DDPM sampler, and we will specify them
as needed in later discussion. For each t ∈ [T ], we use pt to denote the law or density of Yt .

3
2.2 A continuous-time SDE for the forward process
In this paper, we build our theoretical results on the continuous-time limit of the aforementioned forward
process, described by the diffusion process:
1 1
dXt = − Xt dt + √ dBt (0 ≤ t < 1), X0 ∼ pdata , (2.4)
2(1 − t) 1−t

where (Bt )t≥0 is a standard Brownian motion. The solution to this stochastic differential equation (SDE)
has the closed-form expression:
r Z
√ √ 1−t t 1
Xt = 1 − tX0 + t Z t where Zt = dBs ∼ N (0, Id ). (2.5)
t 0 1−s

It is important to note that the process Xt is not defined at t = 1, although it is straightforward to see from
the above equation that Xt converges to a Gaussian variable as t → 1.
To demonstrate the connection between this diffusion√process and the forward process (2.1) of the diffusion
model, we evaluate the diffusion process at times ti = 1 − αi for 1 ≤ i ≤ T . It is straightforward to check
that the marginal distribution of the resulting discrete-time process {Xti : 1 ≤ i ≤ T } is identical to that of
the forward process (2.1). Therefore the diffusion process (2.4) can be viewed as a continuous-time limit of
the forward process. In the next section, we will establish theoretical results for the diffusion process (2.4).
Through time discretization, our theory will provide insights for the DDPM.
We use the notation Xt for both the discrete-time process {Xt : t ∈ [T ]} in (2.1) and the continuous-time
diffusion process (Xt )0≤t<1 in (2.4) to maintain consistency with standard literature. The context will clarify
which process is being referred to.

3 The score-based density formula


3.1 Main results
Our main results are based on the continuous-time diffusion process (Xt )0≤t<1 defined in (2.4). While X0
might not have a density, for any t ∈ (0, 1), the random variable Xt has a smooth density, denoted by ρt (·).
Our main result characterizes the evolution of the conditional mean of log ρt (Xt ) given X0 , as stated below.
Theorem 1. Consider the diffusion process (Xt )0≤t<1 defined in (2.4), and let ρt be the density of Xt . For
any 0 < t1 < t2 < 1, we have
Z t2 
1 h X − √1 − tX 2 i d

t 0
E [log ρt2 (Xt2 ) − log ρt1 (Xt1 ) | X0 ] = E + ∇ log ρt (Xt ) | X0 − dt.
t1 2(1 − t) t 2 2t

The proof of this theorem is deferred to Appendix 5. A few remarks are as follows. First, it is worth
mentioning that this formula does not describe the evolution of the (conditional) differential entropy of the
process, because ρt (·) represents the unconditional density of Xt , while the expectation is taken conditional
on X0 . Second, without further assumptions, we cannot set t1 = 0 or t2 = 1 because X0 might not have a
density (hence ρ0 is not well-defined), and Xt is only defined for t < 1. By assuming that X0 has a finite
second moment, the following proposition characterizes the limit of E[log ρt (Xt ) | X0 ] as t → 1.
Proposition 1. Suppose that E[kX0 k22 ] < ∞. Then for any x0 ∈ Rd , we have

1 + log (2π)
lim E [log ρt (Xt ) | X0 = x0 ] = − d.
t→1− 2
The proof of this proposition is deferred to Appendix A. This result is not surprising, as it can be seen
from (2.5) that Xt converges to a standard Gaussian variable as t → 1 regardless of x0 , and we can check

1 + log (2π)
E[log φ(Z)] = − d
2

4
where Z ∼ N (0, Id ) and φ(·) is its density (we will use this notation throughout his section). The proof of
Proposition 1 formalizes this intuitive analysis.
When X0 has a smooth density ρ0 (·) with Lipschitz continuous score function, we can show that
E[log ρt (Xt ) | X0 ] → ρ0 (x0 ) as t → 0, as presented in the next proposition.
Proposition 2. Suppose that X0 has density ρ0 (·) and supx k∇2 log ρ0 (x)k < ∞. Then for any x0 ∈ Rd ,
we have
lim E [log ρt (Xt ) | X0 = x0 ] = log ρ0 (x0 ).
t→0+

The proof of this proposition can be found in Appendix B. With Propositions 1 and 2 in place, we can
take t1 → 0 and t2 → 1 in Theorem 1 to show that for any given point x0 ,
Z 1
1 + log(2π)
log ρ0 (x0 ) = − d− D(t, x0 )dt (3.1a)
2 0

where the function D(x, t) is defined as

1 h X − √1 − tX 2 i d
t 0
D(t, x) := E + ∇ log ρt (Xt ) | X0 = x − . (3.1b)
2(1 − t) t 2 2t

In practice, we might not want to make smoothness assumptions on X0 as in Proposition 2. In that case,
we can fix some sufficiently small δ > 0 and obtain a density formula
Z 1
1 + log(2π)
E [log ρδ (Xδ ) | X0 = x0 ] = − d− D(t, x0 )dt (3.1c)
2 δ

for a smoothed approximation of log ρ0 (x0 ). This kind of proximity is often used to circumvent non-
smoothness target distributions in diffusion model literature (e.g., Benton et al. (2023); Chen et al. (2023b,
2022); Li et al. (2023b)). We leave some more discussions to Appendix C.

3.2 From continuous time to discrete time


In this section, to avoid ambiguity, we will use (Xtsde )0≤t<1 to denote the continuous-time diffusion process
(2.4) studied in the previous section, while keep using {Xt : 1 ≤ t ≤ T } to denote the forward process (2.1).
The density formula (3.1) is not readily implementable because of its continuous-time nature. Consider time
discretization over the grid

0 < t1 < t2 < · · · < tT < tT +1 = 1 where ti := 1 − αi (1 ≤ i ≤ T ).

Recall that the forward process X1 , . . . , XT has the same marginal distribution as Xtsde
1
, . . . , Xtsde
T
snapshoted
from the diffusion process (2.4). This gives the following approximation of the density formula (3.1a):
(i)  
log ρ0 (x0 ) ≈ E log ρt1 (Xtsde
1
) | X0sde = x0

ti+1 − ti h Xtsde i
XT
(ii)1 + log(2πt1 ) − 1 − ti X0sde 2
≈ − d− E i
+ ∇ log ρti (Xtsde ) | X0sde = x0
2 i=1
2(1 − t i ) t i 2

(iii) 1 + log (2πt1 ) XT


ti+1 − ti h √ √ i
2
≈ − d− Eε∼N (0,Id ) ε − εbi ( 1 − ti x0 + ti ε) 2
.
2 i=1
2ti (1 − ti )

In step (i) we approximate log ρ0 (x0 ) with a smoothed proxy; see the discussion around (3.1c) for details;
R1
step (ii) applies (3.1c), where we compute the integral t1 d/(2t)dt = −(d/2) log t1 in closed form and
approximate the integral
Z 1
1 h X sde − √1 − tX sde 2 i
t 0
E + ∇ log ρt (Xtsde ) | X0sde = x0 dt;
t1 2(1 − t) t 2

5
d √ √
step (iii) follows from Xtsde
i
= 1 − ti x0 + ti ε for ε ∼ N (0, Id ) conditional on X0sde = x0 , and the relation
√ √
∇ log ρti = ∇ log qi = s⋆i (x) = − ti ε⋆i (x) ≈ − ti εbi (x).

In practice, we need to choose the learning rates {βt : 1 ≤ t ≤ T } such that the grid becomes finer as T
becomes large. More specifically, we require

ti+1 − ti = αi − αi+1 = αi βi+1 ≤ βi+1 (1 ≤ i ≤ T − 1)

to be small (roughly of order O(1/T )), and t1 = β1 and 1 − tT = αT to be vanishingly small (of order T −c
for some sufficiently large constant c > 0); see e.g., Benton et al. (2023); Li et al. (2023b) for learning rate
schedules satisfying these properties. Finally, we replace the time steps {ti : 1 ≤ i ≤ T } with the learning
rates for the forward process to achieve1

1 + log (2πβ1 ) XT
1 − αt+1 h √ √ i
2
log ρ0 (x0 ) ≈ − d− Eε∼N (0,Id ) ε − εbt ( αt x0 + 1 − αt ε) 2
, (3.2)
2 t=1
2(1 − αt )

The density approximation (3.2) can be evaluated with the trained epsilon predictors.

3.3 Comparison with other results


The density formulas (3.1) expresses the density of X0 using the score function along the continuous-time
limit of the forward process of the diffusion model. Other forms of score-based density formulas can be
derived using normalizing flows. Notice that the probability flow ODE of the SDE (2.4) is

x − ∇ log ρt (x)
ẋt = vt (xt ) where vt (x) = − ; (3.3)
2(1 − t)

namely, if we draw a particle x0 ∼ ρ0 and evolve it according to the ODE (3.3) to get the trajectory t → xt
for t ∈ [0, 1), then xt ∼ ρt . See e.g., Song et al. (2021b, Appendix D.1) for the derivation of this result.
Under some smoothness condition, we can use the results developed in Albergo et al. (2023); Grathwohl et al.
(2019) to show that for any given x0
Z t   Z t 
∂ d − tr ∇2 log ρs (xs )
log ρt (xt ) − log ρ0 (x0 ) = − Tr vs (xs ) ds = ds. (3.4)
0 ∂x 0 2(1 − s)

Here t → xt is the solution to the ODE (3.3) with initial condition x0 . Since the ODE system (3.3) is based
on the score functions (hence xt can be numerically solved), and the integral in (3.4) is based on the Jacobian
of the score functions, we may take t → 1 and use the fact that ρt (·) → φ(·) to obtain a score-based density
formula Z 1 
d 1 2 d − tr ∇2 log ρs (xs )
log ρ0 (x0 ) = − log(2π) − kx1 k2 − ds. (3.5)
2 2 0 2(1 − s)
However, numerically, this formula is more difficult to compute than our formula (3.1) for the following
reasons. First, (3.5) involves the Jacobian of the score functions, which are more challenging to estimate
than the score functions themselves. In fact, existing convergence guarantees for DDPM do not depend on
the accurate estimation of the Jacobian of the score functions (Benton et al., 2023; Chen et al., 2023a, 2022;
Li and Yan, 2024). Second, using this density formula requires solving the ODE (3.3) accurately to obtain
x1 , which might not be numerically stable, especially when the score function is not accurately estimated at
early stages, due to error propagation. In contrast, computing (3.1) only requires evaluating a few Gaussian
integrals (which can be efficiently approximated by the Monte Carlo method) and is more stable to score
estimation error.
1 Here we define αT +1 = 0 to accommodate the last term in the summation.

6
4 Implications
In the previous section, we established a density formula

1 + log (2πβ1 )
T
X 1 − αt+1 h √ √ i
2
log q0 (x) ≈ − d− Eε∼N (0,Id ) ε − ε⋆t ( αt x + 1 − αt ε) 2 (4.1)
| 2
{z } t=1 |2(1 − αt ) {z }

=:C0 =:L⋆
t−1 (x)

up to discretization error (which vanishes as T becomes large) and score estimation error. In this section,
we will discuss the implications of this formula in various generative and learning frameworks.

4.1 Certifying the validity of optimizing ELBO in DDPM


The seminal work (Ho et al., 2020) established the variational lower bound (VLB), also known as the evidence
lower bound (ELBO), of the log-likelihood
T
X 
log p0 (x) ≥ − Ext ∼pXt |X0 (· | x) KL pXt−1 |Xt ,X0 (· | xt , x) k pYt−1 |Yt (· | xt )
t=2 | {z }
=:Lt−1 (x)
  
− KL pYT (·) k pXT |X0 (· | x) + Ex1 ∼pX1 |X0 (· | x) log pY0 |Y1 (x | x1 ) , (4.2)
| {z } | {z }
=:LT (x) =:C0 (x)

where the reverse process (Yt )0≤t≤T was defined in Section 2.1, and p0 is the density of Y0 . Under the coef-
ficient design recommended by Li and Yan (2024) (other reasonable designs also lead to similar conclusions)

(1 − αt ) (αt − αt )
ηt = 1 − αt and σt2 = , (4.3)
1 − αt
it can be computed that for each 2 ≤ t ≤ T :
1 − αt h √ √ i
2
Lt−1 (x) = Eε∼N (0,Id ) ε − εt ( αt x + 1 − αt ε) 2 .
2(αt − αt )

We can verify that (i) for each 2 ≤ t ≤ T , the coefficients in Lt−1 from (4.2) and L⋆t−1 from (4.1) are identical
up to higher-order error; (ii) when T is large, LT becomes vanishingly small; and (iii) the function

1 + log (2πβ1 )
C0 (x) = − d + O(β1 ) = C0⋆ + O(β1 )
2
is nearly a constant. See Appendix D.1 for details. It is worth highlighting that as far as we know, existing
literature haven’t pointed out that C0 (x) is nearly a constant. For instance, Ho et al. (2020) discretize
this term to obtain discrete log-likelihood (see Section 3.3 therein), which is unnecessary in view of our
observation. Additionally, some later works falsely claim that C0 (x) is negligible, as we will discuss in the
following sections.
Now we discuss the validity of optimizing the variational bound for training DDPMs. Our discussion
shows that

KL(q0 k p0 ) = −Ex∼q0 [log p0 (x)] − H(q0 ) ≤ Ex∼q0 [L(x)] − C0⋆ − H(q0 ) + o(1), (4.4)
| {z } | {z }
=:L(ε1 ,...,εT ) =:Lvb (ε1 ,...,εT )
R
where H(q0 ) = − log q0 (x)dq0 is the entropy of q0 , and L(x) denotes the widely used (negative) ELBO2

XT
1 − αt+1 h √ √ i
2
L(x) := Eε∼N (0,Id ) ε − εt ( αt x + 1 − αt ε) 2 .
t=1
2(1 − αt )
2 We follow the convention in existing literature to remove the last two terms LT (x) and C0 (x) from (4.2) in the ELBO.

7
The true objective of DDPM is to learn the epsilon predictors ε1 , . . . , εT that minimizes L in (4.4), while in
practice, the optimization target is the variational bound Lvb . It is known that the global minimizer for

XT
1 − αt+1 h √ √ i
2
Ex∼q0 [L(x)] = Ex∼q0 ,ε∼N (0,Id ) ε − εt ( αt x + 1 − αt ε) 2 (4.5)
t=1
2(1 − αt )

is exactly εbt (·) ≡ ε⋆t (·) for each 1 ≤ t ≤ T (see Appendix D.1). Although in practice the optimization is
based on samples from the target distribution q0 (instead of the population level expectation over q0 ) and
may not find the exact global minimizer, we consider the ideal scenario where the learned epsilon predictors
εbt equal ε⋆t to facilitate discussion. When εt = ε⋆t for each t, according to (4.1), we have

L(x) ≈ − log q0 (x) + C0⋆ . (4.6)

Taking (4.4) and (4.6) together gives

0 ≤ L(b ε1 , . . . , εbT ) ≈ −Ex∼q0 [log q(x)] + C0⋆ − C0⋆ − H(q0 ) = 0,


ε1 , . . . , εbT ) ≤ Lvb (b (4.7)

namely the minimizer for Lvb approximately minimizes L, and the optimal value is asymptotically zero when
the number of steps T becomes large. This suggests that by minimizing the variational bound Lvb , the
resulting generator distribution p0 is guaranteed to be close to the target distribution q0 in KL divergence.
Some experimental evidence suggests that using reweighted coefficients can marginally improve empirical
performance. For example, Ho et al. (2020) suggests that in practice, it might be better to use uniform
coefficients in the ELBO

1X
T h √ √ i
2
Lsimple (x) := Eε∼N (0,Id ) ε − εbti ( αt x + 1 − αt ε) 2
(4.8)
T i=1

when trainging DDPM to improve sampling quality.3 This strategy has been adopted by many later works.
In the following sections, we will discuss the role of using the ELBO in different applications. While the
original literature might use the modified ELBO (4.8), in our discussion we will stick to the original ELBO
(4.6) to gain intuition from our theoretical findings.

4.2 Understanding the role of regularization in GAN


Generative Adversarial Networks (GANs) are a powerful and flexible framework for learning the unknown
probability distribution pdata that generates a collection of training data (Goodfellow et al., 2014). GANs op-
erate on a game between a generator G and a discriminator D, typically implemented using neural networks.
The generator G takes a random noise vector z sampled from a simple distribution pnoise (e.g., Gaussian)
and maps it to a data sample resembling the training data, aiming for the distribution of G(z) to be close to
pdata . Meanwhile, the discriminator D determines whether a sample x is real (i.e., drawn from pdata ) or fake
(i.e., produced by the generator), outputting the probability D(x) of the former. The two networks engage
in a zero-sum game:

min max V (G, D) := Ex∼pdata [log D(x)] + Ez∼pnoise [log(1 − D(G(z)))],


G D

with the generator striving to produce realistic data while the discriminator tries to distinguish real data
from fake. The generator and discriminator are trained iteratively4

D ← arg min −Ex∼pdata [log D(x)] − Ez∼pnoise [log(1 − D(G(z)))],


3 Note that the optimal epsilon predictors ε bt for L and Lsimple are the same, but in practice, we may not find the optimal
predictors. This practical strategy is beyond the scope of our theoretical result, and implies that the influence of terms from
different steps needs more careful investigation. We conjecture that this is mainly because the estimation error for terms when
t is close to zero is larger, hence smaller coefficients for these terms can improve performance.
4 While the most natural update rule for the generator is G ← arg min E
z∼pnoise [log(1 − D(G(z)))], both schemes are used in
practice and have similar performance. Our choice is for consistency with Xia et al. (2023), and our analysis can be extended
to the other choice.

8
G ← arg min −Ez∼pnoise [log D(G(z))]

to approach the Nash equilibrium (G⋆ , D⋆ ), where the distribution of G⋆ (z) with z ∼ pnoise matches the
target distribution pdata , and D(x) = 1/2 for all x.
It is believed that adding a regularization term to make the generated samples fit the VLB can improve
the sampling quality of the generative model. For example, Xia et al. (2023) proposed adding the VLB
L(x) as a regularization term to the objective function, where {bεti (·) : 1 ≤ i ≤ T } are the learned epsilon
predictors for pdata . The training procedure then becomes

D ← arg min −Ex∼pdata [log D(x)] − Ez∼pnoise [log(1 − D(G(z)))],


G ← arg min −Ez∼pnoise [log D(G(z))] + λEz∼pnoise [L(G(z))],

where λ > 0 is some tuning parameter. However, it remains unclear what exactly is optimized through the
above objective. According to our theory, L(x) ≈ − log pdata (x) + C0⋆ . Assuming that this approximation is
exact for intuitive understanding, the unique Nash equilibrium (Gλ , Dλ ) satisfies

pGλ (x) = zpdata (x)λ − 1 + pdata (x)

for some normalizing factor z > 0, where pGλ is the density of Gλ (z) with z ∼ pnoise . See Appendix D.2 for
details. This can be viewed as amplifying the density pdata wherever it is not too small, while zeroing out
the density where pdata is vanishingly small (which is difficult to estimated accurately), thus improving the
sampling quality.

4.3 Confirming the use of ELBO in diffusion classifier


Motivated by applications like image classification and text-to-image diffusion model, we consider a joint
underlying distribution p0 (x, c), where typically x is the image data and the latent variable c is the class
index or text embedding, taking values in a finite set C. For each c ∈ C, we train  a diffusion model for the
conditional data distribution p0 (x | c), which provides a set of epsilon predictors εbt (x; c) : 1 ≤ t ≤ T, c ∈ C .
Assuming a uniform prior over C, we can use Bayes’ formula to obtain:

p0 (c) p0 (x | ci ) p0 (x | c)
p0 (c | x) = P = P .
p
j∈C 0 j(c ) p 0 (x | c j ) j∈C p0 (x | cj )

for each c ∈ C. Recent work (Li et al., 2023a) proposed to use the ELBO5

XT
1 − αt+1 h √ √ i
2
−L(x; c) := − Eε∼N (0,Id ) ε − εbt ( αt x + 1 − αt ε; c) 2
t=1
2(1 − αt )

as an approximate class-conditional log-likelihood log p0 (x | c) for each c ∈ C, which allows them to obtain a
posterior distribution
exp (−L(x; c))
pb0 (c | x) = P . (4.9)
j∈C exp (−L(x; cj ))

Our theory suggests that −L(x; c) ≈ log p0 (x | c)−C0⋆ , where C0⋆ = −[1+log(2πβ1 )]d/2 is a universal constant
that does not depend on p0 and c. This implies that

exp (log p0 (x | c) − C0⋆ ) p (x | c)


pb0 (c | x) ≈ P ⋆) =
P 0 = p0 (c | x)
j∈C exp (log p 0 (x | c j ) − C0 j∈C p0 (x | cj )

providing theoretical justification for using the computed posterior pb0 in classification tasks.
It is worth mentioning that, although this framework was proposed in the literature (Li et al., 2023a), it
remains a heuristic method before our work. For example, in general, replacing the intractable log-likelihood
with a lower bound does not guarantee good performance, as they might not be close. Additionally, recall
5 The original paper adopted uniform coefficients; see the last paragraph of Section 4.1 for discussion.

9
that there is a term C0 (x) in the ELBO (4.2). Li et al. (2023a) claimed that “Since T = 1000 is large and
log pθ (x0 | x1 , c) is typically small, we choose to drop this term”. However this argument is not correct, as we
already computed in Section 4.1 that this term
1 + log (2πβ1 )
C0 (x) = − d + O(β1 )
2
can be very large since β1 is typically very close to 0. In view of our results, the reason why this term can
be dropped is that it equals a universal constant that does not depend on the image data x and the class
index c, thus it does not affect the posterior (4.9).

4.4 Demystifying the diffusion loss in autoregressive models


Finally, we use our results to study a class of diffusion loss recently introduced in Li et al. (2024), in the
context of autoregressive image generation. Let xk denote the next token to be predicted, and z be the
condition parameterized by an autoregressive network z = f (x1 , . . . , xk−1 ) based on previous tokens as
input. The goal is to train the network z = f (·) together with a diffusion model {εt (· ; z) : 1 ≤ t ≤ T } such
that pb(x | z) (induced by the diffusion model) with z = f (x1 , . . . , xk−1 ) can predict the next token xk .
The diffusion loss is defined as follows: for some weights wt ≥ 0, let
T
X h √ √ i
2
L(z, x) = wt Eε∼N (0,Id ) ε − εt ( αt x + 1 − αt ε; z) 2 . (4.10)
t=1

With training data {(x1i , . . . , xki ) : 1 ≤ i ≤ n}, we can train the autoregressive network f (·) and the diffusion
model by minimizing the following empirical risk:
n
1X 
arg min L f (x1i , . . . , xik−1 ), xki . (4.11)
f,ε1 ,...,εT n i=1

To gain intuition from our theoretical results, we take the weights in the diffusion loss (4.10) to be the
coefficients in the ELBO (4.6), and for each z, suppose that the learned diffusion model for p(xk | z) is
already good enough, which returns the set of epsilon predictors {b εt (· ; z) : 1 ≤ t ≤ T } for the probability
distribution of xk conditioned on z. Under this special case, our approximation result (4.6) shows that

L(z, x) ≈ − log p(x | z) + C0⋆ ,

which suggests that the training objective for the network f in (4.11) can be viewed as approximate MLE,
as the loss function
n n
1X  1X
L f (x1i , . . . , xik−1 ), xki ≈ − log p(xki | f (x1i , . . . , xik−1 )) + C0⋆
n i=1 n i=1

represents the negative log-likelihood function (up to an additive constant) of the observed xk1 , . . . , xkn in
terms of f .

5 Proof of Theorem 1
Recall the definition of the stochastic process (Xt )0≤t≤1
1 1
dXt = − Xt dt + √ dBt .
2(1 − t) 1−t
√ √
Define Yt := Xt / 1 − t for any 0 ≤ t < 1, and let f (t, x) = x/ 1 − t, we can use Itô’s formula to show that
∂f ⊤ 1
dYt = df (t, Xt ) = (t, Xt ) dt + ∇x f (t, Xt ) dXt + dXt⊤ ∇2x f (t, Xt ) dXt
∂t 2

10
 
Xt 1 1 1 dBt
= 3/2
dt + √ − Xt dt + √ dBt = . (5.1)
2(1 − t) 1−t 2(1 − t) 1−t 1−t

Therefore the Itô process Yt is√a martingale, which is easier to handle. Let g(t, y) = log ρt ( 1 − ty), and we
can express log ρt (x) = g(t, x/ 1 − t). In view of Itô’s formula, we have
(i) ∂g ⊤ 1
d log ρt (Xt ) = dg (t, Yt ) = (t, Yt ) dt + ∇y g (t, Yt ) dYt + dYt⊤ ∇2y g (t, Yt ) dYt
∂t 2
(ii) ∂g 1 ⊤ 1 ⊤ 2
= (t, Yt ) dt + ∇y g (t, Yt ) dBt + 2 dBt ∇y g (t, Yt ) dBt
∂t 1−t 2 (1 − t)
(iii) ∂g 1 ⊤ 1 
= (t, Yt ) dt + ∇y g (t, Yt ) dBt + 2 tr ∇2y g (t, Yt ) dt. (5.2)
∂t 1−t 2 (1 − t)
Here step (i) follows from the Itô rule, step (ii) utilizes (5.1), while step (iii) can be derived from the Itô
calculus. Then we investigate the three terms above. Notice that
√ √
∇y ρt ( 1 − ty) ∇x ρt (Xt ) 1 − t √
∇y g (t, y) | y=Yt = √ | y=Yt = = 1 − t∇ log ρt (Xt ) , (5.3)
ρt ( 1 − tYt ) ρt (Xt )
and similarly, we have
∇2y g (t, y) | y=Yt = (1 − t) ∇2 log ρt (Xt ) . (5.4)
Substituting (5.3) and (5.4) back into (5.2) gives
∂g 1 ⊤ 1 
d log ρt (Xt ) = (t, Yt ) dt + √ ∇ log ρt (Xt ) dBt + tr ∇2 log ρt (Xt ) dt.
∂t 1−t 2 (1 − t)
or equivalently, for any given 0 < t1 < t2 < 1, we have
Z t2 h  Z t2
t2 ∂g tr ∇2 log ρt (Xt ) i 1
log ρt (Xt ) = (t, Yt ) + dt + √ ∇ log ρt (Xt )⊤ dBt . (5.5)
t1 t1 ∂t 2 (1 − t) t1 1−t
Conditional on X0 , we take expectation on both sides of (5.5) to achieve
Z t2   
∂g 1 
E [log ρt2 (Xt2 ) − log ρt1 (Xt1 ) | X0 ] = E (t, Yt ) + tr ∇2 log ρt (Xt ) dt | X0 . (5.6)
t1 ∂t 2 (1 − t)
We need the following lemmas, whose proof can be found at the end of this section.
Claim 1. For any 0 < t < 1 and any y ∈ Rd , we have
Z
∂g d 1 √ 
(t, y) = − + 2 ρX0 |Xt x0 | 1 − ty ky − x0 k22 dx0 .
∂t 2t 2t x0

Claim 2. For any 0 < t < 1 and any x ∈ Rd , we have


Z
2
 d 2 1 √ 2
tr ∇ log ρt (x) = − − ∇ log ρt (x) 2
+ 2 x− 1 − tx0 ρ
2 X0 |Xt
(x0 | x) dx0 .
t t
It also admits the lower bound
 d
tr ∇2 log ρt (x) ≥ − .
t

Therefore for any x and y = x/ 1 − t, we know that
∂g 1  d d d
(t, y) + tr ∇2 log ρt (x) ≥ − − ≥− . (5.7)
∂t 2 (1 − t) 2t 2 (1 − t) t (1 − t) t
Hence we have

E [log ρt2 (Xt2 ) − log ρt1 (Xt1 ) | X0 ]

11
Z t2    Z t2
(i) ∂g 1  d d
=E (t, Yt ) + tr ∇2 log ρt (Xt ) + dt | X0 − dt
t1 ∂t 2 (1 − t) (1 − t) t t1 (1 − t) t
Z t2    Z t2
(ii) ∂g 1 2
 d d
= E (t, Yt ) + tr ∇ log ρt (Xt ) + | X0 dt − dt
t1 ∂t 2 (1 − t) (1 − t) t t1 (1 − t) t
Z t2   
∂g 1 
= E (t, Yt ) + tr ∇2 log ρt (Xt ) | X0 dt. (5.8)
t1 ∂t 2 (1 − t)

Here step (i) follows from (5.6), and its validity is guaranteed by
Z t2
d t2 (1 − t1 )
dt = log < +∞,
t1 t (1 − t) t 1 (1 − t2 )

while step (ii) utilizes Tonelli’s Theorem, and the nonnegativity √


of the integrand is ensured by (5.7). Taking
Claims 1 and 2 collectively, we know that for any x and y = x/ 1 − t,
 2 Z
∂g tr ∇2 log ρt (x) d + ∇ log ρt (x) 2 1 √ 
(t, y) − = + 2 ρX0 |Xt x0 | 1 − ty ky − x0 k22 dx0
∂t 2 (1 − t) 2 (1 − t) 2t x0
Z
1 1 √ 2
− 2
x − 1 − tx0 2 ρX0 |Xt (x0 | x) dx0
2 (1 − t) t
2
d + ∇ log ρt (x) 2
= . (5.9)
2 (1 − t)

Putting (5.8) and (5.9) together, we arrive at


Z 
t2 2 
d + ∇ log ρt (Xt ) 2 1 2

E [log ρt2 (Xt2 ) − log ρt1 (Xt1 ) | X0 ] = E + tr ∇ log ρt (Xt ) | X0 dt. (5.10)
t1 2 (1 − t) 1−t

Notice that conditional on X0 , we have Xt ∼ N ( 1 − tX0 , tId ). Then we have

E [log ρt2 (Xt2 ) − log ρt1 (Xt1 ) | X0 ]


Z t2  2 √ 
(i) d + ∇ log ρt (Xt ) 2 1 ⊤ Xt − 1 − tX0
= E + ∇ log ρt (Xt ) | X0 dt
t1 2 (1 − t) 1−t t
(ii)
Z t2 
1 h X − √1 − tX 2 i d

t 0
= E + ∇ log ρt (Xt ) | X0 − dt
t1 2(1 − t) t 2 2t

Here step (i) follows from (5.10) and an application of Stein’s lemma
 
√    
E ∇ log ρt (Xt )⊤ Xt − 1 − tX0 | X0 = tE tr ∇2 log ρt (Xt ) | X0 ,

while step (ii) holds since


h X − √1 − tX 2i d
t 0
E = .
t 2 t
√ √
Proof of Claim 1. For any t ∈ (0, 1), since Xt = 1 − tX0 + tZ, we have
Z  (1 − t)ky − x k2 
√ 0 2
ρt ( 1 − ty) = (2πt)−d/2 exp − ρ0 (dx0 ). (5.11)
x0 2t

Note that here ρ0 (·) stands for the law of X0 . Hence we have

∂g ∂ √ 1 ∂ √
(t, y) = log ρt ( 1 − ty) = √ ρt ( 1 − ty)
∂t ∂t ρt ( 1 − ty) ∂t

12
Z   (1 − t)ky − x k2 
1 d 0 2
= √ (2π)−d/2 − t−d/2−1 exp −
ρt ( 1 − ty)x0 2 2t
 (1 − t)ky − x k2  ky − x k2 
0 2 0 2
+ t−d/2 exp − ρ0 (dx0 )
2t 2t2
Z  
1 √  d ky − x0 k22
= √ ρ X |X 1 − ty | x0 − + ρ0 (dx0 )
ρt ( 1 − ty) x0 t 0 2t 2t2
Z 
d ky − x0 k22  √ 
= − + 2
ρX0 |Xt dx0 | 1 − ty
x0 2t 2t

as claimed.

Proof of Claim 2. Notice that we can express


Z
1  √  1 √ 
∇ log ρt (x) = − E Xt − 1 − tX0 | Xt = x = − x− 1 − tx0 ρX0 |Xt (dx0 | x) ;
t t x0

see Chen et al. (2022) for the proof of this relationship. Then we can compute
1n 1  √   √ ⊤
∇2 log ρt (x) = −Id + E Xt − 1 − tX0 | Xt = x E Xt − 1 − tX0 | Xt = x
t t
1 h √  √ ⊤ io
− E Xt − 1 − tX0 Xt − 1 − tX0 | Xt = x
t Z
1n 1h √  ih Z √  i⊤
= − Id + x − 1 − tx0 ρX0 |Xt (dx0 | x) x − 1 − tx0 ρX0 |Xt (dx0 | x)
t t
Z o
1 √  √ ⊤
− x − 1 − tx0 x − 1 − tx0 ρX0 |Xt (dx0 | x) .
t
Hence we have
 Z Z 
 1 1 √  2 1 √ 2
tr ∇2 log ρt (x) = − d + x − 1 − tx0 ρX0 |Xt (dx0 | x) 2 − x − 1 − tx0 ρ
2 X0 |Xt
(dx0 | x)
t t t
Z
d 1 2 1 √ 2
= − − 2 ∇ log ρt (x) 2 + 2 x − 1 − tx0 2 ρX0 |Xt (x0 | x) dx0 .
t t t
By Jensen’s inequality, we know that
 d
tr ∇2 log ρt (x) ≥ − .
t

6 Discussion
This paper develops a score-based density formula that expresses the density function of a target distribution
using the score function along a continuous-time diffusion process that bridges this distribution and standard
Gaussian. By connecting this diffusion process with the forward process of score-based diffusion models, our
results provide theoretical support for training DDPMs by optimizing the ELBO, and offer novel insights
into several applications of diffusion models, including GAN training and diffusion classifiers.
Our work opens several directions for future research. First, our theoretical results are established for
the continuous-time diffusion process. It is crucial to carefully analyze the error induced by time discretiza-
tion, which could inform the number of steps required for the results in this paper to be valid in practice.
Additionally, while our results provide theoretical justification for using the ELBO (4.6) as a proxy for the
negative log-likelihood of the target distribution, they do not cover other practical variants of ELBO with
modified weights (e.g., the simplified ELBO (4.8)). Extending our analysis to other diffusion processes might
yield new density formulas incorporating these modified weights. Lastly, further investigation is needed into
other applications of this score-based density formula, including density estimation and inverse problems.

13
Acknowledgements
G. Li is supported in part by the Chinese University of Hong Kong Direct Grant for Research. Y. Yan was
supported in part by a Norbert Wiener Postdoctoral Fellowship from MIT.

A Proof of Proposition 1
We establish the desired result by sandwiching E[log ρt (Xt ) | X0 = x0 ] and find its limit as t → 1 . We first
record that the density of Xt can be expressed as
  √ 
kx − 1 − tX0 k22
ρt (x) = EX0 (2πt)−d/2 exp − , (A.1)
2t
d √ √
since Xt = 1 − tX0 + tZ for an independent variable Z ∼ N (0, Id ).

Lower bounding E[log ρt (Xt ) | X0 = x0 ]. Starting from (A.1), for any x ∈ Rd and any 0 < t < 1,
  √ 
−d/2 kx − 1 − tX0 k22
log ρt (x) = log EX0 (2πt) exp −
2t
   √ 
(i) kx − 1 − tX0 k22
≥ log (2πt)−d/2 exp − EX0
2t
 √ 2

d kx − 1 − tX0 k2
= − log(2πt) − EX0
2 2t
2

d kxk2 1−t 1−t
= − log(2πt) − − E[kX0 k22 ] + E[x⊤ X0 ]
2 2t 2t t
(ii) d √  kxk22 √
= − log(2πt) − 1 + O( 1 − t) + O( 1 − t)E[kX0 k22 ].
2 2t
Here step (i) follows from Jensen’s inequality and the fact that e−x is a convex function, while step (ii)
follows from elementary inequalities
  1  
E[x⊤ X0 ] ≤ E kxkkX0 k2 ≤ E kxk22 + kX0 k22 .
2
This immediately gives, for any given x0 ∈ Rd and any 0 < t < 1,

d 1 + O( 1 − t)   √
E[log ρt (Xt ) | X0 = x0 ] ≥ − log(2πt) − E kXt k22 | X0 = x0 + O( 1 − t)E[kX0 k22 ] . (A.2a)
| 2 2t {z }
=:fx0 (t)

Since E[kX0 k22 ] < ∞, it is straightforward to check that

d 1 h √ √ i
lim fx0 (t) = − log(2π) − lim E k 1 − tx0 + tZk22 for Z ∼ N (0, Id )
t→1− 2 t→1− 2
d d
= − log(2π) − . (A.2b)
2 2

Upper bounding E[log ρt (Xt ) | X0 = x0 ]. Towards that, we need to obtain point-wise upper bound for
log ρt (x). Since the desired result only depends on the limiting behavior when t → 1, from now on we only
consider t > 0.9, under which r
1/4 1 1
(1 − t) < log
2 1−t
holds. It would be helpful to develop the upper bound for the following two cases separately.

14
p
• For any (1 − t)1/4 < kxk2 < 0.5 log 1/(1 − t), we have
   
(a) (kxk2 − (1 − t)1/4 )2 
log ρt (x) ≤ log EX0 (2πt)−d/2 exp − + 1 kX0 k2 > (1 − t)−1/4
2t
(b) d 1/4 2
(kxk2 − (1 − t) )  (kxk − (1 − t)1/4 )2  
2
≤ − log(2πt) − + exp P kX0 k2 > (1 − t)−1/4
2 2t 2t
(c) d (kxk2 − (1 − t)1/4 )2  kxk2 
2
≤ − log(2πt) − + exp E[kX0 k22 ](1 − t)1/2
2 2t 2t
(d) d (kxk2 − (1 − t)1/4 )2
≤ − log(2πt) − + E[kX0 k22 ](1 − t)1/4 . (A.3)
2 2t
Here step (a) follows from (A.1); step (b) holds since log(x + y) ≤ log x + y/x holds for any x > 0 and
y ≥ 0; stepp(c) follows from kxk2 > (1 − t)1/4 and Chebyshev’s inequality; while step (d) holds since
kxk2 < 0.5 log 1/(1 − t).
p
• For kxk2 ≥ 0.5 log 1/(1 − t) or kxk ≤ (1 − t)1/4 , we will use the naive upper bound

d
log ρt (x) ≤ − log(2πt) < 0, (A.4)
2
where the first relation simply follows from (A.1) and the second relation holds when t > 0.9.
Then we have
(i) n p o
E[log ρt (Xt ) | X0 = x0 ] ≤ E[log ρt (Xt ) 1 (1 − t)1/4 < kXt k2 < 0.5 log 1/(1 − t) | X0 = x0 ]
 
(ii) d (kxk2 − (1 − t)1/4 )2
≤ E − log(2πt) − + E[kX0 k22 ](1 − t)1/4
2 2t
n o 
1/4
p
· 1 (1 − t) < kXt k2 < 0.5 log 1/(1 − t) | X0 = x0
 d   p 
= − log(2πt) + E[kX0 k22 ](1 − t)1/4 P (1 − t)1/4 < kXt k2 < 0.5 log 1/(1 − t)
| 2 {z }
=:g x0 (t)
 n o 
(kXt k2 − (1 − t) 1/4 2
) 1/4
p
−E 1 (1 − t) < kXt k2 < 0.5 log 1/(1 − t) | X0 = x0 .
2t
| {z }
gx0 (t)
=:e

Here step (i) follows from (A.4), while step (ii) utilizes (A.3). Since Xt is a continuous random variable for
any t ∈ (0, 1), we have  
p
lim P (1 − t)1/4 < kXt k2 < 0.5 log 1/(1 − t) = 1.
t→1−

Therefore we know that


d
lim gx0 (t) = − log(2π).
t→1− 2
d √ √
Recall that Xt = 1 − tX0 + tZ for a Gaussian variable Z ∼ N (0, Id ) independent of X0 , we can express
 √ √  
(k tZ + 1 − tx0 k2 − (1 − t)1/4 )2 √ √ 1p
gex0 (t) = E 1 (1 − t)1/4 < k tZ + 1 − tx0 k2 < log 1/(1 − t)
2t 2
Z √ √  r 
1/4 2
(k tz + 1 − tx0 k2 − (1 − t) ) 1/4
√ √ 1 1
= 1 (1 − t) < k tz + 1 − tx0 k2 < log φ(z) dz,
2t 2 1−t
| {z }
=:ht (z)

15
where φ(z) = (2π)−d/2 exp(−kzk22/2) is the density function of N (0, Id ). For any t ∈ (0.9, 1), we have
√ √
ht (z) ≤ k tz + 1 − tx0 k22 φ(z) ≤ 2(kzk22 + kx0 k22 )φ(z) =: h(z),

and it is straightforward to check that


Z
h(z)dz = 2d + 2kx0 k22 < ∞.

By dominated convergence theorem, we know that


Z Z Z
kzk22 d
lim gex0 (t) = lim ht (z)dz = lim ht (z)dz = φ(z)dz = .
t→1− t→1− t→1− 2 2
Therefore we have

E[log ρt (Xt ) | X0 = x0 ] ≤ gx0 (t) where gx0 (t) := gx0 (t) − gex0 (t), (A.5a)

such that
d d
lim gx0 (t) = lim g x0 (t) − lim gex0 (t) = − log(2π) − . (A.5b)
t→1− t→1− t→1− 2 2

Conclusion. By putting together (A.2) and (A.5), we know that for any t ∈ (0.9, 1)
d d
fx0 (t) ≤ E[log ρt (Xt ) | X0 = x0 ] ≤ gx0 (t) and lim fx0 (t) = lim gx0 (t) = − log(2π) − .
t→1− t→1− 2 2
By the sandwich theorem, we arrive at the desired result
d d
lim E[log ρt (Xt ) | X0 = x0 ] = − log(2π) − .
t→1− 2 2

B Proof of Proposition 2
Suppose that L := supx k∇2 log ρ0 (x)k. The following claim will be useful in establishing the proposition,
whose proof is deferred to the end of this section.
Claim 3. There exists some t0 > 0 such that

sup k∇2 log ρt (x)k ≤ 4L. (B.1)


x

holds for any 0 ≤ t ≤ t0 .


Equipped with Claim 3, we know that for any t ≤ t0 ,
   √ √ 
E log ρt (Xt ) | X0 = x0 = E log ρt ( 1 − tx0 + tZ)
(i)  √ √ √ 
= E log ρt ( 1 − tx0 ) + tZ ⊤ ∇ log ρt ( 1 − tx0 ) + O(Lt)kZk22
(ii) √
= log ρt ( 1 − tx0 ) + O(Ldt)
Z  (1 − t)kx − x k2  √
(iii) 0 2
= log ρ0 (x)(2πt)−d/2 exp − dx + O(L dt)
x 2t
Z  −d/2  (1 − t)kx − x k2 
2πt 0 2

= (1 − t)−d/2 log ρ0 (x) exp − dx + O(L dt), (B.2)
x 1−t 2t

where Z ∼ N (0, Id ). Here step (i) follows from (B.1) in Claim 3; step (ii) holds since E[Z] = 0 and
E[kZk22 ] = d; while step (iii) follows from (5.11). It is straightforward to check that
Z  −d/2  (1 − t)kx − x k2 
2πt 0 2
ρ0 (x) exp − dx
x 1−t 2t

16
is the density of ρ0 ∗ N (0, t/(1 − t)) evaluated at x0 , which taken collectively with the assumption that ρ0 (·)
is continuous yields
Z  −d/2  (1 − t)kx − x k2 
2πt 0 2
lim ρ0 (x) exp − dx = ρ0 (x0 ).
t→0+ x 1−t 2t

Therefore we can take t → 0+ in (B.2) to achieve


 
lim E log ρt (Xt ) | X0 = x0 = log ρ0 (x0 )
t→0+

as claimed.

Proof of Claim 3. The conditional density of X0 given Xt = x is


 √ 
pX0 (x0 )pXt |X0 (x | x0 ) ρ0 (x0 ) −d/2 kx − 1 − tx0 k22
pX0 |Xt (x0 | x) = = (2πt) exp − , (B.3)
pXt (x) ρt (x) 2t
which leads to
1 2 √
−∇2x0 log pX0 |Xt (x0 | x) = −∇2x0 log ρ0 (x0 ) + ∇x0 kx − 1 − tx0 k22
2t  
2 1 −t 1−t
= −∇x0 log ρ0 (x0 ) + Id  − L Id .
t t
Therefore we know that
1 1
−∇2x0 log pX0 |Xt (x0 | x)  Id for t≤ , (B.4)
2t 2(L + 1)

namely the conditional distribution of X0 given Xt = x is 1/(2t)-strongly log-concave for any x, when
t ≤ 1/2(L + 1). By writting
Z  Z  √ 
√  −d/2 x − tz

ρt (x) = pXt (x) = φ(z)p 1−tX0 x − tz dz = (1 − t) φ(z)ρ0 √ dz, (B.5)
1−t
we can express the score function of ρt as
Z  √ 
∇ρt (x) − d+1 1 x − tz
∇ log ρt (x) = = (1 − t) 2 φ(z)∇ρ0 √ dz
ρt (x) ρt (x) 1−t
Z  √   √ 
− d+1 1 x − tz x − tz
= (1 − t) 2 φ(z)ρ0 √ ∇ log ρ0 √ dz (B.6)
ρt (x) 1−t 1−t
 d/2 Z  √ 
(i) − d+1 1−t 1 x − 1 − tx0
= (1 − t) 2 φ √ ρ0 (x0 ) ∇ log ρ0 (x0 ) dx0
t ρt (x) t
Z
(ii) 1 1
= √ pX0 |Xt (x0 | x)∇ log ρ0 (x0 ) dx0 = √ E [∇ log ρ0 (X0 ) | Xt = x] . (B.7)
1−t 1−t
√ √
Here step (i) uses the change of variable x0 = (x − tz)/ 1 − t, while step (ii) follows from (B.3). Starting
from (B.6), we take the derivative to achieve
Z  √   √   √ ⊤
2 −d 1 x − tz x − tz x − tz
∇ log ρt (x) = (1 − t) 2 +1 φ(z)ρ0 √ ∇ log ρ0 √ ∇ log ρ0 √ dz
ρt (x) 1−t 1−t 1−t
| {z }
=:H1 (x)
Z  √   √ 
d 1 x − tz x − tz
+ (1 − t)− 2 +1 φ(z)ρ0 √ ∇2 log ρ0 √ dz
ρt (x) 1−t 1−t
| {z }
=:H2 (x)

17
Z  √   √ 
d+1 1 x − tz x − tz
− (1 − t)− 2 φ(z)ρ0 √ ∇ log ρ0 √ dz [∇ρt (x)]⊤ . (B.8)
ρ2t (x) 1−t 1−t
| {z }
=:H3 (x)

Then we investigate H1 (x), H2 (x) and H3 (x) respectively. Regarding H1 (x), we have
 d/2 Z  √ 
(a1) 1−t 1 x− 1 − tx0
H1 (x) = (1 − t) −d
2 +1 φ √ ρ0 (x0 ) ∇ log ρ0 (x0 ) [∇ log ρ0 (x0 )]⊤ dz
t ρt (x) t
Z
(b1) 1
= pX0 |Xt (x0 | x)∇ log ρ0 (x0 ) [∇ log ρ0 (x0 )]⊤ dx0
1−t
1 h i

= E ∇ log ρ0 (X0 ) [∇ log ρ0 (X0 )] | Xt = x ; (B.9a)
1−t
for H2 (x), we have
 d/2 Z  √   √ 
(a2) d 1−t 1 x − 1 − tx0 x − tz
H2 (x) = (1 − t)− 2 +1 φ √ ρ0 (x0 ) ∇2 log ρ0 √ dx0
t ρt (x) t 1−t
Z
(b2) 1 1  
= pX0 |Xt (x0 | x)∇2 log ρ0 (x0 ) dx0 = E ∇2 log ρ0 (X0 ) | Xt = x ; (B.9b)
1−t 1−t

for the final term H3 (x), we have


Z  √   √  
(c) − d+1 1 x − tz x − tz
H3 (x) = −(1 − t) 2 φ(z)ρ0 √ ∇ log ρ0 √ dz [∇ log ρt (x)]⊤
ρt (x) 1−t 1−t
 d/2 Z  √  
(a3) − d+1 1−t 1 x − 1 − tx0 ⊤
= −(1 − t) 2 φ √ ρ0 (x0 ) ∇ log ρ0 (x0 ) dx0 [∇ log ρt (x)]
t ρt (x) t
Z
(b3) 1 ⊤
= −√ pX0 |Xt (x0 | x)∇ log ρ0 (x0 ) dx0 [∇ log ρt (x)]
1−t
(d) 1 ⊤
= − E [∇ log ρ0 (X0 ) | Xt = x] E [∇ log ρ0 (X0 ) | Xt = x] . (B.9c)
1−t
√ √
Here steps (a1), (a2) and (a3) follow from the change of variable x0 = (x − tz)/ 1 − t; steps (b1), (b2)
and (b3) utilize (B.3); step (c) follows from ∇ log ρt (x) = ∇ρt (x)/ρt (x); while step (d) follows from (B.7).
Substituting (B.9) back into (B.8), we have
1   1
∇2 log ρt (x) = E ∇2 log ρ0 (X0 ) | Xt = x + cov (∇ log ρ0 (X0 ) | Xt = x) . (B.10)
1−t 1−t
Notice that for any t ≤ 1/2(L + 1), we have
h 2 i
kcov (∇ log ρ0 (X0 ) | Xt = x)k = sup E u⊤ (∇ log ρ0 (X0 ) − E [∇ log ρ0 (X0 ) | Xt = x]) | Xt = x
u∈Sd−1
(i) h 2 i
≤ sup E u⊤ (∇ log ρ0 (X0 ) − ∇ log ρ0 (E [X0 | Xt = x])) | Xt = x
u∈Sd−1
h i
2
≤ E k∇ log ρ0 (X0 ) − ∇ log ρ0 (E [X0 | Xt = x])k2 | Xt = x
(ii) h i
≤ E kX0 − E [X0 | Xt = x]k22 | Xt = x
(iii)
≤ 2tL2 d, (B.11)

Here step (i) holds since for any random variable X, E[(X −c)2 ] is minimized at c = E[X]; step (ii) holds since
the score function ∇ log ρ0 (·) is L-Lipschitz; step (iii) follows from the Poincaré inequality for log-concave

18
distribution, and the fact that the conditional distribution of X0 given Xt = x is 1/2t-strongly log-concave
(cf. (B.4)). We conclude that
(a) 1 2tL2 d (b)
∇2 log ρt (x) ≤ L+ ≤ 4L.
1−t 1−t
Here step (a) follows from (B.10), (B.11), and the assumption that supx k∇2 log ρt (x)k ≤ L, while step (b)
holds provided that t ≤ min{1/2, 1/(2Ld)}.

C More discussions on the density formulas


Although the density formulas (3.1a) have been rigorously established, it is helpful to inspect the limiting
behavior of the integrand D(t, x0 ) at the boundary to understand why the integral converges. Throughout
the discussion, we let ε ∼ N (0, Id ).
• As t → 0, we can compute
 √ √ √ 
E kε + t∇ log ρt ( 1 − tx0 + tε)k22 − d
D(t, x0 ) ≍
t
(i)  √ √  1  √ √ 
≍ E k∇ log ρt ( 1 − tx0 + tε)k22 + √ E ε⊤ ∇ log ρt ( 1 − tx0 + tε)
t
(ii)  √ √  h  √ √ i
≍ E k∇ log ρt ( 1 − tx0 + tε)k22 + E tr ∇2 log ρt ( 1 − tx0 + tε) .

Here step (i) holds since E[kεk22 ] = d, while step (ii) follows from Stein’s lemma. Therefore, when the
score functions are reasonably smooth as t → 0, one may expect that the integrand D(t, x0 ) is of constant
order, allowing the integral to converge at t = 0.
• As t → 1, we can compute
1  √ √ √  d
D(t, x0 ) = E kε + t∇ log ρt ( 1 − tx0 + tε)k22 −
2(1 − t)t 2t
1  √ √ √  d
≍ E kε + t∇ log ρt ( 1 − tx0 + tε)k22 − .
2(1 − t) 2

Since ρt converges to φ as t → 1 and ∇ log φ(x) = −x, we have


√ √ √
lim ε + t∇ log ρt ( 1 − tx0 + tε) = 0.
t→1
 √ √ √ 
Hence one may expect that E kε + t∇ log ρt ( 1 − tx0 + tε)k22 converges to zero quickly, allowing the
integral to converge at t = 1.

D Technical details in Section 4


D.1 Technical details in Section 4.1
Computing Lt−1 (x0 ). Conditional on Xt = xt and X0 = x0 , we have
√ √ 
αt−1 βt αt (1 − αt−1 ) 1 − αt−1
Xt−1 | Xt = xt , X0 = x0 ∼ N x0 + xt , βt Id ,
1 − αt 1 − αt 1 − αt

and conditional on Yt = xt , we have


 
xt + ηt st (xt ) σt2
Yt−1 | Yt = xt ∼ N √ , .
αt αt

19
Recall that the KL divergence between two d-dimensional Gaussian N (µ1 , Σ1 ) and N (µ2 , Σ2 ) admits the
following closed-form expression:
1h  ⊤ −1
i
KL (N (µ1 , Σ1 ) k N (µ2 , Σ2 )) = tr Σ−1
2 Σ 1 + (µ 2 − µ 1 ) Σ 2 (µ 2 − µ 1 ) − d + log det Σ 2 − log det Σ 1 .
2
Then we can check that for 2 ≤ t ≤ T ,
√ 2
 αt αt−1 βt αt − 1 ηt st (xt )
KL pXt−1 |Xt ,X0 (· | xt , x0 ) k pYt−1 |Yt (· | xt ) = x0 + √ xt − √ ,
2σt2 1 − αt αt (1 − αt ) αt 2

where we use the coefficient design (4.3). This immediately gives


" √ 2
#
αt αt−1 βt αt − 1 ηt st (xt )
Lt−1 (x0 ) = E x0 + √ xt − √
2σt2 xt ∼pXt |X0 (· | x0 ) 1 − αt αt (1 − αt ) αt 2
" #
2
(i) αt αt − 1 1 − αt √ √
= Eε∼N (0,Id ) p ε− √ st ( αt x0 + 1 − αt ε)
2σt2 αt (1 − αt ) αt 2
(ii) 1 − αt h √ √ 2
i
= Eε∼N (0,Id ) ε − εt ( αt x0 + 1 − αt ε) 2 .
2(αt − αt )
√ √
Here in step (i), we utilize the coefficient design (4.3) and replace xt with αt x0 + 1 − αt ε, which has
the same √distribution; while in step (ii), we replace the score function st (·) with the epsilon predictor
εt (·) := − 1 − αt st (·). Comparing the coefficients in L⋆t−1 and Lt−1 , we decompose

1 − αt+1 1 − αt 1 − αt+1 1 − αt+1 1 − αt+1 1 − αt


− ≤ − + − .
2(1 − αt ) 2(αt − αt ) 2(1 − αt ) 2(αt − αt ) 2(αt − αt ) 2(αt − αt )
| {z } | {z }
=:γ1 =:γ2

Consider the learning rate schedule in Li et al. (2023b); Li and Yan (2024):
(  t )
1 c1 log T c1 log T
β1 = c 0 , βt+1 = min β1 1 + ,1 (t = 1, . . . , T − 1) (D.1)
T T T

for sufficiently large constants c0 , c1 > 0. Then using the properties in e.g., Li and Yan (2024, Lemma 8),
we can check that
(1 − αt+1 )(αt − 1) 8c1 log T 1 − αt+1
γ1 = ≤ ,
2(1 − αt )(αt − αt ) T 2(1 − αt )
and
αt − αt+1 βt − βt+1 βt 1 − αt 1 − αt+1 8c1 log T 1 − αt+1
γ2 = = ≤ 1− 1+ ≤ .
2(αt − αt ) 2(αt − αt ) βt+1 αt − αt 2(1 − αt ) T 2(1 − αt )
Hence the coefficients in L⋆t−1 and Lt−1 are identical up to higher-order error:

1 − αt+1 1 − αt 16c1 log T 1 − αt+1


− ≤ .
2(1 − αt ) 2(αt − αt ) T 2(1 − αt )

Computing L0 (x0 ). By taking η1 = σ12 = 1 − α1 (notice that (4.3) does not cover the case t = 1), we have
 −d/2 !
2
2πσ12 α1 x1 − η1 s1 (x1 )
pY0 |Y1 (x0 | x1 ) = exp − 2 x0 − √
α1 2σ1 α1 2
 −d/2 !
2
2πβ1 α1 x1 − β1 s1 (x1 )
= exp − x0 − √ ,
α1 2β1 α1 2

20
and therefore
" #
2
d 2πβ1 α1 x1 + β1 s1 (x1 )
C0 (x0 ) = Ex1 ∼pX1 |X0 (· | x0 ) − log − x0 − √
2 α1 2β1 α1 2
(i) d 2πβ1 1 h p p p i
= − log − Eε∼N (0,Id ) kε + β1 s1 ( 1 − β1 x0 + β1 ε)k22
2 α1 2
(ii) 1 + log(2πβ1 ) d 1  p p 
= − d + log(1 − β1 ) − β1 Eε∼N (0,Id ) ks1 ( 1 − β1 x0 + β1 ε)k22
p 2 2 p 2 p
 
− β1 Eε∼N (0,Id ) ε⊤ s1 ( 1 − β1 x0 + β1 ε) . (D.2)
√ √
Here in step (i), we replace x1 with 1 − β1 x0 + β1 ε, which has the same distribution; step (ii) uses
the fact that E[kεk22 ] = d for ε ∼ N (0, Id ). Using similar analysis as in Proposition 2, we can show that
supx k∇2 log q1 (x)k ≤ O(L) when β1 is sufficiently small, as long as supx k∇2 log q0 (x)k ≤ L. Hence we have
 p p   p p 2 
Eε∼N (0,Id ) ks1 ( 1 − β1 x0 + β1 ε)k22 ≤ Eε∼N (0,Id ) ks1 (x0 )k2 + O(L)kx0 − 1 − β1 x0 − β1 εk2
 p p 
≤ 2ks1 (x0 )k22 + O(L2 )Eε∼N (0,Id ) kx0 − 1 − β1 x0 − β1 εk22
2
= 2 ks1 (x0 )k2 + O(L2 β1 ). (D.3)
By Stein’s lemma, we can show that
h p p i p h  p p i
Eε∼N (0,Id ) ε⊤ s1 ( 1 − β1 x0 + β1 ε) = β1 E tr ∇2 log q1 ( 1 − β1 x0 + β1 ε)
p
≤ O( β1 Ld). (D.4)
Substituting the bounds (D.3) and (D.4) back into (D.2), we have
1 + log(2πβ1 )
C0 (x0 ) = − d + O(β1 )
2
as claimed.

Negligibility of LT (x). Since


√ 
YT ∼ N (0, Id ), and XT | X0 = x0 ∼ N αT x0 , (1 − αT )Id ,
we can compute
 1 αT  d 1 αT 
KL pYT (·) k pXT |X0 (· | x0 ) = d + kx0 k22 + log(1 − αT ) ≤ d + kx0 k22 .
2 1 − αT 2 2 1 − αT
Using the learning rate schedule in (D.1), we can check that αT ≤ T −c2 for some large universal constant
c2 > 0; see e.g., Li et al. (2023b, Section 5.1) for the proof. Therefore when T ≥ 2, we have
 d + kx0 k22
KL pYT (·) k pXT |X0 (· | x0 ) ≤ ,
4T c2
which is negligible when T is sufficiently large.

Optimal solution for (4.5). It is known that for each 1 ≤ t ≤ T , the score function s⋆t (·) associated with
qt satisfies " #
2

√ √  1
st (·) = arg min Ex∼q0 ,ε∼N (0,Id ) s αt x + 1 − αt ε + √ ε .
s(·):Rd →Rd 1 − αt 2

See e.g., Chen et al. (2022, Appendix A) for the proof. Recall that ε⋆t (·) = 1 − αt s⋆t (·), then we have
h √ √ i
2
ε⋆t (·) = arg min Ex∼q0 ,ε∼N (0,Id ) ε − ε( αt x + 1 − αt ε) 2 .
ε(·):Rd →Rd

Therefore the global minimizer for (4.5) is εbt (·) ≡ ε⋆t (·) for each 1 ≤ t ≤ T .

21
D.2 Technical details in Section 4.2
By checking the optimality condition, we know that (Dλ , Gλ ) is a Nash equilibrium if and only if
pdata (x)
Dλ (x) = , (optimality condition for Dλ ) (D.5)
pdata (x) + pGλ (x)
where pGλ = (Gλ )# pnoise , and there exists some constant c such that
(
− log Dλ (x) + λL(x) = c, when x ∈ supp(pGλ ),
(optimality condition for Gλ ) (D.6)
− log Dλ (x) + λL(x) ≥ c, otherwise.

Taking the approximation L(x) ≈ − log pdata (x) + C0⋆ as exact, we have
( ⋆
eλC0 −c p−λ
data (x), for x ∈ supp(pGλ ),
Dλ (x) = (D.7)
1, for x ∈
/ supp(pGλ ).

where the first and second cases follow from (D.6) and (D.5) respectively. Then we derive a closed-form
expression for pGλ .
• For any x ∈ supp(pGλ ), by putting (D.5) and (D.7) together, we have
⋆ pdata (x)
eλC0 −c p−λ
data (x) = ,
pdata (x) + pGλ (x)
which further gives ⋆ 
pGλ (x) = pdata (x) e−λC0 +c pλdata (x) − 1 . (D.8)

• For any x ∈
/ supp(pGλ ), we have

(i) (ii) (iii)


− log Dλ (x) + λL(x) = λL(x) = −λ log pdata (x) + λC0⋆ ≥ c,

where step (i) follows from Dλ (x) = 1, which follows from (D.7); step (ii) holds when we take the
approximation L(x) ≈ − log pdata (x) + C0⋆ as exact; and step (iii) follows from (D.6). This immediately
gives ⋆
e−λC0 +c pλdata (x) − 1 = log (−λC0⋆ + c + λ log pdata (x)) − 1 ≤ 0. (D.9)

Taking (D.8) and (D.9) collectively, we can write


⋆ 
pGλ (x) = pdata (x) e−λC0 +c pλdata (x) − 1 +
. (D.10)

On the other hand, we can check that (D.7) and (D.10) satisfies the optimality conditions (D.5) and (D.6),
which establishes the desired result.

References
Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. (2023). Stochastic interpolants: A unifying framework
for flows and diffusions. arXiv preprint arXiv:2303.08797.
Anderson, B. D. (1982). Reverse-time diffusion equation models. Stochastic Processes and their Applications,
12(3):313–326.
Benton, J., De Bortoli, V., Doucet, A., and Deligiannidis, G. (2023). Linear convergence bounds for diffusion
models via stochastic localization. arXiv preprint arXiv:2308.03686.
Chen, H., Lee, H., and Lu, J. (2023a). Improved analysis of score-based generative modeling: User-friendly
bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pages
4735–4763. PMLR.

22
Chen, S., Chewi, S., Lee, H., Li, Y., Lu, J., and Salim, A. (2023b). The probability flow ode is provably fast.
arXiv preprint arXiv:2305.11798.
Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. R. (2022). Sampling is as easy as learning the
score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215.
Croitoru, F.-A., Hondru, V., Ionescu, R. T., and Shah, M. (2023). Diffusion models in vision: A survey.
IEEE Transactions on Pattern Analysis and Machine Intelligence.
Dhariwal, P. and Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural
Information Processing Systems, 34:8780–8794.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio,
Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
Graikos, A., Malkin, N., Jojic, N., and Samaras, D. (2022). Diffusion models as plug-and-play priors.
Advances in Neural Information Processing Systems, 35:14715–14728.
Grathwohl, W., Chen, R. T. Q., Bettencourt, J., and Duvenaud, D. (2019). Scalable reversible generative
models with free-form continuous dynamics. In International Conference on Learning Representations.
Haussmann, U. G. and Pardoux, E. (1986). Time reversal of diffusions. The Annals of Probability, pages
1188–1205.
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems, 33:6840–6851.
Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of
Machine Learning Research, 6(4).
Hyvärinen, A. (2007). Some extensions of score matching. Computational statistics & data analysis,
51(5):2499–2512.
Li, A. C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D. (2023a). Your diffusion model is secretly
a zero-shot classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 2206–2217.
Li, G., Wei, Y., Chen, Y., and Chi, Y. (2023b). Towards non-asymptotic convergence for diffusion-based
generative models. In The Twelfth International Conference on Learning Representations.
Li, G. and Yan, Y. (2024). Adapting to unknown low-dimensional structures in score-based diffusion models.
arXiv preprint arXiv:2405.14861.
Li, T., Tian, Y., Li, H., Deng, M., and He, K. (2024). Autoregressive image generation without vector
quantization. arXiv preprint arXiv:2406.11838.
Luo, C. (2022). Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970.
Mardani, M., Song, J., Kautz, J., and Vahdat, A. (2024). A variational perspective on solving inverse
problems with diffusion models. In The Twelfth International Conference on Learning Representations.
Nichol, A. Q. and Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In International
Conference on Machine Learning, pages 8162–8171.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical text-conditional image
generation with CLIP latents. arXiv preprint arXiv:2204.06125.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-resolution image synthesis
with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 10684–10695.

23
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R.,
Karagol Ayan, B., Salimans, T., et al. (2022). Photorealistic text-to-image diffusion models with deep
language understanding. Advances in Neural Information Processing Systems, 35:36479–36494.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised learning
using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–
2265.
Song, J., Meng, C., and Ermon, S. (2021a). Denoising diffusion implicit models. In International Conference
on Learning Representations.
Song, Y. and Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution.
Advances in neural information processing systems, 32.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021b). Score-based
generative modeling through stochastic differential equations. International Conference on Learning Rep-
resentations.
Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural computation,
23(7):1661–1674.
Xia, M., Shen, Y., Yang, C., Yi, R., Wang, W., and Liu, Y.-j. (2023). Smart: Improving gans with score
matching regularity. In Forty-first International Conference on Machine Learning.
Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., and Yang, M.-H. (2023).
Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–
39.

24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy