A Score-Based Density Formula, With Applications in
A Score-Based Density Formula, With Applications in
Abstract
Score-based generative models (SGMs) have revolutionized the field of generative modeling, achiev-
ing unprecedented success in generating realistic and diverse content. Despite empirical advances, the
theoretical basis for why optimizing the evidence lower bound (ELBO) on the log-likelihood is effective
for training diffusion generative models, such as DDPMs, remains largely unexplored. In this paper, we
address this question by establishing a density formula for a continuous-time diffusion process, which
can be viewed as the continuous-time limit of the forward process in an SGM. This formula reveals the
connection between the target density and the score function associated with each step of the forward
process. Building on this, we demonstrate that the minimizer of the optimization objective for training
DDPMs nearly coincides with that of the true objective, providing a theoretical foundation for optimizing
DDPMs using the ELBO. Furthermore, we offer new insights into the role of score-matching regular-
ization in training GANs, the use of ELBO in diffusion classifiers, and the recently proposed diffusion
loss.
Keywords: score-based density formula, score-based generative model, evidence lower bound, denoising
diffusion probabilistic model
Contents
1 Introduction 2
2 Problem set-up 3
2.1 Denoising diffusion probabilistic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 A continuous-time SDE for the forward process . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Implications 7
4.1 Certifying the validity of optimizing ELBO in DDPM . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Understanding the role of regularization in GAN . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 Confirming the use of ELBO in diffusion classifier . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Demystifying the diffusion loss in autoregressive models . . . . . . . . . . . . . . . . . . . . . 10
5 Proof of Theorem 1 10
6 Discussion 13
∗ The authors contributed equally.
† Department of Statistics, The Chinese University of Hong Kong, Hong Kong; Email: genli@cuhk.edu.hk.
‡ Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA; Email: yuling.yan@wisc.edu.
1
A Proof of Proposition 1 14
B Proof of Proposition 2 16
1 Introduction
Score-based generative models (SGMs) represent a groundbreaking advancement in the realm of generative
models, significantly impacting machine learning and artificial intelligence by their ability to synthesize
high-fidelity data instances, including images, audio, and text (Dhariwal and Nichol, 2021; Ho et al., 2020;
Sohl-Dickstein et al., 2015; Song et al., 2021a; Song and Ermon, 2019; Song et al., 2021b). These models
operate by progressively refining noisy data into samples that resemble the target distribution. Due to their
innovative approach, SGMs have achieved unprecedented success, setting new standards in generative AI and
demonstrating extraordinary proficiency in generating realistic and diverse content across various domains,
from image synthesis and super-resolution to audio generation and molecular design (Croitoru et al., 2023;
Ramesh et al., 2022; Rombach et al., 2022; Saharia et al., 2022; Yang et al., 2023).
The foundation of SGMs is rooted in the principles of stochastic processes, especially stochastic differential
equations (SDEs). These models utilize a forward process, which involves the gradual corruption of an initial
data sample with Gaussian noise over several time steps. This forward process can be described as:
add noise add noise add noise
X0 −→ X1 −→ · · · −→ XT , (1.1)
where X0 ∼ pdata is the original data sample, and XT is a sample close to pure Gaussian noise. The
ingenuity of SGMs lies in constructing a reverse denoising process that iteratively removes the noise, thereby
reconstructing the data distribution. This reverse process starts from a Gaussian sample YT and moves
backward as:
denoise denoise denoise
YT −→ YT −1 −→ · · · −→ Y0 (1.2)
d
ensuring that Yt ≈ Xt at each step t. The final output Y0 is a new sample that closely mimics the distribution
of the initial data pdata .
Inspired by the classical results on time-reversal of SDEs (Anderson, 1982; Haussmann and Pardoux,
1986), SGMs construct the reverse process guided by score functions ∇ log pXt associated with each step of
the forward process. Although these score functions are unknown, they are approximated by neural net-
works trained through score-matching techniques (Hyvärinen, 2005, 2007; Song and Ermon, 2019; Vincent,
2011). This leads to two popular models: denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020;
Nichol and Dhariwal, 2021) and denoising diffusion implicit models (DDIMs) (Song et al., 2021a). While the
theoretical results in this paper do not depend on the specific construction of the reverse process, we will
use the DDPM framework to discuss their implications for diffusion generative models.
However, despite empirical advances, there remains a lack of theoretical understanding for diffusion
generative models. For instance, the optimization target of DDPM is derived from a variational lower
bound on the log-likelihood (Ho et al., 2020), which is also referred to as the evidence lower bound (ELBO)
(Luo, 2022). It is not yet clear, from a theoretical standpoint, why optimizing a lower bound of the true
objective is still a valid approach. More surprisingly, recent research suggests incorporating the ELBO of
a pre-trained DDPM into other generative or learning frameworks to leverage the strengths of multiple
architectures, effectively using it as a proxy for the negative log-likelihood of the data distribution. This
approach has shown empirical success in areas such as GAN training, classification, and inverse problems
(Graikos et al., 2022; Li et al., 2023a; Mardani et al., 2024; Xia et al., 2023). While it is conceivable that
the ELBO is a reasonable optimization target for training DDPMs (as similar idea is utilized in e.g., the
2
majorize-minimization algorithm), it is more mysterious why it serves as a good proxy for the negative
log-likelihood in these applications.
In this paper, we take a step towards addressing the aforementioned question. On the theoretical side,
we establish a density formula for a diffusion process (Xt )0≤t<1 defined by the following SDE:
1 1
dXt = − Xt dt + √ dBt (0 ≤ t < 1), X0 ∼ pdata ,
2(1 − t) 1−t
which can be viewed as a continuous-time limit of the forward process (1.1). Under some regularity conditions,
this formula expresses the density of X0 with the score function along this process, having the form
1 + log(2π)
Z 1
1 h X − √1 − tX 2 i d
t 0
log pX0 (x) = − d− E + ∇ log pXt (Xt ) | X0 = x − dt,
2 0 2(1 − t) t 2 2t
where pXt (·) is the density of Xt . By time-discretization, this reveals the connection between the target
density pdata and the score function associated with each step of the forward process (1.1). These theoretical
results will be presented in Section 3.
Finally, using this density formula, we demonstrate that the minimizer of the optimization target for
training DDPMs (derived from the ELBO) also nearly minimizes the true target—the KL divergence between
the target distribution and the generator distribution. This finding provides a theoretical foundation for
optimizing DDPMs using the ELBO. Additionally, we use this formula to offer new insights into the role of
score-matching regularization in training GANs (Xia et al., 2023), the use of ELBO in diffusion classifiers
(Li et al., 2023a), and the recently proposed diffusion loss (Li et al., 2024). These implications will be
discussed in Section 4.
2 Problem set-up
In this section, we formally introduce the Denoising Diffusion Probabilistic Model (DDPM) and the stochastic
differential equation (SDE) that describes the continuous-time limit of the forward process of DDPM.
3
2.2 A continuous-time SDE for the forward process
In this paper, we build our theoretical results on the continuous-time limit of the aforementioned forward
process, described by the diffusion process:
1 1
dXt = − Xt dt + √ dBt (0 ≤ t < 1), X0 ∼ pdata , (2.4)
2(1 − t) 1−t
where (Bt )t≥0 is a standard Brownian motion. The solution to this stochastic differential equation (SDE)
has the closed-form expression:
r Z
√ √ 1−t t 1
Xt = 1 − tX0 + t Z t where Zt = dBs ∼ N (0, Id ). (2.5)
t 0 1−s
It is important to note that the process Xt is not defined at t = 1, although it is straightforward to see from
the above equation that Xt converges to a Gaussian variable as t → 1.
To demonstrate the connection between this diffusion√process and the forward process (2.1) of the diffusion
model, we evaluate the diffusion process at times ti = 1 − αi for 1 ≤ i ≤ T . It is straightforward to check
that the marginal distribution of the resulting discrete-time process {Xti : 1 ≤ i ≤ T } is identical to that of
the forward process (2.1). Therefore the diffusion process (2.4) can be viewed as a continuous-time limit of
the forward process. In the next section, we will establish theoretical results for the diffusion process (2.4).
Through time discretization, our theory will provide insights for the DDPM.
We use the notation Xt for both the discrete-time process {Xt : t ∈ [T ]} in (2.1) and the continuous-time
diffusion process (Xt )0≤t<1 in (2.4) to maintain consistency with standard literature. The context will clarify
which process is being referred to.
The proof of this theorem is deferred to Appendix 5. A few remarks are as follows. First, it is worth
mentioning that this formula does not describe the evolution of the (conditional) differential entropy of the
process, because ρt (·) represents the unconditional density of Xt , while the expectation is taken conditional
on X0 . Second, without further assumptions, we cannot set t1 = 0 or t2 = 1 because X0 might not have a
density (hence ρ0 is not well-defined), and Xt is only defined for t < 1. By assuming that X0 has a finite
second moment, the following proposition characterizes the limit of E[log ρt (Xt ) | X0 ] as t → 1.
Proposition 1. Suppose that E[kX0 k22 ] < ∞. Then for any x0 ∈ Rd , we have
1 + log (2π)
lim E [log ρt (Xt ) | X0 = x0 ] = − d.
t→1− 2
The proof of this proposition is deferred to Appendix A. This result is not surprising, as it can be seen
from (2.5) that Xt converges to a standard Gaussian variable as t → 1 regardless of x0 , and we can check
1 + log (2π)
E[log φ(Z)] = − d
2
4
where Z ∼ N (0, Id ) and φ(·) is its density (we will use this notation throughout his section). The proof of
Proposition 1 formalizes this intuitive analysis.
When X0 has a smooth density ρ0 (·) with Lipschitz continuous score function, we can show that
E[log ρt (Xt ) | X0 ] → ρ0 (x0 ) as t → 0, as presented in the next proposition.
Proposition 2. Suppose that X0 has density ρ0 (·) and supx k∇2 log ρ0 (x)k < ∞. Then for any x0 ∈ Rd ,
we have
lim E [log ρt (Xt ) | X0 = x0 ] = log ρ0 (x0 ).
t→0+
The proof of this proposition can be found in Appendix B. With Propositions 1 and 2 in place, we can
take t1 → 0 and t2 → 1 in Theorem 1 to show that for any given point x0 ,
Z 1
1 + log(2π)
log ρ0 (x0 ) = − d− D(t, x0 )dt (3.1a)
2 0
1 h X − √1 − tX 2 i d
t 0
D(t, x) := E + ∇ log ρt (Xt ) | X0 = x − . (3.1b)
2(1 − t) t 2 2t
In practice, we might not want to make smoothness assumptions on X0 as in Proposition 2. In that case,
we can fix some sufficiently small δ > 0 and obtain a density formula
Z 1
1 + log(2π)
E [log ρδ (Xδ ) | X0 = x0 ] = − d− D(t, x0 )dt (3.1c)
2 δ
for a smoothed approximation of log ρ0 (x0 ). This kind of proximity is often used to circumvent non-
smoothness target distributions in diffusion model literature (e.g., Benton et al. (2023); Chen et al. (2023b,
2022); Li et al. (2023b)). We leave some more discussions to Appendix C.
Recall that the forward process X1 , . . . , XT has the same marginal distribution as Xtsde
1
, . . . , Xtsde
T
snapshoted
from the diffusion process (2.4). This gives the following approximation of the density formula (3.1a):
(i)
log ρ0 (x0 ) ≈ E log ρt1 (Xtsde
1
) | X0sde = x0
√
ti+1 − ti h Xtsde i
XT
(ii)1 + log(2πt1 ) − 1 − ti X0sde 2
≈ − d− E i
+ ∇ log ρti (Xtsde ) | X0sde = x0
2 i=1
2(1 − t i ) t i 2
In step (i) we approximate log ρ0 (x0 ) with a smoothed proxy; see the discussion around (3.1c) for details;
R1
step (ii) applies (3.1c), where we compute the integral t1 d/(2t)dt = −(d/2) log t1 in closed form and
approximate the integral
Z 1
1 h X sde − √1 − tX sde 2 i
t 0
E + ∇ log ρt (Xtsde ) | X0sde = x0 dt;
t1 2(1 − t) t 2
5
d √ √
step (iii) follows from Xtsde
i
= 1 − ti x0 + ti ε for ε ∼ N (0, Id ) conditional on X0sde = x0 , and the relation
√ √
∇ log ρti = ∇ log qi = s⋆i (x) = − ti ε⋆i (x) ≈ − ti εbi (x).
In practice, we need to choose the learning rates {βt : 1 ≤ t ≤ T } such that the grid becomes finer as T
becomes large. More specifically, we require
to be small (roughly of order O(1/T )), and t1 = β1 and 1 − tT = αT to be vanishingly small (of order T −c
for some sufficiently large constant c > 0); see e.g., Benton et al. (2023); Li et al. (2023b) for learning rate
schedules satisfying these properties. Finally, we replace the time steps {ti : 1 ≤ i ≤ T } with the learning
rates for the forward process to achieve1
1 + log (2πβ1 ) XT
1 − αt+1 h √ √ i
2
log ρ0 (x0 ) ≈ − d− Eε∼N (0,Id ) ε − εbt ( αt x0 + 1 − αt ε) 2
, (3.2)
2 t=1
2(1 − αt )
The density approximation (3.2) can be evaluated with the trained epsilon predictors.
x − ∇ log ρt (x)
ẋt = vt (xt ) where vt (x) = − ; (3.3)
2(1 − t)
namely, if we draw a particle x0 ∼ ρ0 and evolve it according to the ODE (3.3) to get the trajectory t → xt
for t ∈ [0, 1), then xt ∼ ρt . See e.g., Song et al. (2021b, Appendix D.1) for the derivation of this result.
Under some smoothness condition, we can use the results developed in Albergo et al. (2023); Grathwohl et al.
(2019) to show that for any given x0
Z t Z t
∂ d − tr ∇2 log ρs (xs )
log ρt (xt ) − log ρ0 (x0 ) = − Tr vs (xs ) ds = ds. (3.4)
0 ∂x 0 2(1 − s)
Here t → xt is the solution to the ODE (3.3) with initial condition x0 . Since the ODE system (3.3) is based
on the score functions (hence xt can be numerically solved), and the integral in (3.4) is based on the Jacobian
of the score functions, we may take t → 1 and use the fact that ρt (·) → φ(·) to obtain a score-based density
formula Z 1
d 1 2 d − tr ∇2 log ρs (xs )
log ρ0 (x0 ) = − log(2π) − kx1 k2 − ds. (3.5)
2 2 0 2(1 − s)
However, numerically, this formula is more difficult to compute than our formula (3.1) for the following
reasons. First, (3.5) involves the Jacobian of the score functions, which are more challenging to estimate
than the score functions themselves. In fact, existing convergence guarantees for DDPM do not depend on
the accurate estimation of the Jacobian of the score functions (Benton et al., 2023; Chen et al., 2023a, 2022;
Li and Yan, 2024). Second, using this density formula requires solving the ODE (3.3) accurately to obtain
x1 , which might not be numerically stable, especially when the score function is not accurately estimated at
early stages, due to error propagation. In contrast, computing (3.1) only requires evaluating a few Gaussian
integrals (which can be efficiently approximated by the Monte Carlo method) and is more stable to score
estimation error.
1 Here we define αT +1 = 0 to accommodate the last term in the summation.
6
4 Implications
In the previous section, we established a density formula
1 + log (2πβ1 )
T
X 1 − αt+1 h √ √ i
2
log q0 (x) ≈ − d− Eε∼N (0,Id ) ε − ε⋆t ( αt x + 1 − αt ε) 2 (4.1)
| 2
{z } t=1 |2(1 − αt ) {z }
⋆
=:C0 =:L⋆
t−1 (x)
up to discretization error (which vanishes as T becomes large) and score estimation error. In this section,
we will discuss the implications of this formula in various generative and learning frameworks.
where the reverse process (Yt )0≤t≤T was defined in Section 2.1, and p0 is the density of Y0 . Under the coef-
ficient design recommended by Li and Yan (2024) (other reasonable designs also lead to similar conclusions)
(1 − αt ) (αt − αt )
ηt = 1 − αt and σt2 = , (4.3)
1 − αt
it can be computed that for each 2 ≤ t ≤ T :
1 − αt h √ √ i
2
Lt−1 (x) = Eε∼N (0,Id ) ε − εt ( αt x + 1 − αt ε) 2 .
2(αt − αt )
We can verify that (i) for each 2 ≤ t ≤ T , the coefficients in Lt−1 from (4.2) and L⋆t−1 from (4.1) are identical
up to higher-order error; (ii) when T is large, LT becomes vanishingly small; and (iii) the function
1 + log (2πβ1 )
C0 (x) = − d + O(β1 ) = C0⋆ + O(β1 )
2
is nearly a constant. See Appendix D.1 for details. It is worth highlighting that as far as we know, existing
literature haven’t pointed out that C0 (x) is nearly a constant. For instance, Ho et al. (2020) discretize
this term to obtain discrete log-likelihood (see Section 3.3 therein), which is unnecessary in view of our
observation. Additionally, some later works falsely claim that C0 (x) is negligible, as we will discuss in the
following sections.
Now we discuss the validity of optimizing the variational bound for training DDPMs. Our discussion
shows that
KL(q0 k p0 ) = −Ex∼q0 [log p0 (x)] − H(q0 ) ≤ Ex∼q0 [L(x)] − C0⋆ − H(q0 ) + o(1), (4.4)
| {z } | {z }
=:L(ε1 ,...,εT ) =:Lvb (ε1 ,...,εT )
R
where H(q0 ) = − log q0 (x)dq0 is the entropy of q0 , and L(x) denotes the widely used (negative) ELBO2
XT
1 − αt+1 h √ √ i
2
L(x) := Eε∼N (0,Id ) ε − εt ( αt x + 1 − αt ε) 2 .
t=1
2(1 − αt )
2 We follow the convention in existing literature to remove the last two terms LT (x) and C0 (x) from (4.2) in the ELBO.
7
The true objective of DDPM is to learn the epsilon predictors ε1 , . . . , εT that minimizes L in (4.4), while in
practice, the optimization target is the variational bound Lvb . It is known that the global minimizer for
XT
1 − αt+1 h √ √ i
2
Ex∼q0 [L(x)] = Ex∼q0 ,ε∼N (0,Id ) ε − εt ( αt x + 1 − αt ε) 2 (4.5)
t=1
2(1 − αt )
is exactly εbt (·) ≡ ε⋆t (·) for each 1 ≤ t ≤ T (see Appendix D.1). Although in practice the optimization is
based on samples from the target distribution q0 (instead of the population level expectation over q0 ) and
may not find the exact global minimizer, we consider the ideal scenario where the learned epsilon predictors
εbt equal ε⋆t to facilitate discussion. When εt = ε⋆t for each t, according to (4.1), we have
namely the minimizer for Lvb approximately minimizes L, and the optimal value is asymptotically zero when
the number of steps T becomes large. This suggests that by minimizing the variational bound Lvb , the
resulting generator distribution p0 is guaranteed to be close to the target distribution q0 in KL divergence.
Some experimental evidence suggests that using reweighted coefficients can marginally improve empirical
performance. For example, Ho et al. (2020) suggests that in practice, it might be better to use uniform
coefficients in the ELBO
1X
T h √ √ i
2
Lsimple (x) := Eε∼N (0,Id ) ε − εbti ( αt x + 1 − αt ε) 2
(4.8)
T i=1
when trainging DDPM to improve sampling quality.3 This strategy has been adopted by many later works.
In the following sections, we will discuss the role of using the ELBO in different applications. While the
original literature might use the modified ELBO (4.8), in our discussion we will stick to the original ELBO
(4.6) to gain intuition from our theoretical findings.
with the generator striving to produce realistic data while the discriminator tries to distinguish real data
from fake. The generator and discriminator are trained iteratively4
8
G ← arg min −Ez∼pnoise [log D(G(z))]
to approach the Nash equilibrium (G⋆ , D⋆ ), where the distribution of G⋆ (z) with z ∼ pnoise matches the
target distribution pdata , and D(x) = 1/2 for all x.
It is believed that adding a regularization term to make the generated samples fit the VLB can improve
the sampling quality of the generative model. For example, Xia et al. (2023) proposed adding the VLB
L(x) as a regularization term to the objective function, where {bεti (·) : 1 ≤ i ≤ T } are the learned epsilon
predictors for pdata . The training procedure then becomes
where λ > 0 is some tuning parameter. However, it remains unclear what exactly is optimized through the
above objective. According to our theory, L(x) ≈ − log pdata (x) + C0⋆ . Assuming that this approximation is
exact for intuitive understanding, the unique Nash equilibrium (Gλ , Dλ ) satisfies
pGλ (x) = zpdata (x)λ − 1 + pdata (x)
for some normalizing factor z > 0, where pGλ is the density of Gλ (z) with z ∼ pnoise . See Appendix D.2 for
details. This can be viewed as amplifying the density pdata wherever it is not too small, while zeroing out
the density where pdata is vanishingly small (which is difficult to estimated accurately), thus improving the
sampling quality.
p0 (c) p0 (x | ci ) p0 (x | c)
p0 (c | x) = P = P .
p
j∈C 0 j(c ) p 0 (x | c j ) j∈C p0 (x | cj )
for each c ∈ C. Recent work (Li et al., 2023a) proposed to use the ELBO5
XT
1 − αt+1 h √ √ i
2
−L(x; c) := − Eε∼N (0,Id ) ε − εbt ( αt x + 1 − αt ε; c) 2
t=1
2(1 − αt )
as an approximate class-conditional log-likelihood log p0 (x | c) for each c ∈ C, which allows them to obtain a
posterior distribution
exp (−L(x; c))
pb0 (c | x) = P . (4.9)
j∈C exp (−L(x; cj ))
Our theory suggests that −L(x; c) ≈ log p0 (x | c)−C0⋆ , where C0⋆ = −[1+log(2πβ1 )]d/2 is a universal constant
that does not depend on p0 and c. This implies that
providing theoretical justification for using the computed posterior pb0 in classification tasks.
It is worth mentioning that, although this framework was proposed in the literature (Li et al., 2023a), it
remains a heuristic method before our work. For example, in general, replacing the intractable log-likelihood
with a lower bound does not guarantee good performance, as they might not be close. Additionally, recall
5 The original paper adopted uniform coefficients; see the last paragraph of Section 4.1 for discussion.
9
that there is a term C0 (x) in the ELBO (4.2). Li et al. (2023a) claimed that “Since T = 1000 is large and
log pθ (x0 | x1 , c) is typically small, we choose to drop this term”. However this argument is not correct, as we
already computed in Section 4.1 that this term
1 + log (2πβ1 )
C0 (x) = − d + O(β1 )
2
can be very large since β1 is typically very close to 0. In view of our results, the reason why this term can
be dropped is that it equals a universal constant that does not depend on the image data x and the class
index c, thus it does not affect the posterior (4.9).
With training data {(x1i , . . . , xki ) : 1 ≤ i ≤ n}, we can train the autoregressive network f (·) and the diffusion
model by minimizing the following empirical risk:
n
1X
arg min L f (x1i , . . . , xik−1 ), xki . (4.11)
f,ε1 ,...,εT n i=1
To gain intuition from our theoretical results, we take the weights in the diffusion loss (4.10) to be the
coefficients in the ELBO (4.6), and for each z, suppose that the learned diffusion model for p(xk | z) is
already good enough, which returns the set of epsilon predictors {b εt (· ; z) : 1 ≤ t ≤ T } for the probability
distribution of xk conditioned on z. Under this special case, our approximation result (4.6) shows that
which suggests that the training objective for the network f in (4.11) can be viewed as approximate MLE,
as the loss function
n n
1X 1X
L f (x1i , . . . , xik−1 ), xki ≈ − log p(xki | f (x1i , . . . , xik−1 )) + C0⋆
n i=1 n i=1
represents the negative log-likelihood function (up to an additive constant) of the observed xk1 , . . . , xkn in
terms of f .
5 Proof of Theorem 1
Recall the definition of the stochastic process (Xt )0≤t≤1
1 1
dXt = − Xt dt + √ dBt .
2(1 − t) 1−t
√ √
Define Yt := Xt / 1 − t for any 0 ≤ t < 1, and let f (t, x) = x/ 1 − t, we can use Itô’s formula to show that
∂f ⊤ 1
dYt = df (t, Xt ) = (t, Xt ) dt + ∇x f (t, Xt ) dXt + dXt⊤ ∇2x f (t, Xt ) dXt
∂t 2
10
Xt 1 1 1 dBt
= 3/2
dt + √ − Xt dt + √ dBt = . (5.1)
2(1 − t) 1−t 2(1 − t) 1−t 1−t
√
Therefore the Itô process Yt is√a martingale, which is easier to handle. Let g(t, y) = log ρt ( 1 − ty), and we
can express log ρt (x) = g(t, x/ 1 − t). In view of Itô’s formula, we have
(i) ∂g ⊤ 1
d log ρt (Xt ) = dg (t, Yt ) = (t, Yt ) dt + ∇y g (t, Yt ) dYt + dYt⊤ ∇2y g (t, Yt ) dYt
∂t 2
(ii) ∂g 1 ⊤ 1 ⊤ 2
= (t, Yt ) dt + ∇y g (t, Yt ) dBt + 2 dBt ∇y g (t, Yt ) dBt
∂t 1−t 2 (1 − t)
(iii) ∂g 1 ⊤ 1
= (t, Yt ) dt + ∇y g (t, Yt ) dBt + 2 tr ∇2y g (t, Yt ) dt. (5.2)
∂t 1−t 2 (1 − t)
Here step (i) follows from the Itô rule, step (ii) utilizes (5.1), while step (iii) can be derived from the Itô
calculus. Then we investigate the three terms above. Notice that
√ √
∇y ρt ( 1 − ty) ∇x ρt (Xt ) 1 − t √
∇y g (t, y) | y=Yt = √ | y=Yt = = 1 − t∇ log ρt (Xt ) , (5.3)
ρt ( 1 − tYt ) ρt (Xt )
and similarly, we have
∇2y g (t, y) | y=Yt = (1 − t) ∇2 log ρt (Xt ) . (5.4)
Substituting (5.3) and (5.4) back into (5.2) gives
∂g 1 ⊤ 1
d log ρt (Xt ) = (t, Yt ) dt + √ ∇ log ρt (Xt ) dBt + tr ∇2 log ρt (Xt ) dt.
∂t 1−t 2 (1 − t)
or equivalently, for any given 0 < t1 < t2 < 1, we have
Z t2 h Z t2
t2 ∂g tr ∇2 log ρt (Xt ) i 1
log ρt (Xt ) = (t, Yt ) + dt + √ ∇ log ρt (Xt )⊤ dBt . (5.5)
t1 t1 ∂t 2 (1 − t) t1 1−t
Conditional on X0 , we take expectation on both sides of (5.5) to achieve
Z t2
∂g 1
E [log ρt2 (Xt2 ) − log ρt1 (Xt1 ) | X0 ] = E (t, Yt ) + tr ∇2 log ρt (Xt ) dt | X0 . (5.6)
t1 ∂t 2 (1 − t)
We need the following lemmas, whose proof can be found at the end of this section.
Claim 1. For any 0 < t < 1 and any y ∈ Rd , we have
Z
∂g d 1 √
(t, y) = − + 2 ρX0 |Xt x0 | 1 − ty ky − x0 k22 dx0 .
∂t 2t 2t x0
11
Z t2 Z t2
(i) ∂g 1 d d
=E (t, Yt ) + tr ∇2 log ρt (Xt ) + dt | X0 − dt
t1 ∂t 2 (1 − t) (1 − t) t t1 (1 − t) t
Z t2 Z t2
(ii) ∂g 1 2
d d
= E (t, Yt ) + tr ∇ log ρt (Xt ) + | X0 dt − dt
t1 ∂t 2 (1 − t) (1 − t) t t1 (1 − t) t
Z t2
∂g 1
= E (t, Yt ) + tr ∇2 log ρt (Xt ) | X0 dt. (5.8)
t1 ∂t 2 (1 − t)
Here step (i) follows from (5.6), and its validity is guaranteed by
Z t2
d t2 (1 − t1 )
dt = log < +∞,
t1 t (1 − t) t 1 (1 − t2 )
Here step (i) follows from (5.10) and an application of Stein’s lemma
√
E ∇ log ρt (Xt )⊤ Xt − 1 − tX0 | X0 = tE tr ∇2 log ρt (Xt ) | X0 ,
Note that here ρ0 (·) stands for the law of X0 . Hence we have
∂g ∂ √ 1 ∂ √
(t, y) = log ρt ( 1 − ty) = √ ρt ( 1 − ty)
∂t ∂t ρt ( 1 − ty) ∂t
12
Z (1 − t)ky − x k2
1 d 0 2
= √ (2π)−d/2 − t−d/2−1 exp −
ρt ( 1 − ty)x0 2 2t
(1 − t)ky − x k2 ky − x k2
0 2 0 2
+ t−d/2 exp − ρ0 (dx0 )
2t 2t2
Z
1 √ d ky − x0 k22
= √ ρ X |X 1 − ty | x0 − + ρ0 (dx0 )
ρt ( 1 − ty) x0 t 0 2t 2t2
Z
d ky − x0 k22 √
= − + 2
ρX0 |Xt dx0 | 1 − ty
x0 2t 2t
as claimed.
see Chen et al. (2022) for the proof of this relationship. Then we can compute
1n 1 √ √ ⊤
∇2 log ρt (x) = −Id + E Xt − 1 − tX0 | Xt = x E Xt − 1 − tX0 | Xt = x
t t
1 h √ √ ⊤ io
− E Xt − 1 − tX0 Xt − 1 − tX0 | Xt = x
t Z
1n 1h √ ih Z √ i⊤
= − Id + x − 1 − tx0 ρX0 |Xt (dx0 | x) x − 1 − tx0 ρX0 |Xt (dx0 | x)
t t
Z o
1 √ √ ⊤
− x − 1 − tx0 x − 1 − tx0 ρX0 |Xt (dx0 | x) .
t
Hence we have
Z Z
1 1 √ 2 1 √ 2
tr ∇2 log ρt (x) = − d + x − 1 − tx0 ρX0 |Xt (dx0 | x) 2 − x − 1 − tx0 ρ
2 X0 |Xt
(dx0 | x)
t t t
Z
d 1 2 1 √ 2
= − − 2 ∇ log ρt (x) 2 + 2 x − 1 − tx0 2 ρX0 |Xt (x0 | x) dx0 .
t t t
By Jensen’s inequality, we know that
d
tr ∇2 log ρt (x) ≥ − .
t
6 Discussion
This paper develops a score-based density formula that expresses the density function of a target distribution
using the score function along a continuous-time diffusion process that bridges this distribution and standard
Gaussian. By connecting this diffusion process with the forward process of score-based diffusion models, our
results provide theoretical support for training DDPMs by optimizing the ELBO, and offer novel insights
into several applications of diffusion models, including GAN training and diffusion classifiers.
Our work opens several directions for future research. First, our theoretical results are established for
the continuous-time diffusion process. It is crucial to carefully analyze the error induced by time discretiza-
tion, which could inform the number of steps required for the results in this paper to be valid in practice.
Additionally, while our results provide theoretical justification for using the ELBO (4.6) as a proxy for the
negative log-likelihood of the target distribution, they do not cover other practical variants of ELBO with
modified weights (e.g., the simplified ELBO (4.8)). Extending our analysis to other diffusion processes might
yield new density formulas incorporating these modified weights. Lastly, further investigation is needed into
other applications of this score-based density formula, including density estimation and inverse problems.
13
Acknowledgements
G. Li is supported in part by the Chinese University of Hong Kong Direct Grant for Research. Y. Yan was
supported in part by a Norbert Wiener Postdoctoral Fellowship from MIT.
A Proof of Proposition 1
We establish the desired result by sandwiching E[log ρt (Xt ) | X0 = x0 ] and find its limit as t → 1 . We first
record that the density of Xt can be expressed as
√
kx − 1 − tX0 k22
ρt (x) = EX0 (2πt)−d/2 exp − , (A.1)
2t
d √ √
since Xt = 1 − tX0 + tZ for an independent variable Z ∼ N (0, Id ).
Lower bounding E[log ρt (Xt ) | X0 = x0 ]. Starting from (A.1), for any x ∈ Rd and any 0 < t < 1,
√
−d/2 kx − 1 − tX0 k22
log ρt (x) = log EX0 (2πt) exp −
2t
√
(i) kx − 1 − tX0 k22
≥ log (2πt)−d/2 exp − EX0
2t
√ 2
d kx − 1 − tX0 k2
= − log(2πt) − EX0
2 2t
2
√
d kxk2 1−t 1−t
= − log(2πt) − − E[kX0 k22 ] + E[x⊤ X0 ]
2 2t 2t t
(ii) d √ kxk22 √
= − log(2πt) − 1 + O( 1 − t) + O( 1 − t)E[kX0 k22 ].
2 2t
Here step (i) follows from Jensen’s inequality and the fact that e−x is a convex function, while step (ii)
follows from elementary inequalities
1
E[x⊤ X0 ] ≤ E kxkkX0 k2 ≤ E kxk22 + kX0 k22 .
2
This immediately gives, for any given x0 ∈ Rd and any 0 < t < 1,
√
d 1 + O( 1 − t) √
E[log ρt (Xt ) | X0 = x0 ] ≥ − log(2πt) − E kXt k22 | X0 = x0 + O( 1 − t)E[kX0 k22 ] . (A.2a)
| 2 2t {z }
=:fx0 (t)
d 1 h √ √ i
lim fx0 (t) = − log(2π) − lim E k 1 − tx0 + tZk22 for Z ∼ N (0, Id )
t→1− 2 t→1− 2
d d
= − log(2π) − . (A.2b)
2 2
Upper bounding E[log ρt (Xt ) | X0 = x0 ]. Towards that, we need to obtain point-wise upper bound for
log ρt (x). Since the desired result only depends on the limiting behavior when t → 1, from now on we only
consider t > 0.9, under which r
1/4 1 1
(1 − t) < log
2 1−t
holds. It would be helpful to develop the upper bound for the following two cases separately.
14
p
• For any (1 − t)1/4 < kxk2 < 0.5 log 1/(1 − t), we have
(a) (kxk2 − (1 − t)1/4 )2
log ρt (x) ≤ log EX0 (2πt)−d/2 exp − + 1 kX0 k2 > (1 − t)−1/4
2t
(b) d 1/4 2
(kxk2 − (1 − t) ) (kxk − (1 − t)1/4 )2
2
≤ − log(2πt) − + exp P kX0 k2 > (1 − t)−1/4
2 2t 2t
(c) d (kxk2 − (1 − t)1/4 )2 kxk2
2
≤ − log(2πt) − + exp E[kX0 k22 ](1 − t)1/2
2 2t 2t
(d) d (kxk2 − (1 − t)1/4 )2
≤ − log(2πt) − + E[kX0 k22 ](1 − t)1/4 . (A.3)
2 2t
Here step (a) follows from (A.1); step (b) holds since log(x + y) ≤ log x + y/x holds for any x > 0 and
y ≥ 0; stepp(c) follows from kxk2 > (1 − t)1/4 and Chebyshev’s inequality; while step (d) holds since
kxk2 < 0.5 log 1/(1 − t).
p
• For kxk2 ≥ 0.5 log 1/(1 − t) or kxk ≤ (1 − t)1/4 , we will use the naive upper bound
d
log ρt (x) ≤ − log(2πt) < 0, (A.4)
2
where the first relation simply follows from (A.1) and the second relation holds when t > 0.9.
Then we have
(i) n p o
E[log ρt (Xt ) | X0 = x0 ] ≤ E[log ρt (Xt ) 1 (1 − t)1/4 < kXt k2 < 0.5 log 1/(1 − t) | X0 = x0 ]
(ii) d (kxk2 − (1 − t)1/4 )2
≤ E − log(2πt) − + E[kX0 k22 ](1 − t)1/4
2 2t
n o
1/4
p
· 1 (1 − t) < kXt k2 < 0.5 log 1/(1 − t) | X0 = x0
d p
= − log(2πt) + E[kX0 k22 ](1 − t)1/4 P (1 − t)1/4 < kXt k2 < 0.5 log 1/(1 − t)
| 2 {z }
=:g x0 (t)
n o
(kXt k2 − (1 − t) 1/4 2
) 1/4
p
−E 1 (1 − t) < kXt k2 < 0.5 log 1/(1 − t) | X0 = x0 .
2t
| {z }
gx0 (t)
=:e
Here step (i) follows from (A.4), while step (ii) utilizes (A.3). Since Xt is a continuous random variable for
any t ∈ (0, 1), we have
p
lim P (1 − t)1/4 < kXt k2 < 0.5 log 1/(1 − t) = 1.
t→1−
15
where φ(z) = (2π)−d/2 exp(−kzk22/2) is the density function of N (0, Id ). For any t ∈ (0.9, 1), we have
√ √
ht (z) ≤ k tz + 1 − tx0 k22 φ(z) ≤ 2(kzk22 + kx0 k22 )φ(z) =: h(z),
E[log ρt (Xt ) | X0 = x0 ] ≤ gx0 (t) where gx0 (t) := gx0 (t) − gex0 (t), (A.5a)
such that
d d
lim gx0 (t) = lim g x0 (t) − lim gex0 (t) = − log(2π) − . (A.5b)
t→1− t→1− t→1− 2 2
Conclusion. By putting together (A.2) and (A.5), we know that for any t ∈ (0.9, 1)
d d
fx0 (t) ≤ E[log ρt (Xt ) | X0 = x0 ] ≤ gx0 (t) and lim fx0 (t) = lim gx0 (t) = − log(2π) − .
t→1− t→1− 2 2
By the sandwich theorem, we arrive at the desired result
d d
lim E[log ρt (Xt ) | X0 = x0 ] = − log(2π) − .
t→1− 2 2
B Proof of Proposition 2
Suppose that L := supx k∇2 log ρ0 (x)k. The following claim will be useful in establishing the proposition,
whose proof is deferred to the end of this section.
Claim 3. There exists some t0 > 0 such that
where Z ∼ N (0, Id ). Here step (i) follows from (B.1) in Claim 3; step (ii) holds since E[Z] = 0 and
E[kZk22 ] = d; while step (iii) follows from (5.11). It is straightforward to check that
Z −d/2 (1 − t)kx − x k2
2πt 0 2
ρ0 (x) exp − dx
x 1−t 2t
16
is the density of ρ0 ∗ N (0, t/(1 − t)) evaluated at x0 , which taken collectively with the assumption that ρ0 (·)
is continuous yields
Z −d/2 (1 − t)kx − x k2
2πt 0 2
lim ρ0 (x) exp − dx = ρ0 (x0 ).
t→0+ x 1−t 2t
as claimed.
namely the conditional distribution of X0 given Xt = x is 1/(2t)-strongly log-concave for any x, when
t ≤ 1/2(L + 1). By writting
Z Z √
√ −d/2 x − tz
√
ρt (x) = pXt (x) = φ(z)p 1−tX0 x − tz dz = (1 − t) φ(z)ρ0 √ dz, (B.5)
1−t
we can express the score function of ρt as
Z √
∇ρt (x) − d+1 1 x − tz
∇ log ρt (x) = = (1 − t) 2 φ(z)∇ρ0 √ dz
ρt (x) ρt (x) 1−t
Z √ √
− d+1 1 x − tz x − tz
= (1 − t) 2 φ(z)ρ0 √ ∇ log ρ0 √ dz (B.6)
ρt (x) 1−t 1−t
d/2 Z √
(i) − d+1 1−t 1 x − 1 − tx0
= (1 − t) 2 φ √ ρ0 (x0 ) ∇ log ρ0 (x0 ) dx0
t ρt (x) t
Z
(ii) 1 1
= √ pX0 |Xt (x0 | x)∇ log ρ0 (x0 ) dx0 = √ E [∇ log ρ0 (X0 ) | Xt = x] . (B.7)
1−t 1−t
√ √
Here step (i) uses the change of variable x0 = (x − tz)/ 1 − t, while step (ii) follows from (B.3). Starting
from (B.6), we take the derivative to achieve
Z √ √ √ ⊤
2 −d 1 x − tz x − tz x − tz
∇ log ρt (x) = (1 − t) 2 +1 φ(z)ρ0 √ ∇ log ρ0 √ ∇ log ρ0 √ dz
ρt (x) 1−t 1−t 1−t
| {z }
=:H1 (x)
Z √ √
d 1 x − tz x − tz
+ (1 − t)− 2 +1 φ(z)ρ0 √ ∇2 log ρ0 √ dz
ρt (x) 1−t 1−t
| {z }
=:H2 (x)
17
Z √ √
d+1 1 x − tz x − tz
− (1 − t)− 2 φ(z)ρ0 √ ∇ log ρ0 √ dz [∇ρt (x)]⊤ . (B.8)
ρ2t (x) 1−t 1−t
| {z }
=:H3 (x)
Then we investigate H1 (x), H2 (x) and H3 (x) respectively. Regarding H1 (x), we have
d/2 Z √
(a1) 1−t 1 x− 1 − tx0
H1 (x) = (1 − t) −d
2 +1 φ √ ρ0 (x0 ) ∇ log ρ0 (x0 ) [∇ log ρ0 (x0 )]⊤ dz
t ρt (x) t
Z
(b1) 1
= pX0 |Xt (x0 | x)∇ log ρ0 (x0 ) [∇ log ρ0 (x0 )]⊤ dx0
1−t
1 h i
⊤
= E ∇ log ρ0 (X0 ) [∇ log ρ0 (X0 )] | Xt = x ; (B.9a)
1−t
for H2 (x), we have
d/2 Z √ √
(a2) d 1−t 1 x − 1 − tx0 x − tz
H2 (x) = (1 − t)− 2 +1 φ √ ρ0 (x0 ) ∇2 log ρ0 √ dx0
t ρt (x) t 1−t
Z
(b2) 1 1
= pX0 |Xt (x0 | x)∇2 log ρ0 (x0 ) dx0 = E ∇2 log ρ0 (X0 ) | Xt = x ; (B.9b)
1−t 1−t
Here step (i) holds since for any random variable X, E[(X −c)2 ] is minimized at c = E[X]; step (ii) holds since
the score function ∇ log ρ0 (·) is L-Lipschitz; step (iii) follows from the Poincaré inequality for log-concave
18
distribution, and the fact that the conditional distribution of X0 given Xt = x is 1/2t-strongly log-concave
(cf. (B.4)). We conclude that
(a) 1 2tL2 d (b)
∇2 log ρt (x) ≤ L+ ≤ 4L.
1−t 1−t
Here step (a) follows from (B.10), (B.11), and the assumption that supx k∇2 log ρt (x)k ≤ L, while step (b)
holds provided that t ≤ min{1/2, 1/(2Ld)}.
Here step (i) holds since E[kεk22 ] = d, while step (ii) follows from Stein’s lemma. Therefore, when the
score functions are reasonably smooth as t → 0, one may expect that the integrand D(t, x0 ) is of constant
order, allowing the integral to converge at t = 0.
• As t → 1, we can compute
1 √ √ √ d
D(t, x0 ) = E kε + t∇ log ρt ( 1 − tx0 + tε)k22 −
2(1 − t)t 2t
1 √ √ √ d
≍ E kε + t∇ log ρt ( 1 − tx0 + tε)k22 − .
2(1 − t) 2
19
Recall that the KL divergence between two d-dimensional Gaussian N (µ1 , Σ1 ) and N (µ2 , Σ2 ) admits the
following closed-form expression:
1h ⊤ −1
i
KL (N (µ1 , Σ1 ) k N (µ2 , Σ2 )) = tr Σ−1
2 Σ 1 + (µ 2 − µ 1 ) Σ 2 (µ 2 − µ 1 ) − d + log det Σ 2 − log det Σ 1 .
2
Then we can check that for 2 ≤ t ≤ T ,
√ 2
αt αt−1 βt αt − 1 ηt st (xt )
KL pXt−1 |Xt ,X0 (· | xt , x0 ) k pYt−1 |Yt (· | xt ) = x0 + √ xt − √ ,
2σt2 1 − αt αt (1 − αt ) αt 2
Consider the learning rate schedule in Li et al. (2023b); Li and Yan (2024):
( t )
1 c1 log T c1 log T
β1 = c 0 , βt+1 = min β1 1 + ,1 (t = 1, . . . , T − 1) (D.1)
T T T
for sufficiently large constants c0 , c1 > 0. Then using the properties in e.g., Li and Yan (2024, Lemma 8),
we can check that
(1 − αt+1 )(αt − 1) 8c1 log T 1 − αt+1
γ1 = ≤ ,
2(1 − αt )(αt − αt ) T 2(1 − αt )
and
αt − αt+1 βt − βt+1 βt 1 − αt 1 − αt+1 8c1 log T 1 − αt+1
γ2 = = ≤ 1− 1+ ≤ .
2(αt − αt ) 2(αt − αt ) βt+1 αt − αt 2(1 − αt ) T 2(1 − αt )
Hence the coefficients in L⋆t−1 and Lt−1 are identical up to higher-order error:
Computing L0 (x0 ). By taking η1 = σ12 = 1 − α1 (notice that (4.3) does not cover the case t = 1), we have
−d/2 !
2
2πσ12 α1 x1 − η1 s1 (x1 )
pY0 |Y1 (x0 | x1 ) = exp − 2 x0 − √
α1 2σ1 α1 2
−d/2 !
2
2πβ1 α1 x1 − β1 s1 (x1 )
= exp − x0 − √ ,
α1 2β1 α1 2
20
and therefore
" #
2
d 2πβ1 α1 x1 + β1 s1 (x1 )
C0 (x0 ) = Ex1 ∼pX1 |X0 (· | x0 ) − log − x0 − √
2 α1 2β1 α1 2
(i) d 2πβ1 1 h p p p i
= − log − Eε∼N (0,Id ) kε + β1 s1 ( 1 − β1 x0 + β1 ε)k22
2 α1 2
(ii) 1 + log(2πβ1 ) d 1 p p
= − d + log(1 − β1 ) − β1 Eε∼N (0,Id ) ks1 ( 1 − β1 x0 + β1 ε)k22
p 2 2 p 2 p
− β1 Eε∼N (0,Id ) ε⊤ s1 ( 1 − β1 x0 + β1 ε) . (D.2)
√ √
Here in step (i), we replace x1 with 1 − β1 x0 + β1 ε, which has the same distribution; step (ii) uses
the fact that E[kεk22 ] = d for ε ∼ N (0, Id ). Using similar analysis as in Proposition 2, we can show that
supx k∇2 log q1 (x)k ≤ O(L) when β1 is sufficiently small, as long as supx k∇2 log q0 (x)k ≤ L. Hence we have
p p p p 2
Eε∼N (0,Id ) ks1 ( 1 − β1 x0 + β1 ε)k22 ≤ Eε∼N (0,Id ) ks1 (x0 )k2 + O(L)kx0 − 1 − β1 x0 − β1 εk2
p p
≤ 2ks1 (x0 )k22 + O(L2 )Eε∼N (0,Id ) kx0 − 1 − β1 x0 − β1 εk22
2
= 2 ks1 (x0 )k2 + O(L2 β1 ). (D.3)
By Stein’s lemma, we can show that
h p p i p h p p i
Eε∼N (0,Id ) ε⊤ s1 ( 1 − β1 x0 + β1 ε) = β1 E tr ∇2 log q1 ( 1 − β1 x0 + β1 ε)
p
≤ O( β1 Ld). (D.4)
Substituting the bounds (D.3) and (D.4) back into (D.2), we have
1 + log(2πβ1 )
C0 (x0 ) = − d + O(β1 )
2
as claimed.
Optimal solution for (4.5). It is known that for each 1 ≤ t ≤ T , the score function s⋆t (·) associated with
qt satisfies " #
2
⋆
√ √ 1
st (·) = arg min Ex∼q0 ,ε∼N (0,Id ) s αt x + 1 − αt ε + √ ε .
s(·):Rd →Rd 1 − αt 2
√
See e.g., Chen et al. (2022, Appendix A) for the proof. Recall that ε⋆t (·) = 1 − αt s⋆t (·), then we have
h √ √ i
2
ε⋆t (·) = arg min Ex∼q0 ,ε∼N (0,Id ) ε − ε( αt x + 1 − αt ε) 2 .
ε(·):Rd →Rd
Therefore the global minimizer for (4.5) is εbt (·) ≡ ε⋆t (·) for each 1 ≤ t ≤ T .
21
D.2 Technical details in Section 4.2
By checking the optimality condition, we know that (Dλ , Gλ ) is a Nash equilibrium if and only if
pdata (x)
Dλ (x) = , (optimality condition for Dλ ) (D.5)
pdata (x) + pGλ (x)
where pGλ = (Gλ )# pnoise , and there exists some constant c such that
(
− log Dλ (x) + λL(x) = c, when x ∈ supp(pGλ ),
(optimality condition for Gλ ) (D.6)
− log Dλ (x) + λL(x) ≥ c, otherwise.
Taking the approximation L(x) ≈ − log pdata (x) + C0⋆ as exact, we have
( ⋆
eλC0 −c p−λ
data (x), for x ∈ supp(pGλ ),
Dλ (x) = (D.7)
1, for x ∈
/ supp(pGλ ).
where the first and second cases follow from (D.6) and (D.5) respectively. Then we derive a closed-form
expression for pGλ .
• For any x ∈ supp(pGλ ), by putting (D.5) and (D.7) together, we have
⋆ pdata (x)
eλC0 −c p−λ
data (x) = ,
pdata (x) + pGλ (x)
which further gives ⋆
pGλ (x) = pdata (x) e−λC0 +c pλdata (x) − 1 . (D.8)
• For any x ∈
/ supp(pGλ ), we have
where step (i) follows from Dλ (x) = 1, which follows from (D.7); step (ii) holds when we take the
approximation L(x) ≈ − log pdata (x) + C0⋆ as exact; and step (iii) follows from (D.6). This immediately
gives ⋆
e−λC0 +c pλdata (x) − 1 = log (−λC0⋆ + c + λ log pdata (x)) − 1 ≤ 0. (D.9)
On the other hand, we can check that (D.7) and (D.10) satisfies the optimality conditions (D.5) and (D.6),
which establishes the desired result.
References
Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. (2023). Stochastic interpolants: A unifying framework
for flows and diffusions. arXiv preprint arXiv:2303.08797.
Anderson, B. D. (1982). Reverse-time diffusion equation models. Stochastic Processes and their Applications,
12(3):313–326.
Benton, J., De Bortoli, V., Doucet, A., and Deligiannidis, G. (2023). Linear convergence bounds for diffusion
models via stochastic localization. arXiv preprint arXiv:2308.03686.
Chen, H., Lee, H., and Lu, J. (2023a). Improved analysis of score-based generative modeling: User-friendly
bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pages
4735–4763. PMLR.
22
Chen, S., Chewi, S., Lee, H., Li, Y., Lu, J., and Salim, A. (2023b). The probability flow ode is provably fast.
arXiv preprint arXiv:2305.11798.
Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. R. (2022). Sampling is as easy as learning the
score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215.
Croitoru, F.-A., Hondru, V., Ionescu, R. T., and Shah, M. (2023). Diffusion models in vision: A survey.
IEEE Transactions on Pattern Analysis and Machine Intelligence.
Dhariwal, P. and Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural
Information Processing Systems, 34:8780–8794.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio,
Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
Graikos, A., Malkin, N., Jojic, N., and Samaras, D. (2022). Diffusion models as plug-and-play priors.
Advances in Neural Information Processing Systems, 35:14715–14728.
Grathwohl, W., Chen, R. T. Q., Bettencourt, J., and Duvenaud, D. (2019). Scalable reversible generative
models with free-form continuous dynamics. In International Conference on Learning Representations.
Haussmann, U. G. and Pardoux, E. (1986). Time reversal of diffusions. The Annals of Probability, pages
1188–1205.
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems, 33:6840–6851.
Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of
Machine Learning Research, 6(4).
Hyvärinen, A. (2007). Some extensions of score matching. Computational statistics & data analysis,
51(5):2499–2512.
Li, A. C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D. (2023a). Your diffusion model is secretly
a zero-shot classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 2206–2217.
Li, G., Wei, Y., Chen, Y., and Chi, Y. (2023b). Towards non-asymptotic convergence for diffusion-based
generative models. In The Twelfth International Conference on Learning Representations.
Li, G. and Yan, Y. (2024). Adapting to unknown low-dimensional structures in score-based diffusion models.
arXiv preprint arXiv:2405.14861.
Li, T., Tian, Y., Li, H., Deng, M., and He, K. (2024). Autoregressive image generation without vector
quantization. arXiv preprint arXiv:2406.11838.
Luo, C. (2022). Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970.
Mardani, M., Song, J., Kautz, J., and Vahdat, A. (2024). A variational perspective on solving inverse
problems with diffusion models. In The Twelfth International Conference on Learning Representations.
Nichol, A. Q. and Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In International
Conference on Machine Learning, pages 8162–8171.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical text-conditional image
generation with CLIP latents. arXiv preprint arXiv:2204.06125.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-resolution image synthesis
with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 10684–10695.
23
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R.,
Karagol Ayan, B., Salimans, T., et al. (2022). Photorealistic text-to-image diffusion models with deep
language understanding. Advances in Neural Information Processing Systems, 35:36479–36494.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised learning
using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–
2265.
Song, J., Meng, C., and Ermon, S. (2021a). Denoising diffusion implicit models. In International Conference
on Learning Representations.
Song, Y. and Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution.
Advances in neural information processing systems, 32.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021b). Score-based
generative modeling through stochastic differential equations. International Conference on Learning Rep-
resentations.
Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural computation,
23(7):1661–1674.
Xia, M., Shen, Y., Yang, C., Yi, R., Wang, W., and Liu, Y.-j. (2023). Smart: Improving gans with score
matching regularity. In Forty-first International Conference on Machine Learning.
Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., and Yang, M.-H. (2023).
Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–
39.
24