Deep Equilibrium Approaches To Diffusion Models
Deep Equilibrium Approaches To Diffusion Models
Zico Kolter
arXiv:2210.12867v1 [cs.LG] 23 Oct 2022
Abstract
Diffusion-based generative models are extremely effective in generating high-
quality images, with generated samples often surpassing the quality of those pro-
duced by other models under several metrics. One distinguishing feature of these
models, however, is that they typically require long sampling chains to produce
high-fidelity images. This presents a challenge not only from the lenses of sam-
pling time, but also from the inherent difficulty in backpropagating through these
chains in order to accomplish tasks such as model inversion, i.e., approximately
finding latent states that generate known images. In this paper, we look at diffusion
models through a different perspective, that of a (deep) equilibrium (DEQ) fixed
point model. Specifically, we extend the recent denoising diffusion implicit model
(DDIM) [68], and model the entire sampling chain as a joint, multivariate fixed
point system. This setup provides an elegant unification of diffusion and equilib-
rium models, and shows benefits in 1) single image sampling, as it replaces the
fully-serial typical sampling process with a parallel one; and 2) model inversion,
where we can leverage fast gradients in the DEQ setting to much more quickly find
the noise that generates a given image. The approach is also orthogonal and thus
complementary to other methods used to reduce the sampling time, or improve
model inversion. We demonstrate our method’s strong performance across several
datasets, including CIFAR10, CelebA, and LSUN Bedrooms and Churches.1
1 Introduction
Diffusion models have emerged as a promising class of generative models that can generate high
quality images [69, 68, 57], outperforming GANs on perceptual quality metrics [19], and likelihood-
based models on density estimation [42]. One of the limitations of these models, however, is the
fact that they require a long diffusion chain (many repeated applications of a denoising process), in
order to generate high-fidelity samples. Several recent papers have focused on tackling this limitation,
e.g., by shortening the length of diffusion process through an alternative parameterization [68, 44], or
through progressive distillation of a sampler with large diffusion chain into a smaller one [54, 65].
However, all of these methods still rely on a fundamentally sequential sampling process, imposing
challenges on accelerating the sampling and for other applications like differentiating through the
entire generation process.
In this paper, we propose an alternative approach that also begins to address such challenges from a
different perspective. Specifically, we propose to model the generative process of a specific class of
1
Code is available at https://github.com/locuslab/deq-ddim
2 Preliminaries
Diffusion Models Denoising diffusion probabilistic models (DDPM) [67, 33] are generative models
that can convert the data distribution to a simple distribution, (e.g., a standard Gaussian, N (0, I)),
through a diffusion process. Specifically, given samples from a target distribution x0 ∼ q(x0 ), the
diffusion process is a Markov chain that adds Gaussian noises to the data to generate latent states
x1 , ..., xT in the same sample space as x0 . The inference distribution of diffusion process is given by:
T
Y
q(x1:T |x0 ) = q(xt |xt−1 ) (1)
t=1
R
To learn the parameters θ that characterize a distribution pθ (x0 ) = pθ (x0:T )dx1:T as an approxi-
mation of q(x0 ), a surrogate variational lower bound [67] was proposed to train this model:
X
L = Eq [− log pθ (x0 |x1 ) + DKL (q(xt−1 |xt , x0 )||pθ (xt−1 |xt ) + DKL (q(xT |x0 )||p(xT )] (2)
t
2
After training, samples can be generated by a reverse Markov chain, i.e., first sampling xT ∼ p(xT ),
and then repeatedly sampling xt−1 till we reach x0 .
As noted in [67, 68], the length T of a diffusion process is usually large (e.g., T = 1000 [33]) as it
contributes to a better approximation of Gaussian conditional distributions in the generative process.
However, because of the large value of T , sampling from diffusion models can be visibly slower
compared to other deep generative models like GANs [29].
One feasible acceleration is to rewrite the forward process into a non-Markovian one that leads to a
“shorter” and deterministic generative process, i.e., denoising diffusion implicit model [68] (DDIM).
DDIM can be trained similarly to DDPM, using the variational lower bound shown in Eq. (2).
Essentially, DDIM constructs a nearly non-stochastic scheme that can quickly sample from the
learned data distribution without introducing additional noises. Specifically, the scheme to generate a
sample xt−1 given xt is:
√ (t) q
√ xt − 1 − αt θ (xt ) (t)
xt−1 = αt−1 √ + 1 − αt−1 − σt2 · θ (xt ) + σt t (3)
αt
(t)
where α1 , ..., αT ∈ (0, 1], t ∼ N (0, I), and θ (xt ) is an estimator trained to predict the
Qt given a noisy state xt . For a variance schedule β1 , . . . , βT , we use the notation αt =
noise
ps=1 (1 − βs ). Different p values of σt define different generative processes. When σt =
(1 − αt−1 )/(1 − αt ) 1 − αt /αt−1 for all t, the generative process represents a DDPM. Setting
σt = 0 for all t gives rise to a DDIM, which results in a deterministic generating process except the
initial sampling xT ∼ p(xT ).
Deep Equilibrium Models Deep equilibrium models are a recently-proposed class of deep net-
works that, in their forward pass, seek to find a fixed point of a single layer applied repeatedly to a
hidden state. Specifically, consider a deep feedforward model with L layers:
[i]
z[i+1] = fθ z[i] ; x for i = 0, ..., L − 1 (4)
[i]
where x is the input injection, z[i] is the hidden state of ith layer, and fθ is a layer that defines the
[i]
feature transformation. Assuming the above model is weight-tied, i.e., fθ = fθ , ∀i, then in the limit
of infinite depth, the output z of this network converges to a fixed point z∗ .
[i]
lim fθ z[i] ; x = z∗ (5)
i→∞
Inspired from the neural convergence phenomenon, Deep equilibrium (DEQ) models [6] are proposed
to directly compute this fixed point z∗ as the output, i.e.,
fθ (z∗ ; x) = z∗ (6)
∗
The equilibrium state z can be solved by black-box solvers like Broyden’s method [13], or Anderson
acceleration [5]. To train this fixed-point system, Bai et al. [6] leverage implicit differentiation to
directly backpropagate through the equilibrium state z∗ using O(1) memory complexity. DEQ is
known as a principled framework for characterizing convergence and energy minimization in deep
learning. We leave a detailed discussion in Sec. 6.
3
we can exactly capture the typical diffusion inference chain; and 2) we can create a more expressive
reverse process where the state xt is updated based upon all previous states xt+1:T , improving the
inference process; 3) we can execute all steps of the inference chain in parallel rather than solely in
sequence as is typically required in diffusion models; and 4) we can use common DEQ acceleration
methods, such as the Anderson solver [5] to find the fixed point, which makes the sampling process
converge faster. A downside of this formulation is that we need to store all DEQ states simultaneously
(i.e., only the images, not the intermediate network states).
This process also lets us generate a sample using a subset of latent states {xτ1 , . . . , xτS }, where
{τ1 , . . . , τS } ⊆ T . While this helps in accelerating the overall generative process, there is a tradeoff
between sampling quality and computational efficiency. As noted in Song et al. [68], larger T values
lead to lower FID scores of the generated images but need more compute time; smaller T are faster
to sample from, but the resulting images have worse FID scores.
Reformulating this sampling process as a DEQ addresses multiple concerns raised above. We can
define a DEQ, with a sequence of latent states x1:T as its internal state, that simultaneously solves for
the equilibrium points at all the timesteps. The global convergence of this process is upper bounded
by T steps, by definition. To derive the DEQ formulation of the generative process, first we rearrange
the terms in Eq. (7):
s
r
αt−1 p αt−1 (1 − αt ) (t)
xt−1 = xt + 1 − αt−1 − θ (xt ) (8)
αt αt
r
(t) √ αt−1 (1 − αt )
Let c1 = 1 − αt−1 − . Then we can write
αt
r
αt−1 (t) (t)
xt−1 = xt + c1 θ (xt ) (9)
αt
By induction, we can rewrite the above equation as:
r T −1 r
αT −k X αT −k (t+1) (t+1)
xT −k = xT + c θ (xt+1 ), k ∈ [0, .., T ] (10)
αT αt 1
t=T −k
This defines a “fully-lower-triangular” inference process, where the update of xt depends on the
noise prediction network θ applied to all subsequent states xt+1:T ; in contrast to the traditional
diffusion process, which updates xt based only on xt+1 . Specifically, let h(·) represent the function
that performs the operations in the equations (10) for a latent xt at timestep t, and let h̃(·) represent
the function that performs the same set of operations across all the timesteps simultaneously. We can
write the above set of equations as a fixed point system:
xT −1 h(xT )
xT −2 h(xT −1:T )
. = ..
.. .
x0 h(x1:T )
or,
x0:T −1 = h̃(x0:T −1 ; xT ) (11)
The above system of equations represent a DEQ with xT ∼ N (0, I) as input injection. We can
simultaneously solve for the roots of this system of equations through black-box solvers like Anderson
acceleration [5]. Let g(x0:T −1 ; xT ) = h̃(x0:T −1 ; xT ) − x0:T −1 , then we have
x∗0:T = RootSolver(g(x0:T −1 ; xT )) (12)
4
This DEQ formulation has multiple benefits. Solving for all the equilibria simultaneously leads to a
better estimation of the intermediate latent states xt in a fewer number of steps (i.e., ≤ t steps for xt ).
This leads to faster convergence of the sampling process as the final sample x0 , which is dependent
on the latent states of all the previous time steps, has a better estimate of these intermediate latent
states. Note that by the same reasoning, the intermediate latent states xt converge faster too. Thus,
we can get images with perceptual quality comparable to DDIM in a significantly fewer number
of steps. Of course, we also note that the computational requirements of each individual step has
significantly increased, but this is at least largely offset by the fact that the steps can be executed as
mini-batched in parallel over each state. Empirically, in fact, we often notice significant speedup
using this approach on tasks like single image generation.
This DEQ formulation of DDIM can be extended to the stochastic generative processes of DDIM
with η > 0, including that of DDPM (referred to as DEQ-sDDIM). The key idea is to sample noises
for all the time steps along the sampling chain and treat this noise as an input injection to DEQ, in
addition to xT .
x∗0:T = RootSolver(g(x0:T −1 ; xT , 1:T )) (13)
where RootSolver(·) is any black-box fixed point solver, and 1:T ∼ N (0, I) represents the input
injected noises. We discuss this formulation in more detail in Appendix D.
Algorithm 1 A naive algorithm to invert DDIM Algorithm 2 Inverting DDIM with DEQ
Input: A target image x0 ∼ D, θ (xt , t) a Input: A target image x0 ∼ D, θ (xt , t) a
trained denoising diffusion model, N the total trained denoising diffusion model, N the total
number of epochs number of epochs
. f denotes the sampling process in Eq (7) . g is the function in Eq. (12)
Initialize x̂T ∼ N (0, I) Initialize x̂0:T ∼ N (0, I)
for epochs from 1 to N do for epochs from 1 to N do
for t = T, ..., 1 do . Disable gradient computation
Sample x̂t−1 = f (x̂t ; θ (x̂t , t)) x∗0:T = RootSolver(g(x0:T −1 ); xT )
end for . Enable gradient computation
Take a gradient descent step on Compute Loss L(x0 , x∗0 )
∇x̂T kx̂0 − x0 k2F Use the 1-step grad to compute ∂L/∂xT
Take a gradient descent step using above
end for end for
Output: x̂T Output: x∗T
Given an arbitrary image x0 ∼ D, and a denoising diffusion model θ (xt , t) trained on a dataset D,
model inversion seeks to determine the latent x̂T ∼ N (0, I) that can generate an image x̂0 identical
to the original image x0 through the generative process for DDIM described in Eq. (7). For an input
image x0 , and a generated image x̂0 , this task needs to minimize the squared-Frobenius distance
between these images:
L(x0 , x̂0 ) = kx0 − x̂0 k2F (14)
A relatively straightforward way to invert DDIM is to randomly sample xT ∼ N (0, I), and update it
via gradient descent by first estimating x0 using the generative process in Eq. (7) and backpropagating
through this process after computing the loss objective in (14). The overall process has been
summarized in Algorithm 1. This process has a large computational overhead. Every training epoch
5
requires a sequential sampling for all T timesteps. Optimizing through this generative process
would require the creation of a large computational graph for storing relevant intermediate variables
necessary for the backward pass. Sequential sampling further slows down the entire process.
Alternatively, we can use the DEQ formulation to develop a much more efficient inversion method.
We provide a high-level overview of this approach in Algorithm 2. We can apply implicit function
theorem (IFT) to the fixed point, i.e., (12) to compute gradients of the loss L(x0 , x∗0 ) in (14) w.r.t. (·):
∂ h̃(x∗
∂L ∂L −1 0:T −1 ; xT )
=− ∗ Jgθ x∗
(15)
∂(·) ∂x0:T 0:T ∂(·)
where (·) could be any of the latent states x1 , ..., xT , and Jg−1
θ x∗
is the inverse Jacobian of
0:T
g(x0:T −1 ; xT ) evaluated at x∗0:T . Refer to [6] for a detailed proof. Computing the inverse of
Jacobian matrix can become computationally intractable, especially when the latent states xt are
high dimensional. Further, prior works [6, 8, 28] have reported growing instability of DEQs during
training due to the ill-conditioning of Jacobian. Recent works [27, 26, 28, 9] suggest that we do not
need an exact gradient to train DEQs. We can instead use an approximation to Eq. (15), i.e.,
∂L ∂L ∂ h̃(x∗0:T −1 ; xT )
=− ∗ M (16)
∂(·) ∂x0:T ∂(·)
where M is an approximation of Jg−1 θ x∗
. For example, [27, 26, 28] show that setting M = I,
0:T
i.e., 1-step gradient, works well. In this work, we follow Geng et al. [28] to further add a damping
factor to the 1-step gradient. The forward pass is given by:
x∗0:T = RootSolver(g(x0:T −1 ); xT ) (17)
x∗0:T = τ · h̃(x∗0:T −1 ; x∗T ) + (1 − τ ) · x∗0:T (18)
The gradients for the backward pass can be computed through standard autograd packages. We
provide the PyTorch-style pseudocode of our approach in the Appendix B. Using inexact gradients
for the backward pass has several benefits: 1) It remarkably improves the training stability of DEQs;
2) Our backward pass consists of a single step and is ultra-cheap to compute. It reduces the total
training time by a significant amount. It is easy to extend the strategy used in Algorithm 2 and use
DEQs to invert DDIMs with stochastic generative process (referred to as DEQ-sDDIM). We provide
the key steps of this approach in Algorithm 4.
5 Experiments
We consider four datasets that have images of different resolutions for our experiments: CIFAR10
(32×32) [46], CelebA (64×64) [52], LSUN Bedroom (256×256) and LSUN Outdoor Church
(256×256) [76]. For all the experiments, we use Anderson acceleration as the default fixed point
solver. We use the pretrained denoising diffusion models from Ho et al. [33] for CIFAR10, LSUN
Bedroom, and LSUN Outdoor Church, and from Song et al. [68] for CelebA. While training DEQs
for model inversion, we use the 1-step gradient Eq. (18) to compute the backward pass. The damping
factor τ for 1-step gradient is set to 0.1. All the experiments have been performed on NVIDIA RTX
A6000 GPUs. We provide additional experimental details in the Appendix A. While the primary
focus in this section will be on the DDIM with a deterministic generative process i.e., η = 0, we
also include a few key results on stochastic version of DDIM (DEQ-sDDIM) here. More extensive
experiments can be found in Appendix D.
We verify that DEQ-DDIM converges to a fixed point by plotting the values of kh̃(x0:T ) − x0:T k2
over Anderson solver steps. As seen in Figure 1, DEQ-DDIM converges to a fixed point for generative
processes of different lengths. It is easier to reach simultaneous equilibria on smaller sequence lengths
than larger sequence lengths. However, this does not affect the quality of images generated. We
6
visualize the latent states of DEQ-DDIM in Figure 2. Our experiments demonstrate that DEQ-DDIM
is able to generate high-quality images in as few as 15 Anderson solver steps on diffusion chains
that were trained on a much larger number of steps T . One might note that DEQ-DDIM converges
to a limit cycle for diffusion processes with larger sequence lengths. This is not a limitation as we
only want the latent states at the last few timesteps to converge well, which happens in practice as
demonstrated in Fig. 2. Further, these residuals can be driven down by using more powerful solvers
like quasi-Newton methods, e.g., Broyden’s method.
Figure 1: DEQ-DDIM finds an equilibrium point. We plot the absolute fixed-point convergence
kh̃(x0:T )−x0:T k2 during a forward pass of DEQ for CIFAR-10 (left) and CelebA (right) for different
number of steps T . The shaded region indicates the maximum and minimum value encountered
during any of the 25 runs.
Figure 2: Visualization of intermediate latents xt of DEQ-DDIM after 15 forward steps with Anderson
solver for CIFAR-10 (first row, T = 500), CelebA (second row, T = 500), LSUN Bedroom (third
row, T = 50, and LSUN Outdoor Church (fourth row, T = 50). For T = 500, we visualize every
50th latent, and for T = 50, we visualize every 5th latent. In addition, we also visualize x0:4 in the
last 5 columns.
We verify that DEQ-DDIM can generate images of comparable quality to DDIM by reporting Fréchet
Inception Distance (FID) [32] in Table 1. For the forward pass of DEQ-DDIM, we run Anderson
solver for a maximum of 15 steps for each image. We report FID scores on 50,000 images, and
average time to generate an image (including GPU time) on 500 images. We note significant gains in
wall-clock time on single-shot image generation with DEQ-DDIM on images with smaller resolutions.
Specifically, DEQ-DDIM can generate images almost 2× faster than the sequential sampling of
DDIM on CIFAR-10 (32×32) and CelebA (64×64). We note that these gains vanish on sequences of
shorter lengths. This is because the number of fixed point solver iterations needed for convergence
becomes comparable to the length of diffusion chain for small values of T . Thus, lightweight updates
performed on short diffusion chains for sequential sampling are faster compared to compute heavy
updates in DEQ-DDIM.
We also report FID scores on DEQ-sDDIM for CIFAR10 in Table 2. We run Anderson solver for a
maximum of 50 steps for each image. We observe that while DEQ-sDDIM is slower than DDIM,
7
it always generates images with comparable or better FID scores. For higher levels of stochasticity
i.e., for larger valued of η, DEQ-sDDIM needs more Anderson solver iterations to converge to a
fixed point, which increases image generation wall-clock time. We include additional results in
Appendix D.2. Finally, we also find that on full-batch inference with larger batches, sequential
sampling might outperform DEQ-DDIM, as the latter would have larger memory requirements in this
case, i.e., processing smaller batches of size B might be faster than processing larger batches of size
BT .
Table 1: FID scores and time for single image generation for DDPM, DDIM and DEQ-DDIM.
Table 2: FID scores for single image generation for DDIM and DEQ-sDDIM on CIFAR10. Note
that DDPM [33] with a larger variance achieves FID scores of 133.37∗ and 32.72∗ respectively for
T = 20 and T = 50, where ∗ indicates numbers reported from Song et al. [68]
.
We report the minimum values of squared Frobenius norm between the recovered and target images
averaged from 100 different runs in Table 3. We report results for DEQ with η = 0 (i.e., DEQ-DDIM)
in this table, and additional results for η > 0 (i.e., DEQ-sDDIM) are reported in Figure 17. DEQ
outperforms the baseline method on all the datasets by a significant margin. We also plot the training
loss curves of DEQ-DDIM and the baseline in Figure 3. We observe that DEQ-DDIM converges faster
and has much lower loss values than the baseline method induced by DDIM. We also visualize the
images generated with the recovered latent states for DEQ-DDIM in Figure 4 and with DEQ-sDDIM
in Figure 5. It is worth noting that images generated with DEQs capture more vivid details of the
original images, like textures of foliage, crevices, and other finer details than the baseline. We include
additional results of model inversion with DEQ-sDDIM on different datasets in Appendix D.3.
Baseline DEQ-DDIM
Dataset T
Min loss ↓ Avg Time (mins) ↓ Min loss ↓ Avg Time (mins) ↓
CIFAR10 100 15.74 ± 8.7 49.07 ± 1.76 0.76 ± 0.35 12.99 ± 0.97
CIFAR10 10 2.59 ± 3.67 14.36 ± 0.26 0.68 ± 0.32 2.54 ± 0.41
CelebA 20 14.13 ± 5.04 30.09 ± 0.57 1.03 ± 0.37 28.09 ± 1.76
Bedroom 10 1114.49 ± 795.86 26.41 ± 0.17 36.37 ± 22.86 33.7 ± 1.05
Church 10 1674.68 ± 1432.54 29.7 ± 0.75 47.94 ± 24.78 33.54 ± 3.02
Table 3: Comparison of minimum loss and average time required to generate an image. All the
results have been reported on 100 images. See Appendix A for detailed training settings.
8
Figure 3: Training loss for CelebA and LSUN Bedroom over epochs. DEQ-DDIM converges in fewer
epochs, and achieves lower values of loss compared to the baseline. The shaded region indicates the
maximum and minimum value of loss encountered during any of the 100 runs.
6 Related Work
Implicit Deep Learning Implicit deep learning is an emerging field that introduces structured
methods to construct modern neural networks. Different from prior explicit counterparts defined
by hierarchy or layer stacking, implicit models take advantage of dynamical systems [43, 23, 3],
e.g., optimization [4, 72, 20, 27, 21], differential equation [16, 22, 70, 30], or fixed-point system [6,
7, 31]. For instance, Neural ODE [16] describes a continuous time-dependent system, while Deep
Equilibrium (DEQ) model [6], which is actually path-independent, is a new type of implicit models
that outputs the equilibrium states of the underlying system, e.g., z∗ from z∗ = fθ (z∗ , x) given the
input x. This fixed-point system can be solved by black-box solvers [5, 13], and further accelerated
by the neural solver [10] in the inference. An active topic is the stability [8, 28, 9] of such a system as
it will gradually deteriorate during training, albeit strong performances [16, 6, 8]. DEQ has achieved
SOTA results on a wide-range of tasks like language modeling [6], semantic segmentation [7], graph
modeling [31, 51, 59, 15], object detection [73], optical flow estimation [9], robustness [74, 48], and
generative models like normalizing flow [53], with theoretical guarantees [75, 38, 25, 50].
Figure 4: Model inversion on CIFAR10, CelebA, LSUN Bedrooms and Churches, respectively. Each
triplet has the original image (left), DDIM’s inversion (middle), and DEQ-DDIM’s inversion (right).
Diffusion Models Diffusion models [67, 33, 68], or score-based generative models [69, 71], are
newly developed generative models that utilize an iterative denoising process to progressively sample
from a learned data distribution, which actually is the reverse of a forward diffusion process. They
have demonstrated impressive fidelity for text-conditioned image generation [62] and outperformed
state-of-the-art GANs on ImageNet [19]. Despite the superior practical results, diffusion models
suffer from a plodding sampling speed, e.g., over hours to generate 50k CIFAR-sized images [68]. To
accelerate the sampling from diffusion models, researchers propose to skip a part of the sampling
steps by reframing the reverse chain [68, 45, 44], or distill the trained diffusion model into a faster
one [54, 65]. Plus, the forward and backward processes in diffusion models can be formulated as
stochastic differential equations [71], bridging diffusion models with Neural ODEs [16] in implicit
deep learning. However, the community still lacks insights into the connection between DEQ and
diffusion models, where we build our work to investigate this.
9
Figure 5: Model inversion of DEQ-sDDIM on CIFAR10, CelebA, LSUN Bedrooms and Churches,
respectively. Each triplet displays the original image (left), and images obtained through inversion
with DEQ-sDDIM for η = 0.5 (middle), and η = 1 (right).
Model inversion Model inversion gives insights into the latent space of a generative model, as an
inability of a generative model to correctly reconstruct an image from its latent code is indicative of
its inability to model all the attributes of image correctly. Further, the ability to manipulate the latent
codes to edit high-level attributes of images finds applications in many tasks like semantic image
manipulation [77, 2], super resolution [14, 47], in-painting [18], compressed sensing [12], etc. For
generative models like GANs [29], inversion is non-trivial and requires alternate methods like learning
the mapping from an image to the latent code [11, 61, 78], and optimizing the latent code through
optimizers, e.g., both gradient-based [1] and gradient-free [34]. For diffusion models like DDPM [33],
the generative process is stochastic, which can make model inversion very challenging. Many existing
works based on diffusion models [55, 18, 71, 36, 40] edit images or solve inverse problems without
requiring full model inversion. Instead, they do so by utilizing existing understanding of diffusion
models as presented in some recent works [71, 37, 39]. Diffusion models have been widely applied
to conditional image generation [17, 18, 57, 64, 36, 66, 40, 56]. Chung et al. [18] propose a method
to reduce the number of steps in reverse conditional diffusion process through better initialization,
based on the idea of contraction theory of stochastic differential equations. Our proposed method is
orthogonal to this work; we explicitly model DDIM as a joint, multivariate fixed point system and
leverage black-box root solvers to solve for the fixed point and also allow for efficient differentiation.
7 Conclusion
We propose an approach to elegantly unify diffusion models and deep equilibrium (DEQ) models.
We model the entire sampling chain of the denoising diffusion implicit model (DDIM) as a joint,
multivariate (deep) equilibrium model. This setup replaces the traditional sequential sampling process
with a parallel one, thereby enabling us to enjoy speedup obtained from multiple GPUs. Further,
we can leverage inexact gradients to optimize the entire sampling chain quickly, which results in
significant gains in model inversion. We demonstrate the benefits of this approach on 1) single-shot
image generation, where we were able to obtain FID scores on par with or slightly better than those
of DDIM; and 2) model inversion, where we achieved much faster convergence. We also propose an
easy way to extend DEQ formulation for deterministic DDIM to its stochastic variants. It is possible
to further speedup the sampling process by training a DEQ model to predict the noise at a particular
timestep of the diffusion chain. We can jointly optimize the noise prediction network, and the latent
variables of the diffusion chain, which we leave as future work.
8 Acknowledgements
Ashwini Pokle is supported by a grant from the Bosch Center for Artificial Intelligence.
References
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the
stylegan latent space? In Proceedings of the IEEE/CVF International Conference on Computer
10
Vision, pages 4432–4441, 2019. (Cited on 10)
[2] Rameen Abdal, Peihao Zhu, Niloy J Mitra, and Peter Wonka. Styleflow: Attribute-conditioned
exploration of stylegan-generated images using conditional continuous normalizing flows. ACM
Transactions on Graphics (TOG), 40(3):1–21, 2021. (Cited on 2, 10)
[3] Brandon Amos. Tutorial on amortized optimization for learning to optimize over continuous
domains. arXiv preprint arXiv:2202.00665, 2022. (Cited on 9)
[4] Brandon Amos and J. Zico Kolter. OptNet: Differentiable optimization as a layer in neural
networks. In International Conference on Machine Learning (ICML), 2017. (Cited on 9)
[5] Donald G Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM
(JACM), 1965. (Cited on 3, 4, 9, 16, 20)
[6] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Neural Information
Processing Systems (NeurIPS), 2019. (Cited on 2, 3, 6, 9, 18)
[7] Shaojie Bai, Vladlen Koltun, and J Zico Kolter. Multiscale deep equilibrium models. Neural
Information Processing Systems (NeurIPS), 2020. (Cited on 9)
[8] Shaojie Bai, Vladlen Koltun, and J Zico Kolter. Stabilizing equilibrium models by jacobian
regularization. arXiv preprint arXiv:2106.14342, 2021. (Cited on 6, 9)
[9] Shaojie Bai, Zhengyang Geng, Yash Savani, and J Zico Kolter. Deep equilibrium optical flow
estimation. arXiv preprint arXiv:2204.08442, 2022. (Cited on 6, 9)
[10] Shaojie Bai, Vladlen Koltun, and J Zico Kolter. Neural deep equilibrium solvers. In International
Conference on Learning Representations, 2022. (Cited on 9)
[11] David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Bolei Zhou, Jun-Yan Zhu, and
Antonio Torralba. Semantic photo manipulation with a generative image prior. arXiv preprint
arXiv:2005.07727, 2020. (Cited on 10)
[12] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using
generative models. In International Conference on Machine Learning, pages 537–546. PMLR,
2017. (Cited on 10)
[13] Charles G Broyden. A class of methods for solving nonlinear simultaneous equations. Mathe-
matics of computation, 1965. (Cited on 3, 9, 18)
[14] Kelvin CK Chan, Xintao Wang, Xiangyu Xu, Jinwei Gu, and Chen Change Loy. Glean:
Generative latent bank for large-factor image super-resolution. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 14245–14254, 2021. (Cited on
10)
[15] Qi Chen, Yifei Wang, Yisen Wang, Jiansheng Yang, and Zhouchen Lin. Optimization-induced
graph implicit nonlinear diffusion. In International Conference on Machine Learning, pages
3648–3661. PMLR, 2022. (Cited on 9)
[16] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary
differential equations. In Neural Information Processing Systems (NeurIPS), 2018. (Cited on 9)
[17] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon.
Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint
arXiv:2108.02938, 2021. (Cited on 10)
[18] Hyungjin Chung, Byeongsu Sim, and Jong Chul Ye. Come-closer-diffuse-faster: Accelerating
conditional diffusion models for inverse problems through stochastic contraction. arXiv preprint
arXiv:2112.05146, 2021. (Cited on 10)
[19] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.
Advances in Neural Information Processing Systems, 34, 2021. (Cited on 1, 9)
11
[20] Josip Djolonga and Andreas Krause. Differentiable learning of submodular models. Advances
in Neural Information Processing Systems, 30, 2017. (Cited on 9)
[21] Priya L. Donti, David Rolnick, and J Zico Kolter. DC3: A learning method for optimization
with hard constraints. In International Conference on Learning Representations (ICLR), 2021.
(Cited on 9)
[22] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural ODEs. In Neural
Information Processing Systems (NeurIPS), 2019. (Cited on 9)
[23] Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, and Armin Askari. Implicit deep learning.
arXiv:1908.06315, 2019. (Cited on 9)
[24] Thorsten Falk, Dominic Mai, Robert Bensch, Özgün Çiçek, Ahmed Abdulkadir, Yassine
Marrakchi, Anton Böhm, Jan Deubner, Zoe Jäckel, Katharina Seiwald, et al. U-net: deep
learning for cell counting, detection, and morphometry. Nature methods, 16(1):67–70, 2019.
(Cited on 16)
[25] Zhili Feng and J Zico Kolter. On the neural tangent kernel of equilibrium models, 2021. (Cited
on 9)
[26] Samy Wu Fung, Howard Heaton, Qiuwei Li, Daniel McKenzie, Stanley Osher, and Wotao Yin.
Fixed point networks: Implicit depth models with jacobian-free backprop. arXiv e-prints, pages
arXiv–2103, 2021. (Cited on 6)
[27] Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is atten-
tion better than matrix decomposition? In International Conference on Learning Representa-
tions (ICLR), 2021. (Cited on 6, 9)
[28] Zhengyang Geng, Xin-Yu Zhang, Shaojie Bai, Yisen Wang, and Zhouchen Lin. On training
implicit models. Neural Information Processing Systems (NeurIPS), 2021. (Cited on 6, 9, 16)
[29] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural
information processing systems, 27, 2014. (Cited on 3, 10)
[30] Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured
state spaces. In International Conference on Learning Representations (ICLR), 2022. (Cited on
9)
[31] Fangda Gu, Heng Chang, Wenwu Zhu, Somayeh Sojoudi, and Laurent El Ghaoui. Implicit
Graph Neural Networks. In Neural Information Processing Systems (NeurIPS), pages 11984–
11995, 2020. (Cited on 9)
[32] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
neural information processing systems, 30, 2017. (Cited on 7)
[33] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Neural
Information Processing Systems (NeurIPS), 2020. (Cited on 2, 3, 6, 8, 9, 10, 16, 19)
[34] Minyoung Huh, Richard Zhang, Jun-Yan Zhu, Sylvain Paris, and Aaron Hertzmann. Transform-
ing and projecting images into class-conditional generative networks. In European Conference
on Computer Vision, pages 17–34. Springer, 2020. (Cited on 10)
[35] Thibaut Issenhuth, Ugo Tanielian, Jérémie Mary, and David Picard. Edibert, a generative model
for image editing. arXiv preprint arXiv:2111.15264, 2021. (Cited on 2)
[36] Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G Dimakis, and Jon Tamir.
Robust compressed sensing mri with deep generative priors. Advances in Neural Information
Processing Systems, 34:14938–14954, 2021. (Cited on 10)
[37] Zahra Kadkhodaie and Eero P Simoncelli. Solving linear inverse problems using the prior
implicit in a denoiser. arXiv preprint arXiv:2007.13640, 2020. (Cited on 10)
12
[38] Kenji Kawaguchi. On the Theory of Implicit Deep Learning: Global Convergence with Implicit
Layers. In International Conference on Learning Representations (ICLR), 2020. (Cited on 9)
[39] Bahjat Kawar, Gregory Vaksman, and Michael Elad. Snips: Solving noisy inverse problems
stochastically. Advances in Neural Information Processing Systems, 34:21757–21769, 2021.
(Cited on 10)
[40] Gwanghyun Kim and Jong Chul Ye. Diffusionclip: Text-guided image manipulation using
diffusion models. arXiv preprint arXiv:2110.02711, 2021. (Cited on 10)
[41] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014. (Cited on 16)
[42] Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.
arXiv preprint arXiv:2107.00630, 2021. (Cited on 1)
[43] J. Zico Kolter, David Duvenaud, and Matthew Johnson. Deep implicit layers tutorial - neural
ODEs, deep equilibirum models, and beyond. Neural Information Processing Systems Tutorial,
2020. (Cited on 9, 16)
[44] Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. arXiv preprint
arXiv:2106.00132, 2021. (Cited on 1, 9)
[45] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile
diffusion model for audio synthesis. In International Conference on Learning Representations
(ICLR), 2021. (Cited on 9)
[46] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,
Citeseer, 2009. (Cited on 2, 6)
[47] Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun Feng, Zhihai Xu, Qi Li, and
Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models.
Neurocomputing, 2022. (Cited on 10)
[48] Mingjie Li, Yisen Wang, and Zhouchen Lin. Cerdeq: Certifiable deep equilibrium model. In
International Conference on Machine Learning, 2022. (Cited on 9)
[49] Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim, Antonio Torralba, and Sanja Fidler.
Editgan: High-precision semantic image editing. Advances in Neural Information Processing
Systems, 34:16331–16345, 2021. (Cited on 2)
[50] Zenan Ling, Xingyu Xie, Qiuhao Wang, Zongpeng Zhang, and Zhouchen Lin. Global conver-
gence of over-parameterized deep equilibrium models. arXiv preprint arXiv:2205.13814, 2022.
(Cited on 9)
[51] Juncheng Liu, Kenji Kawaguchi, Bryan Hooi, Yiwei Wang, and Xiaokui Xiao. EIGNN: Efficient
infinite-depth graph neural networks. In Neural Information Processing Systems (NeurIPS),
2021. (Cited on 9)
[52] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the
wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
(Cited on 2, 6)
[53] Cheng Lu, Jianfei Chen, Chongxuan Li, Qiuhao Wang, and Jun Zhu. Implicit normalizing flows.
In International Conference on Learning Representations (ICLR), 2021. (Cited on 9)
[54] Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for
improved sampling speed. ArXiv, abs/2101.02388, 2021. (Cited on 1, 9)
[55] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon.
Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint
arXiv:2108.01073, 2021. (Cited on 2, 10)
13
[56] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew,
Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing
with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021. (Cited on 10)
[57] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic
models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
(Cited on 1, 10)
[58] Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar.
Diffusion models for adversarial purification. arXiv preprint arXiv:2205.07460, 2022. (Cited
on 2)
[59] Junyoung Park, Jinhyun Choo, and Jinkyoo Park. Convergent graph solvers. arXiv preprint
arXiv:2106.01680, 2021. (Cited on 9)
[60] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
pytorch. 2017. (Cited on 16)
[61] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and Jose M Álvarez. Invertible
conditional gans for image editing. arXiv preprint arXiv:1611.06355, 2016. (Cited on 10)
[62] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-
conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. (Cited
on 9)
[63] Ardavan Saeedi, Matthew Hoffman, Stephen DiVerdi, Asma Ghandeharioun, Matthew Johnson,
and Ryan Adams. Multimodal prediction and personalization of photo edits with deep generative
models. In International Conference on Artificial Intelligence and Statistics, pages 1309–1317.
PMLR, 2018. (Cited on 2)
[64] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad
Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636,
2021. (Cited on 10)
[65] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.
In International Conference on Learning Representations (ICLR), 2022. (Cited on 1, 9)
[66] Hiroshi Sasaki, Chris G Willcocks, and Toby P Breckon. Unit-ddpm: Unpaired image translation
with denoising diffusion probabilistic models. arXiv preprint arXiv:2104.05358, 2021. (Cited
on 10)
[67] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper-
vised learning using nonequilibrium thermodynamics. In International Conference on Machine
Learning (ICML), 2015. (Cited on 2, 3, 9, 19)
[68] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv
preprint arXiv:2010.02502, 2020. (Cited on 1, 2, 3, 4, 6, 8, 9, 16, 19, 22)
[69] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data
distribution. Advances in Neural Information Processing Systems, 32, 2019. (Cited on 1, 9)
[70] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In
International Conference on Learning Representations (ICLR), 2021. (Cited on 9)
[71] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In
International Conference on Learning Representations (ICLR), 2021. (Cited on 9, 10)
[72] Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning and
logical reasoning using a differentiable satisfiability solver. In International Conference on
Machine Learning (ICML), 2019. (Cited on 9)
14
[73] Tiancai Wang, Xiangyu Zhang, and Jian Sun. Implicit Feature Pyramid Network for Object
Detection. arXiv preprint arXiv:2012.13563, 2020. (Cited on 9)
[74] Colin Wei and J Zico Kolter. Certified robustness for deep equilibrium models via interval bound
propagation. In International Conference on Learning Representations, 2022. (Cited on 9)
[75] Ezra Winston and J. Zico Kolter. Monotone operator equilibrium networks. In Neural Informa-
tion Processing Systems (NeurIPS), 2020. (Cited on 9)
[76] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: construction of a
large-scale image dataset using deep learning with humans in the loop. CoRR, abs/1506.03365,
2015. (Cited on 6)
[77] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain gan inversion for real image
editing. In European conference on computer vision, pages 592–608. Springer, 2020. (Cited on
2, 10)
[78] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative visual
manipulation on the natural image manifold. In European conference on computer vision, pages
597–613. Springer, 2016. (Cited on 10)
15
A Experimental Details
In this section, we present detailed settings for all the experiments in Section 5.
Architecture We use exactly the same U-Net [24] architecture for θ (xt , t) as the one previously
used by Ho et al. [33], Song et al. [68]. We use pretrained models from Ho et al. [33] for CIFAR10,
LSUN Bedrooms and Outdoor Churches, and from Song et al. [68] for CelebA.
General setting We follow the linear selection procedure to select a subsequence of timesteps
τS ⊂ T for all the datasets except CIFAR10, i.e., we select timesteps such that τi = bcic for some
c. For CIFAR10, we select timesteps such that τi = bci2 c for some c. The constant c is selected
so that τ−1 is close to T . We use Anderson acceleration [5, 43] as our fixed-point solver for all the
experiments. We set the exiting equilibrium error of solver to 1e-3 and set the history length to 5.
We allow a maximum of 15 solver forward steps in all the experiments with DEQ-DDIM, and use a
maximum of 50 solver forward steps for DEQ-sDDIM. Finally, we use PyTorch’s inbuilt DataParallel
module to handle parallelization. We use upto 4 NVIDIA Quadro RTX 8000 or RTX A6000 GPUs
for all our experiments.
Training details for model inversion We implement and test the code in PyTorch version 1.11.0.
We use the Adam [41] optimizer with a learning rate of 0.01. We train DEQs for 400 epochs on
CIFAR10 and CelebA, and for 500 epochs on LSUN Bedroom, and LSUN Outdoor Church. The
baseline is trained for 1000 epochs on CIFAR10 with T = 100, for 3000 epochs on CIFAR10 with
T = 10, for 2500 epochs on CelebA, and for 2000 epochs on LSUN Bedroom, and LSUN Outdoor
Church. At the beginning of inversion procedure with DEQ-DDIM, we sample xT ∼ N (0, I), and
initialize the latents at all (or the subsequence of) timesteps to this value. For inversion with DEQ-
sDDIM we also sample 1:T ∼ N (0, I). We stop the training as soon as the loss falls below 0.5 for
CIFAR10, and below 2 for other datasets.
B Pseudocode
We provide PyTorch-style [60] pseudocode to invert DDIM with DEQ approach in Algorithm box B.
Note that we use phantom gradient [28] to compute inexact gradients.
Effect of length of subsequence τS during training We study the effect of the length of the
diffusion chain on the convergence rate of optimization for model inversion in Fig. 6. We note that for
sequential sampling, loss decreases slightly faster for the smaller diffusion chain (τS = 10 and 20)
than the longer one (τS = 100) for the baseline. However, for DEQ-DDIM, the length of diffusion
chain doesn’t seem to have an effect on the rate of convergence as the loss curves for τS = 100 and
τS = 10 and 20 are nearly identical.
Effect of length of subsequence τS during sampling All the images in Fig. 4 are sampled with a
subsequence of timesteps τS ⊂ T , i.e., the number of latents in the diffusion chain used for training
and the number of timesteps used for sampling an image from the recovered x̂T were equal. We
investigate if sampling with τS = T = 1000 results in images with a better perceptual quality for the
baseline. We display the recovered images for LSUN Bedrooms in Fig. 7. The length of diffusion
chain during training time is τS = 10. We note that using more sampling steps does not result in
16
Algorithm 3 PyTorch-style pseudocode for inversion with DEQ-DDIM
# x0: a target image for inversion
# all_xt: all the latents in the diffusion chain or its subsequence; xT is at
index 0, x0 is at the last index
# func: A function that performs the required operations on the fixed-point
system for a single timestep
# solver: A fixed-point solver like Anderson acceleration
# optimizer: an optimization algorithm like Adam
# tau: the damping factor τ for phantom gradient
# num_epochs: the max number of epochs
Figure 6: Effect of length of diffusion chain on optimization process for model inversion with DEQ-
DDIM. We display training loss curves for (left) baseline and (right) DEQ for CIFAR10 (top row)
and CelebA (bottom row). It is slightly easier to optimize smaller diffusion chain for the Baseline.
The error bar indicates the maximum and minimum value of loss encountered during any of the 100
runs for CIFAR10, and 25 runs for CelebA.
17
inverted images that are closer to the original image. In some cases, samples generated with more
sampling steps have some additional artifacts that are not present in the original image.
Figure 7: Model inversion on LSUN Bedrooms (Baseline): Using more sampling steps does not
generate images that are closer to the ground truth image. Each triplet has the original image (left),
images sampled with T = 10 (middle) and T = 1000 (right). The length of diffusion chain at
training time was 10.
Comparing exact vs inexact gradients for backward pass of DEQ-DDIM The choice of gradi-
ent calculation for the backward pass of DEQ affects both the training stability and convergence of
DEQ-DDIM. Here, we compare the performance of the exact gradients and inexact gradients. Com-
puting the inverse of Jacobian in Eq. (15) is difficult because the Jacobian can be prohibitively large.
We follow Bai et al. [6] to compute exact gradients using the following linear system
>
∂L
Jg−1 x∗
v >
+ =0 (19)
θ
0:T ∂x∗0:T
We use Broyden’s method [13] to efficiently solve for v > in this linear system. We compare it againt
inexact gradients i.e., Jacobian free gradient used in Algorithm 2. We observe that training DEQ-
DDIM with exact gradients becomes increasingly unstable as the training proceeds, especially for
larger learning rates like 0.005. However, we can converge faster with larger learning rates like 0.01
with inexact gradients.
Figure 8: Training DEQ-DDIM becomes increasingly unstable with exact gradients for larger learning
rates like 0.005 (left, blue curve). However, it is possible to train DEQ-DDIM with exact gradients
for smaller learning rate like 0.001 (left, orange curve). However, inexact gradients (right, orange
curve) converge faster than exact gradients (right, blue curve) on larger learning rates like 0.01. We
report results on 10 runs for CelebA dataset with diffusion chains of length 20.
18
of the intermediate states of the diffusion chain at different solver steps for the two initialization
schemes as observed in Figure 10.
Figure 9: Choice of initialization is critical in DEQs: Initializing DEQs with xT results in much
faster convergence compared to zero initialization. We report the convergence results on CIFAR10
using 5 runs on diffusion chains of length 50.
Figure 10: Visualization of intermediate latents xt for CelebA for different choices of initialization
for T = 50: (first row) Initialization with xT after 10 Anderson solver steps (second row) Zero
initialization after 10 Anderson solver steps (third row) Zero initialization after 30 Anderson solver
steps. We visualize every 5th latent in x0:T −1 at the given solver step (10 or 30). We also visualize
x0:4 in the last 5 columns. Initialization with xT produces visually appealing images much faster.
A more general and highly stochastic generative process is to sample and integrate noises every time
step, given by [68]:
√ (t) q
√ xt − 1 − αt θ (xt ) (t)
xt−1 = αt−1 √ + 1 − αt−1 − σt2 · θ (xt ) + σt t (20)
αt
where
r t ∼ N (0, I). Here, σt = 0 corresponds to a deterministic DDIM while σt =
1 − αt−1 αt
r
1− corresponds to a stochastic DDIM. Empirically, this is parameterized as
1 − αt αt−1
r
1 − αt−1 αt
r
σt (η) = η 1− where η is a hyperparameter to control stochasticity. Note that
1 − αt αt−1
η = 1 corresponds to a DDPM [67, 33].
Rearranging the terms in Eq. (20), we get
s
r
αt−1 α (1 − α )
q
t−1 t (t) (xt ) + σt t
xt−1 = xt + 1 − αt−1 − σt2 − θ (21)
αt αt
19
r
(t)
p αt−1 (1 − αt )
Let c1 = 1 − αt−1 − σt2 − . Then we can write
αt
r
αt−1 (t) (t)
xt−1 = xt + c1 θ (xt ) + σt t (22)
αt
By induction, we can rewrite the above equation as:
r T −1 r
αT −k X αT −k (t+1) (t+1)
xT −k = xT + c1 θ (xt+1 ) + σt+1 t+1 , k ∈ [0, .., T ] (23)
αT αt
t=T −k
This again defines a “fully-lower-triangular” inference process, where the update of xt depends on
the noise prediction network θ applied to all subsequent states xt+1:T .
Following the notation used in the main paper, let h(·) represent the function that performs the
operations in the equations (23) for a latent xt at timestep t, let h̃(·) represent the function that
performs the same set of operations across all the timesteps simultaneously, and let 1:T represent the
noise injected into the diffusion process at every timestep. We can write the above set of equations as
a fixed point system:
xT −1 h(xT ; T )
xT −2 h(xT −1:T ; T −1:T )
. = ..
.. .
x0 h(x1:T ; 1:T )
or,
x0:T −1 = h̃(x0:T −1 ; xT , 1:T ) (24)
The above system of equations represent a DEQ with xT and 1:T as input injection. We refer to this
formulation of DEQ for stochastic DDIM with η > 0 as DEQ-sDDIM.
A major difference between the both is that DEQ-sDDIM can exploit the noises 1:T sampled prior
to fixed point solving as addition input injections. The insight here is that the noises along the
sampling chain are independent of each other, thus allowing us to sample all the noises simultaneously
and convert a highly stochastic autoregressive sampling process into a deterministic “fully-lower-
triangular” DEQ.
Then we can now use black-box solvers like Anderson acceleration [5] to solve for the roots of this
system of equations as the previous case. Let g(x0:T −1 ; xT , 1:T ) = h̃(x0:T −1 ; xT , 1:T ) − x0:T −1 ,
then we have
x∗0:T = RootSolver(g(x0:T −1 ); xT , 1:T ) (25)
where RootSolver(·) is any black-box fixed point solver.
We verify that our DEQ version for stochastic DDIM converges to a fixed point by plotting values of
kh̃(x0:T ) − x0:T k2 over Anderson solver steps in Figure 11. As one would expect, we need more
solver steps to solve for the fixed point given higher values of η.
We verify that DEQ-sDDIM can generate images that are on par with the original DDIM by computing
FID scores on 50,000 sampled images. We report our results in Table 2 and Table 4. We observe that
our FID scores are comparable or slightly better to those from sequential DDIM. We also visualize
images generated from the same latent xT at different levels of stochasticity controlled through η.
We display our generated images in Figure 12 and Figure 13.
We report the minimum values of squared Frobenius norm between the recovered and target images
averaged from 25 different runs in Figure 17. We use the same hyperparameters as the ones used for
20
Figure 11: DEQ-sDDIM finds an equilibrium point. We plot the absolute fixed-point convergence
kh̃(x0:T ) − x0:T k2 during a forward pass of DEQ for CelebA (left) and LSUN Bedrooms (right)
for different number of steps T . The shaded region indicates the maximum and minimum value
encountered during any of the 10 runs.
Figure 12: CelebA samples from DEQ-sDDIM for T = 500. (left to right) We display images from
the same xT for different levels of stochasticity with η = 0, 0.25, 0.5, 0.75, and 1
Figure 13: LSUN Bedrooms samples from DEQ-sDDIM for T = 25. (left to right) We display
images from the same xT for different levels of stochasticity with η = 0, 0.25, 0.5, 0.75, and 1
21
FID Scores Time (in seconds)
η T
DDIM DEQ-sDDIM DDIM DEQ-sDDIM
0.2 20 13.85 13.52 0.42 1.53
0.5 20 15.67 15.27 0.42 2.17
1 20 25.85 25.31 0.42 2.35
0.2 50 9.33 8.66 1.05 4.17
0.5 50 10.75 9.73 1.05 5.18
1 50 18.22 15.57 1.05 8.59
Table 4: FID scores for single image generation using stochastic DDIM and DEQ-sDDIM on CelebA.
Note that DDPM with a larger variance achieves FID score of 183.83∗ and 71.71∗ on T = 20 and
T = 50, respectively, where ∗ indicates the numbers from Song et al. [68]
training DEQ models for DDIM in these experiments. DEQ-sDDIM is able to achieve low values of
the reconstruction loss even for large values of η like 1 as noted in Figure 17. We also plot training
loss curves for different values of η in Figure 16 on CIFAR10. We note that it indeed takes longer
time to invert DEQ-sDDIM for higher values of η. This is primarily because the fixed point solver
needs more iterations to converge. However, despite that we obtain impressive model inversion
results on CIFAR10 and CelebA. We visualize images generated with the recovered latent states in
Figure 14 and Figure 15.
Figure 14: Results of model inversion of DEQ-sDDIM on CIFAR10: For every set of four images
from left to right we display the original image, and images obtained through inversion for η = 0,
η = 0.5, and η = 1.
22
Figure 15: Results of model inversion of DEQ-sDDIM on CelebA: For every set of four images
from left to right we display the original image, and images obtained through inversion for η = 0,
η = 0.5, and η = 1.
23
Figure 18: Visualization of intermediate latents xt after 15 forward steps with Anderson solver for
CIFAR-10. Each set of three consecutive rows displays intermediate latents for different number of
diffusion steps: (first row) T = 1000, (second row) T = 500, and (third row) T = 50. For T = 1000,
we visualize every 100th , for T = 500, we visualize every 50th latent, and for T = 50, we visualize
every 5th latent. In addition, we also visualize x0:4 in the last 5 columns.
24
Figure 19: Model inversion on CIFAR10. Each triplet has the original image (left), DDIM’s inversion
(middle), and DEQ’s inversion (right). The number of sampling steps for all the images is T = 100.
Figure 20: Model inversion on CelebA. Finer details like hair, background, and texture of skin are
better captured by DEQ. Each triplet has the original image (left), DDIM’s inversion (middle), and
DEQ’s inversion (right). The number of sampling steps for all the generated images is T = 10.
25
Figure 21: Model inversion on LSUN Bedrooms. Each triplet has the original image (left), DDIM’s
inversion (middle), and DEQ-DDIM’s inversion (right). The number of sampling steps for all the
generated images is T = 10.
Figure 22: Model inversion on LSUN Outdoor Churches. Each triplet has the original image (left),
DDIM’s inversion (middle), and DEQ-DDIM’s inversion (right). The number of sampling steps for
all the generated images is T = 10.
26
Figure 23: Visualization of model inversion on CIFAR10. The first column is the original image and
the last column is the final generated image after 1000 steps for the baseline, and 500 steps for DEQ-
DDIM. The 6 columns in between contain images sampled from xT after n training updates where n
is 0 (initialization), 1, 50, 100, 150, and 200 for DEQ-DDIM (first row) and baseline (second row).
Figure 24: Visualization of model inversion on LSUN Bedrooms. The first column is the original
image and the last column is the final generated image after 2000 steps for the baseline, and 500
steps for DEQ-DDIM. The 6 columns in between contain images sampled from xT after n training
updates where n is 0 (initialization), 1, 50, 100, 150, and 200 for DEQ-DDIM (first row) and baseline
(second row).
Figure 25: Visualization of model inversion on LSUN Churches. The first column is the original
image and the last column is the final generated image after 2000 steps for the baseline, and 500
steps for DEQ-DDIM. The 6 columns in between contain images sampled from xT after n training
updates where n is 0 (initialization), 1, 50, 100, 150, and 200 for DEQ-DDIM (first row) and baseline
(second row).
27