Solving Inverse Problems
Solving Inverse Problems
Bowen Song1∗, Soo Min Kwon1∗ , Zecheng Zhang2 , Xinyu Hu3 , Qing Qu1 Liyue Shen1
1
University of Michigan, 2 Kumo.AI, 3 Microsoft
A BSTRACT
1 I NTRODUCTION
Inverse problems arise from a wide range of applications across many domains, including compu-
tational imaging (Beck & Teboulle, 2009; Afonso et al., 2011), medical imaging (Suetens, 2017;
Ravishankar et al., 2019), and remote sensing (Liu et al., 2021; 2022), to name a few. When solv-
ing these inverse problems, the goal is to reconstruct an unknown signal x∗ ∈ Rn given observed
measurements y ∈ Rm of the form
y = A(x∗ ) + η,
where A(·) : Rn → Rm denotes some forward measurement operator (can be linear or nonlinear)
and η ∈ Rm is additive noise. Usually, we are interested in the case when m < n, which follows
many real-world scenarios. When m < n, the problem is ill-posed and some kind of regularizer (or
prior) is necessary to obtain a meaningful solution.
In the literature, the traditional approach of using hand-crafted priors (e.g. sparsity) is slowly being
replaced by rich, learned priors such as deep generative models. Recently, there has been a lot of
interests in using diffusion models as structural priors due to their state-of-the-art performance in
image generation (Dhariwal & Nichol, 2021a; Karras et al., 2022; Song et al., 2023b; Lou & Ermon,
2023a). Compared to generative adversarial networks (GANs), diffusion models are generally easier
and more stable to train, making them a generative prior that is more readily accessible (Dhariwal
& Nichol, 2021b). The most common approach for using diffusion models as priors is to resort to
posterior sampling, which has been extensively explored in the literature (Song et al., 2022; Chung
et al., 2023a; 2022; Kawar et al., 2022; Song et al., 2023a; Chung et al., 2023b; Meng & Kabashima,
2022; Zhang & Zhou, 2023). However, despite their remarkable success, these techniques exhibit
several limitations. The primary challenge is that the majority of existing works train these models
directly in the pixel space, which requires substantial computational resources and a large volume
of training data (Rombach et al., 2022).
Latent diffusion models (LDMs), which embed data in order to operate in a lower-dimensional
space, present a potential solution to this challenge, along with considerable improvements in com-
putational efficiency (Rombach et al., 2022; Vahdat et al., 2021) by training diffusion models in a
∗
Equal Contribution; Corresponding authors {bowenbw, kwonsm}@umich.edu
1
Published as a conference paper at ICLR 2024
Figure 1: Example reconstructions of our algorithm (ReSample) on two noisy inverse problems,
nonlinear deblurring and CT reconstruction, on natural and medical images, respectively.
compressed latent space. They can also provide a great amount of flexibility, as they can enable
one to transfer and generalize these models to different domains by fine-tuning on small amounts
of training data (Ruiz et al., 2023). Nevertheless, using LDMs to solve inverse problems poses a
significant challenge. The main difficulty arises from the inherent nonlinearity and nonconvexity
of the decoder, making it challenging to directly apply existing solvers designed for pixel space.
To address this issue, a concurrent work by Rout et al. (2023) recently introduced a posterior sam-
pling algorithm operating in the latent space (PSLD), designed to solve linear inverse problems with
provable guarantees. However, we observe that PSLD may reconstruct images with artifacts in the
presence of measurement noise. Therefore, developing an efficient algorithm capable of addressing
these challenges remains an open research question.
In this work, we introduce a novel algorithm named ReSample, which effectively employs LDMs
as priors for solving general inverse problems. Our algorithm can be viewed as a two-stage process
that incorporates data consistency by (1) solving a hard-constrained optimization problem, ensuring
we obtain the correct latent variable that is consistent with the observed measurements, and (2)
employing a carefully designed resampling scheme to map the measurement-consistent sample back
onto the correct noisy data manifold. As a result, we show that our algorithm can achieve state-of-
the-art performance on various inverse problem tasks and different datasets, compared to existing
algorithms. Notably, owing to using the latent diffusion models as generative priors, our algorithm
achieves a reduction in memory complexity. Below, we highlight some of our key contributions.
• We propose a novel algorithm that enables us to leverage latent diffusion models for solving
general inverse problems (linear and nonlinear) through hard data consistency.
• Particularly, we carefully design a stochastic resampling scheme that can reliably map the
measurement-consistent samples back onto the noisy data manifold to continue the reverse sam-
pling process. We provide a theoretical analysis to further demonstrate the superiority of the
proposed stochastic resampling technique.
• With extensive experiments on multiple tasks and various datasets, encompassing both natural
and medical images, our proposed algorithm achieves state-of-the-art performance on a variety of
linear and nonlinear inverse problems.
2 BACKGROUND
Denoising Diffusion Probabilistic Models. We first briefly review the basic fundamentals of dif-
fusion models, namely the denoising diffusion probabilistic model (DDPM) formulation (Ho et al.,
2020). Let x0 ∼ pdata (x) denote samples from the data distribution. Diffusion models start by
progressively perturbing data to noise via Gaussian kernels, which can be written as the variance-
preserving stochastic differential equation (VP-SDE) (Song et al., 2021) of the form
βt p
dx = − xdt + βt dw, (1)
2
where βt ∈ (0, 1) is the noise schedule that is a monotonically increasing sequence of t and w is the
standard Wiener process. This is generally defined such that we obtain the data distribution when
t = 0 and obtain a Gaussian distribution when t = T , i.e. xT ∼ N (0, I). The objective of diffusion
2
Published as a conference paper at ICLR 2024
unconditional
sample
reverse predict
process
encoder Latent
noise optimization
Space
ReSample
reverse
process
data consistent
estimated decoder sample
image
Our Algorithm
Latent Space
measurement
Figure 2: Overview of our ReSample algorithm during the reverse sampling process condi-
tioned on the data constraints from measurement. The entire sampling process is conducted in
the latent space upon passing the sample through the encoder. The proposed algorithm performs
hard data consistency at some time steps t via a skipped-step mechanism.
models is to learn the corresponding reverse SDE of Equation (1), which is of the form
βt p
dx = − x − βt ∇xt log p(xt ) dt + βt dw̄, (2)
2
where dw̄ is the standard Wiener process running backward in time and ∇xt log p(xt ) is the (Stein)
score function. In practice, we approximate the score function using a neural network sθ parameter-
ized by θ, which can be trained via denoising score matching (Vincent, 2011):
θ̂ = arg min E ∥sθ (xt , t) − ∇xt log p(xt |x0 )∥22 ,
(3)
θ
where t is uniformly sampled from [0, T ] and the expectation is taken over t, xt ∼ p(xt |x0 ), and
x0 ∼ pdata (x). Once we have access to the parameterized score function sθ , we can use it to
approximate the reverse-time SDE and simulate it using numerical solvers (e.g. Euler-Maruyama).
Denoising Diffusion Implicit Models. As the DDPM formulation is known to have a slow sam-
pling process, Song et al. (2020) proposed denoising diffusion implicit models (DDIMs) that defines
the diffusion process as a non-Markovian process to remedy this (Ho et al., 2020; Song et al., 2020;
2023b; Lu et al., 2022). This enables a faster sampling process with the sampling steps given by
√
q
xt−1 = ᾱt−1 x̂0 (xt ) + 1 − ᾱt−1 − ηδt2 sθ (xt , t) + ηδt ϵ, t = T, . . . , 0, (4)
Qt
where αt = 1 − βt , ᾱt = i=1 αi , ϵ ∼ N (0, I), η is the temperature parameter, δt controls the
stochasticity of the update step, and x̂0 (xt ) denotes the predicted x0 from xt which takes the form
1 √
x̂0 (xt ) = √ (xt + 1 − ᾱt sθ (xt , t)), (5)
ᾱt
which is an application of Tweedie’s formula. Here, sθ is usually trained using the epsilon-matching
score objective (Song et al., 2020). We use DDIM as the backbone of our algorithm and show how
we can leverage these update steps for solving inverse problems.
Solving Inverse Problems with Diffusion Models. Given measurements y ∈ Rm from some
forward measurement operator A(·), we can use diffusion models to solve inverse problems by
replacing the score function in Equation (2) with the conditional score function ∇xt log p(xt |y).
Then by Bayes rule, notice that we can write the conditional score as
∇xt log p(xt |y) = ∇xt log p(xt ) + ∇xt log p(y|xt ).
This results in the reverse SDE of the form
βt p
dx = − x − βt (∇xt log p(xt ) + ∇xt log p(y|xt )) dt + βt dw̄.
2
3
Published as a conference paper at ICLR 2024
In the literature, solving this reverse SDE is referred as posterior sampling. However, the issue
with posterior sampling is that there does not exist an analytical formulation for the likelihood term
∇xt log p(y|xt ). To resolve this, there exists two lines of work: (1) to resort to alternating pro-
jections onto the measurement subspace to avoid using the likelihood directly (Chung et al., 2022;
Kawar et al., 2022; Wang et al., 2022) and (2) to estimate the likelihood under some mild assump-
tions (Chung et al., 2023a; Song et al., 2023a). For example, Chung et al. (2023a) proposed diffusion
posterior sampling (DPS) that uses a Laplacian approximation of the likelihood, which results in the
discrete update steps
√
q
xt−1 = ᾱt−1 x̂0 (xt ) + 1 − ᾱt−1 − ηδt2 sθ (xt , t) + ηδt ϵ (6)
xt−1 = x′t−1 − ζ∇xt ∥y − A(x̂0 (xt ))∥22 , (7)
where ζ ∈ R can be viewed as a tunable step-size. However, as previously mentioned, these tech-
niques have limited applicability for real-world problems as they are all built on the pixel space.
Solving Inverse Problems with Latent Diffusion Models. The limited applicability of pixel-
based diffusion models can be tackled by alternatively utilizing more efficient LDMs as generative
priors. The setup for LDMs is the following: given an image x ∈ Rn , we have an encoder E : Rn →
Rk and a decoder D : Rk → Rn where k ≪ n. Let z = E(x) ∈ Rk denote the embedded samples
in the latent space. One way of incorporating LDMs to solve inverse problems would be to replace
the update steps in Equation (6) with
√
q
′
zt−1 = ᾱt−1 ẑ0 (zt ) + 1 − ᾱt−1 − ηδt2 sθ (zt , t) + ηδt ϵ, (8)
′
zt−1 = zt−1 − ζ∇zt ∥y − A(D(ẑ0 (zt )))∥22 , (9)
After incorporating LDMs, this can be viewed as a non-linear inverse problem due to the non-
linearity of the decoder D(·). As this builds upon the idea behind DPS, we refer to this algorithm as
Latent-DPS. While this formulation seems to work, we empirically observe that Latent-DPS often
produces reconstructions that are often noisy or blurry and inconsistent with the measurements. We
conjecture that since the forward operator involving the decoder is highly nonconvex, the gradient
update may lead towards a local minimum. We provide more insights in Appendix Section D.
Hard Data Consistency. Similar to Latent-DPS, our algorithm involves incorporating data con-
sistency into the reverse sampling process of LDMs. However, rather than a gradient update as
shown in Equation (9), we propose to solve an optimization problem on some time steps t:
1
ẑ0 (y) ∈ arg min ∥y − A(D(z))∥22 , (10)
z 2
where we denote ẑ0 (y) as the sample consistent with the measurements y ∈ Rm . This optimization
problem has been previously explored in other works that use GANs for solving inverse problems,
and can be efficiently solved using iterative solvers such as gradient descent (Bora et al., 2017;
Jalal et al., 2021; Shah et al., 2021; Lempitsky et al., 2018). However, it is well known that solv-
ing this problem starting from a random initial point may lead to unfavorable local minima (Bora
et al., 2017). To address this, we solve Equation (10 starting from an initial point ẑ0 (zt+1 ), where
ẑ0 (zt+1 ) is the estimate of ground-truth latent vector at time 0 based on the sample at time t + 1.
The intuition behind this initialization is that we want to start the optimization process within local
proximity of the global solution of Equation (10), to prevent resulting in a local minimum. We term
this overall concept as hard data consistency, as we strictly enforce the measurements via optimiza-
tion, rather than a “soft” approach through gradient update like Latent-DPS. To obtain ẑ0 (zt+1 ), we
use Tweedie’s formula (Efron, 2011) that gives us an approximation of the posterior mean which
takes the following formula:
1
ẑ0 (zt ) = E[z0 |zt ] = √ (zt + (1 − ᾱt )∇ log p(zt )). (11)
ᾱt
4
Published as a conference paper at ICLR 2024
However, we would like to note that performing hard data consistency on every reverse sampling
iteration t may be very costly. To address this, we first observe that as we approach t = T , the
estimated ẑ0 (zt+1 ) can deviate significantly from the ground truth. In this regime, we find that
hard data consistency provides only marginal benefits. Additionally, in the literature, existing works
point out the existence of a three-stage phenomenon (Yu et al., 2023), where they demonstrate that
data consistency is primarily beneficial for the semantic and refinement stages (the latter two stages
when t is closer to 0) of the sampling process. Following this reasoning, we divide T into three
sub-intervals and only apply the optimization in the latter two intervals. This approach provides
both computational efficiency and accurate estimates of ẑ0 (zt+1 ).
Furthermore, even during these two intervals, we observe that we do not need to solve the opti-
mization problem on every iteration t. Because of the continuity of the sampling process, after each
data-consistency optimization step, the samples in the following steps can retain similar semantic or
structural information to some extent. Thus, we “reinforce” the data consistency constraint during
the sampling process via a skipped-step mechanism. Empirically, we see that it is sufficient to per-
form this on every 10 (or so) iterations of t. One can think of hard data consistency as guiding the
sampling process towards the ground truth signal x∗ (or respectively z ∗ ) such that it is consistent
with the given measurements. Lastly, in the presence of measurement noise, minimizing Equa-
tion (10) to zero loss can lead to overfitting the noise. To remedy this, we perform early stopping,
where we only minimize up to a threshold τ based on the noise level. We will discuss the details of
the optimization process in the Appendix. We also observe that an additional Latent-DPS step after
unconditional sampling can (sometimes) marginally increase the overall performance. We perform
an ablation study on the performance of including Latent-DPS in the Appendix.
Remapping Back to zt . Following the flowchart in Figure 2, the next step is to map the
measurement-consistent sample ẑ0 (y) back onto the data manifold defined by the noisy samples
at time t to continue the reverse sampling process. Doing so would be equivalent to computing
the posterior distribution p(zt |y). To incorporate ẑ0 (y) into the posterior, we propose to construct
an auxiliary distribution p(ẑt |ẑ0 (y), y) to replace p(zt |y). Here, ẑt denotes the remapped sam-
ple and zt′ denotes the unconditional sample before remapping. One simple way of computing this
distribution to obtain ẑt is shown in Proposition 1.
Proposition 1 (Stochastic Encoding). Since the sample ẑt given ẑ0 (y) and measurement y is con-
ditionally independent of y, we have that
√
p(ẑt |ẑ0 (y), y) = p(ẑt |ẑ0 (y)) = N ( ᾱt ẑ0 (y), (1 − ᾱt )I). (12)
We defer all of the proofs to the Appendix. Proposition 1 provides us a way of computing ẑt , which
we refer to as stochastic encoding. However, we observe that using stochastic encoding can incur
a high variance when t is farther away from t = 0, where the ground truth signal exists. This large
variance can often lead to noisy image reconstructions. To address this issue, we propose a posterior
5
Published as a conference paper at ICLR 2024
Table 1: Quantitative results of super resolution and inpainting on the CelebA-HQ dataset.
Input images have an additive Gaussian noise with σy = 0.01. Best results are in bold and second
best results are underlined.
sampling technique that reduces the variance by additionally conditioning on zt′ , the unconditional
sample at time t. Here, the intuition is that by using information of zt′ , we can get closer to the
ground truth zt , which effectively reduces the variance. In Lemma 2, under some mild assumptions,
we show that this new distribution p(ẑt |zt′ , ẑ0 (y), y) is a tractable Gaussian distribution.
Proposition 2 (Stochastic Resampling). Suppose that p(zt′ |ẑt , ẑ0 (y), y) is normally distributed
such that p(zt′ |ẑt , ẑ0 (y), y) = N (µt , σt2 ). If we let p(zˆt |ẑ0 (y), y) be a prior for µt , then the
posterior distribution p(ẑt |zt′ , ẑ0 (y), y) is given by
2√
σt ᾱt ẑ0 (y) + (1 − ᾱt )zt′ σt2 (1 − ᾱt )
p(ẑt |zt′ , ẑ0 (y), y) = N , I . (13)
σt2 + (1 − ᾱt ) σt2 + (1 − ᾱt )
We refer to this new mapping technique as stochastic resampling. Since we do not have access to
σt2 , it serves as a hyperparameter that we tune in our algorithm. The choice of σt2 plays a role of
controlling the tradeoff between prior consistency and data consistency. If σt2 → 0, then we recover
unconditional sampling, and if σt2 → ∞, we recover stochastic encoding. We observe that this new
technique also has several desirable properties, for which we rigorously prove in the next section.
In Section 3.1, we discussed that stochastic resampling induces less variance than stochastic encod-
ing. Here, we aim to rigorously prove the validity of this statement.
Lemma 1. Let z̃t and ẑt denote the stochastically encoded and resampled image of ẑ0 (y), respec-
tively. If VAR(zt′ ) > 0, then we have that VAR(ẑt ) < VAR(z̃t ).
Theorem 1. If ẑ0 (y) is measurement-consistent such that y = A(D(ẑ0 (y))), i.e. ẑ0 = ẑ0 (zt+1 ) =
ẑ0 (y), then stochastic resample is unbiased such that E[ẑt |y] = E[zt′ ].
These two results, Lemma 1 and Theorem 1, prove the benefits of stochastic resampling. At a
high-level, these proofs rely on the fact the posterior distributions of both stochastic encoding and
resampling are Gaussian and compare their respective means and variances. In the following result,
we characterize the variance induced by stochastic resampling, and show that as t → 0, the variance
decreases, giving us a reconstructed image that is of better quality.
Theorem 2. Let z0 denote a sample from the data distribution and zt be a sample from the noisy
perturbed distribution at time t. Then,
(1 − ᾱt )2 2 1 − ᾱt
Cov(z0 |zt ) = ∇zt log pzt (zt ) + I.
ᾱt ᾱt
By Theorem 2, notice that since as αt is an increasing sequence that converges to 1 as t decreases,
the variance between the ground truth z0 and the estimated ẑ0 decreases to 0 as t → 0, assuming
that ∇2zt log pzt (zt ) < ∞. Following our theory, we empirically show that stochastic resampling
can reconstruct signals that are less noisy than stochastic encoding, as shown in the next section.
4 E XPERIMENTS
We conduct experiments to solve both linear and nonlinear inverse problems on natural and medical
images. We compare our algorithm to several state-of-the-art methods that directly apply the diffu-
sion models that are trained in the pixel space: DPS (Chung et al., 2023a), Manifold Constrained
6
Published as a conference paper at ICLR 2024
Table 2: Quantitative results of Gaussian and nonlinear deblurring on the CelebA-HQ dataset.
Input images have an additive Gaussian noise with σy = 0.01. Best results are in bold and second
best results are underlined. For nonlinear deblurring, some baselines are omitted, as they can only
solve linear inverse problems.
Abdominal Head Chest
Method
PSNR↑ SSIM↑ PSNR↑ SSIM↑ PSNR↑ SSIM↑
Latent-DPS 26.80 ±1.09 0.870 ±0.026 28.64 ±5.38 0.893 ±0.058 25.67±1.14 0.822 ±0.033
MCG (Chung et al., 2022) 29.41 ±3.14 0.857 ±0.041 28.28 ±3.08 0.795 ±0.116 27.92 ±2.48 0.842 ±0.036
DPS (Chung et al., 2023a) 27.33 ±2.68 0.715 ± 0.031 24.51 ±2.77 0.665 ±0.058 24.73 ±1.84 0.682 ±0.113
PnP-UNet (Gilton et al., 2021) 32.84 ±1.29 0.942 ±0.008 33.45 ±3.25 0.945 ±0.023 29.67 ±1.14 0.891 ±0.011
FBP 26.29 ±1.24 0.727 ±0.036 26.71 ±5.02 0.725 ±0.106 24.12 ±1.14 0.655 ±0.033
FBP-UNet (Jin et al., 2017) 32.77 ±1.21 0.937 ±0.013 31.95 ±3.32 0.917 ±0.048 29.78 ±1.12 0.885 ±0.016
ReSample (Ours) 35.91 ±1.22 0.965 ±0.007 37.82 ±5.31 0.978 ±0.014 31.72 ±0.912 0.922 ±0.011
Table 3: Quantitative results of CT reconstruction on the LDCT dataset. Best results are in bold
and second best results are underlined.
Gradients (MCG) (Chung et al., 2022), Denoising Diffusion Destoration Models (DDRM) (Kawar
et al., 2022), Diffusion Model Posterior Sampling (DMPS) (Meng & Kabashima, 2022). Then we
compare an algorithm that uses a plug-and-play approach that apply a pretrained deep denoiser
for inverse problems (ADMM-PnP) (Ahmad et al., 2019). We also compare our algorithm to
Latent-DPS and Posterior Sampling with Latent Diffusion (PSLD) (Rout et al., 2023), a concurrent
work we recently notice also tackling latent diffusion models. Various quantitative metrics are
used for evaluation including Learned Perceptual Image Patch Similarity (LPIPS) distance, peak
signal-to-noise-ratio (PSNR), and structural similarity index (SSIM). Lastly, we conduct ablation
study to compare the performance of stochastic encoding and our proposed stochastic resampling
technique as mentioned in Section 3.1, and also demonstrate the memory efficiency gained by
leveraging LDMs.
Experiments on Natural Images. For the experiments on natural images, we use datasets
FFHQ (Kazemi & Sullivan, 2014), CelebA-HQ (Liu et al., 2015), and LSUN-Bedroom (Yu et al.,
2016) with the image resolution of 256×256×3. We take pre-trained latent diffusion models LDM-
VQ4 trained on FFHQ and CelebA-HQ provided by (Rombach et al., 2022) with autoencoders that
yield images of size 64 × 64 × 3, and DDPMs (Ho et al., 2020) also trained on FFHQ and CelebA-
HQ training sets. Then, we sample 100 images from both the FFHQ and CelebA-HQ validation sets
for testing evaluation. For computing quantitative results, all images are normalized to the range
[0, 1]. All experiments had Gaussian measurement noise with standard deviation σy = 0.01. Due to
limited space, we put the results on FFHQ and details of the hyperparameters to the Appendix.
For linear inverse problems, we consider the following tasks: (1) Gaussian deblurring, (2) inpainting
(with a random mask), and (3) super resolution. For Gaussian deblurring, we use a kernel with size
61 × 61 with standard deviation 3.0. For super resolution, we use bicubic downsampling and a
random mask with varying levels of missing pixels for inpainting. For nonlinear inverse problems,
we consider nonlinear deblurring as proposed by (Chung et al., 2023a). The quantitative results are
displayed in Tables 1 and 2, with qualitative results in Figure 3. In Tables 1 and 2, we can see that
ReSample significantly outperforms all of the baselines across all three metrics on the CelebA-HQ
dataset. We also observe that ReSample performs better than or comparable to all baselines on the
FFHQ dataset as shown in the Appendix. Remarkably, our method excels in handling nonlinear
inverse problems, further demonstrating the flexibility of our algorithm. We further demonstrate the
superiority of ReSample for handling nonlinear inverse problems in Figure 6a, where we show that
we can consistently outperform DPS.
*We have updated the baseline results for PSLD. More details are provided in the Appendix (Section A.4).
7
Published as a conference paper at ICLR 2024
Measurement Reference DPS Latent-DPS ReSample Measurement Reference DPS Latent-DPS ReSample
Gaussian Deblurring
Figure 3: Qualitative results of multiple tasks on the LSUN-Bedroom and CelebA-HQ datasets.
All inverse problems have Gaussian measurement noise with variance σy = 0.01.
Figure 4: Qualitative results of CT reconstruction on the LDCT dataset. We annotate the critical
image structures in a red box, and zoom in below the image.
Effectiveness of the Resampling Technique. Here, we validate our theoretical results by con-
ducting ablation studies on stochastic resampling. Specifically, we perform experiments on the
LSUN-Bedroom and CelebA-HQ datasets with tasks of Gaussian deblurring and super-resolution.
As shown in Figure 5, we observe that stochastic resampling reconstructs smoother images with
higher PSNRs compared to stochastic encoding, corroborating our theory.
8
Published as a conference paper at ICLR 2024
Figure 6: Left: Additional results on nonlinear deblurring highlighting the performance of ReSam-
ple. Right: Ablation study on the ReSample frequency on the performance.
Effectiveness of Hard Data Consistency. In Figure 6b, we perform an ablation study on the Re-
Sample frequency on CT reconstruction. This observation is in line with what we expect intuitively,
as more ReSample time steps (i.e., more data consistency) lead to more accurate reconstructions.
Memory Efficiency. To demon- Table 4: Memory Usage of Different Methods for Gaussian
strate memory efficiency, we use the Deblurring on the FFHQ Dataset
command nvidia-smi to moni- Model Algorithm Model Only Memory Increment Total
tor the memory consumption dur- DDPM DPS 1953MB +3416MB (175%) 5369MB
ing solving an inverse problem. We MCG +3421MB (175%) 5374MB
present the memory usage for Gaus- DMPS +5215MB (267 %) 7168MB
DDRM +18833MB (964 %) 20786MB
sian deblurring on the FFHQ dataset LDM PSLD 3969MB +5516MB (140%) 9485MB
in Table 4. Although the entire LDM ReSample +1040MB (26.2%) 5009MB
models occupy more memory due to
the autoencoders, our algorithm it-
self exhibits memory efficiency, resulting in lower overall memory usage. This highlights its po-
tential in domains like medical imaging, where memory plays a crucial role in feasibility.
5 C ONCLUSION
In this paper, we propose ReSample, an algorithm that can effectively leverage LDMs to solve gen-
eral inverse problems. We demonstrated that our algorithm can reconstruct high-quality images
compared to many baselines, including those in the pixel space. One limitation of our method lies
in the computational overhead of hard data consistency, which we leave as a significant challenge
for future work to address and improve upon.
9
Published as a conference paper at ICLR 2024
6 R EPRODUCIBILITY S TATEMENT
To ensure the reproducibility of our results, we thoroughly detail the hyperparameters employed
in our algorithm in the Appendix. Additionally, we provide a comprehensive explanation of the
configuration of all baselines used in our experiments. As we use pre-trained diffusion models
throughout our experiments, they are readily accessible online. Lastly, our code is available at
https://github.com/soominkwon/resample.
7 ACKNOWLEDGEMENTS
BS and LS acknowledges support from U-M MIDAS PODS Grant and U-M MICDE Catalyst Grant,
and computing resource support from NSF ACCESS Program and Google Cloud Research Credits
Program. This work used NCSA Delta GPU through allocation CIS230133 and ELE230011 from
the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) pro-
gram, which is supported by National Science Foundation grants #2138259, #2138286, #2138307,
#2137603, and #2138296. SMK and QQ acknowledge support from U-M START & PODS grants,
NSF CAREER CCF-2143904, NSF CCF-2212066, NSF CCF-2212326, ONR N00014-22-1-2529,
AWS AI Award, and a gift grant from KLA.
R EFERENCES
Manya V. Afonso, José M. Bioucas-Dias, and Mário A. T. Figueiredo. An augmented lagrangian
approach to the constrained optimization formulation of imaging inverse problems. IEEE Trans-
actions on Image Processing, 20(3):681–695, 2011. doi: 10.1109/TIP.2010.2076294.
Rizwan Ahmad, Charles A Bouman, Gregery T Buzzard, Stanley H Chan, Edward T Reehorst, and
Philip Schniter. Plug and play methods for magnetic resonance imaging. 2019.
Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. doi: 10.1137/080716542.
URL https://doi.org/10.1137/080716542.
Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using genera-
tive models. In International Conference on Machine Learning, pp. 537–546. PMLR, 2017.
Paul A. Bromiley. Products and convolutions of gaussian probability density functions. 2013.
Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for
inverse problems using manifold constraints. arXiv preprint arXiv:2206.00941, 2022.
Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul
Ye. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh Interna-
tional Conference on Learning Representations, 2023a. URL https://openreview.net/
forum?id=OnD9zGAGT0k.
Hyungjin Chung, Suhyeon Lee, and Jong Chul Ye. Fast diffusion sampler for inverse problems by
geometric decomposition. arXiv preprint arXiv:2303.05754, 2023b.
Hyungjin Chung, Jong Chul Ye, Peyman Milanfar, and Mauricio Delbracio. Prompt-tuning latent
diffusion models for inverse problems. arXiv preprint arXiv:2310.01110, 2023c.
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. arXiv preprint
arXiv:2105.05233, 2021a.
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances
in Neural Information Processing Systems, 34:8780–8794, 2021b.
Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Asso-
ciation, 106(496):1602–1614, 2011. doi: 10.1198/jasa.2011.tm11181. URL https://doi.
org/10.1198/jasa.2011.tm11181.
10
Published as a conference paper at ICLR 2024
Nic Fishman, Leo Klarner, Valentin De Bortoli, Emile Mathieu, and Michael Hutchinson. Diffusion
models for constrained domains. arXiv preprint arXiv:2304.05364, 2023.
Davis Gilton, Gregory Ongie, and Rebecca Willett. Model adaptation for inverse problems in imag-
ing. IEEE Transactions on Computational Imaging, 7:661–674, 2021.
Harshit Gupta, Kyong Hwan Jin, Ha Q Nguyen, Michael T McCann, and Michael Unser. Cnn-based
projected gradient descent for consistent ct image reconstruction. IEEE transactions on medical
imaging, 37(6):1440–1453, 2018.
Yoseob Han and Jong Chul Ye. Framing u-net via deep convolutional framelets: Application to
sparse-view ct. IEEE transactions on medical imaging, 37(6):1418–1429, 2018.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in
Neural Information Processing Systems, 33:6840–6851, 2020.
Shady Abu Hussein, Tom Tirer, and Raja Giryes. Image-adaptive gan based reconstruction. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3121–3129, 2020.
Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G Dimakis, and Jon
Tamir. Robust compressed sensing mri with deep generative priors. In M. Ranzato,
A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neu-
ral Information Processing Systems, volume 34, pp. 14938–14954. Curran Associates, Inc.,
2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/
file/7d6044e95a16761171b130dcb476a43e-Paper.pdf.
Kyong Hwan Jin, Michael T McCann, Emmanuel Froustey, and Michael Unser. Deep convolutional
neural network for inverse problems in imaging. IEEE Transactions on Image Processing, 26(9):
4509–4522, 2017.
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-
based generative models. arXiv preprint arXiv:2206.00364, 2022.
Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration
models. arXiv preprint arXiv:2201.11793, 2022.
Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of regres-
sion trees. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–
1874, 2014. doi: 10.1109/CVPR.2014.241.
Victor Lempitsky, Andrea Vedaldi, and Dmitry Ulyanov. Deep image prior. In 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 9446–9454, 2018. doi: 10.1109/
CVPR.2018.00984.
Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir:
Image restoration using swin transformer. In Proceedings of the IEEE/CVF international confer-
ence on computer vision, pp. 1833–1844, 2021.
Guan-Horng Liu, Tianrong Chen, Evangelos A Theodorou, and Molei Tao. Mirror diffusion models
for constrained and watermarked generation. arXiv preprint arXiv:2310.01236, 2023.
Wei Liu, Xin Xia, Lu Xiong, Yishi Lu, Letian Gao, and Zhuoping Yu. Automated vehicle sideslip
angle estimation considering signal measurement characteristic. IEEE Sensors Journal, 21(19):
21675–21687, 2021.
Wei Liu, Karoll Quijano, and Melba M Crawford. Yolov5-tassel: detecting tassels in rgb uav im-
agery with improved yolov5 based on transfer learning. IEEE Journal of Selected Topics in
Applied Earth Observations and Remote Sensing, 15:8085–8094, 2022.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild.
In Proceedings of International Conference on Computer Vision (ICCV), 2015.
Aaron Lou and Stefano Ermon. Reflected diffusion models. arXiv preprint arXiv:2304.04740,
2023a.
11
Published as a conference paper at ICLR 2024
Aaron Lou and Stefano Ermon. Reflected diffusion models. arXiv preprint arXiv:2304.04740,
2023b.
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A
fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint
arXiv:2206.00927, 2022.
Xiangming Meng and Yoshiyuki Kabashima. Diffusion model based posterior sampling for noisy
linear inverse problems. arXiv preprint arXiv:2211.12343, 2022.
Taylor R Moen, Baiyu Chen, David R Holmes III, Xinhui Duan, Zhicong Yu, Lifeng Yu, Shuai
Leng, Joel G Fletcher, and Cynthia H McCollough. Low-dose ct image and projection dataset.
Medical physics, 48(2):902–911, 2021.
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models.
In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
Saiprasad Ravishankar, Jong Chul Ye, and Jeffrey A Fessler. Image reconstruction: From sparsity
to data-adaptive methods and machine learning. Proceedings of the IEEE, 108(1):86–109, 2019.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
Matteo Ronchetti. Torchradon: Fast differentiable routines for computed tomography. arXiv preprint
arXiv:2009.14788, 2020.
Litu Rout, Negin Raoof, Giannis Daras, Constantine Caramanis, Alexandros G. Dimakis, and Sanjay
Shakkottai. Solving linear inverse problems provably via posterior sampling with latent diffusion
models. arXiv preprint arXiv:2307.00619, 2023.
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman.
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–
22510, 2023.
Viraj Shah, Rakib Hyder, M. Salman Asif, and Chinmay Hegde. Provably convergent algorithms
for solving inverse problems using generative models. arXiv preprint arXiv:2105.06371, 2021.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv
preprint arXiv:2010.02502, 2020.
Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion
models for inverse problems. In International Conference on Learning Representations, 2023a.
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben
Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint
arXiv:2011.13456, 2021.
Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. Solving inverse problems in medical imag-
ing with score-based generative models. International Conference on Learning Representations,
2022. URL https://openreview.net/forum?id=vaRCHVj0uGI.
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint
arXiv:2303.01469, 2023b.
Paul Suetens. Fundamentals of medical imaging. Cambridge university press, 2017.
Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space.
In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.),
Advances in Neural Information Processing Systems, volume 34, pp. 11287–11302. Curran
Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/
paper/2021/file/5dca4c6b9e244d24a30b4c45601d9720-Paper.pdf.
12
Published as a conference paper at ICLR 2024
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Compu-
tation, 23(7):1661–1674, 2011. doi: 10.1162/NECO a 00142.
Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion
null-space model. arXiv preprint arXiv:2212.00490, 2022.
Haoyu Wei, Florian Schiffers, Tobias Würfl, Daming Shen, Daniel Kim, Aggelos K Katsaggelos,
and Oliver Cossairt. 2-step sparse-view ct reconstruction with a domain-specific perceptual net-
work. arXiv preprint arXiv:2012.04743, 2020.
Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun:
Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv
preprint arXiv:1506.03365, 2016.
Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. Freedom: Training-free
energy-guided conditional diffusion model. arXiv preprint arXiv:2303.09833, 2023.
Dan Zhang and Fangfang Zhou. Self-supervised image denoising for real-world images with
context-aware transformer. IEEE Access, 11:14340–14349, 2023.
Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser:
Residual learning of deep cnn for image denoising. IEEE transactions on image processing, 26
(7):3142–3155, 2017.
Bo Zhu, Jeremiah Z Liu, Stephen F Cauley, Bruce R Rosen, and Matthew S Rosen. Image recon-
struction by domain-transform manifold learning. Nature, 555(7697):487–492, 2018.
13
Published as a conference paper at ICLR 2024
Appendix
In this section, we present additional experimental results to supplement those presented in the
main paper. In Section A, we present additional qualitative and quantitative results to highlight the
performance of ReSample. In Section B, we briefly discuss some of the implementation details
regarding hard data consistency. In Section C, we outline all of the hyperparameters used to produce
our results. In Section D, we discuss some more reasons to why Latent-DPS fails to consistently
accurately recover the underlying image. Lastly, in Section F, we present our deferred proofs.
A A DDITIONAL R ESULTS
Here, we provide additional quantitative results on the FFHQ dataset. Similar to the results in the
main text on the CelebA-HQ dataset, ReSample outperforms the baselines across many different
tasks.
Table 5: Comparison of quantitative results on inverse problems on the FFHQ dataset. Input images
have an additive Gaussian noise with σy = 0.01. Best results are in bold and second best results are
underlined.
Table 6: Comparison of quantitative results on inverse problems on the FFHQ dataset. Input images
have an additive Gaussian noise with σy = 0.01. Best results are in bold and second best results are
underlined.
We observe that ReSample outperforms all baselines in nonlinear deblurring and super resolution,
while demonstrating comparable performance to DPS (Chung et al., 2023a) in Gaussian deblurring
and inpainting. An interesting observation is that ReSample exhibits the largest performance gap
from DPS in nonlinear deblurring and the smallest performance gap in random inpainting, mirroring
the pattern we observed in the results of the CelebA-HQ dataset. This may suggest that ReSample
performs better than baselines when the forward operator is more complex.
In this section, we present both qualitative and quantitative results on a more challenging image
inpainting setting. More specifically, we consider the “box” inpainting setting, where the goal is to
14
Published as a conference paper at ICLR 2024
recover a ground truth image in which large regions of pixels are missing. We present our results
in Figure 7 and Table 7, where we compare the performance of our algorithm to PSLD (Rout et al.,
2023) (latent-space algorithm) and DPS (Chung et al., 2023a) (pixel-space algorithm). Even in
this challenging setting, we observe that ReSample can outperform the baselines, highlighting the
effectiveness of our method.
Box Inpainting
Method
LPIPS↓ PSNR↑ SSIM↑
DPS 0.127 22.85 0.861
DDRM 0.120 24.33 0.860
DMPS 0.233 22.01 0.803
Latent-DPS 0.199 23.14 0.784
PSLD 0.201 23.99 0.787
ReSample (Ours) 0.093 24.67 0.892
Table 7: Comparison of quantitative results for box inpainting on CelebA-HQ dataset. Input images
have an additive Gaussian noise with σy = 0.01. Best results are in bold and second best results are
underlined.
In this section, we present techniques for extending our algorithm to address inverse problems
with high-resolution images. One straightforward method involves using arbitrary-resolution ran-
dom noise as the initial input for the LDM. The convolutional architecture shared by the UNet in
both the LDM and the autoencoder enables the generation of images at arbitrary resolutions, as
demonstrated by Rombach et al. (2022). To illustrate the validity of this approach, we conducted a
random-inpainting experiment with images of dimensions (512 × 512 × 3) and report the results in
Figure 8. In Figure 8, we observe that our method can achieve excellent performance even in this
high-resolution setting. Other possible methods include obtaining an accurate pair of encoders and
decoders that can effectively map higher-resolution images to the low-dimensional space, and then
running our algorithm using the latent diffusion model. We believe that these preliminary results
represent an important step towards using LDMs as generative priors for solving inverse problems
for high-resolution images.
Measurement Reference ReSample PSLD DPS
Figure 7: Qualitative results on box inpainting with measurement noise σy = 0.01. These results
highlight the effectiveness of ReSample on more difficult inverse problem tasks.
To better reflect the performance of PSLD (Rout et al., 2023), we re-ran several experiments with
PSLD while collaborating with the original authors to fine-tune hyperparameters. Furthermore,
15
Published as a conference paper at ICLR 2024
Figure 8: Additional results on 70% random inpainting with high-resolution images (512 × 512 × 3)
with measurement noise σy = 0.01.
as the work by Rout et al. (2023) initially employed the stable diffusion (SD) model rather than the
LDM model, we have relabeled the baselines as “PSLD-LDM”. We report the hyperparameters used
to generate the new results, as well as the ones used to generate the previous results in Table 8. The
previous hyperparameters were chosen based on the implementation details provided by Rout et al.
(2023) in Section B.1, where they stated that they used γ = 0.1 and that the hyperparameters were
available in the codebase. However, it has been brought to our attention that this was not the optimal
hyperparameter. Therefore, we additionally tuned the parameter so that each baseline could obtain
the best results. We have also conducted additional experiments directly comparing to PSLD-LDM
for inpainting tasks on the FFHQ 1K validation set and present the results in Table 9. Throughout
these results, we still observe that our algorithm largely outperforms the baselines, including PSLD.
Table 8: Different hyperparameters used to generate PSLD results. SR refers to super resolution
(4×), “previous” and “new” refer to the hyperparameter γ used to generate the old and new results,
respectively.
Table 9: Comparison of quantitative results for different inpainting tasks on the FFHQ 1k validation
set. Best results are in bold and second best results are underlined.
16
Published as a conference paper at ICLR 2024
Figure 9: Additional results on super resolution (4×) on the CelebA-HQ dataset with Gaussian
measurement noise of variance σy = 0.01.
Figure 10: Additional results on Gaussian deblurring on the CelebA-HQ dataset with Gaussian
measurement noise of variance σy = 0.01.
17
Published as a conference paper at ICLR 2024
Figure 11: Additional results on inpainting with a random mask (70%) on the LSUN-Bedroom
dataset with Gaussian measurement noise of variance σy = 0.01.
Figure 12: Comparison of algorithms on Gaussian deblurring on the FFHQ dataset with Gaussian
measurement noise of variance σy = 0.01.
18
Published as a conference paper at ICLR 2024
Figure 13: Additional results on Gaussian deblurring with additive Gaussian noise σy = 0.05.
Figure 14: Additional results on super resolution 4× with additive Gaussian noise σy = 0.05.
19
Published as a conference paper at ICLR 2024
Figure 15: Additional results on CT reconstruction with additive Gaussian noise σy = 0.01.
20
Published as a conference paper at ICLR 2024
Here, we present some ablation studies regarding computational efficiency and stochastic resam-
pling, amongst others.
PSNR
PSNR
Strength of Hard Data Consistency Term # of Timesteps for Optimization and ReSample
Figure 16: An ablation study on a few of the hyperparameters associated with ReSample. Left:
Performance of different values of γ (hyperparameter in stochastic resampling) on the CelebA-HQ
dataset. Right: CT reconstruction performance as a function of the number of timesteps to perform
optimization.
Effectiveness of Stochastic Resampling. Recall that in stochastic resampling, we have one hy-
perparameter σt2 . In Section C, we discuss that our choice for this hyperparameter is
1 − ᾱt−1 ᾱt
σt2 = γ 1− ,
ᾱt ᾱt−1
where ᾱ is an adaptive parameter associated to the diffusion model process. Thus, the only param-
eter that we need to choose here is γ. To this end, we perform a study on how γ affects the image
reconstruction quality. The γ term can be interpreted as a parameter that balances the prior consis-
tency with the measurement consistency. In Figure 16 (left) observe that performance increases a
lot when γ increases from a small value, but plateaus afterwards. We also observe that the larger γ
gives more fine details on the images, but may introduce additional noise. In Figure 17, we provide
visual representations corresponding to the choices of γ in order.
Effect of the Skip Step Size. In this section, we conduct an ablation study to investigate the im-
pact of the skip step size in our algorithm. The skip step size denotes the frequency at which we
’skip’ before applying the next hard data consistency. For instance, a skip step size of 10 implies
that we apply hard data consistency every 10 iterations of the reverse sampling process. Intuitively,
one might expect that fewer skip steps would lead to better reconstructions, indicating more frequent
hard data consistency steps. To validate this intuition, we conducted an experiment on the effect of
the skip step size on CT reconstruction, and the results are displayed in Figure 18, with correspond-
ing inference times in Table 16. We observe that for skip step sizes ranging from 1 to 10, the results
are very similar. However, considering the significantly reduced time required for a skip step size
21
Published as a conference paper at ICLR 2024
of 10 as shown in Table 16, we chose a skip step size of 10 in our experimental setting in order to
balance the trade-off between reconstruction quality and inference time. Lastly, we would like to
note that in Figure 18, a skip step size of 4 exhibits a (very) slight improvement over a skip step
of 1. Due to the reverse sampling process initiating with a random noise vector, there is a minor
variability in the results, with the average outcomes for skip step sizes of 1 and 4 being very similar.
Table 10: Inference times and corresponding performance for performing hard data consistency on
with varying skip step sizes for chest CT reconstruction.
PSNR: 27.63 PSNR: 29.95 PSNR: 31.30 PSNR: 31.73 PSNR: 31.80 PSNR: 31.79
SSIM: 0.869 SSIM: 0.897 SSIM: 0.918 SSIM: 0.922 SSIM: 0.923 SSIM: 0.922
Figure 18: Observing the effect of the skip step size of hard data consistency optimization on CT
reconstruction. The numbers above the image correspond to the number of hard data consistency
steps per iteration of the reverse sampling process (e.g., 1 refers to hard data consistency on every
step), and the red boxes outline the differences in reconstructions.
Discussion on Latent-DPS. In
the main text, we briefly dis- Table 12: Memory Usage of different algorithms with differ-
cussed how adding the Latent- ent pretrained models for Gaussian Deblurring on FFHQ256 and
DPS gradient term into our Re- ImageNet512 datasets
Sample algorithm can improve
the overall reconstruction qual- Pretrained Model Algorithm Model Only Memory Increment Total Memory
1953MB
ity. Generally, we observe that DDPM(FFHQ) DPS MCG
+3416MB (175%) 5369MB
+3421MB (175%) 5374MB
adding the Latent-DPS gradient DMPS +5215MB (267 %) 7168MB
DDRM +18833MB (964 %) 20786MB
will cause a very marginal boost LDM(FFHQ) PSLD 3969MB +5516MB (140%) 9485MB
in the PSNR, but only when ReSample (Ours) +1040MB (26.2%) 5009MB
4394MB
the learning rate scale is chosen DDPM(ImageNet) DPS
MCG
+6637MB (151%) 11031MB
+6637MB (151%) 11031MB
“correctly”. Interestingly, we ob- DMPS +8731MB (199 %) 13125MB
DDRM +4530MB (103 %) 8924MB
serve that choosing ᾱt is criti- LDM(ImageNet) PSLD 5669MB +5943MB (105%) 11612MB
cal in the performance of Latent- ReSample (Ours) +1322MB (30.1%) 7002MB
DPS. To this end, we perform a
22
Published as a conference paper at ICLR 2024
brief ablation study on the learning rate for Latent-DPS, where we choose the learning rate to be
k ᾱt for some k > 0. We vary k and test the performance on 50 images of chest CT images and
display the results in Table 13. In Table 13, we observe that k = 0.5 returns the best results, and
should be chosen if one were to adopt the method of adding Latent-DPS into ReSample.
Table 13: The effect of the Latent-DPS gradient scale (learning rate) as a function of k > 0. We
study the PSNR and measurement loss (objective function) changes for varying k for chest CT
reconstruction on 50 test samples. The best results are in bold. Note that k = 0 here refers to no
Latent-DPS and only using ReSample.
Training Efficiency of LDMs. To underscore the significance of our work, we conducted an abla-
tion study focusing on the training efficiency of LDMs. This study demonstrates that LDMs require
significantly fewer computational resources for training, a trait particularly beneficial for down-
stream applications like medical imaging. To support this assertion, we present the training time
and performance of LDMs, comparing them to DDPMs (Ho et al., 2020) in the context of medical
image synthesis. For LDMs, we utilized the pretrained autoencoder and LDM architecture provided
by Rombach et al. (2022), training them on 9000 2D CT slices from three organs across 40 patients
sourced from the LDCT training set Moen et al. (2021). In the case of DDPMs, we employed the
codebase provided by Nichol & Dhariwal (2021) to train pixel-based diffusion models on the same
set of 2D CT training slices. The results are presented in comparison with 300 2D slices from 10
patients in the validation set, detailed in Table 14. Observing the results in Table 14, we note that by
utilizing LDMs, we can significantly reduce training time and memory usage while achieving better
generation quality, as measured by the FID score.
Table 14: Comparison of training efficiency and performance of LDMs and DDPMs on CT images,
computed on a V100 GPU.
Recall that the hard data consistency step involves solving the following optimization problem:
1
ẑ0 (y) ∈ arg min ∥y − A(D(z))∥22 , (14)
z 2
where we initialize using ẑ0 (zt ). Instead of solving for the latent variable z directly, notice that one
possible technique would be to instead solve for vanilla least squares:
1
x̂0 (y) ∈ arg min ∥y − A(x)∥22 , (15)
x 2
where we instead initialize using D(ẑ0 (zt )), where D(·) denotes the decoder. Similarly, our goal
here is to find the x̂0 (y) that is close to D(ẑ0 (zt ))that satisfies the measurement consistency: ∥y −
A(x)∥22 < σ 2 , where σ 2 is an estimated noise level. Throughout the rest of the Appendix, we refer
to the former optimization process as latent optimization and the latter as pixel optimization.
In our experiments, we actually found that these two different formulations yield different results,
in the sense that performing latent optimization gives reconstructions that are “noisy”, yet much
sharper with fine details, whereas pixel optimization gives results that are “smoother” yet blurry
with high-level semantic information. Here, the intuition is that pixel optimization does not directly
change the latent variable where as latent optimization directly optimizes over the latent variable.
23
Published as a conference paper at ICLR 2024
Moreover, the encoder E(·) can add additional errors to the estimated ẑ0 (y), perhaps throwing the
sample off the data manifold, yielding images that are blurry as a result.
There is also a significant difference in time complexity between these two methods. Since latent
optimization needs to backpropagate through the whole network of the decoder D(·) on every gradi-
ent step, it takes much longer to obtain a local minimum (or converge). Empirically, to balance the
trade-off between reconstruction speed and image quality, we see that using both of these formula-
tions for hard data consistency can not only yield the best results, but also speed up the optimization
process. Since pixel optimization is easy to get a global optimum, we use it first during the reverse
sampling process and then use latent optimization when we are closer to t = 0 to refine the images
with the finer details.
Lastly, we would like to remark that for pixel optimization, there is a closed-form solution that could
be leveraged under specific settings Wang et al. (2022). If the forward operator A is linear and can
take the matrix form A and the measurements are noiseless (i.e., y = A(x̂0 (y))), then we can pose
the optimization problem as
1
x̂0 (y) ∈ arg min ∥D(ẑ0 (zt )) − x∥22 , s.t. Ax = y. (16)
x 2
Then, the solution to this optimization problem is given by
x̂0 (y) = D(ẑ0 (zt )) − (A+ AD(ẑ0 (zt )) − A+ y), (17)
where by employing the encoder, we obtain
ẑ0 (y) = E(x̂0 (y)) = E(D(ẑ0 (zt )) − (A+ AD(ẑ0 (zt )) − A+ y)). (18)
This optimization technique does not require iterative solvers and offers great computational effi-
ciency. In the following section, we provide ways in which we can compute a closed-form solution
in the case in which the measurements may be noisy.
Notice that since pixel optimization directly operates in the pixel space, we can use solvers such
as conjugate gradients least squares for linear inverse problems. Let A be the matrix form of the
linear operator A(·). Then, as discussed in the previous subsection, the solution to the optimization
problem in the noiseless setting is given by
x̂ = x0 − (A+ Ax0 − A+ y), (19)
where A+ = A⊤ (AA⊤ )−1 and (AA⊤ )−1 can be implemented by conjugate gradients. In the
presence of measurement noise, we can relax this solution to
x̂ = x0 − κ(A+ Ax0 − A+ y), (20)
κ ∈ (0, 1) is a hyperparameter that can reduce the impact between the noisy component of the
measurements and x0 (the initial image before optimization, for which we use D(ẑ0 (zt ))).
We use this technique for CT reconstruction, where the forward operator A is the radon transform
and A⊤ is the non-filtered back projection. However, we can use this conjugate gradient technique
for any linear inverse problem where the matrix A is available.
C I MPLEMENTATION D ETAILS
In this section, we discuss the choices of the hyperparameters used in all of our experiments for our
algorithm. All experiments are implemented in PyTorch on NVIDIA GPUs (A100 and A40).
For organizational purposes, we tabulate all of the hyperparameters associated to ReSample in Ta-
ble 15. The parameter for the number of times to perform hard data consistency is not included in
the table, as we did not have any explicit notation for it. For experiments across all natural images
24
Published as a conference paper at ICLR 2024
Notation Definition
τ Hyperparameter for early stopping in hard data consistency
σt Variance scheduling for the resampling technique
T Number of DDIM or DDPM steps
Table 15: Summary of the hyperparameters with their respective notations for ReSample.
datasets (LSUN-Bedroom, FFHQ, CelebA-HQ), we used the same hyperparameters as they seemed
to all empirically give the best results.
For T , we used T = 500 DDIM steps. For hard data consistency, we first split T into three even
sub-intervals. The first stage refers to the sub-interval closest to t = T and the third stage refers to
the interval closest to t = 0. During the second stage, we performed pixel optimization, whereas in
the third stage we performed latent optimization for hard data consistency as described in Section B.
We performed this optimization on every 10 iterations of t.
We set τ = 10−4 , which seemed to give us the best results for noisy inverse problems, with a maxi-
mum number of iterations of 2000 for pixel optimization and 500 for latent optimization (whichever
convergence criteria came first). For the variance hyperparameter σt in the stochastic resample step,
we chose an adaptive schedule of
2 1 − ᾱt−1 ᾱt
σt = γ 1− ,
ᾱt ᾱt−1
as discussed in Section A. Generally, we see that γ = 40 returns the best results for experiments on
natural images.
Since the LDMs for medical images is not as readily available as compared to natural images, we
largely had to fine-tune existing models. We discuss these in more detail in this section.
Backbone Models. For the backbone latent diffusion model, we use the pre-trained model from
latent diffusion (Rombach et al., 2022). We select the VQ-4 autoencoder and the FFHQ-LDM with
CelebA-LDM as our backbone LDM.For inferencing, upon taking pre-trained checkpoints provided
by Rombach et al. (2022), we fine-tuned the models on 2000 CT images with 100K iterations and a
learning rate of 10−5 .
Inferencing. For T , we used a total of T = 1000 DDIM steps. For hard data consistency, we
split T into three sub-intervals: t > 750, 300 < t ≤ 750, and t ≤ 300. During the second
stage, we performed pixel optimization by using conjugate gradients as discussed previously, with
50 iterations with κ = 0.9. In the third stage, we performed latent optimization with τ as the
estimated noise level τ = 10−4 . We set skip step size to be 10 and γ = 40, with σt as the same as
the experiments for the natural images.
Latent-DPS. For the Latent-DPS baseline, we use T = 1000 DDIM steps for CT reconstruction
and T = 500 DDIM steps for natural image experiments. Let ζt denote the learning rate. For
medical images, we use ζt = 2.5ᾱt and ζt = 0.5ᾱt for natural images. Empirically, we observe
that our proposed ζt step size schedule gives the best performance and is robust to scale change, as
previously discussed.
DPS and MCG. For DPS, we use the original DPS codebase provided by Chung et al. (2023a)
and pre-trained models trained on CelebA and FFHQ training sets for natural images. For medical
images, we use the pretrained checkpoint from Chung et al. (2022) on the dataset provided by
Moen et al. (2021). For MCG (Chung et al., 2022), we modified the MCG codebase by deleting
the projection term tuning the gradient term for running DPS experiments on CT reconstruction.
Otherwise, we directly used the codes provided by Chung et al. (2022) for both natural and medical
image experiments.
25
Published as a conference paper at ICLR 2024
DDRM. For DDRM, we follow the original code provided by Kawar et al. (2022) withDDPM
models trained on FFHQ and CelebA training sets adopted from the repository provided by Dhariwal
& Nichol (2021b). We use the default parameters as displayed by Kawar et al. (2022).
DMPS. We follow the original code from the repository of Meng & Kabashima (2022) with the
DDPM models trained on FFHQ and CelebA training sets adopted from the repository of Dhariwal
& Nichol (2021b). We use the default parameters as displaye d by Meng & Kabashima (2022).
PSLD. We follow the original code from the repo Shah et al. (2021) with the pretrained LDMs on
CelebA and FFHQ datasets provided by Rombach et al. (2022). We use the default hyperparameters
as implied in Shah et al. (2021).
ADMM-PnP and Other (Supervised) Baselines. For ADMM-PnP we use the 10 iterations with
τ tuned for different inverse problems. We use τ = 5 for CT reconstruction, τ = 0.1 for linear in-
verse problems, and τ = 0.075 for nonlinear deblurring. We use the pre-trained model from original
DnCNN repository provided by Zhang et al. (2017). We observe that ADMM-PnP tends to over-
smooth the images with more iterations, which causes performance degradation. For FBP-UNet, we
trained a UNet that maps FBP images to ground truth images. The UNet network architecture is the
same as the one explained by Jin et al. (2017).
26
Published as a conference paper at ICLR 2024
In this section, we provide a further explanation to which why Latent-DPS often fails to give accurate
reconstructions.
Previously, we claimed that by using Latent-DPS, it is likely that we converge to a local minimum
of the function ∥y − A(D(ẑ0 ))∥22 and hence cannot achieve accurate measurement consistency.
Here, we validate this claim by comparing the measurement consistency loss between Latent-DPS
and ReSample. We observe that ReSample is able to achieve better measurement consistency than
Latent-DPS. This observation validates our motivation of using hard data consistency to improve
reconstruction quality.
Table 16: Comparison of average measurement consistency loss for CT reconstruction between
ReSample and Latent-DPS
We hypothesize that one reason that Latent-DPS fails to give accurate reconstructions could due to
the nonlinearity of the decoder D(·). More specifically, the derivation of DPS provided by Chung
et al. (2023a) relies on a linear manifold assumption. For our case, since our forward model can be
viewed as the form A(D(·)), where the decoder D(·) is a highly nonlinear neural network, this linear
manifold assumption fails to hold. For example, if two images z (1) and z (2) lie on the clean data
distribution manifold M at t = 0, a linear combination of them az (1) + bz (2) for some constants
a and b, may not belong to M since D(az (1) + bz (2) ) may not give us an image that is realistic.
Thus, in practice, we observe that DPS reconstructions tend to be more blurry, which implies that
the reverse sampling path falls out of this data manifold. We demonstrate this in Figure 19, where
we show that the average of two latent vectors gives a blurry and unrealistic image.
We would like to point out that this reasoning may also explain why our algorithm outperforms DPS
on nonlinear inverse tasks.
In this section, we discuss how inaccurate estimates of ẑ0 (zt ) (i.e., the posterior mean via Tweedie’s
formula) can be one of the reasons why Latent-DPS returns image reconstructions that are noisy.
This was mainly because for values of t closer to t = T , the estimate of ẑ0 (zt ) may be inaccurate,
leading us to images that are noisy at t = 0. Generally, we find that the estimation of ẑ0 (zt ) is
27
Published as a conference paper at ICLR 2024
inaccurate in the early timesteps (e.g. t > 0.5T ) and vary a lot when t decreases. This implies that
the gradient update may not point to a consistent direction when t is large. We further demonstrate
this observation in Figure 20.
Figure 20: Comparison of the prediction of the ground truth signal ẑ0 (zt ) for different values of t.
Left: ẑ0 (zt ) when t = 0.5T . Right: ẑ0 (zt ) when t = 0. This serves to show the estimation error of
the posterior mean for large values of t (i.e. when t is closer to pure noise).
E R ELATED W ORKS
Deep neural networks have been extensively employed as priors for solving inverse problems (Han &
Ye, 2018; Bora et al., 2017; Zhu et al., 2018; Gupta et al., 2018). Numerous works focus on learning
the mapping between measurements and clean images, which we term as supervised methods (Han
& Ye, 2018; Zhu et al., 2018; Wei et al., 2020; Liang et al., 2021). Supervised methods necessitate
training on pairs of measurements and clean images, requiring model retraining for each new task.
On the other hand, another line of research aims at learning the prior distribution of ground truth
images, solving inverse problems at inference time without retraining by using the pre-trained prior
distributions (Bora et al., 2017; Jalal et al., 2021; Hussein et al., 2020; Lempitsky et al., 2018). We
categorize this as unsupervised methods.
Until the advent of diffusion models, unsupervised methods struggled to achieve satisfactory re-
construction quality compared to supervised methods (Jalal et al., 2021). These methods rely on
optimization within a constrained space (Bora et al., 2017), while constructing a space that accu-
rately encodes the ground truth data distribution to solve the optimization problem is challenging.
However, with the accurate approximation of the prior distribution now provided by diffusion mod-
els, unsupervised methods can outperform supervised methods for solving inverse problems more
efficiently without retraining the model (Song et al., 2021; Jalal et al., 2021).
For unsupervised approaches using diffusion models as priors, the plug-and-play approach has been
widely applied for solving linear inverse problems (Song et al., 2021; Kawar et al., 2022; Wang et al.,
2022). These methods inject the measurement on the noisy manifold or incorporate an estimate of
the ground truth image into the reverse sampling procedure. We refer to these methods as hard
data consistency approaches, as they directly inject the measurements into the reverse sampling pro-
cess. However, to the best of our knowledge, only a few of these methods extend to nonlinear inverse
problems or even latent diffusion models (Rombach et al., 2022). Other approaches focus on approx-
imating the conditional score (Dhariwal & Nichol, 2021b; Chung et al., 2023a; 2022) under some
mild assumptions using gradient methods. While these methods can achieve excellent performance
and be extended to nonlinear inverse problems, measurement consistency may be compromised, as
demonstrated in our paper. Since these methods apply data consistency only through a gradient in
the reverse sampling process, we refer to these methods as soft data consistency approaches.
Recently, there has been a growing interest in modeling the data distribution by diffusion models
in a latent space or a constraint domain Liu et al. (2023); Rombach et al. (2022); Lou & Ermon
(2023b); Fishman et al. (2023). There has also been a growing interest in solving inverse problems
using latent diffusion models. In particular, Rout et al. (2023) proposed PSLD, an unsupervised soft
approach that adds a gradient term at each reverse sampling step to solve linear inverse problems
with LDMs. However, there are works that observe that this approach may suffer from instability
due to estimating the gradient term Chung et al. (2023c). Our work proposes an unsupervised hard
approach that involves “resampling” and enforcing hard data consistency for solving general inverse
problems (both linear and nonlinear) using latent diffusion models. By using a hard data consistency
approach, we can obtain much better reconstructions, highlighting the effectiveness of our algorithm.
28
Published as a conference paper at ICLR 2024
Notation. We denote scalars with under-case letters (e.g. α) and vectors with bold under-case
letters (e.g. x). Recall that in the main body of the paper, z ∈ Rk denotes a sample in the
latent space, zt′ denotes an unconditional sample at time step t, ẑ0 (zt ) denotes a prediction
of the ground truth signal z0 at time step t, ẑ0 (y) denotes the measurement-consistent sample
of ẑ0 (zt ) using hard data consistency, and ẑt denotes the re-mapped sample from ẑ0 (y) onto
the data manifold at time step t. We use ẑt as the next sample to resume the reverse diffusion process.
Proposition 1 (Stochastic Encoding). Since the sample ẑt given ẑ0 (y) and measurement y is con-
ditionally independent of y, we have that
√
p(ẑt |ẑ0 (y), y) = p(ẑt |ẑ0 (y)) = N ( ᾱt ẑ0 (y), (1 − ᾱt )I). (21)
Proof. By Tweedie’s formula, we have that ẑ0 (y) is the estimated mean of the ground truth signal
z0 . By the forward process of the DDPM formulation Ho et al. (2020), we also have that
√
p(ẑt |ẑ0 (y)) = N ( ᾱt ẑ0 (y), (1 − ᾱt )I).
Then, since y is a measurement of z0 at t = 0, we have p(y|ẑ0 (y), ẑt ) = p(y|ẑ0 (y)). Finally, we
get
p(y|ẑt , ẑ0 (y))p(ẑt |ẑ0 (y))p(ẑ0 (y))
p(ẑt |ẑ0 (y), y) = (22)
p(y, ẑ0 (y))
p(y, ẑ0 (y))p(ẑt |ẑ0 (y))
= (23)
p(y, ẑ0 (y))
= p(ẑt |ẑ0 (y)). (24)
Proposition 2 (Stochastic Resampling). Suppose that p(zt′ |ẑt , ẑ0 (y), y) is normally distributed
such that p(zt′ |ẑt , ẑ0 (y), y) = N (µt , σt2 ). If we let p(zˆt |ẑ0 (y), y) be a prior for µt , then the
posterior distribution p(ẑt |zt′ , ẑ0 (y), y) is given by
2√
σt ᾱt ẑ0 (y) + (1 − ᾱt )zt′ σt2 (1 − ᾱt )
′
p(ẑt |zt , ẑ0 (y), y) = N , 2 I . (25)
σt2 + (1 − ᾱt ) σt + (1 − ᾱt )
This as a Gaussian distribution, which can be easily shown using moment-generating func-
tions Bromiley (2013).
29
Published as a conference paper at ICLR 2024
Theorem 1. If ẑ0 (y) is measurement-consistent such that y = A(D(ẑ0 (y))), i.e. ẑ0 = ẑ0 (zt+1 ) =
ẑ0 (y), then stochastic resample is unbiased such that E[ẑt |y] = E[zt′ ].
Proof. We have that E[ẑt |y] = Ezt′ [Eẑ0 (y) [Eẑt [ẑt |zt′ , ẑ0 (y), y]]].
σt2 √
Let γ = σ2 +1− ᾱt
. By using Proposition 2, we have E[ẑt |y] = Ezt′ [Eẑ0 [(γ ᾱt ẑ0 + (1 −
t
γ)zt′ )|ẑ0 , zt′ , y]].
Since ẑ0 is measurement-consistent such that y = A(D(ẑ0 (y))), let k = −1, we have
1
ẑ0 (y) = ẑ0 = √ (z ′ + (1 − ᾱt−k )∇ log p(zt−k
′
)) (30)
ᾱt−k t−k
Then,we have that
√
E[ẑt |y] = Ezt′ [Eẑ0 [(γ ᾱt ẑ0 + (1 − γ)zt′ )|ẑ0 , zt′ , y]] (31)
r
ᾱt ′ ′
=γ Ez′ [E[zt−k + (1 − ᾱt−k )∇ log p(zt−k )|zt′ ]] + (1 − γ)E[zt′ ], (32)
ᾱt−k t
′
as both zt′ and zt−k are unconditional samples and independent of y. Now, we have
′ ′
Ezt′ [E[zt−k + (1 − ᾱt−k )∇ log p(zt−k )|zt′ ]] = Ezt−k
′
′
[zt−k ′
+ (1 − ᾱt−k )∇ log p(zt−k )] (33)
q
Since zt′ is the unconditional reverse sample of zt−k ′
, we have E[zt′ ] = ᾱᾱt−k
t ′
E[zt−k ], and then
Z
′ ′ ′ ′
Ezt−k
′ [∇ log p(zt−k )] = ∇ log p(zt−k )p(zt−k )dzt−k (34)
Z ′ ′
p (zt−k ) ′ ′
= ′ p(zt−k )dzt−k (35)
p(zt−k )
∂(1)
= ′ =0 (36)
∂zt−k
Finally, we have E[ẑt |y] = γE[zt′ ] + (1 − γ)E[zt′ ] = E[zt′ ].
Lemma 1. Let z̃t and ẑt denote the stochastically encoded and resampled image of ẑ0 (y), respec-
tively. If VAR(zt′ ) > 0, then we have that VAR(ẑt ) < VAR(z̃t ).
Proof. Recall that ẑt and z̃t are both normally distributed, with
VAR(ẑt ) = 1 − ᾱt (37)
1
VAR(z̃t ) = 1 1 (38)
1−ᾱt + σt2
σt2 (1 − ᾱt )
= . (39)
σt2 + (1 − ᾱt )
For all σt2 ≥ 0, we have
σt2 (1 − ᾱt )
< 1 − ᾱt (40)
σt2 + (1 − ᾱt )
=⇒ VAR(ẑt ) < VAR(z̃t ). (41)
Theorem 2. Let z0 denote a sample from the data distribution and zt be a sample from the noisy
perturbed distribution at time t. Given that the score function ∇zt log pzt (zt ) is bounded,
(1 − ᾱt )2 2 1 − ᾱt
Cov(z0 |zt ) = ∇zt log pzt (zt ) + I,
ᾱt ᾱt
where αt ∈ (0, 1) is an decreasing sequence in t.
30
Published as a conference paper at ICLR 2024
Now, we want to separate the term only with z̄t and the termonly with z0 for applying Tweedie’s
2
formula. Hence, let p0 (z̄t ) = 1
d/2 exp − 2 ||z̄ᾱt ||t , we achieve
(2π( 1−ᾱᾱt t )) ( 1−ᾱt )
ᾱt
p(z̄t |z0 ) = p0 (z̄t ) exp z̄t⊤ z0 − ∥z0 ∥2 ,
2(1 − ᾱt )
which separates the distribution interaction term z̄t⊤ z0 from ∥z0 ∥2 term.
Let λ(z̄t ) = log p(z̄t ) − log p0 (z̄t ). Then by Tweedie’s formula, we have E[z0 |z̄t ] = ∇λ(z̄t ) and
Cov(z0 |z̄t ) = ∇2 λ(z̄t ). Since p0 (z̄t ) is a Gaussian distribution with mean at 0 and variance equal
ᾱt
to 1− α¯t , we obtain
1 − ᾱt
∇λ(z̄t ) = ∇ log p(z̄t ) + z̄t .
ᾱt
We observe that since ẑ is a scaled version of zt , we can obtain the distribution of ẑ as
1 − ᾱt 1 − ᾱt
p(z̄t ) = √ p √ · zt .
ᾱt ᾱt
Then, we can apply chain rule to first compute then gradient on zt , and then account for z̄t . As a
result we get
1 − ᾱt
∇z̄t log p(z̄t ) = √ ∇zt log p(zt )
ᾱt
and then
1 − ᾱt 1
∇λ(z̄t ) = E[z0 |zt ] = √ ∇ log p(zt ) + √ zt ,
ᾱt ᾱt
which is consistent with the score function of Chung et al. (2023a). Afterwards, we take the gradient
again with respective to the z̄t , and apply the chain rule again, then we have
(1 − ᾱt )2 2 1 − ᾱt
∇2 λ(z̄t ) = Cov(z0 |zt ) = ∇ log p(zt ) + I.
ᾱt ᾱt
This gives the desired result.
31