0% found this document useful (0 votes)
5 views21 pages

Diffusion Models, Image Super-Resolution and Everything: A Survey

This survey explores the impact of Diffusion Models (DMs) on the field of image Super-Resolution (SR), highlighting their ability to generate high-quality images that align closely with human perceptual preferences. It discusses the challenges posed by DMs, such as computational demands and lack of explainability, while providing a comprehensive overview of their theoretical foundations, methodologies, and current research trends. The document aims to demystify DMs and inspire future innovations in image SR by analyzing existing strengths and weaknesses in the context of emerging technologies.

Uploaded by

candy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views21 pages

Diffusion Models, Image Super-Resolution and Everything: A Survey

This survey explores the impact of Diffusion Models (DMs) on the field of image Super-Resolution (SR), highlighting their ability to generate high-quality images that align closely with human perceptual preferences. It discusses the challenges posed by DMs, such as computational demands and lack of explainability, while providing a comprehensive overview of their theoretical foundations, methodologies, and current research trends. The document aims to demystify DMs and inspire future innovations in image SR by analyzing existing strengths and weaknesses in the context of emerging technologies.

Uploaded by

candy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

1

Diffusion Models, Image Super-Resolution


And Everything: A Survey
Brian B. Moser1,2 , Arundhati S. Shanbhag1,2 , Federico Raue1 , Stanislav Frolov1,2 , Sebastian Palacio1 , Andreas
Dengel1,2
1 German Research Center for Artificial Intelligence (DFKI), Germany
2 Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau, Germany

first.second@dfki.de
arXiv:2401.00736v3 [cs.CV] 23 Jun 2024

Abstract—Diffusion Models (DMs) have disrupted the image various aspects [12]–[14]. Their capability to generate high-
Super-Resolution (SR) field and further closed the gap between quality images from LR inputs has shown immense promise
image quality and human perceptual preferences. They are in SR by closely aligning with the qualitative judgments of
easy to train and can produce very high-quality samples that
exceed the realism of those produced by previous generative human evaluators [15]. In other words, human raters perceive
methods. Despite their promising results, they also come with SR images generated by DMs as more realistic than those
new challenges that need further research: high computational produced by other generative models like GANs.
demands, comparability, lack of explainability, color shifts, and However, as the volume of publications expands, staying
more. Unfortunately, entry into this field is overwhelming because updated on the latest developments is becoming more chal-
of the abundance of publications. To address this, we provide a
unified recount of the theoretical foundations underlying DMs lenging, particularly for those new to the field. DMs diverge
applied to image SR and offer a detailed analysis that under- fundamentally from prior generative models and pose new
scores the unique characteristics and methodologies within this challenges while addressing the limitations of earlier models.
domain, distinct from broader existing reviews in the field. This Identifying coherent trends and potential research directions is
survey articulates a cohesive understanding of DM principles challenging despite this rapid expansion. This survey aims to
and explores current research avenues, including alternative
input domains, conditioning techniques, guidance mechanisms, demystify DMs, offers a comprehensive overview that bridges
corruption spaces, and zero-shot learning approaches. By offering foundational concepts with the forefront of image SR and
a detailed examination of the evolution and current trends in critically analyzes current strengths and weaknesses.
image SR through the lens of DMs, this survey sheds light on the The presented survey builds upon the previous work Hitch-
existing challenges and charts potential future directions, aiming hiker’s Guide to Super-Resolution [16], which gives a broad
to inspire further innovation in this rapidly advancing area.
overview of the image SR field in general. Similar in spirit
Index Terms—Super-Resolution, Diffusion Models, Survey. is the survey of Li et al., which reviews diffusion models
on the more general image restoration tasks like inpainting
I. I NTRODUCTION and dehazing [17]. Both have overlapping topics, such as the
foundations and types of DMs, namely DDPMs [8], SGMs

I N the ever-evolving field of computer vision, the task of


image Super-Resolution (SR) – enhancing Low-Resolution
(LR) images into High-Resolution (HR) counterparts – stands
[11], and SDEs [10]. Moreover, both surveys highlight the
introduction of conditioning strategies and zero-shot diffusion
as well as show potential research directions. However, our
as a longstanding challenge due to its ill-posed nature: Multi- survey covers all topics related to image SR and is, there-
ple HR images are plausible for any given LR image, differing fore, more detailed regarding recent developments specifically
in aspects such as brightness and color [1]. Its applications developed for image SR. Moreover, we explain SR-related
span a broad spectrum, from everyday photography to refining challenges, like color shifting and cascaded image SR. We also
satellite [2] and medical images [3]. Despite notable achieve- highlight the relationship of DMs with other generative SR
ments of prior generative SR models, each comes with its models, namely Variational Autoencoders, GANs, and Flow-
own limitations. For example, the computational demands of based methods. In addition, we review frequency-based DMs,
Autoregressive models often outweigh their utility, while Nor- alternative corruption spaces, and diffusion-based image SR
malizing Flows or Variational Autoencoders struggle to match applications.
quality expectations [4]–[6]. Although powerful, Generative Concluding with a discussion on emerging trends and their
Adversarial Networks (GANs) need careful regularization and potential for reshaping SR and DM development, this survey
optimization strategies to overcome instability issues [7]. sets the stage for future research. By offering clarity and
The advent of Diffusion Models (DMs) marks a significant direction in the rapidly evolving domain of DMs, we aim
shift in image generation tasks, including SR, challenging to inspire and inform the next wave of research, fostering
the long-standing dominance of Generative Adversarial Net- advancements that continue to push the boundaries of what
works (GANs) [8]–[11]. Applications like Dall-E and Stable is possible in image SR with DMs.
Diffusion demonstrate that DMs have surpassed GANs in The structure of this paper is organized as follows:
0000–0000/00$00.00 © 2021 IEEE
2

section II - Super-Resolution Basics: This section pro- given LR image, i.e., they can have similar loss values
vides fundamental definitions and introduces standard datasets, compared to the ground-truth image but are subjectively
methods, and metrics for assessing image quality commonly perceived differently due to many aspects like brightness and
utilized in image SR publications. coloring [1], [15], [19]. Traditional regression techniques, like
section III - Diffusion Models Basics: Introduces the prin- standard CNNs, are often adequate for lower magnifications
ciples and various formulations of DMs, including Denoising but struggle to replicate high-frequency details required at
Diffusion Probabilistic Models (DDPMs), Score-based Gener- higher magnifications (e.g., s > 4). To address this, SR models
ative Models (SGMs), and Stochastic Differential Equations must hallucinate realistic details beyond interpolation, which
(SDEs). This section also explores how DMs relate to other typically falls under the umbrella topic of generative models,
generative models. where diffusion models are now at the forefront.
section IV - Improvements for Diffusion Models: Common
practices for enhancing DMs, focusing on efficient sampling A. Datasets
techniques and improved likelihood estimation.
Several datasets offer a variety of images, resolutions, and
section V - Diffusion Models for Image SR: Presents con-
content types. Typically, these datasets consist of LR and HR
crete realizations of DMs in SR, explores alternative domains
image pairs. However, some datasets contain only HR images,
(latent space and wavelet domain), discusses architectural
with LR images created by bicubic downsampling with anti-
designs and multiple tasks with Null-Space Models, and
aliasing - a default setting for imresize in MATLAB [20].
examines alternative corruption spaces.
One famous general SR train set is the Diverse 2K resolution
section VII - Domain-Specific Applications: DM-based SR
(DIV2K) dataset [21], which includes various realistic images
applications, namely medical imaging, blind face restoration,
at different resolutions designed specifically for image SR.
atmospheric turbulence in face SR, and remote sensing.
Classical test datasets for SR models trained on DIV2K are
section VIII - Discussion and Future Work: Common
Set5 [22], Set14 [23], BSDS100 [24], Urban100 [25] and
problems of DMs for image SR and noteworthy research
Manga109 [26] that cover a variety of scenes and images
avenues for DMs specific to image SR.
contents like buildings and manga paintings. Flickr2K [27]
section IX - Conclusion: Summarizes the survey.
and Flickr-Faces-HQ (FFHQ) [28] offer diverse sets of human-
centric and scene-centric images from Flickr, respectively.
II. I MAGE S UPER -R ESOLUTION While FFHQ is commonly employed for training models for
The goal of image Super-Resolution (SR) is to trans- face SR tasks, Flickr2K is usually used as a train data ex-
form one or more Low-Resolution (LR) images into High- tension in combination with DIV2K. Another dataset for face
Resolution (HR) images. The domain can be broadly cate- SR is CelebA-HQ [29], which provides high-quality celebrity
gorized into two areas [16]: Single Image Super-Resolution images and is typically used to evaluate FFHQ-trained SR
(SISR) and Multi-Image Super-Resolution (MISR). In SISR, models. For broader applications in CV, datasets like ImageNet
a single LR image leads to a single HR image. In contrast, [30] and Visual Object Classes (VOC2012) [31] are favored.
MISR methods use multiple LR images to produce one or ImageNet offers an extensive range of images that help train
many HR outputs. This section focuses on SISR and explores models on various object classes, whereas VOC2012 is vital
relevant datasets, established SR models, and techniques to for object detection and segmentation. Both are valuable for
assess image quality. multi-task learning involving SR. More datasets can be found
Given a LR image x ∈ Rw̄×h̄×c , the goal is to generate a in the Hitchhiker’s Guide to Super-Resolution [16].
HR counterpart y ∈ Rw×h×c with w̄ < w and h̄ < h. The
relationship is represented by a degradation mapping B. SR Models
x = D (y; Θ) = ((y ⊗ k) ↓s +n)JP EGq (1) The primary objective is to design a SR model M :
Rw̄×h̄×c → Rw×h×c , such that it inverses Equation 1:
where D is a degradation map D : Rw×h×c → Rw̄×h̄×c and
Θ contains degradation parameters, including aspects like blur ŷ = M (x; θ) , (3)
k, noise n, scaling s, and compression quality q [18]. The where ŷ is the predicted HR approximation of the LR image
degradation is typically unknown, posing the main challenge x and θ the parameters of M. The parameters θ are optimized
in determining the inverse mapping of D with parameters θ, using Equation 2, i.e., minimizing the loss function L between
usually embodied as SR model. It leads to an optimization task the estimation ŷ and the ground-truth HR image y. The
aimed at minimizing the difference between the predicted SR following section focuses on standard methods for designing
image ŷ and the original HR image y: an SR model, especially deep learning methods before we
θ∗ = argminθ L (ŷ, y) + λϕ(θ), (2) examine how diffusion models fulfill this role in detail.
Traditional Methods: Traditional methods for image SR
where L represents the loss between the predicted SR image define a range of methodologies, such as statistical [32],
and the actual HR image. Here, λ is a balancing parameter, edge-based [33], [34] patch-based [35], [36], prediction-based
while ϕ(θ) is introduced as a regularization term. [37], [38] and sparse representation techniques [39]. They
The inherent complexity arises from the ill-posed nature fundamentally rely on image statistics and the information
of predicting θ, as several SR images can be valid for any inherent in existing pixels to generate HR images. Despite
3

their utility, a noteworthy drawback of these methods is the using exact log-likelihood-based training [54]. This facilitates
potential introduction of noise, blur, and visual artifacts [16]. flow-based methods to circumvent training instability but
Regression-based Deep Learning: Image SR significantly incurs a substantial computational cost.
evolved with advancements in deep learning and computa-
tional power. Typically, they employ a Convolutional Neural C. Image Quality Assessment (IQA)
Network (CNN) for end-to-end mapping from LR to HR. Image quality is a multifaceted concept that addresses prop-
Initial models, such as SRCNN [40], FSRCNN [41], and erties like sharpness, contrast, and absence of noise. Hence,
ESPCNN [42], utilized simple CNNs of diverse depth and a fair evaluation of SR models based on produced image
feature maps sizes. Later models adapted concepts from the quality forms a non-trivial task. This section presents the
broader CV domain into SR models, e.g., ResNet led to essential methods, especially for diffusion models, to assess
SRResNet, where residual information was propagated to image quality in the context of image SR, which fall under
successive network layers [43]. Likewise, DenseNet [44] was the umbrella term Image Quality Assessment (IQA) 1 . At its
adapted with SRDenseNet [45]. They employ dense blocks, core, IQA refers to any metric that resembles the perceptual
where each layer receives additionally the features generated evaluations from human observers, specifically, the level of
in all preceding layers. Recursive CNNs that recursively use realism perceived in an image after the application of SR tech-
the same module to learn representations were also inspired by niques. During this section, we will use the following notation:
other CV methods for regression-based SR methods in DRCN Nx = w · h · c, which defines the number of pixels of an image
[46], DRRN [47], and CARN [48]. More recently, attention x ∈ Rw×h×c and Ωx = {(i, j, k) ∈ N31 |i ≤ h, j ≤ w, k ≤ c}
mechanisms have been incorporated to focus on regions of that defines the set of all valid positions in x.
interest in images, predominantly via the channel and spatial Peak Signal-to-Noise Ratio (PSNR): The Peak Signal-to-
attention mechanisms [16], [49]–[51]. All those methods have Noise Ratio (PSNR) is one of the most widely used techniques
in common that they are regression-based. Commonly used to evaluate SISR reconstruction quality. It represents the ratio
loss functions are the L1 and L2 losses. As mentioned, they between the maximum pixel value L and the Mean Squared
often produce satisfying results for lower magnifications but Error (MSE) between the SR image ŷ and the HR image y.
struggle to replicate the high-frequency details required at !
higher magnifications (e.g., s > 4). These limitations arise L2
PSNR (y, ŷ) = 10 · log10 1 PN (4)
because these models primarily learn an averaged mapping 2
N i=1 [y − ŷ]
(due to L1 and L2 losses) from LR to HR images, which tends
to produce overly smooth textures lacking detail, especially Despite being one of the most popular IQA methods, it does
noticeable in larger upscaling factors [16]. To address this, SR not accurately match human perception [15]. It focuses on
models must hallucinate realistic details beyond simple inter- pixel differences, which can often be inconsistent with the
polation, a challenge typically tackled by generative models. subjectively perceived quality: the slightest shift in pixels
Generative Adversarial Networks (GANs): One of the can result in worse PSNR values while not affecting human
most prominent generative models is the Generative Adver- perceptual quality. Due to its pixel-level calculation, models
sarial Network (GAN). It uses two CNNs: A generator G trained with correlated pixel-based loss tend to achieve high
and a discriminator D, which are trained simultaneously. The PSNR values [16], whereas generative models tend to produce
generator aims to produce HR samples that are as close to the lower PSNR values [15].
original as to fool the discriminator, which tries to distinguish Structural Similarity Index (SSIM): The SSIM, like
between generated and real samples. This framework, e.g., the PSNR, is a popular evaluation method that focuses on
in SRGAN [52] or ESRGAN [53], is optimized using a the differences in structural features between images. It in-
combination of adversarial loss and content loss to produce dependently captures the structural similarity by comparing
less-smoothed images. The resultant images of state-of-the-art luminance, contrast, and structures. SSIM estimates for an
GANs are sharper and more detailed. Due to their capability to image y the luminance µy as the mean of the intensity, while
generate high-quality and diverse images, they have received it is estimating contrast σy as its standard deviation:
much attention lately. However, they are susceptible to mode 1 X
µy = yp , (5)
collapse, have a sizeable computational footprint, sometimes Ny
p∈Ωy
fail to converge, and suffer from stabilization issues [7].
Flow-based Methods: Flow-based methods employ optical 1 X 2
σy = [yp − µy ] (6)
flow algorithms to generate SR images [54]. They were Ny − 1
p∈Ωy
introduced in an attempt to counter the ill-posed nature of
image SR by learning the conditional distribution of plausible To capture the similarity between the computed entities, the
HR images given a LR input. They introduce a conditional authors introduced a comparison function S:
normalized flow architecture that aligns LR and HR images by 2·x·y+c
S (x, y, c) = 2 , (7)
calculating the displacement field between them and then uses x + y2 + c
this information to recover SR images. They employ a fully where x and y are the scalar variables being compared, and
invertible encoder capable of mapping any input HR image to 2
c = (k · L) , 0 < k ≪ 1 is a constant for numerical stability.
the latent flow space and ensuring exact reconstruction. This
framework enables the SR model to learn rich distributions 1 More SR-related IQA methods can be found in Moser et al. [16].
4

For a HR image y and its approximation ŷ, the luminance (Cl ) cases where no reference images are available, e.g., in unsu-
and contrast (Cc ) comparisons are computed using Cl (y, ŷ) = pervised settings. Fortunately, we can assess an image by mea-
S (µy , µŷ , c1 ) and Cc (y, ŷ) = S (σy , σŷ , c2 ), where c1 , c2 > suring the distance of statistical features from those obtained
0. The empirical covariance from a collection of high-quality images of a similar domain,
1 X i.e., natural images. This can be opinion- and distortion-aware
σy,ŷ = (yp − µy ) · (ŷp − µŷ ) , (8) like BRISQUE [57] or opinion- and distortion-unaware like
Ny − 1
p∈Ωy NIQE [58]. Another intriguing way to assess no-reference
defines the structure comparison (Cs ), which is the correlation image quality is to exploit the visual-language pre-trained
coefficient between y and ŷ: CLIP model [59]. One example is CLIP-IQA, which calculates
the cosine similarity of the encoded image with two prompts
σy,ŷ + c3
Cs (y, ŷ) = , (9) of opposing meaning, i.e., ”good photo” and ”bad photo”
σy · σŷ + c3
[60]. The resulting relative similarity metric for one or the
where c3 > 0. Finally, the SSIM is defined as: other prompt determines the image quality. CLIP-IQA shows
α β γ results comparable to those of BRISQUE without the hand-
SSIM (y, ŷ) = [Cl (y, ŷ)] · [Cc (y, ŷ)] · [Cs (y, ŷ)] (10)
crafted features and surpasses other no-reference IQA methods
where α > 0, β > 0, and γ > 0 are parameters that can be like NIQE. Another way to exploit deep learning models is
adjusted to tune the relative importance of the components. to train them to predict subjective scores using IQA datasets
Mean Opinion Score (MOS): The MOS is a subjective like TID2013 [61]. Examples are DeepQA [62], NIMA [63],
measure that leverages human perceptual quality for the eval- or MUSIQ [64]. Others can be found in the learning-based
uation of the generated SR images. Human viewers are shown perceptual quality section of the Hitchhiker’s Guide to Super-
SR images and asked to rate them with quality scores that are Resolution [16].
then mapped to numerical values and later averaged. Typically,
these range from 1 (bad) to 5 (good) but may vary [15]. While III. D IFFUSION M ODELS BASICS
this method is a direct evaluation of human perception, it is
more time-consuming and cumbersome to conduct compared Diffusion Models (DMs) have profoundly impacted the
to objective metrics. Moreover, due to the highly subjective realm of generative AI, and many approaches that fall under
nature of this metric, it is susceptible to bias. the umbrella term DM have emerged. What sets DMs apart
Consistency: Consistency measures the degree of stability from earlier generative models is their execution over iterative
of non-deterministic SR methods, such as generative models time steps, both forward and backward in time and denoted by
like GANs or DMs. Like flow-based methods, generative t, as depicted in Figure 1. The forward and backward diffusion
approaches are intentionally designed to generate a spectrum processes are distinguished by:
of plausible outputs for the same input. However, low consis- Forward q - degrade input data using noise iteratively, forward
tency is not desirable. Minor variations lessen the influence in time (i.e., t increases).
of a relatively consistent method in the input. Nevertheless, Backward p - denoise the degraded data, thereby reversing
consistency can vary depending on the requirements. One the noise iteratively, backward in time (i.e., t decreases).
commonly employed metric to quantify consistency is the The time step t increases during forward diffusion, whereas
Mean Squared Error. it propagates towards 0 during backward diffusion. Let D =
Learned Perceptual Image Patch Similarity (LPIPS): {xi , yi }N
i=1 be a dataset of LR-HR image-pairs. For each time
Contrary to the pixel-based evaluation of PSNR and SSIM, the step t, the random variable zt describes the current state, a
Learned Perceptual Image Patch Similarity (LPIPS) utilizes state between the image and corruption space. In literature,
a pre-trained CNN φ, e.g., VGG [55] or AlexNet [56], and there is no clear distinction between zt in the forward and zt in
generates L feature maps from the SR and HR image, and the backward diffusion. During forward diffusion, we assume
subsequently calculates the similarity between them. Given zt ∼ q (zt | zt−1 ). Conversely, in the backward diffusion, we
hl and wl as the height and width of the l-th feature map assume zt−1 ∼ p (zt−1 | zt ). We will denote T with 0 < t ≤
respectively, and a scaling vector αl ∈ RCl , the LPIPS metric T as the maximal time step for finite cases. The initial data
is formulated as follows: distribution (t = 0) is represented by z0 ∼ q (x), which is then
 2 slowly injected with noise (additive). Vice versa, DMs remove
L X αl ⊙ φl (ŷ) − φl (y)
X p 2 noise therein by running a parameterized model pθ (zt−1 | zt )
LPIPS (y, ŷ) = (11) in the reverse time direction that approximates the ideal (but
p
hl · wl
l=1 unattainable) denoised distribution p (zt−1 | zt ).
LPIPS operates by projecting images into a perceptual fea- The explicit implementation of the forward diffusion q and
ture space through φ and evaluating the difference between backward diffusion p, approximated by pθ , is defined by the
corresponding patches in SR and HR images, scaled by αl . specific DM in use. There are three types: Two discrete forms,
This methodology allows for a more human-centric evaluation, namely Denoising Diffusion Probabilistic Models (DDPMs)
given that it is better aligned with human perception than and Score-Based Generative Models (SGMs), and the contin-
traditional metrics such as PSNR and SSIM [16]. uous form by Stochastic Differential Equations (SDEs) [65].
No-Reference Metrics: All IQA metrics discussed so far Each of these types will be discussed next are comprehensively
require a reference (ground-truth) image. However, there are shown in Figure 1.
5

backward Since the forward process approximates q(zT ) ≈ N (0, I), the

SDE
formulation of the learnable transition kernel becomes:
DDPM SGM pθ (zt−1 | zt ) = N (zt−1 | µθ (zt , γt ), Σθ (zt , γt )) , (15)
where µθ and Σθ are learnable. Similarly, the conditional
formulation pθ (zt−1 | zt , x) conditioned on x (e.g., a LR
image) is using µθ (zt , x, γt ) and Σθ (zt , x, γt ) instead.
Corruption Image Optimization: To guide the backward diffusion in learning
Space Space
the forward process, we minimize the Kullback-Leibler (KL)
... ... divergence of the joint distribution of the forward and reverse
sequences
T
forward Y
pθ (z0 , ..., zT ) = p (zT ) pθ (zt−1 | zt ) , and (16)
SDE

t=1
T
DDPM SGM

Y
q (z0 , ..., zT ) = q (z0 ) q (zt | zt−1 ) , (17)
t=1

which leads to minimizing

Fig. 1: Principle of DMs. The forward diffusion adds noise KL(q (z0 , ..., zT ) ∥pθ (z0 , ..., zT )) (18)
iteratively (red), which translates an image from the image = −Eq(z0 ,...,zT ) [log pθ (z0 , ..., zT )] + c
space to the corruption space. The backward diffusion, the " T
#
(i) X pθ (zt−1 | zt )
iterative refinement process, reverts the process (blue) back = Eq(z0 ,...,zT ) − log p (zT ) − log +c
to the image space. Shown are three different implemen- t=1
q (zt | zt−1 )
tations of DMs, namely Denoising Diffusion Probabilistic (ii)
Models (DDPMs), Score-based Generative Models (SGMs), ≥ E [− log pθ (z0 )] + c,
and Stochastic Differential Equations (SDEs) with their respect where (i) is possible because both terms are products of
formulation of the forward and backward diffusion. distributions and (ii) is the product of Jensen’s inequality. The
constant c is unaffected and, therefore, irrelevant in optimizing
θ. Note that Equation 18 without c is the Variational Lower
A. Denoising Diffusion Probabilistic Models (DDPMs) Bound (VLB) of the log-likelihood of the data z0 , which is
commonly maximized by DDPMs.
Denoising Diffusion Probabilistic Models (DDPMs) [8]
use two Markov chains to enact the forward and backward
B. Score-based Generative Models (SGMs)
diffusion across a finite amount of discrete time steps.
Forward Diffusion: It transforms the data distribution Score-based Generative Models (SGMs), much like
into a prior distribution, typically designed manually (e.g., DDPMs, utilize discrete diffusion processes but employ an
Gaussian), given by: alternative mathematical foundation. Instead of using proba-
√ bility density function p(z) directly, Song et al. [11] propose
q(zt | zt−1 ) = N (zt | 1 − αt zt−1 , αt I), (12) to work with its (Stein) score function, which is defined as
the gradient of the log probability density ∇z log p(z). Math-
where the hyper-parameters 0 < α1:T < 1 represent the ematically, the score function preserves all information about
variance of noise incorporated at each time step. While the the density function, but computationally, it is easier to work
Gaussian kernel is commonly adopted, alternative kernel types with. Furthermore, the decoupling of model training from
can also be employed. This formulation can be condensed to the sampling procedure grants greater flexibility in defining
a single-step calculation, as shown by: sampling methods and training objectives.
√ Forward Diffusion: Let 0 < σ1 < ... < σT be a finite
q(zt | z0 ) = N (zt | γt z0 , (1 − γt )I), (13) sequence of noise levels. Like DDPMs, the forward diffusion,
Qt typically assigned to a Gaussian noise distribution, is
where γt = i=1 (1 − αi ) [66]. Consequently, zt can be
directly sampled regardless of what ought to happen on q(zt | z0 ) = N (zt | z0 , σt2 I). (19)
previous time steps by
This equation results in a sequence of noisy data densities
√ p R
q(z1 ), ..., q(zT ) with q(zt ) = q(zt )q(z0 )dz0 . Consequently,
zt = γt · z0 + 1 − γt · ϵ, ϵ ∼ N (0, I) . (14)
the intermediate step zt = z0 + σt · ϵ with ϵ ∼ N (0, I) can be
Backward Diffusion: The goal is to directly learn the sampled agnostic from previous time steps in a single step.
inverse of the forward diffusion and generate a distribution Backward Diffusion: To revert the noise during the back-
that resembles the prior z0 , usually the HR image in SR. In ward diffusion, we need to approximate ∇zt log q(zt ) and
practice, we use a CNN to learn a parameterized form of p. choose a method for estimating the intermediate states zt from
6

that approximation. For the gradient approximation at each C. Stochastic Differential Equations (SDEs)
time step t, we use a trained predictor, denoted as sθ and
So far, we have discussed DMs that deal with finite time
called Noise-Conditional Score Network (NCSN), such that
steps. A generalization to infinite continuous time steps is
sθ (zt , t) ≈ ∇zt log q(zt ) [11].
made by formulating these as solutions to Stochastic Differ-
The training of the NCSN will be covered in the next sec-
ential Equations (SDEs), also known as Score SDEs [10].
tion; for now, we focus on the sampling process using NCSN.
In fact, we can view SGMs and DDPMs as discretizations
Sampling with NCSN involves generating the intermediate
of a continuous-time SDE. SDEs are not entirely bound to
states zt through an iterative approach, using sθ (zt , t). Note
DMs, as they are a mathematical concept describing stochastic
that this iterative process is different from the iterations done
processes. As such, they fit perfectly to describe the processes
during the diffusion as it addresses solely the generation of zt .
we want to simulate in DMs. Like previously, data is perturbed
This is a key difference to DDPMs as zt needs to be sampled
in a general diffusion process but generalized to an infinite
iteratively, whereas DDPMs directly predict zt from zt+1 .
number of noise scales.
There are various ways to perform this iterative generation,
Forward Diffusion: We can represent the forward diffusion
but we will concentrate on a specific method known as
by the following SDE:
Annealed Langevin Dynamics (ALD), introduced by Song et
al. [10]. Let N be the number of estimation iterations for zt dz = f(z, t)dt + g(t)dw, (23)
at time step t and αt > 0 the corresponding step size, which
determines how much the estimation moves from one estimate where f and g are the drift and diffusion functions, respectively,
(i) (i+1) (N )
zt−1 towards zt−1 . The initial state is zT ∼ N (0, I). For and w is the standard Wiener process (also known as Brownian
(0) (N )
each 0 < t ≤ T , we initialize zt−1 = zt ≈ zt , which is the motion). This generalized formulation allows uniform repre-
latest estimation of the previous intermediate state. In order to sentation of both DDPMs and SGMs. The SDE for DDPMs
(N )
get zt−1 ≈ zt−1 iteratively, ALD uses the following update is given by:
rules for i = 0, ..., N − 1:
1 p
(i)
dz = − α(t)zdt + α(t)dw, (24)
ϵ ← N (0, I) (20) 2
(i+1) (i) 1 (i) √ with α( Tt ) = T αt for T → ∞. For SGMs, the SDE is
zt−1 ← zt−1 + αt−1 sθ (zt−1 , t − 1) + st−1 ϵ(i) (21)
2
r
(N )
This update rule guarantees that z0 converges to q(z0 ) for d [σ(t)2 ]
dz = dw, (25)
αt → 0 and N → ∞ [67]. dt
Similar to DDPMs, we can turn SGMs into conditional with σ( Tt ) = σt for T → ∞. From now on, we denote with
SGMs by integrating the condition x, e.g., a LR image, into qt (z) the distribution of zt in the diffusion process.
sθ (zt , x, t) ≈ ∇zt log q(zt |x). Backward Diffusion: The reverse-time SDE is formulated
Optimization: Without specifically formulating the back- by Anderson et al. [71] as:
ward diffusion, we can train a NCSN such that sθ (zt , t) ≈
∇zt log q(zt ). Estimating the score can be done by using the dz = f(z, t) − g(t)2 ∇z log qt (z) dt + g(t)dw̃,
 
(26)
denoising score matching method [68]:
where w̃ is the standard Wiener process when time flows
λ(t)σt2 ∥∇zt log q(zt ) − sθ (zt , t)∥2
 
E (22) backwards and dt an infinitesimal negative time step. Solutions
t∼U (1,T )
z0 ∼q(z0 ) to Equation 26 can be viewed as diffusion processes that grad-
zt ∼q(zt |z0 ) ually convert noise to data. The existence of a corresponding
(i) probability flow Ordinary Differential Equation (ODE), whose
λ(t)σt2 ∥∇zt log q(zt |z0 ) − sθ (zt , t)∥2 + c
 
= E
t∼U (1,T ) trajectories possess the same marginals as the reverse-time
z0 ∼q(z0 )
zt ∼q(zt |z0 ) SDE, was proven by Song et al. [11] and is
 
(ii) zt − z0 2

1

= E λ(t)∥ − − σt sθ (zt , t)∥ + c 2
dz = f(z, t) − g(t) ∇z log qt (z) dt. (27)
t∼U (1,T ) σt 2
z0 ∼q(z0 )
zt ∼q(zt |z0 )
Thus, the reverse-time SDE and the probability flow ODE
(iii)
λ(t)∥ϵ + σt sθ (zt , t)∥2 + c
 
= E enable sampling from the same data distribution.
t∼U (1,T )
z0 ∼q(z0 ) Optimization: Similar to the approach in SGMs, we define
ϵ∼N (0,I) a score model such that sθ (zt , t) ≈ ∇z log qt (z). Additionally,
we extend Equation 22 to continuous time as follows:
where λ(t) > 0 is a weighting function, σt the noise level
added at time step t, (i) derived by Vincent et al. [68], (ii) E

λ(t)∥sθ (zt , t) − ∇zt log qt (zt | z0 )∥2 , (28)

from Equation 19, (iii) from zt = z0 + σt ϵ and with c again a t∼U (0,T )
z0 ∼q(z0 )
constant unaffected in the optimization of θ. Note that there are zt ∼q(zt |z0 )
other ways to estimate the score, e.g., based on score matching
[69] or sliced score matching [70]. where λ(t) > 0 is a weighting function.
7

Generative Adversarial Networks GAN: One prominent category of generative models is


Discriminator Generator Generative Adversarial Networks (GANs) [72], which have
1/0 demonstrated state-of-the-art performance in various vision-
related tasks, including text-to-image synthesis [7] and im-
age super-resolution (SR) [52]. GANs are known for their
Variational Autoencoders
adversarial training, where a generator competes against a
Encoder Decoder
discriminator. Although DMs do not employ a discriminator,
they utilize a similar adversarial training strategy by iter-
atively adding and removing noise to enable realistic data
generation. However, approaches with GANs often suffer from
Normalizing Flows non-convergence, training instability, and high computational
Encoder Decoder costs. They require careful hyperparameter tuning due to the
interplay between the generator and the discriminator.
VAE: Variational Autoencoders (VAEs) [73] are designed
as autoencoders with a variational latent space, which is
Diffusion Models
times especially interesting in addressing the ill-posedness of image
Forward Backward SR. The core objective of a VAE centers around establishing
the variational lower bound of the log data likelihood, akin to
the fundamental principle underlying DMs. In a comparative
Fig. 2: Conceptual overview of generative models (GANs, context, one can consider DMs as a variation of VAEs but
VAEs, NFs, and DMs). with a fixed VAE encoder responsible for perturbing the input
data, while the VAE decoder resembles the backward diffusion
process in DMs. Still, unlike VAEs, which compress the
input into smaller dimensions in the latent space, DMs often
D. Relation between Diffusion Models maintain the same spatial size.
As highlighted in the SDE section, we can describe both ARM: Autoregressive Models (ARMs) treat images as
variations, namely SGMs, and DDPMs, with SDEs. We can sequences of pixels and generate each pixel based on the
also showcase this close relationship by reformulating the values of previously generated pixels in a sequential manner
optimization targets. For DDPMs, we saw in Equation 18 that [6]. The probability of the entire image is given as the product
KL(q (z0 , ..., zT ) ∥pθ (z0 , ..., zT )) of conditional probability distributions for each individual
pixel. This makes ARMs computationally expensive for HR
(ii)
≥ E [− log pθ (z0 )] + c image generation. Conversely, DMs generate data by gradually
diffusing noise into an initial data sample and then reverse this
is minimized. By reweighting the VLB, as Ho et al. [8] process. Noise is diffused across the entire image simultane-
recommends for improved sample quality, we can further ously rather than sequentially.
derive: NF: Normalizing Flows (NF) [74] are a distinct category
λ(t)∥ϵ − ϵθ (zt , t)∥2 ,
 
E of generative models renowned for their capacity to represent
t∼U (1,T )
z0 ∼q(z0 ) data as intricate and complex distributions. Like DMs and
ϵ∼N (0,I) VAEs, these models are optimized based on the log-likelihood
where λ(t) > 0 is a weighting function. If we now take the of the data they generate. However, what sets NFs apart is their
optimization target in Equation 22 of SGMs, which was unique ability to learn an invertible parameterized transfor-
mation. Importantly, this transformation possesses a tractable
λ(t)∥ϵ + σt sθ (zt , t)∥2 + c,
 
E Jacobian determinant, making it feasible to compute. The
t∼U (1,T )
z0 ∼q(z0 ) concept of DiffFlow [75] enters the picture as an innovative
ϵ∼N (0,I) algorithm that marries the principles of DMs with those of
the connection between DDPMs and SGMs becomes clear NFs. This combination offers the promise of enhanced genera-
once we set ϵθ (zt , t) = −σt sθ (zt , t). As the constant c is tive modeling capabilities. Yet, while promising, NFs are often
irrelevant for the optimization, we can see once again that there considered challenging to train and can be computationally
is a mathematical connection between DDPMs and SGMs. demanding [76].

E. Relation to other Image SR Generative Models


IV. I MPROVEMENTS FOR D IFFUSION M ODELS
Generative models in image SR differ primarily in how they
approach the task of generating HR images from LR inputs In the broader research community, there are several ways to
and are illustrated in Figure 2. These differences stem from improve DMs for image generation, as presented, for example,
the underlying architecture and training objectives. While they by Karras et al. [77]. This section, however, focuses on
offer significant advantages, they come with a individual set enhancements particularly interesting for image SR: Efficient
of challenges, like training stability and computational costs. sampling and enhanced likelihood estimation.
8

A. Efficient Sampling minimal loss in sample quality. Jolicoeur-Martineau et al. [91]


have devised an efficient SDE solver with adaptive step sizes
Efficient sampling refers to strategies that generate samples for the accelerated generation of score-based models. This
from noise more quickly, i.e., in fewer time steps, without method has been found to generate samples more rapidly than
compromising the quality of the produced image significantly. the Euler-Maruyama method without compromising sample
For instance, a DDPM takes about 20 hours to sample 50,000 quality. Building upon DDIM and Jolicoeur-Martineau et al.,
32x32 images, in contrast to a GAN’s less than one minute on the DPM-solver [92], inspired by the AnalyticalDPM [93],
a Nvidia 2080 Ti GPU; for larger 256x256 images, this extends approximates the error prediction via Taylor expansion and
to nearly 1,000 hours [78]. Fortunately, the independence thereby achieves efficient sampling by analytically resolving
between training and inference schedules is often leveraged the linear component of the ODE solution instead of relying
in image SR. For example, a model may undergo training on generic black-box ODE solvers. This method significantly
with 1,000 time steps, but the subsequent inference phase reduces the sampling steps to 10 to 20. In a later work, the
may require only a fraction, i.e., 200 [15], [79]. However, the authors introduced an improved version with DPM-solver++
broader community of DM research has made further attempts that essentially approximates the predicted image instead of
focusing on either training-based or training-free sampling. the error [94]. Lately, a more general formulation and exten-
Training-based sampling methods speed up data genera- sion of the DPM-solver++ was introduced by UniPC [95].
tion using a trained sampler that approximates the backward
diffusion process instead of a traditional numerical solver. This B. Improved Likelihood
process may be complete or partial. For example, Watson et
Log-likelihood improvement is directly coupled with en-
al. [80] developed a dynamic programming algorithm that
hancing the performance of various applications and methods,
identifies optimal inference paths using a fixed number of
including but not limited to compression [96], semi-supervised
refinement steps, significantly reducing the computation re-
learning [97], and image SR. Given that DMs do not directly
quired. Diffusion Sampler Search [81] offers another approach,
optimize the log-likelihood, e.g., SGMs utilize a weighted
optimizing fast samplers for pre-trained DMs by adjusting
combination of score-matching losses, an objective that forms
the Kernel Inception Distance. Another technique is truncated
an upper bound on the negative log-likelihood needs to be
diffusion, which improves speed by prematurely ending the
optimized. Song et al. [98] proposed a method called likeli-
forward diffusion process [82], [83]. This early termination
hood weighting to address this need. This method minimizes
results in outputs that are not purely Gaussian noise, presenting
the weighted combination of score matching losses for score-
computational challenges. These challenges are addressed us-
based DMs. A carefully chosen weighting function sets an
ing proxy distributions from pre-trained VAEs or GANs, which
upper bound on the negative log-likelihood in the weighted
match the diffused data distribution and facilitate efficient
score-matching objective. Upon minimization, this results in
backward diffusion. Lastly, Knowledge distillation is also used
an elevation of the log-likelihood. Kingma et al. [99] explored
to accelerate sampling. It involves transferring knowledge from
methods that simultaneously train the noise schedule and
a complex, slower sampler (the teacher model) to simpler,
diffusion parameters to maximize the variational lower bound
faster models (student models) [84], [85]. As demonstrated
within Variational Diffusion Models. Additionally, the Im-
by Salimans et al. [86], this method progressively reduces
proved Denoising Diffusion Probabilistic Models (iDDPMs)
the number of sampling steps, trading off a slight decrease
proposed by Nichol and Dhariwal et al. [100] implement a
in sample quality for increased speed. Similarly, Xiao et al.
cosine noise schedule. This gradually introduces noise into the
[87] addressed the slow sampling issue associated with the
input, contrasting with the linear schedules that tend to degrade
Gaussian assumption in denoising steps, which is usually
the information quicker. Using the cosine noise schedule leads
only effective for small step sizes. They proposed Denoising
to better log-likelihoods and facilitates faster sampling.
Diffusion GANs that use conditional GANs for the denoising
steps, allowing for larger step sizes and faster sampling. For
V. D IFFUSION M ODELS FOR I MAGE SR
image SR, an application for exploiting knowledge distillation
can be found in AddSR [88]. Similarly, YONOS-SR [89] uses So far, we introduced the theoretical framework of DMs.
knowledge distillation, but instead of training faster samplers, This section reviews practical applications and recent advances
they transfer different scaling task knowledge and use the in image SR. We will discuss concrete realizations of DMs,
training-free DDIMs for efficient sampling, which is presented which are predominantly DDPMs. We then discuss guidance
in the next section. strategies to enhance conditioning usage, represent condi-
Training-free sampling methods aim to speed up sampling tioning information in alternative state domains for DDPMs,
by minimizing the number of discretization steps while solving and incorporate various conditioning methods. Additionally,
the Stochastic Differential Equation (SDE) or Probability Flow we explore SR-specific research areas, including corruption
Ordinary Differential Equation (ODE) [90], [91]. Denoising spaces, color-shifting, and architectural designs. Figure 3 pro-
Diffusion Implicit Models (DDIMs) [90] introduced by Song vides a topological overview of this section.
et al. generalizes the Markovian forward diffusion of DDPMs
into non-Markovian ones. This generalization allows the A. Concrete Realization of Diffusion Models
DDIMs to learn a Markov chain to reverse the non-Markovian While SGMs provide considerable design flexibility, the
forward diffusion, resulting in higher sampling speeds with image SR trend leans towards DDPMs. DDPMs benefit from
9

Conditioning: Guidance: ϵt ∼ N (0, I) can be represented as:


What information guides sampling? Improve conditioning influencing training
- LR reference - Classifier guidance √
 
- SR reference - Classifier-free guidance 1 1 − αt
- Feature reference
zt−1 = √ zt − √ · φθ (x, zt , γt ) + 1 − αt · ϵt
- Text-to-Image information
αt 1 − γt
(31)
Concurrent work focused on a similar implementation of SR3
but shows different variations implementing the denoising
model φθ (x, zt , γt ), which we will discuss later. A notable
mention is SRDiff [79], published around the same time
Corruption Space: State Domain:
What is the target of the forward diffusion? / How are states represented?
and follows a close realization of SR3. The main distinction
What is the start of the backward diffusion? - Pixel Space between SRDiff and SR3 is that SR3 predicts the HR image
- Gaussian noise - Latent Space
- Cold Diffusion - Frequency Space directly, whereas SRDiff predicts the residual information
- I2SB - Residual Space between the LR and HR image, i.e., the difference. Thus, it
- InDi
has an alternative state domain, which will be discussed next.
Fig. 3: Topology of this work. Conditioning (subsection V-D)
leads the backward diffusion, whereas guidance (subsec- B. Guidance in Training
tion V-B) is a training strategy to improve the incorporation
of conditioning into DMs. The state domain (subsection V-C) The backbone of diffusion-based image SR is the learning
describes the representation of states zt . The corruption space of conditional distributions [15], [101]. As such, the condition
(subsection V-E) describes the target of the forward diffusion x, e.g., the LR image, is integrated into the backward diffu-
process or the start of the backward diffusion. sion, i.e., pθ (zt−1 | zt , x) for DDPMs or in sθ (zt , x, t) for
SGMs/SDEs. However, this simple formulation can result in a
model that overlooks the conditioning. A principle known as
guidance can mitigate this issue by controlling the weighting
a straightforward implementation, which reduces the entry of the conditioning information at the expense of sample
barrier. It is a significant advantage, as it allows quicker de- diversity. It can be categorized into classifier and classifier-
velopment cycles and replication of results. In addition, while free guidance. To our knowledge, while effectively used for
the flexibility of SGMs is advantageous in creating customized improving DMs, they have not been applied to image SR.
solutions, it introduces design complexity due to the multitude Classifier Guidance: Classifier guidance employs a clas-
of design variables that need to be considered. This poses a sifier to guide the diffusion process by merging the score
challenge in research settings, where rigorously evaluating the estimate of the DM with the gradients of the classifier during
impact of each variable (e.g., different sampling algorithms) sampling [12]. This process is similar to low temperature or
becomes cumbersome. Moreover, the growing DDPM liter- truncated sampling in BigGANs [102] and facilitates a trade-
ature contributes to their popularity. As more studies adopt off between mode coverage and sample fidelity. The classifier
DDPMs, a virtuous cycle is created, where familiarity and is trained concurrently with the DM to predict the conditional
proven effectiveness encourage further adoption. information x from zt . For weighting of the conditioning
Among the pioneering DM efforts is SR3 [15], which information, the score function becomes:
concretely realizes DDPMs for image SR. Like typical for
∇zt log q(zt | x) = ∇zt log q(zt ) + λ∇zt log q(x | zt ), (32)
DDPMs, it adds Gaussian noise to the LR image until zT ∼
N (0, I) and generates a target HR image z0 iteratively in T where λ ∈ R+ is a hyper-parameter for controlling the
refinement steps. SR3 employs the denoising model to predict weighting. The downside of this approach is its dependence
the noise ϵt . The denoising model, φθ (x, zt , γt ), takes the LR on a learned classifier that can handle arbitrarily noisy inputs,
image x, the noise variance γt , and the noisy target image zt a capability most existing pre-trained image classification
as inputs. With the prediction of ϵt provided by φθ , we can models lack.
reformulate Equation 14 to approximate z0 as follows: Classifier-Free Guidance: Classifier-Free guidance aims to
achieve similar results without a classifier [103]. It modifies
√ p
γt · ẑ0 + 1 − γt · φθ (x, zt , γt )
zt = Equation 32 into
1   (29)
∇zt log q(zt |x) = (1 − λ)∇zt log q(zt ) + λ∇zt log q(zt | x).
p
⇐⇒ ẑ0 = √ · zt − 1 − γt · φθ (x, zt , γt )
γt (33)
As a result, we have a standard unconditional DM and a
The substitution of ẑ0 into the posterior distribution to param- conditional DM that has the score estimate ∇zt log q(zt | x).
eterize the mean of pθ (zt−1 |zt , x) leads to: The unconditional DM remains when λ = 0, and for λ =
  1, it aligns with the vanilla formulation of the conditional
1 1 − αt DM. The interesting scenario arises when λ > 1, where
µθ (x, zt , γt ) = √ zt − √ · φθ (x, zt , γt ) (30)
αt 1 − γt the DM prioritizes conditional information and moves away
from the unconditional score function, thus reducing the
In SR3, the authors simplified the variance Σθ to (1 − αt ) for likelihood of generating samples disregarding conditioning
ease of computation. Consequently, each refinement step with information. However, the major downside of this approach
10

is its computational cost for training two separate DMs. This


Enc.
can be mitigated by training a single conditional model and Dec.
substituting the conditioning information with a null value in iDWT
the unconditional score function [104]. DWT

Corruption Clean
C. State Domains Space
Pixel-based Diffusion image

So far, we have discussed methods that operate directly on Corruption Latent Clean
Latent Space Diffusion
Space Representation image
the pixel space. This section introduces different methods that
Corruption Wavelet Clean
map the input into alternative state domains: latent, frequency, Frequency-based Diffusion
Space Representation image
and residual space. Apart from particular challenges arising
from the alternative state domain, these methods incur an Fig. 4: Overview of state domains. The green bar shows the
additional step that maps the pixel domain into their own, vanilla DM operating in pixel space. The blue bar shows the
as illustrated in Figure 4. exploit of the latent space domain via Autoencoders. The red
Latent Space: Models like SR3 [15], and SRDiff [79] bar shows the application of DMs in the wavelet domain.
have achieved high-quality SR results by operating in the
pixel domain. However, these models are computationally
intensive due to their iterative nature and the high-dimensional
calculations in RGB space. To reduce computational demands, Frequency Space: Wavelets provide a novel outlook on SR
one can move the diffusion process into the latent space [16], [111]. The conversion from the spatial to the wavelet
of an autoencoder [105]. The first of this kind was the domain is lossless and offers significant advantages as the
Latent Score-based Generative Models (LSGMs) by Vadhat spatial size of an image can be downsized by a factor of
et al. [106]. It is a regular SGM that operates in the latent four, thereby allowing faster diffusion during the training
space of a VAE and, by pre-training the VAE, achieves and inference stages. Moreover, the conversion segregates
even faster sampling speeds. It yields comparable and better high-frequency details into distinct channels, facilitating a
results than DMs operating in the pixel domain while being more concentrated and intentional focus on high-frequency
faster. Building upon LSGMs, Rombach et al. introduced information, offering a higher degree of control [112]. Besides,
the Latent Diffusion Model (LDM) [13], [107], which also it can be conveniently incorporated into existing DMs as a
performs diffusion in a low-dimensional latent space of an plug-in feature. The diffusion process can interact directly with
autoencoder. In contrast to LSGM, LDM utilizes a DDPM all wavelet bands as proposed in DiWa [113] or specifically
and an autoencoder that is pre-trained, like the VQ-GAN [5], target certain bands while the remaining bands are predicted
and is not jointly trained with the denoising network. This via standard CNNs. For instance, WaveDM [114] modifies
approach significantly lowers resource requirements without the low-frequency band, whereas WSGM [115] or ResDiff
compromising performance. Due to the decoupled training, it [116] conditions the high-frequency bands relative to the low-
requires very little regularization of the latent space and allows resolution image. Altogether, the wavelet domain presents a
the reuse of latent representations across multiple models. Im- promising avenue for future research. It provides potential for
proving upon LDMs is REFUSION (image REstoration with significant performance acceleration while maintaining, if not
difFUSION models) [108] by Luo et al., which differs in two enhancing, the quality of SR results.
aspects: First, it uses a U-Net that contains skip connections Residual Space: SRDiff [79] was the first work that ad-
from the encoder to the decoder, which provides the decoder vocated for shifting the generation process into the residual
with additional details. Moreover, it introduces Nonlinear space, i.e., the difference between the upsampled LR and the
Activation-Free blocks (NAFBlocks) [109], replacing all non- HR image. This enables the DM to focus on residual details,
linear activations with an element-wise operation that splits speeds up convergence, and stabilizes the training [16], [111].
feature channels into two parts and multiplies them to produce Whang et al. [117] also employs residual predictions as a
one output. Secondly, they train their U-Net with a latent- fundamental component of their predict-and-refine approach
replacing training strategy, which partially replaces the latent for image deblurring. However, unlike SRDiff, they provide a
representation with either the encoded LR or HR image for SR prediction with a CNN instead of the bilinear upsampled
reconstruction training. Similarly, Chen et al. [110] improve LR and predict the residuals between the SR prediction and
the architectural aspects of LDMs and propose a two-stage the HR ground truth with their DM. An improvement is
strategy called the Hierarchical Integration Diffusion Model presented by ResDiff [116], which additionally incorporates
(HI-Diff). In the first stage, an encoder compresses the ground the SR prediction and its high-frequency information during
truth image to a highly compact latent space representation, the backward diffusion for better guidance. In a different vein,
which has a much higher compression ratio than LDM. As Yue et al. [118] presents ResShift. This technique constructs a
a result, the computational burden of the DM, which refines Markov chain of transformations between HR and LR images
multi-scale latent representations, is much more reduced. The by manipulating the residual between them. Thus, instead of
second stage is a vision transformer-based autoencoder, which just adding Gaussian noise with zero mean in the forward
incorporates the latent representations of the first stage during process, the residual is also added as the mean of the noise
the downsampling process via Hierarchical Integration Mod- sampling during training. This novel approach substantially
ules (HIM), a cross-attention fusion module. enhances sampling efficiency, i.e., only 15 sampling steps.
11

TABLE I: Results for 4× SR of general images on DIV2K val.


Note that EDSR, FxSR-PD, CAR, and RRDB are regression-
Encoder + Decoder based methods that generally produce better PSNR and SSIM
scores than generative approaches [15].
Stage 1: VAE Training
Methods PSNR ↑ SSIM ↑ LPIPS ↓
Bicubic 26.70 0.77 0.409
EDSR [122] 28.98 0.83 0.270
Stage 2: Diffusion Training FxSR-PD [123] 29.24 0.84 0.239
RRDB [53] 29.44 0.84 0.253
Fig. 5: Overview of DiffuseVAE. The two-stage approach CAR [1] 32.82 0.88 -
employs a VAE (first stage), which generates variational
RankSRGAN [124] 26.55 0.75 0.128
prediction as a condition for the DM (second stage). ESRGAN [53] 26.22 0.75 0.124
SRFlow [125] 27.09 0.76 0.120
SRDiff [79] 27.41 0.79 0.136
D. Conditioning Diffusion Models IDM [100] 27.59 0.78 -
DiWa [113] 28.09 0.78 0.104
DMs depend on conditioning information to guide the
sampling process toward a reasonable HR prediction. One
common strategy is to use the LR image during the backward TABLE II: PSNR and SSIM comparison on CelebA-HQ face
diffusion. This section reviews various alternative methods for SR 16×16 → 128×128. Consistency measures MSE (×10−5 )
integrating conditioning information into backward diffusion. between LR inputs and the downsampled SR outputs.
Low Resolution Reference: High-quality SR predictions
can be achieved through a straightforward channel concatena- Methods PSNR ↑ SSIM ↑ Consistency ↓
tion [119]. The LR image is concatenated with the denoised PULSE [126] 16.88 0.44 161.1
result from time step t−1 and serves as the conditioning input FSRGAN [127] 23.01 0.62 33.8
for noise prediction at time step t. In contrast, Iterative Latent SR3 (regression) [15] 23.96 0.69 2.71
Variable Refinement (ILVR) by Choi et al. [120] conditions SR3 (diffusion) [15] 23.04 0.65 2.68
the generative process of an unconditional LDM [13]. This DiWa [113] 23.34 0.67 -
approach offers the advantage of shorter training times, as it IDM [100] 24.01 0.71 2.14
leverages a pre-trained DM. To integrate conditioning infor-
mation, the low-frequency components of the denoised output
are replaced with their corresponding counterparts from the and sample quality. It is advantageous in scenarios where
LR image. Thus, the latent variable is aligned with a provided multiple predictions are required, similar to the use cases for
reference image at each generation process stage, ensuring Normalizing Flows.
precise control and adaptation during generation. Feature Reference: Another avenue for conditioning in-
Super-Resolved Reference: An alternative to conditioning volves relevant features extracted from pre-trained networks.
the denoising on the LR image involves learned priors from SRDiff [79] leverages a pre-trained encoder to encode LR
pre-trained SR models to predict a reference image. E.g., image features at each step of the backward diffusion. These
CDPMSR [121] conditions the denoising process with a features serve as guidance, aiding in the generation of higher-
predicted SR reference image obtained using existing and stan- resolution outputs. Implicit DMs (IDMs) [100] take a different
dard SR models. ResDiff [116], on the other hand, leverages approach by conditioning their denoising network with a neu-
a pre-trained CNN to predict a low-frequency, content-rich ral representation, which enables the learning of a continuous
image that includes partial high-frequency components. This representation at various scales. They encode the image as a
image guides the noise towards the residual space, offering an function within continuous space and seamlessly integrate it
alternative means of conditioning the generative process. into the DM. These extracted features are adapted to multiple
Pandey et al. [128] introduced an exciting idea of vary- scales and are used across multiple layers within the DMs. To
ing predicted conditions with DiffuseVAE as illustrated in comprehensively understand the performance differences be-
Figure 5. This approach integrates the stochastic predictions tween these approaches, comparisons can be found in Table II
generated by a VAE as conditioning information for the and Table I. Recently, DeeDSR was introduced [129], which
DM, capitalizing on the advantages offered by both models. incorporates degradation-aware features extracted from the LR
They use a two-stage approach called the generator-refiner image to guide the diffusion process of a LDM [107].
framework. In the first stage, a VAE is trained on the training Text-to-Image Information: By incorporating conditioning
data. In the subsequent stage, the DM is conditioned using information that goes beyond the LR image (e.g., its SR
varying, often blurred, reconstructions generated by the VAE. prediction, direct concatenation of the LR image, or its feature
The essential advantage of this method lies in the diversity representation), one can add Text-To-Image (T2I) information.
in the generated samples, which is defined within the lower- The incorporation of T2I information proves advantageous as
dimensional latent space of the VAE. This characteristic it allows the usage of pre-trained T2I models. These models
creates a more favorable balance between sampling speed can be fine-tuned by adding specific layers or encoders tailored
12

TABLE III: Results for 4× SR of general images on resized Forward


Diffusion
DIV2K val (128 × 128 → 512 × 512).

Standard
... ...

... ...
Methods PSNR ↑ SSIM ↑ LPIPS ↓
Backward
BSRGAN [134] 23.41 0.61 0.426 Diffusion

Real-ESRGAN [135] 23.15 0.62 0.403


LDL [136] 22.74 0.62 0.416

Image-to-Image
Schrödinger
FeMaSR [137] 21.86 0.54 0.410 ... ...

SwinIR-GAN [49] 22.65 0.61 0.406 ... ...

LDM [13] 21.48 0.56 0.450


SD Upscaler [13] 21.21 0.55 0.430
StableSR [107] 20.88 0.53 0.438
PASD [131] 21.85 0.52 0.403 Fig. 6: Comparison of the standard corruption space and I2 SB.
Instead of injecting noise to the clean image (initial state z0 ),
the final state zT is the degraded image.
to the SR task, facilitating the integration of textual descrip-
tions into the image generation process. This approach enables
a richer source of guidance, potentially improving image
synthesis and interpretation in SR tasks. Wang et al. have be represented differently due to alternative state domains
put this concept into practice with StableSR [107]. Central (e.g., latent, frequency, or residual). Cold Diffusion [139]
to StableSR is a time-aware encoder trained in tandem with a presents another ingenious way of modifying the corruption
frozen Stable DM, essentially a LDM. This setup seamlessly space for DDPMs. It shows that the generative capability is
integrates trainable spatial feature transform layers, enabling not strongly dependent on the choice of image degradation. It
conditioning based on the input image. To further augment the reveals new experimental types of diffusion besides Gaussian
flexibility of StableSR and achieve a delicate balance between noise can be effectively used, like animorphosis (i.e., human
realism and fidelity, they introduce an optional controllable faces iteratively degrading to animal faces). The Image-to-
feature wrapping module. This module accommodates user Image Schrödinger Bridge (I2 SB) goes in a similar direction
preferences, allowing for fine-tuned adjustments based on but does not impose any assumptions on the underlying prior
individual requirements. The inspiration for this feature comes distributions [140]. In its diffusion process, the clean image
from the methodology introduced in CodeFormer [130], which represents the initial state, while the degraded image is the
enhances the versatility of StableSR in catering to diverse final state in both forward and backward diffusion processes.
user needs and preferences. Likewise, Yang et al. introduce a This is notable for its ability to provide a transparent and
method known as Pixel-Aware Stable Diffusion (PASD) [131]. traceable path from a degraded image to its clean version,
PASD takes conditioning a step further by incorporating text as illustrated in Figure 6. Consequently, it provides enhanced
embeddings of the LR input using a CLIP text encoder [59] interpretability since the process between degraded and clean
and its feature representation. This approach augments the images is directly addressed, which is not commonly present
model’s ability to generate images by incorporating textual in many DMs. Another benefit is its higher efficiency in
information, thus allowing for more precise and context- backward diffusion since it requires fewer steps (often be-
aware image synthesis. Comparisons between PASD and other tween 2 and 10) to achieve comparable performance. Its
approaches can be found in Table III, demonstrating the impact conditionality, however, limits its use specifically to paired
of this text-based conditioning on image SR results. A similar data during training, which is unsuitable for unsupervised SR.
concurrent work can be found with SeeSR [132]. XPSR [133] While Cold Diffusion and I2 SB show promising results for
extends this idea by fusing different levels of semantic text image restoration, an extensive and more detailed quantitative
encodings (high-level: the content of the image; low-level: the analysis of different corruption types for image SR remains
perception of overall quality, sharpness, noise level, and other an exciting and open research avenue. Another avenue for
distortions about the LR image). alternative corruption space is presented by Inversion by
Direct Iteration (InDI) [141]. InDI delineates a direct mapping
strategy, efficiently bridging the gap between the two quality
E. Corruption Space spaces without the iterative refinement typically required by
Karras et al. [77] identified three pillars of DMs: the noise conventional diffusion processes. The intrinsic flexibility and
schedule, the network parameterization, and the sampling the direct mapping capability of InDI propose intriguing
algorithm. Recently, many authors argued to consider also possibilities for enhancing image quality, suggesting a potent
different types of corruption instead of pure Gaussian noise avenue for research exploration. The potential integration of
used during forward diffusion like Soft Score Matching [138], InDI’s principles with those of conditional DMs could offer
i.e., the starting point for backward diffusion or the target substantial advancements in the field of image SR. A detailed
for the forward diffusion zT . Soft Score Matching directly examination and discussion of InDI within the broader scope
incorporates the filtering process within the SGM, training the of diffusion-based image enhancement could yield valuable in-
model to predict a clean image. Upon corruption, this predicted sights and contribute significantly to the ongoing development
image aligns with the diffused observation. Note that zT may of generative models in image processing.
13

domain. In WaveDM [114], a deterministic U-Net predictor is


used for the high-frequency band, while diffusion is applied
in the low-frequency band.
Latent Diffusion Models proposed by Rombach et al. [13]
use a VQ-GAN [5] autoencoder in the latent space. For DiffIR
[145], multiple variations of state-of-the-art Vision Transform-
ers are employed [49], [146], [147]. Another common practice
is pre-training deterministic components, as seen in models
Fig. 7: Example of color shifting produced by vanilla SR3
like SRDiff [79] or DiffIR [145]. Overall, the potential ways
in a 64 × 64 → 256 × 256 setting when trained with reduced
to design a denoising network are infinite, generally drawing
batch size (8 instead of 256).
inspiration from advancements made in general computer
vision. The optimal denoising networks will vary based on
the task, and the development of new models is anticipated.
F. Color Shifting
As a result of high computational costs, DMs can occasion-
VI. D IFFUSION - BASED Z ERO -S HOT SR
ally suffer from color shifting when limited hardware neces-
sitates smaller batch sizes or shorter learning periods [142]. Zero-shot image SR aims to develop methods that do not
An example with SR3 is shown in Figure 7. As presented by depend on prior image examples or training [16], [153].
StableSR, a straightforward modification can address this issue Typically, these methods harness the inherent redundancy
by performing color normalization by adjusting the mean and within a single image for improvement. They often leverage
variance with those of the LR image on the generated image pre-trained DMs for generation, incorporating LR images as
[107]. Mathematically, it gives the following equation: conditions during the sampling process, in contrast to other
zc0 − µcz0 conditioning methods discussed earlier [154]. Additionally,
ẑ0 = · σxc + µcx , (34) they differ from guidance-based methods, where conditioning
σzc0
information is used to weight the training of a DM from
where c ∈ {r, g, b} denotes the color channel, and µcz0 and σzc0 scratch. A recent study by Li et al. [17] categorizes diffusion-
(or µcx and σxc ) are the mean and standard variance from the based methods into projection-based, decomposition-based,
c-th channel of the predicted image z0 (or the input image x), and posterior estimation, which are introduced in this section.
respectively. You Only Diffuse Areas (YODA) [143], which The discussed methods are compared in Table IV.
targets diffusion on important image areas more frequently
through time-dependent masks generated with DINO [144],
also mitigates the color shift effect for image SR. This suggests A. Projection-Based
that properly defined architecture and diffusion design are
crucial to omit this effect. Further analysis of why this effect Projection-based methods aim to extract inherent structures
emerges must be obtained in future work. or textures from LR images to complement the generated
images at each step and to ensure data consistency. An
illustrative example of a projection-based method in the realm
G. Architecture Designs for Denoising of inpainting tasks is RePaint [155]. In RePaint, the diffusion
The design of the denoising model in DMs offers a range process is selectively applied to the specific area requiring
of options. The majority of DMs adopt the use of U-Net, inpainting, leaving the remaining image portions unaltered.
as noted in most literature [102]. SR3 [15], for instance, Taking inspiration from this concept, YODA [143] applies
employs residual blocks from BigGAN [102] and re-scales a similar technique, but for image SR. YODA incorporates
skip connections by a factor of √12 . SRDiff takes a similar importance masks derived from DINO [144] to define the areas
approach [79], although it opts for vanilla residual blocks for diffusion during each time step, but it is not a zero-shot
without the re-scaling of skip connections and uses a LR approach.
encoder to incorporate the information of the LR image during One zero-shot method is ILVR [120], which projects the
the backward diffusion. Whang et al. [117] exploit an initial low-frequency information from the LR image to the HR
predictor to combine the strengths of deterministic image SR image, ensuring data consistency and establishing an improved
models and DMs. It has the advantage that the DM only DM condition. A more sophisticated method is Come-Closer-
needs to learn the residuals that the deterministic image SR Diffuse-Faster (CCDF) [156], which modifies the unified pro-
model (initial predictor) fails to predict, hence simplifying jection method to SR as follows:
the learning target. Additionally, the removal of self-attention,
positional encodings and group normalization from the SR3 ẑt−1 = f(zt , t) + g(zt , t) · εt (35)
U-Net enables their model to support arbitrary resolutions. zt−1 = (I − P) · ẑt−1 + x̂, x̂ ∼ q(zt |z0 = x), (36)
An initial predictor is also employed in the wavelet-based
approach DiWa [113]. Moreover, wavelet SR models, such where f, g depend on the type of DMs, P is the degradation
as DWSR [112] – a simple sequence of convolution layers of process of the LR image, and x̂ is the LR image with the
depth 10 – are utilized for denoising prediction in the wavelet added and time-dependent noise.
14

TABLE IV: Comparison of zero-shot methods. Data in bold represents the best performance. Second-best is underlined. Values
derived from Li et al. [17].

ImageNet 1K CelebA 1K Time Flops


Methods
PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ [s/image] [G]
Bicubic 25.36 0.643 0.27 24.26 0.628 0.34 - -
ILVR [120] 27.40 0.871 0.21 31.59 0.878 0.22 41.3 1113.75
SNIPS [148] 24.31 0.684 0.21 27.34 0.675 0.27 31.4 -
DDRM [149] 27.38 0.869 0.22 31.64 0.946 0.19 10.1 1113.75
DPS [150] 25.88 0.814 0.15 29.65 0.878 0.18 141.2 1113.75
DDNM [151] 27.46 0.871 0.15 31.64 0.945 0.16 15.5 1113.75
GDP [152] 26.51 0.832 0.14 28.65 0.876 0.17 3.1 1113.76

B. Decomposition-Based with ϵt = ϵθ (zt , t). To produce a z0 that fulfills the equation


Decomposition-based methods view image SR tasks as a Az0 ≡ x, the model leaves the null-space unaltered while
linear reverse problem similar to Equation 1: setting the range-space as A† y. This generates a rectified
estimation, ẑ0|t , defined by:
x = Ay + b, (37)
ẑ0|t = A† x + (I − A† A)z0|t . (42)
where A is the degradation operator and b contaminating
Finally, zt−1 is derived by sampling from p(zt−1 |zt , ẑ0|t ):
noise. Among the earliest decomposition-based methods, we
√ √
find SNIPS [148] and its subsequent work DDRM [149]. ᾱt−1 βt αt (1 − ᾱt−1 )
These methods employ diffusion in the spectral domain, zt−1 = ẑ0|t + zt +σt ϵ, ϵ ∼ N (0, I),
1 − ᾱt 1 − ᾱt
enhancing SR outcomes. To achieve this, they apply singular Qt (43)
value decomposition to the degradation operator A, thereby with αt = 1 − βt and ᾱt = i=0 αi , illustrated in Figure 8.
facilitating a spectral-domain transformation that contributes The term zt−1 represents a noised version of ẑ0|t . This noise
to their improved SR results. effectively mitigates the dissonance between the range-space
The Denoising Diffusion Null-space Model (DDNM) rep- contents, represented by A† x, and the null-space contents,
resents another decomposition-based zero-shot approach ap- denoted by (I − A† A)z0|t . The authors of DDNM show
plicable to a broad range of linear IR problems [151] beyond additionally that ẑ0|t conforms to consistency.
image SR to tasks like colorization, inpainting, and deblurring The last step involves defining A and A† , the construction
[151]. It leverages the range-null space decomposition method- of which is contingent on the restoration task at hand. For
ology [157], [158] to tackle diverse IR challenges effectively. instance, in SR tasks involving scaling by a factor of n,
DDNM approaches the problem by reconfiguring Equation 1 A can be defined as a 1 × n2 matrix, representative of
as a linear reverse problem, although it is essential to note an average-pooling operator. The average-pooling operator,
denoted as n12 ... n12 , functions to average each patch into
 
that this approach differs from SNIPS and DDRM in that it
operates in a noiseless context: a singular value. Similarly, we can construct its pseudo-inverse
2 ⊤
as A† ∈ Rn ×1 = 1 ... 1 . The original work provides

x = Ay, (38) further examples of tasks (such as colorization, inpainting,
with y ∈ RD×1 as the linearized HR image and x ∈ Rd×1 and restoration), illustrating how these methods are applied.
the linearized degraded image. Furthermore, it has to conform In addition, it describes how compound operations consisting
to the following two constraints: of numerous sub-operations function in these contexts. In
their research, the authors also introduced DDNM+ to support
Consistency : Aŷ ≡ x, Realness : ŷ ∼ p(y), (39) the restoration of noisy images. They utilized a technique
analogous to the ”back and forward” strategy implemented
with p(y) as the distribution of ground-truth images and ŷ the
in RePaint [155]. This approach was leveraged to enhance the
predicted image. The range-null space decomposition allows
quality further.
constructing a general solution for ŷ in the form of:
Given this approach’s novelty, only a handful of subsequent
ŷ = A† x + (I − A† A)ȳ, (40) studies extend and build upon it, such as the work presented
in CDPMSR [121]. This research direction promises exciting
with A† ∈ RD×d the pseudo-inverse that satisfies AA† A ≡ possibilities, although it calls for further investigation. For
A. Our goal is to find a proper ȳ that generates the null-space example, it should be noted that the DDNM approach in-
(I − A† A)ȳ and agrees with the range-space A† x that also troduces additional computational expenses compared to the
fulfills realness in Equation 39. task-specific training carried out using DDPMs. Moreover,
DDNM derives clean intermediate states, denoted as z0|t , the degradation operator A is set manually, which can be
for the range-null space decomposition from z0 at time-step challenging for certain tasks. Another potential drawback is the
t. This is achieved through the equation: assumption that A functions as a linear degradation operator,
1 √  which may not always hold true and thus could limit the
z0|t = √ zt − ϵθ (zt , t) 1 − ᾱt (41) model’s effectiveness in certain scenarios.
ᾱt
15

challenge, GDP substitutes zt with its clean estimation ẑ0 in


the distance calculation, providing a pragmatic solution to the
+
noise discrepancy issue.

+
VII. D OMAIN -S PECIFIC A PPLICATIONS

Fig. 8: Overview of DDNM [151]. It utilizes the range- SR3 [15] produces photo-realistic and perceptually state-
null space decomposition to construct a general solution for of-the-art images on faces and natural images but may not be
multiple tasks, such as image SR, colorization, inpainting, and suitable for other tasks like remote sensing. Some models are
deblurring. more suited to certain tasks as they tackle issues specific to
the domain [161]. This section highlights the applications of
DMs to domain-specific SR tasks: Medical imaging, special
cases of face SR (Blind Face Restoration and Atmospheric
C. Posterior Estimation Turbulences), and remote sensing.
Most projection-based methods typically address the noise-
less inverse problem. However, this assumption can weaken
A. Medical Imaging
data consistency because the projection process can deviate
the sample path from the data manifold [17]. To address this Magnetic Resonance Imaging (MRI) scans are widely used
and enhance data consistency, some recent works [150], [159], to aid patient diagnosis but can often be of low quality and
[160] take a different approach by aiming to estimate the corrupted with noise. Chung et al. [162] propose a combined
posterior distribution using the Bayes theorem: denoising and SR network referred to as R2D2+ (Regularized
Reverse Diffusion Denoiser + SR). They perform denoising of
p(x | zt ) · p(zt ) the MRI scans, followed by an SR module. Inspired by CCDF
p(zt | x) = , (44)
p(x) (i.e., a zero-shot method) from Chung et al. [156], they start
This Bayesian approach provides a more robust and prob- their backward diffusion from an initial noisy image instead of
abilistic framework for solving inverse problems, ultimately pure Gaussian noise. The reverse SDE is solved using a non-
improving results in various image processing tasks. It results parametric, eigenvalue-based method. In addition, they restrict
in the corresponding score function: the stochasticity of the DMs through low-frequency regular-
ization. Particularly, they maintain low-frequency information
∇zt log pt (zt | x) = ∇zt log pt (x | zt ) + sθ (x, t), (45) while correcting the high-frequency ones to produce sharp
where sθ (x, t) is extracted from a pre-trained model while and super-resolved MRI scans. Mao et al. [163] addresses the
pt (x|zt ) is intractable. Thus, the goal is precisely estimating lack of diffusion-based multi-contrast MRI SR methods. They
pt (x|zt ). MCG [159] and DPS [150] approximate the posterior propose a Disentangled Conditional Diffusion model (DisC-
pt (x|zt ) with pt (x|ẑ0 (zt )), where ẑ0 (zt ) is the expectation Diff) to leverage a multi-conditional fusion strategy based
given zt as ẑ0 (zt ) = E [z0 |zt ] according to Tweedie’s formula on representation disentanglement, enabling high-quality HR
[150]. While MCG also relies on projection, which can be image sampling. Specifically, they employ a disentangled U-
harmful to data consistency, DPS discards the projection step Net with multiple encoders to extract latent representations and
and estimates the posterior as: use a novel joint disentanglement and Charbonnier loss func-
tion to learn representations across MRI contrasts. They also
∇zt log pt (x | zt ) ≈ ∇zt log p(x | ẑ0 (zt )) (46) implement curriculum learning and improve their MRI model
1 for varying anatomical complexity by gradually increasing the
≈ − 2 ∇zt ∥x − H(ẑ0 (zt ))∥22 ,
σ difficulty of training images. An improvement of DisC-Diff by
where H is a forward measurement operator. A further ex- combining the DM with a transformer was introduced by Li
pansion of this formula to the unified form for the linear, et al. with DiffMSR [164].
non-linear, differentiable inverse problem with Moore Penrose
pseudoinverse can be found in IIGDM [160]. B. Blind Face Restoration
A different approach to estimate pt (x|zt ) is demonstrated
Most previously discussed SR methods are founded on
by GDP [152]. The authors noted that a higher conditional
a fixed degradation process during training, such as bicu-
probability of pt (x|zt ) correlates with a smaller distance
bic downsampling. However, when applied practically, these
between the application of the degradation model D(zt ) and
assumptions frequently diverge from the actual degradation
x. Thus, they propose a heuristic approximation:
process and yield subpar results. Additionally, datasets with
1 pairs of clean and real-world distorted images are usually
pt (x|zt ) ≈ exp(− [sL(D(zt ), x)]) + λQ(zt ), (47)
Z unavailable. This issue is particularly researched in face SR,
where L and Q denote a distance and quality metric, re- termed Blind Face Restoration (BFR), where datasets typically
spectively. The term Z is for normalization, and s is a contain supervised samples (x, y) with unknown degradation.
scaling factor controlling the guidance weight. However, due A solution to BFR was proposed by Yue et al. with DifFace
to varying noise levels between zt and x, precisely defining [165] that leverages the rich generative priors of pre-trained
the distance metric L can be challenging. To overcome this DMs with parameters θ, which were trained to approximate
16

pθ (zt |zt−1 ). In contrast to existing methods that learn direct SR. The method transfers class prior information from an SR
mappings from x to y under several constraints [166], [167], model trained on clean facial data to a model designed to
DifFace circumvents this by generating a diffused version zN counteract turbulence degradation via knowledge distillation.
of the desired HR image y with N < T . They predict the The final model operates within the realistic faces manifold,
starting point, the posterior q(zN |x) via a transition distribu- which allows it to generate realistic face outputs even under
tion p(zN |x). The transition distribution is formulated like the substantial distortions. During inference, the process begins
regular diffusion process, a Gaussian distribution, but uses an with noise- and turbulence-degraded images to ensure that the
initial predictor φ(x) to generate the mean, named diffused restored images closely resemble the distorted ones.
estimator. As their model borrows the reverse Markov chain
from a pre-trained DM, DifFace requires no full retraining for D. Remote Sensing
new and unknown degradations, unlike SR3.
A concurrent and better performing approach is DiffBFR Remote Sensing Super-Resolution (RSSR) addresses the
[168] that adopts a two-step approach to BFR: A Identity HR reconstruction from one or more LR images to aid
Restoration Module (IRM), which employs two conditional object detection and semantic segmentation tasks for satellite
DDPMs, and a Texture Enhancement Module (TEM), which imagery. RSSR is limited by the absence of small targets
employs an unconditional DDPM. In the first step within with complex granularity in the HR images [172]. To produce
the IRM, a conditional DDPM enriches facial details at a finer details and texture, Liu et al. [173] present DMs with
low-resolution space same as x. The downsampled version a Detail Complement mechanism (DMDC). They train their
of y gives the target objective. Next, it resizes the output model similar to SR3 [15] and perform a detailed supplement
to the desired spatial size of y and applies another condi- task. To generate high-frequency information, they randomly
tional DDPM to approximate the HR image y. To ensure mask several parts of the images to mimic dense objects.
minimal deviation from the actual image, DiffBFR employs The SR images recover the occluded patches as the model
a novel truncated sampling method, which begins denoising learns small-grained information. Additionally, they introduce
at intermediate steps. The TEM further enhances realism a novel pixel constraint loss to limit the diversity of DMDC
through image texture and sharpened facial details. It imposes and improve overall accuracy. Ali et al. [174] design a new
a diffuse-base facial prior with an unconditional DM trained architecture for RS images that integrates Vision Transformers
on HR images and a backward diffusion starting from pure (ViT) with DMs as a Two-stage approach for Enhancement
noise. However, it has more parameters than SR3 and requires and Super-Resolution (TESR). In the first stage (SR stage),
optimization to accelerate sampling. the SwinIR [49] model is used for RSSR. In the second
Another method is DR2E [169], which employs two stages: stage (enhancement stage), the noisy images are enhanced by
degradation removal and enhancement modules. For degrada- employing DMs to reconstruct the finer details. Xu et al. [175]
tion removal, they use a pre-trained face SR DDPM to remove propose a blind SR framework based on Dual conditioning
degradations from an LR image with severe and unknown DDPMs for SR (DDSR). A kernel predictor conditioned on
degradations. In particular, they diffuse the degraded image LR image encodings estimates the degradation kernel in the
x in T time steps to obtain xT = zT . Then, they use xt first stage. This is followed by an SR module consisting of a
to guide the backward diffusion such that the low-frequency conditional DDPM in a U-Net with the predicted kernel and
part of zt is replaced with that of xt , which is close in distri- the LR encodings as guidance. An RRDB encoder extracts the
bution. Theoretically, it produces visually clean intermediate encodings from LR images. Recently, Khanna et al. introduced
results that are degradation-invariant. In the second stage, the DiffusionSat [176], which uses a LDM for RSSR and incorpo-
enhancement module pθ (y | z0 ), an arbitrary backbone CNN rates additional remote sensing conditioning information (e.g.,
trained to map LR images to HR using a simple L2 loss, longitude, latitude, cloud cover, etc.).
predicts the final output. DR2E can be slower than existing
diffusion-based SR models for images with slight degradations VIII. D ISCUSSION AND F UTURE W ORK
and can even remove details from the input. Though relatively new, DMs are quickly becoming a
promising research area, especially in image SR. There are
C. Atmospheric Turbulence in Face SR several avenues of ongoing research in this field, aiming to
enhance the efficiency of DMs, accelerate computation speeds,
Atmospheric Turbulence (AT) results from atmospheric con- and minimize memory footprint, all while generating high-
ditions fluctuations, leading to images’ perceptual degradation quality, high-fidelity images. This section introduces common
through geometric distortions, spatially variant blur, and noise. problems of DMs for image SR and examines noteworthy
These alterations negatively impact downstream vision tasks, research avenues for DMs specific to image SR.
such as tracking or detection. Wang et al. [170] introduced a
variational inference framework known as AT-VarDiff, which
aims to correct AT in generic scenes. The distinctive feature of A. Color Shifting
this approach is its reliance on a conditioning signal derived Often, the most practical advancements come from a solid
from latent task-specific prior information extracted from the theoretical understanding. As discussed in subsection V-F,
input image to guide the DM. Nair et al. [171] put forth an- due to the substantial computational demands, DMs may
other technique to restore facial images impaired by AT using occasionally exhibit color shifts when constrained by hardware
17

limitations that demand smaller batch sizes or shorter training added during the forward diffusion process. The adaptability
periods [142]. While well-defined diffusion methods [143] and efficiency demonstrated by novel approaches like InDI or
or color normalization [107] might mitigate this problem, a I2 SB, especially in handling diverse and complex corruption
theoretical understanding of why it is emerging is necessary. patterns, spotlight the urgent need for future research.

B. Computational Costs E. Comparability


In a study conducted by Ganguli et al., it was observed Comparing DMs in SR is complex because of the varied
that the computing power needed for large-scale AI experi- datasets used in different studies. They vary in resolution,
ments has surged by over 300,000 times in the last decade content diversity, color distribution, and noise levels, all of
[177]. Regrettably, this increase in resource intensity has been which significantly influence model performance. A model
accompanied by a sharp decline in the share of these results may perform well with one dataset but poorly with another,
originating from academic circles. DMs are not immune to this complicating the assessment of its overall effectiveness. Es-
issue; their computational demands add to the expanding gap tablishing a standard benchmark with diverse, representative
between industry and academia. Therefore, there is a pressing datasets and uniform evaluation metrics is essential for com-
need to reduce computational costs and memory footprints for parability. This approach would help identify models that
practical applicability and research. One strategy to alleviate consistently perform well across different conditions and tasks,
computational demands is to examine smaller spatial-sized thereby promoting faster progress in the field. Furthermore,
domains, as discussed in subsection V-C. Examples of such evaluating the quality of SR images from generative mod-
approaches include LDMs [5], [13] and wavelet-based models els is still problematic. Although DMs often produce more
[113], [115]. However, the capability of LDMs to reconstruct photorealistic images, they typically score lower on standard
data with high precision and fine-grained accuracy, as required metrics like PSNR and SSIM [16]. However, these models tend
in image SR, remains to be questioned. Therefore, further to receive more favorable assessments from human evaluators
advancements in these methods are critically needed. On the [15]. LPIPS [182] performs better reflecting this perception,
other hand, wavelet-based models do not present a bottleneck but the domain of image SR has to adapt to more diverse
regarding information preservation. This advantage suggests metrics, such as predictors that reflect human ratings directly
that they should be the subject of more intensive exploration. [183], [184]. For instance, datasets with subjective ratings, like
TID2013 [61], and neural networks, such as DeepQA [62] or
C. Efficient Sampling NIMA [63], can be employed to predict human-like scoring
of images and should be further explored.
A benefit of DMs is the possibility of decoupling training
and inference schedules [178]. This allows for substantial
enhancements in curtailing the time required for inference in F. Image Manipulation
practical applications, providing a significant efficiency edge in Image manipulation can be particularly useful in multi-
real-world scenarios. While reducing the number of steps taken image SR for generating HR images that blend characteristics
during inference is relatively simple, a systematic method from multiple sources, potentially improving the quality and
for determining inference schedules has yet to be developed diversity of the output (e.g., satellite imagery for SR pre-
[179]. As outlined in subsection IV-A, this research direction dictions with flexible daylights). SRDiff [79] proposed two
represents a promising avenue. We explored training-based potential extensions: content fusion and latent space interpola-
sampling methods for SR with AddSR [88] and YONOS- tion. Content fusion involves the combination of content from
SR [89] but also introduced efficient DMs that need fewer two source images. For instance, they replace the eyes in one
sampling steps, like ResShift [118] and DiffIR [145]. An alter- source image with the face from another image before con-
native is given by methods that use different corruption spaces, ducting diffusion in the image space like CutMix [185]. The
as discussed in subsection V-E. Unlike sampling from pure backward diffusion successfully creates a smooth transition
Gaussian noise, notable works such as Luo et al. [108], I2 SB between both images. In the latent space interpolation model,
[140], Come-Closer-Diffuse-Faster [156], or Cold Diffusion the latent space of two SR predictions is linearly interpolated
[139] define a process from the LR to the HR image di- to generate a new image. While these extensions have yielded
rectly. Additional techniques for decreasing computation time, remarkable results, unlike other generative models such as
such as knowledge distillation, alternative noise schedulers, or VAEs or GANs, DMs have been found to offer less proficient
truncated diffusion, demand further investigation concerning latent representations [186]. Therefore, recent and ongoing
image SR [84], [85], [180], [181]. research into the manipulation of latent representations in DMs
is both in its early stages and greatly needed [187]–[189].
D. Corruption Spaces
New approaches for corruption spaces allow a more di- G. Cascaded Image Generation
rect approach for upsampling images from LR to HR. The Saharia et al. [15] presented cascaded image SR, in which
significance of exploring different corruption spaces lies in multiple DDPMs are chained across different scales. This
addressing the inherent limitations and assumptions embedded strategy was applied to unconditional and class-conditional
within current DM frameworks, e.g., diversity and blurriness generation, cascading a model synthesizing 64 × 64 images
18

with SR3 models generating 1024 × 1024 unconditional faces [10] Y. Song and S. Ermon, “Generative modeling by estimating gradients
and 256×256 class-conditional natural images. The cascading of the data distribution,” NeurIPS, vol. 32, 2019.
[11] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon,
approach allows several simpler models to be trained simul- and B. Poole, “Score-based generative modeling through stochastic
taneously, improving computational efficiency due to faster differential equations,” arXiv:2011.13456, 2020.
training times and reduced parameter counts. Furthermore, [12] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image
synthesis,” NeurIPS, vol. 34, 2021.
they implemented cascading for inference, using more refine- [13] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
ment steps at lower and fewer steps at higher resolutions. resolution image synthesis with latent diffusion models,” in CVPR,
They found this more efficient than generating SR images 2022.
[14] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen,
directly. Even though their approach underperforms compared “Hierarchical text-conditional image generation with clip latents,”
to BigGAN [102] concerning cascaded generation, it still arXiv:2204.06125, 2022.
represents an exciting research opportunity. [15] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi,
“Image super-resolution via iterative refinement,” IEEE TPAMI, vol. 45,
no. 4, 2023.
IX. C ONCLUSION [16] B. B. Moser, F. Raue, S. Frolov, S. Palacio, J. Hees, and A. Den-
gel, “Hitchhiker’s guide to super-resolution: Introduction and recent
Diffusion Models (DMs) revolutionized image Super- advances,” IEEE TPAMI, 2023.
Resolution (SR) by enhancing both technical image quality [17] X. Li, Y. Ren, X. Jin, C. Lan, X. Wang, W. Zeng, X. Wang, and
and human perceptual preferences. While traditional SR often Z. Chen, “Diffusion models for image restoration and enhancement–a
comprehensive survey,” arXiv:2308.09388, 2023.
focuses solely on pixel-level accuracy, DMs can generate [18] A. Liu, Y. Liu, J. Gu, Y. Qiao, and C. Dong, “Blind image super-
HR images that are aesthetically pleasing and realistic. Un- resolution: A survey and beyond,” IEEE TPAMI, vol. 45, no. 5, 2022.
like previous generative models, they do not suffer typical [19] S. Anwar and N. Barnes, “Densely residual laplacian super-resolution,”
IEEE TPAMI, 2020.
convergence issues. This survey explored the progress and [20] MATLAB, The Mathworks, Inc., Natick, Massachusetts, 2017.
diverse methods that have propelled DMs to the forefront [21] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image
of SR. Potential use cases, as discussed in our applications super-resolution: Dataset and study,” in CVPRW, July 2017.
[22] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel,
section, extend far beyond what was previously imagined. We “Low-complexity single-image super-resolution based on nonnegative
introduced their foundational principles and compared them to neighbor embedding,” 2012.
other generative models. We explored conditioning strategies, [23] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using
sparse-representations,” in International conference on curves and
from LR image guidance to text embeddings. Zero-shot SR, surfaces. Springer, 2010.
a particularly intriguing paradigm, was also a subject, as well [24] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human
as corruption spaces and image SR-specific topics like color segmented natural images and its application to evaluating segmenta-
shifting and architectural designs. In conclusion, the survey tion algorithms and measuring ecological statistics,” in ICCV, vol. 2.
IEEE, 2001.
provides a comprehensive guide to the current landscape and [25] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution
valuable insights into trends, challenges, and future directions. from transformed self-exemplars,” in CVPR, 2015.
As we continue to explore and refine these models, the future [26] Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and
K. Aizawa, “Sketch-based manga retrieval using manga109 dataset,”
of image SR looks more promising than ever. Multimedia Tools and Applications, vol. 76, no. 20, 2017.
[27] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image
super-resolution: Dataset and study,” in CVPRW, 2017.
ACKNOWLEDGMENT [28] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture
This work was supported by the BMBF project XAINES for generative adversarial networks,” in CVPR, 2019.
[29] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of
(Grant 01IW20005) and SustainML (Horizon Europe grant gans for improved quality, stability, and variation,” arXiv:1710.10196,
agreement No 101070408). 2017.
[30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in CVPR, 2009.
R EFERENCES [31] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
[1] W. Sun and Z. Chen, “Learned image downscaling for upscaling using A. Zisserman, “The PASCAL voc2012 Results,” http://www.pascal-
content adaptive resampler,” IEEE TIP, vol. 29, 2020. network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
[2] D. Valsesia and E. Magli, “Permutation invariance and uncertainty [32] K. I. Kim and Y. Kwon, “Single-image super-resolution using sparse
in multitemporal image super-resolution,” IEEE Transactions on Geo- regression and natural image prior,” IEEE TPAMI, vol. 32, no. 6, 2010.
science and Remote Sensing, vol. 60, 2021. [33] G. Freedman and R. Fattal, “Image and video upscaling from local
[3] S. M. A. Bashir, Y. Wang, M. Khan, and Y. Niu, “A comprehensive self-examples,” ACM Trans. Graph., vol. 30, no. 2, apr 2011.
review of deep learning-based single image super-resolution,” PeerJ [34] J. Sun, Z. Xu, and H.-Y. Shum, “Image super-resolution using gradient
Computer Science, vol. 7, 2021. profile prior,” in CVPR, 2008.
[4] D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han, “Autoregressive [35] H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through
image generation using residual quantization,” in CVPR, 2022. neighbor embedding,” in Proceedings of the 2004 IEEE Computer
[5] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- Society Conference on Computer Vision and Pattern Recognition, 2004.
resolution image synthesis,” in CVPR, 2021. CVPR 2004., vol. 1. IEEE, 2004.
[6] B. Guo, X. Zhang, H. Wu, Y. Wang, Y. Zhang, and Y.-F. Wang, “Lar- [36] W. Freeman, T. Jones, and E. Pasztor, “Example-based super-
sr: A local autoregressive model for image super-resolution,” in CVPR, resolution,” IEEE Computer Graphics and Applications, vol. 22, no. 2,
2022. 2002.
[7] S. Frolov, T. Hinz, F. Raue, J. Hees, and A. Dengel, “Adversarial text- [37] R. Keys, “Cubic convolution interpolation for digital image process-
to-image synthesis: A review,” Neural Networks, vol. 144, 2021. ing,” IEEE Transactions on Acoustics, Speech, and Signal Processing,
[8] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic vol. 29, no. 6, 1981.
models,” NeurIPS, vol. 33, 2020. [38] M. Irani and S. Peleg, “Improving resolution by image registration,”
[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, CVGIP: Graphical Models and Image Processing, vol. 53, no. 3, 1991.
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial net- [39] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution
works,” Communications of the ACM, vol. 63, no. 11, 2020. via sparse representation,” IEEE TIP, vol. 19, no. 11, 2010.
19

[40] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using [69] A. Hyvärinen and P. Dayan, “Estimation of non-normalized statistical
deep convolutional networks,” IEEE TPAMI, vol. 38, no. 2, 2015. models by score matching.” Journal of Machine Learning Research,
[41] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution vol. 6, no. 4, 2005.
convolutional neural network,” in ECCV. Springer, 2016. [70] Y. Song, S. Garg, J. Shi, and S. Ermon, “Sliced score matching: A
[42] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, scalable approach to density and score estimation,” in Uncertainty in
D. Rueckert, and Z. Wang, “Real-time single image and video super- Artificial Intelligence. PMLR, 2020.
resolution using an efficient sub-pixel convolutional neural network,” [71] B. D. Anderson, “Reverse-time diffusion equation models,” Stochastic
in CVPR, 2016. Processes and their Applications, vol. 12, no. 3, 1982.
[43] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, [72] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
image super-resolution using a generative adversarial network,” in NeurIPS, vol. 27, 2014.
CVPR, 2017. [73] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
[44] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely arXiv:1312.6114, 2013.
connected convolutional networks,” in CVPR, 2017. [74] D. Rezende and S. Mohamed, “Variational inference with normalizing
[45] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using flows,” in ICML. PMLR, 2015.
dense skip connections,” in ICCV, 2017. [75] Q. Zhang and Y. Chen, “Diffusion normalizing flow,” in NeurIPS,
[46] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan,
network for image super-resolution,” in CVPR, 2016. Eds., vol. 34. Curran Associates, Inc., 2021.
[47] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive [76] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and
residual network,” in CVPR, 2017. B. Lakshminarayanan, “Normalizing flows for probabilistic modeling
[48] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and lightweight and inference,” The Journal of Machine Learning Research, vol. 22,
super-resolution with cascading residual network,” in ECCV, 2018. no. 1, 2021.
[49] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, [77] T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design
“Swinir: Image restoration using swin transformer,” in CVPR, 2021. space of diffusion-based generative models,” NeurIPS, vol. 35, 2022.
[50] X. Chen, X. Wang, J. Zhou, Y. Qiao, and C. Dong, “Activating more [78] “Oxford vggface implementation using keras functional framework
pixels in image super-resolution transformer,” in CVPR, 2023. v2+,” https://github.com/rcmalli/keras-vggface.
[51] C.-C. Hsu, C.-M. Lee, and Y.-S. Chou, “Drct: Saving image super- [79] H. Li, Y. Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y. Chen,
resolution away from information bottleneck,” arXiv:2404.00722, “Srdiff: Single image super-resolution with diffusion probabilistic
2024. models,” Neurocomputing, vol. 479, 2022.
[52] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, [80] D. Watson, J. Ho, M. Norouzi, and W. Chan, “Learning to efficiently
A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single sample from diffusion probabilistic models,” arXiv:2106.03802, 2021.
image super-resolution using a generative adversarial network,” in [81] D. Watson, W. Chan, J. Ho, and M. Norouzi, “Learning fast sam-
CVPR, 2017. plers for diffusion models by differentiating through sample quality,”
[53] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and arXiv:2202.05830, 2022.
C. Change Loy, “Esrgan: Enhanced super-resolution generative adver- [82] Z. Lyu, X. Xu, C. Yang, D. Lin, and B. Dai, “Accelerating diffusion
sarial networks,” 2018. models via early stop of the diffusion process,” arXiv:2205.12524,
[54] A. Lugmayr, M. Danelljan, L. Van Gool, and R. Timofte, “Srflow: 2022.
Learning the super-resolution space with normalizing flow,” in ECCV. [83] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to
Springer, 2020. text-to-image diffusion models,” in CVPR, 2023.
[55] K. Simonyan and A. Zisserman, “Very deep convolutional networks [84] C. Meng, R. Rombach, R. Gao, D. Kingma, S. Ermon, J. Ho, and
for large-scale image recognition,” arXiv:1409.1556, 2014. T. Salimans, “On distillation of guided diffusion models,” in CVPR,
[56] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 2023.
with deep convolutional neural networks,” Communications of the [85] E. Luhman and T. Luhman, “Knowledge distillation in iterative gener-
ACM, vol. 60, no. 6, 2017. ative models for improved sampling speed,” arXiv:2101.02388, 2021.
[57] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image [86] T. Salimans and J. Ho, “Progressive distillation for fast sampling of
quality assessment in the spatial domain,” IEEE TIP, vol. 21, no. 12, diffusion models,” arXiv:2202.00512, 2022.
2012. [87] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative learning
[58] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely trilemma with denoising diffusion gans,” arXiv:2112.07804, 2021.
blind” image quality analyzer,” IEEE Signal processing letters, vol. 20, [88] R. Xie, Y. Tai, K. Zhang, Z. Zhang, J. Zhou, and J. Yang, “Addsr:
no. 3, 2012. Accelerating diffusion-based blind super-resolution with adversarial
[59] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, diffusion distillation,” arXiv:2404.01717, 2024.
G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable [89] M. Noroozi, I. Hadji, B. Martinez, A. Bulat, and G. Tzimiropoulos,
visual models from natural language supervision,” in ICML. PMLR, “You only need one step: Fast super-resolution with stable diffusion
2021. via scale distillation,” arXiv:2401.17258, 2024.
[60] J. Wang, K. C. Chan, and C. C. Loy, “Exploring clip for assessing the [90] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit mod-
look and feel of images,” in AAAI, vol. 37, no. 2, 2023. els,” arXiv:2010.02502, 2020.
[61] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. As- [91] A. Jolicoeur-Martineau, K. Li, R. Piché-Taillefer, T. Kachman, and
tola, B. Vozel, K. Chehdi, M. Carli, F. Battisti et al., “Image database I. Mitliagkas, “Gotta go fast when generating data with score-based
tid2013: Peculiarities, results and perspectives,” Signal processing: models,” arXiv:2105.14080, 2021.
Image communication, vol. 30, 2015. [92] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A
[62] J. Kim and S. Lee, “Deep learning of human visual sensitivity in image fast ode solver for diffusion probabilistic model sampling in around 10
quality assessment framework,” in CVPR, 2017. steps,” Advances in Neural Information Processing Systems, vol. 35,
[63] H. Talebi and P. Milanfar, “Nima: Neural image assessment,” IEEE pp. 5775–5787, 2022.
TIP, vol. 27, no. 8, 2018. [93] F. Bao, C. Li, J. Zhu, and B. Zhang, “Analytic-dpm: an analytic estimate
[64] J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang, “Musiq: Multi-scale of the optimal reverse variance in diffusion probabilistic models,”
image quality transformer,” in ICCV, 2021. arXiv:2201.06503, 2022.
[65] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, [94] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver++:
B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey Fast solver for guided sampling of diffusion probabilistic models,”
of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, arXiv:2211.01095, 2022.
2023. [95] W. Zhao, L. Bai, Y. Rao, J. Zhou, and J. Lu, “Unipc: A unified
[66] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, predictor-corrector framework for fast sampling of diffusion models,”
“Deep unsupervised learning using nonequilibrium thermodynamics,” NeurIPS, vol. 36, 2024.
in ICML. PMLR, 2015. [96] J. Ho, E. Lohn, and P. Abbeel, “Compression with flows via local
[67] G. Parisi, “Correlation functions and computer simulations,” Nuclear bits-back coding,” NeurIPS, vol. 32, 2019.
Physics B, vol. 180, no. 3, 1981. [97] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov,
[68] P. Vincent, “A connection between score matching and denoising “Good semi-supervised learning that requires a bad gan,” NeurIPS,
autoencoders,” Neural computation, vol. 23, no. 7, 2011. vol. 30, 2017.
20

[98] Y. Song, C. Durkan, I. Murray, and S. Ermon, “Maximum likelihood [128] K. Pandey, A. Mukherjee, P. Rai, and A. Kumar, “Diffusevae: Efficient,
training of score-based diffusion models,” NeurIPS, vol. 34, 2021. controllable and high-fidelity generation from low-dimensional latents,”
[99] D. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational diffusion arXiv:2201.00308, 2022.
models,” NeurIPS, vol. 34, 2021. [129] C. Bi, X. Luo, S. Shen, M. Zhang, H. Yue, and J. Yang, “Deedsr: To-
[100] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion proba- wards real-world image super-resolution via degradation-aware stable
bilistic models,” in ICML. PMLR, 2021. diffusion,” arXiv, 2024.
[101] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, [130] S. Zhou, K. Chan, C. Li, and C. C. Loy, “Towards robust blind face
“Cascaded diffusion models for high fidelity image generation.” J. restoration with codebook lookup transformer,” NeurIPS, vol. 35, 2022.
Mach. Learn. Res., vol. 23, no. 47, 2022. [131] T. Yang, P. Ren, X. Xie, and L. Zhang, “Pixel-aware stable diffu-
[102] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for sion for realistic image super-resolution and personalized stylization,”
high fidelity natural image synthesis,” arXiv:1809.11096, 2018. arXiv:2308.14469, 2023.
[103] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” [132] R. Wu, T. Yang, L. Sun, Z. Zhang, S. Li, and L. Zhang,
arXiv:2207.12598, 2022. “Seesr: Towards semantics-aware real-world image super-resolution,”
[104] C. Luo, “Understanding diffusion models: A unified perspective,” arXiv:2311.16518, 2023.
arXiv:2208.11970, 2022. [133] Y. Qu, K. Yuan, K. Zhao, Q. Xie, J. Hao, M. Sun, and C. Zhou,
[105] J. Kim and T.-K. Kim, “Arbitrary-scale image generation and up- “Xpsr: Cross-modal priors for diffusion-based image super-resolution,”
sampling using latent diffusion model and implicit neural decoder,” arXiv:2403.05049, 2024.
arXiv:2403.10255, 2024. [134] K. Zhang, J. Liang, L. Van Gool, and R. Timofte, “Designing a practical
[106] A. Vahdat, K. Kreis, and J. Kautz, “Score-based generative modeling degradation model for deep blind image super-resolution,” in ICCV,
in latent space,” NeurIPS, vol. 34, 2021. 2021.
[107] J. Wang, Z. Yue, S. Zhou, K. C. Chan, and C. C. Loy, “Exploiting dif- [135] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-
fusion prior for real-world image super-resolution,” arXiv:2305.07015, world blind super-resolution with pure synthetic data,” in CVPR, 2021.
2023. [136] J. Liang, H. Zeng, and L. Zhang, “Details or artifacts: A locally
[108] Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön, “Im- discriminative learning approach to realistic image super-resolution,”
age restoration with mean-reverting stochastic differential equations,” in CVPR, 2022.
arXiv:2301.11699, 2023. [137] C. Chen, X. Shi, Y. Qin, X. Li, X. Han, T. Yang, and S. Guo,
[109] L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image “Real-world blind super-resolution via feature matching with implicit
restoration,” in ECCV. Springer, 2022. high-resolution priors,” in Proceedings of the 30th ACM International
[110] Z. Chen, Y. Zhang, D. Liu, B. Xia, J. Gu, L. Kong, and X. Yuan, “Hi- Conference on Multimedia, ser. MM ’22. New York, NY, USA:
erarchical integration diffusion model for realistic image deblurring,” Association for Computing Machinery, 2022.
arXiv:2305.12966, 2023. [138] G. Daras, M. Delbracio, H. Talebi, A. G. Dimakis, and P. Mi-
lanfar, “Soft diffusion: Score matching for general corruptions,”
[111] B. B. Moser, S. Frolov, F. Raue, S. Palacio, and A. Dengel, “Dwa:
arXiv:2209.05442, 2022.
Differential wavelet amplifier for image super-resolution,” in Artificial
[139] A. Bansal, E. Borgnia, H.-M. Chu, J. S. Li, H. Kazemi, F. Huang,
Neural Networks and Machine Learning – ICANN 2023, L. Iliadis,
M. Goldblum, J. Geiping, and T. Goldstein, “Cold diffusion: Inverting
A. Papaleonidas, P. Angelov, and C. Jayne, Eds. Cham: Springer
arbitrary image transforms without noise,” arXiv:2208.09392, 2022.
Nature Switzerland, 2023.
[140] G.-H. Liu, A. Vahdat, D.-A. Huang, E. A. Theodorou, W. Nie,
[112] T. Guo, H. Seyed Mousavi, T. Huu Vu, and V. Monga, “Deep wavelet
and A. Anandkumar, “I 2 sb: Image-to-image schrödinger bridge,”
prediction for image super-resolution,” in CVPRW, 2017.
arXiv:2302.05872, 2023.
[113] B. Moser, S. Frolov, F. Raue, S. Palacio, and A. Dengel, “Waving
[141] M. Delbracio and P. Milanfar, “Inversion by direct iteration: An alter-
goodbye to low-res: A diffusion-wavelet approach for image super-
native to denoising diffusion for image restoration,” arXiv:2303.11435,
resolution,” 2023.
2023.
[114] Y. Huang, J. Huang, J. Liu, Y. Dong, J. Lv, and S. Chen,
[142] J. Choi, J. Lee, C. Shin, S. Kim, H. Kim, and S. Yoon, “Perception
“Wavedm: Wavelet-based diffusion models for image restoration,”
prioritized training of diffusion models,” in CVPR, 2022.
arXiv:2305.13819, 2023.
[143] B. B. Moser, S. Frolov, F. Raue, S. Palacio, and A. Dengel, “Yoda:
[115] F. Guth, S. Coste, V. De Bortoli, and S. Mallat, “Wavelet score-based You only diffuse areas. an area-masked diffusion approach for image
generative modeling,” NeurIPS, vol. 35, 2022. super-resolution,” arXiv:2308.07977, 2023.
[116] S. Shang, Z. Shan, G. Liu, and J. Zhang, “Resdiff: Combining cnn and [144] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and
diffusion model for image super-resolution,” arXiv:2303.08714, 2023. A. Joulin, “Emerging properties in self-supervised vision transformers,”
[117] J. Whang, M. Delbracio, H. Talebi, C. Saharia, A. G. Dimakis, and in CVPR, 2021.
P. Milanfar, “Deblurring via stochastic refinement,” in CVPR, 2022. [145] B. Xia, Y. Zhang, S. Wang, Y. Wang, X. Wu, Y. Tian, W. Yang, and
[118] Z. Yue, J. Wang, and C. C. Loy, “Resshift: Efficient diffusion model L. Van Gool, “Diffir: Efficient diffusion model for image restoration,”
for image super-resolution by residual shifting,” 2023. arXiv:2303.09472, 2023.
[119] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, [146] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
“Photorealistic text-to-image diffusion models with deep language NeurIPS, vol. 30, 2017.
understanding,” NeurIPS, vol. 35, 2022. [147] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
[120] J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon, “Ilvr: Conditioning T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
method for denoising diffusion probabilistic models,” 2021. “An image is worth 16x16 words: Transformers for image recognition
[121] A. Niu, K. Zhang, T. X. Pham, J. Sun, Y. Zhu, I. S. Kweon, and at scale,” arXiv:2010.11929, 2020.
Y. Zhang, “Cdpmsr: Conditional diffusion probabilistic models for [148] B. Kawar, G. Vaksman, and M. Elad, “Snips: Solving noisy inverse
single image super-resolution,” 2023. problems stochastically,” NeurIPS, vol. 34, 2021.
[122] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep [149] B. Kawar, M. Elad, S. Ermon, and J. Song, “Denoising diffusion
residual networks for single image super-resolution,” in CVPRW, 2017. restoration models,” NeurIPS, vol. 35, 2022.
[123] S. H. Park, Y. S. Moon, and N. I. Cho, “Flexible style image super- [150] H. Chung, J. Kim, M. T. Mccann, M. L. Klasky, and J. C. Ye,
resolution using conditional objective,” IEEE Access, vol. 10, 2022. “Diffusion posterior sampling for general noisy inverse problems,”
[124] W. Zhang, Y. Liu, C. Dong, and Y. Qiao, “Ranksrgan: Generative arXiv:2209.14687, 2022.
adversarial networks with ranker for image super-resolution,” in CVPR, [151] Y. Wang, J. Yu, and J. Zhang, “Zero-shot image restoration using
2019. denoising diffusion null-space model,” arXiv:2212.00490, 2022.
[125] A. Lugmayr, M. Danelljan, L. V. Gool, and R. Timofte, “Srflow: [152] B. Fei, Z. Lyu, L. Pan, J. Zhang, W. Yang, T. Luo, B. Zhang, and
Learning the super-resolution space with normalizing flow,” in ECCV. B. Dai, “Generative diffusion prior for unified image restoration and
Springer, 2020. enhancement,” in CVPR, 2023.
[126] S. Menon, A. Damian, S. Hu, N. Ravi, and C. Rudin, “Pulse: Self- [153] A. Shocher, N. Cohen, and M. Irani, ““zero-shot” super-resolution
supervised photo upsampling via latent space exploration of generative using deep internal learning,” in CVPR, 2018.
models,” in CVPR, 2020. [154] R. Li, X. Sheng, W. Li, and J. Zhang, “Omnissr: Zero-shot om-
[127] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang, “Fsrnet: End-to-end nidirectional image super-resolution using stable diffusion model,”
learning face super-resolution with facial priors,” in CVPR, 2018. arXiv:2404.10312, 2024.
21

[155] A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and [182] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The
L. Van Gool, “Repaint: Inpainting using denoising diffusion proba- unreasonable effectiveness of deep features as a perceptual metric,” in
bilistic models,” in CVPR, 2022. CVPR, 2018.
[156] H. Chung, B. Sim, and J. C. Ye, “Come-closer-diffuse-faster: Ac- [183] X. Liu, J. Van De Weijer, and A. D. Bagdanov, “Rankiqa: Learning
celerating conditional diffusion models for inverse problems through from rankings for no-reference image quality assessment,” in ICCV,
stochastic contraction,” in CVPR, 2022. 2017.
[157] J. Schwab, S. Antholzer, and M. Haltmeier, “Deep null space learning [184] K. Ma, W. Liu, T. Liu, Z. Wang, and D. Tao, “dipiq: Blind image
for inverse problems: convergence analysis and rates,” Inverse Prob- quality assessment by learning-to-rank discriminable image pairs,”
lems, vol. 35, no. 2, 2019. IEEE TIP, vol. 26, no. 8, 2017.
[158] Y. Wang, Y. Hu, J. Yu, and J. Zhang, “Gan prior based null-space [185] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Reg-
learning for consistent super-resolution,” in AAAI, vol. 37, no. 3, 2023. ularization strategy to train strong classifiers with localizable features,”
[159] H. Chung, B. Sim, D. Ryu, and J. C. Ye, “Improving diffusion models in CVPR, 2019.
for inverse problems using manifold constraints,” NeurIPS, vol. 35, [186] B. Jing, G. Corso, R. Berlinghieri, and T. Jaakkola, “Subspace diffusion
2022. generative models,” in ECCV. Springer, 2022.
[160] J. Song, A. Vahdat, M. Mardani, and J. Kautz, “Pseudoinverse-guided [187] M. Kwon, J. Jeong, and Y. Uh, “Diffusion models already have a
diffusion models for inverse problems,” in ICLR, 2022. semantic latent space,” in ICLR, 2023.
[161] J. Lin, Y. Wang, Z. Tao, B. Wang, Q. Zhao, H. Wang, X. Tong, X. Mai, [188] Q. Wu, Y. Liu, H. Zhao, A. Kale, T. Bui, T. Yu, Z. Lin, Y. Zhang, and
Y. Lin, W. Song et al., “Adaptive multi-modal fusion of spatially S. Chang, “Uncovering the disentanglement capability in text-to-image
variant kernel refinement with diffusion model for blind image super- diffusion models,” in CVPR, 2023.
resolution,” arXiv, 2024. [189] G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion
[162] H. Chung, E. S. Lee, and J. C. Ye, “Mr image denoising and super- models for robust image manipulation,” in CVPR, 2022.
resolution using regularized reverse diffusion,” IEEE Transactions on
Medical Imaging, vol. 42, no. 4, 2022. Brian B. Moser is a Ph.D. student at the TU
[163] Y. Mao, L. Jiang, X. Chen, and C. Li, “Disc-diff: Disentangled Kaiserslautern and a research assistant at the German
conditional diffusion model for multi-contrast mri super-resolution,” Research Center for Artificial Intelligence (DFKI)
arXiv:2303.13933, 2023. in Kaiserslautern. He received the M.Sc. degree in
[164] G. Li, C. Rao, J. Mo, Z. Zhang, W. Xing, and L. Zhao, “Re- computer science from the TU Kaiserslautern in
thinking diffusion model for multi-contrast mri super-resolution,” 2021. His research interests include image super-
arXiv:2404.04785, 2024. resolution and deep learning.
[165] Z. Yue and C. C. Loy, “Difface: Blind face restoration with diffused
error contraction,” arXiv:2212.06512, 2022. Arundhati S. Shanbhag is a Master’s student at
[166] X. Wang, Y. Li, H. Zhang, and Y. Shan, “Towards real-world blind the TU Kaiserslautern and research assistant at the
face restoration with generative facial prior,” in CVPR, 2021. German Research Center for Artificial Intelligence
[167] T. Yang, P. Ren, X. Xie, and L. Zhang, “Gan prior embedded network (DFKI) in Kaiserslautern. Her research interests in-
for blind face restoration in the wild,” in CVPR, 2021. clude computer vision and deep learning.
[168] X. Qiu, C. Han, Z. Zhang, B. Li, T. Guo, and X. Nie, “Diff-
bfr: Bootstrapping diffusion model towards blind face restoration,”
arXiv:2305.04517, 2023.
[169] Z. Wang, Z. Zhang, X. Zhang, H. Zheng, M. Zhou, Y. Zhang, and
Y. Wang, “Dr2: Diffusion-based robust degradation remover for blind Federico Raue is a Senior Researcher at the German
face restoration,” in CVPR, 2023. Research Center for Artificial Intelligence (DFKI)
[170] X. Wang, S. López-Tapia, and A. K. Katsaggelos, “Atmospheric in Kaiserslautern. He received his Ph.D. at TU
turbulence correction via variational deep diffusion,” in 2023 IEEE Kaiserslautern in 2018 and his M.Sc. in Artificial
6th International Conference on MIPR. IEEE, 2023. Intelligence from Katholieke Universiteit Leuven in
[171] N. G. Nair, K. Mei, and V. M. Patel, “At-ddpm: Restoring faces 2005. His research interests include meta-learning
degraded by atmospheric turbulence using denoising diffusion prob- and multimodal machine learning.
abilistic models,” in WACV, 2023.
[172] Y. Xiao, Q. Yuan, K. Jiang, J. He, X. Jin, and L. Zhang, “Ediffsr: An
efficient diffusion probabilistic model for remote sensing image super- Stanislav Frolov is a Ph.D. student at the TU
resolution,” IEEE Transactions on Geoscience and Remote Sensing, Kaiserslautern and a research assistant at the German
2023. Research Center for Artificial Intelligence (DFKI)
in Kaiserslautern. He received the M.Sc. degree in
[173] J. Liu, Z. Yuan, Z. Pan, Y. Fu, L. Liu, and B. Lu, “Diffusion model
electrical engineering from the Karlsruhe Institute of
with detail complement for super-resolution of remote sensing,” Remote
Technology in 2017. His research interests include
Sensing, vol. 14, no. 19, 2022.
generative models and deep learning.
[174] A. M. Ali, B. Benjdira, A. Koubaa, W. Boulila, and W. El-Shafai, “Tesr:
Two-stage approach for enhancement and super-resolution of remote
sensing images,” Remote Sensing, vol. 15, no. 9, 2023. Sebastian Palacio is a researcher in machine learn-
[175] M. Xu, J. Ma, and Y. Zhu, “Dual-diffusion: Dual conditional denoising ing and head of the multimedia analysis and data
diffusion probabilistic models for blind super-resolution reconstruction mining group at the German Research Center for
in rsis,” arXiv:2305.12170, 2023. Artificial Intelligence (DFKI). His Ph.D. topic was
[176] S. Khanna, P. Liu, L. Zhou, C. Meng, R. Rombach, M. Burke, about explainable AI with applications in computer
D. Lobell, and S. Ermon, “Diffusionsat: A generative foundation model vision. Other research interests include adversarial
for satellite imagery,” ICLR, 2024. attacks, multi-task, curriculum, and self-supervised
[177] D. Ganguli, D. Hernandez, L. Lovitt, A. Askell, Y. Bai, A. Chen, learning.
T. Conerly, N. Dassarma, D. Drain, N. Elhage et al., “Predictability
and surprise in large generative models,” in Proceedings of the 2022 Andreas Dengel is a Professor at the Department
ACM Conference on Fairness, Accountability, and Transparency, 2022. of Computer Science at TU Kaiserslautern and Ex-
[178] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and ecutive Director of the German Research Center
W. Chan, “Wavegrad: Estimating gradients for waveform generation,” for Artificial Intelligence (DFKI) in Kaiserslautern,
arXiv:2009.00713, 2020. Head of the Smart Data and Knowledge Services
[179] Z. Cheng, “Sampler scheduler for diffusion models,” research area at DFKI and of the DFKI Deep
arXiv:2311.06845, 2023. Learning Competence Center. His research focuses
[180] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative learning on machine learning, pattern recognition, quantified
trilemma with denoising diffusion gans,” arXiv:2112.07804, 2021. learning, data mining, semantic technologies, and
[181] T. Chen, “On the importance of noise scheduling for diffusion models,” document analysis.
arXiv:2301.10972, 2023.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy