Diffusion Models, Image Super-Resolution and Everything: A Survey
Diffusion Models, Image Super-Resolution and Everything: A Survey
first.second@dfki.de
arXiv:2401.00736v3 [cs.CV] 23 Jun 2024
Abstract—Diffusion Models (DMs) have disrupted the image various aspects [12]–[14]. Their capability to generate high-
Super-Resolution (SR) field and further closed the gap between quality images from LR inputs has shown immense promise
image quality and human perceptual preferences. They are in SR by closely aligning with the qualitative judgments of
easy to train and can produce very high-quality samples that
exceed the realism of those produced by previous generative human evaluators [15]. In other words, human raters perceive
methods. Despite their promising results, they also come with SR images generated by DMs as more realistic than those
new challenges that need further research: high computational produced by other generative models like GANs.
demands, comparability, lack of explainability, color shifts, and However, as the volume of publications expands, staying
more. Unfortunately, entry into this field is overwhelming because updated on the latest developments is becoming more chal-
of the abundance of publications. To address this, we provide a
unified recount of the theoretical foundations underlying DMs lenging, particularly for those new to the field. DMs diverge
applied to image SR and offer a detailed analysis that under- fundamentally from prior generative models and pose new
scores the unique characteristics and methodologies within this challenges while addressing the limitations of earlier models.
domain, distinct from broader existing reviews in the field. This Identifying coherent trends and potential research directions is
survey articulates a cohesive understanding of DM principles challenging despite this rapid expansion. This survey aims to
and explores current research avenues, including alternative
input domains, conditioning techniques, guidance mechanisms, demystify DMs, offers a comprehensive overview that bridges
corruption spaces, and zero-shot learning approaches. By offering foundational concepts with the forefront of image SR and
a detailed examination of the evolution and current trends in critically analyzes current strengths and weaknesses.
image SR through the lens of DMs, this survey sheds light on the The presented survey builds upon the previous work Hitch-
existing challenges and charts potential future directions, aiming hiker’s Guide to Super-Resolution [16], which gives a broad
to inspire further innovation in this rapidly advancing area.
overview of the image SR field in general. Similar in spirit
Index Terms—Super-Resolution, Diffusion Models, Survey. is the survey of Li et al., which reviews diffusion models
on the more general image restoration tasks like inpainting
I. I NTRODUCTION and dehazing [17]. Both have overlapping topics, such as the
foundations and types of DMs, namely DDPMs [8], SGMs
section II - Super-Resolution Basics: This section pro- given LR image, i.e., they can have similar loss values
vides fundamental definitions and introduces standard datasets, compared to the ground-truth image but are subjectively
methods, and metrics for assessing image quality commonly perceived differently due to many aspects like brightness and
utilized in image SR publications. coloring [1], [15], [19]. Traditional regression techniques, like
section III - Diffusion Models Basics: Introduces the prin- standard CNNs, are often adequate for lower magnifications
ciples and various formulations of DMs, including Denoising but struggle to replicate high-frequency details required at
Diffusion Probabilistic Models (DDPMs), Score-based Gener- higher magnifications (e.g., s > 4). To address this, SR models
ative Models (SGMs), and Stochastic Differential Equations must hallucinate realistic details beyond interpolation, which
(SDEs). This section also explores how DMs relate to other typically falls under the umbrella topic of generative models,
generative models. where diffusion models are now at the forefront.
section IV - Improvements for Diffusion Models: Common
practices for enhancing DMs, focusing on efficient sampling A. Datasets
techniques and improved likelihood estimation.
Several datasets offer a variety of images, resolutions, and
section V - Diffusion Models for Image SR: Presents con-
content types. Typically, these datasets consist of LR and HR
crete realizations of DMs in SR, explores alternative domains
image pairs. However, some datasets contain only HR images,
(latent space and wavelet domain), discusses architectural
with LR images created by bicubic downsampling with anti-
designs and multiple tasks with Null-Space Models, and
aliasing - a default setting for imresize in MATLAB [20].
examines alternative corruption spaces.
One famous general SR train set is the Diverse 2K resolution
section VII - Domain-Specific Applications: DM-based SR
(DIV2K) dataset [21], which includes various realistic images
applications, namely medical imaging, blind face restoration,
at different resolutions designed specifically for image SR.
atmospheric turbulence in face SR, and remote sensing.
Classical test datasets for SR models trained on DIV2K are
section VIII - Discussion and Future Work: Common
Set5 [22], Set14 [23], BSDS100 [24], Urban100 [25] and
problems of DMs for image SR and noteworthy research
Manga109 [26] that cover a variety of scenes and images
avenues for DMs specific to image SR.
contents like buildings and manga paintings. Flickr2K [27]
section IX - Conclusion: Summarizes the survey.
and Flickr-Faces-HQ (FFHQ) [28] offer diverse sets of human-
centric and scene-centric images from Flickr, respectively.
II. I MAGE S UPER -R ESOLUTION While FFHQ is commonly employed for training models for
The goal of image Super-Resolution (SR) is to trans- face SR tasks, Flickr2K is usually used as a train data ex-
form one or more Low-Resolution (LR) images into High- tension in combination with DIV2K. Another dataset for face
Resolution (HR) images. The domain can be broadly cate- SR is CelebA-HQ [29], which provides high-quality celebrity
gorized into two areas [16]: Single Image Super-Resolution images and is typically used to evaluate FFHQ-trained SR
(SISR) and Multi-Image Super-Resolution (MISR). In SISR, models. For broader applications in CV, datasets like ImageNet
a single LR image leads to a single HR image. In contrast, [30] and Visual Object Classes (VOC2012) [31] are favored.
MISR methods use multiple LR images to produce one or ImageNet offers an extensive range of images that help train
many HR outputs. This section focuses on SISR and explores models on various object classes, whereas VOC2012 is vital
relevant datasets, established SR models, and techniques to for object detection and segmentation. Both are valuable for
assess image quality. multi-task learning involving SR. More datasets can be found
Given a LR image x ∈ Rw̄×h̄×c , the goal is to generate a in the Hitchhiker’s Guide to Super-Resolution [16].
HR counterpart y ∈ Rw×h×c with w̄ < w and h̄ < h. The
relationship is represented by a degradation mapping B. SR Models
x = D (y; Θ) = ((y ⊗ k) ↓s +n)JP EGq (1) The primary objective is to design a SR model M :
Rw̄×h̄×c → Rw×h×c , such that it inverses Equation 1:
where D is a degradation map D : Rw×h×c → Rw̄×h̄×c and
Θ contains degradation parameters, including aspects like blur ŷ = M (x; θ) , (3)
k, noise n, scaling s, and compression quality q [18]. The where ŷ is the predicted HR approximation of the LR image
degradation is typically unknown, posing the main challenge x and θ the parameters of M. The parameters θ are optimized
in determining the inverse mapping of D with parameters θ, using Equation 2, i.e., minimizing the loss function L between
usually embodied as SR model. It leads to an optimization task the estimation ŷ and the ground-truth HR image y. The
aimed at minimizing the difference between the predicted SR following section focuses on standard methods for designing
image ŷ and the original HR image y: an SR model, especially deep learning methods before we
θ∗ = argminθ L (ŷ, y) + λϕ(θ), (2) examine how diffusion models fulfill this role in detail.
Traditional Methods: Traditional methods for image SR
where L represents the loss between the predicted SR image define a range of methodologies, such as statistical [32],
and the actual HR image. Here, λ is a balancing parameter, edge-based [33], [34] patch-based [35], [36], prediction-based
while ϕ(θ) is introduced as a regularization term. [37], [38] and sparse representation techniques [39]. They
The inherent complexity arises from the ill-posed nature fundamentally rely on image statistics and the information
of predicting θ, as several SR images can be valid for any inherent in existing pixels to generate HR images. Despite
3
their utility, a noteworthy drawback of these methods is the using exact log-likelihood-based training [54]. This facilitates
potential introduction of noise, blur, and visual artifacts [16]. flow-based methods to circumvent training instability but
Regression-based Deep Learning: Image SR significantly incurs a substantial computational cost.
evolved with advancements in deep learning and computa-
tional power. Typically, they employ a Convolutional Neural C. Image Quality Assessment (IQA)
Network (CNN) for end-to-end mapping from LR to HR. Image quality is a multifaceted concept that addresses prop-
Initial models, such as SRCNN [40], FSRCNN [41], and erties like sharpness, contrast, and absence of noise. Hence,
ESPCNN [42], utilized simple CNNs of diverse depth and a fair evaluation of SR models based on produced image
feature maps sizes. Later models adapted concepts from the quality forms a non-trivial task. This section presents the
broader CV domain into SR models, e.g., ResNet led to essential methods, especially for diffusion models, to assess
SRResNet, where residual information was propagated to image quality in the context of image SR, which fall under
successive network layers [43]. Likewise, DenseNet [44] was the umbrella term Image Quality Assessment (IQA) 1 . At its
adapted with SRDenseNet [45]. They employ dense blocks, core, IQA refers to any metric that resembles the perceptual
where each layer receives additionally the features generated evaluations from human observers, specifically, the level of
in all preceding layers. Recursive CNNs that recursively use realism perceived in an image after the application of SR tech-
the same module to learn representations were also inspired by niques. During this section, we will use the following notation:
other CV methods for regression-based SR methods in DRCN Nx = w · h · c, which defines the number of pixels of an image
[46], DRRN [47], and CARN [48]. More recently, attention x ∈ Rw×h×c and Ωx = {(i, j, k) ∈ N31 |i ≤ h, j ≤ w, k ≤ c}
mechanisms have been incorporated to focus on regions of that defines the set of all valid positions in x.
interest in images, predominantly via the channel and spatial Peak Signal-to-Noise Ratio (PSNR): The Peak Signal-to-
attention mechanisms [16], [49]–[51]. All those methods have Noise Ratio (PSNR) is one of the most widely used techniques
in common that they are regression-based. Commonly used to evaluate SISR reconstruction quality. It represents the ratio
loss functions are the L1 and L2 losses. As mentioned, they between the maximum pixel value L and the Mean Squared
often produce satisfying results for lower magnifications but Error (MSE) between the SR image ŷ and the HR image y.
struggle to replicate the high-frequency details required at !
higher magnifications (e.g., s > 4). These limitations arise L2
PSNR (y, ŷ) = 10 · log10 1 PN (4)
because these models primarily learn an averaged mapping 2
N i=1 [y − ŷ]
(due to L1 and L2 losses) from LR to HR images, which tends
to produce overly smooth textures lacking detail, especially Despite being one of the most popular IQA methods, it does
noticeable in larger upscaling factors [16]. To address this, SR not accurately match human perception [15]. It focuses on
models must hallucinate realistic details beyond simple inter- pixel differences, which can often be inconsistent with the
polation, a challenge typically tackled by generative models. subjectively perceived quality: the slightest shift in pixels
Generative Adversarial Networks (GANs): One of the can result in worse PSNR values while not affecting human
most prominent generative models is the Generative Adver- perceptual quality. Due to its pixel-level calculation, models
sarial Network (GAN). It uses two CNNs: A generator G trained with correlated pixel-based loss tend to achieve high
and a discriminator D, which are trained simultaneously. The PSNR values [16], whereas generative models tend to produce
generator aims to produce HR samples that are as close to the lower PSNR values [15].
original as to fool the discriminator, which tries to distinguish Structural Similarity Index (SSIM): The SSIM, like
between generated and real samples. This framework, e.g., the PSNR, is a popular evaluation method that focuses on
in SRGAN [52] or ESRGAN [53], is optimized using a the differences in structural features between images. It in-
combination of adversarial loss and content loss to produce dependently captures the structural similarity by comparing
less-smoothed images. The resultant images of state-of-the-art luminance, contrast, and structures. SSIM estimates for an
GANs are sharper and more detailed. Due to their capability to image y the luminance µy as the mean of the intensity, while
generate high-quality and diverse images, they have received it is estimating contrast σy as its standard deviation:
much attention lately. However, they are susceptible to mode 1 X
µy = yp , (5)
collapse, have a sizeable computational footprint, sometimes Ny
p∈Ωy
fail to converge, and suffer from stabilization issues [7].
Flow-based Methods: Flow-based methods employ optical 1 X 2
σy = [yp − µy ] (6)
flow algorithms to generate SR images [54]. They were Ny − 1
p∈Ωy
introduced in an attempt to counter the ill-posed nature of
image SR by learning the conditional distribution of plausible To capture the similarity between the computed entities, the
HR images given a LR input. They introduce a conditional authors introduced a comparison function S:
normalized flow architecture that aligns LR and HR images by 2·x·y+c
S (x, y, c) = 2 , (7)
calculating the displacement field between them and then uses x + y2 + c
this information to recover SR images. They employ a fully where x and y are the scalar variables being compared, and
invertible encoder capable of mapping any input HR image to 2
c = (k · L) , 0 < k ≪ 1 is a constant for numerical stability.
the latent flow space and ensuring exact reconstruction. This
framework enables the SR model to learn rich distributions 1 More SR-related IQA methods can be found in Moser et al. [16].
4
For a HR image y and its approximation ŷ, the luminance (Cl ) cases where no reference images are available, e.g., in unsu-
and contrast (Cc ) comparisons are computed using Cl (y, ŷ) = pervised settings. Fortunately, we can assess an image by mea-
S (µy , µŷ , c1 ) and Cc (y, ŷ) = S (σy , σŷ , c2 ), where c1 , c2 > suring the distance of statistical features from those obtained
0. The empirical covariance from a collection of high-quality images of a similar domain,
1 X i.e., natural images. This can be opinion- and distortion-aware
σy,ŷ = (yp − µy ) · (ŷp − µŷ ) , (8) like BRISQUE [57] or opinion- and distortion-unaware like
Ny − 1
p∈Ωy NIQE [58]. Another intriguing way to assess no-reference
defines the structure comparison (Cs ), which is the correlation image quality is to exploit the visual-language pre-trained
coefficient between y and ŷ: CLIP model [59]. One example is CLIP-IQA, which calculates
the cosine similarity of the encoded image with two prompts
σy,ŷ + c3
Cs (y, ŷ) = , (9) of opposing meaning, i.e., ”good photo” and ”bad photo”
σy · σŷ + c3
[60]. The resulting relative similarity metric for one or the
where c3 > 0. Finally, the SSIM is defined as: other prompt determines the image quality. CLIP-IQA shows
α β γ results comparable to those of BRISQUE without the hand-
SSIM (y, ŷ) = [Cl (y, ŷ)] · [Cc (y, ŷ)] · [Cs (y, ŷ)] (10)
crafted features and surpasses other no-reference IQA methods
where α > 0, β > 0, and γ > 0 are parameters that can be like NIQE. Another way to exploit deep learning models is
adjusted to tune the relative importance of the components. to train them to predict subjective scores using IQA datasets
Mean Opinion Score (MOS): The MOS is a subjective like TID2013 [61]. Examples are DeepQA [62], NIMA [63],
measure that leverages human perceptual quality for the eval- or MUSIQ [64]. Others can be found in the learning-based
uation of the generated SR images. Human viewers are shown perceptual quality section of the Hitchhiker’s Guide to Super-
SR images and asked to rate them with quality scores that are Resolution [16].
then mapped to numerical values and later averaged. Typically,
these range from 1 (bad) to 5 (good) but may vary [15]. While III. D IFFUSION M ODELS BASICS
this method is a direct evaluation of human perception, it is
more time-consuming and cumbersome to conduct compared Diffusion Models (DMs) have profoundly impacted the
to objective metrics. Moreover, due to the highly subjective realm of generative AI, and many approaches that fall under
nature of this metric, it is susceptible to bias. the umbrella term DM have emerged. What sets DMs apart
Consistency: Consistency measures the degree of stability from earlier generative models is their execution over iterative
of non-deterministic SR methods, such as generative models time steps, both forward and backward in time and denoted by
like GANs or DMs. Like flow-based methods, generative t, as depicted in Figure 1. The forward and backward diffusion
approaches are intentionally designed to generate a spectrum processes are distinguished by:
of plausible outputs for the same input. However, low consis- Forward q - degrade input data using noise iteratively, forward
tency is not desirable. Minor variations lessen the influence in time (i.e., t increases).
of a relatively consistent method in the input. Nevertheless, Backward p - denoise the degraded data, thereby reversing
consistency can vary depending on the requirements. One the noise iteratively, backward in time (i.e., t decreases).
commonly employed metric to quantify consistency is the The time step t increases during forward diffusion, whereas
Mean Squared Error. it propagates towards 0 during backward diffusion. Let D =
Learned Perceptual Image Patch Similarity (LPIPS): {xi , yi }N
i=1 be a dataset of LR-HR image-pairs. For each time
Contrary to the pixel-based evaluation of PSNR and SSIM, the step t, the random variable zt describes the current state, a
Learned Perceptual Image Patch Similarity (LPIPS) utilizes state between the image and corruption space. In literature,
a pre-trained CNN φ, e.g., VGG [55] or AlexNet [56], and there is no clear distinction between zt in the forward and zt in
generates L feature maps from the SR and HR image, and the backward diffusion. During forward diffusion, we assume
subsequently calculates the similarity between them. Given zt ∼ q (zt | zt−1 ). Conversely, in the backward diffusion, we
hl and wl as the height and width of the l-th feature map assume zt−1 ∼ p (zt−1 | zt ). We will denote T with 0 < t ≤
respectively, and a scaling vector αl ∈ RCl , the LPIPS metric T as the maximal time step for finite cases. The initial data
is formulated as follows: distribution (t = 0) is represented by z0 ∼ q (x), which is then
2 slowly injected with noise (additive). Vice versa, DMs remove
L X αl ⊙ φl (ŷ) − φl (y)
X p 2 noise therein by running a parameterized model pθ (zt−1 | zt )
LPIPS (y, ŷ) = (11) in the reverse time direction that approximates the ideal (but
p
hl · wl
l=1 unattainable) denoised distribution p (zt−1 | zt ).
LPIPS operates by projecting images into a perceptual fea- The explicit implementation of the forward diffusion q and
ture space through φ and evaluating the difference between backward diffusion p, approximated by pθ , is defined by the
corresponding patches in SR and HR images, scaled by αl . specific DM in use. There are three types: Two discrete forms,
This methodology allows for a more human-centric evaluation, namely Denoising Diffusion Probabilistic Models (DDPMs)
given that it is better aligned with human perception than and Score-Based Generative Models (SGMs), and the contin-
traditional metrics such as PSNR and SSIM [16]. uous form by Stochastic Differential Equations (SDEs) [65].
No-Reference Metrics: All IQA metrics discussed so far Each of these types will be discussed next are comprehensively
require a reference (ground-truth) image. However, there are shown in Figure 1.
5
backward Since the forward process approximates q(zT ) ≈ N (0, I), the
SDE
formulation of the learnable transition kernel becomes:
DDPM SGM pθ (zt−1 | zt ) = N (zt−1 | µθ (zt , γt ), Σθ (zt , γt )) , (15)
where µθ and Σθ are learnable. Similarly, the conditional
formulation pθ (zt−1 | zt , x) conditioned on x (e.g., a LR
image) is using µθ (zt , x, γt ) and Σθ (zt , x, γt ) instead.
Corruption Image Optimization: To guide the backward diffusion in learning
Space Space
the forward process, we minimize the Kullback-Leibler (KL)
... ... divergence of the joint distribution of the forward and reverse
sequences
T
forward Y
pθ (z0 , ..., zT ) = p (zT ) pθ (zt−1 | zt ) , and (16)
SDE
t=1
T
DDPM SGM
Y
q (z0 , ..., zT ) = q (z0 ) q (zt | zt−1 ) , (17)
t=1
Fig. 1: Principle of DMs. The forward diffusion adds noise KL(q (z0 , ..., zT ) ∥pθ (z0 , ..., zT )) (18)
iteratively (red), which translates an image from the image = −Eq(z0 ,...,zT ) [log pθ (z0 , ..., zT )] + c
space to the corruption space. The backward diffusion, the " T
#
(i) X pθ (zt−1 | zt )
iterative refinement process, reverts the process (blue) back = Eq(z0 ,...,zT ) − log p (zT ) − log +c
to the image space. Shown are three different implemen- t=1
q (zt | zt−1 )
tations of DMs, namely Denoising Diffusion Probabilistic (ii)
Models (DDPMs), Score-based Generative Models (SGMs), ≥ E [− log pθ (z0 )] + c,
and Stochastic Differential Equations (SDEs) with their respect where (i) is possible because both terms are products of
formulation of the forward and backward diffusion. distributions and (ii) is the product of Jensen’s inequality. The
constant c is unaffected and, therefore, irrelevant in optimizing
θ. Note that Equation 18 without c is the Variational Lower
A. Denoising Diffusion Probabilistic Models (DDPMs) Bound (VLB) of the log-likelihood of the data z0 , which is
commonly maximized by DDPMs.
Denoising Diffusion Probabilistic Models (DDPMs) [8]
use two Markov chains to enact the forward and backward
B. Score-based Generative Models (SGMs)
diffusion across a finite amount of discrete time steps.
Forward Diffusion: It transforms the data distribution Score-based Generative Models (SGMs), much like
into a prior distribution, typically designed manually (e.g., DDPMs, utilize discrete diffusion processes but employ an
Gaussian), given by: alternative mathematical foundation. Instead of using proba-
√ bility density function p(z) directly, Song et al. [11] propose
q(zt | zt−1 ) = N (zt | 1 − αt zt−1 , αt I), (12) to work with its (Stein) score function, which is defined as
the gradient of the log probability density ∇z log p(z). Math-
where the hyper-parameters 0 < α1:T < 1 represent the ematically, the score function preserves all information about
variance of noise incorporated at each time step. While the the density function, but computationally, it is easier to work
Gaussian kernel is commonly adopted, alternative kernel types with. Furthermore, the decoupling of model training from
can also be employed. This formulation can be condensed to the sampling procedure grants greater flexibility in defining
a single-step calculation, as shown by: sampling methods and training objectives.
√ Forward Diffusion: Let 0 < σ1 < ... < σT be a finite
q(zt | z0 ) = N (zt | γt z0 , (1 − γt )I), (13) sequence of noise levels. Like DDPMs, the forward diffusion,
Qt typically assigned to a Gaussian noise distribution, is
where γt = i=1 (1 − αi ) [66]. Consequently, zt can be
directly sampled regardless of what ought to happen on q(zt | z0 ) = N (zt | z0 , σt2 I). (19)
previous time steps by
This equation results in a sequence of noisy data densities
√ p R
q(z1 ), ..., q(zT ) with q(zt ) = q(zt )q(z0 )dz0 . Consequently,
zt = γt · z0 + 1 − γt · ϵ, ϵ ∼ N (0, I) . (14)
the intermediate step zt = z0 + σt · ϵ with ϵ ∼ N (0, I) can be
Backward Diffusion: The goal is to directly learn the sampled agnostic from previous time steps in a single step.
inverse of the forward diffusion and generate a distribution Backward Diffusion: To revert the noise during the back-
that resembles the prior z0 , usually the HR image in SR. In ward diffusion, we need to approximate ∇zt log q(zt ) and
practice, we use a CNN to learn a parameterized form of p. choose a method for estimating the intermediate states zt from
6
that approximation. For the gradient approximation at each C. Stochastic Differential Equations (SDEs)
time step t, we use a trained predictor, denoted as sθ and
So far, we have discussed DMs that deal with finite time
called Noise-Conditional Score Network (NCSN), such that
steps. A generalization to infinite continuous time steps is
sθ (zt , t) ≈ ∇zt log q(zt ) [11].
made by formulating these as solutions to Stochastic Differ-
The training of the NCSN will be covered in the next sec-
ential Equations (SDEs), also known as Score SDEs [10].
tion; for now, we focus on the sampling process using NCSN.
In fact, we can view SGMs and DDPMs as discretizations
Sampling with NCSN involves generating the intermediate
of a continuous-time SDE. SDEs are not entirely bound to
states zt through an iterative approach, using sθ (zt , t). Note
DMs, as they are a mathematical concept describing stochastic
that this iterative process is different from the iterations done
processes. As such, they fit perfectly to describe the processes
during the diffusion as it addresses solely the generation of zt .
we want to simulate in DMs. Like previously, data is perturbed
This is a key difference to DDPMs as zt needs to be sampled
in a general diffusion process but generalized to an infinite
iteratively, whereas DDPMs directly predict zt from zt+1 .
number of noise scales.
There are various ways to perform this iterative generation,
Forward Diffusion: We can represent the forward diffusion
but we will concentrate on a specific method known as
by the following SDE:
Annealed Langevin Dynamics (ALD), introduced by Song et
al. [10]. Let N be the number of estimation iterations for zt dz = f(z, t)dt + g(t)dw, (23)
at time step t and αt > 0 the corresponding step size, which
determines how much the estimation moves from one estimate where f and g are the drift and diffusion functions, respectively,
(i) (i+1) (N )
zt−1 towards zt−1 . The initial state is zT ∼ N (0, I). For and w is the standard Wiener process (also known as Brownian
(0) (N )
each 0 < t ≤ T , we initialize zt−1 = zt ≈ zt , which is the motion). This generalized formulation allows uniform repre-
latest estimation of the previous intermediate state. In order to sentation of both DDPMs and SGMs. The SDE for DDPMs
(N )
get zt−1 ≈ zt−1 iteratively, ALD uses the following update is given by:
rules for i = 0, ..., N − 1:
1 p
(i)
dz = − α(t)zdt + α(t)dw, (24)
ϵ ← N (0, I) (20) 2
(i+1) (i) 1 (i) √ with α( Tt ) = T αt for T → ∞. For SGMs, the SDE is
zt−1 ← zt−1 + αt−1 sθ (zt−1 , t − 1) + st−1 ϵ(i) (21)
2
r
(N )
This update rule guarantees that z0 converges to q(z0 ) for d [σ(t)2 ]
dz = dw, (25)
αt → 0 and N → ∞ [67]. dt
Similar to DDPMs, we can turn SGMs into conditional with σ( Tt ) = σt for T → ∞. From now on, we denote with
SGMs by integrating the condition x, e.g., a LR image, into qt (z) the distribution of zt in the diffusion process.
sθ (zt , x, t) ≈ ∇zt log q(zt |x). Backward Diffusion: The reverse-time SDE is formulated
Optimization: Without specifically formulating the back- by Anderson et al. [71] as:
ward diffusion, we can train a NCSN such that sθ (zt , t) ≈
∇zt log q(zt ). Estimating the score can be done by using the dz = f(z, t) − g(t)2 ∇z log qt (z) dt + g(t)dw̃,
(26)
denoising score matching method [68]:
where w̃ is the standard Wiener process when time flows
λ(t)σt2 ∥∇zt log q(zt ) − sθ (zt , t)∥2
E (22) backwards and dt an infinitesimal negative time step. Solutions
t∼U (1,T )
z0 ∼q(z0 ) to Equation 26 can be viewed as diffusion processes that grad-
zt ∼q(zt |z0 ) ually convert noise to data. The existence of a corresponding
(i) probability flow Ordinary Differential Equation (ODE), whose
λ(t)σt2 ∥∇zt log q(zt |z0 ) − sθ (zt , t)∥2 + c
= E
t∼U (1,T ) trajectories possess the same marginals as the reverse-time
z0 ∼q(z0 )
zt ∼q(zt |z0 ) SDE, was proven by Song et al. [11] and is
(ii) zt − z0 2
1
= E λ(t)∥ − − σt sθ (zt , t)∥ + c 2
dz = f(z, t) − g(t) ∇z log qt (z) dt. (27)
t∼U (1,T ) σt 2
z0 ∼q(z0 )
zt ∼q(zt |z0 )
Thus, the reverse-time SDE and the probability flow ODE
(iii)
λ(t)∥ϵ + σt sθ (zt , t)∥2 + c
= E enable sampling from the same data distribution.
t∼U (1,T )
z0 ∼q(z0 ) Optimization: Similar to the approach in SGMs, we define
ϵ∼N (0,I) a score model such that sθ (zt , t) ≈ ∇z log qt (z). Additionally,
we extend Equation 22 to continuous time as follows:
where λ(t) > 0 is a weighting function, σt the noise level
added at time step t, (i) derived by Vincent et al. [68], (ii) E
λ(t)∥sθ (zt , t) − ∇zt log qt (zt | z0 )∥2 , (28)
from Equation 19, (iii) from zt = z0 + σt ϵ and with c again a t∼U (0,T )
z0 ∼q(z0 )
constant unaffected in the optimization of θ. Note that there are zt ∼q(zt |z0 )
other ways to estimate the score, e.g., based on score matching
[69] or sliced score matching [70]. where λ(t) > 0 is a weighting function.
7
Corruption Clean
C. State Domains Space
Pixel-based Diffusion image
So far, we have discussed methods that operate directly on Corruption Latent Clean
Latent Space Diffusion
Space Representation image
the pixel space. This section introduces different methods that
Corruption Wavelet Clean
map the input into alternative state domains: latent, frequency, Frequency-based Diffusion
Space Representation image
and residual space. Apart from particular challenges arising
from the alternative state domain, these methods incur an Fig. 4: Overview of state domains. The green bar shows the
additional step that maps the pixel domain into their own, vanilla DM operating in pixel space. The blue bar shows the
as illustrated in Figure 4. exploit of the latent space domain via Autoencoders. The red
Latent Space: Models like SR3 [15], and SRDiff [79] bar shows the application of DMs in the wavelet domain.
have achieved high-quality SR results by operating in the
pixel domain. However, these models are computationally
intensive due to their iterative nature and the high-dimensional
calculations in RGB space. To reduce computational demands, Frequency Space: Wavelets provide a novel outlook on SR
one can move the diffusion process into the latent space [16], [111]. The conversion from the spatial to the wavelet
of an autoencoder [105]. The first of this kind was the domain is lossless and offers significant advantages as the
Latent Score-based Generative Models (LSGMs) by Vadhat spatial size of an image can be downsized by a factor of
et al. [106]. It is a regular SGM that operates in the latent four, thereby allowing faster diffusion during the training
space of a VAE and, by pre-training the VAE, achieves and inference stages. Moreover, the conversion segregates
even faster sampling speeds. It yields comparable and better high-frequency details into distinct channels, facilitating a
results than DMs operating in the pixel domain while being more concentrated and intentional focus on high-frequency
faster. Building upon LSGMs, Rombach et al. introduced information, offering a higher degree of control [112]. Besides,
the Latent Diffusion Model (LDM) [13], [107], which also it can be conveniently incorporated into existing DMs as a
performs diffusion in a low-dimensional latent space of an plug-in feature. The diffusion process can interact directly with
autoencoder. In contrast to LSGM, LDM utilizes a DDPM all wavelet bands as proposed in DiWa [113] or specifically
and an autoencoder that is pre-trained, like the VQ-GAN [5], target certain bands while the remaining bands are predicted
and is not jointly trained with the denoising network. This via standard CNNs. For instance, WaveDM [114] modifies
approach significantly lowers resource requirements without the low-frequency band, whereas WSGM [115] or ResDiff
compromising performance. Due to the decoupled training, it [116] conditions the high-frequency bands relative to the low-
requires very little regularization of the latent space and allows resolution image. Altogether, the wavelet domain presents a
the reuse of latent representations across multiple models. Im- promising avenue for future research. It provides potential for
proving upon LDMs is REFUSION (image REstoration with significant performance acceleration while maintaining, if not
difFUSION models) [108] by Luo et al., which differs in two enhancing, the quality of SR results.
aspects: First, it uses a U-Net that contains skip connections Residual Space: SRDiff [79] was the first work that ad-
from the encoder to the decoder, which provides the decoder vocated for shifting the generation process into the residual
with additional details. Moreover, it introduces Nonlinear space, i.e., the difference between the upsampled LR and the
Activation-Free blocks (NAFBlocks) [109], replacing all non- HR image. This enables the DM to focus on residual details,
linear activations with an element-wise operation that splits speeds up convergence, and stabilizes the training [16], [111].
feature channels into two parts and multiplies them to produce Whang et al. [117] also employs residual predictions as a
one output. Secondly, they train their U-Net with a latent- fundamental component of their predict-and-refine approach
replacing training strategy, which partially replaces the latent for image deblurring. However, unlike SRDiff, they provide a
representation with either the encoded LR or HR image for SR prediction with a CNN instead of the bilinear upsampled
reconstruction training. Similarly, Chen et al. [110] improve LR and predict the residuals between the SR prediction and
the architectural aspects of LDMs and propose a two-stage the HR ground truth with their DM. An improvement is
strategy called the Hierarchical Integration Diffusion Model presented by ResDiff [116], which additionally incorporates
(HI-Diff). In the first stage, an encoder compresses the ground the SR prediction and its high-frequency information during
truth image to a highly compact latent space representation, the backward diffusion for better guidance. In a different vein,
which has a much higher compression ratio than LDM. As Yue et al. [118] presents ResShift. This technique constructs a
a result, the computational burden of the DM, which refines Markov chain of transformations between HR and LR images
multi-scale latent representations, is much more reduced. The by manipulating the residual between them. Thus, instead of
second stage is a vision transformer-based autoencoder, which just adding Gaussian noise with zero mean in the forward
incorporates the latent representations of the first stage during process, the residual is also added as the mean of the noise
the downsampling process via Hierarchical Integration Mod- sampling during training. This novel approach substantially
ules (HIM), a cross-attention fusion module. enhances sampling efficiency, i.e., only 15 sampling steps.
11
Standard
... ...
... ...
Methods PSNR ↑ SSIM ↑ LPIPS ↓
Backward
BSRGAN [134] 23.41 0.61 0.426 Diffusion
Image-to-Image
Schrödinger
FeMaSR [137] 21.86 0.54 0.410 ... ...
TABLE IV: Comparison of zero-shot methods. Data in bold represents the best performance. Second-best is underlined. Values
derived from Li et al. [17].
+
VII. D OMAIN -S PECIFIC A PPLICATIONS
Fig. 8: Overview of DDNM [151]. It utilizes the range- SR3 [15] produces photo-realistic and perceptually state-
null space decomposition to construct a general solution for of-the-art images on faces and natural images but may not be
multiple tasks, such as image SR, colorization, inpainting, and suitable for other tasks like remote sensing. Some models are
deblurring. more suited to certain tasks as they tackle issues specific to
the domain [161]. This section highlights the applications of
DMs to domain-specific SR tasks: Medical imaging, special
cases of face SR (Blind Face Restoration and Atmospheric
C. Posterior Estimation Turbulences), and remote sensing.
Most projection-based methods typically address the noise-
less inverse problem. However, this assumption can weaken
A. Medical Imaging
data consistency because the projection process can deviate
the sample path from the data manifold [17]. To address this Magnetic Resonance Imaging (MRI) scans are widely used
and enhance data consistency, some recent works [150], [159], to aid patient diagnosis but can often be of low quality and
[160] take a different approach by aiming to estimate the corrupted with noise. Chung et al. [162] propose a combined
posterior distribution using the Bayes theorem: denoising and SR network referred to as R2D2+ (Regularized
Reverse Diffusion Denoiser + SR). They perform denoising of
p(x | zt ) · p(zt ) the MRI scans, followed by an SR module. Inspired by CCDF
p(zt | x) = , (44)
p(x) (i.e., a zero-shot method) from Chung et al. [156], they start
This Bayesian approach provides a more robust and prob- their backward diffusion from an initial noisy image instead of
abilistic framework for solving inverse problems, ultimately pure Gaussian noise. The reverse SDE is solved using a non-
improving results in various image processing tasks. It results parametric, eigenvalue-based method. In addition, they restrict
in the corresponding score function: the stochasticity of the DMs through low-frequency regular-
ization. Particularly, they maintain low-frequency information
∇zt log pt (zt | x) = ∇zt log pt (x | zt ) + sθ (x, t), (45) while correcting the high-frequency ones to produce sharp
where sθ (x, t) is extracted from a pre-trained model while and super-resolved MRI scans. Mao et al. [163] addresses the
pt (x|zt ) is intractable. Thus, the goal is precisely estimating lack of diffusion-based multi-contrast MRI SR methods. They
pt (x|zt ). MCG [159] and DPS [150] approximate the posterior propose a Disentangled Conditional Diffusion model (DisC-
pt (x|zt ) with pt (x|ẑ0 (zt )), where ẑ0 (zt ) is the expectation Diff) to leverage a multi-conditional fusion strategy based
given zt as ẑ0 (zt ) = E [z0 |zt ] according to Tweedie’s formula on representation disentanglement, enabling high-quality HR
[150]. While MCG also relies on projection, which can be image sampling. Specifically, they employ a disentangled U-
harmful to data consistency, DPS discards the projection step Net with multiple encoders to extract latent representations and
and estimates the posterior as: use a novel joint disentanglement and Charbonnier loss func-
tion to learn representations across MRI contrasts. They also
∇zt log pt (x | zt ) ≈ ∇zt log p(x | ẑ0 (zt )) (46) implement curriculum learning and improve their MRI model
1 for varying anatomical complexity by gradually increasing the
≈ − 2 ∇zt ∥x − H(ẑ0 (zt ))∥22 ,
σ difficulty of training images. An improvement of DisC-Diff by
where H is a forward measurement operator. A further ex- combining the DM with a transformer was introduced by Li
pansion of this formula to the unified form for the linear, et al. with DiffMSR [164].
non-linear, differentiable inverse problem with Moore Penrose
pseudoinverse can be found in IIGDM [160]. B. Blind Face Restoration
A different approach to estimate pt (x|zt ) is demonstrated
Most previously discussed SR methods are founded on
by GDP [152]. The authors noted that a higher conditional
a fixed degradation process during training, such as bicu-
probability of pt (x|zt ) correlates with a smaller distance
bic downsampling. However, when applied practically, these
between the application of the degradation model D(zt ) and
assumptions frequently diverge from the actual degradation
x. Thus, they propose a heuristic approximation:
process and yield subpar results. Additionally, datasets with
1 pairs of clean and real-world distorted images are usually
pt (x|zt ) ≈ exp(− [sL(D(zt ), x)]) + λQ(zt ), (47)
Z unavailable. This issue is particularly researched in face SR,
where L and Q denote a distance and quality metric, re- termed Blind Face Restoration (BFR), where datasets typically
spectively. The term Z is for normalization, and s is a contain supervised samples (x, y) with unknown degradation.
scaling factor controlling the guidance weight. However, due A solution to BFR was proposed by Yue et al. with DifFace
to varying noise levels between zt and x, precisely defining [165] that leverages the rich generative priors of pre-trained
the distance metric L can be challenging. To overcome this DMs with parameters θ, which were trained to approximate
16
pθ (zt |zt−1 ). In contrast to existing methods that learn direct SR. The method transfers class prior information from an SR
mappings from x to y under several constraints [166], [167], model trained on clean facial data to a model designed to
DifFace circumvents this by generating a diffused version zN counteract turbulence degradation via knowledge distillation.
of the desired HR image y with N < T . They predict the The final model operates within the realistic faces manifold,
starting point, the posterior q(zN |x) via a transition distribu- which allows it to generate realistic face outputs even under
tion p(zN |x). The transition distribution is formulated like the substantial distortions. During inference, the process begins
regular diffusion process, a Gaussian distribution, but uses an with noise- and turbulence-degraded images to ensure that the
initial predictor φ(x) to generate the mean, named diffused restored images closely resemble the distorted ones.
estimator. As their model borrows the reverse Markov chain
from a pre-trained DM, DifFace requires no full retraining for D. Remote Sensing
new and unknown degradations, unlike SR3.
A concurrent and better performing approach is DiffBFR Remote Sensing Super-Resolution (RSSR) addresses the
[168] that adopts a two-step approach to BFR: A Identity HR reconstruction from one or more LR images to aid
Restoration Module (IRM), which employs two conditional object detection and semantic segmentation tasks for satellite
DDPMs, and a Texture Enhancement Module (TEM), which imagery. RSSR is limited by the absence of small targets
employs an unconditional DDPM. In the first step within with complex granularity in the HR images [172]. To produce
the IRM, a conditional DDPM enriches facial details at a finer details and texture, Liu et al. [173] present DMs with
low-resolution space same as x. The downsampled version a Detail Complement mechanism (DMDC). They train their
of y gives the target objective. Next, it resizes the output model similar to SR3 [15] and perform a detailed supplement
to the desired spatial size of y and applies another condi- task. To generate high-frequency information, they randomly
tional DDPM to approximate the HR image y. To ensure mask several parts of the images to mimic dense objects.
minimal deviation from the actual image, DiffBFR employs The SR images recover the occluded patches as the model
a novel truncated sampling method, which begins denoising learns small-grained information. Additionally, they introduce
at intermediate steps. The TEM further enhances realism a novel pixel constraint loss to limit the diversity of DMDC
through image texture and sharpened facial details. It imposes and improve overall accuracy. Ali et al. [174] design a new
a diffuse-base facial prior with an unconditional DM trained architecture for RS images that integrates Vision Transformers
on HR images and a backward diffusion starting from pure (ViT) with DMs as a Two-stage approach for Enhancement
noise. However, it has more parameters than SR3 and requires and Super-Resolution (TESR). In the first stage (SR stage),
optimization to accelerate sampling. the SwinIR [49] model is used for RSSR. In the second
Another method is DR2E [169], which employs two stages: stage (enhancement stage), the noisy images are enhanced by
degradation removal and enhancement modules. For degrada- employing DMs to reconstruct the finer details. Xu et al. [175]
tion removal, they use a pre-trained face SR DDPM to remove propose a blind SR framework based on Dual conditioning
degradations from an LR image with severe and unknown DDPMs for SR (DDSR). A kernel predictor conditioned on
degradations. In particular, they diffuse the degraded image LR image encodings estimates the degradation kernel in the
x in T time steps to obtain xT = zT . Then, they use xt first stage. This is followed by an SR module consisting of a
to guide the backward diffusion such that the low-frequency conditional DDPM in a U-Net with the predicted kernel and
part of zt is replaced with that of xt , which is close in distri- the LR encodings as guidance. An RRDB encoder extracts the
bution. Theoretically, it produces visually clean intermediate encodings from LR images. Recently, Khanna et al. introduced
results that are degradation-invariant. In the second stage, the DiffusionSat [176], which uses a LDM for RSSR and incorpo-
enhancement module pθ (y | z0 ), an arbitrary backbone CNN rates additional remote sensing conditioning information (e.g.,
trained to map LR images to HR using a simple L2 loss, longitude, latitude, cloud cover, etc.).
predicts the final output. DR2E can be slower than existing
diffusion-based SR models for images with slight degradations VIII. D ISCUSSION AND F UTURE W ORK
and can even remove details from the input. Though relatively new, DMs are quickly becoming a
promising research area, especially in image SR. There are
C. Atmospheric Turbulence in Face SR several avenues of ongoing research in this field, aiming to
enhance the efficiency of DMs, accelerate computation speeds,
Atmospheric Turbulence (AT) results from atmospheric con- and minimize memory footprint, all while generating high-
ditions fluctuations, leading to images’ perceptual degradation quality, high-fidelity images. This section introduces common
through geometric distortions, spatially variant blur, and noise. problems of DMs for image SR and examines noteworthy
These alterations negatively impact downstream vision tasks, research avenues for DMs specific to image SR.
such as tracking or detection. Wang et al. [170] introduced a
variational inference framework known as AT-VarDiff, which
aims to correct AT in generic scenes. The distinctive feature of A. Color Shifting
this approach is its reliance on a conditioning signal derived Often, the most practical advancements come from a solid
from latent task-specific prior information extracted from the theoretical understanding. As discussed in subsection V-F,
input image to guide the DM. Nair et al. [171] put forth an- due to the substantial computational demands, DMs may
other technique to restore facial images impaired by AT using occasionally exhibit color shifts when constrained by hardware
17
limitations that demand smaller batch sizes or shorter training added during the forward diffusion process. The adaptability
periods [142]. While well-defined diffusion methods [143] and efficiency demonstrated by novel approaches like InDI or
or color normalization [107] might mitigate this problem, a I2 SB, especially in handling diverse and complex corruption
theoretical understanding of why it is emerging is necessary. patterns, spotlight the urgent need for future research.
with SR3 models generating 1024 × 1024 unconditional faces [10] Y. Song and S. Ermon, “Generative modeling by estimating gradients
and 256×256 class-conditional natural images. The cascading of the data distribution,” NeurIPS, vol. 32, 2019.
[11] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon,
approach allows several simpler models to be trained simul- and B. Poole, “Score-based generative modeling through stochastic
taneously, improving computational efficiency due to faster differential equations,” arXiv:2011.13456, 2020.
training times and reduced parameter counts. Furthermore, [12] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image
synthesis,” NeurIPS, vol. 34, 2021.
they implemented cascading for inference, using more refine- [13] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
ment steps at lower and fewer steps at higher resolutions. resolution image synthesis with latent diffusion models,” in CVPR,
They found this more efficient than generating SR images 2022.
[14] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen,
directly. Even though their approach underperforms compared “Hierarchical text-conditional image generation with clip latents,”
to BigGAN [102] concerning cascaded generation, it still arXiv:2204.06125, 2022.
represents an exciting research opportunity. [15] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi,
“Image super-resolution via iterative refinement,” IEEE TPAMI, vol. 45,
no. 4, 2023.
IX. C ONCLUSION [16] B. B. Moser, F. Raue, S. Frolov, S. Palacio, J. Hees, and A. Den-
gel, “Hitchhiker’s guide to super-resolution: Introduction and recent
Diffusion Models (DMs) revolutionized image Super- advances,” IEEE TPAMI, 2023.
Resolution (SR) by enhancing both technical image quality [17] X. Li, Y. Ren, X. Jin, C. Lan, X. Wang, W. Zeng, X. Wang, and
and human perceptual preferences. While traditional SR often Z. Chen, “Diffusion models for image restoration and enhancement–a
comprehensive survey,” arXiv:2308.09388, 2023.
focuses solely on pixel-level accuracy, DMs can generate [18] A. Liu, Y. Liu, J. Gu, Y. Qiao, and C. Dong, “Blind image super-
HR images that are aesthetically pleasing and realistic. Un- resolution: A survey and beyond,” IEEE TPAMI, vol. 45, no. 5, 2022.
like previous generative models, they do not suffer typical [19] S. Anwar and N. Barnes, “Densely residual laplacian super-resolution,”
IEEE TPAMI, 2020.
convergence issues. This survey explored the progress and [20] MATLAB, The Mathworks, Inc., Natick, Massachusetts, 2017.
diverse methods that have propelled DMs to the forefront [21] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image
of SR. Potential use cases, as discussed in our applications super-resolution: Dataset and study,” in CVPRW, July 2017.
[22] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel,
section, extend far beyond what was previously imagined. We “Low-complexity single-image super-resolution based on nonnegative
introduced their foundational principles and compared them to neighbor embedding,” 2012.
other generative models. We explored conditioning strategies, [23] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using
sparse-representations,” in International conference on curves and
from LR image guidance to text embeddings. Zero-shot SR, surfaces. Springer, 2010.
a particularly intriguing paradigm, was also a subject, as well [24] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human
as corruption spaces and image SR-specific topics like color segmented natural images and its application to evaluating segmenta-
shifting and architectural designs. In conclusion, the survey tion algorithms and measuring ecological statistics,” in ICCV, vol. 2.
IEEE, 2001.
provides a comprehensive guide to the current landscape and [25] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution
valuable insights into trends, challenges, and future directions. from transformed self-exemplars,” in CVPR, 2015.
As we continue to explore and refine these models, the future [26] Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and
K. Aizawa, “Sketch-based manga retrieval using manga109 dataset,”
of image SR looks more promising than ever. Multimedia Tools and Applications, vol. 76, no. 20, 2017.
[27] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image
super-resolution: Dataset and study,” in CVPRW, 2017.
ACKNOWLEDGMENT [28] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture
This work was supported by the BMBF project XAINES for generative adversarial networks,” in CVPR, 2019.
[29] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of
(Grant 01IW20005) and SustainML (Horizon Europe grant gans for improved quality, stability, and variation,” arXiv:1710.10196,
agreement No 101070408). 2017.
[30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in CVPR, 2009.
R EFERENCES [31] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
[1] W. Sun and Z. Chen, “Learned image downscaling for upscaling using A. Zisserman, “The PASCAL voc2012 Results,” http://www.pascal-
content adaptive resampler,” IEEE TIP, vol. 29, 2020. network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
[2] D. Valsesia and E. Magli, “Permutation invariance and uncertainty [32] K. I. Kim and Y. Kwon, “Single-image super-resolution using sparse
in multitemporal image super-resolution,” IEEE Transactions on Geo- regression and natural image prior,” IEEE TPAMI, vol. 32, no. 6, 2010.
science and Remote Sensing, vol. 60, 2021. [33] G. Freedman and R. Fattal, “Image and video upscaling from local
[3] S. M. A. Bashir, Y. Wang, M. Khan, and Y. Niu, “A comprehensive self-examples,” ACM Trans. Graph., vol. 30, no. 2, apr 2011.
review of deep learning-based single image super-resolution,” PeerJ [34] J. Sun, Z. Xu, and H.-Y. Shum, “Image super-resolution using gradient
Computer Science, vol. 7, 2021. profile prior,” in CVPR, 2008.
[4] D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han, “Autoregressive [35] H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through
image generation using residual quantization,” in CVPR, 2022. neighbor embedding,” in Proceedings of the 2004 IEEE Computer
[5] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- Society Conference on Computer Vision and Pattern Recognition, 2004.
resolution image synthesis,” in CVPR, 2021. CVPR 2004., vol. 1. IEEE, 2004.
[6] B. Guo, X. Zhang, H. Wu, Y. Wang, Y. Zhang, and Y.-F. Wang, “Lar- [36] W. Freeman, T. Jones, and E. Pasztor, “Example-based super-
sr: A local autoregressive model for image super-resolution,” in CVPR, resolution,” IEEE Computer Graphics and Applications, vol. 22, no. 2,
2022. 2002.
[7] S. Frolov, T. Hinz, F. Raue, J. Hees, and A. Dengel, “Adversarial text- [37] R. Keys, “Cubic convolution interpolation for digital image process-
to-image synthesis: A review,” Neural Networks, vol. 144, 2021. ing,” IEEE Transactions on Acoustics, Speech, and Signal Processing,
[8] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic vol. 29, no. 6, 1981.
models,” NeurIPS, vol. 33, 2020. [38] M. Irani and S. Peleg, “Improving resolution by image registration,”
[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, CVGIP: Graphical Models and Image Processing, vol. 53, no. 3, 1991.
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial net- [39] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution
works,” Communications of the ACM, vol. 63, no. 11, 2020. via sparse representation,” IEEE TIP, vol. 19, no. 11, 2010.
19
[40] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using [69] A. Hyvärinen and P. Dayan, “Estimation of non-normalized statistical
deep convolutional networks,” IEEE TPAMI, vol. 38, no. 2, 2015. models by score matching.” Journal of Machine Learning Research,
[41] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution vol. 6, no. 4, 2005.
convolutional neural network,” in ECCV. Springer, 2016. [70] Y. Song, S. Garg, J. Shi, and S. Ermon, “Sliced score matching: A
[42] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, scalable approach to density and score estimation,” in Uncertainty in
D. Rueckert, and Z. Wang, “Real-time single image and video super- Artificial Intelligence. PMLR, 2020.
resolution using an efficient sub-pixel convolutional neural network,” [71] B. D. Anderson, “Reverse-time diffusion equation models,” Stochastic
in CVPR, 2016. Processes and their Applications, vol. 12, no. 3, 1982.
[43] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, [72] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
image super-resolution using a generative adversarial network,” in NeurIPS, vol. 27, 2014.
CVPR, 2017. [73] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
[44] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely arXiv:1312.6114, 2013.
connected convolutional networks,” in CVPR, 2017. [74] D. Rezende and S. Mohamed, “Variational inference with normalizing
[45] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using flows,” in ICML. PMLR, 2015.
dense skip connections,” in ICCV, 2017. [75] Q. Zhang and Y. Chen, “Diffusion normalizing flow,” in NeurIPS,
[46] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan,
network for image super-resolution,” in CVPR, 2016. Eds., vol. 34. Curran Associates, Inc., 2021.
[47] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive [76] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and
residual network,” in CVPR, 2017. B. Lakshminarayanan, “Normalizing flows for probabilistic modeling
[48] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and lightweight and inference,” The Journal of Machine Learning Research, vol. 22,
super-resolution with cascading residual network,” in ECCV, 2018. no. 1, 2021.
[49] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, [77] T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design
“Swinir: Image restoration using swin transformer,” in CVPR, 2021. space of diffusion-based generative models,” NeurIPS, vol. 35, 2022.
[50] X. Chen, X. Wang, J. Zhou, Y. Qiao, and C. Dong, “Activating more [78] “Oxford vggface implementation using keras functional framework
pixels in image super-resolution transformer,” in CVPR, 2023. v2+,” https://github.com/rcmalli/keras-vggface.
[51] C.-C. Hsu, C.-M. Lee, and Y.-S. Chou, “Drct: Saving image super- [79] H. Li, Y. Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y. Chen,
resolution away from information bottleneck,” arXiv:2404.00722, “Srdiff: Single image super-resolution with diffusion probabilistic
2024. models,” Neurocomputing, vol. 479, 2022.
[52] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, [80] D. Watson, J. Ho, M. Norouzi, and W. Chan, “Learning to efficiently
A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single sample from diffusion probabilistic models,” arXiv:2106.03802, 2021.
image super-resolution using a generative adversarial network,” in [81] D. Watson, W. Chan, J. Ho, and M. Norouzi, “Learning fast sam-
CVPR, 2017. plers for diffusion models by differentiating through sample quality,”
[53] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and arXiv:2202.05830, 2022.
C. Change Loy, “Esrgan: Enhanced super-resolution generative adver- [82] Z. Lyu, X. Xu, C. Yang, D. Lin, and B. Dai, “Accelerating diffusion
sarial networks,” 2018. models via early stop of the diffusion process,” arXiv:2205.12524,
[54] A. Lugmayr, M. Danelljan, L. Van Gool, and R. Timofte, “Srflow: 2022.
Learning the super-resolution space with normalizing flow,” in ECCV. [83] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to
Springer, 2020. text-to-image diffusion models,” in CVPR, 2023.
[55] K. Simonyan and A. Zisserman, “Very deep convolutional networks [84] C. Meng, R. Rombach, R. Gao, D. Kingma, S. Ermon, J. Ho, and
for large-scale image recognition,” arXiv:1409.1556, 2014. T. Salimans, “On distillation of guided diffusion models,” in CVPR,
[56] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 2023.
with deep convolutional neural networks,” Communications of the [85] E. Luhman and T. Luhman, “Knowledge distillation in iterative gener-
ACM, vol. 60, no. 6, 2017. ative models for improved sampling speed,” arXiv:2101.02388, 2021.
[57] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image [86] T. Salimans and J. Ho, “Progressive distillation for fast sampling of
quality assessment in the spatial domain,” IEEE TIP, vol. 21, no. 12, diffusion models,” arXiv:2202.00512, 2022.
2012. [87] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative learning
[58] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely trilemma with denoising diffusion gans,” arXiv:2112.07804, 2021.
blind” image quality analyzer,” IEEE Signal processing letters, vol. 20, [88] R. Xie, Y. Tai, K. Zhang, Z. Zhang, J. Zhou, and J. Yang, “Addsr:
no. 3, 2012. Accelerating diffusion-based blind super-resolution with adversarial
[59] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, diffusion distillation,” arXiv:2404.01717, 2024.
G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable [89] M. Noroozi, I. Hadji, B. Martinez, A. Bulat, and G. Tzimiropoulos,
visual models from natural language supervision,” in ICML. PMLR, “You only need one step: Fast super-resolution with stable diffusion
2021. via scale distillation,” arXiv:2401.17258, 2024.
[60] J. Wang, K. C. Chan, and C. C. Loy, “Exploring clip for assessing the [90] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit mod-
look and feel of images,” in AAAI, vol. 37, no. 2, 2023. els,” arXiv:2010.02502, 2020.
[61] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. As- [91] A. Jolicoeur-Martineau, K. Li, R. Piché-Taillefer, T. Kachman, and
tola, B. Vozel, K. Chehdi, M. Carli, F. Battisti et al., “Image database I. Mitliagkas, “Gotta go fast when generating data with score-based
tid2013: Peculiarities, results and perspectives,” Signal processing: models,” arXiv:2105.14080, 2021.
Image communication, vol. 30, 2015. [92] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A
[62] J. Kim and S. Lee, “Deep learning of human visual sensitivity in image fast ode solver for diffusion probabilistic model sampling in around 10
quality assessment framework,” in CVPR, 2017. steps,” Advances in Neural Information Processing Systems, vol. 35,
[63] H. Talebi and P. Milanfar, “Nima: Neural image assessment,” IEEE pp. 5775–5787, 2022.
TIP, vol. 27, no. 8, 2018. [93] F. Bao, C. Li, J. Zhu, and B. Zhang, “Analytic-dpm: an analytic estimate
[64] J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang, “Musiq: Multi-scale of the optimal reverse variance in diffusion probabilistic models,”
image quality transformer,” in ICCV, 2021. arXiv:2201.06503, 2022.
[65] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, [94] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver++:
B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey Fast solver for guided sampling of diffusion probabilistic models,”
of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, arXiv:2211.01095, 2022.
2023. [95] W. Zhao, L. Bai, Y. Rao, J. Zhou, and J. Lu, “Unipc: A unified
[66] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, predictor-corrector framework for fast sampling of diffusion models,”
“Deep unsupervised learning using nonequilibrium thermodynamics,” NeurIPS, vol. 36, 2024.
in ICML. PMLR, 2015. [96] J. Ho, E. Lohn, and P. Abbeel, “Compression with flows via local
[67] G. Parisi, “Correlation functions and computer simulations,” Nuclear bits-back coding,” NeurIPS, vol. 32, 2019.
Physics B, vol. 180, no. 3, 1981. [97] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov,
[68] P. Vincent, “A connection between score matching and denoising “Good semi-supervised learning that requires a bad gan,” NeurIPS,
autoencoders,” Neural computation, vol. 23, no. 7, 2011. vol. 30, 2017.
20
[98] Y. Song, C. Durkan, I. Murray, and S. Ermon, “Maximum likelihood [128] K. Pandey, A. Mukherjee, P. Rai, and A. Kumar, “Diffusevae: Efficient,
training of score-based diffusion models,” NeurIPS, vol. 34, 2021. controllable and high-fidelity generation from low-dimensional latents,”
[99] D. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational diffusion arXiv:2201.00308, 2022.
models,” NeurIPS, vol. 34, 2021. [129] C. Bi, X. Luo, S. Shen, M. Zhang, H. Yue, and J. Yang, “Deedsr: To-
[100] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion proba- wards real-world image super-resolution via degradation-aware stable
bilistic models,” in ICML. PMLR, 2021. diffusion,” arXiv, 2024.
[101] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, [130] S. Zhou, K. Chan, C. Li, and C. C. Loy, “Towards robust blind face
“Cascaded diffusion models for high fidelity image generation.” J. restoration with codebook lookup transformer,” NeurIPS, vol. 35, 2022.
Mach. Learn. Res., vol. 23, no. 47, 2022. [131] T. Yang, P. Ren, X. Xie, and L. Zhang, “Pixel-aware stable diffu-
[102] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for sion for realistic image super-resolution and personalized stylization,”
high fidelity natural image synthesis,” arXiv:1809.11096, 2018. arXiv:2308.14469, 2023.
[103] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” [132] R. Wu, T. Yang, L. Sun, Z. Zhang, S. Li, and L. Zhang,
arXiv:2207.12598, 2022. “Seesr: Towards semantics-aware real-world image super-resolution,”
[104] C. Luo, “Understanding diffusion models: A unified perspective,” arXiv:2311.16518, 2023.
arXiv:2208.11970, 2022. [133] Y. Qu, K. Yuan, K. Zhao, Q. Xie, J. Hao, M. Sun, and C. Zhou,
[105] J. Kim and T.-K. Kim, “Arbitrary-scale image generation and up- “Xpsr: Cross-modal priors for diffusion-based image super-resolution,”
sampling using latent diffusion model and implicit neural decoder,” arXiv:2403.05049, 2024.
arXiv:2403.10255, 2024. [134] K. Zhang, J. Liang, L. Van Gool, and R. Timofte, “Designing a practical
[106] A. Vahdat, K. Kreis, and J. Kautz, “Score-based generative modeling degradation model for deep blind image super-resolution,” in ICCV,
in latent space,” NeurIPS, vol. 34, 2021. 2021.
[107] J. Wang, Z. Yue, S. Zhou, K. C. Chan, and C. C. Loy, “Exploiting dif- [135] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-
fusion prior for real-world image super-resolution,” arXiv:2305.07015, world blind super-resolution with pure synthetic data,” in CVPR, 2021.
2023. [136] J. Liang, H. Zeng, and L. Zhang, “Details or artifacts: A locally
[108] Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön, “Im- discriminative learning approach to realistic image super-resolution,”
age restoration with mean-reverting stochastic differential equations,” in CVPR, 2022.
arXiv:2301.11699, 2023. [137] C. Chen, X. Shi, Y. Qin, X. Li, X. Han, T. Yang, and S. Guo,
[109] L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image “Real-world blind super-resolution via feature matching with implicit
restoration,” in ECCV. Springer, 2022. high-resolution priors,” in Proceedings of the 30th ACM International
[110] Z. Chen, Y. Zhang, D. Liu, B. Xia, J. Gu, L. Kong, and X. Yuan, “Hi- Conference on Multimedia, ser. MM ’22. New York, NY, USA:
erarchical integration diffusion model for realistic image deblurring,” Association for Computing Machinery, 2022.
arXiv:2305.12966, 2023. [138] G. Daras, M. Delbracio, H. Talebi, A. G. Dimakis, and P. Mi-
lanfar, “Soft diffusion: Score matching for general corruptions,”
[111] B. B. Moser, S. Frolov, F. Raue, S. Palacio, and A. Dengel, “Dwa:
arXiv:2209.05442, 2022.
Differential wavelet amplifier for image super-resolution,” in Artificial
[139] A. Bansal, E. Borgnia, H.-M. Chu, J. S. Li, H. Kazemi, F. Huang,
Neural Networks and Machine Learning – ICANN 2023, L. Iliadis,
M. Goldblum, J. Geiping, and T. Goldstein, “Cold diffusion: Inverting
A. Papaleonidas, P. Angelov, and C. Jayne, Eds. Cham: Springer
arbitrary image transforms without noise,” arXiv:2208.09392, 2022.
Nature Switzerland, 2023.
[140] G.-H. Liu, A. Vahdat, D.-A. Huang, E. A. Theodorou, W. Nie,
[112] T. Guo, H. Seyed Mousavi, T. Huu Vu, and V. Monga, “Deep wavelet
and A. Anandkumar, “I 2 sb: Image-to-image schrödinger bridge,”
prediction for image super-resolution,” in CVPRW, 2017.
arXiv:2302.05872, 2023.
[113] B. Moser, S. Frolov, F. Raue, S. Palacio, and A. Dengel, “Waving
[141] M. Delbracio and P. Milanfar, “Inversion by direct iteration: An alter-
goodbye to low-res: A diffusion-wavelet approach for image super-
native to denoising diffusion for image restoration,” arXiv:2303.11435,
resolution,” 2023.
2023.
[114] Y. Huang, J. Huang, J. Liu, Y. Dong, J. Lv, and S. Chen,
[142] J. Choi, J. Lee, C. Shin, S. Kim, H. Kim, and S. Yoon, “Perception
“Wavedm: Wavelet-based diffusion models for image restoration,”
prioritized training of diffusion models,” in CVPR, 2022.
arXiv:2305.13819, 2023.
[143] B. B. Moser, S. Frolov, F. Raue, S. Palacio, and A. Dengel, “Yoda:
[115] F. Guth, S. Coste, V. De Bortoli, and S. Mallat, “Wavelet score-based You only diffuse areas. an area-masked diffusion approach for image
generative modeling,” NeurIPS, vol. 35, 2022. super-resolution,” arXiv:2308.07977, 2023.
[116] S. Shang, Z. Shan, G. Liu, and J. Zhang, “Resdiff: Combining cnn and [144] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and
diffusion model for image super-resolution,” arXiv:2303.08714, 2023. A. Joulin, “Emerging properties in self-supervised vision transformers,”
[117] J. Whang, M. Delbracio, H. Talebi, C. Saharia, A. G. Dimakis, and in CVPR, 2021.
P. Milanfar, “Deblurring via stochastic refinement,” in CVPR, 2022. [145] B. Xia, Y. Zhang, S. Wang, Y. Wang, X. Wu, Y. Tian, W. Yang, and
[118] Z. Yue, J. Wang, and C. C. Loy, “Resshift: Efficient diffusion model L. Van Gool, “Diffir: Efficient diffusion model for image restoration,”
for image super-resolution by residual shifting,” 2023. arXiv:2303.09472, 2023.
[119] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, [146] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
“Photorealistic text-to-image diffusion models with deep language NeurIPS, vol. 30, 2017.
understanding,” NeurIPS, vol. 35, 2022. [147] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
[120] J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon, “Ilvr: Conditioning T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
method for denoising diffusion probabilistic models,” 2021. “An image is worth 16x16 words: Transformers for image recognition
[121] A. Niu, K. Zhang, T. X. Pham, J. Sun, Y. Zhu, I. S. Kweon, and at scale,” arXiv:2010.11929, 2020.
Y. Zhang, “Cdpmsr: Conditional diffusion probabilistic models for [148] B. Kawar, G. Vaksman, and M. Elad, “Snips: Solving noisy inverse
single image super-resolution,” 2023. problems stochastically,” NeurIPS, vol. 34, 2021.
[122] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep [149] B. Kawar, M. Elad, S. Ermon, and J. Song, “Denoising diffusion
residual networks for single image super-resolution,” in CVPRW, 2017. restoration models,” NeurIPS, vol. 35, 2022.
[123] S. H. Park, Y. S. Moon, and N. I. Cho, “Flexible style image super- [150] H. Chung, J. Kim, M. T. Mccann, M. L. Klasky, and J. C. Ye,
resolution using conditional objective,” IEEE Access, vol. 10, 2022. “Diffusion posterior sampling for general noisy inverse problems,”
[124] W. Zhang, Y. Liu, C. Dong, and Y. Qiao, “Ranksrgan: Generative arXiv:2209.14687, 2022.
adversarial networks with ranker for image super-resolution,” in CVPR, [151] Y. Wang, J. Yu, and J. Zhang, “Zero-shot image restoration using
2019. denoising diffusion null-space model,” arXiv:2212.00490, 2022.
[125] A. Lugmayr, M. Danelljan, L. V. Gool, and R. Timofte, “Srflow: [152] B. Fei, Z. Lyu, L. Pan, J. Zhang, W. Yang, T. Luo, B. Zhang, and
Learning the super-resolution space with normalizing flow,” in ECCV. B. Dai, “Generative diffusion prior for unified image restoration and
Springer, 2020. enhancement,” in CVPR, 2023.
[126] S. Menon, A. Damian, S. Hu, N. Ravi, and C. Rudin, “Pulse: Self- [153] A. Shocher, N. Cohen, and M. Irani, ““zero-shot” super-resolution
supervised photo upsampling via latent space exploration of generative using deep internal learning,” in CVPR, 2018.
models,” in CVPR, 2020. [154] R. Li, X. Sheng, W. Li, and J. Zhang, “Omnissr: Zero-shot om-
[127] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang, “Fsrnet: End-to-end nidirectional image super-resolution using stable diffusion model,”
learning face super-resolution with facial priors,” in CVPR, 2018. arXiv:2404.10312, 2024.
21
[155] A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and [182] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The
L. Van Gool, “Repaint: Inpainting using denoising diffusion proba- unreasonable effectiveness of deep features as a perceptual metric,” in
bilistic models,” in CVPR, 2022. CVPR, 2018.
[156] H. Chung, B. Sim, and J. C. Ye, “Come-closer-diffuse-faster: Ac- [183] X. Liu, J. Van De Weijer, and A. D. Bagdanov, “Rankiqa: Learning
celerating conditional diffusion models for inverse problems through from rankings for no-reference image quality assessment,” in ICCV,
stochastic contraction,” in CVPR, 2022. 2017.
[157] J. Schwab, S. Antholzer, and M. Haltmeier, “Deep null space learning [184] K. Ma, W. Liu, T. Liu, Z. Wang, and D. Tao, “dipiq: Blind image
for inverse problems: convergence analysis and rates,” Inverse Prob- quality assessment by learning-to-rank discriminable image pairs,”
lems, vol. 35, no. 2, 2019. IEEE TIP, vol. 26, no. 8, 2017.
[158] Y. Wang, Y. Hu, J. Yu, and J. Zhang, “Gan prior based null-space [185] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Reg-
learning for consistent super-resolution,” in AAAI, vol. 37, no. 3, 2023. ularization strategy to train strong classifiers with localizable features,”
[159] H. Chung, B. Sim, D. Ryu, and J. C. Ye, “Improving diffusion models in CVPR, 2019.
for inverse problems using manifold constraints,” NeurIPS, vol. 35, [186] B. Jing, G. Corso, R. Berlinghieri, and T. Jaakkola, “Subspace diffusion
2022. generative models,” in ECCV. Springer, 2022.
[160] J. Song, A. Vahdat, M. Mardani, and J. Kautz, “Pseudoinverse-guided [187] M. Kwon, J. Jeong, and Y. Uh, “Diffusion models already have a
diffusion models for inverse problems,” in ICLR, 2022. semantic latent space,” in ICLR, 2023.
[161] J. Lin, Y. Wang, Z. Tao, B. Wang, Q. Zhao, H. Wang, X. Tong, X. Mai, [188] Q. Wu, Y. Liu, H. Zhao, A. Kale, T. Bui, T. Yu, Z. Lin, Y. Zhang, and
Y. Lin, W. Song et al., “Adaptive multi-modal fusion of spatially S. Chang, “Uncovering the disentanglement capability in text-to-image
variant kernel refinement with diffusion model for blind image super- diffusion models,” in CVPR, 2023.
resolution,” arXiv, 2024. [189] G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion
[162] H. Chung, E. S. Lee, and J. C. Ye, “Mr image denoising and super- models for robust image manipulation,” in CVPR, 2022.
resolution using regularized reverse diffusion,” IEEE Transactions on
Medical Imaging, vol. 42, no. 4, 2022. Brian B. Moser is a Ph.D. student at the TU
[163] Y. Mao, L. Jiang, X. Chen, and C. Li, “Disc-diff: Disentangled Kaiserslautern and a research assistant at the German
conditional diffusion model for multi-contrast mri super-resolution,” Research Center for Artificial Intelligence (DFKI)
arXiv:2303.13933, 2023. in Kaiserslautern. He received the M.Sc. degree in
[164] G. Li, C. Rao, J. Mo, Z. Zhang, W. Xing, and L. Zhao, “Re- computer science from the TU Kaiserslautern in
thinking diffusion model for multi-contrast mri super-resolution,” 2021. His research interests include image super-
arXiv:2404.04785, 2024. resolution and deep learning.
[165] Z. Yue and C. C. Loy, “Difface: Blind face restoration with diffused
error contraction,” arXiv:2212.06512, 2022. Arundhati S. Shanbhag is a Master’s student at
[166] X. Wang, Y. Li, H. Zhang, and Y. Shan, “Towards real-world blind the TU Kaiserslautern and research assistant at the
face restoration with generative facial prior,” in CVPR, 2021. German Research Center for Artificial Intelligence
[167] T. Yang, P. Ren, X. Xie, and L. Zhang, “Gan prior embedded network (DFKI) in Kaiserslautern. Her research interests in-
for blind face restoration in the wild,” in CVPR, 2021. clude computer vision and deep learning.
[168] X. Qiu, C. Han, Z. Zhang, B. Li, T. Guo, and X. Nie, “Diff-
bfr: Bootstrapping diffusion model towards blind face restoration,”
arXiv:2305.04517, 2023.
[169] Z. Wang, Z. Zhang, X. Zhang, H. Zheng, M. Zhou, Y. Zhang, and
Y. Wang, “Dr2: Diffusion-based robust degradation remover for blind Federico Raue is a Senior Researcher at the German
face restoration,” in CVPR, 2023. Research Center for Artificial Intelligence (DFKI)
[170] X. Wang, S. López-Tapia, and A. K. Katsaggelos, “Atmospheric in Kaiserslautern. He received his Ph.D. at TU
turbulence correction via variational deep diffusion,” in 2023 IEEE Kaiserslautern in 2018 and his M.Sc. in Artificial
6th International Conference on MIPR. IEEE, 2023. Intelligence from Katholieke Universiteit Leuven in
[171] N. G. Nair, K. Mei, and V. M. Patel, “At-ddpm: Restoring faces 2005. His research interests include meta-learning
degraded by atmospheric turbulence using denoising diffusion prob- and multimodal machine learning.
abilistic models,” in WACV, 2023.
[172] Y. Xiao, Q. Yuan, K. Jiang, J. He, X. Jin, and L. Zhang, “Ediffsr: An
efficient diffusion probabilistic model for remote sensing image super- Stanislav Frolov is a Ph.D. student at the TU
resolution,” IEEE Transactions on Geoscience and Remote Sensing, Kaiserslautern and a research assistant at the German
2023. Research Center for Artificial Intelligence (DFKI)
in Kaiserslautern. He received the M.Sc. degree in
[173] J. Liu, Z. Yuan, Z. Pan, Y. Fu, L. Liu, and B. Lu, “Diffusion model
electrical engineering from the Karlsruhe Institute of
with detail complement for super-resolution of remote sensing,” Remote
Technology in 2017. His research interests include
Sensing, vol. 14, no. 19, 2022.
generative models and deep learning.
[174] A. M. Ali, B. Benjdira, A. Koubaa, W. Boulila, and W. El-Shafai, “Tesr:
Two-stage approach for enhancement and super-resolution of remote
sensing images,” Remote Sensing, vol. 15, no. 9, 2023. Sebastian Palacio is a researcher in machine learn-
[175] M. Xu, J. Ma, and Y. Zhu, “Dual-diffusion: Dual conditional denoising ing and head of the multimedia analysis and data
diffusion probabilistic models for blind super-resolution reconstruction mining group at the German Research Center for
in rsis,” arXiv:2305.12170, 2023. Artificial Intelligence (DFKI). His Ph.D. topic was
[176] S. Khanna, P. Liu, L. Zhou, C. Meng, R. Rombach, M. Burke, about explainable AI with applications in computer
D. Lobell, and S. Ermon, “Diffusionsat: A generative foundation model vision. Other research interests include adversarial
for satellite imagery,” ICLR, 2024. attacks, multi-task, curriculum, and self-supervised
[177] D. Ganguli, D. Hernandez, L. Lovitt, A. Askell, Y. Bai, A. Chen, learning.
T. Conerly, N. Dassarma, D. Drain, N. Elhage et al., “Predictability
and surprise in large generative models,” in Proceedings of the 2022 Andreas Dengel is a Professor at the Department
ACM Conference on Fairness, Accountability, and Transparency, 2022. of Computer Science at TU Kaiserslautern and Ex-
[178] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and ecutive Director of the German Research Center
W. Chan, “Wavegrad: Estimating gradients for waveform generation,” for Artificial Intelligence (DFKI) in Kaiserslautern,
arXiv:2009.00713, 2020. Head of the Smart Data and Knowledge Services
[179] Z. Cheng, “Sampler scheduler for diffusion models,” research area at DFKI and of the DFKI Deep
arXiv:2311.06845, 2023. Learning Competence Center. His research focuses
[180] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative learning on machine learning, pattern recognition, quantified
trilemma with denoising diffusion gans,” arXiv:2112.07804, 2021. learning, data mining, semantic technologies, and
[181] T. Chen, “On the importance of noise scheduling for diffusion models,” document analysis.
arXiv:2301.10972, 2023.