Universal Guidance For Diffusion Models
Universal Guidance For Diffusion Models
Tom Goldstein
University of Maryland
A BSTRACT
A headshot of a A Portrait
woman with a of a
dog in winter. woman
Figure 1: Diffusion guided by off-the-shelf networks. Top left: segmentation guidance, top-right:
face recognition guidance, bottom-left: object detection guidance, bottom-right: style-transfer.
1
Published as a conference paper at ICLR 2024
1 I NTRODUCTION
Diffusion models are powerful tools for creating digital art and graphics. Much of their success stems
from our ability to carefully control their outputs, customizing results for each user’s individual needs.
Most models today are controlled through conditioning. With conditioning, the diffusion model is
built from the ground up to accept a particular modality of input from the user, be it descriptive text,
segmentation maps, class labels, etc. While conditioning is a powerful tool, it results in models that
are handcuffed to a single conditioning modality. If another modality is needed, a new model has
to be trained, often from scratch. Unfortunately, the high cost of training makes this prohibitive for
most users.
A more flexible approach to controlling model outputs is to use guidance. In this approach, the diffu-
sion model acts as a generic image generator, and is not required to understand a user’s instructions.
The user pairs this model with a guidance function that measures whether some criterion has been
met. For example, one could guide the model to minimize the CLIP score between the generated
image and a text description of the user’s choice. During each iteration of image creation, the iterates
are nudged down the gradient of the guidance function, causing the final generated image to satisfy
the user’s criterion.
In this paper, we study guidance methods that enable any off-the-shelf model or loss function to
be used as guidance for diffusion. Because guidance functions can be used without re-training or
modification, this form of guidance is universal in that it enables a diffusion model to be adapted for
nearly any purpose.
From a user perspective, guidance is superior to conditioning, as a single diffusion network is treated
like a foundational model that provides universal coverage across many use cases, both commonplace
and bespoke. Unfortunately, it is widely believed that this approach is infeasible. While early
diffusion models relied on classifier guidance (Dhariwal & Nichol, 2021), the community quickly
turned to classifier-free schemes (Ho & Salimans, 2022) that require a model to be trained from
scratch on class labels with a particular frozen ontology that cannot be changed (Nichol et al., 2021;
Rombach et al., 2022; Bansal et al., 2022).
The difficulty of using guidance stems from the domain shift between the noisy images used by the
diffusion sampling process and the clean images on which the guidance models are trained. When this
gap is closed, guidance can be performed successfully. For example, Nichol et al. (2021) successfully
use a CLIP model as guidance, but only after re-training CLIP from scratch using noisy inputs. Noisy
retraining closes the domain gap, but at a very high financial and engineering cost. To avoid the
additional cost, we study methods for closing this gap by changing the sampling scheme, rather than
the model.
To this end, our contributions are summarized as follows:
• We propose an algorithm that enables universal guidance for diffusion models. Our proposed
sampler evaluates the guidance models only on denoised images, rather than noisy latent
states. By doing so, we close the domain gap that has plagued standard guidance methods.
This strategy provides the end-user with the flexibility to work with a wide range of guidance
modalities and even multiple modalities simultaneously. The underlying diffusion model
remains fixed and no fine-tuning of any kind is necessary.
• We demonstrate the effectiveness of our approach for a variety of different constraints such
as classifier labels, human identities, segmentation maps, annotations from object detectors,
and constraints arising from inverse linear problems.
2 BACKGROUND
We first briefly review the recent literature on the core framework behind diffusion models. Then, we
define the problem setting of controlled image generation and discuss previous related works.
2
Published as a conference paper at ICLR 2024
Diffusion models are strong generative models that proved powerful even when first introduced
for image generation (Song & Ermon, 2019; Ho et al., 2020). The approach has been successfully
extended to a number of domains, such as audio and text generation (Kong et al., 2020; Huang et al.,
2022; Austin et al., 2021; Li et al., 2022).
We introduce (unconditional) diffusion formally, as it is helpful in describing the nuances of different
types of models. A diffusion model is defined as a combination of a T -step forward process and a
T -step reverse process. Conceptually, the forward process gradually adds Gaussian noise of different
magnitudes to a clean data point z0 , while the reverse process attempts to gradually denoise a noisy
input in hopes of recovering a clean data point. More concretely, given an array of scalars representing
noise scales {αt }Tt=1 and an initial, clean data point z0 , applying t steps of the forward process to z0
yields a noisy data point
√ √
zt = αt z0 + ( 1 − αt )ϵ, ϵ ∼ N (0, I). (1)
A diffusion model is a learned denoising network ϵθ . It is trained so that for any pair (z0 , t) and any
sample of ϵ, √
zt − αt z0
ϵθ (zt , t) ≈ ϵ = √ . (2)
1 − αt
The reverse process takes the form q(zt−1 |zt , z0 ) with various detail definitions, where q(·|·) is gener-
ally parameterized as a Gaussian distribution. Different works also studied different approximations
of the unknown q(zt−1 |zt , z0 ) used to perform sampling. For example, denoising diffusion implicit
model (DDIM) (Song et al., 2021a) first computed a predicted clean data point
√
zt − ( 1 − αt )ϵθ (zt , t)
ẑ0 = √ , (3)
αt
and sample zt−1 from q(zt−1 |zt , ẑ0 ) by replacing unknown z0 with ẑ0 . On the other hand, while the
details of individual sampling methods vary, all sampling methods produce zt−1 based on current
sample zt , current time step t and a predicted noise ϵ̂. To ease the notation burden, we define a
function S(·, ·, ·) as an abstraction of the sampling method, where zt−1 = S(zt , ϵ̂, t).
In this paper, we focus on controlled image generation with various constraints. Consider a dif-
ferentiable guidance function f , for example a CLIP feature extractor or a segmentation network.
When applied to an image, we obtain a vector c = f (x). We also consider a function ℓ(·, ·) that
measures the closeness of two vectors c and c′ . Given a particular choice of c, which we call a prompt,
the corresponding constraint (based on c, ℓ, and f ) is formalized as ℓ(c, f (z)) ≈ 0, and we aim to
generate a sample z from the image distribution satisfying the constraint. In plain words, we want to
generate an in-distribution image that matches the prompt.
Prior work on controlled generative diffusion falls into two main categories. We refer to the first
category as conditional image generation, and the second category as guided image generation. Next,
we discuss the characteristics of each category and better situate our work among existing methods.
Conditional Image Generation. Methods from this category require training new diffusion models
that accept the prompt as an additional input (Ho & Salimans, 2022; Bansal et al., 2022; Nichol
et al., 2021; Whang et al., 2022; Wang et al., 2022a; Li et al., 2023; Zhang & Agrawala, 2023). For
example, Ho & Salimans (2022) proposed classifier-free guidance using class labels as prompts,
and trained a diffusion model by linear interpolation between unconditional and conditional outputs
of the denoising networks. Bansal et al. (2022) studied the case where the guidance function is a
known linear degradation operator, and trained a conditional model to solve linear inverse problems.
Nichol et al. (2021) further extended classifier-free guidance to text-conditional image generation
with descriptive phrases as prompts, and trained a diffusion model to enforce the similarity between
the CLIP (Radford et al., 2021) representations of the generated images and the text prompts. These
methods are successful across different types of constraints, however the requirement to retrain the
diffusion model makes them computationally intensive.
3
Published as a conference paper at ICLR 2024
Guided Image Generation. Works in this category employed a frozen pre-trained diffusion model
as a foundation model, but modify the sampling method to guide the image generation with feedback
from the guidance function. Our method falls into this category. Prior work that studied guided image
generation did so with a variety of restrictions and external guidance functions (Dhariwal & Nichol,
2021; Kawar et al., 2022; Wang et al., 2022b; Chung et al., 2022a; Lugmayr et al., 2022; Chung
et al., 2022b; Graikos et al., 2022). For example, Dhariwal & Nichol (2021) proposed classifier
guidance, where they trained a classifier on images of different noise scales as the guidance function
f , and included gradients of the classifier during the sampling process. However, a classifier for noisy
images is domain-specific and generally not readily available – an issue our method circumvents.
Wang et al. (2022b) assumed the external guidance functions to be linear operators, and generated
the component of images residing in the null space of linear operators with the foundation model.
Unfortunately, extending that method to handle non-linear guidance functions is non-trivial. Chung
et al. (2022a) studied general guidance functions, and modified the sampling process with the gradient
of guidance function calculated on the expected denoised images. Nevertheless, the authors only
presented results with simpler non-linear guidance functions such as non-linear blurring.
In this work, we study universal guidance algorithms for guided image generation with diffusion
models using any off-the-shelf guidance functions f , such as object detection or segmentation
networks.
3 U NIVERSAL G UIDANCE
We propose a guidance algorithm that augments the image sampling method of a diffusion model
to include guidance from an off-the-shelf auxiliary network. Our algorithm is motivated by an
empirical observation that the reconstructed clean image ẑ0 obtained by Eq. (3), while naturally
imperfect, is still appropriate for a generic guidance function to provide informative feedback to
guide the image generation. In Sec. 3.1, we motivate our forward universal guidance by extending
classifier guidance Dhariwal & Nichol (2021) to leverage this observation and handle generic guidance
functions. In Sec. 3.2, we propose a supplementary backward universal guidance to help enforce the
generated image to satisfy the constraint based on the guidance function f . In Sec. 3.3, we discuss a
simple yet helpful stepwise refinement trick to empirically improve the fidelity of generated images.
To guide the generation with information from the external guidance function f and the loss function
ℓ, an immediate thought is to extend classifier guidance (Dhariwal & Nichol, 2021) to accept
any general guidance function. Concretely, given a class prompt c, classifier guidance performs
classification-guided sampling by replacing ϵθ (zt , t) in each sampling step S(zt , t) with
√
ϵ̂θ (zt , t) = ϵθ (zt , t) − 1 − αt ∇zt log p(c|zt ). (4)
Defining ℓce (·, ·) to be the cross-entropy loss and fcl to be the guidance function that outputs
classification probability, Eq. (4) can be re-written as
√
ϵ̂θ (zt , t) = ϵθ (zt , t) + 1 − αt ∇zt ℓce (c, fcl (zt )). (5)
However, directly replacing fcl and ℓce with any off-the-shelf guidance and loss functions does not
work in practice, as f is most likely trained on clean images and fails to provide meaningful guidance
when the input is noisy.
To address the issue, we observe that
Z
p(c|zt ) = p(c|z0 , zt )p(z0 |zt )dz0 = Ez0 ∼p(z0 |zt ) [p(c|z0 )]. (6)
where c is conditionally-independent with zt given z0 . Leveraging the fact that we can obtain a
predicted clean image ẑ0 by Eq. (3) with ϵθ (zt , t), we approximate the expectation in Eq. (6) as
Ez0 ∼p(z0 |zt ) [p(c|z0 )] ≈ p(c|ẑ0 ). This leads to our proposed guided sampling procedure
4
Published as a conference paper at ICLR 2024
As will be shown in Sec. 4.2, we observe that forward guidance sometimes over-prioritizes maintaining
the “realness” of the image, resulting in an unsatisfactory match with the given prompt. Simply
increasing the guidance strength s(t) is suboptimal, as this often results in instability as the image
moves off the manifold faster than the denoiser can correct it.
Comparing to forward guidance, backward guidance (as Eq. (10)) produces an optimized direction
for the generated image to match the given prompt, and hence prioritizes enforcing the constraint.
Furthermore, calculation of a gradient step for Eq. (8) is computationally cheaper than forward
guidance (Eq. (7)), and we can therefore afford to solve Eq. (8) with multiple gradient steps, further
improving the match with the given prompt.
We note that the names “forward” and “backward” are used analogously to the forward and backward
Euler methods.
5
Published as a conference paper at ICLR 2024
4 E XPERIMENTS
In this section, we present results testing our proposed universal guidance algorithm against a
wide variety of guidance functions. Specifically, we experiment with Stable Diffusion (Rombach
et al., 2022), a diffusion model that is able to perform text-conditional generation by accepting text
prompt as additional input, and experiment with a purely unconditional diffusion model trained on
ImageNet (Deng et al., 2009), where we use pre-trained model provided by OpenAI (Dhariwal &
Nichol, 2021). We note that Stable Diffusion, while being a text-conditional generative model, can
also perform unconditional image generation by simply using an empty string as the text prompt.
We first present the experiment on Stable Diffusion for different guidance functions in Sec. 4.1, and
present the results on ImageNet diffusion model in Sec. 4.2. Hyper-parameters used for different
guidance functions, further ablation studies and selection procedures for suitable guidance strength
s(t) and refinement step k can be found in the appendix.
In this section, we present the results of guided image generation using Stable Diffusion as the
foundation model. The guidance functions we experiment with include a segmentation network,
a face recognition network, an object detection network and style guidance with CLIP feature
extractor (Radford et al., 2021). For experiments on Stable Diffusion, we discover that applying
forward guidance already produce high-quality images that match the given prompt, and hence set
m = 0. To perform forward guidance on Stable Diffusion, we forward the predicted clean latent
variable computed by Eq. (3) through the image decoder of Stable Diffusion to obtain predicted clean
6
Published as a conference paper at ICLR 2024
Headshot of a
person with
Prompt Walker hound, Walker hound, Walker hound, Prompt blonde hair Headshot of a A headshot of
Walker foxhound Walker foxhound Walker foxhound with space woman made a woman looking
Guide under water. on snow. as an oil painting. Guide background. of marble. like Lara Croft.
(N/A)
(N/A)
Face Recognition Guidance. For guiding generation to include a specific person’s likeness, we
propose combining face detection and facial recognition modules into one guidance function. This
setup produces a facial attribute embedding from an image of a face. We use multi-task cascaded
convolutional networks (MTCNN) (Zhang et al., 2016) for face detection, and we use facenet (Schroff
et al., 2015) for facial recognition. The guidance function f then crops out the detected face and
7
Published as a conference paper at ICLR 2024
(N/A) (N/A)
Figure 5: In addition to text prompts (above each Figure 6: In addition to text prompts, these im-
column), these images are guided by an object de- ages are guided by a style image. Each column
tector. Each column contains examples of images contains examples of images generated to match
generated to match the prompt and the bounding the text prompt and the style used for guidance.
boxes used for guidance.
outputs a facial attribute embedding as a prompt and we use an l1 -loss between embeddings as the
loss function ℓ.
We explore different combinations of face guidance and text prompts. Similarly to the segmentation
case, we use the text prompt as a fixed additional conditioning to Stable Diffusion and guide this
text-conditional trajectory with our algorithm so that the face in the generated image looks similar
to the face prompt. In Fig. 4, we clearly see that the facial characteristics of a given face prompt
are reproduced almost perfectly on the generated images. The descriptive text of either background,
material, or style is also realized correctly and blends nicely with the generated faces. We again
summarize our quantitative evaluation in Tab. 1. We evaluate the similarity between facial attributes
of ground-truth identity and the generated faces. In general, two faces are considered to be from the
same person if the similarity is over 0.5, and our algorithm can effectively guide the generated face to
meet the criteria.
Object Location Guidance For Stable Diffusion, we also present the results of guided image
generation with an object detection network. For this experiment, we use Faster-RCNN (Ren et al.,
2015) with Resnet-50-FPN backbone (Li et al., 2021), a publicly available pre-trained model in
Pytorch, as our object detector. We use bounding boxes with class labels as our object location
prompt. We construct a loss function ℓ by the sum of three individual losses, namely (1) anchor
classification loss, (2) bounding box regression loss and (3) region label classification loss.
We again experiment with different combinations of text prompt and object location prompt, and
similarly use the text prompt as a fixed conditioning to Stable Diffusion. Using our proposed guidance
algorithm, we perform guided image generation that generates and matches the objects presented in
the text prompt to the given object locations. The results are presented in Fig. 5. We observe from
Fig. 5 that objects in the descriptive text all appear in the designated location with the appropriate
size indicated by the given bounding boxes. Each location is filled with appropriate, high-quality
generations that align with varied image content prompts, ranging from “beach” to “oil painting”. In
Tab. 1, we use mAP@50 to measure how well the generated images satisfy the constraint.
Style Guidance Finally, we conclude our experiments on Stable Diffusion by guiding the image
generation based on a reference style given by a style image. To achieve so, we capture the reference
style from the style image by the image feature extractor from CLIP, and use the resulting image
embedding as prompts. The loss function calculates the negative cosine similarity between the em-
bedding of generated images and the embedding of the style image. Similar to previous experiments,
8
Published as a conference paper at ICLR 2024
Object Location Forward Only Forward + Backward Masked Image Clf. Guided Clf. + Seg. Guided
Figure 7: Generation guided by object detection Figure 8: Our guidance algorithm can incorporate
with the unconditional ImageNet model. While feedback from multiple guidance functions. Left
both forward and backward guidance produces re- to right: The inpainting prompt; The classifier-
alistic images with the desired objects in the desig- guided inpainting output; Images generated with
nated locations, forward guidance alone produces both classifier and segmentation guidance, where
the wrong objects or the wrong locations/sizes. realistic dogs are generated exactly on the mask.
we control the content using text input as additional conditioning to the Stable Diffusion model. We
experiment with combinations of different style images and different text prompts, and present the
results in Fig. 6. From Fig. 6, we can see that the generated images contain contents that match the
given text prompts, while exhibiting style that matches the given style images.
In this section, we present results for guided image generation using an unconditional diffusion
model trained on ImageNet. We experiment with object location guidance and a hybrid guided
image generation task which we term segmentation-guided inpainting. We also include additional
experiments for CLIP guidance in the appendix. We will discuss results of each guidance separately.
Object Location Guidance. Similar to object Object Location Fwd. Only Fwd. + Bkd.
location guidance for Stable Diffusion, we also Bounding box 1 0.39 0.90
use the same network architecture and the same Bounding box 2 0.18 0.36
pre-trained model as our object detection net-
work, and construct an identical loss function Table 2: Quantitative analysis of forward guidance
ℓ for our guidance algorithm. However, unlike only versus combination of forward and backward
Stable Diffusion, object locations are the only guidance for object detection guidance on Ima-
prompts available for guided image generation. geNet with bounding boxes in Fig. 7. The metric
We experiment with different object location is mAP@50.
prompts using either (1) only forward universal
guidance and (2) both forward and backward universal guidance. We observe from Fig. 7 that
applying both forward and backward guidance generates images that are realistic and the objects
match the prompt nicely. On the other hand, while images generated using only forward guidance
remain realistic, they feature objects with mismatching categories and locations. The observation is
further backed by quantitative evaluation presented in Tab. 2. The evaluation is based on mAP@50
between the ground truth object locations and the predicted bounding boxes of generated images, and
clearly shows that the combination of forward and backward guidance leads to a much better match
with the constraint. The results demonstrate the effectiveness of our universal guidance algorithm,
and also validate the necessity of our backward guidance.
Segmentation-Guided Inpainting. In this experiment, we aim to explore the ability of our algo-
rithm to handle multiple guidance functions. We perform guided image generation with combined
guidance from an inpainting mask, a classifier and a segmentation network. We first generate images
with masked regions as the prompt for inpainting. We then pick an object class c as the prompt
for classification and generate a segmentation mask where the masked regions are considered fore-
ground objects of the same class c. We use ℓ2 loss on the non-masked region as the loss function
for inpainting, and set the corresponding s(t) = 0, or equivalently only use backward guidance
for inpainting. We use the same segmentation network as described in Sec. 4.1. For classification
9
Published as a conference paper at ICLR 2024
guidance, we use the classifier that accepts noisy input (Dhariwal & Nichol, 2021), and perform the
original classifier guidance as Eq. (4) instead of our forward guidance. The results in Fig. 8 show
that when using both inpainting and classifier as guidance, our algorithm generates realistic images
that both match the inpainting prompt and are classified correctly to the given object class. Adding
in segmentation guidance, our algorithm further improves by making a near-perfect match to both
the segmentation map and the inpainting prompt while maintaining realism, demonstrating that our
algorithm effectively combines feedback from multiple guidance functions.
5 C ONCLUSION
We propose a universal guidance algorithm that is able to perform guided image generation with any
off-the-shelf guidance function based on a fixed foundation diffusion model. Our algorithm only
needs guidance and loss functions to be differentiable, and avoids any retraining to adapt the guidance
function or the foundation model to a specific type of prompt. We demonstrate promising results with
our algorithm on complex guidance including segmentation, face recognition and object detection
systems – and multiple guidance functions can even be used together.
6 ACKNOWLEDGEMENTS
This work was made possible by the ONR MURI program and the AFOSR MURI program. Com-
mercial support was provided by Capital One Bank, the Amazon Research Award program, and Open
Philanthropy. Further support was provided by the National Science Foundation (IIS-2212182), and
by the NSF TRAILS Institute (2229885).
R EPRODUCIBILITY S TATEMENT
we describe guidance functions and foundation diffusion models for experiments presented in Sec. 4
in the corresponding subsections. Hyperparameters for experiments described in Sec. 4 of the main
paper can be found in Sec. B in the appendix. We also include the source code used to conduct the
experiments described in the paper in our supplementary material.
R EFERENCES
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising
diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–
17993, 2021.
Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie S Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas
Geiping, and Tom Goldstein. Cold diffusion: Inverting arbitrary image transforms without noise. arXiv
preprint arXiv:2208.09392, 2022.
Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior
sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687, 2022a.
Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse
problems using manifold constraints. arXiv preprint arXiv:2206.00941, 2022b.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,
2009.
Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. volume 34,
2021.
Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play
priors. arXiv preprint arXiv:2206.09012, 2022.
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems, 32, 2020.
10
Published as a conference paper at ICLR 2024
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun
Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF
international conference on computer vision, pp. 1314–1324, 2019.
Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. Fastdiff: A fast conditional
diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934, 2022.
Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. arXiv
preprint arXiv:2201.11793, 2022.
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model
for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B Hashimoto. Diffusion-lm
improves controllable text generation. arXiv preprint arXiv:2205.14217, 2022.
Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaiming He, and Ross Girshick. Benchmarking detection
transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021.
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae
Lee. Gligen: Open-set grounded text-to-image generation. arXiv preprint arXiv:2301.07093, 2023.
Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint:
Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 11461–11471, 2022.
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever,
and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion
models. arXiv preprint arXiv:2112.10741, 2021.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,
Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep
learning library. Advances in neural information processing systems, 32, 2019.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language
supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with
region proposal networks. Advances in neural information processing systems, 28, 2015.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of CVPR, 2022.
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion
models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–
36494, 2022.
Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition
and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
815–823, 2015.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. International Conference
on Learning Representations, 2021a.
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances
in Neural Information Processing Systems, 32, 2019.
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.
Score-based generative modeling through stochastic differential equations. International Conference on
Learning Representations, 2021b.
Weilun Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Dong Chen, Lu Yuan, and Houqiang Li. Semantic
image synthesis via diffusion models. arXiv preprint arXiv:2207.00050, 2022a.
Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space
model. arXiv preprint arXiv:2212.00490, 2022b.
11
Published as a conference paper at ICLR 2024
Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G Dimakis, and Peyman Milanfar.
Deblurring via stochastic refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 16293–16303, 2022.
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask
cascaded convolutional networks. IEEE signal processing letters, 23(10):1499–1503, 2016.
Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv
preprint arXiv:2302.05543, 2023.
12
Published as a conference paper at ICLR 2024
A L IMITATIONS
Generation using universal guidance is typically slower than standard conditional generation for
several reasons. Empirically, more than one iteration of denoising is required for the given noise level
t to generate high-quality images with complex guidance functions. However, the time complexity
of our algorithm scales linearly with the number of refinement steps k, which slows down image
generation when k is large. That being said, for the applications presented in this paper, we were
successful in generating images with smaller values of k. Also, as demonstrated in the main paper,
backward guidance is required in certain scenarios to help generate images that match the given
constraint. Computing backward guidance requires performing minimization with a multi-step
gradient descent inner loop. While proper choices of gradient-based optimization algorithms and
learning rate schedules significantly speed up the convergence of minimization, the time it takes
to compute backward guidance inevitably becomes longer when the guidance function is itself a
very-large neural network. Finally, we note that, to get optimal results, sampling hyper-parameters
must be chosen individually for each guidance network.
B H YPER - PARAMETERS
In this section, we present the hyper-parameters for the different guidance functions i.e. face,
segmentation, object location, and style guidance. We present the hyperparameters for experiments
on√Stable Diffusion in Sec. 4.1 in the Tab. 3, where we include coefficient s0 to compute st =
s0 1 − αt and the number of Universal Stepwise Refinement (k). We also provide hyperparameters
for experiments on ImageNet in Sec. 4.2 in Tab. 4
Guidance s0 k
Face 20000 2
Object Location 100 3
Style Transfer 6 6
Segmentation 400 10
Table 3: Hyper-parameters used in this paper for different guidance functions to reproduce the results
for Stable Diffusion.
Guidance s0 k
Object Location 100 3
Segmentation 200 10
Table 4: Hyper-parameters used in this paper for different guidance functions to reproduce the results
for ImageNet.
CLIP (Radford et al., 2021) is a state-of-the-art text-to-image similarity model developed by OpenAI.
We use the image feature extractor of CLIP to do text-guided image generation with our algorithm. We
construct a loss function that calculates the negative cosine similarity between an√image embedding
and the CLIP text embedding produced by a given text prompt. We use s(t) = 10 1 − αt and k = 8
and use Stable Diffusion as an unconditional image generator.
We generate images guided by a number of text prompts. To further assess our universal guidance
algorithm and compare guidance and conditioning, we also generate images using classical, text-
conditional generation by Stable Diffusion with identical prompts as inputs, and summarize the
results in Fig. 9. The results in Fig. 9 show that our algorithm can guide the generation to produce
high-quality images that match the given text description, and are comparable with images generated
by the specialized text-conditioning model. We also include qualitative results from experiments on
DrawBench (Saharia et al., 2022). DrawBench is a widely-used and diverse list of text prompts. We
randomly select 20 prompts and generate 10 images for each individual prompt. We compute CLIP
13
Published as a conference paper at ICLR 2024
Figure 9: We compare the ability to match given text prompts between our universal guidance
algorithm and text-conditional model trained from scratch. The results demonstrate that our universal
algorithm is comparable to specialized conditional model on the ability to generate quality images
that satisfy the text constraints.
score, the cosine similarity between CLIP feature of the text prompt and the associated generated im-
age, on both Stable Diffusion and our algorithm, and report the average in Tab. 5. As demonstrated in
the table, the performance of our algorithm is quantitatively similar to a dedicated text-conditional gen-
erator, while requiring no additional training at all for the underlying unconditional diffusion model.
Table 5: Quantitative results on DrawBench for Stable Diffusion and our algorithm.
14
Published as a conference paper at ICLR 2024
English foxhound by
Van
Van Gogh
Gogh Style
Style Cake
Edward Hopper
Figure 10: We show that unconditional diffusion models trained on ImageNet can be guided with
CLIP to generate high-quality images that match the text prompts, even if these generated images
should be out of distribution.
CLIP Guidance. We use the same construction of f and ℓ for Stable Diffusion to perform CLIP-
guided generation. We use only forward guidance for this experiment. To assess the limit of our
universal guidance algorithm, we hand-crafted text prompts such that the matching images are
expected to be out of distribution. In particular, our text prompts either designate art styles that are far
from realistic or designate objects that do not belong to any possible class label of ImageNet. We
present the results in Fig. 10, and from the results, we clearly see that our algorithm still successfully
guides the generation√to produce quality images that also match the text prompts. For all three images,
we have s(t) = w · 1 − αt , where w is 2, 5 and 2 respectively and k is 10, 5 and 10 respectively.
15
Published as a conference paper at ICLR 2024
Figure 11: The qualitative results for effect of different guidance strength s(t) and
√ universal refinement
step k on segmentation guidance for Stable Diffusion. Here we use s(t) = c · 1 − αt , and compare
different c instead.
We also want to remark that the ablation study formalizes a principled way to pick suitable k and
s(t) for different guidance functions. In particular, increasing k is always beneficial to the generation
quality, and the value of k is limited only by the computational budget. Given a fixed k, there is
generally a sweet spot for s(t) that ensures both a match to the target and sufficient quality, and this
sweet spot can be found with standard parameter search, as described above.
16
Published as a conference paper at ICLR 2024
Figure 12: More images to show Segmentation guidance. In each subfigure, the first image is the
segmentation map used to guide the image generation with its caption as its text prompt.
17
Published as a conference paper at ICLR 2024
Figure 13: More images to show Face guidance. In each subfigure, the first image is the human
identity used to guide the image generation with its caption as its text prompt.
18
Published as a conference paper at ICLR 2024
Figure 14: More images to show Object Location guidance. In each subfigure, the first image is the
object location used to guide the image generation with its caption as its text prompt.
19
Published as a conference paper at ICLR 2024
Figure 15: More images to show Style Transfer. In each subfigure, the first image is the styling image
used to guide the image generation with its caption as its text prompt.
20