0% found this document useful (0 votes)
25 views20 pages

Universal Guidance For Diffusion Models

Uploaded by

peeyushiy8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views20 pages

Universal Guidance For Diffusion Models

Uploaded by

peeyushiy8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Published as a conference paper at ICLR 2024

U NIVERSAL G UIDANCE FOR D IFFUSION M ODELS


Arpit Bansal* Hong-Min Chu* Avi Schwarzschild
University of Maryland University of Maryland University of Maryland
bansal01@umd.edu

Soumyadip Sengupta Micah Goldblum Jonas Geiping


University of North Carolina New York University University of Maryland

Tom Goldstein
University of Maryland

A BSTRACT

Typical diffusion models are trained to accept a particular form of conditioning,


most commonly text, and cannot be conditioned on other modalities without re-
training. In this work, we propose a universal guidance algorithm that enables
diffusion models to be controlled by arbitrary guidance modalities without the
need to retrain any use-specific components. We show that our algorithm success-
fully generates quality images with guidance functions including segmentation,
face recognition, object detection, and classifier signals. Code is available at
github.com/arpitbansal297/Universal-Guided-Diffusion.

Target Segmentation Map Target Identity

Walker hound, A headshot of a


Walker foxhound blonde woman
in space as a sketch

Target Object Location Target Style Image

A headshot of a A Portrait
woman with a of a
dog in winter. woman

Figure 1: Diffusion guided by off-the-shelf networks. Top left: segmentation guidance, top-right:
face recognition guidance, bottom-left: object detection guidance, bottom-right: style-transfer.

1
Published as a conference paper at ICLR 2024

1 I NTRODUCTION

Diffusion models are powerful tools for creating digital art and graphics. Much of their success stems
from our ability to carefully control their outputs, customizing results for each user’s individual needs.
Most models today are controlled through conditioning. With conditioning, the diffusion model is
built from the ground up to accept a particular modality of input from the user, be it descriptive text,
segmentation maps, class labels, etc. While conditioning is a powerful tool, it results in models that
are handcuffed to a single conditioning modality. If another modality is needed, a new model has
to be trained, often from scratch. Unfortunately, the high cost of training makes this prohibitive for
most users.
A more flexible approach to controlling model outputs is to use guidance. In this approach, the diffu-
sion model acts as a generic image generator, and is not required to understand a user’s instructions.
The user pairs this model with a guidance function that measures whether some criterion has been
met. For example, one could guide the model to minimize the CLIP score between the generated
image and a text description of the user’s choice. During each iteration of image creation, the iterates
are nudged down the gradient of the guidance function, causing the final generated image to satisfy
the user’s criterion.
In this paper, we study guidance methods that enable any off-the-shelf model or loss function to
be used as guidance for diffusion. Because guidance functions can be used without re-training or
modification, this form of guidance is universal in that it enables a diffusion model to be adapted for
nearly any purpose.
From a user perspective, guidance is superior to conditioning, as a single diffusion network is treated
like a foundational model that provides universal coverage across many use cases, both commonplace
and bespoke. Unfortunately, it is widely believed that this approach is infeasible. While early
diffusion models relied on classifier guidance (Dhariwal & Nichol, 2021), the community quickly
turned to classifier-free schemes (Ho & Salimans, 2022) that require a model to be trained from
scratch on class labels with a particular frozen ontology that cannot be changed (Nichol et al., 2021;
Rombach et al., 2022; Bansal et al., 2022).
The difficulty of using guidance stems from the domain shift between the noisy images used by the
diffusion sampling process and the clean images on which the guidance models are trained. When this
gap is closed, guidance can be performed successfully. For example, Nichol et al. (2021) successfully
use a CLIP model as guidance, but only after re-training CLIP from scratch using noisy inputs. Noisy
retraining closes the domain gap, but at a very high financial and engineering cost. To avoid the
additional cost, we study methods for closing this gap by changing the sampling scheme, rather than
the model.
To this end, our contributions are summarized as follows:

• We propose an algorithm that enables universal guidance for diffusion models. Our proposed
sampler evaluates the guidance models only on denoised images, rather than noisy latent
states. By doing so, we close the domain gap that has plagued standard guidance methods.
This strategy provides the end-user with the flexibility to work with a wide range of guidance
modalities and even multiple modalities simultaneously. The underlying diffusion model
remains fixed and no fine-tuning of any kind is necessary.
• We demonstrate the effectiveness of our approach for a variety of different constraints such
as classifier labels, human identities, segmentation maps, annotations from object detectors,
and constraints arising from inverse linear problems.

2 BACKGROUND

We first briefly review the recent literature on the core framework behind diffusion models. Then, we
define the problem setting of controlled image generation and discuss previous related works.

2
Published as a conference paper at ICLR 2024

2.1 D IFFUSION M ODELS

Diffusion models are strong generative models that proved powerful even when first introduced
for image generation (Song & Ermon, 2019; Ho et al., 2020). The approach has been successfully
extended to a number of domains, such as audio and text generation (Kong et al., 2020; Huang et al.,
2022; Austin et al., 2021; Li et al., 2022).
We introduce (unconditional) diffusion formally, as it is helpful in describing the nuances of different
types of models. A diffusion model is defined as a combination of a T -step forward process and a
T -step reverse process. Conceptually, the forward process gradually adds Gaussian noise of different
magnitudes to a clean data point z0 , while the reverse process attempts to gradually denoise a noisy
input in hopes of recovering a clean data point. More concretely, given an array of scalars representing
noise scales {αt }Tt=1 and an initial, clean data point z0 , applying t steps of the forward process to z0
yields a noisy data point
√ √
zt = αt z0 + ( 1 − αt )ϵ, ϵ ∼ N (0, I). (1)
A diffusion model is a learned denoising network ϵθ . It is trained so that for any pair (z0 , t) and any
sample of ϵ, √
zt − αt z0
ϵθ (zt , t) ≈ ϵ = √ . (2)
1 − αt

The reverse process takes the form q(zt−1 |zt , z0 ) with various detail definitions, where q(·|·) is gener-
ally parameterized as a Gaussian distribution. Different works also studied different approximations
of the unknown q(zt−1 |zt , z0 ) used to perform sampling. For example, denoising diffusion implicit
model (DDIM) (Song et al., 2021a) first computed a predicted clean data point

zt − ( 1 − αt )ϵθ (zt , t)
ẑ0 = √ , (3)
αt
and sample zt−1 from q(zt−1 |zt , ẑ0 ) by replacing unknown z0 with ẑ0 . On the other hand, while the
details of individual sampling methods vary, all sampling methods produce zt−1 based on current
sample zt , current time step t and a predicted noise ϵ̂. To ease the notation burden, we define a
function S(·, ·, ·) as an abstraction of the sampling method, where zt−1 = S(zt , ϵ̂, t).

2.2 C ONTROLLED I MAGE G ENERATION

In this paper, we focus on controlled image generation with various constraints. Consider a dif-
ferentiable guidance function f , for example a CLIP feature extractor or a segmentation network.
When applied to an image, we obtain a vector c = f (x). We also consider a function ℓ(·, ·) that
measures the closeness of two vectors c and c′ . Given a particular choice of c, which we call a prompt,
the corresponding constraint (based on c, ℓ, and f ) is formalized as ℓ(c, f (z)) ≈ 0, and we aim to
generate a sample z from the image distribution satisfying the constraint. In plain words, we want to
generate an in-distribution image that matches the prompt.
Prior work on controlled generative diffusion falls into two main categories. We refer to the first
category as conditional image generation, and the second category as guided image generation. Next,
we discuss the characteristics of each category and better situate our work among existing methods.

Conditional Image Generation. Methods from this category require training new diffusion models
that accept the prompt as an additional input (Ho & Salimans, 2022; Bansal et al., 2022; Nichol
et al., 2021; Whang et al., 2022; Wang et al., 2022a; Li et al., 2023; Zhang & Agrawala, 2023). For
example, Ho & Salimans (2022) proposed classifier-free guidance using class labels as prompts,
and trained a diffusion model by linear interpolation between unconditional and conditional outputs
of the denoising networks. Bansal et al. (2022) studied the case where the guidance function is a
known linear degradation operator, and trained a conditional model to solve linear inverse problems.
Nichol et al. (2021) further extended classifier-free guidance to text-conditional image generation
with descriptive phrases as prompts, and trained a diffusion model to enforce the similarity between
the CLIP (Radford et al., 2021) representations of the generated images and the text prompts. These
methods are successful across different types of constraints, however the requirement to retrain the
diffusion model makes them computationally intensive.

3
Published as a conference paper at ICLR 2024

Guided Image Generation. Works in this category employed a frozen pre-trained diffusion model
as a foundation model, but modify the sampling method to guide the image generation with feedback
from the guidance function. Our method falls into this category. Prior work that studied guided image
generation did so with a variety of restrictions and external guidance functions (Dhariwal & Nichol,
2021; Kawar et al., 2022; Wang et al., 2022b; Chung et al., 2022a; Lugmayr et al., 2022; Chung
et al., 2022b; Graikos et al., 2022). For example, Dhariwal & Nichol (2021) proposed classifier
guidance, where they trained a classifier on images of different noise scales as the guidance function
f , and included gradients of the classifier during the sampling process. However, a classifier for noisy
images is domain-specific and generally not readily available – an issue our method circumvents.
Wang et al. (2022b) assumed the external guidance functions to be linear operators, and generated
the component of images residing in the null space of linear operators with the foundation model.
Unfortunately, extending that method to handle non-linear guidance functions is non-trivial. Chung
et al. (2022a) studied general guidance functions, and modified the sampling process with the gradient
of guidance function calculated on the expected denoised images. Nevertheless, the authors only
presented results with simpler non-linear guidance functions such as non-linear blurring.
In this work, we study universal guidance algorithms for guided image generation with diffusion
models using any off-the-shelf guidance functions f , such as object detection or segmentation
networks.

3 U NIVERSAL G UIDANCE

We propose a guidance algorithm that augments the image sampling method of a diffusion model
to include guidance from an off-the-shelf auxiliary network. Our algorithm is motivated by an
empirical observation that the reconstructed clean image ẑ0 obtained by Eq. (3), while naturally
imperfect, is still appropriate for a generic guidance function to provide informative feedback to
guide the image generation. In Sec. 3.1, we motivate our forward universal guidance by extending
classifier guidance Dhariwal & Nichol (2021) to leverage this observation and handle generic guidance
functions. In Sec. 3.2, we propose a supplementary backward universal guidance to help enforce the
generated image to satisfy the constraint based on the guidance function f . In Sec. 3.3, we discuss a
simple yet helpful stepwise refinement trick to empirically improve the fidelity of generated images.

3.1 F ORWARD U NIVERSAL G UIDANCE

To guide the generation with information from the external guidance function f and the loss function
ℓ, an immediate thought is to extend classifier guidance (Dhariwal & Nichol, 2021) to accept
any general guidance function. Concretely, given a class prompt c, classifier guidance performs
classification-guided sampling by replacing ϵθ (zt , t) in each sampling step S(zt , t) with

ϵ̂θ (zt , t) = ϵθ (zt , t) − 1 − αt ∇zt log p(c|zt ). (4)
Defining ℓce (·, ·) to be the cross-entropy loss and fcl to be the guidance function that outputs
classification probability, Eq. (4) can be re-written as

ϵ̂θ (zt , t) = ϵθ (zt , t) + 1 − αt ∇zt ℓce (c, fcl (zt )). (5)
However, directly replacing fcl and ℓce with any off-the-shelf guidance and loss functions does not
work in practice, as f is most likely trained on clean images and fails to provide meaningful guidance
when the input is noisy.
To address the issue, we observe that
Z
p(c|zt ) = p(c|z0 , zt )p(z0 |zt )dz0 = Ez0 ∼p(z0 |zt ) [p(c|z0 )]. (6)

where c is conditionally-independent with zt given z0 . Leveraging the fact that we can obtain a
predicted clean image ẑ0 by Eq. (3) with ϵθ (zt , t), we approximate the expectation in Eq. (6) as
Ez0 ∼p(z0 |zt ) [p(c|z0 )] ≈ p(c|ẑ0 ). This leads to our proposed guided sampling procedure

4
Published as a conference paper at ICLR 2024

ϵ̂θ (zt , t) = ϵθ (zt , t) + s(t) · ∇zt ℓ(c, f (ẑ0 )) (7)


where s(t) controls the guidance strength for each sampling step and
  √ 
zt − 1 − αt ϵθ (zt , t)
∇zt ℓ(c, f (ẑ0 )) = ∇zt ℓ c, f √
αt
as in Eq. (3). We term Eq. (7) forward universal guidance, or forward guidance in short. In practice,
applying forward guidance effectively brings the generated image closer to the prompt while keeping
the generation trajectory in the data manifold. We note that a related approach is also studied in Chung
et al. (2022a), where the guidance step is computed based on E[z0 |zt ]. The approach drew inspiration
from the score-based generative framework (Song et al., 2021b), but resulted in a different update
method.

3.2 BACKWARD U NIVERSAL G UIDANCE

As will be shown in Sec. 4.2, we observe that forward guidance sometimes over-prioritizes maintaining
the “realness” of the image, resulting in an unsatisfactory match with the given prompt. Simply
increasing the guidance strength s(t) is suboptimal, as this often results in instability as the image
moves off the manifold faster than the denoiser can correct it.

Algorithm 1 Universal Guidance To address the issue, we propose backward


universal guidance, or backward guidance
Parameter: Refinement steps k, backward guidance in short, to supplement forward guidance
steps m, and guidance strength s(t), and help enforce the generated image to
Required: zt the noisy vector at a given time-step t, satisfy the constraint. The key idea of back-
diffusion model ϵθ , noise scale αt , guidance function ward guidance is to optimize for a clean im-
f , loss function ℓ, and prompt c age that best matches the prompt based on
for n = 1, 2, . . . , k do ẑ0 , and linearly translate the guided change
Calculate ẑ0 as in Eq. (3). back to the noisy image space at step t.
Calculate ϵ̂θ using forward universal guidance as Concretely, instead of directly calculat-
in Eq. (7). ing ∇zt ℓ(c, f (ẑ0 )), we compute a guided
if m > 0 then change ∆z0 in clean data space as
Calculate ∆z0 by minimizing Eq. (8) with m
steps of gradient descent. ∆z0 = arg min ℓ(c, f (ẑ0 + ∆)). (8)

Perform backward universal guidance by
p Empirically, we solve Eq. (8) with m-step
ϵ̂θ ← ϵ̂θ − αt /(1 − αt )∆z0 (see Eq. (10)). gradient descent, where we use ∆ = 0 as a
end if starting point. Since ẑ0 + ∆z0 minimizes
zt−1 ← S(zt , ϵ̂θ , t). ℓ(c, f (z)) directly, ∆z0 is the change in

ϵ ∼ Np(0, I). p clean data space that best enforces the con-

zt ← αt /αt−1 zt−1 + 1 − αt /αt−1 ϵ . straint. Then, we translate ∆z0 back to the
end for noisy data space of zt by calculating the
Return zt−1 guided denoising prediction ϵ̃ that satisfies
√ √
zt = αt (ẑ0 + ∆z0 ) + 1 − αt ϵ̃. (9)
Reusing Eq. (3), we can rewrite ϵ̃ as an augmentation to the original denoising prediction ϵθ (zt , t) by
p
ϵ̃ = ϵθ (zt , t) − αt /(1 − αt )∆z0 . (10)

Comparing to forward guidance, backward guidance (as Eq. (10)) produces an optimized direction
for the generated image to match the given prompt, and hence prioritizes enforcing the constraint.
Furthermore, calculation of a gradient step for Eq. (8) is computationally cheaper than forward
guidance (Eq. (7)), and we can therefore afford to solve Eq. (8) with multiple gradient steps, further
improving the match with the given prompt.

We note that the names “forward” and “backward” are used analogously to the forward and backward
Euler methods.

5
Published as a conference paper at ICLR 2024

3.3 U NIVERSAL S TEPWISE R EFINEMENT

Upon applying our universal guidance to stan-


dard generation pipelines, we observe that in
some instances, the images generated exhibit
artifacts or unusual behaviors that differentiate
Figure 2: An example of how the Universal Step-
them from natural images. Similar findings have
wise Refinement (USR) helps segmentation-guided
been reported in (Lugmayr et al., 2022; Wang
generation. The left-most figure is the given seg-
et al., 2022b), where linear guidance functions
mentation map, and the images generated with
were explored. Although we attempted to en-
USR steps 1, 4 and 10 follow in order. Notice the
hance the realism by adjusting the guidance
increase of quality with USR steps.
strength s(t), finding an optimal balance that
both guarantees image realism and adherence to guidance constraints proved challenging, particularly
with complex guidance functions. We hypothesize that the direction of guidance offered by our
universal method might not always correlate with image realism, diverting the image from a natural
image sampling trajectory.
Drawing motivation from Lugmayr et al. (2022); Wang et al. (2022b), we propose a more nuanced
approach, termed Universal Stepwise Refinement (USR), to navigate these challenges. Specifically,
after sampling zt−1 = S(zt , ϵ̂t , t), we reintroduce Gaussian noise ϵ′ ∼ N (0, I) to zt−1 , deriving zt′
as per the equation: p p
zt′ = αt /αt−1 · zt−1 + 1 − αt /αt−1 · ϵ′ . (11)
Equation 11 ensures zt′ retains the appropriate noise scale for input at time step t. Given a refinement
step k, we repeat the Universal Stepwise Refinement k times before advancing with the sampling
for step t − 1. This refined process enables the exploration of various regions of the data manifold
at a consistent noise scale to reach a solution that aligns with both the guidance and image quality
requisites. Our empirical assessments indicate that using USR mechanism on top of proper guidance
strength s(t) significantly enhanced the image realism while still maintaining the adherence to the
prompt, as shown in Fig. 2.
We summarize our universal guidance algorithm composed of forward universal guidance, backward
universal guidance and universal stepwise refinement for a single sampling step in Algorithm 1. For
simplicity, the algorithm assumes only one guidance function, but can be easily adapted to handle
multiple pairs of (f, l). Additionally, the objectives of the forward and backward guidance do not
have to be identical, allowing different ways to simultaneously utilize multiple guidance functions.

4 E XPERIMENTS
In this section, we present results testing our proposed universal guidance algorithm against a
wide variety of guidance functions. Specifically, we experiment with Stable Diffusion (Rombach
et al., 2022), a diffusion model that is able to perform text-conditional generation by accepting text
prompt as additional input, and experiment with a purely unconditional diffusion model trained on
ImageNet (Deng et al., 2009), where we use pre-trained model provided by OpenAI (Dhariwal &
Nichol, 2021). We note that Stable Diffusion, while being a text-conditional generative model, can
also perform unconditional image generation by simply using an empty string as the text prompt.
We first present the experiment on Stable Diffusion for different guidance functions in Sec. 4.1, and
present the results on ImageNet diffusion model in Sec. 4.2. Hyper-parameters used for different
guidance functions, further ablation studies and selection procedures for suitable guidance strength
s(t) and refinement step k can be found in the appendix.

4.1 R ESULTS FOR S TABLE D IFFUSION

In this section, we present the results of guided image generation using Stable Diffusion as the
foundation model. The guidance functions we experiment with include a segmentation network,
a face recognition network, an object detection network and style guidance with CLIP feature
extractor (Radford et al., 2021). For experiments on Stable Diffusion, we discover that applying
forward guidance already produce high-quality images that match the given prompt, and hence set
m = 0. To perform forward guidance on Stable Diffusion, we forward the predicted clean latent
variable computed by Eq. (3) through the image decoder of Stable Diffusion to obtain predicted clean

6
Published as a conference paper at ICLR 2024

Headshot of a
person with
Prompt Walker hound, Walker hound, Walker hound, Prompt blonde hair Headshot of a A headshot of
Walker foxhound Walker foxhound Walker foxhound with space woman made a woman looking
Guide under water. on snow. as an oil painting. Guide background. of marble. like Lara Croft.

(N/A)
(N/A)

Figure 3: In addition to matching the text prompts


(above each column), these images are guided
by an image segmentation pipeline. Each col- Figure 4: Each column contains examples of im-
umn contains examples of images generated to ages generated to match the prompt and the iden-
match the prompt and the segmentation map in tity of the images in the left-most column. The
the left-most column. The top-most row contains top-most row contains examples generated with-
examples generated without guidance. out guidance.
images. We discuss the qualitative results and implementation details for each guidance function in
its corresponding subsection. We summarize our quantitative evaluation in Tab. 1, where we evaluate
how well the generated images match the external guidance with task-specific metrics, and include the
similarity of CLIP embedding between text prompts and images generated with and without guidance.
For all tasks, the minimum difference between CLIP similarities indicates that our algorithm performs
guidance without sacrificing the effectiveness of the underlying diffusion model. We also include
additional quantitative and qualitative results for text-guided generation in the appendix.
Task Metric Value
Segmentation Map Guidance. To guide im- mIoU 0.898
age generation using a segmentation map as Segmentation
CLIP similarity (0.247) 0.249
prompt, we use a MobileNetV3-Large (Howard Face similarity 0.642
Face
et al., 2019) with a segmentation head, and CLIP similarity (0.287) 0.234
a publicly available pre-trained model in Py- Object detection mAP@50 0.634
Torch (Paszke et al., 2019). We use standard per- CLIP similarity (0.263) 0.246
pixel cross-entropy loss between a given prompt
and the predicted segmentation of generated im- Table 1: Quantitative analysis of different guidance
ages as our loss function ℓ. applied on Stable Diffusion. The reference value in
parenthesis is obtained with no external guidance.
In our experiment, we combine segmentation
maps that depict objects of different shapes with
new text prompts. We use the text prompt as a fixed additional input to Stable Diffusion to perform
text-conditional sampling, and guide the text-conditional generated images to match the given
segmentation maps. Qualitative results are presented in Fig. 3. From Fig. 3, we see that the generated
images show a clear separation between object and background that matches the given segmentation
map nearly perfectly. The generated object and background also each match their descriptive text
(i.e. dog breed and environment description). Furthermore, the generated images are overall highly
realistic. In Tab. 1, we evaluate mIoU between the ground truth segmentation map and the predicted
segmentation of generated images to assess the match with constraint.

Face Recognition Guidance. For guiding generation to include a specific person’s likeness, we
propose combining face detection and facial recognition modules into one guidance function. This
setup produces a facial attribute embedding from an image of a face. We use multi-task cascaded
convolutional networks (MTCNN) (Zhang et al., 2016) for face detection, and we use facenet (Schroff
et al., 2015) for facial recognition. The guidance function f then crops out the detected face and

7
Published as a conference paper at ICLR 2024

Prompt A colorful A fantasy photo A portrait of


Prompt Headshot of a Headshot of a An oil painting of photo of an of volcanoes a woman
woman with woman with a a headshot of a Style Eiffel Tower
Guide a dog. dog on beach. women with a dog.

(N/A) (N/A)

Figure 5: In addition to text prompts (above each Figure 6: In addition to text prompts, these im-
column), these images are guided by an object de- ages are guided by a style image. Each column
tector. Each column contains examples of images contains examples of images generated to match
generated to match the prompt and the bounding the text prompt and the style used for guidance.
boxes used for guidance.
outputs a facial attribute embedding as a prompt and we use an l1 -loss between embeddings as the
loss function ℓ.
We explore different combinations of face guidance and text prompts. Similarly to the segmentation
case, we use the text prompt as a fixed additional conditioning to Stable Diffusion and guide this
text-conditional trajectory with our algorithm so that the face in the generated image looks similar
to the face prompt. In Fig. 4, we clearly see that the facial characteristics of a given face prompt
are reproduced almost perfectly on the generated images. The descriptive text of either background,
material, or style is also realized correctly and blends nicely with the generated faces. We again
summarize our quantitative evaluation in Tab. 1. We evaluate the similarity between facial attributes
of ground-truth identity and the generated faces. In general, two faces are considered to be from the
same person if the similarity is over 0.5, and our algorithm can effectively guide the generated face to
meet the criteria.

Object Location Guidance For Stable Diffusion, we also present the results of guided image
generation with an object detection network. For this experiment, we use Faster-RCNN (Ren et al.,
2015) with Resnet-50-FPN backbone (Li et al., 2021), a publicly available pre-trained model in
Pytorch, as our object detector. We use bounding boxes with class labels as our object location
prompt. We construct a loss function ℓ by the sum of three individual losses, namely (1) anchor
classification loss, (2) bounding box regression loss and (3) region label classification loss.
We again experiment with different combinations of text prompt and object location prompt, and
similarly use the text prompt as a fixed conditioning to Stable Diffusion. Using our proposed guidance
algorithm, we perform guided image generation that generates and matches the objects presented in
the text prompt to the given object locations. The results are presented in Fig. 5. We observe from
Fig. 5 that objects in the descriptive text all appear in the designated location with the appropriate
size indicated by the given bounding boxes. Each location is filled with appropriate, high-quality
generations that align with varied image content prompts, ranging from “beach” to “oil painting”. In
Tab. 1, we use mAP@50 to measure how well the generated images satisfy the constraint.

Style Guidance Finally, we conclude our experiments on Stable Diffusion by guiding the image
generation based on a reference style given by a style image. To achieve so, we capture the reference
style from the style image by the image feature extractor from CLIP, and use the resulting image
embedding as prompts. The loss function calculates the negative cosine similarity between the em-
bedding of generated images and the embedding of the style image. Similar to previous experiments,

8
Published as a conference paper at ICLR 2024

Object Location Forward Only Forward + Backward Masked Image Clf. Guided Clf. + Seg. Guided

Figure 7: Generation guided by object detection Figure 8: Our guidance algorithm can incorporate
with the unconditional ImageNet model. While feedback from multiple guidance functions. Left
both forward and backward guidance produces re- to right: The inpainting prompt; The classifier-
alistic images with the desired objects in the desig- guided inpainting output; Images generated with
nated locations, forward guidance alone produces both classifier and segmentation guidance, where
the wrong objects or the wrong locations/sizes. realistic dogs are generated exactly on the mask.
we control the content using text input as additional conditioning to the Stable Diffusion model. We
experiment with combinations of different style images and different text prompts, and present the
results in Fig. 6. From Fig. 6, we can see that the generated images contain contents that match the
given text prompts, while exhibiting style that matches the given style images.

4.2 R ESULTS FOR I MAGE N ET D IFFUSION

In this section, we present results for guided image generation using an unconditional diffusion
model trained on ImageNet. We experiment with object location guidance and a hybrid guided
image generation task which we term segmentation-guided inpainting. We also include additional
experiments for CLIP guidance in the appendix. We will discuss results of each guidance separately.
Object Location Guidance. Similar to object Object Location Fwd. Only Fwd. + Bkd.
location guidance for Stable Diffusion, we also Bounding box 1 0.39 0.90
use the same network architecture and the same Bounding box 2 0.18 0.36
pre-trained model as our object detection net-
work, and construct an identical loss function Table 2: Quantitative analysis of forward guidance
ℓ for our guidance algorithm. However, unlike only versus combination of forward and backward
Stable Diffusion, object locations are the only guidance for object detection guidance on Ima-
prompts available for guided image generation. geNet with bounding boxes in Fig. 7. The metric
We experiment with different object location is mAP@50.
prompts using either (1) only forward universal
guidance and (2) both forward and backward universal guidance. We observe from Fig. 7 that
applying both forward and backward guidance generates images that are realistic and the objects
match the prompt nicely. On the other hand, while images generated using only forward guidance
remain realistic, they feature objects with mismatching categories and locations. The observation is
further backed by quantitative evaluation presented in Tab. 2. The evaluation is based on mAP@50
between the ground truth object locations and the predicted bounding boxes of generated images, and
clearly shows that the combination of forward and backward guidance leads to a much better match
with the constraint. The results demonstrate the effectiveness of our universal guidance algorithm,
and also validate the necessity of our backward guidance.

Segmentation-Guided Inpainting. In this experiment, we aim to explore the ability of our algo-
rithm to handle multiple guidance functions. We perform guided image generation with combined
guidance from an inpainting mask, a classifier and a segmentation network. We first generate images
with masked regions as the prompt for inpainting. We then pick an object class c as the prompt
for classification and generate a segmentation mask where the masked regions are considered fore-
ground objects of the same class c. We use ℓ2 loss on the non-masked region as the loss function
for inpainting, and set the corresponding s(t) = 0, or equivalently only use backward guidance
for inpainting. We use the same segmentation network as described in Sec. 4.1. For classification

9
Published as a conference paper at ICLR 2024

guidance, we use the classifier that accepts noisy input (Dhariwal & Nichol, 2021), and perform the
original classifier guidance as Eq. (4) instead of our forward guidance. The results in Fig. 8 show
that when using both inpainting and classifier as guidance, our algorithm generates realistic images
that both match the inpainting prompt and are classified correctly to the given object class. Adding
in segmentation guidance, our algorithm further improves by making a near-perfect match to both
the segmentation map and the inpainting prompt while maintaining realism, demonstrating that our
algorithm effectively combines feedback from multiple guidance functions.

5 C ONCLUSION
We propose a universal guidance algorithm that is able to perform guided image generation with any
off-the-shelf guidance function based on a fixed foundation diffusion model. Our algorithm only
needs guidance and loss functions to be differentiable, and avoids any retraining to adapt the guidance
function or the foundation model to a specific type of prompt. We demonstrate promising results with
our algorithm on complex guidance including segmentation, face recognition and object detection
systems – and multiple guidance functions can even be used together.

6 ACKNOWLEDGEMENTS
This work was made possible by the ONR MURI program and the AFOSR MURI program. Com-
mercial support was provided by Capital One Bank, the Amazon Research Award program, and Open
Philanthropy. Further support was provided by the National Science Foundation (IIS-2212182), and
by the NSF TRAILS Institute (2229885).

R EPRODUCIBILITY S TATEMENT
we describe guidance functions and foundation diffusion models for experiments presented in Sec. 4
in the corresponding subsections. Hyperparameters for experiments described in Sec. 4 of the main
paper can be found in Sec. B in the appendix. We also include the source code used to conduct the
experiments described in the paper in our supplementary material.

R EFERENCES
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising
diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–
17993, 2021.
Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie S Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas
Geiping, and Tom Goldstein. Cold diffusion: Inverting arbitrary image transforms without noise. arXiv
preprint arXiv:2208.09392, 2022.
Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior
sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687, 2022a.
Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse
problems using manifold constraints. arXiv preprint arXiv:2206.00941, 2022b.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,
2009.
Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. volume 34,
2021.
Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play
priors. arXiv preprint arXiv:2206.09012, 2022.
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems, 32, 2020.

10
Published as a conference paper at ICLR 2024

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun
Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF
international conference on computer vision, pp. 1314–1324, 2019.

Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. Fastdiff: A fast conditional
diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934, 2022.

Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. arXiv
preprint arXiv:2201.11793, 2022.

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model
for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.

Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B Hashimoto. Diffusion-lm
improves controllable text generation. arXiv preprint arXiv:2205.14217, 2022.

Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaiming He, and Ross Girshick. Benchmarking detection
transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021.

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae
Lee. Gligen: Open-set grounded text-to-image generation. arXiv preprint arXiv:2301.07093, 2023.

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint:
Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 11461–11471, 2022.

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever,
and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion
models. arXiv preprint arXiv:2112.10741, 2021.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,
Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep
learning library. Advances in neural information processing systems, 32, 2019.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language
supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with
region proposal networks. Advances in neural information processing systems, 28, 2015.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of CVPR, 2022.

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion
models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–
36494, 2022.

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition
and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
815–823, 2015.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. International Conference
on Learning Representations, 2021a.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances
in Neural Information Processing Systems, 32, 2019.

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.
Score-based generative modeling through stochastic differential equations. International Conference on
Learning Representations, 2021b.

Weilun Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Dong Chen, Lu Yuan, and Houqiang Li. Semantic
image synthesis via diffusion models. arXiv preprint arXiv:2207.00050, 2022a.

Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space
model. arXiv preprint arXiv:2212.00490, 2022b.

11
Published as a conference paper at ICLR 2024

Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G Dimakis, and Peyman Milanfar.
Deblurring via stochastic refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 16293–16303, 2022.
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask
cascaded convolutional networks. IEEE signal processing letters, 23(10):1499–1503, 2016.

Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv
preprint arXiv:2302.05543, 2023.

12
Published as a conference paper at ICLR 2024

A L IMITATIONS

Generation using universal guidance is typically slower than standard conditional generation for
several reasons. Empirically, more than one iteration of denoising is required for the given noise level
t to generate high-quality images with complex guidance functions. However, the time complexity
of our algorithm scales linearly with the number of refinement steps k, which slows down image
generation when k is large. That being said, for the applications presented in this paper, we were
successful in generating images with smaller values of k. Also, as demonstrated in the main paper,
backward guidance is required in certain scenarios to help generate images that match the given
constraint. Computing backward guidance requires performing minimization with a multi-step
gradient descent inner loop. While proper choices of gradient-based optimization algorithms and
learning rate schedules significantly speed up the convergence of minimization, the time it takes
to compute backward guidance inevitably becomes longer when the guidance function is itself a
very-large neural network. Finally, we note that, to get optimal results, sampling hyper-parameters
must be chosen individually for each guidance network.

B H YPER - PARAMETERS

In this section, we present the hyper-parameters for the different guidance functions i.e. face,
segmentation, object location, and style guidance. We present the hyperparameters for experiments
on√Stable Diffusion in Sec. 4.1 in the Tab. 3, where we include coefficient s0 to compute st =
s0 1 − αt and the number of Universal Stepwise Refinement (k). We also provide hyperparameters
for experiments on ImageNet in Sec. 4.2 in Tab. 4

Guidance s0 k
Face 20000 2
Object Location 100 3
Style Transfer 6 6
Segmentation 400 10

Table 3: Hyper-parameters used in this paper for different guidance functions to reproduce the results
for Stable Diffusion.

Guidance s0 k
Object Location 100 3
Segmentation 200 10

Table 4: Hyper-parameters used in this paper for different guidance functions to reproduce the results
for ImageNet.

C CLIP GUIDANCE FOR S TABLE D IFFUSION

CLIP (Radford et al., 2021) is a state-of-the-art text-to-image similarity model developed by OpenAI.
We use the image feature extractor of CLIP to do text-guided image generation with our algorithm. We
construct a loss function that calculates the negative cosine similarity between an√image embedding
and the CLIP text embedding produced by a given text prompt. We use s(t) = 10 1 − αt and k = 8
and use Stable Diffusion as an unconditional image generator.
We generate images guided by a number of text prompts. To further assess our universal guidance
algorithm and compare guidance and conditioning, we also generate images using classical, text-
conditional generation by Stable Diffusion with identical prompts as inputs, and summarize the
results in Fig. 9. The results in Fig. 9 show that our algorithm can guide the generation to produce
high-quality images that match the given text description, and are comparable with images generated
by the specialized text-conditioning model. We also include qualitative results from experiments on
DrawBench (Saharia et al., 2022). DrawBench is a widely-used and diverse list of text prompts. We
randomly select 20 prompts and generate 10 images for each individual prompt. We compute CLIP

13
Published as a conference paper at ICLR 2024

Conditional Stable-Diffusion Guided Stable-Diffusion

A photograph of an astronaut riding a horse.

An oil painting of a corgi wearing a party hat.

Figure 9: We compare the ability to match given text prompts between our universal guidance
algorithm and text-conditional model trained from scratch. The results demonstrate that our universal
algorithm is comparable to specialized conditional model on the ability to generate quality images
that satisfy the text constraints.

score, the cosine similarity between CLIP feature of the text prompt and the associated generated im-
age, on both Stable Diffusion and our algorithm, and report the average in Tab. 5. As demonstrated in
the table, the performance of our algorithm is quantitatively similar to a dedicated text-conditional gen-
erator, while requiring no additional training at all for the underlying unconditional diffusion model.

Stable Diffusion Universal Guidance


CLIP Score 0.2818 0.2632

Table 5: Quantitative results on DrawBench for Stable Diffusion and our algorithm.

14
Published as a conference paper at ICLR 2024

D CLIP GUIDANCE FOR I MAGE N ET DIFFUSION MODEL

English foxhound by
Van
Van Gogh
Gogh Style
Style Cake
Edward Hopper

Figure 10: We show that unconditional diffusion models trained on ImageNet can be guided with
CLIP to generate high-quality images that match the text prompts, even if these generated images
should be out of distribution.

CLIP Guidance. We use the same construction of f and ℓ for Stable Diffusion to perform CLIP-
guided generation. We use only forward guidance for this experiment. To assess the limit of our
universal guidance algorithm, we hand-crafted text prompts such that the matching images are
expected to be out of distribution. In particular, our text prompts either designate art styles that are far
from realistic or designate objects that do not belong to any possible class label of ImageNet. We
present the results in Fig. 10, and from the results, we clearly see that our algorithm still successfully
guides the generation√to produce quality images that also match the text prompts. For all three images,
we have s(t) = w · 1 − αt , where w is 2, 5 and 2 respectively and k is 10, 5 and 10 respectively.

15
Published as a conference paper at ICLR 2024

Figure 11: The qualitative results for effect of different guidance strength s(t) and
√ universal refinement
step k on segmentation guidance for Stable Diffusion. Here we use s(t) = c · 1 − αt , and compare
different c instead.

E A BLATION S TUDY AND P ROCEDURE TO P ICK H YPERPARAMETERS


In this section we present our ablation study on the effect of different parameters, namely guidance
strength s(t) and refinement step k, for our algorithm. We summarize the qualitative results in Fig. 11
on segmentation guidance for Stable Diffusion. From the figure, we observe that increasing c alone
leads to better matching between the segmentation map and the position of the generated dogs, but
the quality of images also degrades, especially when c is large (c >= 200). On the other hand,
increasing k noticeably improves the quality of images. For example, comparing the best image from
(k, c) = (4, 800) and (k, c) = (10, 800), the latter clearly exhibits significantly better realness. We
also quantitatively evaluate the effectiveness of the two parameters, and present the results in Tab. 6.
We calculate the mIOU between the ground truth segmentation map and the segmentation map of
generated images predicted by the given segmentation network, which assesses the match between
generation and the given constraint. We also show cosine similarity between the clip attribute vectors
of the generated images and the given text phrase. High cosine similarity indicates that generation
guided with segmentation does not stray away from the text-conditional generation of the underlying
diffusion model. In Tab. 6, we observe that when the step m is fixed, increasing the guidance strength
c leads to a better match with external guidance. On the other hand, the clip similarity score is stable
for c in range of (10, 200) before showing a sudden drop when c >= 400, indicating that it is more
difficult for the underlying diffusion model to control the guided generation to match the text when c
is large. As for universal refinement step k, increasing k generally improves both mIOU and clip
similarity.

Metric Refinement step c =10 c = 20 c = 50 c = 100 c =200 c =400 c =800


k=2 0.52 0.52 0.67 0.79 0.87 0.89 0.89
mIoU k=4 0.50 0.59 0.71 0.80 0.87 0.89 0.88
k=10 0.57 0.62 0.76 0.80 0.89 0.90 0.92
k=2 0.26 0.26 0.25 0.24 0.24 0.20 0.16
CLIP sim. k=4 0.26 0.27 0.26 0.25 0.25 0.21 0.14
k=10 0.24 0.25 0.26 0.26 0.25 0.22 0.18

Table 6: Quantitative evaluation for different combination of parameters of segmentation guidance on


Stable Diffusion.

We also want to remark that the ablation study formalizes a principled way to pick suitable k and
s(t) for different guidance functions. In particular, increasing k is always beneficial to the generation
quality, and the value of k is limited only by the computational budget. Given a fixed k, there is
generally a sweet spot for s(t) that ensures both a match to the target and sufficient quality, and this
sweet spot can be found with standard parameter search, as described above.

16
Published as a conference paper at ICLR 2024

F M ORE RESULTS ON S TABLE D IFFUSION

(a) Walker hound, Walker foxhound on snow.

(b) Walker hound, Walker foxhound under water.

(c) Walker hound, Walker foxhound as an oil painting.

Figure 12: More images to show Segmentation guidance. In each subfigure, the first image is the
segmentation map used to guide the image generation with its caption as its text prompt.

17
Published as a conference paper at ICLR 2024

(a) Headshot of a person with blonde hair with space background.

(b) A headshot of a woman looking like a lara croft.

(c) A headshot of a blonde woman as a sketch.

Figure 13: More images to show Face guidance. In each subfigure, the first image is the human
identity used to guide the image generation with its caption as its text prompt.

18
Published as a conference paper at ICLR 2024

(a) A headshot of a woman with a dog in winter.

(b) a headshot of a woman with a dog on beach.

(c) An oil painting of a headshot of a woman with a dog.

Figure 14: More images to show Object Location guidance. In each subfigure, the first image is the
object location used to guide the image generation with its caption as its text prompt.

19
Published as a conference paper at ICLR 2024

(a) A colorful photo of an Eiffel Tower.

(b) A fantasy photo of a lonely road.

(c) A fantasy photo of volcanoes.

Figure 15: More images to show Style Transfer. In each subfigure, the first image is the styling image
used to guide the image generation with its caption as its text prompt.

20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy