0% found this document useful (0 votes)

151 views25 pages

Diffusion-LM Improves Controllable Text Generation

The document describes Diffusion-LM, a new non-autoregressive language model based on continuous diffusions that enables controllable text generation. Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a hierarchy of continuous latent representations. This continuous latent space allows for a simple gradient-based algorithm to perform complex, fine-grained control tasks, such as constraining syntactic structure, that previous models struggled with. The authors demonstrate Diffusion-LM can successfully control text generation for tasks ranging from attributes to syntactic structures, significantly outperforming prior work.

Uploaded by

Gugo Man

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

151 views25 pages

Diffusion-LM Improves Controllable Text Generation

Uploaded by

Gugo Man

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Diffusion-LM Improves Controllable Text Generation

Xiang Lisa Li John Thickstun Ishaan Gulrajani

Stanford University Stanford University Stanford Univeristy
xlisali@stanford.edu jthickst@stanford.edu igul@stanford.edu
arXiv:2205.14217v1 [cs.CL] 27 May 2022

Percy Liang Tatsunori B. Hashimoto

Stanford Univeristy Stanford Univeristy
pliang@cs.stanford.edu thashim@stanford.edu

Abstract
Controlling the behavior of language models (LMs) without re-training is a major
open problem in natural language generation. While recent works have demon-
strated successes on controlling simple sentence attributes (e.g., sentiment), there
has been little progress on complex, fine-grained controls (e.g., syntactic structure).
To address this challenge, we develop a new non-autoregressive language model
based on continuous diffusions that we call Diffusion-LM. Building upon the recent
successes of diffusion models in continuous domains, Diffusion-LM iteratively
denoises a sequence of Gaussian vectors into word vectors, yielding a sequence
of intermediate latent variables. The continuous, hierarchical nature of these inter-
mediate variables enables a simple gradient-based algorithm to perform complex,
controllable generation tasks. We demonstrate successful control of Diffusion-LM
for six challenging fine-grained control tasks, significantly outperforming prior
work.1

1 Introduction
Large autoregressive language models (LMs) are capable of generating high quality text [33, 3, 5, 45],
but in order to reliably deploy these LMs in real world applications, the text generation process needs
to be controllable: we need to generate text that satisfies desired requirements (e.g. topic, syntactic
structure). A natural approach for controlling a LM would be to fine-tune the LM using supervised
data of the form (control, text) [16]. However, updating the LM parameters for each control task
can be expensive and does not allow for compositions of multiple controls (e.g. generate text that
is both positive sentiment and non-toxic). This motivates light-weight and modular plug-and-play
approaches [6] that keep the LM frozen and steer the generation process using an external classifier
that measures how well the generated text satisfies the control. But steering a frozen autoregressive
LM has been shown to be difficult, and existing successes have been limited to simple, attribute-level
controls (e.g., sentiment or topic) [6, 22, 44].
In order to tackle more complex controls, we propose Diffusion-LM, a new language model based
on continuous diffusions. Diffusion-LM starts with a sequence of Gaussian noise vectors and
incrementally denoises them into vectors corresponding to words, as shown in Figure 1. These
gradual denoising steps produce a hierarchy of continuous latent representations. We find that
this hierarchical and continuous latent variable enables simple, gradient-based methods to perform
complex control tasks such as constraining the parse tree of a generated sequence.
Continuous diffusion models have been extremely successful in vision and audio domains [12, 21,
34, 8, 4], but they have not been applied to text because of the inherently discrete nature of text
1
Code is available at https://github.com/XiangLi1999/Diffusion-LM.git

Preprint. Under review.

Figure 1: Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yield-
ing a intermediate latent variables of decreasing noise level xT · · · x0 . For controllable generation, we
iteratively perform gradient updates on these continuous latents to optimize for fluency (parametrized
by Diffusion-LM) and satisfy control requirements (parametrized by a classifier).

(§3). Adapting this class of models to text requires several modifications to standard diffusions:
we add an embedding step and a rounding step to the standard diffusion process, design a training
objective to learn the embedding, and propose techniques to improve rounding (§4). We control
Diffusion-LM using a gradient-based method, as shown in Figure 1. This method enables us to steer
the text generation process towards outputs that satisfy target structural and semantic controls. It
iteratively performs gradient updates on the continuous latent variables of Diffusion-LM to balance
fluency and control satisfaction (§5.1).
To demonstrate control of Diffusion-LM, we consider six control targets ranging from fine-grained
attributes (e.g., semantic content) to complex structures (e.g., parse trees). Our method almost doubles
the success rate of previous plug-and-play methods and matches or outperforms the fine-tuning oracle
on all these classifier-guided control tasks (§7.1). In addition to these individual control tasks, we
show that we can successfully compose multiple classifier-guided controls to generate sentences with
both desired semantic content and syntactic structure (§7.2). Finally, we consider span-anchored
controls, such as length control and infilling. Diffusion-LM allows us to perform these control tasks
without a classifier, and our Diffusion-LM significantly outperforms prior plug-and-play methods and
is on-par with an autoregressive LM trained from scratch for the infilling task (§7.3).

2 Related Work

Diffusion Models for Text. Diffusion models [39] have demonstrated great success in continuous
data domains [12, 27, 21, 25], producing images and audio that have state-of-the-art sample quality.
To handle discrete data, past works have studied text diffusion models on discrete state spaces, which
defines a corruption process on discrete data (e.g., each token has some probability to be corrupted to
an absorbing or random token) [1, 14, 15]. In this paper, we focus on continuous diffusion models
for text and to the best of our knowledge, our work is the first to explore this setting. In contrast
to discrete diffusion LMs, our continuous diffusion LMs induce continuous latent representations,
which enables efficient gradient-based methods for controllable generation.

Autoregressive and Non-autoregressive LMs. Most large pre-trained LMs are left-to-right au-
toregressive (e.g., GPT-3 [3], PaLM [5]). The fixed generation order limits the models’ flexibility
in many controllable generation settings, especially those that impose controls globally on both left
and right contexts. One example is infilling, which imposes lexical control on the right contexts;
another example is syntactic structure control, which controls global properties involving both left
and right contexts. Since autoregressive LMs cannot directly condition on right contexts, prior
works have developed specialized training and decoding techniques for these tasks [38, 9, 30]. For
example, Qin et al. [31] proposed a decoding method that relaxes the discrete LM outputs to continu-
ous variables and backpropagates gradient information from the right context. Diffusion-LM can
condition on arbitrary classifiers that look at complex, global properties of the sentence. There are
other non-autoregressive LMs that have been developed for machine translation and speech-to-text
tasks [11, 37]. However these methods are specialized for speech and translation settings, where
the entropy over valid outputs is low, and it has been shown that these approaches fail for language
modeling [35].

2
Plug-and-Play Controllable Generation. Plug-and-play controllable generation aims to keep the
LM frozen and steer its output using potential functions (e.g., classifiers). Given a probabilistic
potential function that measures how well the generated text satisfies the desired control, the generated
text should be optimized for both control satisfaction (measured by the potential function) and fluency
(measured by LM probabilities) . There are several plug-and-play approaches based on autoregressive
LMs: FUDGE [44] reweights the LM prediction at each token with an estimate of control satisfaction
for the partial sequence; GeDi [22] and DExperts [24] reweight the LM prediction at each token with
a smaller LM finetuned/trained for the control task.
The closest work to ours is PPLM [6], which runs gradient ascent on an autoregressive LM’s hidden
activations to steer the next token to satisfy the control and maintain fluency. Because PPLM is based
on autoregressive LMs, it can only generate left-to-right. This prevents PPLM from repairing and
recovering errors made in previous generation steps. Despite their success on attribute (e.g., topic)
controls, we will show these plug-and-play methods for autoregressive LMs fail on more complex
control tasks such as controlling syntactic structure and semantic content in §7.1. We demonstrate
that Diffusion-LM is capable of plug-and-play controllable generation by applying classifier-guided
gradient updates to the continuous sequence of latent variables induced by the Diffusion-LM.

3 Problem Statement and Background

We first define controllable generation (§3.1) and then review continuous diffusion models (§3.3).

3.1 Generative Models and Controllable Generation for Text

Text generation is the task of sampling w from a trained language model plm (w), where w =
[w1 · · · wn ] is a sequence of discrete words and plm (w) is a probability distribution over sequences
of words. Controllable text generation is the task of sampling w from a conditional distribution
p(w | c), where c denotes a control variable. For syntactic control, c can be a target syntax tree
(Figure 1), while for sentiment control, c could be a desired sentiment label. The goal of controllable
generation is to generate w that satisfies the control target c.
Consider the plug-and-play controllable generation setting: we are given a language model plm (w)
trained from a large amount of unlabeled text data, and for each control task, we are given a classifier
p(c | w) trained from smaller amount of labeled text data (e.g., for syntactic control, the classifier
is a probabilistic parser). The goal is to utilize these two models to approximately sample from the
posterior p(w | c) via Bayes rule p(w | c) ∝ plm (w) · p(c | w). Here, plm (w) encourages w to be
fluent, and the p(c | w) encourages w to fulfill the control.

3.2 Autoregressive Language Models

Qn to language modeling factors plm in an autoregressive left-to-right mannar,

The canonical approach
plm (w) = plm (w1 ) i=2 plm (xi | x<i ). In this case, text generation is reduced to the task of
repeatedly predicting the next token conditioned on the partial sequence generated so far. The next
token prediction plm (xi | x<i ) is often parametrized by Transformer architecture [42].

3.3 Diffusion Models for Continuous Domains

A diffusion model [12, 27] is a latent variable model that models the data x0 ∈ Rd as a Markov
chain xT . . . x0 with each variable in Rd , and xT is a Gaussian. The diffusion model incrementally
denoises the sequence of latent variables xT :1 to approximate samples from the target data distri-
bution (Figure 2). The initial state pθ (xT ) ≈ N (0, I), and each denoising transition xt → xt−1 is
parametrized by the model pθ (xt−1 | xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)). For example, µθ and Σθ
may be computed by a U-Net or a Tranformer.
To train the diffusion model, we define a forward process that constructs the intermediate latent
variables x1:T . The forward process incrementally adds Gaussian noise to data x0 until, at diffusion
step T , samples xT are √approximately Gaussian. Each transition xt−1 → xt is parametrized by
q(xt | xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), where the hyperparameter βt is the amount of noise added
at diffusion step t. This parametrization of the forward process q contains no trainable parameters and

3
Figure 2: A graphical model representing the forward and reverse diffusion processes. In addition to
the original diffusion models [12], we add a Markov transition between x0 and w, and propose the
embedding §4.1 and rounding §4.2 techniques.

allows us to define a training objective that involves generating noisy data according to a pre-defined
forward process q and training a model to reverse the process and reconstruct the data.
The diffusion model is trained to maximize the marginal likelihood of the data Ex0 ∼pdata [log pθ (x0 )],
and the canonical objective is the variational lower bound of log pθ (x0 ) [39],
" T
#
q(xT |x0 ) X q(xt−1 |x0 , xt )
Lvlb (x0 ) = E log + log − log pθ (x0 |x1 ) . (1)
q(x1:T |x0 ) pθ (xT ) t=2
pθ (xt−1 |xt )

However, this objective can be unstable and require many optimization tricks to stabilize [27]. To
circumvent this issue, Ho et al. [12] devised a simple surrogate objective that expands and reweights
each KL-divergence term in Lvlb to obtain a mean-squared error loss (derivation in Appendix E)
which we will refer to as
T
X
Lsimple (x0 ) = E ||µθ (xt , t) − µ̂(xt , x0 )||2 ,
q(xt |x0 )
t=1

where µ̂(xt , x0 ) is the mean of the posterior q(xt−1 |x0 , xt ) which is a closed from Gaussian, and
µθ (xt , t) is the predicted mean of pθ (xt−1 | xt ) computed by a neural network. While Lsimple is no
longer a valid lower bound, prior work has found that it empirically made training more stable and
improved sample quality2 . We will make use of similar simplifications in Diffusion-LM to stabilize
training and improve sample quality (§4.1).

4 Diffusion-LM: Continuous Diffusion Language Modeling

Constructing Diffusion-LM requires several modifications to the standard diffusion model. First, we
must define an embedding function that maps discrete text into a continuous space. To address this,
we propose an end-to-end training objective for learning embeddings (§4.1). Second, we require a
rounding method to map vectors in embedding space back to words. To address this, we propose
training and decoding time methods to facilitate rounding (§4.2).

4.1 End-to-end Training

To apply a continuous diffusion model to discrete text, we define an embedding function E MB(wi )
that maps each word to a vector in Rd . We define the embedding of a sequence w of length n to be:
E MB(w) = [E MB(w1 ), . . . , E MB(wn )] ∈ Rnd .
We propose a modification of the diffusion model training objective (Equation 1) that jointly learns
the diffusion model’s parameters and word embeddings. In preliminary experiments, we explored
random Gaussian embeddings, as well as pre-trained word embeddings [29, 33]. We found that these
fixed embeddings are suboptimal for Diffusion-LM compared to end-to-end training3 .
As shown in Figure 2, our approach adds a Markov transition from discrete words w to x0 in the
forward process, parametrized by qφ (x0 |w) = N (E MB(w),
Qn σ0 I). In the reverse process, we add a
trainable rounding step, parametrized by pθ (w | x0 ) = i=1 pθ (wi | xi ), where pθ (wi | xi ) is a
2
Our definition of Lsimple here uses a different parametrization from Ho et al. [12]. We define our squared
loss in terms of µθ (xt , t) while they express it in terms of θ (xt , t).
3
While trainable embeddings perform best on control and generation tasks, we found that fixed embeddings
onto the vocabulary simplex were helpful when optimizing for held-out perplexity. We leave discussion of this
approach and perplexity results to Appendix F as the focus of this work is generation quality and not perplexity.

4
softmax distribution. The training objectives introduced in §3 now becomes

Le2e
vlb (w) = E [Lvlb (x0 ) + log qφ (x0 |w) − log pθ (w|x0 )]] ,
qφ (x0 |w)
(2)
Le2e Lsimple (x0 ) + ||E MB(w) − µθ (x1 , 1)||2 − log pθ (w|x0 ) .

simple (w) = E
qφ (x0:T |w)
Learned Embeddings
We derive Le2e e2e
simple (w) from Lvlb (w) following
NOUN
PROPN
the simplification in §3.3 and our derivation de- AUX
tails are in Appendix E. Since we are training the VERB
embedding function, qφ now contains trainable ADP
DET
parameters and we use the reparametrization ADJ
trick [36, 18] to backpropagate through this sam- PRON
pling step. Empirically, we find the learned em- ADV
SCONJ
beddings cluster meaningfully: words with the NUM
same part-of-speech tags (syntactic role) tend to
be clustered, as shown in Figure 3.
Figure 3: A t-SNE [41] plot of the learned word
embeddings. Each word is colored by its POS.
4.2 Reducing Rounding Errors

The learned embeddings define a mapping from discrete text to the continuous x0 . We now describe
the inverse process of rounding a predicted x0 back to discrete text. Rounding is achieved
Qn by choosing
the most probable word for each position, according to argmax pθ (w | x0 ) = i=1 pθ (wi | xi ).
Ideally, this argmax-rounding would be sufficient to map back to discrete text, as the denoising steps
should ensure that x0 lies exactly on the embedding of some word. However, empirically, the model
fails to generate x0 that commits to a single word.
One explanation for this phenomenon is that the Lsimple (x0 ) term in our objective 2 puts in-
sufficient emphasis on modeling the structure of x0 . Recall that we defined Lsimple (x0 ) =
PT 2
t=1 Ext ||µθ (xt , t) − µ̂(xt , x0 )|| , where our model µθ (xt , t) directly predicts the mean of
pθ (xt−1 | xt ) for each denoising step t. In this objective, the constraint that x0 has to commit
to a single word embedding will only appear in the terms with t near 0, and we found that this
parametrization required careful tuning to force the objective to emphasize those terms (see Ap-
pendix H).
Our approach re-parametrizes Lsimple to force Diffusion-LM to explicitly model x0 in every term
of the objective. Specifically, we derive an analogue to Lsimple which is parametrized via x0 ,
PT 2
Le2e
x0 -simple (x0 ) =
4
t=1 Ext ||fθ (xt , t) − x0 || , where our model fθ (xt , t) predicts x0 directly .
This forces the neural network to predict x0 in every term and we found that models trained with this
objective quickly learn that x0 should precisely centered at a word embedding.
We described how re-parametrization can be helpful for model training, but we also found that the
same intuition could be used at decoding time in a technique that we call the clamping trick. In
the standard generation approach for a x0 -parametrized model, the model denoises xt to xt−1 by
first computing
√ √ of x0 via fθ (xt , t)Qand
an estimate
t
then sampling xt−1 conditioned on this estimate:
xt−1 = ᾱfθ (xt , t)+ 1 − ᾱ, where ᾱt = s=0 (1−βs ) and ∼ N (0, I) 5 . In the clamping trick,
the model additionally maps the predicted √vector fθ (xt , t) to its nearest
√ word embedding sequence.
Now, the sampling step becomes xt−1 = ᾱ · Clamp(fθ (xt , t)) + 1 − ᾱ. The clamping trick
forces the predicted vector to commit to a word for intermediate diffusion steps, making the vector
predictions more precise and reducing rounding errors.6

4
Predicting x0 and xt−1 is equivalent up √ to scaling√constants as the distribution of xt−1 can be obtained in
closed form via the forward process xt−1 = ᾱx0 + 1 − ᾱ, see Appendix E for further details.
5
This follows from the marginal distribution q(xt | x0 ), which is a closed form Gaussian since all the
Markov transitions are Gaussian.
6
Intuitively, applying the clamping trick to early diffusion steps with t near T may be sub-optimal, because
the model hasn’t figured out what words to commit to. Empirically, applying clamping trick for all diffusion
steps doesn’t hurt the performance much. But to follow this intuition, one could also set the starting step of the
clamping trick as a hyperparameter.

5
5 Decoding and Controllable Generation with Diffusion-LM

Having described the Diffusion-LM, we now consider the problem of controllable text generation
(§5.1) and decoding (§5.2).

5.1 Controllable Text Generation

We now describe a procedure that enables plug-and-play control on Diffusion-LM. Our approach to
control is inspired by the Bayesian formulation in §3.1, but instead of performing control directly on
the discrete text, we perform control on the sequence of continuous latent variables x0:T defined by
Diffusion-LM, and apply the rounding step to convert these latents into text.
QT
Controlling x0:T is equivalent to decoding from the posterior p(x0:T |c) = t=1 p(xt−1 | xt , c),
and we decompose this joint inference problem to a sequence of control problems at each diffusion
step: p(xt−1 | xt , c) ∝ p(xt−1 | xt ) · p(c | xt−1 , xt ). We further simplify p(c | xt−1 , xt ) = p(c |
xt−1 ) via conditional independence assumptions from prior work on controlling diffusions [40].
Consequently, for the t-th step, we run gradient update on xt−1 :
∇xt−1 log p(xt−1 | xt , c) = ∇xt−1 log p(xt−1 | xt ) + ∇xt−1 log p(c | xt−1 ),
where both log p(xt−1 | xt ) and log p(c | xt−1 ) are differentiable: the first term is parametrized by
Diffusion-LM, and the second term is parametrized by a neural network classifier.
Similar to work in the image setting [8, 40], we train the classifier on the diffusion latent variables
and run gradient updates on the latent space xt−1 to steer it towards fulfilling the control. These
image diffusion works take one gradient step towards ∇xt−1 log p(c | xt−1 ) per diffusion steps. To
improve performance on text and speed up decoding, we introduce two key modifications: fluency
regularization and multiple gradient steps.
To generate fluent text, we run gradient updates on a control objective with fluency regularization:
λ log p(xt−1 | xt ) + log p(c | xt−1 ), where λ is a hyperparameter that trades off fluency (the first
term) and control (the second term). While existing controllable generation methods for diffusions
do not include the λ log p(xt−1 | xt ) term in the objective, we found this term to be instrumental for
generating fluent text. The resulting controllable generation process can be viewed as a stochastic
decoding method that balances maximizing and sampling p(xt−1 | xt , c), much like popular text
generation techniques such as nucleus sampling [13] or sampling with low temperature. In order to
improve the control quality, we take multiple gradient steps for each diffusion step: we run 3 steps of
the Adagrad 7 [10] update for each diffusion steps. To mitigate for the increased computation cost,
we downsample the diffusion steps from 2000 to 200, which speeds up our controllable generation
algorithm without hurting sample quality much.

5.2 Minimum Bayes Risk Decoding

Many conditional text generation tasks require a single high-quality output sequence, such as machine
translation or sentence infilling. In these settings, we apply Minimum Bayes Risk (MBR) decoding
[23] to aggregate a set of samples S drawn from the Diffusion-LM , and select the sample that
achieves theP minimum expected risk under a loss function L (e.g., negative BLEU score): ŵ =
argminw∈S w0 ∈S |S| 1
L(w, w0 ). We found that MBR decoding often returned high quality outputs,
since a low quality sample would be dissimilar from the remaining samples and penalized by the loss
function.

6 Experimental Setup

With the above improvements on training (§4) and decoding (§5), we train Diffusion-LM for two
language modeling tasks. We then apply the controllable generation method to 5 classifier-guided
control tasks, and apply MBR decoding to a classifier-free control task (i.e. infilling).
7
We tried ablations that replaced Adagrad with SGD, but we found Adagrad to be substantially less sensitive
to hyperparameter tuning.

6
input (Semantic Content) food : Japanese
output text Browns Cambridge is good for Japanese food and also children friendly near The Sorrento .
input (Parts-of-speech) PROPN AUX DET ADJ NOUN NOUN VERB ADP DET NOUN ADP DET NOUN PUNCT
output text Zizzi is a local coffee shop located on the outskirts of the city .
input (Syntax Tree) (TOP (S (NP (*) (*) (*)) (VP (*) (NP (NP (*) (*))))))
output text The Twenty Two has great food
input (Syntax Spans) (7, 10, VP)
output text Wildwood pub serves multicultural dishes and is ranked 3 stars
input (Length) 14
output text Browns Cambridge offers Japanese food located near The Sorrento in the city centre .
input (left context) My dog loved tennis balls.
input (right context) My dog had stolen every one and put it under there.
output text One day, I found all of my lost tennis balls underneath the bed.

Table 1: Example input control and output text for each control tasks.

6.1 Datasets and Hyperparameters

We train Diffusion-LM on two datasets: E2E [28] and ROCStories [26]. The E2E dataset consists
of 50K restaurant reviews labeled by 8 fields including food type, price, and customer rating. The
ROCStories dataset consists of 98K five-sentence stories, capturing a rich set of causal and temporal
commonsense relations between daily events. This dataset is more challenging to model than E2E,
because the stories contain a larger vocabulary of 11K words and more diverse semantic content.
Our Diffusion-LM is based on Transformer [42] architecture with 80M parameters, with a sequence
length n = 64, diffusion steps T = 2000 and a square-root noise schedule (see Appendix A for
details). We treat the embedding dimension as a hyperparameter, setting d = 16 for E2E and d = 128
for ROCStories. See Appendix B for hyperparameter details. At decoding time, we downsample
to 200 diffusion steps for E2E and maintain 2000 steps for ROCStories. Decoding Diffusion-LM
for 200 steps is still 7x slower than decoding autoregressive LMs. For controllable generation, our
method based on Diffusion-LM is 1.5x slower than FUDGE but 60x faster than PPLM.

6.2 Control tasks

We consider 6 control tasks shown in Table 1: the first 4 tasks rely on a classifier, and the last 2 tasks
are classifier free8 . For each control task (e.g. semantic content), we sample 200 control targets c
(e.g., rating=5 star) from the validation splits, and we generate 50 samples for each control target. To
evaluate the fluency of the generated text, we follow the prior works [44, 6] and feed the generated
text to a teacher LM (i.e., a carefully fine-tuned GPT-2 model) and report the perplexity of generated
text under the teacher LM. We call this metric lm-score (denoted as lm): a lower lm-score indicates
better sample quality. 9 We define success metrics for each control task as follows:
Semantic Content. Given a field (e.g., rating) and value (e.g., 5 star), generate a sentence that covers
field=value, and report the success rate by exact match of ‘value’.
Parts-of-speech. Given a sequence of parts-of-speech (POS) tags (e.g., Pronoun Verb Determiner
Noun), generate a sequence of words of the same length whose POS tags (under an oracle POS tagger)
match the target (e.g., I ate an apple). We quantify success via word-level exact match.
Syntax Tree. Given a target syntactic parse tree (see Figure 1), generate text whose syntactic parse
matches the given parse. To evaluate the success, we parse the generated text by an off-the-shelf
parser [20], and report F1 scores.
Syntax Spans. Given a target (span, syntactic category) pair, generate text whose parse tree over
span [i, j] matches the target syntactic category (e.g. prepositional phrase).We quantify success via
the fraction of spans that match exactly.

8
Length is classifier-free for our Diffusion-LM based methods, but other methods still require a classifier.
9
Prior works [44, 6] use GPT [32] as the teacher LM whereas we use a fine-tuned GPT-2 model because
our base autoregressive LM and Diffusion-LM both generate UNK tokens, which does not exist in pretrained
vocabularies of GPT.

7
Length. Given a target length 10, . . . , 40, generate a sequence with a length within ±2 of the target.
In the case of Diffusion-LM, we treat this as a classifier-free control task.
Infilling. Given a left context (O1 ) and a right context (O2 ) from the aNLG dataset [2], and the goal
is to generate a sentence that logically connects O1 and O2 . For evaluation, we report both automatic
and human evaluation from the Genie leaderboard [17].

6.3 Classifier-Guided Control Baselines

For the first 5 control tasks, we compare our method with PPLM, FUDGE, and a fine-tuning
oracle. Both PPLM and FUDGE are plug-and-play controllable generation approaches based on an
autoregressive LM, which we train from scratch using the GPT-2 small architecture [33].
PPLM[6]. This method runs gradient ascent on the LM activations to increase the classifier proba-
bilities and language model probabilities, and has been successful on simple attribute control. We
apply PPLM to control semantic content, but not the remaining 4 tasks which require positional
information, as PPLM’s classifier lacks positional information.
FUDGE[44]. For each control task, FUDGE requires a future discriminator that takes in a prefix
sequence and predicts whether the complete sequence would satisfy the constraint. At decoding time,
FUDGE reweights the LM prediction by the discriminator scores.
FT. For each control task, we fine-tune GPT-2 on (control, text) pair, yielding an oracle conditional
language model that’s not plug-and-play. We report both the sampling (with temperature 1.0) and
beam search (with beam size 4) outputs of the fine-tuned models, denoted as FT-sample and FT-search.

6.4 Infilling Baselines

We compare to 3 specialized baseline methods developed in past work for the infilling task.
DELOREAN [30]. This method continuously relaxes the output space of a left-to-right autore-
gressive LM, and iteratively performs gradient updates on the continuous space to enforce fluent
connection to the right contexts. This yields a continuous vector which is rounded back to text.
COLD[31]. COLD specifies an energy-based model that includes fluency (from left-to-right and
right-to-left LM) and coherence constraints (from lexical overlap). It samples continuous vectors
from this energy-based model and round them to text.
AR-infilling. We train an autoregressive LM from scratch to do sentence infilling task [9]. Similar to
training Diffusion-LM, we train on the ROCStories dataset, but pre-process it by reordering sentences
from (O1 , Omiddle , O2 ) to (O1 , O2 , Omiddle ). At evaluation time, we feed in O1 , O2 , and the model
generates the middle sentence.

7 Main Results
We train Diffusion-LMs on the E2E and ROCStories datasets. In terms of negative log-likelihood
(NLL, lower is better), we find that the variational upper bound of Diffusion-LM NLL 10 underper-
forms the equivalent autoregressive Transformer model (2.28 vs. 1.77 for E2E, 3.88 vs 3.05 for
ROCStories) although scaling up model and dataset size partially bridges the gap (3.88 → − 3.10 on
ROCStories). Our best log-likelihoods required several modifications from §4; we explain these and
give detailed log-likelihood results in Appendix F. Despite worse likelihoods, controllable generation
based on our Diffusion-LM results in significantly better outputs than systems based on autoregressive
LMs, as we will show in §7.1,§7.2, and §7.3

7.1 Classifier-Guided Controllable Text Generation Results

As shown in Table 2, Diffusion-LM achieves high success and fluency across all classifier-guided
control tasks. It significantly outperforms the PPLM and FUDGE baselines across all 5 tasks.
Surprisingly, our method outperforms the fine-tuning oracle on controlling syntactic parse trees and
spans, while achieving similar performance on the remaining 3 tasks.
10
Exact log-likelihoods are intractable for Diffusion-LM, so we report the lower bound Le2e
vlb .

8
Semantic Content Parts-of-speech Syntax Tree Syntax Spans Length
ctrl ↑ lm ↓ ctrl ↑ lm ↓ ctrl ↑ lm ↓ ctrl ↑ lm ↓ ctrl ↑ lm ↓
PPLM 9.9 5.32 - - - - - - - -
FUDGE 69.9 2.83 27.0 7.96 17.9 3.39 54.2 4.03 46.9 3.11
Diffusion-LM 81.2 2.55 90.0 5.16 86.0 3.71 93.8 2.53 99.9 2.16
FT-sample 72.5 2.87 89.5 4.72 64.8 5.72 26.3 2.88 98.1 3.84
FT-search 89.9 1.78 93.0 3.31 76.4 3.24 54.4 2.19 100.0 1.83

Table 2: Diffusion-LM achieves high success rate (ctrl ↑) and good fluency (lm ↓) across all 5 control
tasks, outperforming the PPLM and FUDGE baselines. Our method even outperforms the fine-tuning
oracle (FT) on controlling syntactic parse trees and spans.

Syntactic Parse ( S ( S ( NP * ) ( VP * ( NP ( NP * * ) ( VP * ( NP ( ADJP * * ) * ) ) ) ) ) * ( S ( NP * * * ) ( VP * (

ADJP ( ADJP * ) ) ) ) )
FUDGE Zizzi is a cheap restaurant . [incomplete]
Diffusion-LM Zizzi is a pub providing family friendly Indian food Its customer rating is low
FT Cocum is a Pub serving moderately priced meals and the customer rating is high
Syntactic Parse ( S ( S ( VP * ( PP * ( NP * * ) ) ) ) * ( NP * * * ) ( VP * ( NP ( NP * * ) ( SBAR ( WHNP * ) ( S (
VP * ( NP * * ) ) ) ) ) ) * )
FUDGE In the city near The Portland Arms is a coffee and fast food place named The Cricketers which is not
family - friendly with a customer rating of 5 out of 5 .
Diffusion-LM Located on the riverside , The Rice Boat is a restaurant that serves Indian food .
FT Located near The Sorrento, The Mill is a pub that serves Indian cuisine.

Table 3: Qualitative examples from the Syntax Tree control. The syntactic parse tree is linearized
by nested brackets representing the constituents, and we use the standard PTB syntactic categories.
Tokens within each span are represented as * . We color failing spans red and bold the spans of
interest that we discuss in §7.1.

Controlling syntactic parse trees and spans are challenging tasks for fine-tuning, because conditioning
on the parse tree requires reasoning about the nested structure of the parse tree, and conditioning on
spans requires lookahead planning to ensure the right constituent appears at the target position.
We observe that PPLM fails in semantic content controls and conjecture that this is because PPLM is
designed to control coarse-grained attributes, and may not be useful for more targeted tasks such as
enforcing that a restaurant review contains a reference to Starbucks.
FUDGE performs well on semantic content control but does not perform well on the remaining four
tasks. Controlling a structured output (Parts-of-speech and Syntax Tree) is hard for FUDGE because
making one mistake anywhere in the prefix makes the discriminator assign low probabilities to all
continuations. In other control tasks that require planning (Length and Syntax Spans), the future
discriminator is difficult to train, as it must implicitly perform lookahead planning.
The non-autoregressive nature of our Diffusion-LM allows it to easily solve all the tasks that require
precise future planning (Syntax Spans and Length). We believe that it works well for complex
controls that involve global structures (Parts-of-speech, Syntax Tree) because the coarse-to-fine
representations allow the classifier to exert control on the entire sequence (near t = T ) as well as on
individual tokens (near t = 0).

Qualitative Results. Table 3 shows samples of Syntax Tree control. Our method and fine-tuning
both provide fluent sentences that mostly satisfy controls, whereas FUDGE deviates from the
constraints after the first few words. One key difference between our method and fine-tuning is that
Diffusion-LM is able to correct for a failed span and have suffix spans match the target. In the first
example, the generated span (“Family friendly Indian food”) is wrong because it contains 1 more
word than the target. Fortunately, this error doesn’t propagate to later spans, since Diffusion-LM
adjusts by dropping the conjunction. Analogously, in the second example, the FT model generates a
failed span (“The Mill”) that contains 1 fewer word. However, the FT model fails to adjust in the
suffix, leading to many misaligned errors in the suffix.

9
Semantic Content + Syntax Tree Semantic Content + Parts-of-speech
semantic ctrl ↑ syntax ctrl ↑ lm ↓ semantic ctrl ↑ POS ctrl ↑ lm ↓
FUDGE 61.7 15.4 3.52 64.5 24.1 3.52
Diffusion-LM 69.8 74.8 5.92 63.7 69.1 3.46
FT-PoE 61.7 29.2 2.77 29.4 10.5 2.97

Table 4: In this experiment, we compose semantic control and syntactic control: Diffusion-LM
achieves higher success rate (ctrl ↑) at some cost of fluency (lm ↓). Our method outperforms both
FUDGE and FT-PoE (product of experts of two fine-tuned models) on control success rate, especially
for the structured syntactic controls (i.e. syntactic parse tree and POS).

Automatic Eval Human Eval

BLEU-4 ↑ ROUGE-L ↑ CIDEr ↑ BERTScore ↑
Left-only 0.9 16.3 3.5 38.5 n/a
DELOREAN 1.6 19.1 7.9 41.7 n/a
COLD 1.8 19.5 10.7 42.7 n/a
Diffusion 7.1 28.3 30.7 89.0 0.37+0.03
−0.02

AR 6.7 27.0 26.9 89.0 0.39+0.02

−0.03

Table 5: For sentence infilling, Diffusion-LM significantly outperforms prior work COLD [31] and
Delorean [30] (numbers taken from paper), and matches the performance of an autoregressive LM
(AR) trained from scratch to do infilling.

7.2 Composition of Controls

One unique capability of plug-and-play controllable generation is its modularity. Given classifiers for
multiple independent tasks, gradient guided control makes it simple to generate from the intersection
of multiple controls by taking gradients on the sum of the classifier log-probabilities.
We evaluate this setting on the combination of Semantic Content + Syntax Tree control and Semantic
Content + Parts-of-speech control. As shown in Table 4, our Diffusion-LM achieves a high success
rate for both of the two components, whereas FUDGE gives up on the more global syntactic control.
This is expected because FUDGE fails to control syntax on its own.
Fine-tuned models are good at POS and semantic content control individually but do not compose
these two controls well by product of experts (PoE), leading to a large drop in success rates for both
constraints.

7.3 Infilling Results

As shown in Table 5, our diffusion LM significantly outperforms continuous relaxation based methods
for infilling (COLD and DELOREAN). Moreover, our method achieves comparable performance to
fine-tuning a specialized model for this task. Our method has slightly better automatic evaluation
scores and the human evaluation found no statistically significant improvement for either method.
These results suggest that Diffusion LM can solve many types of controllable generation tasks that
depend on generation order or lexical constraints (such as infilling) without specialized training.

7.4 Ablation Studies

We verify the importance of our proposed design choices in §4 through two ablation studies. We
measure the sample quality of Diffusion-LM using the lm-score on 500 samples §6.2.
Learned v.s. Random Embeddings (§4.1). Learned embeddings outperform random embeddings
on the ROCStories, which is a harder language modeling task. The same trend holds for the E2E
dataset but with a smaller margin.
Objective Parametrization (§4.2). We propose to let the diffusion model predict x0 directly. Here,
we compare this with standard parametrization in image generation which parametrizes by the noise
term . Figure 4 (right) shows that parametrizing by x0 consistently attains good performance across

10
dimensions, whereas parametrizing by works fine for small dimensions, but quickly collapses for
larger dimensions.

8 Conclusion and Limitations

We proposed Diffusion-LM, a novel and Learned v.s. Random Embeddings parametrization

lm-score (lower is better)

controllable language model based on con-
4.0

3.5 8 x
tinuous diffusions, which enables new 3.0
6
2.5
forms of complex fine-grained control
tasks. We demonstrate Diffusion-LM’s suc-
2.0

1.5
4
cess in 6 fine-grained control tasks: our learned emb 2
1.0

random emb
0.5
method almost doubles the control success 0
16 64 128 16 64 128
0.0

rate of prior methods and is competitive

with baseline fine-tuning methods that re- Figure 4: We measure the impact of our proposed de-
quire additional training. sign choices through lm-score. We find both learned em-
We find the complex controls enabled by beddings and reparametrization substantially improves
Diffusion-LM to be compelling, and we sample quality.
are excited by how Diffusion-LM is a substantial departure from the current paradigm of discrete
autoregressive generation. As with any new technologies, there are drawbacks to the Diffusion-LMs
that we constructed: (1) it has higher perplexity; (2) decoding is substantially slower; and (3) training
converges more slowly. We believe that with more follow-up work and optimization, many of these
issues can be addressed, and this approach will turn out to be a compelling way to do controllable
generation at scale.

Acknowledgments and Disclosure of Funding

We thank Yang Song, Jason Eisner, Tianyi Zhang, Rohan Taori, Xuechen Li, Niladri Chatterji, and
the members of p-lambda group for early discussions and feedbacks. We gratefully acknowledge the
support of a PECASE award. Xiang Lisa Li is supported by a Stanford Graduate Fellowship.

References
[1] Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg.
Structured denoising diffusion models in discrete state-spaces. In A. Beygelzimer, Y. Dauphin,
P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems,
2021. URL https://openreview.net/forum?id=h7-XixPCAL.
[2] Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman,
Hannah Rashkin, Doug Downey, Wen tau Yih, and Yejin Choi. Abductive commonsense
reasoning. In International Conference on Learning Representations, 2020. URL https:
//openreview.net/forum?id=Byg1v1HKDB.
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar-
wal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
teusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCan-
dlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot
learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Ad-
vances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran
Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/
1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[4] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan.
Wavegrad: Estimating gradients for waveform generation. In International Conference on Learn-
ing Representations, 2021. URL https://openreview.net/forum?id=NsMLjcFaO8O.
[5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh,

11
Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay,
Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson,
Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier
García, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David
Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani
Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat,
Aitor Lewkowycz, Erica Oliveira Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee,
Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason
Wei, Kathleen S. Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm:
Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[6] Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason
Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled
text generation. In International Conference on Learning Representations, 2020. URL https:
//openreview.net/forum?id=H1edEyBKDS.

[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.
[8] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image
synthesis. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances
in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?
id=AAWuCvzaVt.

[9] Chris Donahue, Mina Lee, and Percy Liang. Enabling language models to fill in the blanks. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages
2492–2501, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/
2020.acl-main.225. URL https://aclanthology.org/2020.acl-main.225.
[10] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online
learning and stochastic optimization. In J. Mach. Learn. Res., 2010.
[11] Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non-
autoregressive neural machine translation. In International Conference on Learning Rep-
resentations, 2018. URL https://openreview.net/forum?id=B1l8BtlCb.
[12] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In
Advances in Neural Information Processing Systems, pages 6840–6851, 2020.
[13] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural
text degeneration. In International Conference on Learning Representations, 2020. URL
https://openreview.net/forum?id=rygGQyrFvH.

[14] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax
flows and multinomial diffusion: Towards non-autoregressive language models. arXiv preprint
arXiv:2102.05379, 2021.
[15] Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg,
and Tim Salimans. Autoregressive diffusion models. In International Conference on Learning
Representations, 2022. URL https://openreview.net/forum?id=Lm8T39vLDTE.
[16] N. Keskar, B. McCann, L. R. Varshney, Caiming Xiong, and R. Socher. Ctrl: A conditional
transformer language model for controllable generation. ArXiv, abs/1909.05858, 2019.
[17] Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Nicholas Lourie, Jungo Kasai, Yejin Choi,
Noah A. Smith, and Daniel S. Weld. Genie: A leaderboard for human-in-the-loop evaluation of
text generation. ArXiv, abs/2101.06561, 2021.
[18] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Confer-
ence on Learning Representations (ICLR), 2014.
[19] Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.
arXiv preprint arXiv:2107.00630, 2021.

12
[20] Nikita Kitaev and Dan Klein. Constituency parsing with a self-attentive encoder. In Proceedings
of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 2676–2686, Melbourne, Australia, July 2018. Association for Computational
Linguistics. doi: 10.18653/v1/P18-1249. URL https://aclanthology.org/P18-1249.
[21] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile
diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
[22] Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty,
Richard Socher, and Nazneen Fatema Rajani. GeDi: Generative Discriminator Guided Sequence
Generation. arXiv preprint arXiv:2009.06367, 2020.
[23] Shankar Kumar and William Byrne. Minimum Bayes-risk decoding for statistical machine
translation. In Proceedings of the Human Language Technology Conference of the North
American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages
169–176, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational
Linguistics. URL https://aclanthology.org/N04-1022.
[24] Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A.
Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts
and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Com-
putational Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pages 6691–6706, Online, August 2021. Associa-
tion for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URL https:
//aclanthology.org/2021.acl-long.522.

[25] Gautam Mittal, Jesse Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation
with diffusion models. arXiv preprint arXiv:2103.16091, March 2021.
[26] Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy
Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper
understanding of commonsense stories. In Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Tech-
nologies, pages 839–849, San Diego, California, June 2016. Association for Computational
Linguistics. doi: 10.18653/v1/N16-1098. URL https://aclanthology.org/N16-1098.
[27] Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv
preprint arXiv:2102.09672, 2021.
[28] Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. The E2E dataset: New challenges for
end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and
Dialogue, pages 201–206, Saarbrücken, Germany, August 2017. Association for Computational
Linguistics. doi: 10.18653/v1/W17-5525. URL https://aclanthology.org/W17-5525.
[29] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for
word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association
for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://aclanthology.
org/D14-1162.

[30] Lianhui Qin, Vered Shwartz, Peter West, Chandra Bhagavatula, Jena D. Hwang, Ronan Le Bras,
Antoine Bosselut, and Yejin Choi. Back to the future: Unsupervised backprop-based decoding
for counterfactual and abductive commonsense reasoning. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Language Processing (EMNLP), pages 794–805,
Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.
emnlp-main.58. URL https://aclanthology.org/2020.emnlp-main.58.
[31] Lianhui Qin, Sean Welleck, Daniel Khashabi, and Yejin Choi. Cold decoding: Energy-based
constrained text generation with langevin dynamics, 2022. URL https://arxiv.org/abs/
2202.11705.

[32] Alec Radford and Karthik Narasimhan. Improving language understanding by generative
pre-training. https://openai.com/blog/language-unsupervised/, 2018.

13
[33] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners. https://openai.com/blog/better-language-models/,
2019.
[34] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical
text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, April
2022.
[35] Yi Ren, Jinglin Liu, Xu Tan, Zhou Zhao, Sheng Zhao, and Tie-Yan Liu. A study of non-
autoregressive model for sequence generation. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics, pages 149–159, Online, July 2020.
Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.15. URL
https://aclanthology.org/2020.acl-main.15.

[36] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation
and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
[37] Chitwan Saharia, William Chan, Saurabh Saxena, and Mohammad Norouzi. Non-autoregressive
machine translation with latent alignments. In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 1098–1108, 2020.
[38] Lei Sha. Gradient-guided unsupervised lexically constrained text generation. In Proceedings of
the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages
8692–8703, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/
v1/2020.emnlp-main.701. URL https://aclanthology.org/2020.emnlp-main.701.
[39] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsu-
pervised learning using nonequilibrium thermodynamics. In Francis Bach and David Blei,
editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of
Proceedings of Machine Learning Research, pages 2256–2265, Lille, France, 07–09 Jul 2015.
PMLR. URL https://proceedings.mlr.press/v37/sohl-dickstein15.html.
[40] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In
International Conference on Learning Representations, 2021. URL https://openreview.
net/forum?id=PxTIG12RRHS.

[41] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of
Machine Learning Research, 2008.
[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V.
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Ad-
vances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran
Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

[43] Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language
Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
[44] Kevin Yang and Dan Klein. FUDGE: Controlled text generation with future discriminators.
In Proceedings of the 2021 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, pages 3511–3535, Online, June
2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.276. URL
https://aclanthology.org/2021.naacl-main.276.

[45] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen,
Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam
Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and
Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022. URL https:
//arxiv.org/abs/2205.01068.

14
A Diffusion Noise Schedule
Because a diffusion model shares parameters for all diffusion steps, the 1.0

noise schedule (parametrized by ᾱ1:T ) is an important hyperparameter 0.8

that determines how much weight we assign to each denoising problem.

Noise std
0.6
We find that standard noise schedules for continuous diffusions are not
robust for text data. We hypothesize that the discrete nature of text and the 0.4

rounding step make the model insensitive to noise near t = 0. Concretely, linear
0.2 cosine
adding small amount of Gaussian noise to a word embedding is unlikely
0.0
sqrt
to change its nearest neighbor in the embedding space, making denoising 0 250 500 750 1000 1250 1500 1750 2000

an easy task near t = 0. Diffusion Steps

Figure 5: Visualizing
√ the
To address this, we introduce a new sqrt noise schedule p that is better noise schedule 1 − ᾱt .
suited for text, shown in Figure 5 defined by ᾱt = 1 − t/T + s, where
s is a small constant that corresponds to the starting noise level11 . Compared to standard linear and
cosine schedules, our sqrt schedule starts with a higher noise level and increase noise rapidly for the
first 50 steps. Then sqrt slows down injecting noise to avoid spending much steps in the high-noise
problems, which may be too difficult to solve well.

B Hyperparameters
Diffusion-LM hyperparameters. The hyperparameters that are specific to Diffusion-LM include
the number of diffusion steps, the architecture of the Diffusion-LM, the embedding dimension, and
the noise schedule, . We set the diffusion steps to be 2000, the architecture to be BERT-base [7], and
the sequence length to be 64. For the embedding dimensions, we select from d ∈ {16, 64, 128, 256}
and select d = 16 for the E2E dataset and d = 128 for ROCStories. For the noise schedule, we design
the sqrt schedule (Appendix A) that is more robust to different parametrizations and embedding
dimensions as shown in Appendix H. However, once we picked the x0 -parametrization (§4.2) the
advantage of sqrt schedule is not salient.

Training hyperparameters. We train Diffusion-LMs using AdamW optimizer and a linearly decay
learning rate starting at 1e-4, dropout of 0.1, batch size of 64, and the total number of training iteration
is 200K for E2E dataset, and 800K for ROCStories dataset. Our Diffusion-LMs are trained on a single
GPU: NVIDIA RTX A5000, NVIDIA GeForce RTX 3090, or NVIDIA A100. It takes approximately
5 hours to train for 200K iterations on a single A100 GPU.
To stablize the training under Le2e
vlb objective, we find that we need to set gradient clipping to 1.0
and apply importance sampling to reweight each term in Lvlb [27]. Both tricks are not necessary for
Le2e
simple objective.

Controllable Generation hyperparameters. To achieve controllable generation, we run gradient

update on the continuous latents of Diffusion-LM. We use the AdaGrad optimizer [10] to update the
latent variables, and we tune the learning rate, lr ∈ {0.05, 0.1, 0.15, 0.2} and the trade-off parameter
λ ∈ {0.1, 0.01, 0.001, 0.0005}. Different plug-and-play controllable generation approaches tradeoff
between fluency and control by tunning different hyperparameters: PPLM uses the number of gradient
updates per token, denoted as k, and we tune k ∈ {10, 30}. FUDGE uses the tradeoff parameter
λFUDGE and we tune this λFUDGE ∈ {16, 8, 4, 2}. Table 6 contains all the selected hyperparameter
for each control tasks. Both PPLM and FUDGE has additional hyperparameters and we follow the
instruction from the original paper to set those. For PPLM, we set the learning rate to be 0.04 and
KL-scale to be 0.01. For FUDGE, we set precondition top-K to be 200, post top-K to be 10.

C Decoding Speed
Sampling from Diffusion-LMs requires iterating through the 2000 diffusion steps, yielding O(2000)
fθ model calls. In contrast, sampling from autoregressive LMs takes O(n) where n is the sequence
length. Therefore, decoding Diffusion-LM is slower than decoding autoregressive LMs in short and
11
We set s =1e-4, and T = 2000, which sets the initial standard deviation to 0.1.

15
Semantic Parts-of-speech Syntax Tree Syntax Spans Length
tradeoff lr tradeoff lr tradeoff lr tradeoff lr tradeoff lr
PPLM 30 0.04 - - - - - - - -
FUDGE 8.0 - 20.0 - 20.0 - 20.0 - 2.0 -
Diffusion-LM 0.01 0.1 0.0005 0.05 0.0005 0.2 0.1 0.15 0.01 0.1

Table 6: Hyperparameters for controllable generation methods.

medium-length sequence regimes. Concretely, it takes around 1 minute to decode 50 sequence of

length 64.
To speed up decoding, we tried skipping steps in the generative diffusion process and downsample
2000 steps to 200 steps. Concretely, we set T = 200 and downsample the noise schedule ᾱt = ᾱ10t ,
which is equivalent to setting each unit transition as the transition xt → xt+10 . We decode Diffusion-
LM using this new noise schedule and discretization. We find that this naive approach doesn’t hurt
sample quality for simple language modeling tasks like E2E, but it hurts sample quality for harder
language modeling tasks like ROCStories.
For plug-and-play controllable generation tasks, extant approaches are even slower. PPLM takes
around 80 minutes to generate 50 samples (without batching), because it needs to run 30 gradient
updates for each token. FUDGE takes 50 seconds to generate 50 samples (with batching), because it
needs to call the lightweight classifier for each partial sequence, requiring 200 classifier calls for each
token, yielding 100× sequence length calls. We can batch the classifier calls, but it sometimes limits
batching across samples due to limited GPU memory. Our Diffusion-LM takes around 80 seconds to
generate 50 samples (with batching). Our method downsamples the number of diffusion steps to 200,
and it takes 3 classifier calls per diffusion step, yielding 600 model calls in total.

D Classifiers for Classifier-Guided Controls

Semantic Content. We train an autoregressive LM (GPT-2 small architecture) to predict the (field,
value) pair conditioned on text. To parametrize log p(c | xt ), we compute the logprob of “value” per
token.
Parts-of-speech. The classifier is parametrized by a parts-of-speech tagger, which estimates the
probability of the target POS sequence conditioned on the latent variables. This tagger uses a BERT-
base architecture: the input is the concantenated word embedding, and output a softmax distribution
over all POS tags for each input word. log p(c | xt ) is the sum of POS log-probs for each word in the
sequence.
Syntax Tree. We train a Transformer-based constituency parser [20]. Our parser makes locally
normalized prediction for each span, predicting either “not a constituent”, or a label for the constituent
(e.g., Noun Phrase). log p(c | xt ) is the sum of log-probs for each labeled and non-constituency span
in the sequence.
Syntax Span. We use the same parser trained for the syntax tree. log p(c | xt ) is the log-probability
that the target span is annotated with the target label.

E End-to-end Objective Derivations

For continuous diffusion models (§3.3), Lsimple is derived from the canonical objective Lvlb by
reweighting each term. The first T terms in Lvlb are all KL divergence between two Gaussian
distributions, which has a closed form solution. Take the t-th term for example:

q(xt−1 |x0 , xt ) 1 2
E log = E 2 ||µθ (x t , t) − µ̂(x t , x0 )|| + C, (3)
q(x1:T |x0 ) pθ (xt−1 |xt ) q(x1:T |x0 ) 2σt

where C is a constant, µ̂ is the mean of the posterior q(xt−1 |x0 , xt ), and µθ is the mean of pθ (xt−1 |
xt ) predicted by the diffusion model. Intuitively, this simplification matches the predicted mean of
xt−1 to its true posterior mean. The simplification involves removing the constant C and the scaling
factor 2σ1 2 , yielding one term in Lsimple : Eq(x1:T |x0 ) ||µθ (xt , t) − µ̂(xt , x0 )||2 .

t

16
To apply continuous diffusion to model discrete text, we design Diffusion-LM (§4.1) and propose a
new end-to-end training objective (equation (2)) that learns the diffusion model and the embedding
parameters jointly. The Le2e
vlb can be written out as

Le2e
vlb (w) = E [Lvlb (x0 ) + log qφ (x0 |w) − log pθ (w|x0 )]]
qφ (x0 |w)
 
T
 q(xT |x0 ) X q(xt−1 |x0 , xt ) log qφ (x0 |w) 
= log + log − − log pθ (w|x0 )
 
E
qφ (x0:T |w)  pθ (xT ) p θ (xt−1 |x t ) log pθ (x 0 |x1 ) | {z } 
| {z } t=2 | {z } | {z } Lround
LT Lt−1 L0

We apply the same simplification which transforms Lvlb → Lsimple to transform Le2e e2e
vlb → Lsimple :

E [LT ] → E[|| E [xT |x0 ] − 0||2 ] = E[||µ̂(xT ; x0 )]||2 ]

qφ (x0:T |w) xT ∼q

E [Lt−1 ] → E[|| E [xt−1 |x0 , xt ] − E [xt−1 |xt ]||2 ] = E[||µ̂(xt , x0 ) − µθ (xt , t)||2 ]
qφ (x0:T |w) xt−1 ∼q xt−1 ∼pθ
2 2
E [L0 ] → E[|| E [x0 | w] − E [x0 | x1 ]|| ] = E[||E MB(w) − µθ (x1 , 1)|| ]
qφ (x0:T |w) x0 ∼qφ x0 ∼pθ

It’s worth noting that the first term is constant if the noise schedule satisfies ᾱT = 0, which guarantees
xT is pure Gaussian noise. In contrast, if the noise schedule doesn’t go all the way such that xT is pure
Gaussian noise, we need to include this regularization term to prevent the embedding from learning
too large norms. Embedding with large norms is a degenerate solution, because it is impossible
to sample from p(xT ) accurately, even though it makes all the other denoising transitions easily
predictable.
Combining these terms yield Le2e
simple .

" T
#
X
Le2e
simple (w) = E ||µ̂(xT ; x0 )||2 + [||µ̂(xt , x0 ) − µθ (xt , t)||2 ]
qφ (x0:T |w)
t=2
||E MB(w) − µθ (x1 , 1)||2 − log pθ (w|x0 ) .

+ E
qφ (x0:1 |w)

Intuitively, we learn a Transformer model that that takes as input (xt , t) ∈ (Rnd , R) and the goal is
to predict the distribution of xt−1 ∈ Rnd . It’s worth noting that this Transformer model is shared
across all the diffusion steps t = 1 . . . T . As we demonstrated in the derivation of Le2e
simple , the most
natural thing is to directly parametrize the neural network to predict the mean of xt−1 | xt , we call
this µθ -parametrization.
There are other parametrizations that are equivalent to µθ -parametrization up to a scaling constant.
For example in §4.2, we can train the Transformer model to directly predict x0 via fθ (xt , t), and use
the tractable Gaussian posterior q(xt−1 | x0 , xt ) to compute the
√
mean of xt−1√
, which has a closed
ᾱ βt αt (1−ᾱt−1 )
form solution, conditioned on predicted x0 and observed xt : 1−t−1 ᾱt x 0 + 1−ᾱt xt .

||µ̂(xt , x0 ) − µθ (xt , t)||2

√ √ √ √
ᾱt−1 βt αt (1 − ᾱt−1 ) ᾱt−1 βt αt (1 − ᾱt−1 )
=||( x0 + xt ) − ( fθ (xt , t) + xt )||2
1 − ᾱt 1 − ᾱt 1 − ᾱt 1 − ᾱt
√
ᾱt−1 βt
=|| (x0 − fθ (xt , t))||2
1 − ᾱt
∝||x0 − fθ (xt , t)||2

17
Dataset Small AR Small Diffusion Medium Diffusion
E2E 1.77 2.28 -
ROCStories 3.05 3.88 -
ROCStories (+GPT-J) 2.41 3.59 3.10

Table 7: Log-likelihood results (nats per token)

These two parametrizations differ by a constant scaling, and we apply the x0 -parametrization to all
terms in Le2e
simple to reduce rounding errors as discussed in §4.2:
" T
#
X
Le2e
x0 -simple (w) = E ||µ̂(xT ; x0 )||2 + [||x0 − fθ (xt , t)||2 ]
qφ (x0:T |w)
t=2
||E MB(w) − fθ (x1 , 1)||2 − log pθ (w|x0 ) .

+ E
qφ (x0:1 |w)

To generate samples from a Diffusion-LM with x0 -parametrization, at each diffusion step, the model
estimates the x0 via fθ (xt , t) and then we sample xt−1 from q(xt−1 | fθ (xt , t), xt ), which is fed as
input to the next diffusion step.

F Log-Likelihood Models and Results

To investigate Diffusion-LM’s log-likelihood performance, we make several departures from the
training procedure of §4. Ultimately the log-likelihood improvements described in this section did
not translate into better generation quality in our experiments and therefore we focus on the original
method in the rest of the paper. Our likelihood models are trained as follows:
• Instead of training a diffusion model on sequences of low-dimensional token embeddings,
we train a model directly sequences of on one-hot token vectors.
• Following the setup of Kingma et al. [19], we train a continuous-time diffusion model
against the log-likelihood bound and learn the noise schedule simultaneously with the rest
of the model to minimize the loss variance.
• Because our model predicts sequences of one-hot vectors, we use a softmax nonlinearity at
its output and replace all squared-error terms in the loss function with cross-entropy terms.
This choice of surrogate loss led to better optimization, even though we evaluate against the
original loss with squared-error terms.
• The model applies the following transformation to its inputs before any Transformer layers:
x := softmax(α(t)x + β(t)) where α(t) ∈ R and β(t) ∈ Rv are learned functions of the
diffusion timestep t parameterized by MLPs (v is the vocabulary size).
• At inference time, we omit the rounding procedure in §4.2.
For exact model architecture and training hyperparameter details, please refer to our released code.
We train these diffusion models, as well as baseline autoregressive Transformers, on E2E and
ROCStories and report log-likelihoods in Table 7. We train two sizes of Transformers: “small”
models with roughly 100M parameters and “medium” models with roughly 300M parameters. Both
E2E and ROCstories are small enough datasets that all of our models reach their minimum test loss
early in training (and overfit after that). To additionally compare model performance in a large-dataset
regime, we also present “ROCStories (+GPT-J)” experiments in which we generate 8M examples
of synthetic ROCStories training data by finetuning GPT-J [43] on the original ROCStories data,
pretrain our models on the synthetic dataset, and then finetune and evaluate them on the original
ROCStories data.

G Qualitative Examples
We show randomly sampled outputs of Diffusion-LM both for unconditional generation and for the 5
control tasks. Table 8 shows the unconditional generation results. Table 9, Table 10, Table 12, and

18
Table 3 show the qualitative samples from span control, POS control, semantic content control, and
syntax tree control, respectively. Table 11 shows the results of length control.

H Additional Ablation Studies

In addition to the 2 ablation studies in §7.4, we provide more ablation results in Figure 6 about
architecture choices and noise schedule.
Learned v.s. Random Embeddings (§4.1). Learned embeddings outperform random embeddings
on both ROCStories and the E2E dataset by xx percent and xx percent respectively, as shown in the
first row of Figure 6.
Noise Schedule (Appendix A). We compare the sqrt schedule with cosine [27] and linear [12] sched-
ules proposed for image modeling. The middle row of Figure 6 demonstrates that sqrt schedule attains
consistently good and stable performance across all dimension and parametrization choices. While
the sqrt schedule is less important with x0 -parametrization, we see that it provides a substantially
more robust noise schedule under alternative parametrizations such as .

Transformer v.s. U-Net. The U-Net architecture in Ho et al. [12] utilizes 2D-convolutional layers,
and we imitate all the model architectures except changing 2D-conv to 1D-conv which is suitable for
text data. Figure 6 (last row) shows that the Transformer architecture outperforms U-Net.

I Societal Impacts
On the one hand, having strong controllability in language models will help with mitigating toxicity,
making the language models more reliable to deploy. Additionally, we can also control the model to
be more truthful, reducing the inaccurate information generated by the language model by carefully
controlling it to be truthful. On the other hand, however, one could also imagine more powerful
targeted disinformation (e.g., narrative wedging) derived from the fine-grained controllability.
Towards this end, it might be worth considering generation methods that can watermark the generated
outputs without affecting its fluency, and this type of watermark could also be framed as a controllable
generation problem, with distinguish-ability and fluency as the constraints.

19
E2E ROCstory
1.50 4
1.25
3
1.00
0.75 2
0.50
learned emb 1 learned emb
0.25
random emb random emb
0.00 16 64 128 0 16 64 128
parametrize to predict x0 parametrize to predict
sqrt sqrt
8 cosine cosine
linear linear
6
4
2
0 16 64 128 16 64 128
Architecture Choice
2.0

1.5

1.0

0.5

0.0 Transformer U-Net

Figure 6: Additional ablation results. The first row shows Diffusion-LM with trainable embeddings
outperform random embeddings on both datasets (§4.1). The second row demonstrates that sqrt
schedule attains consistently good and stable performance across all dimension and parametrization
choices. The last row shows that Transformer architecture outperforms U-Net architecture for
language modeling.

20
ROCStories+Aug
Matt was at the store . He was looking at a new toothbrush . He found the perfect one . When
he got home , he bought it . It was bright and he loved it .

I and my friend were hungry . We were looking for some to eat . We went to the grocery store
. We bought some snacks . We decided to pick up some snacks .

I was at the store . I had no money to buy milk . I decided to use the restroom . I went to the
register . I was late to work .

The man wanted to lose weight . He did n’t know how to do weight . He decided to start
walking . He ate healthy and ate less . He lost ten pounds in three months .

I went to the aquarium . I wanted to feed something . I ordered a fish . When it arrived I had
to find something . I was disappointed .
Tom was planning a trip to California . He had fun in the new apartment . He was driving ,
ROCStories

until it began to rain . Unfortunately , he was soaked . Tom stayed in the rain at the beach .

Carrie wanted a new dress . She did not have enough money . She went to the bank to get one
, but saw the missed . Finally , she decided to call her mom . She could not wait to see her
new dress .

Tina went to her first football game . She was excited about it . When she got into the car she
realized she forgot her hand . She ended up getting too late . Tina had to start crying .

Michael was at the park . Suddenly he found a stray cat . He decided to keep the cat . He went
to his parents and demanded a leg . His parents gave him medicine to get it safe .

Tim was eating out with friends . They were out of service . Tim decided to have a pizza
sandwich . Tim searched for several hours . He was able to find it within minutes .
The Waterman is an expensive pub that serves Japanese food . It is located in Riverside and
has a low customer rating .
E2E

A high priced pub in the city centre is The Olive Grove . It is a family friendly pub serving
French food .

The Rice Boat offers moderate priced Chinese food with a customer rating of 3 out of 5 . It is
near Express by Holiday Inn .

There is a fast food restaurant , The Phoenix , in the city centre . It has a price range of more
than 0̆0a3 30 and the customer ratings are low .

The Mill is a coffee shop based in the city centre area near to The Sorrento . It is in the high
price range and serves Indian food .

Table 8: Randomly sampled examples generated by unconditional sampling Diffusion-LM trained

on 3 datasets. ROCStories+Aug denotes ROCStories with data augmentation. It’s generated by first
fine-tuning GPT-j on the ROCStories dataset and then sample the large GPT-j model to generate 1M
stories.

21
target span [3, 5, PP]
FUDGE UNK the UNK for Italian food , The Eagle coffee shop is near Burger King in the riverside
area . The Eagle has a customer rating of 5 out of 5 , and isn ’ t family - friendly . The Eagle
has a cheap price range .
Diffusion-LM The Plough , near Café Rouge , is a high priced fast food pub .
FT Along the riverside near Café Rouge is The Golden Curry . It serves Italian food in a family -
friendly environment . It has a low customer rating .
target span [10, 12, PP]
FUDGE Blue Spice is a high price range Fast food located in city centre .
Diffusion-LM The Phoenix is a high priced food restaurant , located near the river .
FT The Punter is a family restaurant with low prices and delicious sushi , located near the Café
Sicilia
target span [9, 14, S]
FUDGE Zizzi pub serves Italian food for adults only . It has been rated average by customers .
Diffusion-LM There is a Chinese restaurant called The Eagle , it has an average customer rating .
FT On the riverside area are located Alimentum , has a very good French food for adults and kids
, UNK price range are over 20 to 25 £ .
target span [4, 16, VP]
FUDGE The Cambridge Blue pub is near the Café Brazil and offers a high price range for their French
food .
Diffusion-LM On the Ranch there is a children friendly pub called The Cricketers with an average customer
rating .
FT The Travellers Rest Beefeater is an average rated restaurant located in the riverside area near
Café Adriatic . Their price range is less than £ 20 .
target span [0, 2, NP]
FUDGE The Golden Palace is a cheap , 5 - star coffee shop , located on the river in the north of the city
centre .
Diffusion-LM The Olive Grove is a pub that provides Indian food in the high price range . It is in the city
centre .
FT The Golden Curry is located in city centre near Café Rouge which provides English food . Its
customer rating is average and is not family - friendly .
target span [12, 13, NP]
FUDGE The Waterman is a family friendly place with a good rating .[missing span]
Diffusion-LM The Vaults is a high priced , family friendly restaurant that serves Italian food .
FT Strada is a restaurant which costs less than £ 20 , but is not family - friendly and has an average
rating .

Table 9: Qualitative output of the syntax span control tasks. The target span (i, j,label) means the
span from position i to position j should be a constituent with a specific label: S is sentence, NP is
noun phrase, VP is verb phrase, PP is prepositional phrase, etc. We color failed spans red and correct
spans green.

22
target POS PROPN AUX DET ADJ NOUN NOUN VERB ADP DET NOUN ADP DET NOUN PUNCT
FUDGE Aromi is a non family - friendly fast food coffee shop in the riverside area with a low Customer
Rating .
Diffusion-LM Fitzbillies is a cheap coffee shop located on the outskirts of the city .
FT Aromi is a fast food pub located at the centre of the city.
target POS PROPN AUX DET NOUN VERB NOUN ADJ NOUN PUNCT PRON NOUN NOUN AUX
ADJ
FUDGE Cocum is a family - friendly coffee shop , that has a low price range and a low customer rating
.
Diffusion-LM Zizzi is a pub providing restaurant Chinese food . Its customer rating is low
FT Zizzi is a pub providing kids friendly services. Its customer rating is average
target POS DET NOUN PUNCT PROPN VERB ADJ CCONJ ADJ NOUN CCONJ AUX VERB ADP
DET PROPN ADJ PROPN PUNCT
FUDGE A child - friendly coffee shop , Cocum , offers fast food at an average price range of £ 20 - 25 .
Diffusion-LM The Waterman - friendly serves UNK and fast food and is located near the Crown Plaza Hotel .
FT The wine - Strada serves fast and cheap food and is located near the Rainbow Vegetarian Café.
target POS DET PROPN PROPN VERB ADJ NOUN ADP NOUN ADP SYM NUM PUNCT NOUN
NOUN AUX ADJ PUNCT DET PROPN PROPN AUX VERB ADP DET PROPN CCONJ
PROPN ADP PROPN PROPN PUNCT ADJ PUNCT DET NOUN PART AUX VERB PUNCT
FUDGE The Midsummer House offers cheap English food near All Bar One . Rated 5 out of 5 .
Diffusion-LM The Rice Boat provides Chinese food in £ 20 - 25 . Price range is high . The Rice Boat is
located near the Express by Holiday Inn and is kids friendly . The customer rating is high .
FT The Rice Boat welcomes Japanese food with prices under £ 20. Customer ratings are low. The
Rice Boat is located near the Express by Holiday Inn. Convenient. No children’s are allowed.
target POS PROPN PROPN AUX DET ADJ NOUN NOUN ADP DET NOUN NOUN ADP DET PROPN
PUNCT PRON AUX NOUN PUNCT ADJ PUNCT
FUDGE Loch Fyne is a Japanese restaurant with a moderate price range and kid - friendly atmosphere .
Diffusion-LM Browns Cambridge is an Italian restaurant shop in the city centre near The Sorrento . It is
family - friendly .
FT Browns Cambridge is a cheap coffee shop in the riverside area near The Sorrento, that is
family - friendly.
target POS PROPN VERB DET ADJ NOUN NOUN PROPN PUNCT PRON AUX ADJ VERB CCONJ
VERB NOUN SCONJ VERB ADJ NOUN PUNCT
FUDGE Fitzbillies coffee shop has a high price range , children friendly service and serves Japanese
food in riverside with high customer rating .
Diffusion-LM There has a high customer rating . It is kid friendly called The Golden Curry and serves Indian
food .
FT Customers give the French coffee shop Fitzbillies ; it is average rated and offers families where
serving light meals.
target POS DET NUM NUM VERB ADJ NOUN
FUDGE The Twenty Two serves Fast food and is kids friendly .
Diffusion-LM The Twenty Two provides Chinese food
FT The Twenty Two provides Indian food
target POS ADV NOUN ADV ADP PROPN PROPN PUNCT DET PROPN NOUN NOUN VERB ADJ
NOUN NOUN CCONJ AUX PART VERB NOUN NOUN PUNCT
FUDGE UNK your whole family to The Wrestlers , the best UNK the UNK UNK at the river
Diffusion-LM Located in riverside near The Sorrento , Browns Cambridge coffee shop serves Japanese food ,
and is not family - friendly .
FT Even adults only at Loch Fyne, The Rice Boat coffee shop has moderate price range and does
not cater kids age.
target POS DET PROPN AUX DET NUM NOUN NOUN NOUN VERB ADP DET PROPN PROPN
PUNCT
FUDGE The Eagle is a 3 star coffee shop located near Burger King , north of the City centre that
provides low - cost fast food .
Diffusion-LM The Cricketers is a five star coffee shop located near The Portland Arms .
FT The Vaults is a one star coffee shop located near the Café Brazil.

Table 10: Qualitative output of the POS control tasks. The target POS is the sequence of gold
parts-of-speech tags the generated texts should match.

23
target length 7
FUDGE Wildwood is a cheap Japanese pub . Low rating .
Diffusion-LM The Twenty Two serves Indian food .
FT The Mill is an Indian restaurant .
target length 12
FUDGE The Phoenix is an average Japanese restaurant that is in the City Centre .
Diffusion-LM The Twenty Two serves Chinese food and is not family friendly .
FT Green Man is an average priced restaurant located near All Bar One
target length 17
FUDGE Fitzbillies is an expensive Italian coffee shop in the city centre . It is not child friendly. .
Diffusion-LM The Twenty Two serves Indian food in the city centre . It is not family friendly .
FT For low - priced food and a family - friendly atmosphere, visit Fitzbillies near Express by
Holiday Inn
target length 22
FUDGE The Golden Curry is an English food restaurant located near the Café Rouge in the Riverside
area . The customer rating is average . Children are welcome .
Diffusion-LM Strada is a fast food pub located near Yippee Noodle Bar and has a customer rating of 3 out of
5.
FT There is an Italian kid friendly restaurant in the riverside area near The Sorrento named Browns
Cambridge in the riverside area .
target length 27
FUDGE The Olive Grove is an expensive , children friendly , Fast food restaurant in the city centre .
[missing 9 words]
Diffusion-LM The Eagle is a family friendly coffee shop in the city centre near Burger King . It serves Italian
food and has a low customer rating .
FT A pub in the city centre near Yippee Noodle Bar is named Strada. It serves French food and
has a customer rating of 3 out of 5
target length 32
FUDGE The Golden Curry is a Japanese food restaurant with a high customer Rating , kid friendly and
located along the riverside near Café Rouge . [missing 7 words]
Diffusion-LM There is a family - friendly coffee shop in the city centre , it is called Zizzi . It is cheap and
has a customer rating of 5 out of 5 .
FT In the city centre is a kids friendly place called Green Man. It has Japanese food and is near
All Bar One. It has a price range of £ 20 - 25
target length 37
FUDGE There is a coffee shop called Fitzbillies which offers French food at cheap prices . It is not
family - friendly and has a customer rating of 5 out of 5 . It is in riverside .
Diffusion-LM The Rice Boat provides Indian food in the moderate price range . It is located in the city centre
. It is near Express by Holiday Inn . Its customer rating is 3 out of 5 .
FT For a family friendly coffee shop that serves Italian food, with a customer rating of 5 out of 5
and a cheap price range, try The Eagle. It is located in the riverside area .

Table 11: Qualitative output of the length control tasks, where all the generated texts tried to exactly
match the target length. We mark the words exceeding the target length red.

24
target semantic content name : Bibimbap House
FUDGE Clare Hall , the Bibimbap House , serves high end Japanese food in the city centre .
Diffusion-LM Bibimbap House in riverside near Clare Hall has a cheap price range .
FT By Clare Hall is Bibimbap House which serves expensive noodles.
target semantic content name : Travellers Rest Beefeater
FUDGE Clowns near Clare Hall in riverside is a French coffee shop rated 5 out of 5
Diffusion-LM Green Man is an Italian pub located in the city centre near Café UNK .
FT Travellers Rest Beefeater is a reasonably priced restaurant that is family friendly.
target semantic content Type : coffee shop
FUDGE Wildwood is a coffee shop located near Ranch . It is expensive and highly UNK .
Diffusion-LM The Punter is a high priced coffee shop located near Café Sicilia that serves Japanese food . It
is not family - friendly and has a customer rating of 3 out of 5 .
FT Located in the riverside area is the coffee shop Fitzbillies. It has Indian food in the price Range
of less than £ 20 and a low customer Rating. It is not family Friendly.
target semantic content customer rating : low
FUDGE The Waterman is a fast food restaurant that is family - friendly near the city centre . [missing
content]
Diffusion-LM The Rice Boat restaurant has a low customer rating and is located in riverside . It serves Italian
food , and is not family - friendly .
FT The Eagle is low customer rating coffee shop with Italian food in riverside near Burger King.
Its price range is less than £ 20 and is family - friendly.
target semantic content near : The Sorrento
FUDGE Browns Cambridge provides Indian food in the less than £ 20 price range . Its customer rating
is low . [missing content]
Diffusion-LM Near The Sorrento on the riverside is a pub named Taste of Cambridge that serves Japanese
food .
FT Browns Cambridge sells Italian food and is also a coffee shop. It has an average customer
rating. It is located in the riverside area near Crowne Plaza Hotel and yes it is child friendly.
target semantic content food : Italian
FUDGE A non family - friendly Italian pub is Zizzi . It has an average customer rating .
Diffusion-LM Loch Fyne is Italian Japanese restaurant that is kid friendly .
FT situated near All Bar One is a child friendly Italian eatery called Green Man costing more than
£ 30 is a restaurant near the riverside
target semantic content price : high
FUDGE The Vaults is a high priced Italian Pub with a customer rating of 3 out of 5 near Café Adriatic
Diffusion-LM The Punter is a French restaurant with a high price range .
FT A fast food coffee shop that is not kid friendly is called Cocum. It is expensive and gets
average ratings.

Table 12: Qualitative output of the semantic content control task. We mark the compliant spans as
green, and the spans that violates the control target as red.

Downloed Papers
No ratings yet
Downloed Papers
700 pages
Whitepaper - Foundational Large Language Models & Text Generation - v2
100% (1)
Whitepaper - Foundational Large Language Models & Text Generation - v2
86 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
Syntactic and Semantic C
No ratings yet
Syntactic and Semantic C
34 pages
LLM-grounded Diffusion Enhancing Prompt Understanding
No ratings yet
LLM-grounded Diffusion Enhancing Prompt Understanding
29 pages
Structured Denoising Diffusion Models in Discrete State-Spaces
No ratings yet
Structured Denoising Diffusion Models in Discrete State-Spaces
33 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
Large Language Models
No ratings yet
Large Language Models
24 pages
CogView3 Finer and Faster Text-To-Image Generation Via Relay Diffusion
No ratings yet
CogView3 Finer and Faster Text-To-Image Generation Via Relay Diffusion
21 pages
2023.emnlp Main.96SynthIE
No ratings yet
2023.emnlp Main.96SynthIE
20 pages
Image Editing Costs
No ratings yet
Image Editing Costs
28 pages
Adversarial Attacks On LLMs - Lil'Log
No ratings yet
Adversarial Attacks On LLMs - Lil'Log
30 pages
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
No ratings yet
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
32 pages
Denoising Diffusion Probabilistic Models
No ratings yet
Denoising Diffusion Probabilistic Models
25 pages
Artificial Inelegance
No ratings yet
Artificial Inelegance
26 pages
Lecture 7 - Conditional Language Modeling
No ratings yet
Lecture 7 - Conditional Language Modeling
64 pages
Recursive Self Improvement
No ratings yet
Recursive Self Improvement
18 pages
2024 Acl-Long 778
No ratings yet
2024 Acl-Long 778
15 pages
Maskgan
No ratings yet
Maskgan
17 pages
ACL Findings - 2022 - Jing Qian - Controllable Natural Language Generation With Contrastive Prefixes
No ratings yet
ACL Findings - 2022 - Jing Qian - Controllable Natural Language Generation With Contrastive Prefixes
13 pages
Tikas FYP
No ratings yet
Tikas FYP
37 pages
Neural Language Modelling - NLP
No ratings yet
Neural Language Modelling - NLP
30 pages
Llms Course Andrew
No ratings yet
Llms Course Andrew
46 pages
Diffusion Models in Deep Learning
No ratings yet
Diffusion Models in Deep Learning
14 pages
Lavida: A Large Diffusion Language Model For Multimodal Understanding
No ratings yet
Lavida: A Large Diffusion Language Model For Multimodal Understanding
26 pages
Diffusionbert: Improving Generative Masked Language Models With Diffusion Models
No ratings yet
Diffusionbert: Improving Generative Masked Language Models With Diffusion Models
10 pages
GAN Paper
No ratings yet
GAN Paper
9 pages
2 Ruf
No ratings yet
2 Ruf
11 pages
Bouchard, Stenetorp, Riedel - Unknown - Learning To Generate Textual Data
No ratings yet
Bouchard, Stenetorp, Riedel - Unknown - Learning To Generate Textual Data
9 pages
Diffusion Models in NLP: A Survey: Yuansong Zhu, Yu Zhao
No ratings yet
Diffusion Models in NLP: A Survey: Yuansong Zhu, Yu Zhao
5 pages
Spike GPT Generative Pre-Trained Language Model With Spiking Neutral Networks
No ratings yet
Spike GPT Generative Pre-Trained Language Model With Spiking Neutral Networks
11 pages
XCS224N Module4 Slides
No ratings yet
XCS224N Module4 Slides
91 pages
5th Unit
No ratings yet
5th Unit
36 pages
A Survey On Generative Diffusion Model
No ratings yet
A Survey On Generative Diffusion Model
25 pages
Summer Course Material
No ratings yet
Summer Course Material
52 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
Chapter 3
No ratings yet
Chapter 3
44 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
Power Electronics Circuit Analysis and D
No ratings yet
Power Electronics Circuit Analysis and D
682 pages
2024 Eacl-Short 33-FlowSeq
No ratings yet
2024 Eacl-Short 33-FlowSeq
13 pages
Universal Guidance For Diffusion Models
No ratings yet
Universal Guidance For Diffusion Models
20 pages
PIIS2589004224005558
No ratings yet
PIIS2589004224005558
24 pages
LLMs For Mathematicians 1702200180
No ratings yet
LLMs For Mathematicians 1702200180
13 pages
2 Generative Models
No ratings yet
2 Generative Models
60 pages
LLM Book 43-102
No ratings yet
LLM Book 43-102
60 pages
LLM - Introduction 2024
No ratings yet
LLM - Introduction 2024
77 pages
Text Generation Based On Generative Adversarial Nets With Latent Variable
No ratings yet
Text Generation Based On Generative Adversarial Nets With Latent Variable
13 pages
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
No ratings yet
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
16 pages
A Survey On Neural Network Language Models
No ratings yet
A Survey On Neural Network Language Models
7 pages
Deep Neural Network Language Models - W12-2703
No ratings yet
Deep Neural Network Language Models - W12-2703
9 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
Generative Adversarial Networks For Text Using Word2vec Intermediaries
No ratings yet
Generative Adversarial Networks For Text Using Word2vec Intermediaries
12 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Trend
No ratings yet
Trend
47 pages
Federated and Edge Learning For Large Language Models
No ratings yet
Federated and Edge Learning For Large Language Models
19 pages
01 TASS Training Manual For Tax Payer - Copy - PPTM
No ratings yet
01 TASS Training Manual For Tax Payer - Copy - PPTM
109 pages
Numerical Optimization - Solutions Manual
73% (11)
Numerical Optimization - Solutions Manual
75 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
STA404 Exam Booklet - 20.03.2023
No ratings yet
STA404 Exam Booklet - 20.03.2023
153 pages
Solutions To Selected Problems-Duda, Hart
67% (3)
Solutions To Selected Problems-Duda, Hart
12 pages
Design Engineer Interview Questions
0% (1)
Design Engineer Interview Questions
2 pages
Group 3
No ratings yet
Group 3
24 pages
Pre-Use Scissor List Inspection Checklist
No ratings yet
Pre-Use Scissor List Inspection Checklist
3 pages
TVF2 5
No ratings yet
TVF2 5
107 pages
Screenshot 2023-05-30 at 14.41.45
No ratings yet
Screenshot 2023-05-30 at 14.41.45
37 pages
Feller 1949 - On The Theory of Stochastic Processes, With Par - Ticular Reference To Applications
No ratings yet
Feller 1949 - On The Theory of Stochastic Processes, With Par - Ticular Reference To Applications
30 pages
E9100 Gryphon z97 Ug For Web Only
No ratings yet
E9100 Gryphon z97 Ug For Web Only
170 pages
2020form - MC28s2020-Annexes E - Accex
No ratings yet
2020form - MC28s2020-Annexes E - Accex
1 page
Introduction To Python
No ratings yet
Introduction To Python
5 pages
CXDI Controller NF RF Software
No ratings yet
CXDI Controller NF RF Software
3 pages
Sales Manual Africa
No ratings yet
Sales Manual Africa
9 pages
Do Not Dare To Copy It
No ratings yet
Do Not Dare To Copy It
37 pages
Lab1 Linux and Program
No ratings yet
Lab1 Linux and Program
49 pages
7 ICT Powerpoint W1
No ratings yet
7 ICT Powerpoint W1
3 pages
Web Programming Lab Manual 26 May
No ratings yet
Web Programming Lab Manual 26 May
26 pages
Meanstack Lab Manual 2022-23
No ratings yet
Meanstack Lab Manual 2022-23
80 pages
Chopper Blade
No ratings yet
Chopper Blade
1 page
NEW Java Mannual-Lab
No ratings yet
NEW Java Mannual-Lab
43 pages
Blue and White Modern Digital Marketing Agency Presentation
No ratings yet
Blue and White Modern Digital Marketing Agency Presentation
9 pages
G17 Gen 5 Instructions
No ratings yet
G17 Gen 5 Instructions
9 pages
SG49K5J: Multi-MPPT String Inverter For Japan System
No ratings yet
SG49K5J: Multi-MPPT String Inverter For Japan System
1 page
Frame Scaffolding Catalog
No ratings yet
Frame Scaffolding Catalog
38 pages
V V V V: Scrapers
No ratings yet
V V V V: Scrapers
38 pages
Isb-Iba1-P01-N9k DR Drill Cab Meeting
No ratings yet
Isb-Iba1-P01-N9k DR Drill Cab Meeting
10 pages
BAPD1001 Assignment1 530487090
No ratings yet
BAPD1001 Assignment1 530487090
5 pages
CogView2 - May 2022
No ratings yet
CogView2 - May 2022
15 pages
Sasha Rush - Interactive and Visual Prompt Engineering
No ratings yet
Sasha Rush - Interactive and Visual Prompt Engineering
11 pages
SysteDrawing LI23090+卡塔尔+CVES1.2+64台 V1 20230613
No ratings yet
SysteDrawing LI23090+卡塔尔+CVES1.2+64台 V1 20230613
3 pages
FYP Final Report Preparation 2019-2020 - MKMJ PDF
No ratings yet
FYP Final Report Preparation 2019-2020 - MKMJ PDF
10 pages
Files With Fstream: Short Answer
No ratings yet
Files With Fstream: Short Answer
9 pages
KenLM: Efficient Language Modeling in Practice
From Everand
KenLM: Efficient Language Modeling in Practice
William Smith
No ratings yet
Co-Evolution of Metamodels and Model Transformations: An operator-based, stepwise approach for the impact resolution of metamodel evolution on model transformations.
From Everand
Co-Evolution of Metamodels and Model Transformations: An operator-based, stepwise approach for the impact resolution of metamodel evolution on model transformations.
Steffen Kruse
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Diffusion-LM Improves Controllable Text Generation

Uploaded by

Diffusion-LM Improves Controllable Text Generation

Uploaded by

Diffusion-LM Improves Controllable Text Generation

Xiang Lisa Li John Thickstun Ishaan Gulrajani

Percy Liang Tatsunori B. Hashimoto

Preprint. Under review.

3 Problem Statement and Background

3.1 Generative Models and Controllable Generation for Text

3.2 Autoregressive Language Models

Qn to language modeling factors plm in an autoregressive left-to-right mannar,

3.3 Diffusion Models for Continuous Domains

4 Diffusion-LM: Continuous Diffusion Language Modeling

4.1 End-to-end Training

5.1 Controllable Text Generation

5.2 Minimum Bayes Risk Decoding

6.1 Datasets and Hyperparameters

6.2 Control tasks

6.3 Classifier-Guided Control Baselines

6.4 Infilling Baselines

7.1 Classifier-Guided Controllable Text Generation Results

Syntactic Parse ( S ( S ( NP * ) ( VP * ( NP ( NP * * ) ( VP * ( NP ( ADJP * * ) * ) ) ) ) ) * ( S ( NP * * * ) ( VP * (

Automatic Eval Human Eval

AR 6.7 27.0 26.9 89.0 0.39+0.02

7.2 Composition of Controls

7.3 Infilling Results

7.4 Ablation Studies

8 Conclusion and Limitations

lm-score (lower is better)

rate of prior methods and is competitive

Acknowledgments and Disclosure of Funding

noise schedule (parametrized by ᾱ1:T ) is an important hyperparameter 0.8

an easy task near t = 0. Diffusion Steps

Controllable Generation hyperparameters. To achieve controllable generation, we run gradient

Table 6: Hyperparameters for controllable generation methods.

medium-length sequence regimes. Concretely, it takes around 1 minute to decode 50 sequence of

D Classifiers for Classifier-Guided Controls

E End-to-end Objective Derivations

E [LT ] → E[|| E [xT |x0 ] − 0||2 ] = E[||µ̂(xT ; x0 )]||2 ]

||µ̂(xt , x0 ) − µθ (xt , t)||2

Table 7: Log-likelihood results (nats per token)

F Log-Likelihood Models and Results

H Additional Ablation Studies

0.0 Transformer U-Net

Table 8: Randomly sampled examples generated by unconditional sampling Diffusion-LM trained

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.