The Chosen One: Consistent Characters in Text-to-Image Diffusion Models
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models
“in the park” ”reading a book” “at the beach” “holding an avocado”
“A photo of a 50
years old man
with curly hair.”
“A portrait of a
man with a
mustache and a
hat, fauvism.”
“A rendering of a
cute albino
porcupine, cozy
indoor lighting."
Figure 1. The Chosen One: Given a text prompt describing a character, our method distills a representation that enables consistent
depiction of the same character in novel contexts.
1
“A plasticine of a cute baby cat with big eyes.” inversion of an existing depiction of a human face [36].
In this work, we argue that in many applications the goal
is to generate some consistent character, rather than visually
Standard
2
turing [67], typography generation [35], motion generation Algorithm 1 Consistent Character Generation
[60, 81], and solving inverse problems [32]. Input: Text-to-image diffusion model M , parameterized
by Θ = (θ, τ ), where θ are the LoRA weights and τ is
Text-to-image personalization. Text-conditioned mod- a set of custom text embeddings, target prompt p, feature
els cannot generate an image of a specific object or char- extractor F .
acter. To overcome this limitation, a line of works utilizes Hyper-parameters: number of generated images per
several images of the same instance to encapsulate new pri- step N , minimum cluster size dmin-c , target cluster size
ors in the generative model. Existing solutions range from dsize-c , convergence criterion dconv , maximum number of
optimization of text tokens [20, 85, 88] to fine-tuning the iterations diter
parameters of the entire model [6, 70], where in the mid- Output: a consistent representation Θ(p)
dle, recent works suggest fine-tuning a small subset of pa-
repeat S
rameters [1, 17, 26, 33, 41, 71, 82]. Models trained in this
S = N F (MΘ (p))
manner can generate consistent images of the same subject.
C = K-MEANS++(S, k = ⌊N/dsize-c ⌋)
However, they typically require a collection of images de-
C = {c ∈ C|dmin-c < P |c|} {filter small clusters}
picting the subject, which naturally narrows their ability to 1 2
ccohesive = argmin |c| e∈c ∥e − ccen ∥
generate any imaginary character. Moreover, when training c∈C
on a single input image [6], these methods tend to overfit Θ = argmin Lrec over ccohesive
(θ,τ )
and produce similar images with minimal diversity during 1
P 2
inference. until dconv ≥ |S| 2 s1 ,s2 ∈S ∥s1 − s2 ∥
Unlike previous works, our method does not require an return Θ
input image; instead, it can generate consistent and diverse
images of the same character based only on a text descrip-
tion. Additional works are aimed to bypass the personal- ad hoc and manually-intensive tricks such as using text to-
ization training by introducing a dedicated personalization kens of a celebrity, or a combination of celebrities [64] in
encoder [3, 16, 21, 37, 42, 76, 90, 93]. Given an image and order to create a consistent human; however, the gener-
a prompt, these works can produce images with a character ated characters resemble the original celebrities, and this
similar to the input. However, as shown in Section 4.1, they approach does not generalize to other character types (e.g.,
lack consistency when generating multiple images from the animals). Users have also proposed to ensure consistency
same input. Concurrently, ConceptLab [66] is able to gen- by manually crafting very long and elaborate text prompts
erate new members of a broad category (e.g., a new pet); [65], or by using image variations [63] and filtering them
in contrast, we seek a consistent instance of a character de- manually by similarity [65]. Other users suggested gener-
scribed by the input text prompt. ating a full design sheet of a character, then manually filter
the best results and use them for further generation [94].
All these methods are manual, labor-intensive, and ad hoc
Story visualization. Consistent character generation is
for specific domains (e.g., humans). In contrast, our method
well studied in the field of story visualization. Early GAN
is fully automated and domain-agnostic.
works [43, 80] employ a story discriminator for the image-
text alignment. Recent works, such as StoryDALL-E [46]
and Make-A-Story [62] utilize pre-trained T2I models for 3. Method
the image generation, while an adapter model is trained As stated earlier, our goal in this work is to enable gener-
to embed story captions and previous images into the T2I ation of consistent images of a character (or another kind of
model. However, those methods cannot generalize to novel visual subject) based on a textual description. We achieve
characters, as they are trained over specific datasets. More this by iteratively customizing a pre-trained text-to-image
closely related, Jeong et al. [36] generate consistent story- model, using sets of images generated by the model itself as
books by combining textual inversion with a face-swapping training data. Intuitively, we refine the representation of the
mechanism; therefore, their work relies on images of exist- target character by repeatedly funneling the model’s output
ing human-like characters. TaleCrafter [24] presents a com- into a consistent identity. Once the process has converged,
prehensive pipeline for storybook visualization. However, the resulting model can be used to generate consistent im-
their consistent character module is based on an existing ages of the target character in novel contexts. In this section,
personalization method that requires fine-tuning on several we describe our method in detail.
images of the same character. Formally, we are given a text-to-image model MΘ , pa-
rameterized by Θ, and a text prompt p that describes a tar-
Manual methods. Other attempts for achieving consis- get character. The parameters Θ consist of a set of model
tent character generation using a generative model rely on weights θ and a set of custom text embeddings τ . We seek
3
"Luna is a forest sprite
with green skin and Cohesive
leaves for hair.”
Identity
MΘ F MΘ
Cluster Extract
S
t
C ea
Rep
Figure 3. Method overview. Given an input text prompt, we start by generating numerous images using the text-to-image model MΘ ,
which are embedded into a semantic feature space using the feature extractor F . Next, these embeddings are clustered and the most
cohesive group is chosen, since it contains images with shared characteristics. The “common ground” among the images in this set is used
to refine the representation Θ to better capture and fit the target. These steps are iterated until convergence to a consistent identity.
4
3.2. Identity Extraction 4. Experiments
Depending on the diversity of the image set generated in In Section 4.1 we compare our method against several
the current iteration, the most cohesive cluster ccohesive may baselines, both qualitatively and quantitatively. Next, in
still exhibit an inconsistent identity, as can be seen in Fig- Section 4.2 we describe the user study we conducted and
ure 3. The representation Θ is therefore not yet ready for present its results. The results of an ablation study are re-
consistent generation, and we further refine it by training on ported in Section 4.3. Finally, in Section 4.4 we demon-
the images in ccohesive to extract a more consistent identity. strate several applications of our method.
This refinement is performed using text-to-image personal-
ization methods [20, 70], which aim to extract a character
4.1. Qualitative and Quantitative Comparison
from a given set of several images that already depict a con- We compared our method against the most related person-
sistent identity. While we apply them to a set of images alization techniques [20, 42, 71, 89, 93]. In each exper-
which are not completely consistent, the fact that these im- iment, each of these techniques is used to extract a char-
ages are chosen based on their semantic similarity to each acter from a single image, generated by SDXL [57] from
other, enables these methods to nevertheless distill a com- an input prompt p. The same prompt p is also provided
mon identity from them. as input to our method. Textual Inversion (TI) [20] opti-
We base our solution on a pre-trained Stable Diffusion mizes a textual token using several images of the same con-
XL (SDXL) [57] model, which utilizes two text encoders: cept, and we converted it to support SDXL by learning two
CLIP [61] and OpenCLIP [34]. We perform textual inver- text tokens, one for each of its text encoders, as we did in
sion [20] to add a new pair of textual tokens τ , one for each our method. In addition, we used LoRA DreamBooth [71]
of the two text encoders. However, we found that this pa- (LoRA DB), which we found less prone to overfitting than
rameter space is not expressive enough, as demonstrated in standard DB. Furthermore, we compared against all avail-
Section 4.3, hence we also update the model weights θ via a able image encoder techniques that encode a single image
low-rank adaptation (LoRA) [33, 71] of the self- and cross- into the textual space of the diffusion model for later gener-
attention layers of the model. ation in novel contexts: BLIP-Diffusion [42], ELITE [89],
We use the standard denoising loss: and IP-adapter [93]. For all the baselines, we used the same
h i prompt p to generate a single image, and used it to extract
Lrec = Ex∼ccohesive ,z∼E(x),ϵ∼N (0,1),t ∥ϵ − ϵΘ(p) (zt , t)∥22 , the identity via optimization (TI and LoRA DB) or encod-
ing (ELITE, BLIP-diffusion and IP-adapter).
(2)
In Figure 5 we qualitatively compare our method against
where ccohesive is the chosen cluster, E(x) is the VAE en-
the above baselines. While TI [20], BLIP-diffusion [42]
coder of SDXL, ϵ is the sample’s noise and t is the time
and IP-adapter [93] are able to follow the specified prompt,
step, zt is the latent z noised to time step t. We optimize
they fail to produce a consistent character. LoRA DB [71]
Lrec over Θ = (θ, τ ), the union of the LoRA weights and
succeeds in consistent generation, but it does not always re-
the newly-added textual tokens.
spond to the prompt. Furthermore, the resulting character
3.3. Convergence is generated in the same fixed pose. ELITE [90] struggles
with prompt following and the generated characters tend to
As explained earlier (Algorithm 1 and Figure 3), the above be deformed. In comparison, our method is able to follow
process is performed iteratively. Note that the representa- the prompt and maintain consistency, while generating ap-
tion Θ extracted in each iteration is the one used to generate pealing characters in different poses and viewing angles.
the set of N images for the next iteration. The generated In order to automatically evaluate our method and the
images are thus funneled into a consistent identity. baselines quantitatively, we instructed ChatGPT [53] to
Rather than using a fixed number of iterations, we apply generate prompts for characters of different types (e.g., ani-
a convergence criterion that enables early stopping. After mals, creatures, objects, etc.) in different styles (e.g., stick-
each iteration, we calculate the average pairwise Euclidean ers, animations, photorealistic images, etc.). Each of these
distance between all N embeddings of the newly-generated prompts was then used to extract a consistent character by
images, and stop when this distance is smaller than a pre- our method and by each of the baselines. Next, we gen-
defined threshold dconv . erated these characters in a predefined collection of novel
Finally, it should be noticed that our method is nondeter- contexts. For a visual comparison, please refer to the sup-
ministic, i.e., when running our method multiple times, on plementary material.
the same input prompt p, different consistent characters will We employ two standard evaluation metrics: prompt
be generated. This is aligned with the one-to-many nature similarity and identity consistency, which are commonly
of our task. For more details and examples, please refer to used in the personalization literature [6, 20, 70]. Prompt
the supplementary material. similarity measures the correspondence between the gener-
5
“indoors”
“in the park” TI [20] LoRA DB [71] ELITE [90] BLIP-diff [42] IP-Adapter [93] Ours
ated images and the input text prompt. We use the standard ages and the CLIP text embedding of the source prompts.
CLIP [61] similarity, i.e., the normalized cosine similarity For measuring identity consistency, we calculate the sim-
between the CLIP image embedding of the generated im- ilarity between the CLIP image embeddings of generated
6
0.9 3.8
Automatic identity consistency (→)
LoRA DB LoRA DB
IP-adapter
3.4
0.8 ELITE
TI TI
3.2
Ours w reinit.
0.75 BLIP-diffusion IP-Adapter
3
Ours w/o clustering
Ours single iter.
0.15 0.16 0.17 0.18 0.19 0.2 0.21 2.9 3 3.1 3.2 3.3 3.4
Automatic prompt similarity (→) User prompt similarity ranking (→)
Figure 6. Quantitative Comparison and User Study. (Left) We compared our method quantitatively with various baselines in terms of
identity consistency and prompt similarity, as explained in Section 4.1. LoRA DB and ELITE maintain high identity consistency, while
sacrificing prompt similarity. TI and BLIP-diffusion achieve high prompt similarity but low identity consistency. Our method and IP-
adapter both lie on the Pareto front, but the better identity consistency of our method is perceptually significant, as demonstrated in the
user study. We also ablated some components of our method: removing the clustering stage, reducing the optimizable representation,
re-initializing the representation in each iteration and performing only a single iteration. All of the ablated cases resulted in a significant
degradation of consistency. (Right) The user study rankings also demonstrate that our method lies on the Pareto front, balancing between
identity consistency and prompt similarity.
images of the same concept across different contexts. ysis, read the supplementary material.
As can be seen in Figure 6 (left), there is an inher-
ent trade-off between prompt similarity and identity con- 4.3. Ablation Study
sistency: LoRA DB and ELITE exhibit high identity con- We conducted an ablation study for the following cases: (1)
sistency, while sacrificing prompt similarity. TI and BLIP- Without clustering — we omit the clustering step described
diffusion achieve high prompt similarity but low identity in Section 3.1, and instead simply generate 5 images ac-
consistency. Our method and IP-adapter both lie on the cording to the input prompt. (2) Without LoRA — we re-
Pareto front. However, our method achieves better iden- duce the optimizable representation Θ in the identity ex-
tity consistency than IP-adapter, which is significant from traction stage, as described in Section 3.2, to consist of only
the user’s perspective, as supported by our user study. the newly-added text tokens without the additional LoRA
weights. (3) With re-initialization — instead of using the
4.2. User Study latest representation Θ in each of the optimization itera-
We conducted a user study to evaluate our method, using the tions, as described in Section 3.3, we re-initialize it in each
Amazon Mechanical Turk (AMT) platform [2]. We used iteration. (4) Single iteration — rather than iterating until
the same generated prompts and samples that were used in convergence (Section 3.3), we stop after a single iteration.
Section 4.1 and asked the evaluators to rate the prompt sim- As can be seen in Figure 6 (left), all of the above key
ilarity and identity consistency of each result on a Likert components are crucial for achieving a consistent identity
scale of 1–5. For ranking the prompt similarity, the eval- in the final result: (1) removing the clustering harms the
uators were presented with the target text prompt and the identity extraction stage because the training set is too di-
result of our method and the baselines on the same page, verse, (2) reducing the representation causes underfitting,
and were asked to rate each of the images. For identity con- as the model does not have enough parameters to properly
sistency, for each of the generated concepts, we compared capture the identity, (3) re-initializing the representation in
our method and the baselines by randomly choosing pairs each iteration, or (4) performing a single iteration, does not
of generated images with different target prompts, and the allow the model to converge into a single identity.
evaluators were asked to rate on a scale of 1–5 whether the For a visual comparison of the ablation study, as well as
images contain the same main character. Again, all the pairs comparison of alternative feature extractors (DINOv1 [14]
of the same character for the different baselines were shown and CLIP [61]), please refer to the supplementary material.
on the same page.
4.4. Applications
As can be seen in Figure 6 (right), our method again
exhibits a good balance between identity consistency and As demonstrated in Figure 7, our method can be used for
prompt similarity, with a wider gap separating it from the various down-stream tasks, such as (a) Illustrating a story by
baselines. For more details and statistical significance anal- breaking it into a different scenes and using the same con-
7
“This is a story about Jasper, a cute mink with a brown jacket and red pants.
(a) Inconsistent
Jasper started his day by jogging on the beach, and afterwards, he enjoyed a
identity
coffee meetup with a friend in the heart of New York City. As the day drew
to a close, he settled into his cozy apartment to review a paper.”
illustration
(a) Story
supporting elements
(b) Inconsistent
Scene 1 Scene 2 Scene 3 Scene 4
“a Plasticine of a cute baby cat with big eyes”
image editing
(b) Local
(c) Spurious
attributes
Image + mask “sitting” “ jumping” “wearing
sunglasses”
“a photo of a ginger woman with a long hair”
(c) Additional
pose control
8
Acknowledgments. We thank Yael Pitch, Matan Cohen, that were trained on a single image (with a single identity).
Neal Wadhwa and Yaron Brodsky for their valuable help Instead, we could generate a small set of 5 images for the
and feedback. given prompt (that are not guaranteed to be of the same
identity), and use this mini dataset for TI and LoRA DB
A. Additional Experiments baselines. As can be seen in Figure 18 and Figure 19, these
baselines sacrifice the identity consistency.
Below we provide additional experiments that were omitted
from the main paper. In Appendix A.1 we provide addi-
tional comparisons and results of our method, and demon- A.4. Additional Feature Extractors
strate the nondeterministic nature of our method in Ap-
Instead of using DINOv2 [54] features for the identity
pendix A.2. Furthermore, in Appendix A.3 we compare
clustering stage (Section 3.1 in the main paper), we also
our method against two naı̈ve baselines. In addition, Ap-
experimented with two alternative feature extractors: DI-
pendix A.4 presents the results of our method using differ-
NOv1 [14] and CLIP [61] image encoder. We quantita-
ent feature extractors. Lastly, in Appendix A.5 we provide
tively evaluate our method with each of these feature extrac-
results that reduce the concerns of dataset memorization by
tors in terms of identity consistency and prompt similarity,
our method.
as explained in Section 4.1 in the main paper. As can be
A.1. Additional Comparisons and Results seen in Figure 20, DINOv1 produces higher identity con-
sistency, while sacrificing prompt similarity, whereas CLIP
In Figure 9 we provide a qualitative comparison on the au-
achieves higher prompt similarity at the expense of identity
tomatically generated images. In Figure 11 we provide a
consistency. Qualitatively, as demonstrated in Figure 21,
qualitative comparison of the ablated cases. In Figure 12
we found the DINOv1 extractor to perform similarly to DI-
we provide an additional qualitative comparison.
NOv2, whereas CLIP produces results with a slightly lower
Concurrently to our work, the DALL·E 3 model [12] was
identity consistency.
commercially released as a part of the paid ChatGPT Plus
[53] subscription, enabling generating images in a conver-
sational setting. We tried, using a conversation, to create a A.5. Dataset Non Memorization
consistent character of a Plasticine cat, as demonstrated in
Our method is able to produce consistent characters, which
Figure 10. As can be seen, the generated characters share
raises the question of whether these characters, already exist
only some of the characteristics (e.g., big eyes) but not all
in the training data of the generative model. We employed
of them (e.g., colors, textures and shapes).
SDXL [57] as our text-to-image model, whose training
In addition, as demonstrated in Figure 13, our approach
dataset is, regrettably, undisclosed in the paper [57]. Con-
is applicable to consistent generation of a wide range of sub-
sequently, we relied on the most likely overlapping dataset,
jects, without the requirement for them to necessarily depict
LAION-5B [73], which was also utilized by Stable Diffu-
human characters or creatures. Figure 14 shows additional
sion V2.
results of our method, demonstrating a variety of character
styles. Lastly, in Figure 15 we demonstrate the ability of To probe for dataset memorization, we found the top 5
creating a fully consistent “life story” of a character using nearest neighbors in the dataset, in terms of CLIP [61] im-
our method. age similarity, for a few representative characters from our
paper, using an open-source solution [68]. As demonstrated
A.2. Nondeterminism of Our Method in Figure 22, our method does not simply memorize images
In Figures 16 and 17 we demonstrate the non-deterministic from the LAION-5B dataset.
nature of our method. Using the same text prompt, we
run our method multiple times with different initial seeds, B. Implementation Details
thereby generating a different set of images for the iden-
tity clustering stage (Section 3.1 in the main paper). Con- In this section, we provide the implementation details that
sequently, the most cohesive cluster ccohesive is different in were omitted from the main paper. In Appendix B.1 we pro-
each run, yielding different consistent identities. This be- vide the implementation details of our method and the base-
havior of our method is aligned with the one-to-many na- lines. Then, in Appendix B.2 we provide the implementa-
ture of our task — a single text prompt may correspond to tions details of the automatic metrics that we used to eval-
many identities. uate our method against the baselines. In Appendix B.3 we
provide the implementation details and the statistical analy-
A.3. Naı̈ve Baselines
sis for the user study we conducted. Lastly, in Appendix B.4
As explained in Section 4.1 in the main paper, we compared we provide the implementation details for the applications
our method against a version of TI [20] and LoRA DB [71] we presented.
9
TI [20] LoRA DB [71] ELITE [90] BLIP-diff [42] IP-Adapter [93] Ours
“drinking
a beer”
the background”
“with a city in
“a 2D animaiton of captivating Arctic fox with fluffy fur, bright eyes, and
nimble movements, bringing the magic of the icy wilderness to animated life”
a burger”
“eating
“wearing a hat”
blue hat”
10
“holding an Table 1. Statistical analysis. We use Tukey’s honestly significant
“in the park” “reading a book” “at the beach” avocado”
difference procedure [83] to test whether the differences between
DALL·E 3 [12]
11
Table 2. Users’ rankings means and variances. the means and Story illustration. Given a long story, e.g., “This is a
variances of the rankings that are reported in the user study. story about Jasper, a cute mink with a brown jacket and red
pants. Jasper started his day by jogging on the beach, and
Method Prompt similarity (↑) Identity consistency (↑)
afterwards, he enjoyed a coffee meetup with a friend in the
TI [20] 3.31 ± 1.43 3.17 ± 1.17 heart of New York City. As the day drew to a close, he settled
LoRA DB [71] 3.03 ± 1.43 3.67 ± 1.2
ELITE [90] 2.87 ± 1.46 3.2 ± 1.21
into his cozy apartment to review a paper”, one can create
BLIP-Diffusion [42] 3.35 ± 1.41 2.76 ± 1.31 a consistent character from the main character description
IP-Adapter [93] 3.25 ± 1.42 2.99 ± 1.28 (“a cute mink with a brown jacket and red pants”), then
Ours 3.3 ± 1.36 3.48 ± 1.2 they can generate the various scenes by simply rephrasing
the sentence:
1. “[v] jogging on the beach”
the main paper, and asked the evaluators to rate the prompt
2. “[v] drinking coffee with his friend in the heart of New
similarity and identity consistency of each result on a Likert
York City”
scale of 1–5. For ranking the prompt similarity, the evalua-
3. “[v] reviewing a paper in his cozy apartment”
tors were instructed the following: “For each of the follow-
ing images, please rank on a scale of 1 to 5 its correspon-
dence to this text description: {PROMPT}. The character Local image editing. Our method can be simply inte-
in the image can be anything (e.g., a person, an animal, a grated with Blended Latent Diffusion [5, 7] for editing im-
toy etc.” where {PROMPT} is the target text prompt (in ages locally: given a text prompt, we start by running our
which we replaced the special token with the word “char- method to extract a consistent identity, then, given an input
acter”). All the baselines, as well as our method, were pre- image and mask, we can plant the character in the image
sented in the same page, and the evaluators were asked to within the mask boundaries. In addition, we can provide a
rate each one of the results using a slider from 1 (“Do not local text description for the character.
match at all”) to 5 (“Match perfectly”). Next, for assessing
the identity consistency, we took for each one of the charac- Additional pose control. our method can be integrated
ters, two generated images that correspond to different tar- with ControlNet [97]:given a text prompt, we first apply our
get text prompts, put them next to each other, and instructed method to extract a consistent identity, then, given an input
the evaluators the following: “For each of the following im- pose, we can generate the character with this pose.
age pairs, please rank on a scale of 1 to 5 if they contain the
same character (1 means that they contain totally different C. Societal Impact
characters and 5 means that they contain exactly the same
We believe that the emergence of technology that facilitates
character). The images can have different backgrounds”.
the effortless creation of consistent characters holds excit-
We put all the compared images on the same page, and the
ing promise in a variety of creative and practical applica-
evaluators were asked to rate each one of the pairs using a
tions. It can empower storytellers and content creators to
slider from 1 (“Totally different characters”) to 5 (“Exactly
bring their narratives to life with vivid and unique charac-
the same character ”).
ters, enhancing the immersive quality of their work. In ad-
We collected three rating per question, resulting in 1104
dition, it may offer accessibility to those who may not pos-
rating per task (prompt similarity and identity consistency).
sess traditional artistic skills, democratizing character de-
The time allotted per task was one hour, to allow the raters
sign in the creative industry. Furthermore, it can reduce the
to properly evaluate the results without time pressure. The
cost of advertising, and open up new opportunities for small
means and variances of the user study responses are re-
and underprivileged entrepreneurs, enabling them to reach a
ported in Table 2
wider audience and compete in the market more effectively.
In addition, we conducted a statistical analysis of our
On the other hand, as any other generative AI technol-
user study by validating that the difference between all the
ogy, it can be misused by creating false and misleading vi-
conditions is statistically significant using Kruskal-Wallis
sual content for deceptive purposes. Creating fake charac-
[40] test (p < 3.85e−28 for the text similarity test and
ters or personas can be used for online scams, disinforma-
p < 1.07e−76 for the identity consistency text). Lastly, we
tion campaigns, etc., making it challenging to discern gen-
used Tukey’s honestly significant difference procedure [83]
uine information from fabricated content. Such technolo-
to show that the comparison of our method against all the
gies underscore the vital importance of developing gener-
baselines is statistically significant, as detailed in Table 1.
ated content detection systems, making it a compelling and
B.4. Applications Implementation Details appealing research direction to address.
In Section 4.4 in the main paper, we presented three down-
stream applications of our method.
12
“drinking a beer”
in the background” Ours single iter. Ours w/o clust. Ours w/o LoRA Ours w reinit. Ours
“with a city
14
“in the desert” “in Times Square” “near a lake” “near the Eiffel Tower” “near the Taj Mahal”
Figure 13. Consistent generation of non-character objects. Our approach is applicable to a wide range of objects, without the require-
ment for them to depict human characters or creatures.
15
“holding an
“in the park” “reading a book” “at the beach” avocado”
“a purple astronaut with human face, digital art, smooth, sharp focus, vector art”
Figure 14. Additional results. Our method is able to consistently generate different types and styles of characters, e.g., paintings,
animations, stickers and vector art.
16
“as a baby” “as a small child” “as a teenager” “with his “before the prom”
first girlfriend”
“as a soldier” “in the “sitting in a lecture” “playing football” “drinking a beer”
college campus”
“studying in “happy with his “giving a talk “graduating from “a profile picture”
his room” accepted paper” in a conference” college”
Figure 15. Life story. Given a text prompt describing a fictional character, “a photo of a man with short black hair”, we can generate a
consistent life story for that character, demonstrating the applicability of our method for story generation.
17
“holding an
“in the park” “reading a book” “at the beach” avocado”
Figure 16. Non-determinism. By running our method multiple times, given the same prompt “a photo of a 50 years old man with curly
hair”, but using different initial seeds, we obtain different consistent characters corresponding to the text prompt.
18
“holding an
“in the park” “reading a book” “at the beach” avocado”
Figure 17. Non-determinism. By running our method multiple times, given the same prompt “a Plasticine of a cute baby cat with big
eyes”, but using different initial seeds, we obtain different consistent characters corresponding to the text prompt.
19
TI mini LoRA DB mini Ours
“drinking a beer”
in the background”
“with a city
0.8
TI multi
0.75
LoRA DB multi
Ours w DINOv1
Automatic identity consistency (↑)
0.86
Ours w DINOv2
0.84
0.82
Ours w CLIP
21
Ours with CLIP Ours with DINOv1 Ours
“drinking a beer”
in the background”
“with a city
Figure 22. Dataset non-memorization. We found the top 5 nearest neighbors in the LAION-5B dataset [73], in terms of CLIP [61] image
similarity, for a few representative characters from our paper, using an open-source solution [68]. As can be seen, our method does not
simply memorize images from the LAION-5B dataset.
23
References 2021 IEEE/CVF International Conference on Computer Vi-
sion (ICCV), pages 9630–9640, 2021. 7, 9, 11, 21, 22
[1] Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel
[15] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and
Cohen-Or. A neural space-time representation for text-to-
Daniel Cohen-Or. Attend-and-excite: Attention-based se-
image personalization. ArXiv, abs/2305.15391, 2023. 3
mantic guidance for text-to-image diffusion models. ACM
[2] Amazon. Amazon mechanical turk. https://www. Transactions on Graphics (TOG), 42:1 – 10, 2023. 2
mturk.com/, 2023. 7, 11
[16] Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Rui, Xuhui
[3] Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Jia, Ming-Wei Chang, and William W. Cohen. Subject-
Cohen-Or, Ariel Shamir, and Amit H Bermano. Domain- driven text-to-image generation via apprenticeship learning.
agnostic tuning-encoder for fast personalization of text-to- ArXiv, abs/2304.00186, 2023. 3
image models. arXiv preprint arXiv:2307.06925, 2023. 3 [17] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao,
[4] David Arthur and Sergei Vassilvitskii. k-means++: the ad- and Hengshuang Zhao. Anydoor: Zero-shot object-level im-
vantages of careful seeding. In ACM-SIAM Symposium on age customization. ArXiv, abs/2307.09481, 2023. 3
Discrete Algorithms, 2007. 4 [18] Guillaume Couairon, Marlene Careil, Matthieu Cord,
[5] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended Stéphane Lathuilière, and Jakob Verbeek. Zero-shot spa-
diffusion for text-driven editing of natural images. In Pro- tial layout conditioning for text-to-image diffusion models.
ceedings of the IEEE/CVF Conference on Computer Vision ArXiv, abs/2306.13754, 2023. 2
and Pattern Recognition (CVPR), pages 18208–18218, 2022. [19] Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel.
2, 8, 12 Scenescape: Text-driven consistent scene generation. ArXiv,
[6] Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen- abs/2302.01133, 2023. 2
Or, and Dani Lischinski. Break-a-scene: Extracting multiple [20] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik,
concepts from a single image. ArXiv, abs/2305.16311, 2023. Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or.
3, 4, 5, 8 An image is worth one word: Personalizing text-to-image
[7] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended generation using textual inversion. In The Eleventh Interna-
latent diffusion. ACM Trans. Graph., 42(4), 2023. 2, 8, 12 tional Conference on Learning Representations, 2022. 2, 3,
[8] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, 5, 6, 9, 10, 11, 12, 14, 20, 21
Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, [21] Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano,
and Xi Yin. Spatext: Spatio-textual representation for con- Gal Chechik, and Daniel Cohen-Or. Encoder-based domain
trollable image generation. In Proceedings of the IEEE/CVF tuning for fast personalization of text-to-image models. ACM
Conference on Computer Vision and Pattern Recognition Transactions on Graphics (TOG), 42(4):1–13, 2023. 3
(CVPR), pages 18370–18380, 2023. 2 [22] Songwei Ge, Taesung Park, Jun-Yan Zhu, and Jia-Bin
[9] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Ji- Huang. Expressive text-to-image generation with rich text.
aming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, ArXiv, abs/2304.06720, 2023. 2
Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and [23] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel.
Ming-Yu Liu. ediff-i: Text-to-image diffusion models with Tokenflow: Consistent diffusion features for consistent video
an ensemble of expert denoisers. ArXiv, abs/2211.01324, editing. arXiv preprint arXiv:2307.10373, 2023. 2
2022. 2 [24] Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia,
[10] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas- Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang,
ten, and Tali Dekel. Text2live: Text-driven layered image Ying Shan, and Yujiu Yang. TaleCrafter: interactive story vi-
and video editing. In European conference on computer vi- sualization with multiple characters. ArXiv, abs/2305.18247,
sion, pages 707–723. Springer, 2022. 2 2023. 2, 3
[11] Sagie Benaim, Frederik Warburg, Peter Ebert Christensen, [25] Ori Gordon, Omri Avrahami, and Dani Lischinski. Blended-
and Serge J. Belongie. Volumetric disentanglement for 3d nerf: Zero-shot object generation and blending in existing
scene manipulation. ArXiv, abs/2206.02776, 2022. 2 neural radiance fields. ArXiv, abs/2306.12760, 2023. 2
[12] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng [26] Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar,
Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Dimitris N. Metaxas, and Feng Yang. Svdiff: Com-
Yufei Guo, et al. Improving image generation with better pact parameter space for diffusion fine-tuning. ArXiv,
captions. 2023. 9, 11 abs/2303.11305, 2023. 3
[13] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- [27] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
aohu Qie, and Yinqiang Zheng. MasaCtrl: tuning-free mu- Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
tual self-attention control for consistent image synthesis and age editing with cross attention control. arXiv preprint
editing. In Proceedings of the IEEE/CVF International Con- arXiv:2208.01626, 2022. 2
ference on Computer Vision (ICCV), pages 22560–22570, [28] Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta de-
2023. 2 noising score. In Proceedings of the IEEE/CVF International
[14] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Conference on Computer Vision, pages 2328–2337, 2023. 2
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- [29] Geoffrey E. Hinton and Sam T. Roweis. Stochastic neighbor
ing properties in self-supervised vision transformers. In embedding. In NIPS, 2002. 4
24
[30] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- [45] Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya
sion probabilistic models. In Proc. NeurIPS, 2020. 2 Jia. Video-p2p: Video editing with cross-attention control.
[31] Lukas Höllein, Ang Cao, Andrew Owens, Justin John- arXiv preprint arXiv:2303.04761, 2023. 2
son, and Matthias Nießner. Text2room: Extracting tex- [46] Adyasha Maharana, Darryl Hannan, and Mohit Bansal.
tured 3d meshes from 2d text-to-image models. ArXiv, Storydall-e: Adapting pretrained text-to-image transformers
abs/2303.11989, 2023. 2 for story continuation. In European Conference on Computer
[32] Eliahu Horwitz and Yedid Hoshen. Conffusion: Confidence Vision, pages 70–87. Springer, 2022. 3
intervals for diffusion models. ArXiv, abs/2211.09795, 2022. [47] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-
3 jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided
[33] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, image synthesis and editing with stochastic differential equa-
Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low- tions. In International Conference on Learning Representa-
rank adaptation of large language models. In International tions, 2021. 2
Conference on Learning Representations, 2021. 3, 5 [48] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and
[34] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Daniel Cohen-Or. Latent-nerf for shape-guided generation
Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, of 3d shapes and textures. In Proceedings of the IEEE/CVF
Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- Conference on Computer Vision and Pattern Recognition,
naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- pages 12663–12673, 2023. 2
CLIP, 2021. 5 [49] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and
[35] Shira Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Daniel Cohen-Or. Null-text inversion for editing real im-
Cohen-Or, and Ariel Shamir. Word-as-image for semantic ages using guided diffusion models. In Proceedings of
typography. ACM Transactions on Graphics (TOG), 42:1 – the IEEE/CVF Conference on Computer Vision and Pattern
11, 2023. 3 Recognition, pages 6038–6047, 2023. 2
[36] Hyeonho Jeong, Gihyun Kwon, and Jong-Chul Ye. Zero- [50] Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha,
shot generation of coherent storybook from plain text story Y. Matias, Yael Pritch, Yaniv Leviathan, and Yedid Hoshen.
using diffusion models. ArXiv, abs/2302.03900, 2023. 2, 3 Dreamix: Video diffusion models are general video editors.
ArXiv, abs/2302.01329, 2023. 2
[37] Xuhui Jia, Yang Zhao, Kelvin C. K. Chan, Yandong Li,
[51] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian
Han-Ying Zhang, Boqing Gong, Tingbo Hou, H. Wang, and
Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-
Yu-Chuan Su. Taming encoder for zero fine-tuning image
adapter: Learning adapters to dig out more controllable
customization with text-to-image diffusion models. ArXiv,
ability for text-to-image diffusion models. arXiv preprint
abs/2304.02642, 2023. 3
arXiv:2302.08453, 2023. 2
[38] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
[52] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
Text-based real image editing with diffusion models. In Pro-
Mark Chen. Glide: Towards photorealistic image generation
ceedings of the IEEE/CVF Conference on Computer Vision
and editing with text-guided diffusion models. In Interna-
and Pattern Recognition, pages 6007–6017, 2023. 2
tional Conference on Machine Learning, 2021. 2
[39] Diederik P. Kingma and Jimmy Ba. Adam: A method for
[53] OpenAI. ChatGPT. https://chat.openai.com/,
stochastic optimization. CoRR, abs/1412.6980, 2014. 11
2022. Accessed: 2023-10-15. 5, 9, 11
[40] William H. Kruskal and Wilson Allen Wallis. Use of ranks
[54] Maxime Oquab, Timothée Darcet, Théo Moutakanni,
in one-criterion variance analysis. Journal of the American
Huy Q. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernan-
Statistical Association, 47:583–621, 1952. 12
dez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby,
[41] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russ
Shechtman, and Jun-Yan Zhu. Multi-concept customization Howes, Po-Yao (Bernie) Huang, Shang-Wen Li, Ishan Misra,
of text-to-image diffusion. In Proceedings of the IEEE/CVF Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Huijiao
Conference on Computer Vision and Pattern Recognition, Xu, Hervé Jégou, Julien Mairal, Patrick Labatut, Armand
pages 1931–1941, 2023. 3 Joulin, and Piotr Bojanowski. DINOv2: Learning robust vi-
[42] Dongxu Li, Junnan Li, and Steven C. H. Hoi. BLIP- sual features without supervision. ArXiv, abs/2304.07193,
Diffusion: Pre-trained subject representation for con- 2023. 4, 9, 11
trollable text-to-image generation and editing. ArXiv, [55] Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-
abs/2305.14720, 2023. 3, 5, 6, 10, 11, 12, 14 Elor, and Daniel Cohen-Or. Localizing object-level shape
[43] Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, variations with text-to-image diffusion models. ArXiv,
Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng abs/2303.11306, 2023. 2
Gao. Storygan: A sequential conditional gan for story visu- [56] Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman,
alization. CVPR, 2019. 3 Jonathan T. Barron, Amit H. Bermano, Eric Ryan Chan, Tali
[44] Shaoteng Liu, Yuecheng Zhang, Wenbo Li, Zhe Lin, and Ji- Dekel, Aleksander Holynski, Angjoo Kanazawa, C. Karen
aya Jia. Video-p2p: Video editing with cross-attention con- Liu, Lingjie Liu, Ben Mildenhall, Matthias Nießner, Bjorn
trol. ArXiv, abs/2303.04761, 2023. 2 Ommer, Christian Theobalt, Peter Wonka, and Gordon Wet-
25
zstein. State of the art on diffusion models for visual com- on Computer Vision and Pattern Recognition (CVPR), pages
puting. ArXiv, abs/2310.07204, 2023. 2 10674–10685, 2021. 2
[57] Dustin Podell, Zion English, Kyle Lacey, A. Blattmann, [70] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
Tim Dockhorn, Jonas Muller, Joe Penna, and Robin Rom- Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine
bach. SDXL: Improving latent diffusion models for high- tuning text-to-image diffusion models for subject-driven
resolution image synthesis. ArXiv, abs/2307.01952, 2023. 2, generation. In Proceedings of the IEEE/CVF Conference
5, 9, 11 on Computer Vision and Pattern Recognition, pages 22500–
[58] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- 22510, 2023. 2, 3, 5
hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv [71] Simo Ryu. Low-rank adaptation for fast text-to-image
preprint arXiv:2209.14988, 2022. 2 diffusion fine-tuning. https : / / github . com /
[59] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, cloneofsimo/lora, 2022. 3, 5, 6, 9, 10, 11, 12, 14,
Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fus- 20, 21
ing attentions for zero-shot text-based video editing. arXiv [72] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
preprint arXiv:2303.09535, 2023. 2 Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
[60] Sigal Raab, Inbal Leibovitch, Guy Tevet, Moab Arar, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
Amit H. Bermano, and Daniel Cohen-Or. Single motion dif- et al. Photorealistic text-to-image diffusion models with deep
fusion. ArXiv, abs/2302.05905, 2023. 3 language understanding. Advances in Neural Information
[61] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Processing Systems, 35:36479–36494, 2022. 2
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [73] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Krueger, and Ilya Sutskever. Learning transferable visual Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
models from natural language supervision. In International man, Patrick Schramowski, Srivatsa Kundurthy, Katherine
Conference on Machine Learning, 2021. 5, 6, 7, 9, 11, 21, Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia
22, 23 Jitsev. Laion-5b: An open large-scale dataset for training
[62] Tanzila Rahman, Hsin-Ying Lee, Jian Ren, S. Tulyakov, next generation image-text models. ArXiv, abs/2210.08402,
Shweta Mahajan, and Leonid Sigal. Make-a-story: Vi- 2022. 9, 23
sual memory conditioned consistent story generation. 2023 [74] Etai Sella, Gal Fiebelman, Peter Hedman, and Hadar
IEEE/CVF Conference on Computer Vision and Pattern Averbuch-Elor. Vox-e: Text-guided voxel editing of 3d ob-
Recognition (CVPR), pages 2493–2502, 2022. 2, 3 jects. ArXiv, abs/2303.12048, 2023. 2
[63] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
[75] Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer,
and Mark Chen. Hierarchical text-conditional image gener-
Oran Gafni, Eliya Nachmani, and Yaniv Taigman. knn-
ation with CLIP latents. arXiv preprint arXiv:2204.06125,
diffusion: Image generation via large-scale retrieval. In The
2022. 2, 3
Eleventh International Conference on Learning Representa-
[64] reddit.com. How to create consistent character faces without tions, 2022. 2
training (info in the comments) : Stablediffusion. https:
[76] Jing Shi, Wei Xiong, Zhe L. Lin, and Hyun Joon Jung. In-
/ / www . reddit . com / r / StableDiffusion /
stantbooth: Personalized text-to-image generation without
comments / 12djxvz / how _ to _ create _
test-time finetuning. ArXiv, abs/2304.03411, 2023. 3
consistent _ character _ faces _ without/,
2023. 2, 3 [77] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[65] reddit.com. 8 ways to generate consistent char- and Surya Ganguli. Deep unsupervised learning using
acters (for comics, storyboards, books etc) : Sta- nonequilibrium thermodynamics. In International Confer-
blediffusion. https : / / www . reddit . com / r / ence on Machine Learning, pages 2256–2265. PMLR, 2015.
StableDiffusion/comments/10yxz3m/8_ways_ 2
to _ generate _ consistent _ characters _ for/, [78] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
2023. 2, 3 ing diffusion implicit models. In International Conference
[66] Elad Richardson, Kfir Goldberg, Yuval Alaluf, and Daniel on Learning Representations, 2020.
Cohen-Or. Conceptlab: Creative generation using diffusion [79] Yang Song and Stefano Ermon. Generative modeling by esti-
prior constraints. arXiv preprint arXiv:2308.02669, 2023. 3 mating gradients of the data distribution. Advances in Neural
[67] Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, Information Processing Systems, 32, 2019. 2
and Daniel Cohen-Or. Texture: Text-guided texturing of [80] Gábor Szűcs and Modafar Al-Shouha. Modular storygan
3d shapes. ACM SIGGRAPH 2023 Conference Proceedings, with background and theme awareness for story visualiza-
2023. 3 tion. In International Conference on Pattern Recognition
[68] Romain Beaumont. Clip retrival. https://github. and Artificial Intelligence, pages 275–286. Springer, 2022.
com/rom1504/clip-retrieval, 2023. 9, 23 3
[69] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick [81] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir,
Esser, and Björn Ommer. High-resolution image synthesis Daniel Cohen-Or, and Amit H. Bermano. Human motion
with latent diffusion models. 2022 IEEE/CVF Conference diffusion model. ArXiv, abs/2209.14916, 2022. 3
26
[82] Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. sive models for content-rich text-to-image generation. arXiv
Key-locked rank one editing for text-to-image personaliza- preprint arXiv:2206.10789, 2022. 2
tion. ACM SIGGRAPH 2023 Conference Proceedings, 2023. [96] Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang,
3 and In-So Kweon. Text-to-image diffusion models in gen-
[83] John W. Tukey. Comparing individual means in the analysis erative ai: A survey. ArXiv, abs/2303.07909, 2023. 2
of variance. Biometrics, 5 2:99–114, 1949. 11, 12 [97] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
[84] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali conditional control to text-to-image diffusion models. In
Dekel. Plug-and-play diffusion features for text-driven Proceedings of the IEEE/CVF International Conference on
image-to-image translation. In Proceedings of the IEEE/CVF Computer Vision (ICCV), pages 3836–3847, 2023. 2, 8, 12
Conference on Computer Vision and Pattern Recognition, [98] Jingyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, and
pages 1921–1930, 2023. 2 Guanbin Li. Dreameditor: Text-driven 3d scene editing with
[85] Yael Vinker, Andrey Voynov, Daniel Cohen-Or, and Ariel neural fields. ArXiv, abs/2306.13455, 2023. 2
Shamir. Concept decomposition for visual exploration and
inspiration. ArXiv, abs/2305.18203, 2023. 3
[86] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro
Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj,
and Thomas Wolf. Diffusers: State-of-the-art diffusion
models. https : / / github . com / huggingface /
diffusers, 2022. 11
[87] Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or.
Sketch-guided text-to-image diffusion models. arXiv
preprint arXiv:2211.13752, 2022. 2
[88] Andrey Voynov, Q. Chu, Daniel Cohen-Or, and Kfir Aber-
man. P+: Extended textual conditioning in text-to-image
generation. ArXiv, abs/2303.09522, 2023. 3
[89] Yuxiang Wei. Official implementation of ELITE. https:
//github.com/csyxwei/ELITE, 2023. Accessed:
2023-05-01. 5
[90] Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei
Zhang, and Wangmeng Zuo. ELITE: Encoding visual con-
cepts into textual embeddings for customized text-to-image
generation. ArXiv, abs/2302.13848, 2023. 3, 5, 6, 10, 11, 12,
14
[91] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau-
mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam
Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien
Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama
Drame, Quentin Lhoest, and Alexander M. Rush. Trans-
formers: State-of-the-art natural language processing. In
Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing: System Demonstrations,
pages 38–45, Online, 2020. Association for Computational
Linguistics. 11
[92] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change
Loy. Rerender a video: Zero-shot text-guided video-to-video
translation. ArXiv, abs/2306.07954, 2023. 2
[93] Hu Ye, Jun Zhang, Siyi Liu, Xiao Han, and Wei Yang. IP-
Adapter: Text compatible image prompt adapter for text-to-
image diffusion models. ArXiv, abs/2308.06721, 2023. 3, 5,
6, 10, 11, 12, 14
[94] youtube.com. How to create consistent characters in mid-
journey. https://www.youtube.com/watch?v=
Z7_ta3RHijQ, 2023. 3
[95] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres-
27