0% found this document useful (0 votes)
363 views27 pages

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

1) The document proposes a method for generating consistent depictions of characters from text prompts alone, without requiring existing images of the target character. It introduces an iterative process that clusters generated images to extract a representation capturing the common identity, producing increasingly consistent images matching the prompt. 2) Quantitative analysis shows the method strikes a better balance between prompt alignment and identity consistency compared to baselines. A user study reinforces these findings. 3) The method could enable applications like story visualization, game development, and advertising by maintaining character consistency across contexts.

Uploaded by

zhangzhengyi443
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
363 views27 pages

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

1) The document proposes a method for generating consistent depictions of characters from text prompts alone, without requiring existing images of the target character. It introduces an iterative process that clusters generated images to extract a representation capturing the common identity, producing increasingly consistent images matching the prompt. 2) Quantitative analysis shows the method strikes a better balance between prompt alignment and identity consistency compared to baselines. A user study reinforces these findings. 3) The method could enable applications like story visualization, game development, and advertising by maintaining character consistency across contexts.

Uploaded by

zhangzhengyi443
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

Omri Avrahami1,2 Amir Hertz1 Yael Vinker1,3 Moab Arar1,3


Shlomi Fruchter1 Ohad Fried4 Daniel Cohen-Or1,3 Dani Lischinski1,2
1 2 3 4
Google Research The Hebrew University of Jerusalem Tel Aviv University Reichman University
arXiv:2311.10093v1 [cs.CV] 16 Nov 2023

“in the park” ”reading a book” “at the beach” “holding an avocado”

“A photo of a 50
years old man
with curly hair.”

“A portrait of a
man with a
mustache and a
hat, fauvism.”

“A rendering of a
cute albino
porcupine, cozy
indoor lighting."

Figure 1. The Chosen One: Given a text prompt describing a character, our method distills a representation that enables consistent
depiction of the same character in novel contexts.

Abstract ter balance between prompt alignment and identity consis-


tency compared to the baseline methods, and these findings
Recent advances in text-to-image generation models are reinforced by a user study. To conclude, we showcase
have unlocked vast potential for visual creativity. However, several practical applications of our approach.
these models struggle with generation of consistent charac-
ters, a crucial aspect for numerous real-world applications
1. Introduction
such as story visualization, game development asset design,
advertising, and more. Current methods typically rely on The ability to maintain consistency of generated visual con-
multiple pre-existing images of the target character or in- tent across various contexts, as shown in Figure 1, plays
volve labor-intensive manual processes. In this work, we a central role in numerous creative endeavors. These in-
propose a fully automated solution for consistent charac- clude illustrating a book, crafting a brand, creating comics,
ter generation, with the sole input being a text prompt. We developing presentations, designing webpages, and more.
introduce an iterative procedure that, at each stage, identi- Such consistency serves as the foundation for establishing
fies a coherent set of images sharing a similar identity and Project page is available at: https://omriavrahami.com/the-chosen-one
extracts a more consistent identity from this set. Our quan- Omri, Yael, Moab, Daniel and Dani performed this work while work-
titative analysis demonstrates that our method strikes a bet- ing at Google.

1
“A plasticine of a cute baby cat with big eyes.” inversion of an existing depiction of a human face [36].
In this work, we argue that in many applications the goal
is to generate some consistent character, rather than visually
Standard

matching a specific appearance. Thus, we address a new


setting, where we aim to automatically distill a consistent
representation of a character that is only required to comply
with a single natural language description. Our method does
not require any images of the target character as input; thus,
it enables creating a novel consistent character that does not
Ours

necessarily resemble any existing visual depiction.


Our fully-automated solution to the task of consistent
character generation is based on the assumption that a suffi-
ciently large set of generated images, for a certain prompt,
Figure 2. Identity consistency. Given the prompt “a Plasticine of
will contain groups of images with shared characteristics.
a cute baby cat with big eyes”, a standard text-to-image diffusion
model produces different cats (all corresponding to the input text),
Given such a cluster, one can extract a representation that
whereas our method produces the same cat. captures the “common ground” among its images. Repeat-
ing the process with this representation, we can increase
the consistency among the generated images, while still re-
brand identity, facilitating storytelling, enhancing commu- maining faithful to the original input prompt.
nication, and nurturing emotional engagement. We start by generating a gallery of images based on the
Despite the increasingly impressive abilities of text-to- provided text prompt, and embed them in a Euclidean space
image generative models, these models struggle with such using a pre-trained feature extractor. Next, we cluster these
consistent generation, a shortcoming that we aim to rectify embeddings, and choose the most cohesive cluster to serve
in this work. Specifically, we introduce the task of consis- as the input for a personalization method that attempts to ex-
tent character generation, where given an input text prompt tract a consistent identity. We then use the resulting model
describing a character, we derive a representation that en- to generate the next gallery of images, which should exhibit
ables generating consistent depictions of the same character more consistency, while still depicting the input prompt.
in novel contexts. Although we refer to characters through- This process is repeated iteratively until convergence.
out this paper, our work is in fact applicable to visual sub- We evaluate our method quantitatively and qualitatively
jects in general. against several baselines, as well as conducting a user study.
Finally, we present several applications of our method.
Consider, for example, an illustrator working on a Plas-
In summary, our contributions are: (1) we formalize the
ticine cat character. As demonstrated in Figure 2, provid-
task of consistent character generation, (2) propose a novel
ing a state-of-the-art text-to-image model with a prompt
solution to this task, and (3) we evaluate our method quan-
describing the character, results in a variety of outcomes,
titatively and qualitatively, in addition to a user study, to
which may lack consistency (top row). In contrast, in this
demonstrate its effectiveness.
work we show how to distill a consistent representation of
the cat (2nd row), which can then be used to depict the same
character in a multitude of different contexts.
2. Related Work
The widespread popularity of text-to-image generative Text-to-image generation. Text conditioned image gen-
models [57, 63, 69, 72], combined with the need for con- erative models (T2I) [63, 69, 95] show unprecedented capa-
sistent character generation, has already spawned a vari- bilities of generating high quality images from mere natural
ety of ad hoc solutions. These include, for example, us- language text descriptions. They are quickly becoming a
ing celebrity names in prompts [64] for creating consistent fundamental tool for any creative vision task. In particular,
humans, or using image variations [63] and filtering them text-to-image diffusion models [9, 30, 52, 77–79] are em-
manually by similarity [65]. In contrast to these ad hoc, ployed for guided image synthesis [8, 15, 18, 22, 27, 51, 87,
manually intensive solutions, we propose a fully-automatic 97] and image editing tasks [5, 7, 10, 13, 28, 38, 47, 49, 55,
principled approach to consistent character generation. 75, 84]. Using image editing methods, one can edit an im-
The academic works most closely related to our setting age of a given character, and change its pose, etc., however,
are ones dealing with personalization [20, 70] and story these methods cannot ensure consistency of the character in
generation [24, 36, 62]. Some of these methods derive a rep- novel contexts, as our problem dictates.
resentation for a given character from several user-provided In addition, diffusion models were used in other tasks
images [20, 24, 70]. Others cannot generalize to novel char- [56, 96], such as: video editing [23, 44, 45, 50, 59, 92], 3D
acters that are not in the training data [62], or rely on textual synthesis [19, 31, 48, 58], editing [11, 25, 74, 98] and tex-

2
turing [67], typography generation [35], motion generation Algorithm 1 Consistent Character Generation
[60, 81], and solving inverse problems [32]. Input: Text-to-image diffusion model M , parameterized
by Θ = (θ, τ ), where θ are the LoRA weights and τ is
Text-to-image personalization. Text-conditioned mod- a set of custom text embeddings, target prompt p, feature
els cannot generate an image of a specific object or char- extractor F .
acter. To overcome this limitation, a line of works utilizes Hyper-parameters: number of generated images per
several images of the same instance to encapsulate new pri- step N , minimum cluster size dmin-c , target cluster size
ors in the generative model. Existing solutions range from dsize-c , convergence criterion dconv , maximum number of
optimization of text tokens [20, 85, 88] to fine-tuning the iterations diter
parameters of the entire model [6, 70], where in the mid- Output: a consistent representation Θ(p)
dle, recent works suggest fine-tuning a small subset of pa-
repeat S
rameters [1, 17, 26, 33, 41, 71, 82]. Models trained in this
S = N F (MΘ (p))
manner can generate consistent images of the same subject.
C = K-MEANS++(S, k = ⌊N/dsize-c ⌋)
However, they typically require a collection of images de-
C = {c ∈ C|dmin-c < P |c|} {filter small clusters}
picting the subject, which naturally narrows their ability to 1 2
ccohesive = argmin |c| e∈c ∥e − ccen ∥
generate any imaginary character. Moreover, when training c∈C
on a single input image [6], these methods tend to overfit Θ = argmin Lrec over ccohesive
(θ,τ )
and produce similar images with minimal diversity during 1
P 2
inference. until dconv ≥ |S| 2 s1 ,s2 ∈S ∥s1 − s2 ∥
Unlike previous works, our method does not require an return Θ
input image; instead, it can generate consistent and diverse
images of the same character based only on a text descrip-
tion. Additional works are aimed to bypass the personal- ad hoc and manually-intensive tricks such as using text to-
ization training by introducing a dedicated personalization kens of a celebrity, or a combination of celebrities [64] in
encoder [3, 16, 21, 37, 42, 76, 90, 93]. Given an image and order to create a consistent human; however, the gener-
a prompt, these works can produce images with a character ated characters resemble the original celebrities, and this
similar to the input. However, as shown in Section 4.1, they approach does not generalize to other character types (e.g.,
lack consistency when generating multiple images from the animals). Users have also proposed to ensure consistency
same input. Concurrently, ConceptLab [66] is able to gen- by manually crafting very long and elaborate text prompts
erate new members of a broad category (e.g., a new pet); [65], or by using image variations [63] and filtering them
in contrast, we seek a consistent instance of a character de- manually by similarity [65]. Other users suggested gener-
scribed by the input text prompt. ating a full design sheet of a character, then manually filter
the best results and use them for further generation [94].
All these methods are manual, labor-intensive, and ad hoc
Story visualization. Consistent character generation is
for specific domains (e.g., humans). In contrast, our method
well studied in the field of story visualization. Early GAN
is fully automated and domain-agnostic.
works [43, 80] employ a story discriminator for the image-
text alignment. Recent works, such as StoryDALL-E [46]
and Make-A-Story [62] utilize pre-trained T2I models for 3. Method
the image generation, while an adapter model is trained As stated earlier, our goal in this work is to enable gener-
to embed story captions and previous images into the T2I ation of consistent images of a character (or another kind of
model. However, those methods cannot generalize to novel visual subject) based on a textual description. We achieve
characters, as they are trained over specific datasets. More this by iteratively customizing a pre-trained text-to-image
closely related, Jeong et al. [36] generate consistent story- model, using sets of images generated by the model itself as
books by combining textual inversion with a face-swapping training data. Intuitively, we refine the representation of the
mechanism; therefore, their work relies on images of exist- target character by repeatedly funneling the model’s output
ing human-like characters. TaleCrafter [24] presents a com- into a consistent identity. Once the process has converged,
prehensive pipeline for storybook visualization. However, the resulting model can be used to generate consistent im-
their consistent character module is based on an existing ages of the target character in novel contexts. In this section,
personalization method that requires fine-tuning on several we describe our method in detail.
images of the same character. Formally, we are given a text-to-image model MΘ , pa-
rameterized by Θ, and a text prompt p that describes a tar-
Manual methods. Other attempts for achieving consis- get character. The parameters Θ consist of a set of model
tent character generation using a generative model rely on weights θ and a set of custom text embeddings τ . We seek

3
"Luna is a forest sprite
with green skin and Cohesive
leaves for hair.”
Identity
MΘ F MΘ
Cluster Extract
S
t
C ea
Rep

Figure 3. Method overview. Given an input text prompt, we start by generating numerous images using the text-to-image model MΘ ,
which are embedded into a semantic feature space using the feature extractor F . Next, these embeddings are clustered and the most
cohesive group is chosen, since it contains images with shared characteristics. The “common ground” among the images in this set is used
to refine the representation Θ to better capture and fit the target. These steps are iterated until convergence to a consistent identity.

a representation Θ(p), s.t., the parameterized model MΘ(p)


is able to generate consistent images of the character de-
scribed by p in novel contexts.
Our approach, described in Algorithm 1 and depicted in
Figure 3, is based on the premise that a sufficiently large
set of images generated by M for the same text prompt, but
with different seeds, will reflect the non-uniform density of
the manifold of generated images. Specifically, we expect
to find some groups of images with shared characteristics.
The “common ground” among the images in one of these
groups can be used to refine the representation Θ(p) so as to
better capture and fit the target. We therefore propose to it-
eratively cluster the generated images, and use the most co-
hesive cluster to refine Θ(p). This process is repeated, with
the refined representation Θ(p), until convergence. Below,
we describe the clustering and the representation refinement Figure 4. Embedding visualization. Given generated images for
components of our method in detail. the text prompt “a sticker of a ginger cat”, we project the set S of
their high-dimensional embeddings into 2D using t-SNE [29] and
3.1. Identity Clustering indicate different K-MEANS++ [4] clusters using different colors.
Representative images are shown for three of the clusters. It may
We start each iteration by using MΘ , parameterized with the
be seen that images in each cluster share the same characteristics:
current representation Θ, to generate a collection of N im-
black cluster — full body cats, red cluster — cat heads, brown
ages, each corresponding to a different random seed. Each cluster — images with multiple cats. According to our cohesion
image is embedded in a high-dimensional semantic embed- measure (1), the black cluster is the most cohesive, and therefore,
ding space, using S a feature extractor F , to form a set of chosen for identity extraction (or refinement).
embeddings S = N F (MΘ (p)). In our experiments, we
use DINOv2 [54] as the feature extractor F .
Next, we use the K-MEANS++ [4] algorithm to cluster and its centroid ccen :
the embeddings of the generated images according to cosine 1 X
similarity in the embedding space. We filter the resulting cohesion(c) = ∥e − ccen ∥2 . (1)
|c| e∈c
collection of clusters C by removing all clusters whose size
is below a pre-defined threshold dmin-c , as it was shown [6] In Figure 4 we show a visualization of the DINOv2 em-
that personalization algorithms are prone to overfitting on bedding space, where the high-dimensional embeddings S
small datasets. Among the remaining clusters, we choose are projected into 2D using t-SNE [29] and colored accord-
the most cohesive one to serve as the input for the identity ing to their K-MEANS++ [4] clusters. Some of the embed-
extraction stage (see Figure 4). We define the cohesion of a dings are clustered together more tightly than others, and
cluster c as the average distance between the members of c the black cluster is chosen as the most cohesive one.

4
3.2. Identity Extraction 4. Experiments
Depending on the diversity of the image set generated in In Section 4.1 we compare our method against several
the current iteration, the most cohesive cluster ccohesive may baselines, both qualitatively and quantitatively. Next, in
still exhibit an inconsistent identity, as can be seen in Fig- Section 4.2 we describe the user study we conducted and
ure 3. The representation Θ is therefore not yet ready for present its results. The results of an ablation study are re-
consistent generation, and we further refine it by training on ported in Section 4.3. Finally, in Section 4.4 we demon-
the images in ccohesive to extract a more consistent identity. strate several applications of our method.
This refinement is performed using text-to-image personal-
ization methods [20, 70], which aim to extract a character
4.1. Qualitative and Quantitative Comparison
from a given set of several images that already depict a con- We compared our method against the most related person-
sistent identity. While we apply them to a set of images alization techniques [20, 42, 71, 89, 93]. In each exper-
which are not completely consistent, the fact that these im- iment, each of these techniques is used to extract a char-
ages are chosen based on their semantic similarity to each acter from a single image, generated by SDXL [57] from
other, enables these methods to nevertheless distill a com- an input prompt p. The same prompt p is also provided
mon identity from them. as input to our method. Textual Inversion (TI) [20] opti-
We base our solution on a pre-trained Stable Diffusion mizes a textual token using several images of the same con-
XL (SDXL) [57] model, which utilizes two text encoders: cept, and we converted it to support SDXL by learning two
CLIP [61] and OpenCLIP [34]. We perform textual inver- text tokens, one for each of its text encoders, as we did in
sion [20] to add a new pair of textual tokens τ , one for each our method. In addition, we used LoRA DreamBooth [71]
of the two text encoders. However, we found that this pa- (LoRA DB), which we found less prone to overfitting than
rameter space is not expressive enough, as demonstrated in standard DB. Furthermore, we compared against all avail-
Section 4.3, hence we also update the model weights θ via a able image encoder techniques that encode a single image
low-rank adaptation (LoRA) [33, 71] of the self- and cross- into the textual space of the diffusion model for later gener-
attention layers of the model. ation in novel contexts: BLIP-Diffusion [42], ELITE [89],
We use the standard denoising loss: and IP-adapter [93]. For all the baselines, we used the same
h i prompt p to generate a single image, and used it to extract
Lrec = Ex∼ccohesive ,z∼E(x),ϵ∼N (0,1),t ∥ϵ − ϵΘ(p) (zt , t)∥22 , the identity via optimization (TI and LoRA DB) or encod-
ing (ELITE, BLIP-diffusion and IP-adapter).
(2)
In Figure 5 we qualitatively compare our method against
where ccohesive is the chosen cluster, E(x) is the VAE en-
the above baselines. While TI [20], BLIP-diffusion [42]
coder of SDXL, ϵ is the sample’s noise and t is the time
and IP-adapter [93] are able to follow the specified prompt,
step, zt is the latent z noised to time step t. We optimize
they fail to produce a consistent character. LoRA DB [71]
Lrec over Θ = (θ, τ ), the union of the LoRA weights and
succeeds in consistent generation, but it does not always re-
the newly-added textual tokens.
spond to the prompt. Furthermore, the resulting character
3.3. Convergence is generated in the same fixed pose. ELITE [90] struggles
with prompt following and the generated characters tend to
As explained earlier (Algorithm 1 and Figure 3), the above be deformed. In comparison, our method is able to follow
process is performed iteratively. Note that the representa- the prompt and maintain consistency, while generating ap-
tion Θ extracted in each iteration is the one used to generate pealing characters in different poses and viewing angles.
the set of N images for the next iteration. The generated In order to automatically evaluate our method and the
images are thus funneled into a consistent identity. baselines quantitatively, we instructed ChatGPT [53] to
Rather than using a fixed number of iterations, we apply generate prompts for characters of different types (e.g., ani-
a convergence criterion that enables early stopping. After mals, creatures, objects, etc.) in different styles (e.g., stick-
each iteration, we calculate the average pairwise Euclidean ers, animations, photorealistic images, etc.). Each of these
distance between all N embeddings of the newly-generated prompts was then used to extract a consistent character by
images, and stop when this distance is smaller than a pre- our method and by each of the baselines. Next, we gen-
defined threshold dconv . erated these characters in a predefined collection of novel
Finally, it should be noticed that our method is nondeter- contexts. For a visual comparison, please refer to the sup-
ministic, i.e., when running our method multiple times, on plementary material.
the same input prompt p, different consistent characters will We employ two standard evaluation metrics: prompt
be generated. This is aligned with the one-to-many nature similarity and identity consistency, which are commonly
of our task. For more details and examples, please refer to used in the personalization literature [6, 20, 70]. Prompt
the supplementary material. similarity measures the correspondence between the gener-

5
“indoors”
“in the park” TI [20] LoRA DB [71] ELITE [90] BLIP-diff [42] IP-Adapter [93] Ours

“a photo of a white fluffy toy”


hat in the street”
“wearing a red
“jumping near
the river”

“a 3D animation of a happy pig”


Golden Gate
“near the
Bridge” “in the snow”

“a rendering of a fox, full body”


Figure 5. Qualitative comparison. We compare our method against several baselines: TI [20], BLIP-diffusion [42] and IP-adapter [93]
are able to follow the target prompts, but do not preserve a consistent identity. LoRA DB [71] is able to maintain consistency, but it does not
always follow the prompt. Furthermore, the character is generated in the same fixed pose. ELITE [90] struggles with prompt following and
also tends to generate deformed characters. On the other hand, our method is able to follow the prompt and maintain consistent identities,
while generating the characters in different poses and viewing angles.

ated images and the input text prompt. We use the standard ages and the CLIP text embedding of the source prompts.
CLIP [61] similarity, i.e., the normalized cosine similarity For measuring identity consistency, we calculate the sim-
between the CLIP image embedding of the generated im- ilarity between the CLIP image embeddings of generated

6
0.9 3.8
Automatic identity consistency (→)

LoRA DB LoRA DB

User identity consistency (→)


ELITE 3.6
0.85 Ours
Ours

IP-adapter
3.4
0.8 ELITE
TI TI
3.2
Ours w reinit.
0.75 BLIP-diffusion IP-Adapter
3
Ours w/o clustering
Ours single iter.

0.7 2.8 BLIP


Ours w/o LoRA

0.15 0.16 0.17 0.18 0.19 0.2 0.21 2.9 3 3.1 3.2 3.3 3.4
Automatic prompt similarity (→) User prompt similarity ranking (→)

Figure 6. Quantitative Comparison and User Study. (Left) We compared our method quantitatively with various baselines in terms of
identity consistency and prompt similarity, as explained in Section 4.1. LoRA DB and ELITE maintain high identity consistency, while
sacrificing prompt similarity. TI and BLIP-diffusion achieve high prompt similarity but low identity consistency. Our method and IP-
adapter both lie on the Pareto front, but the better identity consistency of our method is perceptually significant, as demonstrated in the
user study. We also ablated some components of our method: removing the clustering stage, reducing the optimizable representation,
re-initializing the representation in each iteration and performing only a single iteration. All of the ablated cases resulted in a significant
degradation of consistency. (Right) The user study rankings also demonstrate that our method lies on the Pareto front, balancing between
identity consistency and prompt similarity.

images of the same concept across different contexts. ysis, read the supplementary material.
As can be seen in Figure 6 (left), there is an inher-
ent trade-off between prompt similarity and identity con- 4.3. Ablation Study
sistency: LoRA DB and ELITE exhibit high identity con- We conducted an ablation study for the following cases: (1)
sistency, while sacrificing prompt similarity. TI and BLIP- Without clustering — we omit the clustering step described
diffusion achieve high prompt similarity but low identity in Section 3.1, and instead simply generate 5 images ac-
consistency. Our method and IP-adapter both lie on the cording to the input prompt. (2) Without LoRA — we re-
Pareto front. However, our method achieves better iden- duce the optimizable representation Θ in the identity ex-
tity consistency than IP-adapter, which is significant from traction stage, as described in Section 3.2, to consist of only
the user’s perspective, as supported by our user study. the newly-added text tokens without the additional LoRA
weights. (3) With re-initialization — instead of using the
4.2. User Study latest representation Θ in each of the optimization itera-
We conducted a user study to evaluate our method, using the tions, as described in Section 3.3, we re-initialize it in each
Amazon Mechanical Turk (AMT) platform [2]. We used iteration. (4) Single iteration — rather than iterating until
the same generated prompts and samples that were used in convergence (Section 3.3), we stop after a single iteration.
Section 4.1 and asked the evaluators to rate the prompt sim- As can be seen in Figure 6 (left), all of the above key
ilarity and identity consistency of each result on a Likert components are crucial for achieving a consistent identity
scale of 1–5. For ranking the prompt similarity, the eval- in the final result: (1) removing the clustering harms the
uators were presented with the target text prompt and the identity extraction stage because the training set is too di-
result of our method and the baselines on the same page, verse, (2) reducing the representation causes underfitting,
and were asked to rate each of the images. For identity con- as the model does not have enough parameters to properly
sistency, for each of the generated concepts, we compared capture the identity, (3) re-initializing the representation in
our method and the baselines by randomly choosing pairs each iteration, or (4) performing a single iteration, does not
of generated images with different target prompts, and the allow the model to converge into a single identity.
evaluators were asked to rate on a scale of 1–5 whether the For a visual comparison of the ablation study, as well as
images contain the same main character. Again, all the pairs comparison of alternative feature extractors (DINOv1 [14]
of the same character for the different baselines were shown and CLIP [61]), please refer to the supplementary material.
on the same page.
4.4. Applications
As can be seen in Figure 6 (right), our method again
exhibits a good balance between identity consistency and As demonstrated in Figure 7, our method can be used for
prompt similarity, with a wider gap separating it from the various down-stream tasks, such as (a) Illustrating a story by
baselines. For more details and statistical significance anal- breaking it into a different scenes and using the same con-

7
“This is a story about Jasper, a cute mink with a brown jacket and red pants.

(a) Inconsistent
Jasper started his day by jogging on the beach, and afterwards, he enjoyed a

identity
coffee meetup with a friend in the heart of New York City. As the day drew
to a close, he settled into his cozy apartment to review a paper.”
illustration
(a) Story

“a portrait of a round robot with glasses ...”

supporting elements
(b) Inconsistent
Scene 1 Scene 2 Scene 3 Scene 4
“a Plasticine of a cute baby cat with big eyes”
image editing
(b) Local

“a hyper-realistic digital painting of a happy girl, brown eyes...”


+ “with her cat”

(c) Spurious
attributes
Image + mask “sitting” “ jumping” “wearing
sunglasses”
“a photo of a ginger woman with a long hair”
(c) Additional
pose control

“a sticker of a ginger cat”

Figure 8. Limitations. Our method suffers from the following


limitations: (a) in some cases, our method is not able to converge
Input Pose 1 Result 1 Input Pose 2 Result 2 to a fully consistent identity — notice slight color and arm shape
changes. (b) Our method is not able to associate a consistent iden-
Figure 7. Applications. Our method can be used for various ap- tity to a supporting character that may appear with the main ex-
plications: (a) Illustrating a full story with the same consistent tracted character, for example our method generates different cats
character. (b) Local text-driven image editing via integration with for the same girl. (c) Our method sometimes adds spurious at-
Blended Latent Diffusion [5, 7]. (c) Generating a consistent char- tributes to the character, that were not present in the text prompt.
acter with an additional pose control via integration with Control- For example, it learns to associate green leaves with the cat sticker.
Net [97].

with her cat, different cats were generated. In addition, our


sistent character for all of them. (b) Local text-driven image framework does not support finding multiple concepts con-
editing by integrating Blended Latent Diffusion [5, 7] — a currently [6]. (c) Spurious attributes — we found that in
consistent character can be injected into a specified location some cases, our method binds additional attributes, which
of a provided background image, in a novel pose specified are not part of the input text prompt, with the final iden-
by a text prompt. (c) Generating a consistent character with tity of the character. For example, in Figure 8(c), the input
an additional pose control using ControlNet [97]. For more text prompt was “a sticker of a ginger cat”, however, our
details, please refer to the supplementary material. method added green leaves to the generated sticker, even
though it was not asked to do so. This stems from the
5. Limitations and Conclusions stochastic nature of the text-to-image model — the model
added these leaves in some of the stickers generated during
We found our method to suffer from the following limita- the identity clustering stage (Section 3.1), and the stickers
tions: (a) Inconsistent identity — in some cases, our method containing the leaves happened to form the most cohesive
is not able to converge to a fully consistent identity (with- set ccohesive . (d) Significant computational cost — each iter-
out overfitting). As demonstrated in Figure 8(a), when try- ation of our method involves generating a large number of
ing to generate a portrait of a robot, our method generated images, and learning the identity of the most cohesive clus-
robots with slightly different colors and shapes (e.g., differ- ter. It takes about 20 minutes to converge into a consistent
ent arms). This may result from a prompt that is too gen- identity. Reducing the computational costs is an appealing
eral, for which identity clustering (Section 3.1) is not able to direction for further research.
find a sufficiently cohesive set. (b) Inconsistent supporting In conclusion, in this paper we offered the first fully-
characters/elements— although our method is able to find automated solution to the problem of consistent character
a consistent identity for the character described by the in- generation. We hope that our work will pave the way for
put prompt, the identities of other characters, related to the future advancements, as we believe this technology of con-
input character (e.g., their pet), might be inconsistent. For sistent character generation may have a disruptive effect on
example, in Figure 8(b) the input prompt p to our method numerous sectors, including education, storytelling, enter-
described only the girl, and when asked to generate the girl tainment, fashion, brand design, advertising, and more.

8
Acknowledgments. We thank Yael Pitch, Matan Cohen, that were trained on a single image (with a single identity).
Neal Wadhwa and Yaron Brodsky for their valuable help Instead, we could generate a small set of 5 images for the
and feedback. given prompt (that are not guaranteed to be of the same
identity), and use this mini dataset for TI and LoRA DB
A. Additional Experiments baselines. As can be seen in Figure 18 and Figure 19, these
baselines sacrifice the identity consistency.
Below we provide additional experiments that were omitted
from the main paper. In Appendix A.1 we provide addi-
tional comparisons and results of our method, and demon- A.4. Additional Feature Extractors
strate the nondeterministic nature of our method in Ap-
Instead of using DINOv2 [54] features for the identity
pendix A.2. Furthermore, in Appendix A.3 we compare
clustering stage (Section 3.1 in the main paper), we also
our method against two naı̈ve baselines. In addition, Ap-
experimented with two alternative feature extractors: DI-
pendix A.4 presents the results of our method using differ-
NOv1 [14] and CLIP [61] image encoder. We quantita-
ent feature extractors. Lastly, in Appendix A.5 we provide
tively evaluate our method with each of these feature extrac-
results that reduce the concerns of dataset memorization by
tors in terms of identity consistency and prompt similarity,
our method.
as explained in Section 4.1 in the main paper. As can be
A.1. Additional Comparisons and Results seen in Figure 20, DINOv1 produces higher identity con-
sistency, while sacrificing prompt similarity, whereas CLIP
In Figure 9 we provide a qualitative comparison on the au-
achieves higher prompt similarity at the expense of identity
tomatically generated images. In Figure 11 we provide a
consistency. Qualitatively, as demonstrated in Figure 21,
qualitative comparison of the ablated cases. In Figure 12
we found the DINOv1 extractor to perform similarly to DI-
we provide an additional qualitative comparison.
NOv2, whereas CLIP produces results with a slightly lower
Concurrently to our work, the DALL·E 3 model [12] was
identity consistency.
commercially released as a part of the paid ChatGPT Plus
[53] subscription, enabling generating images in a conver-
sational setting. We tried, using a conversation, to create a A.5. Dataset Non Memorization
consistent character of a Plasticine cat, as demonstrated in
Our method is able to produce consistent characters, which
Figure 10. As can be seen, the generated characters share
raises the question of whether these characters, already exist
only some of the characteristics (e.g., big eyes) but not all
in the training data of the generative model. We employed
of them (e.g., colors, textures and shapes).
SDXL [57] as our text-to-image model, whose training
In addition, as demonstrated in Figure 13, our approach
dataset is, regrettably, undisclosed in the paper [57]. Con-
is applicable to consistent generation of a wide range of sub-
sequently, we relied on the most likely overlapping dataset,
jects, without the requirement for them to necessarily depict
LAION-5B [73], which was also utilized by Stable Diffu-
human characters or creatures. Figure 14 shows additional
sion V2.
results of our method, demonstrating a variety of character
styles. Lastly, in Figure 15 we demonstrate the ability of To probe for dataset memorization, we found the top 5
creating a fully consistent “life story” of a character using nearest neighbors in the dataset, in terms of CLIP [61] im-
our method. age similarity, for a few representative characters from our
paper, using an open-source solution [68]. As demonstrated
A.2. Nondeterminism of Our Method in Figure 22, our method does not simply memorize images
In Figures 16 and 17 we demonstrate the non-deterministic from the LAION-5B dataset.
nature of our method. Using the same text prompt, we
run our method multiple times with different initial seeds, B. Implementation Details
thereby generating a different set of images for the iden-
tity clustering stage (Section 3.1 in the main paper). Con- In this section, we provide the implementation details that
sequently, the most cohesive cluster ccohesive is different in were omitted from the main paper. In Appendix B.1 we pro-
each run, yielding different consistent identities. This be- vide the implementation details of our method and the base-
havior of our method is aligned with the one-to-many na- lines. Then, in Appendix B.2 we provide the implementa-
ture of our task — a single text prompt may correspond to tions details of the automatic metrics that we used to eval-
many identities. uate our method against the baselines. In Appendix B.3 we
provide the implementation details and the statistical analy-
A.3. Naı̈ve Baselines
sis for the user study we conducted. Lastly, in Appendix B.4
As explained in Section 4.1 in the main paper, we compared we provide the implementation details for the applications
our method against a version of TI [20] and LoRA DB [71] we presented.

9
TI [20] LoRA DB [71] ELITE [90] BLIP-diff [42] IP-Adapter [93] Ours
“drinking
a beer”
the background”
“with a city in

“a 2D animaiton of captivating Arctic fox with fluffy fur, bright eyes, and
nimble movements, bringing the magic of the icy wilderness to animated life”
a burger”
“eating
“wearing a hat”
blue hat”

“A watercolor portrayal of a joyful child, radiating innocence and wonder with


rosy cheeks and a genuine, wide-eyed smile”
“near the Statue
of Liberty”
police officer”
“as a

“A 3D animation of a playful kitten, with bright eyes and a


mischievous expression, embodying youthful curiosity and joy”
Figure 9. Automatic qualitative comparison to baselines. We compared our method against several baselines: TI [20], BLIP-diffusion
[42] and IP-adapter [93] are able to correspond to the target prompt but fail to produce consistent results. LoRA DB [71] is able to achieve
consistency, but it does not always follow to the prompt, in addition, the generate character is being generated in the same fixed pose.
ELITE [90] struggles with following the prompt and also tends to generate deformed characters. On the other hand, our method is able to
follow the prompt, and generate consistent characters in different poses and viewing angles.

10
“holding an Table 1. Statistical analysis. We use Tukey’s honestly significant
“in the park” “reading a book” “at the beach” avocado”
difference procedure [83] to test whether the differences between
DALL·E 3 [12]

mean scores in our user study are statistically significant.

Method 1 Method 2 Prompt similarity Identities similarity


p-value p-value
TI [20] Ours p < 0.001 p < 5.9e−10
LoRA DB [71] Ours p < 6.95e−13 p < 0.0002
ELITE [90] Ours p < 6.92e−13 p < 5.96e−7
Ours

BLIP-Diffusion [42] Ours p < 0.01 p < 6.92e−13


IP-Adapter [93] Ours p < 4.23e−5 p < 6.92e−13

Figure 10. DALL·E 3 comparison. We attempted to create a main/projects/blip-diffusion


consistent character using the commercial ChatGPT Plus system, • Official IP-adapter [93] implementation at https://
for the given prompt “a Plasticine of a cute baby cat with big github.com/tencent-ailab/IP-Adapter
eyes”. As can be seen, the generated characters share only some of • DINOv2 [54] ViT-g/14, DINOv1 [14] ViT-B/16 and
the characteristics (e.g., big eyes) but not all of them (e.g., colors, CLIP [61] ViT-L/14 implementation by HuggingFace
textures and shapes).
Transformers [91] at https : / / github . com /
huggingface/transformers
B.1. Method Implementation Details B.2. Automatic Metrics Implementation Details
We based our method, and all the baselines (except In order to automatically evaluate our method and the base-
ELITE [90] and BLIP-diffusion [42]) on Stable Diffusion lines quantitatively, we instructed ChatGPT [53] to gener-
XL (SDXL) [57], which is the state-of-the-art open source ate prompts for characters of different types (e.g., animals,
text-to-image model, at the writing of this paper. We creatures, objects, etc.) in different styles (e.g., stickers, an-
used the official ELITE implementation, that uses Stable imations, photorealistic images, etc.). These prompts were
Diffusion V1.4, and the official implementation of BLIP- then used to generate a set of consistent characters by our
diffusion, that uses Stable Diffusion V1.5. We could not method and by each of the baselines. Next, these prompts
replace these two baselines to SDXL backbone, as the en- were used to generate these characters in a predefined col-
coders were trained on these specific models. As for the rest lection of novel contexts from the following list:
of the baselines, we used the same SDXL architecture and • “a photo of [v] at the beach”
weights. • “a photo of [v] in the jungle”
For our method, we generated a set of N = 128 images • “a photo of [v] in the snow”
at each iteration, which we found to be sufficient, empiri- • “a photo of [v] in the street”
cally. We utilized Adam optimizer [39] with learning rate • “a photo of [v] with a city in the background”
of 3e−5 , β1 = 0.9, β2 = 0.99 and weight decay of 1e−2 . • “a photo of [v] with a mountain in the background”
In each iteration of our method, we used 500 steps. We • “a photo of [v] with the Eiffel Tower in the background”
also found empirically that we can set the convergence cri- • “a photo of [v] near Statue of Liberty”
terion dconv adaptively to be 80% of the average pairwise • “a photo of [v] near Sydney Opera House”
Euclidean distance between all N initial image embeddings • “a photo of [v] floating on top of water”
of the first iteration. In most cases, our method converges • “a photo of [v] eating a burger”
in 1–2 iterations, with takes about 13–26 minutes on A100 • “a photo of [v] drinking a beer”
NVIDIA GPU when using bfloat16 mixed precision. • “a photo of [v] wearing a blue hat”
List of the third-party packages that we used: • “a photo of [v] wearing sunglasses”
• Official SDXL [57] implementation by Hugging- • “a photo of [v] playing a ball”
Face Diffusers [86] at https : / / github . com / • “a photo of [v] as a police officer”
huggingface/diffusers where [v] is the newly-added token that represents the con-
• Official SDXL LoRA DB implementation by Hugging- sistent character.
Face Diffusers [86] at https : / / github . com /
B.3. User Study Details
huggingface/diffusers.
• Official ELITE [90] implementation at https : / / As explained in Section 4.2 in the main paper, we conducted
github.com/csyxwei/ELITE a user study to evaluate our method, using the Amazon Me-
• Official BLIP-diffusion [42] implementation at https: chanical Turk (AMT) platform [2]. We used the same gen-
/ / github . com / salesforce / LAVIS / tree / erated prompts and samples that were used in Section 4.1 in

11
Table 2. Users’ rankings means and variances. the means and Story illustration. Given a long story, e.g., “This is a
variances of the rankings that are reported in the user study. story about Jasper, a cute mink with a brown jacket and red
pants. Jasper started his day by jogging on the beach, and
Method Prompt similarity (↑) Identity consistency (↑)
afterwards, he enjoyed a coffee meetup with a friend in the
TI [20] 3.31 ± 1.43 3.17 ± 1.17 heart of New York City. As the day drew to a close, he settled
LoRA DB [71] 3.03 ± 1.43 3.67 ± 1.2
ELITE [90] 2.87 ± 1.46 3.2 ± 1.21
into his cozy apartment to review a paper”, one can create
BLIP-Diffusion [42] 3.35 ± 1.41 2.76 ± 1.31 a consistent character from the main character description
IP-Adapter [93] 3.25 ± 1.42 2.99 ± 1.28 (“a cute mink with a brown jacket and red pants”), then
Ours 3.3 ± 1.36 3.48 ± 1.2 they can generate the various scenes by simply rephrasing
the sentence:
1. “[v] jogging on the beach”
the main paper, and asked the evaluators to rate the prompt
2. “[v] drinking coffee with his friend in the heart of New
similarity and identity consistency of each result on a Likert
York City”
scale of 1–5. For ranking the prompt similarity, the evalua-
3. “[v] reviewing a paper in his cozy apartment”
tors were instructed the following: “For each of the follow-
ing images, please rank on a scale of 1 to 5 its correspon-
dence to this text description: {PROMPT}. The character Local image editing. Our method can be simply inte-
in the image can be anything (e.g., a person, an animal, a grated with Blended Latent Diffusion [5, 7] for editing im-
toy etc.” where {PROMPT} is the target text prompt (in ages locally: given a text prompt, we start by running our
which we replaced the special token with the word “char- method to extract a consistent identity, then, given an input
acter”). All the baselines, as well as our method, were pre- image and mask, we can plant the character in the image
sented in the same page, and the evaluators were asked to within the mask boundaries. In addition, we can provide a
rate each one of the results using a slider from 1 (“Do not local text description for the character.
match at all”) to 5 (“Match perfectly”). Next, for assessing
the identity consistency, we took for each one of the charac- Additional pose control. our method can be integrated
ters, two generated images that correspond to different tar- with ControlNet [97]:given a text prompt, we first apply our
get text prompts, put them next to each other, and instructed method to extract a consistent identity, then, given an input
the evaluators the following: “For each of the following im- pose, we can generate the character with this pose.
age pairs, please rank on a scale of 1 to 5 if they contain the
same character (1 means that they contain totally different C. Societal Impact
characters and 5 means that they contain exactly the same
We believe that the emergence of technology that facilitates
character). The images can have different backgrounds”.
the effortless creation of consistent characters holds excit-
We put all the compared images on the same page, and the
ing promise in a variety of creative and practical applica-
evaluators were asked to rate each one of the pairs using a
tions. It can empower storytellers and content creators to
slider from 1 (“Totally different characters”) to 5 (“Exactly
bring their narratives to life with vivid and unique charac-
the same character ”).
ters, enhancing the immersive quality of their work. In ad-
We collected three rating per question, resulting in 1104
dition, it may offer accessibility to those who may not pos-
rating per task (prompt similarity and identity consistency).
sess traditional artistic skills, democratizing character de-
The time allotted per task was one hour, to allow the raters
sign in the creative industry. Furthermore, it can reduce the
to properly evaluate the results without time pressure. The
cost of advertising, and open up new opportunities for small
means and variances of the user study responses are re-
and underprivileged entrepreneurs, enabling them to reach a
ported in Table 2
wider audience and compete in the market more effectively.
In addition, we conducted a statistical analysis of our
On the other hand, as any other generative AI technol-
user study by validating that the difference between all the
ogy, it can be misused by creating false and misleading vi-
conditions is statistically significant using Kruskal-Wallis
sual content for deceptive purposes. Creating fake charac-
[40] test (p < 3.85e−28 for the text similarity test and
ters or personas can be used for online scams, disinforma-
p < 1.07e−76 for the identity consistency text). Lastly, we
tion campaigns, etc., making it challenging to discern gen-
used Tukey’s honestly significant difference procedure [83]
uine information from fabricated content. Such technolo-
to show that the comparison of our method against all the
gies underscore the vital importance of developing gener-
baselines is statistically significant, as detailed in Table 1.
ated content detection systems, making it a compelling and
B.4. Applications Implementation Details appealing research direction to address.
In Section 4.4 in the main paper, we presented three down-
stream applications of our method.

12
“drinking a beer”
in the background” Ours single iter. Ours w/o clust. Ours w/o LoRA Ours w reinit. Ours
“with a city

“a 2D animaiton of captivating Arctic fox with fluffy fur, bright


and nimble movements, bringing the magic of the icy wilderness to animated life”
“eating a burger”
“waring a blue hat”

“A watercolor portrayal of a joyful child, radiating innocence


and wonder with rosy cheeks and a genuine, wide-eyed smile”
Statue of Liberty”
“near the
“as a police officer”

“A 3D animation of a playful kitten, with bright eyes and


a mischievous expression, embodying youthful curiosity and joy”
Figure 11. Automatic qualitative comparison of ablations. We ablated the following components of our method: using a single iteration,
removing the clustering stage, removing the LoRA trainable parameters, using the same initial representation at every iteration. As can be
seen, all these ablated cases struggle with preserving the character’s consistency.
13
TI [20] LoRA DB [71] ELITE [90] BLIP-diff [42] IP-Adapter [93] Ours
desert”
“in the
picture with
his phone”
“taking a

“an oil painting of a man mustache and a large hat”


“working on
his laptop”
a burger”
“eating

“a Plasticine of a cute baby cat with big eyes”


“celebrating in
a party”
“in the forest”

“a rendering of a cute turtle, cozy lighting ...”


Figure 12. Additional qualitative comparisons to baselines. We compared our method against several baselines: TI [20], BLIP-diffusion
[42] and IP-adapter [93] are able to correspond to the target prompt but fail to produce consistent results. LoRA DB [71] is able to achieve
consistency, but it does not always follow to the prompt, in addition, the generate character is being generated in the same fixed pose.
ELITE [90] struggles with following the prompt and also tends to generate deformed characters. On the other hand, our method is able to
follow the prompt, and generate consistent characters in different poses and viewing angles.

14
“in the desert” “in Times Square” “near a lake” “near the Eiffel Tower” “near the Taj Mahal”

“a photo of a bottle of water”

“a photo of a blue car”

“a photo of a purple bag”

“a photo of a green bowl”

Figure 13. Consistent generation of non-character objects. Our approach is applicable to a wide range of objects, without the require-
ment for them to depict human characters or creatures.

15
“holding an
“in the park” “reading a book” “at the beach” avocado”

“a portrait of a woman with a large hat in a scenic environment, fauvism”

“a 3D animation of a happy pig”

“a sticker of a ginger cat”

“a purple astronaut with human face, digital art, smooth, sharp focus, vector art”

Figure 14. Additional results. Our method is able to consistently generate different types and styles of characters, e.g., paintings,
animations, stickers and vector art.

16
“as a baby” “as a small child” “as a teenager” “with his “before the prom”
first girlfriend”

“as a soldier” “in the “sitting in a lecture” “playing football” “drinking a beer”
college campus”

“studying in “happy with his “giving a talk “graduating from “a profile picture”
his room” accepted paper” in a conference” college”

“working in a “in his wedding” “with his “as a 50 “as a 70


coffee shop” small child” years old man” years old man”

“a watercolor “a pencil sketch” “a rendered avatar” “a 2D animation” “a graffiti”


painting”

Figure 15. Life story. Given a text prompt describing a fictional character, “a photo of a man with short black hair”, we can generate a
consistent life story for that character, demonstrating the applicability of our method for story generation.

17
“holding an
“in the park” “reading a book” “at the beach” avocado”

Figure 16. Non-determinism. By running our method multiple times, given the same prompt “a photo of a 50 years old man with curly
hair”, but using different initial seeds, we obtain different consistent characters corresponding to the text prompt.

18
“holding an
“in the park” “reading a book” “at the beach” avocado”

Figure 17. Non-determinism. By running our method multiple times, given the same prompt “a Plasticine of a cute baby cat with big
eyes”, but using different initial seeds, we obtain different consistent characters corresponding to the text prompt.

19
TI mini LoRA DB mini Ours

“drinking a beer”
in the background”
“with a city

“a 2D animaiton of captivating Arctic fox with fluffy fur, bright eyes


and nimble movements, bringing the magic of the icy wilderness
to animated life”
“eating a burger”
“waring a blue hat”

“A watercolor portrayal of a joyful child, radiating innocence and


wonder with rosy cheeks and a genuine, wide-eyed smile”
Statue of Liberty”
“near the
“as a police officer”

“A 3D animation of a playful kitten, with bright eyes and


a mischievous expression, embodying youthful curiosity and joy”
Figure 18. Qualitative comparison of the naı̈ve baselines. We tested two additional naı̈ve baselines against our method: TI [20] and
LoRA DB [71] that were trained on a small dataset of 5 generated images. As can be seen, both of these baselines sacrifice identity
consistency.
20
0.85 Ours
Automatic identity consistency (↑)

0.8

TI multi

0.75

LoRA DB multi

0.17 0.18 0.19 0.2


Automatic prompt similarity (↑)

Figure 19. Comparison of naı̈ve baselines. We tested two naı̈ve


baselines against our method: TI [20] and LoRA DB [71] that were
trained on a small dataset of 5 generated images. Our automatic
testing procedure, described in Section 4.1 in the main paper, mea-
sures identity consistency and prompt similarity. As can be seen,
both of these baselines, sacrificing identity consistency.

Ours w DINOv1
Automatic identity consistency (↑)

0.86

Ours w DINOv2

0.84

0.82
Ours w CLIP

0.16 0.17 0.17 0.18


Automatic prompt similarity (↑)

Figure 20. Comparison of feature extractors. We tested two ad-


ditional feature extractors in our method: DINOv1 [14] and CLIP
[61]. Our automatic testing procedure, described in Section 4.1
in the main paper, measures identity consistency and prompt sim-
ilarity. As can be seen, DINOv1 produces higher identity consis-
tency by sacrificing prompt similarity, while CLIP results in higher
prompt similarity at the expense of lower identity consistency. In
practice, however, the DINOv1 results are similar to those ob-
tained with DINOv2 features in terms of prompt adherence (see
Figure 21).

21
Ours with CLIP Ours with DINOv1 Ours

“drinking a beer”
in the background”
“with a city

“a 2D animaiton of captivating Arctic fox with fluffy fur, bright eyes


and nimble movements, bringing the magic of the icy wilderness
to animated life”
“eating a burger”
“waring a blue hat”

“A watercolor portrayal of a joyful child, radiating innocence and


wonder with rosy cheeks and a genuine, wide-eyed smile”
Statue of Liberty”
“near the
“as a police officer”

“A 3D animation of a playful kitten, with bright eyes and


a mischievous expression, embodying youthful curiosity and joy”
Figure 21. Comparison of feature extractors. We experimented with two additional feature extractors in our method: DINOv1 [14] and
CLIP [61]. As can be seen, DINOv1 results are qualitatively similar to DINOv2, whereas CLIP produces results with a slightly lower
identity consistency.
22
Generated
character Top 5 nearest neighbors

Figure 22. Dataset non-memorization. We found the top 5 nearest neighbors in the LAION-5B dataset [73], in terms of CLIP [61] image
similarity, for a few representative characters from our paper, using an open-source solution [68]. As can be seen, our method does not
simply memorize images from the LAION-5B dataset.

23
References 2021 IEEE/CVF International Conference on Computer Vi-
sion (ICCV), pages 9630–9640, 2021. 7, 9, 11, 21, 22
[1] Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel
[15] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and
Cohen-Or. A neural space-time representation for text-to-
Daniel Cohen-Or. Attend-and-excite: Attention-based se-
image personalization. ArXiv, abs/2305.15391, 2023. 3
mantic guidance for text-to-image diffusion models. ACM
[2] Amazon. Amazon mechanical turk. https://www. Transactions on Graphics (TOG), 42:1 – 10, 2023. 2
mturk.com/, 2023. 7, 11
[16] Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Rui, Xuhui
[3] Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Jia, Ming-Wei Chang, and William W. Cohen. Subject-
Cohen-Or, Ariel Shamir, and Amit H Bermano. Domain- driven text-to-image generation via apprenticeship learning.
agnostic tuning-encoder for fast personalization of text-to- ArXiv, abs/2304.00186, 2023. 3
image models. arXiv preprint arXiv:2307.06925, 2023. 3 [17] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao,
[4] David Arthur and Sergei Vassilvitskii. k-means++: the ad- and Hengshuang Zhao. Anydoor: Zero-shot object-level im-
vantages of careful seeding. In ACM-SIAM Symposium on age customization. ArXiv, abs/2307.09481, 2023. 3
Discrete Algorithms, 2007. 4 [18] Guillaume Couairon, Marlene Careil, Matthieu Cord,
[5] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended Stéphane Lathuilière, and Jakob Verbeek. Zero-shot spa-
diffusion for text-driven editing of natural images. In Pro- tial layout conditioning for text-to-image diffusion models.
ceedings of the IEEE/CVF Conference on Computer Vision ArXiv, abs/2306.13754, 2023. 2
and Pattern Recognition (CVPR), pages 18208–18218, 2022. [19] Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel.
2, 8, 12 Scenescape: Text-driven consistent scene generation. ArXiv,
[6] Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen- abs/2302.01133, 2023. 2
Or, and Dani Lischinski. Break-a-scene: Extracting multiple [20] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik,
concepts from a single image. ArXiv, abs/2305.16311, 2023. Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or.
3, 4, 5, 8 An image is worth one word: Personalizing text-to-image
[7] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended generation using textual inversion. In The Eleventh Interna-
latent diffusion. ACM Trans. Graph., 42(4), 2023. 2, 8, 12 tional Conference on Learning Representations, 2022. 2, 3,
[8] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, 5, 6, 9, 10, 11, 12, 14, 20, 21
Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, [21] Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano,
and Xi Yin. Spatext: Spatio-textual representation for con- Gal Chechik, and Daniel Cohen-Or. Encoder-based domain
trollable image generation. In Proceedings of the IEEE/CVF tuning for fast personalization of text-to-image models. ACM
Conference on Computer Vision and Pattern Recognition Transactions on Graphics (TOG), 42(4):1–13, 2023. 3
(CVPR), pages 18370–18380, 2023. 2 [22] Songwei Ge, Taesung Park, Jun-Yan Zhu, and Jia-Bin
[9] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Ji- Huang. Expressive text-to-image generation with rich text.
aming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, ArXiv, abs/2304.06720, 2023. 2
Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and [23] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel.
Ming-Yu Liu. ediff-i: Text-to-image diffusion models with Tokenflow: Consistent diffusion features for consistent video
an ensemble of expert denoisers. ArXiv, abs/2211.01324, editing. arXiv preprint arXiv:2307.10373, 2023. 2
2022. 2 [24] Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia,
[10] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas- Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang,
ten, and Tali Dekel. Text2live: Text-driven layered image Ying Shan, and Yujiu Yang. TaleCrafter: interactive story vi-
and video editing. In European conference on computer vi- sualization with multiple characters. ArXiv, abs/2305.18247,
sion, pages 707–723. Springer, 2022. 2 2023. 2, 3
[11] Sagie Benaim, Frederik Warburg, Peter Ebert Christensen, [25] Ori Gordon, Omri Avrahami, and Dani Lischinski. Blended-
and Serge J. Belongie. Volumetric disentanglement for 3d nerf: Zero-shot object generation and blending in existing
scene manipulation. ArXiv, abs/2206.02776, 2022. 2 neural radiance fields. ArXiv, abs/2306.12760, 2023. 2
[12] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng [26] Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar,
Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Dimitris N. Metaxas, and Feng Yang. Svdiff: Com-
Yufei Guo, et al. Improving image generation with better pact parameter space for diffusion fine-tuning. ArXiv,
captions. 2023. 9, 11 abs/2303.11305, 2023. 3
[13] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- [27] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
aohu Qie, and Yinqiang Zheng. MasaCtrl: tuning-free mu- Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
tual self-attention control for consistent image synthesis and age editing with cross attention control. arXiv preprint
editing. In Proceedings of the IEEE/CVF International Con- arXiv:2208.01626, 2022. 2
ference on Computer Vision (ICCV), pages 22560–22570, [28] Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta de-
2023. 2 noising score. In Proceedings of the IEEE/CVF International
[14] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Conference on Computer Vision, pages 2328–2337, 2023. 2
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- [29] Geoffrey E. Hinton and Sam T. Roweis. Stochastic neighbor
ing properties in self-supervised vision transformers. In embedding. In NIPS, 2002. 4

24
[30] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- [45] Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya
sion probabilistic models. In Proc. NeurIPS, 2020. 2 Jia. Video-p2p: Video editing with cross-attention control.
[31] Lukas Höllein, Ang Cao, Andrew Owens, Justin John- arXiv preprint arXiv:2303.04761, 2023. 2
son, and Matthias Nießner. Text2room: Extracting tex- [46] Adyasha Maharana, Darryl Hannan, and Mohit Bansal.
tured 3d meshes from 2d text-to-image models. ArXiv, Storydall-e: Adapting pretrained text-to-image transformers
abs/2303.11989, 2023. 2 for story continuation. In European Conference on Computer
[32] Eliahu Horwitz and Yedid Hoshen. Conffusion: Confidence Vision, pages 70–87. Springer, 2022. 3
intervals for diffusion models. ArXiv, abs/2211.09795, 2022. [47] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-
3 jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided
[33] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, image synthesis and editing with stochastic differential equa-
Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low- tions. In International Conference on Learning Representa-
rank adaptation of large language models. In International tions, 2021. 2
Conference on Learning Representations, 2021. 3, 5 [48] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and
[34] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Daniel Cohen-Or. Latent-nerf for shape-guided generation
Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, of 3d shapes and textures. In Proceedings of the IEEE/CVF
Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- Conference on Computer Vision and Pattern Recognition,
naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- pages 12663–12673, 2023. 2
CLIP, 2021. 5 [49] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and
[35] Shira Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Daniel Cohen-Or. Null-text inversion for editing real im-
Cohen-Or, and Ariel Shamir. Word-as-image for semantic ages using guided diffusion models. In Proceedings of
typography. ACM Transactions on Graphics (TOG), 42:1 – the IEEE/CVF Conference on Computer Vision and Pattern
11, 2023. 3 Recognition, pages 6038–6047, 2023. 2
[36] Hyeonho Jeong, Gihyun Kwon, and Jong-Chul Ye. Zero- [50] Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha,
shot generation of coherent storybook from plain text story Y. Matias, Yael Pritch, Yaniv Leviathan, and Yedid Hoshen.
using diffusion models. ArXiv, abs/2302.03900, 2023. 2, 3 Dreamix: Video diffusion models are general video editors.
ArXiv, abs/2302.01329, 2023. 2
[37] Xuhui Jia, Yang Zhao, Kelvin C. K. Chan, Yandong Li,
[51] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian
Han-Ying Zhang, Boqing Gong, Tingbo Hou, H. Wang, and
Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-
Yu-Chuan Su. Taming encoder for zero fine-tuning image
adapter: Learning adapters to dig out more controllable
customization with text-to-image diffusion models. ArXiv,
ability for text-to-image diffusion models. arXiv preprint
abs/2304.02642, 2023. 3
arXiv:2302.08453, 2023. 2
[38] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
[52] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
Text-based real image editing with diffusion models. In Pro-
Mark Chen. Glide: Towards photorealistic image generation
ceedings of the IEEE/CVF Conference on Computer Vision
and editing with text-guided diffusion models. In Interna-
and Pattern Recognition, pages 6007–6017, 2023. 2
tional Conference on Machine Learning, 2021. 2
[39] Diederik P. Kingma and Jimmy Ba. Adam: A method for
[53] OpenAI. ChatGPT. https://chat.openai.com/,
stochastic optimization. CoRR, abs/1412.6980, 2014. 11
2022. Accessed: 2023-10-15. 5, 9, 11
[40] William H. Kruskal and Wilson Allen Wallis. Use of ranks
[54] Maxime Oquab, Timothée Darcet, Théo Moutakanni,
in one-criterion variance analysis. Journal of the American
Huy Q. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernan-
Statistical Association, 47:583–621, 1952. 12
dez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby,
[41] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russ
Shechtman, and Jun-Yan Zhu. Multi-concept customization Howes, Po-Yao (Bernie) Huang, Shang-Wen Li, Ishan Misra,
of text-to-image diffusion. In Proceedings of the IEEE/CVF Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Huijiao
Conference on Computer Vision and Pattern Recognition, Xu, Hervé Jégou, Julien Mairal, Patrick Labatut, Armand
pages 1931–1941, 2023. 3 Joulin, and Piotr Bojanowski. DINOv2: Learning robust vi-
[42] Dongxu Li, Junnan Li, and Steven C. H. Hoi. BLIP- sual features without supervision. ArXiv, abs/2304.07193,
Diffusion: Pre-trained subject representation for con- 2023. 4, 9, 11
trollable text-to-image generation and editing. ArXiv, [55] Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-
abs/2305.14720, 2023. 3, 5, 6, 10, 11, 12, 14 Elor, and Daniel Cohen-Or. Localizing object-level shape
[43] Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, variations with text-to-image diffusion models. ArXiv,
Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng abs/2303.11306, 2023. 2
Gao. Storygan: A sequential conditional gan for story visu- [56] Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman,
alization. CVPR, 2019. 3 Jonathan T. Barron, Amit H. Bermano, Eric Ryan Chan, Tali
[44] Shaoteng Liu, Yuecheng Zhang, Wenbo Li, Zhe Lin, and Ji- Dekel, Aleksander Holynski, Angjoo Kanazawa, C. Karen
aya Jia. Video-p2p: Video editing with cross-attention con- Liu, Lingjie Liu, Ben Mildenhall, Matthias Nießner, Bjorn
trol. ArXiv, abs/2303.04761, 2023. 2 Ommer, Christian Theobalt, Peter Wonka, and Gordon Wet-

25
zstein. State of the art on diffusion models for visual com- on Computer Vision and Pattern Recognition (CVPR), pages
puting. ArXiv, abs/2310.07204, 2023. 2 10674–10685, 2021. 2
[57] Dustin Podell, Zion English, Kyle Lacey, A. Blattmann, [70] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
Tim Dockhorn, Jonas Muller, Joe Penna, and Robin Rom- Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine
bach. SDXL: Improving latent diffusion models for high- tuning text-to-image diffusion models for subject-driven
resolution image synthesis. ArXiv, abs/2307.01952, 2023. 2, generation. In Proceedings of the IEEE/CVF Conference
5, 9, 11 on Computer Vision and Pattern Recognition, pages 22500–
[58] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- 22510, 2023. 2, 3, 5
hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv [71] Simo Ryu. Low-rank adaptation for fast text-to-image
preprint arXiv:2209.14988, 2022. 2 diffusion fine-tuning. https : / / github . com /
[59] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, cloneofsimo/lora, 2022. 3, 5, 6, 9, 10, 11, 12, 14,
Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fus- 20, 21
ing attentions for zero-shot text-based video editing. arXiv [72] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
preprint arXiv:2303.09535, 2023. 2 Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
[60] Sigal Raab, Inbal Leibovitch, Guy Tevet, Moab Arar, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
Amit H. Bermano, and Daniel Cohen-Or. Single motion dif- et al. Photorealistic text-to-image diffusion models with deep
fusion. ArXiv, abs/2302.05905, 2023. 3 language understanding. Advances in Neural Information
[61] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Processing Systems, 35:36479–36494, 2022. 2
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [73] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Krueger, and Ilya Sutskever. Learning transferable visual Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
models from natural language supervision. In International man, Patrick Schramowski, Srivatsa Kundurthy, Katherine
Conference on Machine Learning, 2021. 5, 6, 7, 9, 11, 21, Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia
22, 23 Jitsev. Laion-5b: An open large-scale dataset for training
[62] Tanzila Rahman, Hsin-Ying Lee, Jian Ren, S. Tulyakov, next generation image-text models. ArXiv, abs/2210.08402,
Shweta Mahajan, and Leonid Sigal. Make-a-story: Vi- 2022. 9, 23
sual memory conditioned consistent story generation. 2023 [74] Etai Sella, Gal Fiebelman, Peter Hedman, and Hadar
IEEE/CVF Conference on Computer Vision and Pattern Averbuch-Elor. Vox-e: Text-guided voxel editing of 3d ob-
Recognition (CVPR), pages 2493–2502, 2022. 2, 3 jects. ArXiv, abs/2303.12048, 2023. 2
[63] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
[75] Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer,
and Mark Chen. Hierarchical text-conditional image gener-
Oran Gafni, Eliya Nachmani, and Yaniv Taigman. knn-
ation with CLIP latents. arXiv preprint arXiv:2204.06125,
diffusion: Image generation via large-scale retrieval. In The
2022. 2, 3
Eleventh International Conference on Learning Representa-
[64] reddit.com. How to create consistent character faces without tions, 2022. 2
training (info in the comments) : Stablediffusion. https:
[76] Jing Shi, Wei Xiong, Zhe L. Lin, and Hyun Joon Jung. In-
/ / www . reddit . com / r / StableDiffusion /
stantbooth: Personalized text-to-image generation without
comments / 12djxvz / how _ to _ create _
test-time finetuning. ArXiv, abs/2304.03411, 2023. 3
consistent _ character _ faces _ without/,
2023. 2, 3 [77] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[65] reddit.com. 8 ways to generate consistent char- and Surya Ganguli. Deep unsupervised learning using
acters (for comics, storyboards, books etc) : Sta- nonequilibrium thermodynamics. In International Confer-
blediffusion. https : / / www . reddit . com / r / ence on Machine Learning, pages 2256–2265. PMLR, 2015.
StableDiffusion/comments/10yxz3m/8_ways_ 2
to _ generate _ consistent _ characters _ for/, [78] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
2023. 2, 3 ing diffusion implicit models. In International Conference
[66] Elad Richardson, Kfir Goldberg, Yuval Alaluf, and Daniel on Learning Representations, 2020.
Cohen-Or. Conceptlab: Creative generation using diffusion [79] Yang Song and Stefano Ermon. Generative modeling by esti-
prior constraints. arXiv preprint arXiv:2308.02669, 2023. 3 mating gradients of the data distribution. Advances in Neural
[67] Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, Information Processing Systems, 32, 2019. 2
and Daniel Cohen-Or. Texture: Text-guided texturing of [80] Gábor Szűcs and Modafar Al-Shouha. Modular storygan
3d shapes. ACM SIGGRAPH 2023 Conference Proceedings, with background and theme awareness for story visualiza-
2023. 3 tion. In International Conference on Pattern Recognition
[68] Romain Beaumont. Clip retrival. https://github. and Artificial Intelligence, pages 275–286. Springer, 2022.
com/rom1504/clip-retrieval, 2023. 9, 23 3
[69] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick [81] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir,
Esser, and Björn Ommer. High-resolution image synthesis Daniel Cohen-Or, and Amit H. Bermano. Human motion
with latent diffusion models. 2022 IEEE/CVF Conference diffusion model. ArXiv, abs/2209.14916, 2022. 3

26
[82] Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. sive models for content-rich text-to-image generation. arXiv
Key-locked rank one editing for text-to-image personaliza- preprint arXiv:2206.10789, 2022. 2
tion. ACM SIGGRAPH 2023 Conference Proceedings, 2023. [96] Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang,
3 and In-So Kweon. Text-to-image diffusion models in gen-
[83] John W. Tukey. Comparing individual means in the analysis erative ai: A survey. ArXiv, abs/2303.07909, 2023. 2
of variance. Biometrics, 5 2:99–114, 1949. 11, 12 [97] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
[84] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali conditional control to text-to-image diffusion models. In
Dekel. Plug-and-play diffusion features for text-driven Proceedings of the IEEE/CVF International Conference on
image-to-image translation. In Proceedings of the IEEE/CVF Computer Vision (ICCV), pages 3836–3847, 2023. 2, 8, 12
Conference on Computer Vision and Pattern Recognition, [98] Jingyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, and
pages 1921–1930, 2023. 2 Guanbin Li. Dreameditor: Text-driven 3d scene editing with
[85] Yael Vinker, Andrey Voynov, Daniel Cohen-Or, and Ariel neural fields. ArXiv, abs/2306.13455, 2023. 2
Shamir. Concept decomposition for visual exploration and
inspiration. ArXiv, abs/2305.18203, 2023. 3
[86] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro
Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj,
and Thomas Wolf. Diffusers: State-of-the-art diffusion
models. https : / / github . com / huggingface /
diffusers, 2022. 11
[87] Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or.
Sketch-guided text-to-image diffusion models. arXiv
preprint arXiv:2211.13752, 2022. 2
[88] Andrey Voynov, Q. Chu, Daniel Cohen-Or, and Kfir Aber-
man. P+: Extended textual conditioning in text-to-image
generation. ArXiv, abs/2303.09522, 2023. 3
[89] Yuxiang Wei. Official implementation of ELITE. https:
//github.com/csyxwei/ELITE, 2023. Accessed:
2023-05-01. 5
[90] Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei
Zhang, and Wangmeng Zuo. ELITE: Encoding visual con-
cepts into textual embeddings for customized text-to-image
generation. ArXiv, abs/2302.13848, 2023. 3, 5, 6, 10, 11, 12,
14
[91] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau-
mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam
Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien
Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama
Drame, Quentin Lhoest, and Alexander M. Rush. Trans-
formers: State-of-the-art natural language processing. In
Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing: System Demonstrations,
pages 38–45, Online, 2020. Association for Computational
Linguistics. 11
[92] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change
Loy. Rerender a video: Zero-shot text-guided video-to-video
translation. ArXiv, abs/2306.07954, 2023. 2
[93] Hu Ye, Jun Zhang, Siyi Liu, Xiao Han, and Wei Yang. IP-
Adapter: Text compatible image prompt adapter for text-to-
image diffusion models. ArXiv, abs/2308.06721, 2023. 3, 5,
6, 10, 11, 12, 14
[94] youtube.com. How to create consistent characters in mid-
journey. https://www.youtube.com/watch?v=
Z7_ta3RHijQ, 2023. 3
[95] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres-

27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy