Prompt Log Analysis of Text-to-Image Generation Systems
Prompt Log Analysis of Text-to-Image Generation Systems
ABSTRACT 1 INTRODUCTION
Recent developments in large language models (LLM) and genera- Recent developments in large language models (LLM) (e.g., GPT-3
tive AI have unleashed the astonishing capabilities of text-to-image [4], PaLM [6], LLaMA [38], and GPT-4 [24]) and generative AI (es-
generation systems to synthesize high-quality images that are faith- pecially the diffusion models [13, 37]) have enabled the astonishing
ful to a given reference text, known as a “prompt”. These systems image synthesis capabilities of text-to-image generation systems,
have immediately received lots of attention from researchers, cre- such as DALL·E [29, 30], Midjourney [20], latent diffusion models
ators, and common users. Despite the plenty of efforts to improve (LDMs) [32], Imagen [33], and Stable Diffusion [32]. As these sys-
the generative models, there is limited work on understanding tems are able to produce images of high quality that are faithful to a
the information needs of the users of these systems at scale. We given reference text (known as a “prompt”), they have immediately
conduct the first comprehensive analysis of large-scale prompt become a new source of creativity [25] and attracted a great number
logs collected from multiple text-to-image generation systems. Our of creators, researchers, and common users. As a major prototype
work is analogous to analyzing the query logs of Web search en- of generative AI, many believe that these systems are introducing
gines, a line of work that has made critical contributions to the fundamental changes to the creative work of humans [9].
glory of the Web search industry and research. Compared with Despite plenty of efforts on improving the performance of the
Web search queries, text-to-image prompts are significantly longer, underneath generative models, there is limited work on analyzing
often organized into special structures that consist of the subject, the information needs of the real users of these text-to-image sys-
form, and intent of the generation tasks and present unique cate- tems, regardless of the cruciality to understand the objectives and
gories of information needs. Users make more edits within creation workflows of the creators and identify the gaps in how the current
sessions, which present remarkable exploratory patterns. There is systems are capable of facilitating the creators’ needs.
also a considerable gap between the user-input prompts and the In this paper, we take the initiative to investigate the informa-
captions of the images included in the open training data of the gen- tion needs of text-to-image generation by conducting a comprehen-
erative models. Our findings provide concrete implications on how sive analysis of millions of user-input prompts in multiple popular
to improve text-to-image generation systems for creation purposes. systems, including Midjourney, Stable Diffusion, and LDMs. Our
analysis is analogous to query log analysis of search engines, a line
KEYWORDS of work that has inspired many developments of modern informa-
Text-to-Image Generation, AI-Generated Content (AIGC), AI for tion retrieval (IR) research and industry [3, 12, 14, 36, 44]. In this
Creativity, Prompt Analysis, Query Log Analysis. analogy, a text-to-image generation system is compared to a search
engine, the pretrained large language model is compared to the
ACM Reference Format: search index, a user-input prompt can be compared to a search
Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei. 2023. A
query that describes the user’s information need, while a text-to-
Prompt Log Analysis of Text-to-Image Generation Systems. In Proceedings
of the ACM Web Conference 2023 (WWW ’23), April 30-May 4, 2023, Austin,
image generation model can be compared to the search or ranking
TX, USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3543 algorithm that generates (rather than retrieves) one or multiple
507.3587430 pieces of content (images) to fulfill the user’s need (Table 1).
Through a large-scale analysis of the prompt logs, we aim to
∗ These authors contributed equally to this research. answer the following questions: (1) How do users describe their
information needs in the prompts? (2) How do the information needs in
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed text-to-image generation compare with those in Web search? (3) How
for profit or commercial advantage and that copies bear this notice and the full citation are users’ information needs satisfied? (4) How are users’ information
on the first page. Copyrights for components of this work owned by others than the needs covered by the image captions in open datasets?
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission The results of our analysis suggest that (1) text-to-image prompts
and/or a fee. Request permissions from permissions@acm.org. are usually structured with terms that describe the subject, the form,
WWW ’23, April 30-May 4, 2023, Austin, TX, USA and the intent of the image to be created (Sec. 4); (2) text-to-image
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9416-1/23/04. . . $15.00 prompts are sufficiently different from Web search queries. Besides
https://doi.org/10.1145/3543507.3587430
WWW ’23, April 30-May 4, 2023, Austin, TX, USA Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei
Table 2: Statistics of datasets. The values (except for the raw Table 3: Most frequent terms used in prompts.
number of records) are calculated after data processing.
Midjourney DiffusionDB SAC
Dataset Midjourney DiffusionDB SAC Term Freq. Term Freq. Term Freq.
Table 2 lists the basic statistics of the datasets. In the raw data, 1 (norman, rockwell) 0.285 (donald, trump) 0.345 (matsunuma, shingo) 1.000
2 (fenghua, zhong) 0.250 (emma, watson) 0.321 (lisa, mona) 1.000
one input prompt can correspond to multiple generated images and 3 (ngai, victo) 0.240 (biden, joe) 0.283 (elon, musk) 1.000
create multiple data entries for the same input. We remove these 4 (makoto, shinkai) 0.237 (shinkawa, yoji) 0.266 (ariel, perez) 1.000
5 (ray, trace) 0.125 (blade, runner) 0.255 (angeles, los) 1.000
duplicates while reserving repeated inputs from users. More details 6 (fiction, science) 0.123 (katsuhiro, otomo) 0.238 (bradley, noah) 1.000
about the data and data processing are described in Appendix A. 7 (anderson, wes) 0.106 (contest, winner) 0.237 (hayao, miyazaki) 1.000
8 (11:17, circa) 0.074 (takato, yamamoto) 0.236 (finnian, macmanus) 1.000
9 (jia, ruan) 0.071 (“, ”) 0.216 (bartlett, bo) 1.000
4 PROMPT LOG ANALYSIS 10 (cushart, krenz) 0.070 (mead, syd) 0.130 (hasui, kawase) 0.500
11 (shinkawa, yoji) 0.062 (akihiko, yoshida) 0.123 (daniela, uhlig) 0.332
We analyze the prompts in the datasets and aim to answer the four 12 (albert, bierstadt) 0.060 (elvgren, gil) 0.114 (edlin, tyler) 0.318
13 (katsuhiro, otomo) 0.057 (new, york) 0.114 (jurgens, mandy) 0.286
questions mentioned in Section 1. 14 ([, ]) 0.053 (gi, jung) 0.106 (bacon, francis) 0.286
15 (annie, leibovitz) 0.052 (dore, gustave) 0.103 (araki, hirohiko) 0.258
16 (adams, ansel) 0.045 (star, wars) 0.092 (radke, scott) 0.257
17 (mignola, mike) 0.043 (fiction, science) 0.087 (ca’, n’t) 0.252
4.1 How do Users Describe Information Needs? 18 (1800s, tintype) 0.036 (league, legends) 0.082 (card, tarot) 0.201
19 (dore, gustave) 0.036 (rule, thirds) 0.074 (claude, monet) 0.190
We first investigate how users describe their information needs 20 (adams, tintype) 0.029 (ngai, victo) 0.061 (gogh, van) 0.180
by exploring the structures of prompts. We start with analyzing
the usage of terms (tokens or words) in prompts. We conduct a 4.1.1 Words in prompts describe subjects, forms, and in-
first-order analysis that focuses on term frequency, followed by a tents. In Art, a piece of work is typically described with three
second-order analysis that focuses on co-occurring term pairs. The basic components: subject, form, and content. In general, the subject
significance of a term pair is measured with the 𝜒 2 metric [1, 36]: defines “what” (the topic or focus); the form confines “how” (the
[𝐸 (𝑎𝑏) − 𝑂 (𝑎𝑏) ] 2 [𝐸 (𝑎𝑏)
¯ − 𝑂 (𝑎𝑏)¯ ]2 development, composition, or substantiation); and the content artic-
𝜒 2 (𝑎, 𝑏) = + + ulates “why” (the intention or meaning) [22]. We are able to relate
𝐸 (𝑎𝑏) 𝐸 (𝑎𝑏)
¯
¯ − 𝑂 (𝑎𝑏)
[𝐸 (𝑎𝑏) ¯ ] 2 [𝐸 (𝑎¯𝑏)
¯ − 𝑂 (𝑎¯𝑏)
¯ ]2 terms in a text-to-image prompt to these three basic components.
+ , (1) Note that the subject, form, and content of a work of art is often
¯
𝐸 (𝑎𝑏) ¯
𝐸 (𝑎¯𝑏)
intertwined with each other. For example, a term describing the
where 𝑎, 𝑏 are two terms, 𝑂 (𝑎𝑏) is the number of prompts they
subject might also be related to the form or content and vice versa.
co-occur in, 𝐸 (𝑎𝑏) is the expected co-occurrences under the inde-
pendence assumption, and 𝑎, ¯ 𝑏¯ stand for the absence of 𝑎, 𝑏. Subject. A prompt often contains terms describing its topic or
In Table 3, we list the most frequent terms, measured by the num- focus, referred to as the subject, which can be a person, an object,
ber of text-to-image prompts they appear in. The most significant or a theme [26, 27]. Among the 50 most frequent terms of all three
WWW ’23, April 30-May 4, 2023, Austin, TX, USA Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei
datasets (parts of them listed in Table 3), we discover 9 terms related DiffusionDB
103 DiffusionDB
Table 6: Statistics of prompt lengths.
Prompt frequency (log-scale)
100
Web search queries (the average length is 5.0) [44], likely due to
100 101 102 103 104 105 106
Prompt ranked by frequency (log-scale) the highly specialized and complex nature of the tasks.
Bundled queries. When queries are more complex and harder to
Figure 2: Prompt frequencies of DiffusinoDB plotted in a log-
compose, an effective practice used in medical search engines is
log scale. The distribution follows Zipf’s law.
to allow users to bundle a long query, save it for reuse, and share
it with others. In the context of EHR search, bundled queries are
Table 5: Most frequent prompts in DiffusionDB. #Users indi- significantly longer (with 58.9 terms on average, compared to 1.7
cates the number of users who have used this prompt. terms in user typed-in queries) [44, 46]. Bundled queries tend to
have higher quality, and once shared, are more likely to be adopted
Rank Prompt Freq. #Users by other users [46]. Table 5 seems to suggest the same opportu-
1 painful pleasures by lynda benglis, octane render, colorful, 4k, 8k 1010 1
2 cinematic bust portrait of psychedelic robot from left, head and chest ... 240 2
nity, as certain well-composed lengthy queries are revisited many
3 divine chaos engine by karol bak, jean delville, william blake, gustav ... 240 7 times by their users. These prompts could be saved as “bundles”
4 divine chaos engine by karol bak and vincent van gogh 228 1
5 soft greek sculpture of intertwined bodies painted by james jean ... 202 2
and potentially shared with other users. To illustrate the poten-
6 detailed realistic beautiful young medieval queen face portrait ... 202 1 tial, we calculate the prompts used by multiple users and plot the
7 animation magic background game design with miss pokemon ... 181 2
8 cat 174 69
distribution in Figure 3. We find a total of 16,950 unique prompts
9 wrc rally car stylize, art gta 5 cover, official fanart behance hd ... 166 4 (0.94% of all unique prompts) have been used across users, 782 have
10 futurism movement hyperrealism 4k detail flat kinetic 157 1
11 a big pile of soft greek sculpture of intertwined bodies painted by ... 156 1 been used by five or more users, and 182 have been shared by 10
12 test 152 86 or more users. The result suggests that text-to-image generation
13 dream 149 133
14 realistic detailed face portrait of a beautiful futuristic viking warrior ... 149 2 users have already started to share bundled prompts spontaneously,
15
∗ 16
spritesheet game asset vector art, smooth style beeple, by thomas ... 141 3 even though this functionality has not been provided by the system.
137 50
17 abstract 3d female portrait age five by james jean and jason chan, ... 134 1 Compared to vertical search engines that provide bundle-sharing
18 symmetry!! egyptian prince of technology, solid cube of light, ... 130 1 features, the proportion of bundled prompts is still relatively small
19 retrofuturistic portrait of a woman in astronaut helmet, smooth ... 127 1
20 astronaut holding a flag in an underwater desert. a submarine is ... 127 1 (compared with 19.3% for an EHR search engine [44]), indicating a
∗ Row 16 is an empty prompt. huge opportunity for bundling and sharing prompts.
40
106 #Prompts
#Prompts (log-scale)
Prompt length
amine the distribution of prompt frequencies. From Figure 2, we
find the prompt frequency distribution of the larger dataset, Dif- 104
fusionDB, does follow Zipf’s law (except for the very top-ranked 103 20
102
prompts), similar to the queries of Web and vertical search engines
101 10
[36, 41, 44]. The most frequently used prompts are listed in Table
5. Interestingly, many of the top-ranked prompts are (1) lengthy
0 20 40 60 80 100 120 0
and (2) only used by a few users. This indicates that although the Shared by #users
prompt frequency distributes are similar to that of Web search, the
mechanism underneath may be different (shorter Web queries are Figure 3: Prompts shared across users in DiffusionDB. The
more frequent and shared by more users [36]). orange line plots the average prompt length in the blue bins.
Table 7: Prompts shared by the largest numbers of users in Navigational prompts. The most frequent queries in Web search
DiffusionDB. Only prompts longer than five terms are re- are often navigational, where users simply use a query to lead them
ported below row 10. to a particular, known Website (e.g., “Facebook” or “YouTube”). In
text-to-image generation, as the generation model often returns
Rank Prompt #Users
different images given the same text prompt due to randomiza-
1 dream 133
tion, the information need of “navigating” to a known image is
2 stable diffusion 91
3 help 89 rare. Indeed, the queries used by the most number of users (Figure
4 test 86 3) are generally not tied to a particular image. Even though the
5 cat 69 shorter queries on the top look somewhat similar to “Facebook” or
6 nothing 66
“Youtube”, are rather ambiguous and more like testing the system.
7 god 58
8 the backrooms 53
∗9 50
Informational prompts. Most other text-to-image prompts can
10 among us 44 be compared to informational queries in Web search, which aim to
19 a man standing on top of a bridge over a city, cyberpunk art ... 32 acquire certain information that is expected to present on one or
20 mar - a - lago fbi raid lego set 32 more Web pages [3]. The difference is that informational prompts
34 an armchair in the shape of an avocado 23
aim to synthesize (rather than retrieve) an image, which is expected
35 a giant luxury cruiseliner spaceship, shaped like a yacht, ... 23
42 a portrait photo of a kangaroo wearing an orange hoodie and ... 19 to exist in the latent representation space of images. Most prompts
45 anakin skywalker vacuuming the beach to remove sand 19 fall into this category, similar to the case in Web search [3].
48 emma watson as an avocado chair 18
64 milkyway in a glass bottle, 4k, unreal engine, octane render 16 Transactional prompts. Transactional queries are those intended
∗ Row 9 is an empty prompt.
of performing certain Web-related activities [3], such as completing
consecutive prompts that are submitted by the same user within 30 a transaction (e.g., to book a flight or to make a purchase). One
minutes will be considered as in the same session. could superficially categorize all prompts into transactional, as they
The statistics of sessions are listed in Table 8. Similar to prompts, are all intended to conduct the activities of “generating images”.
text-to-image generation sessions also tend to be significantly Zooming into this superficial categorization, we could identify
longer than Web search sessions (by the number of prompts in prompts that refer to specific and recurring tasks, such as “3D
a session). A text-to-image generation session contains 10.25 or rendering”, “post-processing”, “global illumination”, and “movie
13.71 (Midjourney or DiffusionDB) prompts on average and a me- poster” (see more examples in Section 4.1.2). These tasks may be
dian of 4 or 5 (Midjourney or DiffusionDB) prompts; while in Web considered transactional in the context of text-to-image generation.
search, the average session length is around 2.02 and the median is Exploratory prompts. Beyond the above categories correspond-
1 [36]. This is again likely due to the complexity of the creation task ing to the three basic types of Web search queries, we discover a
so the users need to update the prompts multiple times. Indeed, a new type of information needs in prompts, namely the exploratory
user tends to change (add, delete, or replace) a median of 3 terms prompts for text-to-image generation. Comparing to an informa-
(measured by term-level edit distance) between two consecutive tional prompt that aims to generate a specific piece of (hypothetical)
prompts in the same session on Midjourney (5 on DiffusionDB), image, an exploratory prompt often describes a vague or uncertain
astonishingly more than how people update Web search queries. information need (or image generation requirements) that inten-
Do these updates indicate different types of information needs? tionally leads to multiple possible answers. The user intends to
explore different possibilities, leveraging either the randomness of
Table 8: Statistics of prompt sessions. Sessions are identified
the model or the flexibility of terms used in a prompt session.
with a 30-minute timeout. Edit distances regarding terms are
Indeed, rather than clearly specifying the requirements and con-
calculated with consecutive prompts in the same session.
straints and gradually refining the requirements in a session, in
exploratory prompts or sessions, the users tend to play with alter-
Dataset Midjourney DiffusionDB
native terms of the same category (e.g., different colors or animals,
#Sessions 14,232 161,001 or sibling terms) to explore how the generation results could be
Avg. #sessions/user 8.52 15.51 different or could cover a broader search space. Based on the ses-
Median #sessions/user 2 9 sion analysis, we count the most frequent term replacements in
Avg. #prompts/session 10.19 13.71 Table 9. In this table, we find 33 replacements that show exploratory
Median #prompts/session 4 5 patterns, such as (“man”, “woman”), (“asian”, “white”), (“dog”, “cat”),
(“red”, “blue”), and (“16:9”, “9:16”).
Avg. edit distance 8.53 9.42 On the contrary, in non-exploratory sessions, replacing a term
Median edit distance 3 5 with its synonyms or hyponyms, or more specific concepts are
more common, which refines the search space (rather than explor-
4.2.5 A new categorization of information needs. Web search ing the generation space). In the table, we find a few such replace-
queries are typically distinguished into three categories: (1) naviga- ments: (“steampunk”, “cyberpunk”), (“deco”, “nouveau”), (“crown”,
tional queries, (2) informational queries, and (3) transactional queries “throne”). There are also examples that replace terms with the
[3]. Should text-to-image prompts be categorized in the same way? correct spelling or replace punctuations to refine: (“aphrodesiac”,
Or do prompts express new categories of information needs? “aphrodisiac”), (“with”, “,”), (“,”, “and”) and (“,”, “.”).
A Prompt Log Analysis of Text-to-Image Generation Systems WWW ’23, April 30-May 4, 2023, Austin, TX, USA
Table 9: Most frequent term replacements. This table only coefficient at 0.197. This means longer prompts tend to produce
considers consecutive prompts from the same session where images of higher quality. This provides another perspective to un-
exactly one term is been replaced. Green highlights re- derstand the large lengths of prompts and prompt sessions and
placements that might indicates exploratory patterns, while another motivation to bundle and share long prompts.
red highlights non-exploratory replacements.
Midjourney
Replacement Freq.
DiffusionDB
Replacement Freq.
6.5
1 (’deco’, ’nouveau’) 16 (’man’, ’woman’) 216
6.0
2 (’16:9’, ’9:16’) 15 (’woman’, ’man’) 187
3 (’9:16’, ’16:9’) 14 (’2’, ’3’) 161
Rating
4 (’2’, ’1’) 8 (’1’, ’2’) 147
5.5
5 (’16:9’, ’4:6’) 8 (’7’, ’8’) 140
6 (’1’, ’2’) 7 (’8’, ’9’) 139 5.0
7 (’3:4’, ’4:3’) 7 (’6’, ’7’) 135
8 (’1000’, ’10000’) 7 (’3’, ’4’) 132 4.5
9 (’artwork’, ’parrot’) 7 (’girl’, ’woman’) 128
10 (’16:9’, ’1:2’) 6 (’red’, ’blue’) 116 0 10 20 30 40 50
11 (’2:3’, ’3:2’) 6 (’5’, ’6’) 115
Prompt length
12 (’asian’, ’white’) 6 (’4’, ’5’) 112
Figure 4: Prompt length is positively correlated with ratings.
13 (’1’, ’0.5’) 5 (’female’, ’male’) 107
The Pearson correlation coefficient is 0.197.
14 (’320’, ’384’) 5 (’male’, ’female’) 97
15 (’0.5’, ’1’) 4 (’blue’, ’red’) 93
16 (’crown’, ’throne’) 4 (’0’, ’1’) 89
4.3.2 The choice of words matters. We also investigate how
17 (’blue’, ’green’) 4 (’cat’, ’dog’) 89
the choice of words influences the performance of image genera-
18 (’9:16’, ’4:5’) 4 (’woman’, ’girl’) 82
tion. We collect all the prompts that contain a particular term and
19 (’2:3’, ’1:2’) 4 (’dog’, ’cat’) 79
calculate the average rating. Terms with the highest and lowest
20 (’--w’, ’--h’) 4 (’white’, ’black’) 72
average ratings are listed in Table 12 in the appendix. We find most
21 (’nouveau’, ’deco’) 4 (’with’, ’,’) 71
high-rating terms are artist names, which provide clear constraints
22 (’red’, ’blue’) 4 (’steampunk’, ’cyberpunk’) 71
on the styles of images. In contrast, terms with low ratings are
23 (’guy’, ’girl’) 4 (’red’, ’green’) 70
much vaguer and more abstract and might indicate an exploratory
24 (’snake’, ’apple’) 4 (’cyberpunk’, ’steampunk’) 70
behavior. More efforts needed to be done to handle exploratory
25 (’japanese’, ’korean’) 4 (’,’, ’and’) 69
prompts and to encourage the users to refine their needs.
26 (’16:8’, ’8:11’) 4 (’painting’, ’portrait’) 68
27 (’insect’, ’ladybug’) 4 (’,’, ’.’) 68
4.4 How are Users’ Information Needs Covered
28 (’--hd’, ’--vibe’) 3 (’portrait’, ’painting’) 68
by Image Captions?
29 (’aphrodesiac’, ’aphrodisiac’) 3 (’girl’, ’boy’) 64
30 (’0.5’, ’2’) 3 (’green’, ’blue’) 63 Current text-to-image generation models are generally trained with
large-scale image-text datasets, where the paired text usually come
from image captions. To figure out how these training sets match the
Another indication of exploratory behavior is the repeated use actual users’ information needs, we compare the prompts with im-
of prompts. For example, among the top prompts in Table 5 (except age captions in the open domain. In particular, we consider LAION-
those for testing purposes), each of them is repeatedly used by the 400M [35] as one of the main sources of text-to-image training data
same user more than 100 times. This might be because the user since both LDMs and the Stable Diffusion model employ this dataset.
is exploring different generation results with the same prompt, Text in LAION-400M are extracted from the captions of the images
leveraging the randomness of the generative model. collected from the Common Crawl, so they are supposed to convey
the subject, form, and intent of the images. We randomly sample
4.3 How are the Information Needs Satisfied? 1M texts from LAION-400M and compare them with user-input
Prompts are typically crafted to meet certain information needs prompts. We obtain the following finding.
by generating satisfactory images. In this subsection, we examine
how prompts can fulfill this goal. With the rating annotations in Term usages are different between user-input prompts and
the SAC dataset (the average rating is 5.53, and the median is 6), image captions in open datasets. We construct a vocabulary
we calculate the correlation between ratings and other variables based on LAION-400M and calculate the vocabulary coverage of
like prompt lengths and term frequencies. three prompt datasets (i.e., to what proportion of the user-input
terms is covered by the LAION vocabulary). The coverage is 25.94%
4.3.1 Longer prompts tend to be higher rated. We plot how for Midjourney, 43.17% for DiffusionDB, and 80.56% for SAC. The
the ratings of generated images correlate with prompt lengths in coverage is relatively high on SAC as this dataset is relatively clean.
Figure 4, where we find a positive correlation with the Pearson In comparison, the Midjourney and DifussionDB datasets directly
WWW ’23, April 30-May 4, 2023, Austin, TX, USA Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei
collect prompts from Discord channels of Midjourney and Stable the great opportunity for a personalized generation. Currently,
Diffusion, and over half of the terms are not covered in the LAION the session-based generation features are mostly built upon image
dataset. We also analyzed their embeddings and find that user-input initialization of diffusion models, i.e., using the output from the
prompts and image captions from the LAION dataset cover very previous generation as the starting point of diffusion sampling.
different regions in the latent space (Figure 8 in the appendix). Compared with other session-based AI systems like ChatGPT [23],
these session-based features still seem preliminary and take little
5 IMPLICATIONS consideration about personalized generation. Meanwhile, the ex-
Our analysis presents unique characteristics of user-input prompts, plicit descriptions of forms and intent in prompts also indicate
which helps us better understand the limitations and opportunities opportunities to customize the generation models for these con-
of text-to-image generation systems and AI-facilitated creativity on straints (and the potential applications as listed in Section 4.1.2).
the Web. Below we discuss a few concrete and actionable possibili- Handling exploratory prompts and sessions. In Sec. 4.2.5 we
ties for improving the generation systems and enhancing creativity. identify a new type of prompt in addition to the three typical cate-
Building art creativity glossaries. As we discussed in Sec. gories of query in Web search (i.e., navigational, informational, and
4.1.1, a text-to-image prompt could be decomposed into three as- transactional queries), namely the exploratory prompts. To encour-
pects: subject (“what”), form (“how”), and intent (“why”, or content age the exploratory generation of images, reliable and informative
as in classical Art literature). If we can identify and analyze these exploration measures will be much needed. In other machine inno-
specific elements in prompts, we may be able to better decipher vation areas, like AI for molecular generation, efforts have been
users’ information needs. made on discussing the measurement of coverage and exploration
However, to the best of our knowledge, there is no existing of spaces [42, 43], but for text-to-image generation, such discussions
tool that is able to extract the subject, form, and intent from text are still rare. How to encourage the models to explore a larger space,
prompts. Besides, although users have spontaneously collected generate novel and diverse images, and recommend exploratory
terms that describe the form and subject 8 , there is no high-quality prompts to users are all promising yet challenging directions.
and comprehensive glossary in the literature that contains terms Improving generation models with prompt logs. Finally, the
about these three basic components of art, or something like the gap between the image captions in open datasets and the user-input
Unified Medical Language System (UMLS) for biomedical and health prompts (Sec. 4.4) indicates that it is desirable to improve model
domains [2]. Constructing such tools or glossaries is difficult and training directly using the prompt logs. Following the common
will highly rely on the domain knowledge, because: (1) These three practice in Web search engines, one may leverage both explicit
components of art are often intertwined and inseparable in a piece and implicit feedback from the prompt logs (such as the ratings
of work [22], meaning a term would have tendencies to fall into any or certain behavioral patterns or modifications in the prompts) as
categories of these three. For example, in Process Art, the form and additional signals to update the generation models.
content seem to be the same thing [22]. (2) Terminologies about art Although we focus our analysis on text-to-image generation, the
are consistently updated because new artists, styles, and art-related analogy to Web search and some of the above implications also
sites keep popping out. We call for the joint effort of the art and apply to other domains of AI-generated content (AIGC), such as AI
the Web communities to build such vocabularies and tools. chatbots (e.g., ChatGPT).
Bundling and sharing prompts. Sec. 4.2.3 analyzes the lengths
6 CONCLUSION
of text-to-image prompts, where we find an inadequate use of bun-
dled prompts compared with other vertical search engines (e.g., We take an initial step to investigate the information needs of
EHR search engines). Since the prompts are generally much longer text-to-image generation through a comprehensive and large-scale
than Web search queries, and the information needs are also more analysis of user-input prompts (analogous to Web search queries) in
complex, it is highly likely that bundled prompts can help the users multiple popular systems. The results suggest that (1) text-to-image
to craft their prompts more effectively and efficiently. Though there prompts are typically structured with terms that describe the subject,
are already prompt search websites like Lexica9 , PromptHero10 and form, and intent; (2) text-to-image prompts are sufficiently different
PromptBase11 that provide millions of user-crafted prompts, such from Web search queries. Our findings include the significantly
bundled search features are merely integrated into current text-to- lengthier prompts and sessions, the lack of navigational prompts,
image generation systems. As mentioned earlier, adding features the new perspective of transactional prompts, and the prevalence
to support bundling and sharing high-quality prompts could bring of exploratory prompts; (3) image generation quality is correlated
immediate benefits to text-to-image generation systems. with the length of the prompt as well as the usage of terms; and
(4) there is a considerable gap between the user-input prompts
Personalized generation. The analysis in Sec. 4.2.4 suggests and the image captions used to train the models. Based on these
that the session lengths in text-to-image generation are also signif- findings, we present actionable insights to improve text-to-image
icantly larger than the session lengths in Web search, indicating generation systems. We anticipate our study could help the text-to-
image generation community to better understand and facilitate
8 Prompt book for data lovers II: https://docs.google.com/presentation/d/1V8d6TIlKqB
creativity on the Web.
1j5xPFH7cCmgKOV_fMs4Cb4dwgjD5GIsg, retrieved on 3/14/2023.
9 Lexica: https://lexica.art/, retrieved on 3/14/2023.
10 PromptHero: https://prompthero.com/, retrieved on 3/14/2023.
11 PromptBase: https://promptbase.com/, retrieved on 3/14/2023.
A Prompt Log Analysis of Text-to-Image Generation Systems WWW ’23, April 30-May 4, 2023, Austin, TX, USA
REFERENCES [28] John David Pressman, Katherine Crowson, and Simulacra Captions Contributors.
[1] Alan Agresti. 2012. Categorical data analysis. Vol. 792. John Wiley & Sons. 2022. Simulacra Aesthetic Captions. https://github.com/JD- P/simulacra-
[2] Olivier Bodenreider. 2004. The unified medical language system (UMLS): in- aesthetic-captions Retrieved on 3/15/2023..
tegrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), [29] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.
D267–D270. 2022. Hierarchical text-conditional image generation with clip latents. arXiv
[3] Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. preprint arXiv:2204.06125 (2022).
ACM New York, NY, USA, 3–10. [30] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation.
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda In International Conference on Machine Learning. PMLR, 8821–8831.
Askell, et al. 2020. Language models are few-shot learners. Advances in neural [31] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele,
information processing systems 33 (2020), 1877–1901. and Honglak Lee. 2016. Generative adversarial text to image synthesis. In Inter-
[5] Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha national conference on machine learning. PMLR, 1060–1069.
Kembhavi. 2020. X-LXMERT: Paint, Caption and Answer Questions with Multi- [32] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
Modal Transformers. In EMNLP. Ommer. 2022. High-resolution image synthesis with latent diffusion models. In
[6] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- 10684–10695.
bastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. [33] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Den-
arXiv preprint arXiv:2204.02311 (2022). ton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol
[7] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Ayan, Tim Salimans, et al. 2022. Photorealistic Text-to-Image Diffusion Models
Amodei. 2017. Deep reinforcement learning from human preferences. Advances with Deep Language Understanding. In Advances in Neural Information Processing
in neural information processing systems 30 (2017). Systems.
[8] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. 2009. Power-law [34] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. 2023.
distributions in empirical data. SIAM review 51, 4 (2009), 661–703. StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image
[9] Thomas H. Davenport and Nitin Mittal. 2022. How generative AI is changing Synthesis. arXiv preprint arXiv:2301.09515 (2023).
creative work. https://hbr.org/2022/11/how-generative-ai-is-changing-creative- [35] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk,
work Retrieved on 3/15/2023.. Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki.
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: 2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv
Pre-training of Deep Bidirectional Transformers for Language Understanding. In preprint arXiv:2111.02114 (2021).
Proceedings of the 2019 Conference of the North American Chapter of the Association [36] Craig Silverstein, Hannes Marais, Monika Henzinger, and Michael Moricz. 1999.
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Analysis of a very large web search engine query log. In Acm sigir forum, Vol. 33.
Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, ACM New York, NY, USA, 6–12.
4171–4186. https://doi.org/10.18653/v1/N19-1423 [37] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli.
[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial International Conference on Machine Learning. PMLR, 2256–2265.
Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, [38] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Associates, Inc. Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv
[12] Jorge R Herskovic, Len Y Tanaka, William Hersh, and Elmer V Bernstam. 2007. preprint arXiv:2302.13971 (2023).
A day in the life of PubMed: analysis of a typical day’s query log. Journal of the [39] Iulia Turc and Gaurav Nemade. 2022. Midjourney User Prompts & Generated
American Medical Informatics Association 14, 2 (2007), 212–220. Images (250k). https://doi.org/10.34740/KAGGLE/DS/2349267 Retrieved on
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic 3/15/2023..
models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851. [40] Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover,
[14] Bernard J Jansen, Amanda Spink, Judy Bateman, and Tefko Saracevic. 1998. Real and Duen Horng Chau. 2022. DiffusionDB: A Large-scale Prompt Gallery Dataset
life information retrieval: A study of user queries on the web. In ACM Sigir Forum, for Text-to-Image Generative Models. arXiv preprint arXiv:2210.14896 (2022).
Vol. 32. ACM New York, NY, USA, 5–17. [41] Yinglian Xie and David O’Hallaron. 2002. Locality in search engine queries and
[15] Rosie Jones and Kristina Lisa Klinkner. 2008. Beyond the session timeout: auto- its implications for caching. In Proceedings. Twenty-First Annual Joint Conference
matic hierarchical segmentation of search topics in query logs. In Proceedings of of the IEEE Computer and Communications Societies, Vol. 3. IEEE, 1238–1247.
the 17th ACM conference on Information and knowledge management. 699–708. [42] Yutong Xie, Ziqiao Xu, Jiaqi Ma, and Qiaozhu Mei. 2022. How Much of the
[16] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. Chemical Space Has Been Explored? Selecting the Right Exploration Measure
arXiv preprint arXiv:1312.6114 (2013). for Drug Discovery. In ICML 2022 2nd AI for Science Workshop.
[17] Vivian Liu and Lydia B Chilton. 2022. Design guidelines for prompt engineering [43] Yutong Xie, Ziqiao Xu, Jiaqi Ma, and Qiaozhu Mei. 2023. How Much Space
text-to-image generative models. In Proceedings of the 2022 CHI Conference on Has Been Explored? Measuring the Chemical Space Covered by Databases and
Human Factors in Computing Systems. 1–23. Machine-Generated Molecules. In The Eleventh International Conference on Learn-
[18] Elman Mansimov, Emilio Parisotto, Jimmy Ba, and Ruslan Salakhutdinov. 2016. ing Representations.
Generating Images from Captions with Attention. In ICLR. [44] Lei Yang, Qiaozhu Mei, Kai Zheng, and David A Hanauer. 2011. Query log
[19] Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform man- analysis of an electronic health record search engine. In AMIA annual symposium
ifold approximation and projection for dimension reduction. arXiv preprint proceedings, Vol. 2011. American Medical Informatics Association, 915.
arXiv:1802.03426 (2018). [45] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang,
[20] Midjourney.com. 2022. Midjourney. https://midjourney.com/ Retrieved on Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. 2022. Scal-
3/15/2023.. ing Autoregressive Models for Content-Rich Text-to-Image Generation. Transac-
[21] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, tions on Machine Learning Research (2022).
Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: [46] Kai Zheng, Qiaozhu Mei, and David A Hanauer. 2011. Collaborative search in
Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion electronic health records. Journal of the American Medical Informatics Association
Models. In International Conference on Machine Learning. PMLR, 16784–16804. 18, 3 (2011), 282–291.
[22] Otto G Ocvirk, Robert E Stinson, Philip R Wigg, Robert O Bone, and David L
Cayton. 1968. Art fundamentals: Theory and practice. WC Brown Company.
[23] OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt Retrieved on
3/15/2023..
[24] OpenAI. 2023. GPT-4 Technical Report. https://cdn.openai.com/papers/gpt-4.pdf
Retrieved on 3/15/2023..
[25] Jonas Oppenlaender. 2022. The Creativity of Text-to-Image Generation. In Pro-
ceedings of the 25th International Academic Mindtrek Conference. 192–202.
[26] Jonas Oppenlaender. 2022. A Taxonomy of Prompt Modifiers for Text-to-Image
Generation. arXiv preprint arXiv:2204.13988 (2022).
[27] Nikita Pavlichenko and Dmitry Ustalov. 2022. Best Prompts for Text-to-Image
Models and How to Find Them. (2022).
WWW ’23, April 30-May 4, 2023, Austin, TX, USA Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei
A DATA AND DATA PROCESSING record different images generated by the same user with the same
prompt as a single submission. As a result, we obtained 2,208,019
A.1 Datasets
non-duplication prompt submissions from users. Note that repeated
For the datasets, we list important features (prompt, timestamp, user submissions of prompts are reserved. We tokenize the prompts and
ID, and rating), feature descriptions, and corresponding examples remove the whitespace as we process the Midjourney data.
in Table 10.
SAC. SAC provides aesthetic ratings of generated images. Note
Table 10: Feature descriptions and examples of the Midjour- that one prompt can correspond to multiple images, and each image
ney, DiffusionDB, and SAC datasets. can also have multiple ratings. Since there are no user ID and
timestamp annotations in SAC, to remove the duplicates, we simply
Description Examples extract the unique prompts and conclude all correlated ratings.
The prompt used to Midjourney: hands, by Karel More details can be found in the supplementary materials.
generate images. Thole and Mike Mignola --ar 2:3
Type: String DiffusionDB: Ibai Berto Romero as B ADDITIONAL ANALYSIS RESULTS
Prompt
A.2 Data Processing Prompts revised by users. Table 11 lists the most revisited
Midjourney. We extracted prompts, timestamps, and user IDs prompts in DiffusionDB.
from the records in the Midjourney dataset. The prompts in Midjour- Time series analysis. We analyze how the prompts distribute
ney may contain specific syntactic parameters of the Midjourney within 24 hours for the Midjourney and DiffusionDB datasets. The
model, such as “--ar” for aspect ratios, “--h” for heights, “--w” for results are shown in Figure 6. The patterns in these two datasets
widths, “::” for assigning weights to certain terms in the prompts. are similar: the rushing hours are around 01:00–03:00 (for both Mid-
We first take the lowercase characters from tokenized prompts with journey and DiffusionDB), 15:00–17:00 (Midjourney), 20:00 (Diffu-
the Spacy tokenizer13 . Regarding the parameters, such as “--h”, we sionDB); while during the daytime, the users are relatively inactive.
consider them single terms. Specially, we split the weighted terms
with their weights, and consider “::” and “::-” (negative weight) Ratings. The overall rating distribution of SAC is displayed in
as two different terms. During tokenization, we also removed re- Figure 7.
dundant whitespaces. Midjourney allows users to upload reference
images as parts of their prompts in the form of Discord links. These B.2 Comparing Prompts with Training Data
links are also processed as special terms. To compare user-input prompts with texts that are used to train
the text-to-image generation models, we also include the LAION
DiffusionDB. We utilize the metadata of DiffusionDB-Large (14M)
dataset [35]. LAION is a public dataset of CLIP-filtered image-text
for prompt analysis. We first remove duplicate data entries with
pairs and has often been used in large text-to-image model training
the same prompt, timestamp, and user ID, meaning these entries
[32–34, 45]. In the analysis, we use the LAION-400M dataset14 that
12 Note that one prompt may correspond to multiple images, and one image may have contains only English texts.
multiple ratings. Here we list all the ratings correlated to the example prompt.
13 Spacy: https://spacy.io/, retrieved on 3/15/2023. 14 LAION-400M dataset: https://laion.ai/blog/laion-400-open-dataset/.
A Prompt Log Analysis of Text-to-Image Generation Systems WWW ’23, April 30-May 4, 2023, Austin, TX, USA
Table 11: Most revisited prompts in DiffusionDB. Only revis- Table 12: Terms with the highest and the lowest average rat-
its across sessions are considered. ings. Only terms with frequencies larger than 100 are con-
sidered. “Avg.” and “Std.” are means and standard deviations
Prompt #Revisits of ratings respectively.
1 test 24
Terms with highest avg. ratings Terms with lowest avg. ratings
2 cat 19
Term Avg. Std. Freq. Term Avg. Std. Freq.
3 fat chuck is mad 15
4 15 1 shinjuku 8.55 0.90 168 equations 2.36 2.18 240
5 dog 15 2 gyuri 8.22 1.65 219 mathematical 2.37 2.18 230
6 symmetry!! egyptian prince of technology, solid cube of light, ... 13 3 lohuller 8.22 1.66 215 geismar 2.67 2.13 136
7 full character of a samurai, character design, painting by gaston ... 13 4 afremov 7.95 1.73 288 haviv 2.68 2.14 136
8 studio portrait of lawful good colorful female holy mecha paladin ... 11 5 leonid 7.95 1.73 288 chermayeff 2.73 2.14 136
9 full portrait and/or landscape. contemporary art print. high taste. ... 11 6 retrofuture 7.95 1.97 307 learning 3.10 2.64 112
10 woman wearing oculus and digital glitch head edward hopper and ... 11 7 merantz 7.93 1.77 463 pegasus 3.10 2.00 129
11 dream 10 8 josan 7.91 1.73 1,647 teacher 3.11 2.59 110
12 hyperrealistic portrait of a character in a scenic environment by ... 10 9 fantasyland 7.90 1.52 114 someone 3.14 2.45 574
13 full portrait &/or landscape painting for a wall. contemporary art ... 10 10 gensokyo 7.89 1.34 281 funny 3.17 2.52 208
14 zombie girl kawaii, trippy landscape, pop surrealism 10
15 creepy ventriloquist dummy in the style of roger ballen, 4k, bw, ... 9 Midjourney
16 cinematic bust portrait of psychedelic robot from left, head and ... 9 DiffusionDB
17 red ball 9 SAC
18 amazing landscape photo of mountains with lake in sunset by ... 9 LAION
19 female geisha girl, beautiful face, rule of thirds, intricate outfit, ... 9
20 full portrait and/or landscape painting for a wall. contemporary ... 9
0.050 Midjourney
Proportion of prompts
DiffusionDB
0.045
0.040
0.035
00:00
01:00
02:00
03:00
04:00
05:00
06:00
07:00
08:00
09:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
Time (hour)