3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows
Figure 1: 3DALL-E integrates a state-of-the-art text-to-image AI (DALL-E) into 3D CAD software Fusion 360. This plugin
generates 2D image inspiration for conceptual CAD and product design workfows. 3DALL-E helps users craft text prompts by
providing 3D keywords, design/styles, and parts from GPT-3. Users can also generate from image prompts based on a render of
their current workspace, letting users use their 3D modeling progress as a basis for text-to-image generations.
ABSTRACT
∗ Also with Columbia University. Text-to-image AI are capable of generating novel images for in-
spiration, but their applications for 3D design workfows and how
designers can build 3D models using AI-provided inspiration have
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed not yet been explored. To investigate this, we integrated DALL-E,
for proft or commercial advantage and that copies bear this notice and the full citation GPT-3, and CLIP within a CAD software in 3DALL-E, a plugin that
on the frst page. Copyrights for components of this work owned by others than the generates 2D image inspiration for 3D design. 3DALL-E allows
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specifc permission users to construct text and image prompts based on what they
and/or a fee. Request permissions from permissions@acm.org. are modeling. In a study with 13 designers, we found that design-
DIS ’23, July 10–14, 2023, Pittsburgh, PA, USA
ers saw great potential in 3DALL-E within their workfows and
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9893-0/23/07. . . $15.00 could use text-to-image AI to produce reference images, prevent
https://doi.org/10.1145/3563657.3596098
1955
DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka
design fxation, and inspire design considerations. We elaborate on up complexity in their designs. To do so, we integrated three large
prompting patterns observed across 3D modeling tasks and provide AI models—DALL-E, GPT-3, and CLIP—within Fusion 360, an in-
measures of prompt complexity observed across participants. From dustry standard software for computer-aided design (CAD). We
our fndings, we discuss how 3DALL-E can merge with existing implemented a plugin within the software which we call 3DALL-
generative design workfows and propose prompt bibliographies as E. This plugin helps translate a designer’s goals into multimodal
a form of human-AI design history. (text and image) prompts which can produce image inspiration for
them. After a designer inputs their goals (i.e. to design a "truck"),
CCS CONCEPTS the plugin provides a number of related parts, styles, and designs
• Applied computing → Media arts; • Human-centered com- that help users craft text prompts. These suggestions are drawn
puting → Interactive systems and tools; • Computing method- from the world knowledge of GPT-3 [5] to help users familiarize
ologies → Natural language generation; Shape modeling. themselves with relevant design language and 3D keywords that
can better specify the text prompt. The plugin interactively updates
KEYWORDS an image preview from the software viewport that shows an image
prompt which can be passed into DALL-E [72], giving users a direct
creativity support tools, 3D design, DALL-E, GPT-3, CLIP, 3D mod-
bridge between their 3D design workspace and an AI model that
eling, CAD, co-creativity, creative copilot, ideation, prompt en-
can generate image inspiration. Additionally, having a lens on what
gineering, multimodal, text-to-image, AI applications, text-to-3D,
the designer is actively working on allows the plugin to highlight
workfow, difusion
what prompt suggestions may work best, which is implemented in
ACM Reference Format: the system by using CLIP [71] to approximate model knowledge. To
Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka. 2023. evaluate 3DALL-E and how well it can integrate into 3D workfows,
3DALL-E: Integrating Text-to-Image AI in 3D Design Workfows. In Design- we conducted a user study with thirteen users of Fusion 360 who
ing Interactive Systems Conference (DIS ’23), July 10–14, 2023, Pittsburgh, PA, spanned a variety of backgrounds from industrial design to robotics.
USA. ACM, New York, NY, USA, 23 pages. https://doi.org/10.1145/3563657.
We found that 3DALL-E can beneft CAD designers as a system that
3596098
supports conceptual CAD, helps prevent design fxation, produces
reference images, and inspires design considerations.
1 INTRODUCTION We present the following contributions:
Designing 3D models in CAD software is challenging—designers • 3DALL-E, a plugin that generates AI-provided image inspi-
have to satisfy a number of objectives that can range from func- ration for CAD and product design by helping users craft
tional and aesthetic goals to feasibility constraints. Coming up with text prompts with design language (diferent parts, styles,
ideas takes a lot of exploration, even for experienced designers, so and designs for a 3D object) and image prompts connected
they often consult external resources for inspiration on how to de- to their work in progress.
fne their geometry. They browse 3D model repositories [23], video • An exploratory user study (n=13) demonstrating text-to-
tutorials, and image search engines to understand conventional image AI use cases in 3D design workfows and an analysis
designs and diferent aesthetics [48]. This process of conceptualiz- of prompting patterns and prompt complexity.
ing CAD designs is pivotal to the product design process, yet few In our discussion, we propose prompt bibliographies, a concept of
computational methods support it [35, 48]. human-AI design history to track inspiration from text-to-image AI.
A recent innovation that can more directly provide inspiration to We conclude on how text-to-image AI can integrate with existing
designers is text-to-image AI. Tools such as DALL-E [17], Imagen design workfows and what can be best practices for generative
[77], Parti [89], and Stable Difusion [75] are AI tools that have design going forward.
the generative capacity to access and combine many visual con-
cepts into novel images. Given text prompts as input, these tools 2 RELATED WORK
can capture a wide variety of subjects and styles [52]. In online
communities, users have already developed methods to elicit im- 2.1 Prompting
ages with 3D qualities [52, 66] by including prompt keywords such Prompting is a novel form of interaction that has come about as a
as “3D render” or “CGI”. Recent advancements have also allowed consequence of large language models (LLMs) [5]. Prompts allow
users to interact with text-to-image AI systems by passing in image users to engage with AI using natural language. For example, a user
prompts, where images are used as prompts in addition to text. can prompt an AI, “What are diferent parts of a car?” and receive
Generations can now be varied or built of of previous generations. a response such as the following, “Wheels, tires, and headlights”.
These innovative functions make the integration of text-to-image These prompts give LLMs context for what tasks they need to per-
AI within existing creative authoring software more feasible. form and help end users adapt the general pretraining of LLMs
However, how AI-provided image inspiration can contribute to without further fnetuning [4, 73]. By varying prompts, users can
CAD and product design workfows has not yet been fully explored. query LLMs for world knowledge, generative completions, sum-
In this paper, we seek to understand how text-to-image AI can maries, translations, and so forth [5, 53]. Datasets around prompting
assist 3D designers with conceptual CAD and design inspiration are also beginning to emerge to benchmark generative AI abilities.
and where in creative workfows designers can most beneft from PARTI [89] provides a schema and a set of prompts to investigate
AI assistance. Furthermore, we investigate how text-to-image tools the visual language abilities of AI. Coauthor [50] provides a dataset
respond to image prompts sent from 3D designers as they build of rich interactions between GPT-3 and writers. Audits of models
1956
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA
have also been performed by collecting generated outputs of AI ability to train and trade novel concepts learned by the AI [25] of
models at scale and conducting annotation studies, as in [52] and of a few examples. These models have extraordinary generative
[68]. As generative AI communities have gained momentum online, capacity, but their ability to be used nefariously has also inspired
crowd-sourced eforts on Twitter and Discord have also organized new approaches to safeguarding AI outputs from redteaming [6] to
to disseminate prompting guidance [66] that suggest experimen- large scale audits for social and gender biases [12].
tation with various style and medium keywords (e.g. “isometric”, Text-to-3D methods such as CLIP-Sculptor, DreamFusion, and
“3D render”, “sculpture” etc.). Point-E [40, 67, 78, 79] also exist and are rapidly improving, but
Recent research directions have begun to develop workfows they have far longer inference times [40] and required computing
around prompts. AI Chains [85] studied how complex tasks can be power [40]. They are also often constrained to producing shapes
decomposed into smaller, prompt-addressable tasks. Promptchainer that are limited in diversity [79], fdelity [78, 79], stylistic range
[84] unveiled an editor that helps users visually program chains of [64], and capabilities for variable binding owing to the smaller vol-
prompts. Prompt-based workfows were explored in [42] to make ume of paired text-shape data online [79]. Advances using difusion
prototyping ML more accessible for industry practitioners. Other models as a prior have also made the generation of complex, tex-
systems have tested pipelines that concatenate LLMs with text-to- tured 3D models possible [67]. However, text-to-3D approaches
image models. In Opal [53], a pipeline of GPT-3 initiated prompt result in scene [67], voxel [78, 79], pointcloud [64], and mesh [67]
suggestions generated galleries of text-to-image generations to help representations that are medium or high fdelity from the get-go.
news illustrators explore design options in a structured manner. This can start a designer of at an unfamiliar stage in their workfow
Similarly, a visual concept blending system in [27] used BERT [20] (with a medium or high fdelity geometry they might not know
to surface shape analogies and prompt text-to-image AI for visual how to edit) or with a representation they do not usually use for
metaphors. A key fnding from Opal and the visual blends system CAD. To support conceptual CAD from the earliest stages possible,
[27] that we apply in 3DALL-E is that LLMs can help generate we investigate text-to-image rather than text-to-3D in 3DALL-E
prompts so end users can efciently explore design outcomes. as the most suitable starting point for AI-provided inspiration. We
New modes of prompting have also started to emerge. Users elaborate on how designers often start in 2D and build up to 3D
can now pass in image prompts and have AI models autocomplete forms using shape operations in Section 2.4.
images and canvases in methods called inpainting and outpaint-
ing [17, 65]. These functions have been implemented within state-
of-the-art text-to-image AI systems [17]. 3DALL-E is the frst to
systematically generate image prompts from CAD software (Fu-
sion 360) and help users incorporate their 3D design progress into
text-to-image generations. 2.3 Creativity Support Tools
Human-computer interaction research on creativity support tools
has long showcased ways to facilitate text-based content creation.
2.2 Generative Models Early systems showed that users could iteratively defne images
Generative AI models have long been excellent at image synthesis. based on chat and dialogue [21, 80]. AttriBit [9] allowed users to
However, many early models were class-conditional, meaning that assemble 3D models out of parts matched on afective adjectives.
they were only robust at generating images from the classes they Sceneseer [7] and Wordseye [13] allowed users to create scenes via
were trained on [43, 44, 69, 70, 86, 88]. The most recent wave of sentences. However, since the advancement of AI tools, much of
generative AI models can now produce images from tens of thou- the momentum has now concentrated around human-AI co-piloted
sands of visual concepts due to extensive pretraining. CLIP [71], a experiences. Systems such as Opal [53], Sparks [30], FashionQ [41],
state-of-the-art multimodal embedding, was trained of of hundreds and the editors in [81] are examples of AI-assisted ideation. In
of millions of text and image pairs, giving it a broad understanding tandem, many frameworks for computational creativity [54] and
of both domains. The pretraining of CLIP has also helped it serve human-AI interaction [1] have cropped up to understand concerns
as an integral part of multiple generative workfows [15, 16, 22, 62] such as ownership and agency when AI is involved in the creative
and training regimes [60, 78]. Large open-source eforts had pre- process. Gero et al. [29] found that users can establish better mental
viously paired CLIP with GAN models, using it as a discriminator models of what AI can and cannot do if they have a sense of its
to optimize generated images toward text prompts. The novelty internal distribution of knowledge.
of generating media through language has brought many text- Practices for creativity support tools that we revisit from an
to-image tools into production such as Midjourney, DALL-E, and AI perspective include the idea of design galleries [56], timelines
Stable Difusion. DALL-E [72] demonstrated how CLIP embeddings and design history [32], natural language exploration [26], and
can help generate images with autoregressive and difusion-based collaboration support [82]. DataTone [26] demonstrated how in-
approaches. Difusion is key within many of the aforementioned teractive prompting with widgets can help build specifcity in a
methods to increase the quality of text-to-image outputs [14, 17, 61]. text-based interface. Suh et al. [82] demonstrated that AI-generated
New text-to-image approaches have led to more diverse methods content could facilitate teamwork within groups by helping estab-
of user interaction. Make-a-Scene [24] allows users to interact with lish common ground between collaborators. While many systems
generations by manipulating segmentation maps, and DALL-E gives have been built with generative AI capabilities [18, 31, 55] and
users the ability to paint outside the edges of an image, allowing even for text-to-image workfows [53]—none that we know of have
for unlimited canvases [65]. Textual inversion [75] gives users the applied text-to-image AI for 3D design workfows.
1957
DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka
2.4 CAD Conceptual Design and Workfows 3.2 The 3DALL-E interface
CAD is a highly complex design activity that usually involves a 3DALL-E is provided as a panel on the right hand side of the 3D
signifcant amount of conceptual design, as later stages of proto- workspace (Fig. 1). Fig. 2 shows the steps users go through when
typing can incur material costs. Because CAD evolved in part from designing with 3DALL-E inside their 3D workspace and presents
2D drafting, CAD often relies on 2D representations such as free- the main interface components. 3DALL-E allows users to construct
hand drawing and computer-assisted sketches [34, 45, 46]. In these prompts relevant to their current 3D design, which can then be
early stages, designers are also gathering inspiration from external sent to DALL-E to retrieve AI-provided image inspiration. Once
sources like 3D model repositories [23] (e.g. Onshape, Google Poly), generations are received, users are able to download them, see a
video tutorials, and reference images [87] to inform their sketches. history of previous results, and create variations of generations that
Users operate over these 2D representations (sketch profles and they want to explore more from. In what follows, we will present
planes) to apply constraints and dimensions and to take their mod- these diferent steps with a short walkthrough.
els into 3D using operations such as extrusion, lofting, revolving
and so on [34]. It has been found that early stage CAD and product
design “tends to be ambiguous, incomplete, and expressive with
3.3 Constructing Text Prompts for AI-Provided
high levels of uncertainties” [45], and there is less focus on con- Inspiration
straints and parameters [46, 74]. Conceptual CAD also can involve Users begin at the starting state shown in Fig. 2-I, where they can
text and image exploration; mechanical engineers perform system describe what they want to make by typing in their goal (Fig. 2A).
decomposition to understand model needs, and industrial designers Once they do that, diferent prompt suggestions populate the sec-
collect moodboards and perform market research [35]. tions with 3D keywords, designs/styles, and parts (Fig. 2-II). These
One direction within HCI work has focused on capturing and un- suggestions help steer the generations toward results relevant to 3D
derstanding CAD workfows. Screencast [2, 32] collects timelines of modeling as well as provide design language a user might otherwise
authoring operations from CAD help forums. From Screencast data, not be familiar with. For example, querying a chair could return
workfow graphs [8] have been proposed as a way to characterize a series of existing designs such as an egg chair, an Eames chair,
3D modeling workfows. These graphs have shown that users can or a Muskoka chair, helping familiarize the user with the design
arrive at 3D models through diferent paths. For example, to design language beftting of chairs. Once users select a set of prompt sug-
a mug, a user can design in parts and in interchangeable sequences; gestions (e.g. “3d render, isometric, plant stool, wrought-iron”), an
they can frst create the body of the cup, and then the handle, or automatically rephrased prompt appears in the fnal prompt box
vice versa. Examinations of CAD experts have also generalized (e.g. “isometric 3d render of a wrought-iron plant stool“ ) as shown in
CAD modeling as procedures of increasing detail, working from Fig. 2-III. This prompt is still editable by the user, and a text box to
sketches to geometric forms to fnishing features [34]. add custom keywords is also available when clicking the orange ‘+’
Prior work on applying generative models and AI for knowledge- button in the parts section (Fig. 2F).
based design in CAD and industrial engineering does exist [28, 49, Prompt suggestions (Fig. 2C–E) are color-coded with a color for
51, 58]. Liao et. al. note that parametric CAD tools do not ofer “cog- the group they belong to (blue for designs, green for styles, orange
nitive supports for search nor highlight new information a designer for parts) and varied in opacity to indicate how strongly their text
might not have thought of”, which is where generative AI can as- aligns with the image prompt (see Fig. 4 for implementation details).
sist by providing triggers for novel solutions [3]. The closest works For example, from a set of styles like “mid-century modern, contem-
to 3DALL-E would be DreamSketch [45] and Dream Lens [57], porary, and art deco”, if “art deco” was most strongly highlighted (i.e.
systems for generative design exploration. DreamSketch, helped more opaque – darker green), it meant that the image prompt had
explore 3D design ideas by passing in sketches, design variables, the greatest probability of being matched with “art deco”. 3DALL-E
and constraints that retrieved generative designs from topology suggests keywords to elicit 3D qualities particular to 3D models
optimizers. Dream Lens helped users explore and visualize large- and renders, following design guidelines from related work [52, 66].
scale generative design datasets based on parameters. Rather than Styles are suggested to allow users to steer the aesthetic language
freehand sketches or parameters, 3DALL-E presents a method for of their generation and engage with inspiration spanning diferent
supporting conceptual CAD through text-based exploration of de- time periods, traditions, and mediums (e.g. “mid-century modern”,
sign knowledge and text-to-image generations. “Brutalist”, or “CGI”). Using style keywords is also a recommended
tip from prior work and existing AI systems [17, 52, 66]. 3DALL-E
3 DESIGNING WITH 3DALL-E suggests parts as 3D models are often assemblies of parts, as es-
tablished in work on part-based authoring systems [10] and part
3.1 Design Rationale datasets [47, 59, 83]. Other dimensions like material and function
Engaging with text-to-image AI means coming up with many could have been explored without loss of generality. However, we
prompts. Users have to exhaustively experiment with AI to see what chose to focus on geometry-relevant suggestions instead of appear-
words it can understand and render well. To streamline prompt ance (material) or abstract goals (function).
ideation for a CAD environment, 3DALL-E helps users efciently as-
semble 3D design knowledge into prompts. For example, for a table,
a user may know common designs like “dining table” or “desk” but 3.4 Crafting an Image Prompt
may otherwise not know design vernacular (“lift-top”, “drop-leaf”, Users can also choose to include an image that is automatically
or “nesting” table) that 3DALL-E can efciently supply. extracted from their current 3D modeling workspace in addition to
1958
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA
Figure 2: 3DALL-E walkthrough. Step I: Initial state, where users can type their design intentions. Step II: Users are presented
with prompt suggestions from GPT-3. Step III: Selected suggestions are rephrased into an editable prompt. Step IV: Users wait
as DALL-E generates. Step V: Results are shown. A cursor hovers over a shufle icon, which is how users can launch variation
requests from DALL-E.
their text prompt (image+text prompt) or choose to exclude it (text- or masking. Users can easily toggle the visibility of certain parts
only prompt). Image prompts are only passed in when users select of their model using the 3D software’s built-in functionality and
the image preview (Fig. 2B), making it active. Using the 3D soft- request for DALL-E to fll in the details for those hidden parts.
ware to render the viewport allows 3DALL-E to programmatically
deliver clean prompts without tasking the user with any erasing
1959
DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka
3.5 Receiving DALL-E Results and Retrieving We used the Fusion 360 API to automatically save the viewport
Variations to a PNG image every 0.3 seconds. The workspace of Fusion 360 (the
gridded background pictured in Fig. 1) was rendered transparently
Once the user is satisfed with the prompt, they click the DALL-E
in the PNG image.
button next to the fnal text prompt (Fig. 2G) to generate either a
text-only or image+text prompt (depending on whether the image
preview is selected). While waiting for results (Fig. 2-IV), the user is
5 EVALUATION
shown a spinner animation. When the results are ready, the user can Implementing 3DALL-E within Fusion 360 gave us a focused ap-
click the orange download button to pull the results from DALL-E plication context to evaluate text-to-image AI within a creative
into the 3DALL-E interface. workfow. We set out to investigate the following research ques-
Results are returned in sets of four (Fig. 2-V). When the user tions for 3DALL-E to understand in what ways text-to-image AI
hovers over a result, they are presented with a menu that allows can be useful for 3D designers.
them to ‘star’ their favorite results and click the ‘shufe’ button • Generation Patterns within Workfows. Are there certain pat-
to get more variations on that particular result (Fig. 2I). These are terns to how CAD designers use text-to-image generations
retrieved using DALL-E’s built-in functions that generate similar within their workfows, and do these patterns difer depend-
images given an image input. Lastly, 3DALL-E also keeps a history ing upon the 3D modeling task?
of previous generations (Fig. 2J). • Assisted Prompt Construction. How helpful are diferent fea-
tures (prompt suggestions, CLIP highlighting, automatically
4 SYSTEM IMPLEMENTATION captured viewport images) for the construction of text and
3DALL-E was implemented within Autodesk Fusion 360 [37] as image prompts?
a plugin and written with the Fusion 360 API, Python, Javascript, • Prompt complexity. How many concepts do people like to
Selenium, and Flask. Fig. 3 illustrates how we embedded DALL-E, put within prompts?
GPT-3, and CLIP into one user interface. All actions in 3DALL-E To do so, we conducted an exploratory study with 3D CAD de-
were logged by the server to facilitate analysis of participant behav- signers (n=13, 10 male, 3 female). Participants were recruited from
ior in the study (Sect. 5). Note that 3DALL-E could be implemented internal channels within a 3D design software company as well as
generically in most 3D modeling tools. The needed functionality through a design institute mailing list at a local university. Partici-
from Fusion 360 is relatively basic: a custom plugin system and pants were compensated with $50 dollars for 1.5 hours of their time.
ways to render the viewport as an image. The average age of the participants was 28, and they had an average
Prompt suggestions were populated by querying the GPT-3 API of 4.13 years of experience with Fusion 360 (min=1 year, max=8
for the following: “List 10 popular 3D designs for {QUERY}? 1.”, “What years). Five had experience with the generative design environment
are 10 popular styles of a {QUERY}? 1.”, and “What are 10 diferent within Fusion, and three had prior experience with AI / generative
parts of a {QUERY}? 1.”. These queries were split using regular art systems. The participants spanned a range of disciplines from
expressions such that each suggestion was one button on the inter- machining to automotive design. Domains of expertise, frequency
face. To rephrase chosen suggestions, GPT-3 was prompted: “Put of use, and years of experience with the 3D software are listed in
the following together: {SUGGESTIONS}”. Table 1. Based on the system implementation in a CAD software,
Ten 3D keywords are sampled from a set of high frequency words we focused on CAD designers and product designers rather than
(n=121) in a Fusion 360 Screencast dataset. Screencasts are videos 3D artists and 3D concept art more broadly.
used to communicate help and tutorials in forums [2, 32]. Automatic
speech recognition (ASR) of these videos produced transcripts; these 5.1 Experimental Design
transcripts were processed with standard count vectorization using Participants were given two diferent 3D modeling tasks: ����� to
NLP modules from Sklearn, fltered out for general purpose words edit an existing model and ������� to create a model from scratch.
(words that were not specifc to CAD), and sorted by frequency to The intention of having these two tasks was to show how 3DALL-
get the fnal keywords set. E might afect creative workfows at diferent stages of the 3D
Text highlights were calculated by passing each of the prompt modeling process. The ordering of these tasks was counterbalanced
suggestions and the image prompt to CLIP, which was hosted on a to mitigate learning efects. This experimental design was approved
remote server. CLIP produces softmaxed logit scores1 that suggest by a relevant ethics board.
how similar each text option was to the image, a value 3DALL-E Before the study, participants were sent an email with DALL-E’s
renders as the opacity of each highlight. The stronger the highlight, content policy to disclose that they were going to use AI generative
the greater the probability a text option matched what a user had tools. During the study, participants were given a brief introduction
in their viewport. DALL-E was trained with CLIP text and image to the diferent AI architectures involved (GPT-3, DALL-E) and
embeddings. By using CLIP’s embedding in this way, users receive a given two general tips on prompting: 1) text prompts should include
computational guess for how well DALL-E might be able to interpret visual language, 2) text prompts are not highly sensitive to word
each prompt suggestion, while also dialing down the options they ordering [52]. Participants were then given a walkthrough of the
need to focus on (Fig. 4). The 3D keywords were by default gray, user interface and the diferent ways they could generate results
while designs, styles, and parts were matched to gradations of blue, from GPT-3 and DALL-E. The study was conducted virtually via
green, and orange respectively. Zoom and through remote control of the experimenter’s Fusion
1 Applying softmax to logit scores yields normalized linear probabilities. 360 application and plugin.
1960
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA
Figure 3: System design showing the architectures involved in 3DALL-E, which incorporates three large AI models into the
workbench of an industry standard CAD software. In the top left panel, we show how text AI outputs are displayed in the UI.
In the bottom left panel, we show how users could pass in image prompts and retrieve DALL-E generations within the plugin.
Figure 4: Diagram showing how text highlights were calculated using CLIP with image and text from the prompt suggestions as
input. The CLIP logits score was set as the opacity of each prompt suggestion. Each type of suggestion was colored diferently.
1961
DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka
����� was to modify an existing 3D model that the participants We note that frustration was low for both Tasks; 11/13 responded
had brought with them to the study. Participants were told to bring on the low side of the spectrum for ����� (≤ 3) (median: 3), and
a non-sensitive model, meaning one that did not include corporate 10/13 on the low side for ������� . For ����� (median: 2), frustration
data. There were no constraints on what the model could have was low in spite of the fact that 6/13 of participants disagreed to
been. Examples of models brought in can be seen in Fig. 5. When some degree (≤ 3) about having control over the generations.
a participant did not have a model to use, a random design was
“The amount of control you have with the system
provided from the software’s example library. This was the case for
is very dependent upon how specifc you get with
only one participant (P15).
the text. For example, if I make it super broad,
For ������� , participants were allowed to pick whatever they
you’re obviously going to have less control be-
wanted to design from scratch. For each task, participants had 30
cause DALL-E is working of of less information.
minutes to work on their model with the assistance of 3DALL-E. We
So it may provide its own information. It has to
justify this duration of 30 minutes as a sufcient length of time based
kind of fll in the gaps of what you’re trying to
on prior work: DreamSketch [45] (30 to 60 minutes for 3D artifact
say. But the more specifc I got, the better results
creation) and Dream Lens [57] (25 minutes for generative design
I got.” - P1
exploration). At the halfway point, participants were reminded of
the time remaining and of any generation actions that they had not “It was a bit difcult to control. Some things I
tried out yet from GPT-3 (prompt suggestions) or DALL-E (text-only wasn’t quite expecting. For example, with this
prompts, image+text prompts, variations). Beyond this reminder, one [generation of a watch] I expect that it would
they were guided only if they needed assistance accomplishing have more circular watch faces, but it came with
something in the user interface. Examples of what participants ones that were more angular.” - P8
created for ������� can be seen in Fig. 6. At the 30 minute mark,
designers were told to wrap up their design. 5.2.2 Usefulness of GPT-3, CLIP Highlights, Image Prompts . Lastly,
After completing each task, participants marked generations in to understand how helpful diferent features (prompt suggestions,
their history that they felt were inspiring and completed a post-task CLIP highlighting, automatically captured viewport images) are in
questionnaire, which included NASA-TLX [33], Creativity Support the construction of text and image prompts, we discuss workfow-
Index (CSI) [11, 55], and workfow-specifc questions. These ques- specifc questions about the prompting pipeline of 3DALL-E. Partic-
tions can be found in the supplementary material. A semi-structured ipants were asked about the usefulness of 3DALL-E for their usual
interview was conducted to understand their experience. workfow. For ����� , 10/13 felt that it would be helpful (median: 5).
For ������� , 10/13 also felt it would be helpful (median: 7).
In another question, we asked whether it was easy for partici-
5.2 Quantitative Feedback on 3DALL-E pants to come up with new ways to prompt the system. Participants
5.2.1 Creativity Support and NASA-TLX Results. The metrics we responded unilaterally positively for ������� (13/13 responded ≥ 5)
measured showed that designers responded to 3DALL-E with en- and positively for ����� (median: 6) (10/13 responded ≥ 5) (me-
thusiasm. All responses were on a 7-point Likert scale. In terms of dian: 6). Participants were also asked to rate how useful they found
enjoyment, 12/13 participants rated their experience positively (≥ 5 the GPT-3 suggestions. For ����� and ������� , the responses were
out of 7) for ����� (median: 6) and 11/13 for ������� (median: 6) . The generally positive, at least 8/13 participants responded with 5 or
majority of participants also responded positively that they were higher for both tasks (����� median: 7, ������� median: 6).
able to fnd at least one design to satisfy their goal: 10/13 respon-
“I’m looking for the right word and I think that’s
dents in ����� (median: 6), 12/13 respondents in ������� (median:
where this text [GPT-3] search can come in handy. . . I
7). Likewise, most participants reported that the system helped
think it’s helpful to know its language, to know
them fully explore the space of designs (9/13 responded positively
what it fnds.” - P4
for ����� (median: 6) , 11/13 for ������� (median: 6)).
“I think having the GPT-generated ones was use-
“I could spend ages in this.” - P18 ful. It allowed for some ideas I didn’t consider. . . [ideas
In general, the post-task questionnaire results were similar for I] wouldn’t have found the words for.” - P13
����� and ������� . However, on a few dimensions, participant re- On whether or not the highlighting of prompt suggestions was
sponses were distributed slightly diferently. For example for efort, useful, participants responded with more even distributions, though
responses for ����� about tool performance (“How successful were the distributions still skewed positive (8/13 in ����� and 7/13 in
you in accomplishing what you set out to do?”) were split across ������� rated the statement at 5 or higher (median: 5, for both tasks).
the spectrum, with 6/13 rating the tool positively (median: 4). For Participants tended to click on suggestions that were highlighted
������� , 10/13 participants rated the performance positively (me- more strongly for text-image alignment, often choosing the most
dian: 5). In terms of ease of prompting, while 13/13 respondents strongly highlighted suggestion within the category.
were positive that for ������� it was easy to come up with prompts Lastly, we gauged participant response to image prompts, asking
(median: 7), 10/13 responded positively for ����� (median: 5). We if they agreed that image prompts were incorporated well in their
hypothesize that this could have been because for ����� participants generations. For ����� , 10/13 participants responded with a 5 or 6
had to work under more constraints, bringing in 3D models that for agreement (median: 6). For ������� , 8/13 participants responded
were often custom and near fnished. with a 6 or 7 (median: 6).
1962
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA
Figure 5: Examples of 3D designs participants brought in during ����� , which was to edit an existing model.
Table 1: Table of participant details, with discipline, Fusion360 usage frequency, and years of experience. We list labels for the
model they designed during ������� and labels for the model they brought in (����� ).
1963
DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka
Figure 6: Example of 3D designs participants came up with during ������� , which was to create a model from scratch.
“Image prompts defnitely allowed me to tailor of all prompt keywords, showing that participants heavily used the
the outcomes towards what I was hoping for or GPT-3 function of 3DALL-E. We also see in Fig. 7 that 3DALL-E
expecting maybe. . . I’d have struggled to replicate provided the majority of prompt keywords (at least half) for 9/13
[the render type] if I hadn’t done the click on participants in ������� and 9/13 participants in ����� . These results
the image [sent in an image prompt] and create are summarized in Table 2.
some variations. I think once I found something I
liked, using those variations made it much easier Table 2: Source of prompt keywords across tasks, compar-
to stuck to that design theme.” - P13 ing the frequency of prompt keywords supplied by partici-
pants versus by 3DALL-E. 3DALL-E provided the majority of
“This middle one is pretty insane. . . it has inte-
prompt keywords in both tasks.
grated my design into the image properly. . . even
as an assembly, I think that’s completely nuts. . . [An
image prompt] connects what I’m working on Participant-provided 3DALL-E provided
with it [DALL-E]. . . otherwise it might be giving ����� 34.95% 65.05%
some random results, and after a while it might ������� 38.64% 61.36%
become redundant for me.”] - P18 Both tasks 36.39% 63.61%
We analyzed participant prompt logs to quantify how often
participants used 3DALL-E-provided prompt suggestions. For both
����� and ������� , we counted how many times participants used
the 3DALL-E-provided prompt suggestions (3D keywords, designs, 5.3 Prompting Behavior
parts, and styles) and how many times participants provided a We were able to observe certain patterns of prompting with 3DALL-
custom keyword. Collectively, these represented all the keywords E as each generation action was logged by our interface. From these
within prompts. Across both tasks and all participants, we found logs for both GPT-3 and DALL-E, we were able to provide timelines
that 3DALL-E-provided prompt suggestions accounted for 63.61% of generation activity in Fig. 10 (����� ) and Fig. 11 (������� ).
1964
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA
half of the car they modeled to create a full symmetrical car. P18’s
prompting and modeling workfow for ������� is shown in Fig. 9.
The AI-last pattern occurred when participants jumped straight
into their existing workfows for 3D design and tried 3DALL-E later.
We see this in the rows of Fig. 11 that start of with orange bars,
which indicate that participants started modeling from the get-go
of the task. P11, for example, was trying to make a bottle. They
began by sketching the cross-section of a bottle and revolving it
360 degrees to create a form. After flleting the base to round it and
hollowing it out with a hole, they found prompt suggestions from
3DALL-E like “fusion 360” and “Coca-Cola”. Using a generation
prompted from “Front view Coca-Cola Bottle”, they edited their
bottle cross-section to match that of the generation. Only after they
had created this basic bottle did they start looking for inspiration;
seeing generations of Coca-Cola bottles later helped P11 fgure
out how to bring complexity into the cross-section of their design.
P16 (second row in Fig. 11) was another AI-last participant. They
already had an existing screwdriver concept in their mind. They
began by sketching and extruding a rounded rectangle for the grip
of the screwdriver, dimensioning accordingly. They worked on the
fat-head tip by extruding a narrow cylinder and lofting the face
out to a point. After making a rough model, they tried 3DALL-
E with prompts specifc to fat-head screwdrivers and used their
existing modeling progress as an image prompt. P16 commented
that 3DALL-E inspired them to consider diferent handle cross-
sections (e.g. hexagonal, square) and grooved grips. Note that the
AI-last pattern, jumping into a participant’s existing workfow with
Figure 7: Count of prompt keywords by source (3DALL-E- Fusion 360, was more prevalent in ������� .
or participant-provided) for each participant during ����� However, there were also participants who queried AI-throughout.
(top) and ������� (bottom). 3DALL-E provides at least half of Many participants (P13, P1, P8, P10) would intermittently craft an
prompt keywords for 9/13 participants in both tasks. image prompt by briefy working within Fusion 360 and then start
generating. We see these actions whenever participants would have
a short window of Fusion time that led up to image+text genera-
tion (medium blue dots in Fig. 10 and Fig. 11). During these short
5.3.1 AI-first, AI-throughout, or AI-last. One of the most salient windows, participants were generally changing their camera per-
ways to distinguish participants was at which points in their work- spective or the visibility of diferent parts in their assemblies. For
fow they took to 3DALL-E and at which points they focused on example, P10 hid the hopper of a toy truck they had brought in
Fusion 360. Some participants were AI-frst, meaning they tended and tried to generate diferent semi-trailers using prompts such as
to sift through AI generations frst until they had a better grasp “Jeep Gladiator snow plow truck”. P13 (during ������� ) was another
of its abilities or until they found a design that they liked before AI-throughout participant. They frst built up a base for an audio
taking any signifcant 3D design actions. For example, P18 (top row speaker they wanted to design and applied wood and chrome fn-
of Fig. 11), a technical software specialist with an industrial design ishes for a Scandinavian design aesthetic. They then tried prompts
background, was trying to make a car. They frst began looking with lighting elements (e.g. “Isometric Scandinavian minimalism
for inspiration for a matchbox car, before diving into prompt sug- audio speaker with built-in lights”). They built towards a generation
gestions like “sports car”. Text prompts that P18 tried included “a they liked for a while, adding details of a speaker cone and apply-
single sports car built like a Lego building block, view from the top.” ing tessellation and reducing operations to give the speaker body
and “The Dark Knight Rises: the body of a car as a Lego building set”. structural texture. Then they began to create image prompts for
They added perspective (“view from the top”) and a number word 3DALL-E to fll in—deleting faces and extrusions or hiding bodies
(“single”) to specify the composition of their generation and tried in their geometry. They wanted to see the diferent ways the middle
“The Dark Knight Rises” as a style suggested by 3DALL-E for the section of their speaker could be autocompleted. We see P13’s work
query “matchbox car”. After liking one of the resulting generations in ������� and the way they utilized AI-throughout their workfow
(Fig. 13), P18 used the result as a reference image. For the rest of illustrated in Fig. 9.
the duration of the task, P18 modelled within Fusion 360. P18 frst Participants would also use text-only prompts to take them to-
traced over half of the generation like a blueprint before extruding wards new directions. P9 used text prompts to pivot their design
faces to varying heights. They then beveled and chamfered these multiple times and better scope their 3D design. Originally, P9 in-
starting blocks of a car to add ridges and windshields and subtracted tended on creating a prosthetic hand and tried generating “A 3D
material to make room for wheels. They ended by mirroring the model of a robotic hand with two fngers”. After fnding modeling a
1965
DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka
Figure 8: Distribution of Likert scale responses on NASA-TLX, creativity support index, and workfow-specifc questions across
all participants for both ����� and ������� . Full questions are in the Appendix.
1966
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA
Figure 9: Prompting and 3D modeling workfows of design process of three participants (P18, P13, and P1). P18 created a car,
P13 created an audio speaker, and P1 edited a robot. Timelines are vertical with the markers representing diferent generation
requests and yellow intervals representing CAD time. The markers preserve order but the time stamps across participants are
not aligned / to scale.
1967
DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka
Figure 10: Pattern of generation activity for ����� , when participants edited an existing model.
hand to be too complex because of how articulated they are, they In certain rows in Fig. 10 and Fig. 11, we could see that some
tried text-only prompts “3d model of a human fst” and “3d model of participants would shift away from using image prompts and focus
mittens” to explore what they could more feasibly model, exploring on text-only prompts. A case in point of this was when P1 worked
divergently. Deciding on mittens, they imported a generation as on a tank-drive robot that they had built for a FIRST [39] robotics
a reference and sketched over it. After extruding the sketch and competition during ����� (pictured in Fig. 9). To craft image prompts,
applying fllet operations to round out the mittens, P9 added cuf they played around with diferent angles of their models and toggled
sleeves, a detail inspired by the generation. the visibility of parts like the wheels and ground plane of their
In terms of generation patterns for GPT-3, nearly everyone model. The robot was a highly convoluted assembly, and while they
started with generating from GPT-3 (though this could be because found that 3DALL-E could generate decently even on these visually
of the organization of the user interface). Many continued to use complex image prompts, they ended up passing in a series of text-
GPT-3 throughout each task, and we can see this refected in the only prompts like “3D illustration of a Roomba with four wheels
fact that there are purple diamonds (GPT-3 actions) at the early, powered by motors” and “fat image of a toy wheel” (focusing in on
middle, and late stages of workfows for both ����� and ������� . a specifc part rather than trying to get 3DALL-E to work with the
full assembly was also a common strategy of participants). In this
5.3.2 Switches between Types of Prompting. Eight participants situation, the text-only generations were easier for P1 to parse and
passed in an image prompt as their frst action in ����� , and eight make sense of. P5 was another example of someone who pivoted
participants passed in text prompts as their frst generation action away from passing in image prompts to use text-only prompts
for ������� . This suggests that participants may be more likely to after receiving sets of unsatisfying generations during ����� . The
pass in an image prompt if they already have work on their page. image prompt that they passed in was a mechanical base, so the
Aggregating across all the diferent generations across ����� and generations building of of that were all visually indeterminate
������� , we did not see that any mode of prompting was favored (not recognizable as any particular object). P5 instead decided to
more than the rest. Preferences in prompting were highly depen- generate textures of water and maple syrup to project onto their
dent upon the participant and also how well the participant felt like original model (as seen in Fig. 6), fnding this to be an easier way
the generations incorporated their image prompts. For example, to make use of their part and 3DALL-E.
even though P13 found image prompts useful, they felt like image
prompts were incorporated in an “awkward” way, as they had more
glaring visual artifacts than text-only generations.
1968
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA
Figure 11: Pattern of generation activity for ������� , when participants created a model from scratch.
6 PROMPT COMPLEXITY participants were also willing to try a range of concepts, as we can
It can be challenging for an end user to understand how lengthy or see in the wide spread of P2, P9, and P10. Fig. 12 also shows that
detailed a text-to-image prompt should generally be, which is why participants could easily assemble prompts of over six concepts
we studied prompt complexity with 3DALL-E. In 3DALL-E, GPT-3 with this workfow.
would automatically rephrase selected prompt suggestions while We note that even when the prompts were flled with concepts:
adding a small amount of connecting words. Based on this design, “V-shape,Y, Tricopter, Sports, Abstract, Landscape, Aerial, Gimbal,
we could measure complexity as the number of concepts forming Camera, Transmitter, Flight controller, Receiver” , 3DALL-E could
the basis of a prompt. For example, if “3d render, minimalist, chair” still return legible images. For this prompt, P2 received generations
was rephrased as “3d render of a minimalist chair”, we gave the that had laid out displays of product components. P2 was an obvious
prompt a count of 3 concepts. outlier in the complexity of the prompts that they provided. They
However, participants also had the ability to edit the fnal prompt were keen on trying to “break the system” and passed prompts
and to add or subtract concepts of their own. In cases where the averaging 10 concepts. We did not discern a diference between
text prompt mostly came from the participant rather than GPT-3, complexity observed for ����� and ������� .
we counted the number of concepts based on rules from linguis-
tics and natural language processing. The prompt complexity was 7 QUALITATIVE FEEDBACK
then the number of noun phrases and verbs in a prompt, ignoring 7.1 3DALL-E Use Cases for CAD Design
prepositions, function words, and stop words. Count words were
7.1.1 Use Case: Preventing Design Fixation. Participants demon-
ignored; they were considered modifers for the noun phrases they
strated diferent use cases of 3DALL-E as they progressed through
were a part of (e.g. “fve fngers” was one concept).
the tasks. The most commonly acknowledged use case was using
We annotated text-only and image+text prompts with the num-
the system for inspiration, particularly in the early stages of a de-
ber of concepts. We did not annotate variations for complexity
sign workfow. P10 contextualized some of the challenges that 3D
because the generation of those images were not directly informed
designers face on the job, such as design fxation and time con-
by text prompts. From these annotations, we charted prompt com-
straints. “A lot of times designers get stuck, they get tunnel vision...the
plexity across participants in Fig. 12. We found that participants
folks at [toy design company] used to say to me, “We can’t come up
tended to explore between two to six prompts, which is where
with enough designs...it takes too long to come up with a design, so
most of the density of points concentrates in Fig. 12. We see that
then we only get two or three...we would like to see thousands of design
1969
DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka
Figure 12: Prompt complexity measured across participants, where complexity is the count of concepts in each text-only and
image+text prompt. Participants span the X-axis, sorted by the count of their most complex prompt. The values are jittered to
show multiplicity; many prompts mapped to the same number of concepts. Complexity tended to concentrate between two to
six concepts, as seen by the density of prompts within that interval. Each datapoint was colored based on prompt task.
1970
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA
capture “isometric” and “perspective” views in the technically accu- amounts of motor power these robots would require. While 3DALL-
rate sense of those words. Nonetheless, even if generations were not E could not guarantee the feasibility of every generated design,
drawn to perspective or as clean as technical drawings and renders some participants (P1, P8) liked that 3DALL-E inspired them to
usually are, participants still found them useful as reference images. think through details such as how manufacturable a design was.
Other participants used the generations as references albeit more Participants also felt like they could elicit unique, out-of-the-
loosely. P15, liking a “3D render of a desk lamp Victorian” (Fig. 13), norm designs from 3DALL-E and use it to let them gauge the unique-
made the arm of their lamp skinnier as per the generation. P9, ness of their own designs. P4 wanted to design a product that did
observing generations from prompts such as “isometric 3d renders not exist in the real world yet: an ear gauge electronic for their
of a cleaning sprayer bottle” (Fig. 13), noted that they could subtract son. They treated the model’s inability to come up with their exact
volume from the outer contours of their model and reduce the vision in generations as a good thing, interpreting it to mean that
amount of material used, which was part of their goal to design a the product did not exist yet and therefore had patentable value.
more sustainable spray bottle top. “We [DALL-E] started to lose a little bit when we started putting in the
‘Bluetooth ring’, which is good because that tells me. . . probably out
7.1.3 Use Case: Textures and Renders for Editing Appearance. Par- there in the real world, nobody’s actually doing this. . . that made me
ticipants would also edit their model appearance towards the look feel good about the fact that I might have a predicate design in my
of generations (P13, P15, P10, P3). They could do this by applying head.” P2, who had taught drone design classes, also felt like right
textures within the software and dragging and dropping materials of the the bat, 3DALL-E was able to produce unique aesthetics
from the software’s material library onto surfaces. P5, innovatively beyond what is typically seen in drones, something their students
used generations as textures to help build a 3D outdoor movie the- generally struggled to do. P15 also felt like 3DALL-E could have
ater scene. Their scene was built out of simple geometries, and educational value as they looked around for ways to accomplish
atop these geometries, they placed generations of a “jello bed” and something they saw in a lamp generation: “being able to reverse en-
generated portraits of pop culture characters (pictured in Fig. 6). gineer. . . that is a cool learning aspect.” 3DALL-E could not guarantee
P1 mentioned that 3DALL-E could be useful for product design the educational or patentable value of a generation, but it inspired
presentations to show the function or interaction of things being participants (P4, P2, P15) to think about design considerations such
designed. As P1 made a prosthetic hand, they imported a generation as design conventions, uniqueness, and plausibility.
and started to model atop it. Curious about how a text+image
prompt would fare if it included a generation transparently overlaid 7.1.6 Weaknesses in terms of CAD. Some participants did comment
over their geometry, they generated and found compelling images that text-to-image AI may have weaknesses in applications like ma-
that could visually situate their designs with their use cases in chining and simulation or the construction of internal components
product design presentations. and other function-focused parts. P9 pointed out that it would be
difcult to generate geometries that enclose parts, because if a user
7.1.4 Use Case: Inspiring Collaboration. Design in industry is a
was to pass in an image prompt of that part, 3DALL-E would be
team efort, and while 3DALL-E was evaluated in the context of a
unable to draw housing over it. Likewise, a participant mentioned
single user, many participants acknowledged that 3DALL-E could
that they could imagine 3DALL-E being used to design the facade of
be benefcial in teams. P16 mentioned that from their industry
the car, but they did not believe that it could design a more internal
experience, 3DALL-E would be excellent for establishing commu-
component not easily describable in layman’s terms.
nication between mechanical engineers and industrial designers.
Mechanical engineers focus on function, while industrial designers
focus on aesthetics. P16 felt that 3DALL-E could help both sides 7.2 Comparing with Traditional Workfows
pass around design materials for discussion and common ground. Our exploratory study invited designers to stress test 3DALL-E
P13, who was an industrial designer, noted that teams could across the settings of a wide range of disciplines. Participants were
also do multi-pronged exploration with 3DALL-E. Because each impressed with the ability of the model to generate even when
team member would have individual prompting trajectories, a team they passed prompts flled with technical jargon like “CNC ma-
could easily produce diverse searches and more variety during chines”, “L-brackets”, or “drone landing gear”. Still, prompting re-
brainstorming. P3 mentioned that there are already points within mains very distinct from the workfows participants usually go
their industry (automotives) where there are hand-ofs between through. Many participants described their regular design process
the people who generate design ideas and the people who execute as multiple phase progressions from low fdelity to high fdelity.
them. Technical sales specialist P4 also mentioned that they could They mentioned roughing out designs frst, putting placeholders
instantly see 3DALL-E being useful for their clients, many of whom within robotic assemblies (P1), box blocking up to complexity (P13),
have bespoke requests such as organic fxtures for restaurants and and redesigning from the ground up again and again (P18). Even
museums or optimized shapes for certain materials. though 3DALL-E only provided images of 3D designs, these designs
could have high fdelity details that could shortcut participants to
7.1.5 Use Case: Inspiring Design Considerations. 3DALL-E also later stages of the design process.
inspired design considerations by making participants think about
diferent aspects such as functionality or manufacturability. For 7.2.1 Text Interactions in 3D Workflow. The most distinct difer-
example, P1 was looking for a wheeled robot. Seeing generations ence in workfows is that 3DALL-E is text-focused, but text is not
where robot bodies were varied in the number of wheels they had or central to 3D design workfows, which are usually based on the
how far of the ground they were made P1 think about the diferent direct manipulation of the geometry. P13 mentioned that designers
1971
DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka
primarily operate visually. “The only reason I really use text in an designs [because] it only learns from the input it gets from people. . . we
industrial design context is [for] making notations on a design...to might lose creativity.”
explain what a feature is...to write a design specifcation...but the ma-
jority of the time is image focused.” Because of this, P13 preferred the 7.3 Comparison with Existing Generative CAD
“image-based approach” within 3DALL-E where they could “pro- Tools
vide it with a starting image and get variants of that”. P4, however,
Five of 13 participants had experience with the existing generative
thought that in some respects designers are often engaging with
design mode within the 3D CAD software [38]. Generative design
text, but in the form of numbers, properties, parameters, equations,
(GD) is an environment in Fusion 360 in which the completion of a
and confgurations. “[We] do it in a smart way...[we] drive it with
3D design is set up like a problem: users defne physical constraints
the math equation. This is something we can do in parameters, and it
and geometric flters that allow a model to be autocompleted. We did
is very text-based.”
not directly compare with GD, because hardware constraints made
3DALL-E incompatible with GD. However, we did ask participants
7.2.2 Problem Solving with 3DALL-E. P10 and P4 described their
with experience in GD to compare and contrast the two.
day-to-day job tasks as customer-facing CAD specialists as problem
A primary diference was that GD allows users to directly ma-
solving and fnding design solutions. P10 began the study wonder-
nipulate the model geometry, which difers from the text-based
ing if 3DALL-E could solve a problem they were facing in their job:
interaction of 3DALL-E. GD results therefore free the user from
packaging a toy truck. To do so, they like many of the participants,
doing more modeling work. What one participant liked about GD
tried employing 3DALL-E as a problem solver. P10 tested prompts
was that “once they set up the problem, they could just hit go. . . don’t
such as “create a toy dump truck and fre truck with plastic material”
have to actually worry about lofting and modeling”. However, partic-
and “protect a sphere with foam” to see if 3DALL-E could help en-
ipants mentioned that GD has a higher barrier of entry; users are
case a 3D model. From the results they saw, they concluded that
burdened with calculating loads and non-conficting constraints,
3DALL-E “was not intended to be a problem solver type of tool”.
which requires some understanding of physics and engineering.
P13 set up image prompts as autocompletion problems. As they
built an audio speaker for ������� , they commented that they were “You’re [GD ] focused on strength, endurability
“creating two pieces of geometry and using it [3DALL-E] as a con- of the model itself, really driven as a manufac-
nection between the two. . . kind of like the automated modeling com- turing task. . . your end result is something that’s
mand” [36]. They also tried other innovative ways of creating image makeable. . . whereas this process [3DALL-E ] is
prompts: ”a hacky approach, trying to keep preserved geometries with more on the creative side.”
the faces and using 3DALL-E to fll in the gaps”. P2 mentioned that 3DALL-E allowed users to come up with
outcomes far more efciently than GD. In the span of a 30-minute
7.2.3 Driving the Design. When AI input is added into a workfow, task, users were able to browse hundreds of results, with the frst
questions of who drives the design process and who owns the fnal results coming in a matter of seconds, whereas P2 has previously
design can arise. While P9 liked that 3DALL-E augmented their had to wait multiple hours or even days for GD. P2 and P18 were
workfow with what they called dynamic feedback, they felt as enthusiastic that GD and 3DALL-E could merge. P10 suggested that
though their design was being driven by the generations. “Initially, one way these two tools could complement each other is if “this
the image did not really meet my expectation. . . but eventually I was tool [3DALL-E] could be used to generate shapes. . . pass it of to the
also trying to not imagine anything and just depend upon what it generative design [GD] to optimize”.
was suggesting.” P3 mentioned that they felt as if they were driven
by 3DALL-E, while P15 mentioned that sometimes in the midst of 8 DISCUSSION
exploring, they felt they were not gravitating towards building.
As for ownership, many participants felt like the designs they Our results demonstrate high enthusiasm for text-to-image tools
created with 3DALL-E would still be their own. P1 stated on own- within 3D workfows. With 3DALL-E, participants had a tool for
ership, “A lot of 3D modeling is stealing...borrowing premade fles conceptual CAD that could help them combat design fxation and
online, and then assembling it together into a new thing. For this get a variety of reference images and inspiration. Furthermore, we
robot, we borrowed these assemblies from already premade fles that elaborated prompting patterns that can help understand when and
were sold by the company. We modelled based of of that, but the what types of text-to-image generation can be most helpful. In
majority of this robot can be considered ours because we determined measuring prompt complexity, we showed that many prompts fall
the placement.” P13 was also not worried about ownership concerns, within a range of two to six concepts, providing a heuristic that can
stating that even now, anyone can recreate any model found online, be implemented in text-to-image prompt interfaces. The follow-
but that “it’s about the steps you go through to get there.” ing discussion focuses on best practices for helping 3D designers
P18 mentioned that for an AI to be applied to the real world, it bring their own work into AI-assisted design workfows and the
still takes an expert designer’s understanding of the market and implications of these workfows.
customer needs. “I would use my know-how of manufacturing pro-
cesses and the market or style. My service would adopt AI as a source 8.1 Prompt Bibliographies
of inspiration rather than as the solution.” Refecting on if AI in- A strength of studying 3D workfows was that there was no confict
spiration became mainstream without designers in the loop, they between the AI and human on the canvas, as the AI had no part in
expressed concerns that “if everyone would converge on the same the physical realization of the design. We believe this helps mitigate
1972
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA
ownership concerns and makes text-to-image AI very promising for content fltering guards checking for relevance to 3D design. AI
3D design tools. Currently, AI-generated content is a gray area due models may not have to bear the full burden of providing good and
to concerns of attribution and intellectual property [76]. Currently, ethical answers if we can have multiple checkpoints for propriety.
there is no way to tell how heavily an AI-generated image bor-
rows from existing materials. As generated content becomes more 8.3 Generalizability
prevalent on platforms, it is important to develop practices of data The design workfow posed in 3DALL-E is generalizable and can
provenance [19]. We propose the notion of prompt bibliographies easily be used as a blueprint for text-to-image AI integration with
to provide information on what informed designs and to separate diferent design software. The idea behind surfacing 3D keywords
out which contributions were human and which were AI. These from application related data (as we did with Fusion 360 Screencast
can work to clarify ownership and intellectual property concerns. data) also introduces ideas for how prompts can be tailored towards
Prompt bibliographies, illustrated in Fig. 14, could likewise help the technical vocabulary of a software. The idea of passing in image
track designer intentions and enrich the design histories that soft- prompts is also easily extendable to diferent creative tools, even
ware tools provide, which generally capture commands and actions those outside of the 3D space. For example, graphic editing tools
(but not intentions). The bibliographies can be merged within the can pass in image prompts based on active layers chosen by a user.
history timeline features that are present in tools like Fusion 360 and Animation software and video editors can send in choice frames
Photoshop, helping prompting integrate better with the traditional for anchored animations and video stylization. A takeaway of this
workspaces of creative tools. paper is to take advantage of the complex hierarchies that users
Sharing prompt bibliographies with their outcomes (i.e. 3D mod- build up as they design, such as the way 3DALL-E takes advantage
els) can also help respect all the parties that are behind these AI of the fact that 3D models are generally assemblies of parts. With
systems. End users can easily query for the styles of artists (as 3DALL-E, users could isolate parts and send clean image prompts
they already do) and create derivative works that dilute the pool without the burden of erasing or masking anything themselves.
of images attributed to artists. Prompt bibliographies may be es-
pecially relevant for CAD designers as CAD is highly intertwined
8.4 Benefts of Text-to-Image for CAD
with patents, manufacturing, and consumer products.
Few tools currently explicitly support conceptual CAD [35, 48].
3DALL-E supports conceptual CAD not only at the beginning of
8.2 Enriching creative workfows with text the design process, but also throughout their workfow, as evi-
The advancements in prompting may push text prompting as a type denced by the diferent usage patterns. It provides visual assets for
of interaction into creative tools, even if creative workfows have CAD / product design as well as design knowledge that is otherwise
traditionally not revolved around text. In 3DALL-E, we show the difcult to collect (e.g. standard designs, specifc part terminology).
beneft of having a language model scafold the prompting process. These visual assets can be utilized for detailed sketching within
By giving the user fast ways to query and gesture towards what CAD, for appearance editing through materials, or for the inspira-
an AI is most likely to understand (as 3DALL-E did with the high- tion of design considerations. 3DALL-E also presented directions
lighted text options), we enable users to have more opportunities to that can solve weaknesses of existing generative tools (GD) for
understand what language may work best with an AI. At the same CAD. By having 3DALL-E defne shapes and then having the GD
time, 3DALL-E helped users easily reach the design language of environment optimize them, existing generative tools could better
their domain, be it robotics or furniture design. In the quantitative align with what designers visually want, and go beyond physical
survey results, participants felt it was easy to come up with prompts constraints like loads and forces.
near unanimously for ����� and unanimously for ������� . We demonstrated the efcacy of 3DALL-E at supporting a diverse
It is important to understand where in a workfow assistance can set of potential CAD end users: mechanical engineers, industrial
be of most use. Our survey results refect that 3DALL-E produced a designers, roboticists, machining specialists, and hobbyist makers.
slightly more positive experience when it was introduced earlier 3DALL-E’s interdisciplinary design knowledge is both a strength of
on in the process. This was corroborated by many participants who AI pretraining as well as the ability of designers to make integrative
said they saw this tool being most helpful in the early stages of leaps to meet the AI halfway [81]. Additionally, the modular nature
design. Well-placed AI assistance, such as early stage ideation with of 3DALL-E in Fusion 360 demonstrates an idea of separating out
GPT-3, trying a text-only prompt to pivot directions, or carefully AI assistance from traditional non-AI direct manipulation features.
setting up an image prompt for 3DALL-E to fll in—can be greatly Lastly, the text-based nature of the tool and its ready acceptance
constructive and address painpoints like design fxation that CAD with designers demonstrates how text interactions can facilitate a
and 3D designers in general feel today. Furthermore, if we under- low threshold, high ceiling design tool for CAD [63].
stand the scope of the tasks we want AI to handle within a workfow,
such as having GPT-3 suggest diferent parts of a model or having 8.5 Future Work and Limitations
DALL-E generate reference images from front, side, and top views, A necessary line of future work to make text-to-image AI more
we can better ft general purpose models to their task. We can have usable for CAD will be to integrate it with sketch-based modeling.
stronger checks on the prompt inputs and generation outputs if we Sketching is fundamental to CAD and reliant on the creation and
understand what is within scope of the task. For example, when manipulation of clean primitives (splines, lines, etc.), and controlling
P16 wanted a “fat head” screwdriver, they were returned results the composition of text-to-image generations based on sketches
about a medical syndrome—something that could be avoided with would be highly useful.
1973
DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka
Figure 14: Prompt bibliographies, a design concept we propose for tracking human-AI design history. As prompts become a
part of creative workfows, they may be integrated into the design histories already kept by creative authoring software. This
bibliography tracks text and image prompts, as well as which generations inspired users during the tasks.
1974
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA
1975
DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka
arXiv:1810.04805 [cs.CL] [45] Rubaiat Habib Kazi, Tovi Grossman, Hyunmin Cheong, Ali Hashemi, and
[21] Alaa El-Nouby, Shikhar Sharma, Hannes Schulz, R Devon Hjelm, Layla El Asri, George W. Fitzmaurice. 2017. DreamSketch: Early Stage 3D Design Explorations
Samira Ebrahimi Kahou, Y. Bengio, and Graham Taylor. 2018. Keep Drawing It: with Sketching and Generative Design. Proceedings of the 30th Annual ACM
Iterative language-based image generation and editing. Symposium on User Interface Software and Technology (2017).
[22] Patrick Esser, Robin Rombach, and Björn Ommer. 2020. Taming Transformers for [46] Sumbul Khan and Bige Tunçer. 2019. Gesture and speech elicitation for 3D CAD
High-Resolution Image Synthesis. https://doi.org/10.48550/ARXIV.2012.09841 modeling in conceptual design. Automation in Construction (2019).
[23] Thomas Funkhouser, Michael Kazhdan, Philip Shilane, Patrick Min, William [47] Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis Williams, Alexey Arte-
Kiefer, Ayellet Tal, Szymon Rusinkiewicz, and David Dobkin. 2004. Modeling by mov, Evgeny Burnaev, Marc Alexa, Denis Zorin, and Daniele Panozzo. 2018.
Example. ACM Trans. Graph. 23, 3 (aug 2004), 652–663. https://doi.org/10.1145/ ABC: A Big CAD Model Dataset For Geometric Deep Learning. https:
1015706.1015775 //doi.org/10.48550/ARXIV.1812.06216
[24] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv [48] Hitoshi Komoto and Tetsuo Tomiyama. 2012. A Framework for Computer-Aided
Taigman. 2022. Greater Creative Control for AI image generation. https: Conceptual Design and Its Application to System Architecting of Mechatronics
//ai.facebook.com/blog/greater-creative-control-for-ai-image-generation/ Products. Comput. Aided Des. 44, 10 (oct 2012), 931–946. https://doi.org/10.1016/
[25] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal j.cad.2012.02.004
Chechik, and Daniel Cohen-Or. 2022. An Image is Worth One Word: Personalizing [49] Carmen Krahe, Maksym Kalaidov, Markus Doellken, Thomas Gwosch, Andreas
Text-to-Image Generation using Textual Inversion. https://doi.org/10.48550/ Kuhnle, Gisela Lanza, and Sven Matthiesen. 2021. AI-Based knowledge extraction
ARXIV.2208.01618 for automatic design proposals using design-related patterns. Procedia CIRP 100
[26] Tong Gao, Mira Dontcheva, Eytan Adar, Zhicheng Liu, and Karrie G. Kara- (2021), 397–402. https://doi.org/10.1016/j.procir.2021.05.093 31st CIRP Design
halios. 2015. DataTone: Managing Ambiguity in Natural Language Interfaces Conference 2021 (CIRP Design 2021).
for Data Visualization. In Proceedings of the 28th Annual ACM Symposium [50] Mina Lee, Percy Liang, and Qian Yang. 2022. CoAuthor: Designing a Human-AI
on User Interface Software & Technology (Charlotte, NC, USA) (UIST ’15). As- Collaborative Writing Dataset for Exploring Language Model Capabilities. In
sociation for Computing Machinery, New York, NY, USA, 489–500. https: CHI Conference on Human Factors in Computing Systems. ACM. https://doi.org/
//doi.org/10.1145/2807442.2807478 10.1145/3491102.3502030
[27] Songwei Ge and Devi Parikh. 2021. Visual Conceptual Blending with Large-scale [51] Jing Liao, Preben Hansen, and Chunlei Chai. 2020. A framework of ar-
Language and Vision Models. arXiv:2106.14127 [cs.CL] tifcial intelligence augmented design support. Human–Computer Interac-
[28] John S. Gero and Mary Lou Maher. 1993. Modeling Creativity and Knowledge- tion 35, 5-6 (2020), 511–544. https://doi.org/10.1080/07370024.2020.1733576
Based Creative Design. arXiv:https://doi.org/10.1080/07370024.2020.1733576
[29] Katy Ilonka Gero, Zahra Ashktorab, Casey Dugan, Qian Pan, James Johnson, [52] Vivian Liu and Lydia B. Chilton. 2021. Design Guidelines for Prompt Engineering
Werner Geyer, Maria Ruiz, Sarah Miller, David R. Millen, Murray Campbell, Text-to-Image Generative Models. arXiv:2109.06977 [cs.HC]
Sadhana Kumaravel, and Wei Zhang. 2020. Mental Models of AI Agents in a [53] Vivian Liu, Han Qiao, and Lydia Chilton. 2022. Opal: Multimodal Image Genera-
Cooperative Game Setting. Association for Computing Machinery, New York, NY, tion for News Illustration. https://doi.org/10.48550/ARXIV.2204.09007
USA, 1–12. https://doi.org/10.1145/3313831.3376316 [54] Maria Teresa Llano, Mark d’Inverno, Matthew Yee-King, Jon McCormack, Alon
[30] Katy Ilonka Gero, Vivian Liu, and Lydia B. Chilton. 2021. Sparks: Inspiration for Ilsar, Alison Pease, and Simon Colton. 2022. Explainable Computational Creativity.
Science Writing using Language Models. https://doi.org/10.48550/ARXIV.2110. (2022). https://doi.org/10.48550/ARXIV.2205.05682
07640 [55] Ryan Louie, Any Cohen, Cheng-Zhi Anna Huang, Michael Terry, and Carrie J. Cai.
[31] Arnab Ghosh, Richard Zhang, Puneet K. Dokania, Oliver Wang, Alexei A. Efros, 2020. Cococo: AI-Steering Tools for Music Novices Co-Creating with Generative
Philip H. S. Torr, and Eli Shechtman. 2019. Interactive Sketch and Fill: Multiclass Models. In HAI-GEN+user2agent@IUI.
Sketch-to-Image Translation. arXiv:1909.11081 [cs.CV] [56] J. Marks, B. Andalman, P. A. Beardsley, W. Freeman, S. Gibson, J. Hodgins, T. Kang,
[32] Tovi Grossman, Justin Matejka, and George Fitzmaurice. 2010. Chronicle: Capture, B. Mirtich, H. Pfster, W. Ruml, K. Ryall, J. Seims, and S. Shieber. 1997. Design
Exploration, and Playback of Document Workfow Histories. In Proceedings of the Galleries: A General Approach to Setting Parameters for Computer Graphics and
23nd Annual ACM Symposium on User Interface Software and Technology (New Animation. In Proceedings of the 24th Annual Conference on Computer Graphics and
York, New York, USA) (UIST ’10). Association for Computing Machinery, New Interactive Techniques (SIGGRAPH ’97). ACM Press/Addison-Wesley Publishing
York, NY, USA, 143–152. https://doi.org/10.1145/1866029.1866054 Co., USA, 389–400. https://doi.org/10.1145/258734.258887
[33] S. G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load [57] Justin Matejka, Michael Glueck, Erin Bradner, Ali Hashemi, Tovi Grossman,
Index): Results of Empirical and Theoretical Research. Advances in psychology and George Fitzmaurice. 2018. Dream Lens: Exploration and Visualization
52 (1988), 139–183. of Large-Scale Generative Design Datasets. In Proceedings of the 2018 CHI
[34] Nathan W. Hartman. 2004. Defning Expertise in the Use of Constraint-based Conference on Human Factors in Computing Systems (Montreal QC, Canada)
CAD Tools by Examining Practicing Professionals. Engineering Design Graphics (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12.
Journal 69 (2004). https://doi.org/10.1145/3173574.3173943
[35] Fariz Muharram Hasby and Dradjad Irianto. 2022. Conceptual Design Assessment [58] Gabriele Mirra and Alberto Pugnale. 2022. Expertise, playfulness and analogical
Method for Collaborative CAD System. In 4th Asia Pacifc Conference on Research reasoning: three strategies to train Artifcial Intelligence for design applications.
in Industrial and Systems Engineering 2021 (Depok, Indonesia) (APCORISE 2021). Architecture, Structures and Construction 2 (03 2022). https://doi.org/10.1007/
Association for Computing Machinery, New York, NY, USA, 254–261. https: s44150-022-00035-y
//doi.org/10.1145/3468013.3468340 [59] Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J.
[36] Autodesk Inc. 2022. Autodesk Fusion 360 faster performance and quality of life Guibas, and Hao Su. 2019. PartNet: A Large-Scale Benchmark for Fine-Grained
updates. https://www.autodesk.com/products/fusion-360/blog/usability-and- and Hierarchical Part-Level 3D Object Understanding. In The IEEE Conference
performance-improvements-fusion-360/ on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society.
[37] Autodesk Inc. 2022. Fusion 360. https://help.autodesk.com/view/fusion360/ https://doi.org/10.1109/CVPR.2019.00100
ENU/?guid=GUID-7B5A90C8-E94C-48DA-B16B-430729B734DC [60] Ron Mokady, Amir Hertz, and Amit H. Bermano. 2021. ClipCap: CLIP Prefx for
[38] Autodesk Inc. 2022. Generative design for manufacturing with Fusion 360. Image Captioning. https://doi.org/10.48550/ARXIV.2111.09734
https://www.autodesk.com/solutions/generative-design/manufacturing [61] Ryan Murdock. 2022. lucidrains/big-sleep: A simple command line tool for text to
[39] For Inspiration, Recognition of Science, and Technology (FIRST). 2022. FIRST image generation, using OpenAI’s CLIP and a BigGAN. Technique was originally
Robotics Competition. https://www.frstinspires.org/robotics/frc created by https://twitter.com/advadnoun. https://github.com/lucidrains/big-
[40] Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, and Ben Poole. sleep
2022. Zero-Shot Text-Guided Object Generation with Dream Fields. (2022). [62] Ryan Murdock and Phil Wang. 2021. Big Sleep.
[41] Youngseung Jeon, Seungwan Jin, Patrick C. Shih, and Kyungsik Han. 2021. Fash- [63] Brad Myers, Scott E. Hudson, and Randy Pausch. 2000. Past, Present, and Future
ionQ: An AI-Driven Creativity Support Tool for Facilitating Ideation in Fashion of User Interface Software Tools. ACM Trans. Comput.-Hum. Interact. 7, 1 (mar
Design. Association for Computing Machinery, New York, NY, USA. https: 2000), 3–28. https://doi.org/10.1145/344949.344959
//doi.org/10.1145/3411764.3445093 [64] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen.
[42] Ellen Jiang, Kristen Olson, Edwin Toh, Alejandra Molina, Aaron Donsbach, 2022. Point-E: A System for Generating 3D Point Clouds from Complex Prompts.
Michael Terry, and Carrie J Cai. 2022. PromptMaker: Prompt-based Prototyping https://doi.org/10.48550/ARXIV.2212.08751
with Large Language Models. https://doi.org/10.1145/3491101.3503564 [65] OpenAI. 2022. Dall·E: Introducing outpainting. https://openai.com/blog/dall-e-
[43] Tero Karras, Samuli Laine, and Timo Aila. 2018. A Style-Based Generator Archi- introducing-outpainting/
tecture for Generative Adversarial Networks. https://doi.org/10.48550/ARXIV. [66] Guy Parsons. 2022. The DALL-E 2 prompt book. https://dallery.gallery/the-
1812.04948 dalle-2-prompt-book/
[44] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and [67] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2022. DreamFusion:
Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. Text-to-3D using 2D Difusion. arXiv (2022).
arXiv:1912.04958 [cs.CV]
1976
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows DIS ’23, July 10–14, 2023, Pitsburgh, PA, USA
[68] Han Qiao, Vivian Liu, and Lydia Chilton. 2022. Initial Images: Using Image 1711.10485
Prompts to Improve Subject Representation in Multimodal AI Generated Art. In [89] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vi-
Creativity and Cognition (Venice, Italy) (C&C ’22). Association for Computing jay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson,
Machinery, New York, NY, USA, 15–28. https://doi.org/10.1145/3527927.3532792 Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu.
[69] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. Learn, 2022. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation.
Imagine and Create: Text-to-Image Generation from Prior Knowledge. In Ad- https://doi.org/10.48550/ARXIV.2206.10789
vances in Neural Information Processing Systems, H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32.
Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/fle/
d18f655c3fce66ca401d5f38b48c89af-Paper.pdf
[70] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. MirrorGAN:
Learning Text-to-image Generation by Redescription. https://doi.org/10.48550/
ARXIV.1903.05854
[71] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models
From Natural Language Supervision. arXiv:2103.00020 [cs.CV]
[72] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.
2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. https:
//doi.org/10.48550/ARXIV.2204.06125
[73] Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Lan-
guage Models: Beyond the Few-Shot Paradigm. arXiv:2102.07350 [cs.CL]
[74] B. F. Robertson and D. F. Radclife. 2009. Impact of CAD Tools on Creative
Problem Solving in Engineering Design. Comput. Aided Des. 41, 3 (mar 2009),
136–146. https://doi.org/10.1016/j.cad.2008.06.007
[75] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
Ommer. 2022. High-Resolution Image Synthesis With Latent Difusion Models. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). IEEE Computer Society, 10684–10695.
[76] Kevin Roose. 2022. An A.I.-Generated Picture Won an Art Prize. Artists
Aren’t Happy. https://www.nytimes.com/2022/09/02/technology/ai-artifcial-
intelligence-artists.html
[77] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Den-
ton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi,
Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Moham-
mad Norouzi. 2022. Photorealistic Text-to-Image Difusion Models with Deep
Language Understanding. https://doi.org/10.48550/ARXIV.2205.11487
[78] Aditya Sanghi, Hang Chu, Joseph G. Lambourne, Ye Wang, Chin-Yi Cheng, Marco
Fumero, and Kamal Rahimi Malekshan. 2021. CLIP-Forge: Towards Zero-Shot
Text-to-Shape Generation. https://doi.org/10.48550/ARXIV.2110.02624
[79] Aditya Sanghi, Rao Fu, Vivian Liu, Karl Willis, Hooman Shayani, Amir Hosein
Khasahmadi, Srinath Sridhar, and Daniel Ritchie. 2022. TextCraft: Zero-Shot
Generation of High-Fidelity and Diverse Shapes from Text. https://doi.org/10.
48550/ARXIV.2211.01427
[80] Shikhar Sharma, Dendi Suhubdy, Vincent Michalski, Samira Ebrahimi Kahou,
and Yoshua Bengio. 2018. ChatPainter: Improving Text to Image Generation
using Dialogue. https://doi.org/10.48550/ARXIV.1802.08216
[81] Nikhil Singh, Guillermo Bernal, Daria Savchenko, and Elena L. Glassman. 2022.
Where to Hide a Stolen Elephant: Leaps in Creative Writing with Multimodal
Machine Intelligence. ACM Trans. Comput.-Hum. Interact. (jan 2022). https:
//doi.org/10.1145/3511599 Just Accepted.
[82] Minhyang (Mia) Suh, Emily Youngblom, Michael Terry, and Carrie J Cai. 2021.
AI as Social Glue: Uncovering the Roles of Deep Generative AI during Social
Music Composition. In Proceedings of the 2021 CHI Conference on Human Factors
in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing
Machinery, New York, NY, USA, Article 582, 11 pages. https://doi.org/10.1145/
3411764.3445219
[83] Karl D. D. Willis, Yewen Pu, Jieliang Luo, Hang Chu, Tao Du, Joseph G. Lambourne,
Armando Solar-Lezama, and Wojciech Matusik. 2021. Fusion 360 Gallery: A
Dataset and Environment for Programmatic CAD Construction from Human
Design Sequences. ACM Transactions on Graphics (TOG) 40, 4 (2021).
[84] Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jef Gray, Alejandra Molina,
Michael Terry, and Carrie J Cai. 2022. PromptChainer: Chaining Large Language
Model Prompts through Visual Programming. https://doi.org/10.48550/ARXIV.
2203.06566
[85] Tongshuang Wu, Michael Terry, and Carrie J Cai. 2022. AI Chains: Transparent
and Controllable Human-AI Interaction by Chaining Large Language Model
Prompts. https://doi.org/10.1145/3491102.3517582
[86] Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. 2020. TediGAN: Text-
Guided Diverse Face Image Generation and Manipulation. https://doi.org/10.
48550/ARXIV.2012.03308
[87] Kai Xu, Hanlin Zheng, Hao Zhang, Daniel Cohen-Or, Ligang Liu, and Yueshan
Xiong. 2011. Photo-inspired model-driven 3D object modeling. In ACM SIG-
GRAPH 2011 papers on - SIGGRAPH ’11. ACM Press, Vancouver, British Columbia,
Canada, 1. https://doi.org/10.1145/1964921.1964975
[88] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang,
and Xiaodong He. 2017. AttnGAN: Fine-Grained Text to Image Generation with
Attentional Generative Adversarial Networks. https://doi.org/10.48550/ARXIV.
1977