0% found this document useful (0 votes)
10 views17 pages

A New Approach To Interior Design Generating Creat

This study introduces an innovative AI-based method for generating diverse interior design videos from texture-free 3D models, enhancing creativity and efficiency in the design process. The proposed workflow, named F3MTIDV, integrates a new dataset and fine-tuned diffusion model to produce high-quality videos while maintaining texture consistency. This approach aims to transform traditional interior design practices by reducing manual tasks and improving the overall design experience for clients and designers alike.

Uploaded by

Gaurav Ghatge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views17 pages

A New Approach To Interior Design Generating Creat

This study introduces an innovative AI-based method for generating diverse interior design videos from texture-free 3D models, enhancing creativity and efficiency in the design process. The proposed workflow, named F3MTIDV, integrates a new dataset and fine-tuned diffusion model to produce high-quality videos while maintaining texture consistency. This approach aims to transform traditional interior design practices by reducing manual tasks and improving the overall design experience for clients and designers alike.

Uploaded by

Gaurav Ghatge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

buildings

Article
A New Approach to Interior Design: Generating Creative
Interior Design Videos of Various Design Styles from
Indoor Texture-Free 3D Models
Zichun Shao 2,† , Junming Chen 2,† , Hui Zeng 3 , Wenjie Hu 2 , Qiuyi Xu 4 and Yu Zhang 1, *

1 Cultural Creativity and Media, Hangzhou Normal University, Hangzhou 310000, China
2 Faculty of Humanities and Arts, Macau University of Science and Technology, Macao 999078, China;
2220015871@student.must.edu.mo (Z.S.); jmchen@must.edu.mo (J.C.);
2230006101@student.must.edu.mo (W.H.)
3 School of Design, Jiangnan University, Wuxi 214122, China; 7230306023@stu.jiangnan.edu.cn
4 Detroit Green Technology Institute, Hubei University of Technology, Wuhan 430068, China;
2111631203@hbut.edu.cn
* Correspondence: 20210009@hznu.edu.cn
† These authors contributed equally to this work.

Abstract: Interior design requires designer creativity and significant workforce investments. Mean-
while, Artificial Intelligence (AI) is crucial for enhancing the creativity and efficiency of interior
design. Therefore, this study proposes an innovative method to generate multistyle interior design
and videos with AI. First, this study created a new indoor dataset to train an AI that can generate a
specified design style. Subsequently, video generation and super-resolution modules are integrated to
establish an end-to-end workflow that generates interior design videos from texture-free 3D models.
The proposed method utilizes AI to produce diverse interior design videos directly, thus replacing
the tedious tasks of texture selection, lighting arrangement, and video rendering in traditional design
processes. The research results indicate that the proposed method can effectively provide diverse
interior design videos, thereby enriching design presentation and improving design efficiency. Addi-
Citation: Shao, Z.; Chen, J.; Zeng, H.; tionally, the proposed workflow is versatile and scalable, thus holding significant reference value for
Hu, W.; Xu, Q.; Zhang, Y. A New transforming traditional design toward intelligence.
Approach to Interior Design:
Generating Creative Interior Design Keywords: interior design; video generation; fine-tuned model; design efficiency; design workflow
Videos of Various Design Styles from
Indoor Texture-Free 3D Models.
Buildings 2024, 14, 1528.
https://doi.org/10.3390/ 1. Introduction
buildings14061528 1.1. Background and Motivation
Academic Editors: Soowon Chang, With improved living standards, more residents aspire to personalized and exquisitely
Youjin Jang and Jeehee Lee designed residences [1–5]. However, designers face challenges regarding insufficient creativity
and cumbersome design processes when adopting traditional interior design methods [6–11].
Received: 25 April 2024 The lack of creativity is due to the random emergence of design inspiration [6,12,13], while
Revised: 12 May 2024
complex workflows decrease the design efficiency [6,14,15]. Due to these factors, traditional
Accepted: 23 May 2024
design methods often prove inadequate to meet the continuously growing interior design
Published: 24 May 2024
demands [7,10,16]. Artificial intelligence (AI), particularly big data-based AI, possesses the
capability to glean design rules and knowledge from extensive datasets, thereby proficiently
generating corresponding interior designs. This reduces the reliance on manual design
Copyright: © 2024 by the authors.
during the design process and enables designers to select from computer-generated designs.
Licensee MDPI, Basel, Switzerland. This paradigm shift in design methodology can enhance design efficiency and foster
This article is an open access article creativity [14,17].
distributed under the terms and The motivation of this study is to propose a novel AI-based interior design method and
conditions of the Creative Commons a corresponding workflow to assist designers in meeting customer design requirements.
Attribution (CC BY) license (https:// Overall, this study aims to enhance design creativity and efficiency by directly generating
creativecommons.org/licenses/by/ various interior design videos using AI, thereby driving the transformation of interior
4.0/). design toward intelligence.

Buildings 2024, 14, 1528. https://doi.org/10.3390/buildings14061528 https://www.mdpi.com/journal/buildings


Buildings 2024, 14, 1528 2 of 17

1.2. Problem Statement and Objectives


Using generative AI models to produce images is a recent research hotspot [14,18–20].
Their ability to generate images based on textual descriptions makes it possible to produce
designs directly through texts [21–24]. However, two significant challenges emerge when
applying generative models to interior design. For one thing, conventional generative
models are trained using extensive datasets, yet there exists a deficiency of annotated
professional insights into design style and spatial functionality within these datasets.
Consequently, the learning ability of the models diminishes, thus hindering their ability
to generate appropriate interior designs [14,18]. Futhermore, generative models must
generate design videos to provide clients with immersive previews, where the technical
difficulty lies in ensuring texture consistency between different images [25–27]. Therefore,
promoting the application of generative models in interior design necessitates injecting
domain-specific knowledge into these models to enhance their capability of generating
designs with specific styles and addressing the texture consistency issue.
This study aims to develop a novel end-to-end method for artificial intelligence to
generate interior design videos and establish a fresh design workflow. The objective is
to equip designers with efficient and innovative design tools to enhance their efficacy
in completing design tasks. This end-to-end workflow was named “From 3D Model
to Interior Design Video (F3MTIDV)”. The new design tool allows designers to quickly
generate interior design videos in different styles, thus eliminating the tedious material
selection and video rendering work in traditional design processes. Figure 1 compares the
interior design videos generated by a conventional diffusion model and F3MTIDV, where
the conventional diffusion model faces challenges in specifying design styles [14,28] and
lacks texture consistency between the rendered video frames [25–27]. In contrast, F3MTIDV
can determine the design style in the generated video and maintain texture consistency.
Moreover, our improved method can generate high-quality design videos, thus improving
design efficiency and customer experience.

Figure 1. Comparison between interior design videos generated using a conventional diffusion
model and our method (F3MTIDV). Through generative AI, F3MTIDV can produce videos with a
specific design style and maintain texture consistency between frames, which is a capability that the
conventional diffusion model lacks.

1.3. Methodology Overview


The study is mainly divided into several steps. First, a dataset for indoor design styles
and spatial functionalities control (i.e., IDSSFCD-24) was created. Then, a new loss function
containing design styles and spatial functionalities was proposed. The diffusion model
Buildings 2024, 14, 1528 3 of 17

was fine-tuned using this loss function and IDSSFCD-24 for training. The trained diffusion
model can generate indoor designs with specified styles and spatial functionalities. Finally,
crossframe attention and super-resolution modules were introduced, with the former
maintaining content consistency between frames and enhancing the output video to high-
definition quality. This end-to-end workflow is named From 3D Model to Interior Design
Video (i.e., F3MTIDV).
With F3MTIDV, designers only need to export the texture-free 3D model as a video
and input it and corresponding design requirements into the model to generate indoor
design videos in the specified design style. Figure 2 illustrates the research framework and
the related video generation workflow.

Figure 2. Research framework and workflow. This study first constructed the IDSSFCD-24 dataset
and established a new composite loss function for fine-tuning the diffusion model, thus enabling
it to generate interior designs with specified styles and spatial functions. Subsequently, this study
introduced the crossframe attention and super-resolution modules to generate high-definition and
temporally consistent interior design videos, thus forming the F3MTIDV workflow. Users only need
to input the texture-free interior design video and design requirements into F3MTIDV to obtain
creative interior design videos with the specified style.

The proposed method alters the entire interior design process. With F3MTIDV, de-
signers can directly generate interior design videos from texture-free 3D models, thus
significantly enhancing the efficiency and creativity in interior design. Figure 3 demon-
strates that F3MTIDV can generate interior design videos with various styles and spatial
functions. In addition, the proposed method is scalable and applicable to other design tasks
by retraining on different datasets.

Figure 3. The interior design video generated by F3MTIDV presents diverse design styles and spatial
functions. F3MTIDV can generate videos of six specific design styles and four functional spaces, thus
totaling 24 unique types of interior designs.
Buildings 2024, 14, 1528 4 of 17

2. Related Work
2.1. General Interior Design Process
Designers face challenges of low design efficiency and low creativity in interior
design [10,14,17]. One of the reasons for the low design efficiency is the cumbersome and
manual drawing-dependent traditional interior design process. Specifically, designers must
first plan the design by drawing floor plans and creating corresponding interior 3D models.
Then, they must add various texture maps and furniture to present the design style. Finally,
they need to configure appropriate lighting for the 3D model and render the effect image.
Designers use the completed renderings to communicate with clients and determine the
final design solution. Meanwhile, interior design is an iterative process, where designers
must manually select multiple texture maps and rerender effect images to provide clients
with various design choices. Each design modification requires repeating nearly the entire
design process, thus significantly increasing designer workloads [4,10,15,29]. Figure 4
illustrates the conventional design process.

Figure 4. The conventional interior design process. The process is quite cumbersome, and designers
must repeat the entire process for design modifications.

Another challenge faced by interior designers is the demand for creative designs.
Designers must continuously refine their designs to achieve innovative outcomes [13],
but the cumbersome traditional interior design workflow often leaves designers little time
to improve their creativity [10]. Thus, designers are often compelled to adopt fixed design
methods, thereby suppressing creative design production. In the meantime, transforming
creativity into visual expression requires significant labor. Therefore, there is an urgent need
to leverage advanced technology to assist interior design by enhancing design efficiency
and innovation capabilities.

2.2. Fine-Tuned Diffusion Model


Employing AI to generate images has recently become a research hotspot [30–32].
With almost the best performance, diffusion models have been widely applied in many
cutting-edge fields [18,33,34]. Diffusion models typically generate images rapidly based on
input text descriptions [18,21,35,36]. Applying diffusion models to interior design could
help designers quickly produce images, thus improving design efficiency and creativity.
A diffusion model consists of the diffusion module and the denoising module. The dif-
fusion module processes the original image by gradually introducing noise, thus transform-
ing it into an image filled with noise. Correspondingly, the denoising module restores the
image with noise to the original one. The denoising module achieves the image genera-
tion goal by learning the ability to restore the noisy image to the original one [23,35,37].
Additionally, diffusion models provide a flexible way to control the generated results.
Buildings 2024, 14, 1528 5 of 17

During image generation, diffusion models can be guided to produce images consistent
with the added textual description [35,36].
Although diffusion models perform well in general domains, there is still room for
improvement in fields with higher demands for expertise, such as interior design [6,17,18].
Existing diffusion models are trained based on large datasets but lack knowledge in spe-
cific professional domains. This limitation prevents users from effectively controlling the
content and quality of the generated images through professional cues [14,28]. Therefore,
integrating domain-specific knowledge into AI is necessary to enhance image generation
quality in specific domains.
There are four standard methods for injecting professional knowledge into AI to
improve the quality of the generated images. The first is Textual Inversion [38], which
constructs better representation vectors in the embedding space of TextEncoder without
changing the original weights of the diffusion model. This method has the shortest training
time and the minimum generated model parameters but the weakest control over the
generated images. The second is Hypernetwork [39], which inserts a new neural network
into the crossattention module as an intermediate layer to influence the generation results.
However, the images generated have lower quality. The third method is LoRA [40], which
acquires new knowledge by inserting an intermediate layer into the U-net structure of the
diffusion model. The advantage of LoRA lies in not requiring replicating the entire model’s
weights. The model size is moderate after training, and the quality of the generated
images is superior to the first two methods. Finally, there is Dreambooth [28], which
adjusts the weights of the entire diffusion model through training. This method addresses
language drift issues by setting new cue words and involving them in model training [41].
Additionally, it introduces a prior preservation loss to prevent model overfitting [28].
Dreambooth typically produces the best generation results among these four methods
through model fine-tuning.
2.3. Controllable Video Generation
Fine-tuning diffusion models can enhance their capabilities in generating domain-
specific images [28,38–40]. However, textual descriptions alone are insufficient for precise
output control [14,18,36], which is a problem that becomes increasingly pronounced in fields
like interior design with extremely high control requirements. Therefore, it is necessary to
improve the controllability of the generated results.
Scholars have proposed to enhance the image generation controllability of diffusion
models by introducing additional neural networks [42–44]. For example, Voynov et al. [42]
proposed the Latent Edge Predictor (LGP), which is capable of predicting image edges
and comparing them with the actual image edges for loss calculation. The LGP can guide
diffusion models to generate results aligned with real edge sketches at the pixel level by
learning to minimize the loss. Li et al. [43] suggested training with bounding boxes or
human pose maps as control conditions to expand the available types of control networks.
Meanwhile, Zhang et al. [44] proposed the general framework, ControlNet, to support
using single or multiple control models to govern image generation, thereby broadening the
applicability of control conditions. These methods have enriched the means of controlling
image generation and improved the controllability of diffusion model-generated images.
Although adding a control network has improved the practicality of text-to-image
diffusion models [31,42,44], conveying information through videos is much more efficient
than through images. Therefore, researchers have focused more on generating videos
from texts [45–47]. For example, [26] proposed a novel video synthesis method, Con-
trolVideo, which improved the control network by introducing crossframe attention, a
crossframe smoother, and a hierarchical sampler, thus successfully alleviating issues such
as high training costs, the inconsistent appearance of generated videos, and video flicker-
ing. Chen et al. [27] presented a spatiotemporal self-attention mechanism and introduced
residual noise initialization to ensure video appearance consistency. Guo et al. [25] embed-
ded a motion modeling module into a basic text-to-image diffusion model and retrained
it on a large-scale dataset, thus enabling the model to generate videos with continuous
Buildings 2024, 14, 1528 6 of 17

motion. The above studies indicate that combining diffusion models and control network
improvements can produce high-quality videos.

3. Material and Methods


Although diffusion models have been successfully applied in various fields [14,18,33],
the research on their application in interior design is relatively limited, especially the direct
utilization of AI to generate interior design videos. This study proposes an improved
diffusion model and a corresponding workflow for generating interior design videos
(F3MTIDV). The primary advantage of F3MTIDV lies in its ability to rapidly generate
interior design videos embodying a designated style, thus surpassing traditional interior
design methodologies. This eliminates the need for conventional creative design and video
rendering tasks, thus enabling designers to promptly prepare design proposals for user
consideration and expediting the decision-making process.
The F3MTIDV workflow is implemented in five steps: building the new dataset
IDSSFCD-24, designing the new composite loss function incorporating design style and
spatial function losses, fine-tuning the basic diffusion model using IDSSFCD-24 and the
loss function, introducing the crossframe attention module and super-resolution module to
construct a complete interior design video generation workflow, and using this workflow
to generate design videos and make modifications.
During dataset construction, this study enlisted the assistance of professional interior
designers to collect over 20,000 high-quality, freely downloadable interior design images
from well-known interior design websites and annotate each image with design styles and
spatial functions. As a result, IDSSFCD-24 was successfully created to address the lack of
high-quality interior design datasets.
An innovative integrated loss function was proposed to effectively acquire knowledge
of new design styles and spatial functions. This loss function introduces design style, spatial
function, and prior style losses based on the traditional diffusion model loss function,
as expressed in Equation (1), thus forming a completely new integrated loss function,
as expressed in Equation (2). The model training process aims to gradually reduce design
style and spatial function losses, thereby gaining the ability to generate designs with
specified design styles and spatial functions.
The basic diffusion model is expressed as follows:

LX,h,ϵ,t [wt || X̂θ (αt X + σt ϵ, h) − X ||22 ] (1)

where L represents the average loss that the model training process aims to reduce to
achieve better design and generation quality, X̂θ is a trainable diffusion model that con-
tinuously receives a noisy image vector αt X + σt ϵ and text guidance h, which predicts the
noise, X̂θ (αt X + σt ϵ, h) − X represents the difference between the predicted noise and the
true noise X at that time step, and wt is the parameter controlling the weight changes of the
diffusion model. During the training process, the diffusion model adjusts its parameters to
reduce the difference between the generated and real images, thus ultimately minimizing L.
The proposed composite loss function is as follows:

LX,h,ϵ,ϵ′ ,t [wt || X̂θ (αt X + σt ϵ, h) − X ||22



+ λwt′ || X̂θ (αt′ X pr + σt′ ϵ , h pr ) − X pr ||22 ] (2)

The improved loss function in Equation (2) addresses the limitations of traditional
diffusion models in generating interior designs with specified design styles. Equation (2)
consists of two parts. The first part measures the difference between the image generated
by the fine-tuned diffusion model and the actual image. Here, X̂θ represents the new
diffusion model, which considers design styles and spatial functions as part of the loss.
The model quantifies the difference between the predicted noise (i.e., X̂θ (αt X + σt ϵ, h)) and
the actual noise X and considers it as a loss. The second part of the loss function is the prior
Buildings 2024, 14, 1528 7 of 17

knowledge loss, which is obtained by calculating the difference between the noise predicted
by the new diffusion model (X̂θ) and the noise predicted by the pretrained diffusion model
(X pr ). A minor difference indicates that, while retaining the general knowledge of the
original diffusion model, the new model has better knowledge of design styles and spatial
functions. Here, λw is a learnable weight, and λw can adjust the weights of the two losses
above to allow the diffusion model to generate better results. The new fine-tuned diffusion
model is trained with this improved loss function, preserving the general foundational
knowledge of the original pre-trained model while also acquiring knowledge of design
styles and spatial functions. Therefore, the fine-tuned diffusion model can generate interior
designs with specific design styles.
This study fine-tuned the original diffusion model by using ADSSFID-24. Specifically,
the design style and spatial function knowledge in ADSSFID-24 was learned by invoking
a new composite loss function. By reducing the loss, the fine-tuned model can generate
indoor designs with specified design styles and spatial functions. The model fine-tuning
process is shown in Figure 5.

Figure 5. Schematic of the fine-tuned diffusion model.

To construct the interior design video generation workflow, the ability to create in-
terior designs of specified styles was acquired by fine-tuning the diffusion model. Sub-
sequently, a crossframe attention module was employed to ensure consistency in content
and texture between the generated video frames [26]. Also, a super-resolution mod-
ule, BasicVSR++ [48], was utilized, which is an open-source super-resolution algorithm
designed to enhance visual details and improve the resolution of the generated videos.
Thus, the F3MTIDV workflow was completed, which enables the generation of interior
design videos with consistent design styles, coherent content between consecutive frames,
and high resolution.

4. Experiment and Results


4.1. Experimental Settings
The diffusion model was fine-tuned on a Windows 10 PC with 64 GB of RAM and
an NVIDIA RTX 6000 graphics card with 48 GB of VRAM. The training was conducted
using PyTorch, with 100 iterations per image. PyTorch is an open-source machine learning
library that offers a rich set of tools and libraries, and it is widely praised for its flexibility
and ease of use. During image preprocessing, an automatic scaling method was employed
to adjust each image to one of 13 fixed ratios based on its aspect ratio, thus ensuring that
neither its width nor height exceeded 1024 after scaling. This scaling strategy aims to
keep the aspect ratio changes of the images within an acceptable range while preserving
image integrity. Data augmentation techniques, such as horizontal flipping, were applied.
The model had a learning rate of 0.000001 and a batch size of 24. Xformers is a neural
Buildings 2024, 14, 1528 8 of 17

network architecture based on an attention mechanism. Through some optimizations and


improvements, Xformers improved the speed of training and inference, thus making it
more efficient to process large-scale data. FP16 refers to the data type that uses 16-bit
floating point numbers to represent numerical values. Because the computation speed of
16-bit floating point numbers is faster than that of 32-bit floating-point numbers, using
FP16 can accelerate neural networks’ training and inference speed. The computational
process was accelerated using Xformers and FP16, and the total training time for diffusion
model fine-tuning amounted to 17 h.

4.2. Video Generation Efficiency


With F3MTIDV, designers only need to input design style descriptions and texture-free
videos to promptly obtain high-quality interior design videos aligned with the specified
style. Supported by an NVIDIA RTX A6000 graphics card, F3MTIDV generates a video
with a resolution of 832 × 512 and a duration of 6 s in only 10 min. Interior design videos
covering all spaces within a house can be generated in less than one hour. This streamlined
operational process significantly enhances the method’s practicality, thus offering interior
designers an entirely novel design approach.
By altering the prompts, designers can use F3MTIDV to generate indoor design videos
across a spectrum of styles in bulk. The ability to explore diverse prompts to create videos is
an essential skill for designers. Empowered by robust computational capabilities, users can
promptly access designs generated by designers and engage in real-time communication,
thus enhancing decision-making efficiency and design quality.

4.3. IDSSFCD-24
This study aims to employ AI to assist designers in rapidly and efficiently generating
interior design videos with specific design styles. Considering the lack of datasets for
interior design styles, we enlisted professional designers to collect over 30,000 high-quality,
freely available images from well-known interior design websites. Subsequently, the de-
signers meticulously reviewed these images, thus examining design styles and quality and
excluding contents not aligned with the specified standards. After filtering, over 20,000
images aligned with our criteria. Finally, multiple designers annotated the design style
and spatial function of the images. Additionally, the IDSSFCD-24 dataset was constructed.
At least five different designers annotated each image. In inconsistent annotation categories,
the category most frequently encountered was designated as the final category.
The IDSSFCD-24 covers the classification of design styles and spatial functionalities.
The labels for design styles include six types: “contemporary style”, “Chinese style”,
“European style”, “Nordic style”, “American style”, and “Japanese style”. Meanwhile,
the annotations for interior functionalities encompass four types: “bedroom”, “dining
room”, “living room”, and “study room”. Table 1 provides a detailed display of the
quantity of images for different categories.

Table 1. Distribution of images of different design styles and spatial functions in IDSSFCD-24.

Contemporary Style Chinese Style European Style Nordic Style American Style Japanese Style
Bedroom 1073 898 831 995 697 812
Dining room 1491 761 964 993 789 552
Living room 1289 889 990 1486 689 731
Study room 868 662 632 752 636 580
Total 4721 3210 3417 4226 2811 2675

The IDSSFCD-24 contains 21,060 indoor design images, with 4721 in the contemporary
style (most numerous), 2675 in the Japanese style (least numerous), 6074 depicting living
rooms (most numerous), and 4130 depicting study rooms (least numerous). Figure 6
displays some training data samples of IDSSFCD-24.
Buildings 2024, 14, 1528 9 of 17

Figure 6. Training data samples of IDSSFCD-24.

4.4. Subjective Assessment


Previous studies indicate that objective assessment metrics cannot fully reflect human
perception [49–51]. Therefore, instead of relying solely on accurate assessment metrics, we
also conducted subjective assessments to enhance the credibility of the conclusions [49].
Specifically, subjective and objective assessment methods were adopted to evaluate the
quality of the generated interior design videos. This comprehensive evaluation approach
allows for a more thorough assessment of the quality and practical value of the generated
videos, thus aiding designers in better understanding the technology and applying it to
design practices.
In terms of subjective evaluation, Otani et al. [49] showed that the direct scoring
method outperformed the ranking scoring method in generated content assessment and
proposed the subjective evaluation metrics of Fidelity and Alignment. The Fidelity met-
ric evaluates how closely the generated images resemble the actual images. The Align-
ment metric assesses the consistency between the generated images and the prompt text.
Otani et al. [49] also experimentally demonstrated that providing detailed explanations for
each level option in the evaluation improves score consistency between different annotators,
thus outperforming traditional Likert scales in score consistency. Therefore, this study
adopts the direct scoring method and adds detailed annotations to the scoring options.
Considering the significance of rich design details in generating designs, we intro-
duced a “Design Details” metric. Since the consistency between consecutive video frames
is crucial for video tasks, we also added a “Visual Consistency” metric. Finally, a “Usability”
metric was introduced to comprehensively assess the overall quality of the generated
videos. The specific evaluation metrics and corresponding descriptions are presented in
Table 2. In this study, we mainly focus on the Alignment and Usability metrics among these
five. A high score in the Alignment metric indicates that the generated images align with
the textual prompts, while a high Usability score indicates that the video is directly usable.
The differences in the interior design videos generated by Stable Diffusion 1.5 (SD),
Fine-tuned Diffusion Model (FTSD), SD+ Control Video, and the proposed method in this
study were compared. SD is the basic diffusion model, while FTSD is a fine-tuned diffusion
model, both of which generate videos by generating images by frame. SD+ Control Video
combines the basic diffusion model with a video generation module capable of directly
producing coherent videos. Our method (F3MTIDV) includes a fine-tuned diffusion model,
a crossattention module, and a super-resolution module, thus enabling the generation of
readable and high-resolution videos.
The four methods above were used for subjective evaluation, with each method
sequentially generating interior design videos for 24 categories (including four spatial func-
tions and six design styles). Consecutive sets of 20 images were extracted from the videos
generated for each category for evaluation to total 1920 images. The assessments were
conducted by 30 undergraduate students majoring in interior design and 15 professional in-
Buildings 2024, 14, 1528 10 of 17

terior designers, and the average results for the criteria of “Fidelity”, “Alignment”, “Design
Details”, “Visual Consistency”, and “Usability” were obtained, as presented in Table 3.

Table 2. Subjective assessment questionnaire questions.

1. Fidelity: Does the image look like an AI-generated photo or a real photo?
1. AI-generated photo.
2. Probably an AI-generated photo, but photorealistic.
3. Neutral.
4. Probably a real photo, but with irregular textures and shapes.
5. Real photo.

2. Alignment: The image matches the text description.


1. Does not match at all.
2. Has significant discrepancies.
3. Has several minor discrepancies.
4. Has a few minor discrepancies.
5. Matches exactly.

3. Design Details: Objects in the image have detail.


1. Minimal details: Almost all objects lack design details, appearing incomplete or blurry.
2. Some details: Only a few objects have certain details.
3. Moderate details: Nearly half of the objects have design details.
4. Good details: Most objects have design details.
5. High details: Almost all objects exhibit design details.

4. Visual Consistency: The video frames in the image have consistency.


1. Little consistency: Flickering and material changes between frames.
2. Some consistency: Most frames remain inconsistent.
3. Medium consistency: Nearly half of the frames are consistent.
4. Higher consistency: Most frames show consistency.
5. Close to complete consistency: Almost all frames are consistent.

5. Usability: Videos showcase design ideas and facilitate communication.


1. Not usable: Unrealistic video with irrelevant content, lack of details, screen flickering,
and inconsistency between frames.
2. Limited usability: Some improvements, but the overall results are still poor.
3. Partially usable: Some images meet standards, but most remain unusable.
4. Mostly usable: The majority of images meet acceptable standards.
5. Fully usable: The entire video is error-free and ready for use.

Table 3. Comparison of quantitative evaluation results of interior design videos generated by


different methods.

Model Fidelity ↑ Alignment ↑ Design Details ↑ Visual Consistency ↑ Usability ↑


Stable Diffusion (SD) [52] 2.22 2.72 2.63 1.73 1.67
Fine-tuned SD (FTSD) [28] 2.46 3.05 2.97 2.08 1.67
SD [52] + Control Video [26] 2.35 3.14 2.64 2.83 2.48
F3MTIDV 2.61 3.31 2.98 3.18 3.08

Table 3 shows that F3MTIDV achieved optimal results in all five evaluation metrics.
In terms of Fidelity, F3MTIDV slightly outperformed other methods. However, all methods
exhibited relatively low Fidelity scores, thus indicating room for improvement in the au-
thenticity of AI-generated videos. Regarding Alignment, the improved scores with FTSD
or the added video control module suggest that these methods enhance the alignment be-
tween text descriptions and content. Regarding Design Details, SD and SD + Control Video
showed lower scores, while FTSD and F3MTIDV achieved higher scores, thus indicating
that fine-tuning the model enhances the ability to generate image design details. In terms of
Buildings 2024, 14, 1528 11 of 17

Visual Consistency, the method with the added video control module significantly outper-
formed other methods, thus demonstrating the effectiveness of incorporating this module.
In terms of Usability, only F3MTIDV scored above three points. Our approach exhibits
a clear advantage in this comprehensive evaluation metric compared to other methods.
Overall, subjective assessment scores validate the effectiveness of F3MTIDV.

4.5. Objective Assessment


For the objective evaluation, we utilized the Structural Similarity Index (SSIM) [53],
Fréchet Inception Distance (FID) [54], and CLIP Score [55] to assess the consistency, quality,
and textual–visual alignment of the generated videos. Specifically, SSIM [53] serves as
an objective metric for measuring image quality, thus aiming to quantify the structural
similarity between two images. SSIM considers the brightness, contrast, structure, and how
human eyes perceive these factors. A value closer to one indicates higher structural
similarity between the generated and original images. SSIM is calculated as follows [53]:

SSI M ( x, y) = L( x, y) · C ( x, y) · S( x, y) (3)

where x and y are two images, L( x, y) calculates the luminance difference between x and y,
C ( x, y) calculates the contrast based on the standard deviation of the images, and S( x, y)
computes the structural similarity by taking the power function of the ratio of luminance
variance, contrast variance, and covariance between the two images. The SSIM value is the
product of the components above.
FID [54] is an indicator assessing the quality of generated images. It measures the
quality of image generation models by calculating the difference between the distributions
of authentic and generated images. FID utilizes the Inception network to transform sets
of generated and authentic images into feature vectors and then computes the Fréchet
distance between the distributions of actual and generated images. The Fréchet distance
quantifies the similarity between two distributions, with a smaller FID indicating higher
similarity between authentic and generated images, thus suggesting better image quality.
The FID is calculated as follows:
q
FID =∥µreal − µ gen ∥22 + Tr(Σreal + Σ gen − 2 Σreal Σ gen ) (4)

where µreal and µ gen represent the mean of feature vectors for real and generated images,
Σreal and Σ gen represent the covariance matrices of feature vectors for real and generated
images, ∥·∥ denotes the L2 norm, and Tr(·) represents the trace operation on matrices. This
formula measures the similarity between the feature distributions of real and generated
images through the Fréchet distance, thereby assessing the quality of generated images.
The CLIP Score [55] measures the visual–textual consistency between descriptive
content and images. It transforms natural language and images into feature vectors and then
calculates their cosine similarity. A CLIP Score close to one indicates a higher correlation
between the image and the corresponding text. The CLIP Score is calculated as follows:

CLIP Score(c, v) = w × max(cos(c, v), 0) (5)

where c and v are the feature vectors output by the CLIP encoder for textual descriptions
and images, cos(c, v) represents the similarity between c and v calculated using the cosine
function, w is the weight used to adjust the impact of similarity, and max(·, 0) denotes
taking the maximum value, thereby ensuring the similarity does not fall below zero.
The 1920 generated images were quantitatively evaluated using the SSIM [53], FID [54],
and CLIP Score [55]. The evaluation results are shown in Table 4.
The results in Table 4 indicate that F3MTIDV performed the best in the SSIM, FID,
and the CLIP Score. The research results show that fine-tuning the model and adding
the control video module can significantly improve structural similarity, image quality,
and the alignment between textual descriptions and content. The significant improve-
Buildings 2024, 14, 1528 12 of 17

ments were mainly in the SSIM and FID, which is consistent with the conclusions from
subjective assessments.

Table 4. Quantitative evaluation results of images generated by different methods.

Model SSIM [53] ↑ FID [54] ↓ CLIP Score [55] ↑


Stable Diffusion (SD) [52] 0.4364 86.07 27.50
Fine-tuned SD (FTSD) [28] 0.4880 84.18 28.10
SD [52] + Control Video [26] 0.6268 79.02 29.05
F3MTIDV 0.6691 76.64 29.36

4.6. The Diversity of F3MTIDV-Generated Videos


Figure 7 illustrates the diversity of videos generated by F3MTIDV. We selected two
sets of distinct prompt words, with each yielding three videos, thus totaling six videos
for demonstration purposes. As depicted in the figure, a consistent design style is main-
tained across all video frames, which is crucial in practical design applications. Moreover,
the visuals indicate that varying prompt words can produce designs with differing stylistic
elements. Thus, F3MTIDV can provide designers with a spectrum of interior design options,
thereby augmenting both design efficiency and quality.

Figure 7. F3MTIDV generates diverse design videos. It can produce designs in various styles and
generate differentiated designs within the same design style. This new design approach provides
designers with different design videos for communication with users, thus accelerating the design
decision-making process.
Buildings 2024, 14, 1528 13 of 17

4.7. Details in the Generated Videos


One frame was selected from the Chinese-style living room video generated by
F3MTIDV to showcase the details of the generated content (Figure 8). Figure 8 presents
F3MTIDV’s ability to create appropriate textures for textureless images. For instance,
F3MTIDV generated a white fabric texture for the sofa outline and produced red pillows.
For the bookshelf, F3MTIDV generated a dark wooden texture. In terms of flooring,
F3MTIDV created light-colored wooden floors and carpets. All the generated textures
and colors adhere to the Chinese design style. Additionally, F3MTIDV’s lighting design
is also reasonable, thus generating linear lighting for each bookshelf and spotlights at
the top of the two bookshelves. F3MTIDV also created auxiliary lighting sources on the
TV background wall to reduce the difference in brightness between the TV screen and
background wall brightness, thus contributing to eye protection. Furthermore, the design
generated by F3MTIDV exhibits rich design details, such as the black stitched edges at the
seams of the Chinese fabric sofa, which align with real sofa manufacturing processes. These
results collectively indicate that F3MTIDV has obtained outstanding generative design
capabilities through training on the dataset. Finally, F3MTIDV effectively conveys the
design intentions, thus making the content more realistic.

Figure 8. Details in the generated interior design videos. (Prompt word: Chinese-style living room.)

Nevertheless, the design videos generated by F3MTIDV still have room for improve-
ment. Firstly, the structural integrity of the objects can still be enhanced, such as the
insufficient verticality of the lines in the structures of the generated bookshelves and cab-
inets. Secondly, inaccurate lighting and shadow relationships still exist in the generated
videos. Finally, F3MTIDV needs to create videos with higher contrast to enhance realism.
Overall, generating videos with different design styles from textureless 3D models has
proven feasible with F3MTIDV, which will help designers quickly produce spatial designs
in different styles, thus enhancing design efficiency and decision-making effectiveness.

5. Discussion
The subjective and objective evaluations fully demonstrated the effectiveness of
F3MTIDV. Visual comparisons with other methods during subjective assessment showed
that F3MTIDV can facilitate end-to-end video generation with consistent design styles and
no flickering, which is unattainable by other methods. Furthermore, the questionnaire
assessments with five subjective evaluation metrics suggest that F3MTIDV has achieved
optimal results in all indicators. The exceptionally high Alignment and Usability scores
indicate that the videos generated by F3MTIDV have good visual–textual consistency and
Buildings 2024, 14, 1528 14 of 17

usability. During objective evaluation, F3MTIDV yielded the highest scores in metrics such
as the SSIM [53], FID [54], and CLIP Score [55], thus further confirming its effectiveness.
By directly utilizing AI to generate diverse interior design videos, F3MTIDV replaces
the tedious tasks of design creativity, material selection, lighting arrangement, and render-
ing in traditional design practices. Compared to conventional methods, F3MTIDV excels in
efficiency and creative generation. Regarding design efficiency, completing designs and
corresponding modifications using traditional methods typically takes about half a month,
while F3MTIDV (on a PC with a 48 GB VRAM graphics card) can generate design videos
covering various styles and space functions within one hour. With the continuous comput-
ing power improvement, the speed of interior design video generation with F3MTIDV can
be further increased. In terms of creative design, F3MTIDV can generate multiple interior
design styles for users to choose from, thus reducing the complexity of creative design
and accelerating the design decision-making process. Overall, F3MTIDV demonstrates the
feasibility of an innovative approach to interior design. In addition, F3MTIDV is scalable.
By replacing the basic diffusion model, it can be adapted to design video generation in
other design tasks.
AI-generated content will profoundly impact current design approaches. In terms
of design efficiency, AI will increasingly take over tasks emphasizing logical and rational
descriptions, thus eventually forming an AI design chain. Simple design tasks will be
completed by AI, thus allowing designers more time to contemplate design creativity and
enhance design quality. Regarding role positioning, designers are no longer merely tradi-
tional design creators but are transforming into design facilitators collaborating with AI.
For example, the work of designers in this study goes beyond simply drawing images: they
collect and organize data through their professional knowledge and transfer knowledge to
AI models. This new human–machine collaboration approach may become the norm in
future interior design, thus driving the design process toward automation and intelligence.

6. Conclusions
Traditional interior design methods require designers to possess high creativity and
undertake heavy labor when creating interior design videos, thus leading to a lack of
creativity and low design efficiency. To address these issues, we propose F3MTIDV to auto-
mate interior design video generation. Experimental results demonstrate that F3MTIDV
can replace the laborious creative design and drawing work in traditional design processes,
thereby changing the formal design process and significantly improving design efficiency
and creativity.
Nonetheless, this study is not free from limitations. Firstly, controlling the entire
video generation process through a fixed set of prompts is challenging and can be further
improved by using dynamic vectors to automatically adapt to the content of each frame.
Secondly, texts cannot specify texture appearance in specific areas of the image when gen-
erating textures, thus requiring enhanced generation process controllability. Furthermore,
our understanding of design styles is categorized manually, and a more extensive and
automated annotation classification method may better represent real-world design classifi-
cations. Finally, the Fidelity metric of the videos generated by the proposed method can
still be improved, and enhancing the authenticity of the videos can increase their usability.
Artificial intelligence (AI) can be applied to future interior design stages. For example,
during the planning phase, AI can provide designers with an array of design schemes
for selection. Subsequently, image generation techniques based on generative adversarial
networks or diffusion models can yield numerous conceptual design renderings. In the
modeling phase, automated AI modeling methods reduce the traditional manual mod-
eling workload. Finally, automating material texturing and video rendering enhances
efficiency from design to presentation, thus culminating in a more comprehensive AI
design workflow.
Buildings 2024, 14, 1528 15 of 17

Author Contributions: Conceptualization, J.C. and Z.S.; Methodology, J.C. and Z.S.; Software, J.C.
and Z.S.; Validation, J.C. and Z.S.; Formal analysis, J.C. and Z.S.; Investigation, J.C. and Z.S.; Resources,
Y.Z.; Writing—original draft, J.C., Z.S., H.Z., W.H., Q.X. and Y.Z.; Writing—review and editing, J.C.
and Z.S.; Supervision, Y.Z.; Project administration, Y.Z.; Funding acquisition, Y.Z. All authors have
read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The raw data supporting the conclusions of this article will be made
available by the authors upon request.
Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Colenberg, S.; Jylhä, T. Identifying interior design strategies for healthy workplaces—A literature review. J. Corp. Real Estate 2021,
24, 173–189. https://doi.org/10.1108/JCRE-12-2020-0068.
2. Ibadullaev, I.; Atoshov, S. The Effects of Colors on the Human Mind in the Interior Design. Indones. J. Innov. Stud. 2019, 7, 1–9.
https://doi.org/10.21070/ijins.v7i0.27.
3. Bettaieb, D.M.; Alsabban, R. Emerging living styles post-COVID-19: Housing flexibility as a fundamental requirement for
apartments in Jeddah. Archnet-IJAR Int. J. Archit. Res. 2021, 15, 28–50. https://doi.org/10.1108/ARCH-07-2020-0144.
4. Wang, Y.; Liang, C.; Huai, N.; Chen, J.; Zhang, C. A Survey of Personalized Interior Design. Comput. Graph. Forum 2023, 42,
e14844. https://doi.org/10.1111/cgf.14844.
5. Park, B.H.; Hyun, K.H. Analysis of pairings of colors and materials of furnishings in interior design with a data-driven framework.
J. Comput. Des. Eng. 2022, 9, 2419–2438. https://doi.org/10.1093/jcde/qwac114.
6. Ashour, M.; Mahdiyar, A.; Haron, S.H. A Comprehensive Review of Deterrents to the Practice of Sustainable Interior Architecture
and Design. Sustainability 2021, 13, 10403. https://doi.org/10.3390/su131810403.
7. Delgado, J.M.D.; Oyedele, L.; Ajayi, A.; Akanbi, L.; Akinade, O.; Bilal, M.; Owolabi, H. Robotics and automated systems in
construction: Understanding industry-specific challenges for adoption. J. Build. Eng. 2019, 26, 100868. https://doi.org/10.1016/j.
jobe.2019.100868.
8. Wang, D.; Li, J.; Ge, Z.; Han, J. A Computational Approach to Generate Design with Specific Style. Proc. Des. Soc. 2021, 1, 21–30.
https://doi.org/10.1017/pds.2021.3.
9. Chen, J.; Shao, Z.; Cen, C.; Li, J. HyNet: A novel hybrid deep learning approach for efficient interior design texture retrieval.
Multimed. Tools Appl. 2023, 83, 28125–28145. https://doi.org/10.1007/s11042-023-16579-0.
10. Bao, Z.; Laovisutthichai, V.; Tan, T.; Wang, Q.; Lu, W. Design for manufacture and assembly (DfMA) enablers for offsite interior
design and construction. Build. Res. Inf. 2022, 50, 325–338. https://doi.org/10.1080/09613218.2021.1966734.
11. Sinha, M.; Fukey, L.N. Sustainable Interior Designing in the 21st Century—A Review. ECS Trans. 2022, 107, 6801. https:
//doi.org/10.1149/10701.6801ecst.
12. Chen, L.; Wang, P.; Dong, H.; Shi, F.; Han, J.; Guo, Y.; Childs, P.R.; Xiao, J.; Wu, C. An artificial intelligence based data-driven
approach for design ideation. J. Vis. Commun. Image Represent. 2019, 61, 10–22. https://doi.org/10.1016/j.jvcir.2019.02.009.
13. Yilmaz, S.; Seifert, C.M. Creativity through design heuristics: A case study of expert product design. Des. Stud. 2011, 32, 384–415.
https://doi.org/10.1016/j.destud.2011.01.003.
14. Chen, J.; Wang, D.; Shao, Z.; Zhang, X.; Ruan, M.; Li, H.; Li, J. Using Artificial Intelligence to Generate Master-Quality Architectural
Designs from Text Descriptions. Buildings 2023, 13, 2285. https://doi.org/10.3390/buildings13092285.
15. Chen, J.; Shao, Z.; Zhu, H.; Chen, Y.; Li, Y.; Zeng, Z.; Yang, Y.; Wu, J.; Hu, B. Sustainable interior design: A new ap-
proach to intelligent design and automated manufacturing based on Grasshopper. Comput. Ind. Eng. 2023, 183, 109509.
https://doi.org/10.1016/j.cie.2023.109509.
16. Abd Hamid, A.B.; Taib, M.M.; Razak, A.A.; Embi, M.R. Building information modelling: Challenges and barriers in im-
plement of BIM for interior design industry in Malaysia. In Proceedings of the 4th International Conference on Civil and
Environmental Engineering for Sustainability (IConCEES 2017), Langkawi, Malaysia, 4–5 December 2017; Volume 140, p. 012002.
https://doi.org/10.1088/1755-1315/140/1/012002.
17. Karan, E.; Asgari, S.; Rashidi, A. A markov decision process workflow for automating interior design. KSCE J. Civ. Eng. 2021,
25, 3199–3212. https://doi.org/10.1007/s12205-021-1272-6.
18. Chen, J.; Shao, Z.; Hu, B. Generating Interior Design from Text: A New Diffusion Model-Based Method for Efficient Creative
Design. Buildings 2023, 13, 1861. https://doi.org/10.3390/buildings13071861.
19. Cheng, S.I.; Chen, Y.J.; Chiu, W.C.; Tseng, H.Y.; Lee, H.Y. Adaptively-realistic image generation from stroke and sketch with
diffusion model. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA,
2–7 January 2023; pp. 4054–4062. https://doi.org/10.48550/arXiv.2208.12675.
Buildings 2024, 14, 1528 16 of 17

20. Yang, B.; Gu, S.; Zhang, B.; Zhang, T.; Chen, X.; Sun, X.; Chen, D.; Wen, F. Paint by example: Exemplar-based image editing with
diffusion models. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver,
BC, Canada, 17–24 June 2023; pp. 18381–18391. https://doi.org/10.48550/arXiv.2211.13227.
21. Brisco, R.; Hay, L.; Dhami, S. Exploring the Role of Text-to-Image AI in Concept Generation. Proc. Des. Soc. 2023, 3, 1835–1844.
https://doi.org/10.1017/pds.2023.184.
22. Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell.
2023, 45, 10850–10869. https://doi.org/10.1109/TPAMI.2023.3261988.
23. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851.
https://doi.org/10.48550/arXiv.2006.11239.
24. Vartiainen, H.; Tedre, M. Using artificial intelligence in craft education: Crafting with text-to-image generative models. Digit.
Creat. 2023, 34, 1–21. https://doi.org/10.1080/14626268.2023.2174557.
25. Guo, Y.; Yang, C.; Rao, A.; Wang, Y.; Qiao, Y.; Lin, D.; Dai, B. Animatediff: Animate your personalized text-to-image diffusion
models without specific tuning. arXiv 2023, arXiv:2307.04725. https://doi.org/10.48550/arXiv.2307.04725.
26. Zhang, Y.; Wei, Y.; Jiang, D.; Zhang, X.; Zuo, W.; Tian, Q. ControlVideo: Training-Free Controllable Text-to-Video Generation.
arXiv 2023, arXiv:2305.13077. https://doi.org/10.48550/arXiv.2305.13077.
27. Chen, W.; Wu, J.; Xie, P.; Wu, H.; Li, J.; Xia, X.; Xiao, X.; Lin, L. Control-A-Video: Controllable Text-to-Video Generation with
Diffusion Models. arXiv 2023, arXiv:2305.13840. https://doi.org/10.48550/arXiv.2305.13840.
28. Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models
for subject-driven generation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. https://doi.org/10.48550/arXiv.2208.12242.
29. Salvagioni, D.A.J.; Melanda, F.N.; Mesas, A.E.; González, A.D.; Gabani, F.L.; Andrade, S.M.d. Physical, psychological
and occupational consequences of job burnout: A systematic review of prospective studies. PLoS ONE 2017, 12, e0185781.
https://doi.org/10.1371/journal.pone.0185781.
30. Yang, C.; Liu, F.; Ye, J. A product form design method integrating Kansei engineering and diffusion model. Adv. Eng. Inform.
2023, 57, 102058. https://doi.org/10.1016/j.aei.2023.102058.
31. Zhao, S.; Chen, D.; Chen, Y.C.; Bao, J.; Hao, S.; Yuan, L.; Wong, K.Y.K. Uni-ControlNet: All-in-One Control to Text-to-Image
Diffusion Models. arXiv 2023, arXiv:2305.16322. https://doi.org/10.48550/arXiv.2305.16322.
32. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning
transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine
Learning, Virtual Event, 18–24 July 2021; pp. 8748–8763. https://doi.org/10.48550/arXiv.2103.00020.
33. Gu, S.; Chen, D.; Bao, J.; Wen, F.; Zhang, B.; Chen, D.; Yuan, L.; Guo, B. Vector quantized diffusion model for text-to-image
synthesis. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA,
USA, 18–24 June 2022; pp. 10696–10706. https://doi.org/10.1109/CVPR52688.2022.01043.
34. Lyu, Y.; Wang, X.; Lin, R.; Wu, J. Communication in Human—AI Co-Creation: Perceptual Analysis of Paintings Generated by
Text-to-Image System. Appl. Sci. 2022, 12, 11312. https://doi.org/10.3390/app122211312.
35. Zhang, C.; Zhang, C.; Zhang, M.; Kweon, I.S. Text-to-image diffusion model in generative ai: A survey. arXiv 2023,
arXiv:2303.07909. https://doi.org/10.48550/arXiv.2303.07909.
36. Liu, B.; Lin, W.; Duan, Z.; Wang, C.; Ziheng, W.; Zipeng, Z.; Jia, K.; Jin, L.; Chen, C.; Huang, J. Rapid diffusion: Building domain-
specific text-to-image synthesizers with fast inference speed. In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 295–304. https://doi.org/10.18653/v1/2023.acl-industry.28.
37. Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion models: A comprehensive
survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. https://doi.org/10.1145/3626235.
38. Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An image is worth one word: Personaliz-
ing text-to-image generation using textual inversion. arXiv 2022, arXiv:2208.01618. https://doi.org/10.48550/arXiv.2208.01618.
39. Shamsian, A.; Navon, A.; Fetaya, E.; Chechik, G. Personalized federated learning using hypernetworks. In Proceedings of the
International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 9489–9502. https://doi.org/10.48550/arXiv.2
103.04628.
40. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models.
arXiv 2021, arXiv:2106.09685. https://doi.org/10.48550/arXiv.2106.09685.
41. Lee, J.; Cho, K.; Kiela, D. Countering Language Drift via Visual Grounding. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), Hong Kong, China, 4 November 2019; pp. 4385–4395. https://doi.org/10.18653/v1/D19-1447.
42. Voynov, A.; Aberman, K.; Cohen-Or, D. Sketch-guided text-to-image diffusion models. In Proceedings of the SIGGRAPH ’23:
Special Interest Group on Computer Graphics and Interactive Techniques Conference, Los Angeles, CA, USA, 6–10 August 2023;
pp. 1–11. https://doi.org/10.1145/3588432.3591560.
43. Li, Y.; Liu, H.; Wu, Q.; Mu, F.; Yang, J.; Gao, J.; Li, C.; Lee, Y.J. Gligen: Open-set grounded text-to-image generation. In Proceedings
of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023;
pp. 22511–22521. https://doi.org/10.48550/arXiv.2301.07093.
Buildings 2024, 14, 1528 17 of 17

44. Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 3836–3847. https://doi.org/10.48550/arXiv.2
302.05543.
45. Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; Irani, M. Imagic: Text-based real image editing with
diffusion models. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver,
BC, Canada, 17–24 June 2023; pp. 6007–6017. https://doi.org/10.48550/arXiv.2210.09276.
46. Chu, E.; Lin, S.Y.; Chen, J.C. Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using
Conditional Image Diffusion Models. arXiv 2023, arXiv:2305.19193. https://doi.org/10.48550/arXiv.2305.19193.
47. Hu, Z.; Xu, D. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet.
arXiv 2023, arXiv:2307.14073. https://doi.org/10.48550/arXiv.2307.14073.
48. Chan, K.C.; Zhou, S.; Xu, X.; Loy, C.C. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June
2022; pp. 5972–5981. https://doi.org/10.48550/arXiv.2104.13371.
49. Otani, M.; Togashi, R.; Sawai, Y.; Ishigami, R.; Nakashima, Y.; Rahtu, E.; Heikkilä, J.; Satoh, S. Toward verifiable and reproducible
human evaluation for text-to-image generation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14277–14286. https://doi.org/10.48550/arXiv.2304.01816.
50. Guo, J.; Du, C.; Wang, J.; Huang, H.; Wan, P.; Huang, G. Assessing a Single Image in Reference-Guided Image Synthesis. In
Proceedings of the 36th AAAI Conference on Artificial Intelligence, Virtual Event, 22 February–1 March 2022; pp. 753–761.
https://doi.org/10.1609/aaai.v36i1.19956.
51. Seshadrinathan, K.; Soundararajan, R.; Bovik, A.C.; Cormack, L.K. Study of subjective and objective quality assessment of video.
IEEE Trans. Image Process. 2010, 19, 1427–1441. https://doi.org/10.1109/TIP.2010.2042111.
52. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In
Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June
2022; pp. 10684–10695. https://doi.org/10.1109/CVPR52688.2022.01042.
53. Bakurov, I.; Buzzelli, M.; Schettini, R.; Castelli, M.; Vanneschi, L. Structural similarity index (SSIM) revisited: A data-driven
approach. Expert Syst. Appl. 2022, 189, 116087. https://doi.org/10.1016/j.eswa.2021.116087.
54. Obukhov, A.; Krasnyanskiy, M. Quality assessment method for GAN based on modified metrics inception score and Fréchet
inception distance. In Software Engineering Perspectives in Intelligent Systems: Proceedings of 4th Computational Methods in Systems
and Software 2020; Springer: Cham, Switzerland, 2020; Volume 1294, pp. 102–114. https://doi.org/10.1007/978-3-030-63322-6_8.
55. Hessel, J.; Holtzman, A.; Forbes, M.; Bras, R.L.; Choi, Y. Clipscore: A reference-free evaluation metric for image captioning. arXiv
2021, arXiv:2104.08718. https://doi.org/10.48550/arXiv.2104.08718.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy