0% found this document useful (0 votes)
10 views11 pages

Saw Gan

The document presents SAW-GAN, a novel Multi-granularity Text Fusion Generative Adversarial Network designed for text-to-image generation, which effectively integrates sentence, aspect, and word-level semantic information. It introduces two innovative modules, DFM and TFM, to enhance the model's understanding and generation capabilities, achieving significant performance improvements over existing methods. Experimental results demonstrate that SAW-GAN generates photorealistic images with better detail and text-image consistency, outperforming state-of-the-art techniques on benchmark datasets.

Uploaded by

omar.orensa319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

Saw Gan

The document presents SAW-GAN, a novel Multi-granularity Text Fusion Generative Adversarial Network designed for text-to-image generation, which effectively integrates sentence, aspect, and word-level semantic information. It introduces two innovative modules, DFM and TFM, to enhance the model's understanding and generation capabilities, achieving significant performance improvements over existing methods. Experimental results demonstrate that SAW-GAN generates photorealistic images with better detail and text-image consistency, outperforming state-of-the-art techniques on benchmark datasets.

Uploaded by

omar.orensa319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Knowledge-Based Systems 294 (2024) 111795

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

SAW-GAN: Multi-granularity Text Fusion Generative Adversarial Networks


for text-to-image generation
Dehu Jin, Qi Yu, Lan Yu, Meng Qi ∗
School of Information Science and Engineering, Shandong Normal University, Jinan, China

ARTICLE INFO ABSTRACT

Keywords: Text-to-image generation is a challenging task that aims to generate visually realistic images semantically
Text-to-image generation consistent for a given text. Existing methods mainly exploit the global semantic information of a single sentence
Text–image information fusion while ignoring fine-grained semantic information such as aspects and words, which are critical factors in
Attention mechanism
bridging the semantic gap in text-to-image generation. We propose a Multi-granularity Text (Sentence-level,
CLIP
Aspect-level, and W ord-level) Fusion Generative Adversarial N etwork (SAW-GAN ), which comprehensively
represents textual information from multiple granularities. To effectively fuse multi-granularity information,
we design a Double-granularity-text F usion M odule (DFM ) fusing sentence and aspect information through
parallel affine transformation and a T riple granularity-text F usion M odule (TFM ) fusing sentence, aspect and
word information by designing a novel Coordinate Attention M odule (CAM ), which can precisely locate the
visual areas associated with each aspect and word. Furthermore, we use CLIP (Contrastive Language-Image
Pre-training) to provide visual information to bridge the semantic gap and improve the model’s generalization
ability. Our results show significant performance improvements over state-of-the-art methods using Conditional
Generation Adversarial Network (CGAN) on CUB (FID from 13.91 to 10.45) and COCO (FID from 14.60 to
11.17) datasets with photorealistic images of richer details and text–image consistency.

1. Introduction show better image effects than the methods based on diffusion model
and autoregressive model. In this paper, we continue to explore the
The rapid development of artificial intelligence (AI) models has potential of CGAN for text-to-image generation.
been applied to various fields, such as computer vision (CV) [1,2], Existing CGAN-based text-to-image generation models can be broad-
natural language processing (NLP) [3,4] and remote sensing (RS) [5,6]. ly categorized into two types: multi-stage models and single-stage
Cross-modal tasks [7–9] are sub-tasks in AI that involve information models. Multi-stage models [18–20] generate images in a coarse-to-
processing between different modes (such as text, image, speech, etc.), fine manner by stacking a series of generator and discriminator pairs,
and are a hot research nowadays. Text-to-image generation is one of gradually increasing the details and resolution of the image, and fi-
the research hotspots, aiming to generate realistic images that conform nally obtaining a photorealistic image. Single-stage models [21–23]
to semantic descriptions based on a given text, and has been applied in introduce only a generator/discriminator pair to produce photorealis-
art generation [10], image editing [11], interactive entertainment [12], tic images that are semantically consistent with textual descriptions.
and so on. Compared with the multi-stage model, the single-stage model is more
Recently, some methods based on diffusion and autoregressive mod- conducive to the convergence and stability of the model. In this work,
els, such as DALL⋅E 2 [13], Parti [14], etc., have demonstrated strong we also follow a single-stage architecture.
generative capabilities and significantly outperform previous condi- Although the state-of-the-art single-stage have achieved exciting
tional Generative Adversarial Networks (CGAN) [15]. Although these results, due to the difference in the modality structure of text and
image data, the above generative models can only extract limited
models have made significant progress, they all rely on iterative in-
semantic information from the text and use them for image generation,
ference and have huge model sizes, so requiring extremely high com-
which will lead to deficiencies in the quality and semantic consistency
putation time and hardware requirements. Compared with CGAN, the
of generated images. To narrow the semantic gap between text and
latter only needs a single forward pass to generate the image and the
generated images, we need to extract semantic information at differ-
model size is relatively small, which has the advantages of fast and low
ent granularities from text, such as sentence, aspect, and word-level
computational cost. Moreover, the latest CGAN based methods [16,17]

∗ Corresponding author.
E-mail address: qimeng@sdnu.edu.cn (M. Qi).

https://doi.org/10.1016/j.knosys.2024.111795
Received 24 July 2023; Received in revised form 18 March 2024; Accepted 9 April 2024
Available online 12 April 2024
0950-7051/© 2024 Elsevier B.V. All rights reserved.
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795

granularity. For example, for the text ‘‘a stripped bird with a red • Extensive experiments demonstrate our proposed method’s ef-
nape and a long pointed beak’’, the coarse-grained information at the fectiveness and superiority. Even compared with some diffusion
sentence level provides the text’s basic semantic information (global models and autoregressive models can be slightly better.
information). The extent to which aspect-level granularity information
The rest of this paper is organized as follows. In Section 2, we review
interprets the semantics of a text is intermediate between sentence-level
related work. Section 3 details our proposed SAW-GAN as well as new
and word-level. It can serve as a bridge between sentence and word
modules. Experimental results are reported in Section 4. Section 5
level, enabling the model to understand text semantics from coarse to
concludes the paper.
fine instead of jumping directly. For example, for the above text, its
aspects are ‘‘a strip bird’’, ‘‘a red neck’’, and ‘‘a long pointed beak’’, 2. Related work
each descript the object from a different perspective. Correspondingly,
each word at the word-level granularity has a specific meaning and In this section, we describe the research areas related to our work
semantic relationship, such as the adjectives ‘‘strip’’ and ‘‘red’’ and the from two aspects, including text-to-image generation and the use of
nouns ‘‘nape’’ and ‘‘beak’’, which provide detailed information. CLIP in text-to-image generation.
We propose an innovative approach to fuse multi-grained informa-
tion from text (including sentence, aspect, and word level grains) called 2.1. Text-to-image generation
Multi-Granular Text Fusion Generative Adversarial Network (SAW-
GAN). SAW-GAN supplements sentences with fine-grained information In recent years, Artificial Intelligence in Computer Graphics (AICG)
(aspects and words), which enhances the model’s ability to understand has become one of the research hotspots in computer science and
text representations and can generate visually plausible examples that artificial intelligence. With the rapid development of artificial intel-
correspond to semantic descriptions. To effectively fuse text informa- ligence and computer graphics, AICG has received extensive atten-
tion with different granularities, we propose two new multi-granularity tion and exploration in academia and industry. Among them, the
feature fusion modules: Double Granularity-Text Fusion Module (DFM) task of text-to-image synthesis has received increasing attention. Due
and Triple-Granularity-Text Fusion Module (TFM). We fuse two-grained to improvements in generative methods (CGAN) [15], text-to-image
texts (sentence-level and aspect-level) in DFM as a transition for TFM to tasks have yielded confusingly real results. Subsequently, the develop-
fuse three-grained texts (sentence-level, aspect-level, and word-level). ment of large-scale generative methods (autoregressive and diffusion
Specifically, DFM achieves deep fusion of sentence-aspect text by means models) enabled significant improvements in text-to-image generation
of parallel affine transformations to enhance the generator’s under- conditioned on arbitrary text descriptions.
standing of the text. TFM first enhances the visual feature map based The autoregressive model generates the corresponding image pixel
on aspect-words, and then fuses the sentence text to further improve values by decoding the input text features progressively. Yu et al. [14]
the network’s comprehensive understanding and expression of the text. proposed Parti, which views image generation as a translation task.
To achieve accurate fusion of fine-grained text with visual areas, we Ding et al. [27] proposed Cogview2, which performs image genera-
tion in a hierarchical Transformer and parallel autoregressive man-
developed a novel attention mechanism called Coordinate attention
ner. The diffusion model gradually transitions from random noise to
module (CAM). CAM synthetically represents the visual feature map
a meaningful image by adding Gaussian noise. Ramesh et al. [13]
from horizontal and vertical directions, and obtains two direction-
proposed DALL⋅E 2, which combines pre-trained CLIP and diffusion
aware features that can accurately locate the local area of the visual
models to generate images. Zhang et al. [28] proposed ControlNet,
feature map. They and aspect-word embeddings compute semantic
which controls pre-trained large diffusion models to support additional
affinities to update and enhance semantically relevant visual regions
input conditions. However, they all require a time-consuming and
while mitigating the interference of semantically irrelevant and redun-
resource-intensive iterative process to generate high-quality images.
dant information. In addition, we introduce pre-trained CLIP [24] into
CGAN generates images with a single forward pass, significantly re-
the model to encode text and provide visual information to bridge the ducing the training time requirement. We classify existing CGAN-based
semantic gap between text and images and improve generalization. methods into two categories by the number of generator/discriminator
Finally, we design text loss at different granularities to constrain the pairs: multi-stage and single-stage models. We introduce the existing
model, which enhances the ability to understand and express text methods in chronological order.
information at different granularities.
We conduct extensive experiments on CUB [25] and COCO [26] Multi-stage model: Zhang et al. [29,30] first proposed a multi-stage
datasets to evaluate the effectiveness of our proposed SAW-GAN method method, StackGAN, with multiple CGANs to generate exciting instances
quantitatively and qualitatively. Experimental results show that our from a given text in a coarse-to-fine manner. To refine image details, Xu
et al. [31] proposed AttnGAN, which uses an attention mechanism to
method outperforms the current CGAN-based methods, and even out-
refine image regions related to words, further enhancing the visual real-
performs some methods based on diffusion models and autoregressive
ism of images. Zhu et al. [32] proposed DM-GAN, which uses a memory
models. Our main contributions are summarized as follows:
network to store rich semantic information, allowing the network to
• We propose a novel single-stage architecture SAW-GAN that can capture the details and semantics of images better when processing
efficiently and deeply fuse text and image features at multi- complex text descriptions. Ruan et al. [33] proposed DAE-GAN, which
ple granularities like sentences, aspects, and words to generate can accurately capture multiple aspects of text and generate diverse
semantically consistent photorealistic images. and high-quality images. Tan et al. [19] proposed DR-GAN, which
• Two new DFM and TFM are introduced to target the fusion of text introduced a distribution regularization mechanism never to make the
and visual information with different granularities, respectively. generated image more closely match the distribution of the real image
DFM fuses sentence-aspect and visual information to enhance the by minimizing the distribution difference between the real image and
text understanding ability of the generator. TFM pays attention to the generated image.
the visual areas related to aspect-word, which helps the network Single-stage model: Reed et al. [34] were the first to utilize a single-
to realize the comprehensive integration of sentence, aspect and stage CGAN for image generation tasks and generate plausible images.
word information. However, the training process of CGAN is relatively unstable and
• A novel coordinate attention module is designed to accurately prone to mode collapse, resulting in the inability to create diverse
locate and enhance the visual feature map with aspect-word samples. Tao et al. [21] proposed DF-GAN, which utilizes Matching-
embedding. Aware Gradient Penalty (MA-GP) to stabilize the training process and

2
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795

Fig. 1. The overall architecture of our proposed SAW-GAN. It consists of CLIP and a generator/discriminator pair. In the generator, FC is the fully connected layer, and GBlock
and UpBlock are fusion modules based on residual structure. The difference is that there is no upsampling layer in GBlock. DFM is a dual-granularity-text fusion module discussed
in Section 3.2 for fusing sentence and aspect embeddings. TFM is a triple-granularity-text fusion module introduced in Section 3.3 to fuse sentence, aspect, and word embeddings.
The discriminator consists of multiple residual-based structure DBlocks to accept the input of the CLIP image encoder and fuse visual information to obtain image features for
computing multi-granularity text loss.

significantly enhance text–image semantic consistency without intro- encodes images to obtain CLIP image features, and the discriminator
ducing additional networks. Liao et al. [22] proposed SSA-GAN, which further extracts features to obtain effective visual features. Then, the
combines semantic information and spatial layout information for joint loss between text features and visual features at different granularities
optimization when generating images. Ye et al. [23] proposed RAT- is calculated to optimize the generator and discriminator jointly.
GAN to make the generated image consistent with the input text by
iteratively applying affine transformation and exploiting the feedback 3. Method
from the discriminator.
Unlike the above methods, our method adopts a single-stage ar- In this section, we will provide a detailed introduction to the overall
chitecture. It introduces aspect-level and word-level text information architecture of our proposed SAW-GAN, as shown in Fig. 1. In order
during the generation process, providing the model with fine-grained to integrate text information of multiple granularities, we propose
visual features and synthesis capabilities. two new modules to fuse different granularities of text embeddings.
(i) The Dual-granularity-text fusion module (DFM) fuses sentence and
2.2. Use of CLIP in text-to-image generation aspect embeddings in a parallel affine transformation manner. (ii) The
Triple-granularity-text fusion module (TFM) utilizes the Coordinate
CLIP [24] is a model based on contrastive learning. CLIP consists Attention Module (CAM) to modulate visual feature maps based on
of an image encoder and a text encoder by jointly training the image aspect and word embeddings, which are then used as inputs for the
and text encoders and using large-scale image and text datasets for subsequent sentence affine transformation. Finally, we design separate
pre-training. CLIP can learn a general feature space. This feature space loss functions for different granularities of text embeddings to optimize
enables the model to understand the semantic relationship between the semantic direction of the network from global to local constraints.
images and text without additional supervisory signals. Therefore, CLIP
can be applied to many tasks, such as image classification [13,35], 3.1. Overall architecture
image generation [36,37], and image search [38,39].
Since the CLIP pre-training model is pre-trained on a large-scale As shown in Fig. 1, our SAW-GAN consists of a text encoder and a
image–text dataset, it bridges the semantic gap between vision and generator/discriminator pair. Unlike previous works [21,29,31,32], we
language to a certain extent, enabling cross-modal semantic under- introduce the CLIP text encoder to encode the given text and aspect
standing. In the text-to-image generation task, Zhou et al. [40] took text, obtaining three different levels of text embeddings. The CLIP text
advantage of the well-aligned multimodal semantic space of the CLIP encoder is designed based on the BERT pre-trained model [41]. After
model to train a text-to-image generation model without text data. tokenization, we convert each token in the given text into a vector
Ramesh et al. [37] decomposed the generation process into multiple representation, forming a sequence passed through multiple layers of
stages and used the latent variables of the CLIP model for conditional Transformer encoders for feature extraction. Finally, a fixed-length
control, enabling the model to generate images with rich diversity and vector representation is output as word embeddings. The output of the
semantic decoupling. Tao et al. [17] combined GAN and CLIP, using last Transformer encoder is then subjected to average pooling to obtain
the scene understanding ability of CLIP to enable the discriminator to the encoding representation of the entire sentence.
evaluate the image quality accurately and using CLIP to provide helpful The generator network of SAW-GAN consists of one GBlock, five
visual concepts for the generator so that the model can generate images UpBlocks, one DFM, and one TFM, which handle visual feature maps
that are realistic, diverse, and consistent with the input text. of different scales and fuse text embeddings of different granularities,
Unlike previous methods, we fully utilize CLIP’s cross-modal under- respectively. The generator takes a noise vector 𝐸𝑍 sampled from a
standing ability, achieve semantic alignment at different granularities, Gaussian distribution, sentence embedding 𝐸𝑆 , aspect embedding 𝐸𝐴 ,
and enhance the model’s expression effect on the text. Specifically, we and word embedding 𝐸𝑊 as inputs. The noise vector and sentence
utilize the CLIP text encoder to encode text at different granularities. embedding are concatenated and fed into a fully connected layer,
Following the discriminator structure in [17], the CLIP image encoder reshaped into a visual vector, and then fed into the UpBlock. DFM

3
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795

Fig. 2. Overview of the proposed Dual-granularity-text fusion module. DFM aims to achieve the fusion of sentence-aspect embedding and visual features. DFM is composed of an
upsampling layer and two two-way fusion blocks. Conditional on sentence and aspect embedding, the Two-way fusion block uses parallel sentence and aspect affine transformations
to adjust feature maps. It concatenates and convolutions to complete the fusion process. Aspect affine passes the adjusted visual feature to the later processors after scaling, biasing,
and averaging them conditioned on aspect embedding.

first upsamples the output of UpBlock and then fuses the text embed- output. Aspect affine learns the scale and bias of aspect embeddings
dings through two two-way fusion blocks. The two-way fusion block by two sets of MLP. After MLP action, the matrix dimension changes
effectively integrates aspect embedding and sentence embedding into to 𝑅(𝑛×𝑐×ℎ×𝑤) . Visual feature ℎ𝑡−1 is first dimensionally increased to the
the visual features through parallel affine transformations. TFM first correct shape, then scaled and biased, and finally averaged and passed
modulates the visual feature map with aspect embedding and word to the next stage processor. This process is described as follows:
embedding as conditions through a coordinate attention module and
uses it as input for subsequent sentence embedding fusion. 𝐹𝑡𝑖 = 𝐹 𝑖 (𝐹𝑡−1
𝑖
, 𝐸𝑆 , 𝐸𝐴 ), 𝑖 = 1, 2 (1)
The discriminator is designed to distinguish real and generated
images. We follow the structure of the discriminator in [17], which 𝐹 𝑖 = 𝑓𝐶 (𝑓𝐴 (𝑓𝑅 (𝑓𝑆 (ℎ𝑡−1 )𝑖 , 𝐸𝑆 )), 𝑓𝑅 (𝑓𝐴 (ℎ𝑖(𝑡−1) , 𝐸𝐴 ))) (2)
consists of the CLIP image encoder and DBlocks. The real/generated
among them, 𝐸𝑆 is sentence embedding, 𝐸𝐴 is aspect embedding, 𝐹𝑖
images are first extracted into low-dimensional feature maps by the
means two-way fusion module, 𝑓𝐶 means a 3 × 3 convolutional layer,
CLIP image encoder, and then used by DBlock to do integration with
𝑓𝐴 means cascade operation, 𝑓𝑅 is Leaky ReLU activation function [42],
the informative visual features extracted by CLIP from shallow to deep
𝑓𝑆 means sentence affine transformation, 𝑓𝐴 represents aspect affine
layers. The final output is an image feature with rich visual information,
transformation.
which can cascade with sentence embeddings to calculate adversarial
loss as sentence-level text loss 𝐿𝑆 and calculate contrastive loss with Our proposed DFM can effectively fuse textual information at mul-
aspect embeddings as aspect-level text loss 𝐿𝐴 and similarity loss with tiple granularities, enabling the network to capture precise semantics
word embeddings as word-level text loss 𝐿𝑊 . The joint supervision of while maintaining good image quality.
multiple losses provides a better optimization direction for the network.
The following chapters will provide more detailed technical details 3.3. Triple-granularity-text fusion module
of DFM and TFM and loss functions.
Improving the semantic understanding ability of the model requires
3.2. Dual-granularity-text fusion module attention to sentence and aspect text information and more fine-grained
word text information. The given sentence often contains words that
The semantic affinity between text embedding and visual features is contribute differently to the synthesized sample. For example, adjec-
significant for generated images’ authenticity and semantic consistency. tives such as ‘‘small’’ and ‘‘long’’ play a crucial role in the model’s
Therefore, increasing the model’s understanding of the conditional understanding of the modified object’s shape or contour. Therefore,
text can positively impact image synthesis. A sentence may contain exploring the potential interaction between word embeddings and local
multiple aspect terms that describe objects or scenes from different visual features can improve generated images’ quality and semantic
perspectives, such as ‘‘grey and red wings’’ and ‘‘a red head’’ in Fig. 1. consistency. Given these reasons, we propose the Triple-granularity-
These aspect information can help the model better understand the feature fusion module (TFM). As shown in Fig. 3(a), TFM consists
semantic information contained in the text. Therefore, we extract mul- of aspect-word and sentence fusion modules. TFM has four inputs:
tiple aspect-level texts from the sentence and develop DFM to integrate the output ℎ(𝑡−1) of the previous stage, as well as aspect, word, and
them into the generation process to improve the model’s expressive sentence embeddings. First, the aspect-word fusion module combines
performance. aspect and word embeddings and outputs the fused features as input
As shown in Fig. 2, DFM consists of an upsampling module and to the sentence fusion module, which then combines with the sentence
two two-way fusion blocks. DFM takes the output ℎ𝑡−1 of the previous embedding.
step as input, with sentence embedding 𝐸𝑆 𝜖 𝑅(512) and aspect embed- To effectively integrate aspect-word textual information, we have
ding 𝐸𝐴 𝜖 𝑅(𝑛×512) as conditions. We upsampled the input ℎ𝑡−1 , then designed a novel attention module called Coordinate Attention Module
deep fused with sentence embedding and aspect embedding through (CAM), as shown in Fig. 3(b). CAM can identify essential aspects
two two-way fusion blocks, and outputs the fused visual features ℎ𝑡 . and words in a given text description and allocate greater weights to
We improved the deep fusion module of DF-GAN [21], changing the related visual regions to optimize the generation of local details in the
original single condition (sentence embedding) fusion to dual condi- generated image. CAM has three inputs: feature map ℎ𝑡−1 𝜖 𝑅(𝐶×𝐻×𝑊 )
tions (sentence embedding and aspect embedding) fusion. Sentence (where 𝐶, 𝐻, 𝑊 represent the number of channels, height, and width of
embedding and aspect embedding learn the scale and bias of different ℎ𝑡−1 , respectively), aspect embedding 𝐸𝐴 𝜖 𝑅(𝐷×𝑁) and word embedding
granular text conditions through sentence affine and aspect affine and 𝐸𝑊 𝜖 𝑅(𝐷×𝑇 ) (where 𝐷, 𝑁, 𝑇 represent the number of channels of text
then concatenate and convolve in the channel dimension to obtain the embedding, the number of aspect information and sequence length

4
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795

Fig. 3. Overview of the proposed Triple-granularity-text fusion module. (a) TFM consists of two processes: aspect-word fusion and sentence fusion. Aspect and word embeddings
are fused using the Coordinate Attention Module (CAM). (b) CAM calculates the correlation between aspect-word embeddings and feature maps, enhancing local regions of visual
features that align with the textual information to enrich image details. Sentence embedding is fused using multiple affine transformations.

respectively). We first process the original feature map with two global We model the semantic affinity between the two to capture the local
average pooling operations in the horizontal and vertical directions correlation between semantic information and visual feature maps. We
to aggregate the discriminative information in a specific direction to performed a dot product operation between the query and the key. We
obtain a horizontal feature vector 𝐹𝑥 𝜖 𝑅(𝐶×𝐻×1) and a vertical direction processed it with the softmax function to obtain the similarity matrix in
feature vector 𝐹𝑦 𝜖 𝑅(𝐶×1×𝑊 ) . Afterward, the two feature vectors are two directions: 𝑀𝑥 𝜖 𝑅(𝐻×(𝑁+𝑇 )) , 𝑀𝑦 𝜖 𝑅(𝐻×(𝑁+𝑇 )) . They represent local
concatenated and input to the convolutional and norm layers to inte- correlations between text and subregions of a specific direction. Finally,
grate the location information, which is then split into location-specific the two similarity matrices are reshaped into 𝑅(𝐶×𝐻×1) and 𝑅(𝐶×1×𝑊 )
query feature maps 𝐹𝑥𝑞 𝜖 𝑅(𝐶×𝐻×1) and 𝐹𝑦𝑞 𝜖 𝑅(𝐶×1×𝑊 ) . This process is by the softmax function after dot product with 𝐹𝑙𝑘 . We denote these
expressed mathematically in the following: operations as:
𝐹𝑥 = 𝐴𝑣𝑔𝑃 𝑜𝑜𝑙𝑋 (ℎ𝑡−1 ) (3) 𝑀𝑥 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝐷(𝐹𝑥𝑞 , 𝐹𝑙𝑘 )) (8)

𝐹𝑦 = 𝐴𝑣𝑔𝑃 𝑜𝑜𝑙𝑌 (ℎ𝑡−1 ) (4) 𝑀𝑦 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝐷(𝐹𝑦𝑞 , 𝐹𝑙𝑘 )) (9)

𝐹𝑥𝑞 , 𝐹𝑦𝑞 = 𝑓𝑠 (𝑓𝑛 (𝑓𝑐𝑞 (𝑓𝑐 (𝐹𝑥 , 𝐹𝑦 )))) (5) 𝑊𝑥 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝐷(𝐹𝑙𝑘 , 𝑀𝑥 )) (10)
where 𝐴𝑣𝑔𝑃 𝑜𝑜𝑙𝑋 and 𝐴𝑣𝑔𝑃 𝑜𝑜𝑙𝑌 represent the global average pooling
in the horizontal and vertical directions, respectively. 𝑓𝑐 represents 𝑊𝑦 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝐷(𝐹𝑙𝑘 , 𝑀𝑦 )) (11)
a cascade operation, 𝑓𝑐𝑞 represents a 1 × 1 convolutional layer, 𝑓𝑛
is Adaptive Layer-Instance Normalization [43], and 𝑓𝑠 refers to a where, 𝐷(⋅, ⋅) represents the dot product operation, 𝑊𝑥 𝜖 𝑅(𝐶×𝐻×1) and
segmentation operation. 𝑊𝑦 𝜖 𝑅(𝐶×1×𝑊 ) represent the attention weights of 𝑀𝑥 in the horizontal
For aspect and word embeddings, we first concatenate them to- direction and 𝑀𝑦 in the vertical direction, respectively.
gether to integrate local semantic information and then flow them After obtaining the attention weights, we multiply them with the
into two different 1 × 1 convolution operations to generate two local original feature maps to reweight the visual feature maps. By doing so,
contextual embeddings: the key 𝐹𝑙𝑘 𝜖 𝑅(𝐶×(𝑁+𝑇 )) and 𝐹𝑙𝑣 𝜖 𝑅(𝐶×(𝑁+𝑇 )) , the network will pay more attention to the local regions in the feature
mathematically, map related to textual information and assign more weights to them. At
the same time, we use the adaptive residual connection [44] to retain
𝐹𝑙𝑘 = 𝑓𝑐𝑘 (𝑓𝑐 (𝐸𝐴 , 𝐸𝑊 )) (6) the rich original information in the original feature map, stabilize the
𝐹𝑙𝑣 = 𝑓𝑐𝑣 (𝑓𝑐 (𝐸𝐴 , 𝐸𝑊 )) (7) learning of the network, and finally get the adjusted feature map as
ℎ′𝑡−1 𝜖 𝑅(𝐶×𝐻×𝑊 ) . It is defined as follows:
among them, 𝑓𝑐𝑘 and 𝑓𝑐𝑣 represent two 1 × 1 convolutional layers
respectively, and 𝑓𝑐 represents the concat operation. 𝐹𝑤 = ℎ𝑡−1 ⊙ 𝑊𝑥 ⊙ 𝑊𝑦 (12)

5
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795

𝐿𝐺
𝑊 = 1 − 𝑐𝑜𝑠(𝑓𝑖𝑚𝑔 (𝑥),
̂ 𝑤) (21)
ℎ′𝑡−1 = 𝜆 ∗ 𝐹𝑤 + ℎ𝑡−1 (13)
where 𝑤 denotes the word embedding.
where ℎ𝑡−1 represents the input feature map. ⊙ means element-wise
Generator Objective: Generator loss includes sentence-level text loss
multiplication. 𝜆 is a learnable parameter initialized to 0, which enables
𝐿𝐺
𝑆
, aspect-level text loss 𝐿𝐺
𝐴
and word-level text loss 𝐿𝐺
𝑊
.
the network to autonomously adjust the weights of the feature maps to
obtain a more appropriate feature representation. 𝐿𝐺 = 𝐿𝐺 𝐺 𝐺 𝐺 𝐺
(22)
𝑆 + 𝜆𝐴 𝐿𝐴 + 𝜆𝑊 𝐿𝑊
After the above operations, we have realized the fusion of local
information. The fused feature maps are fed into the global fusion where 𝜆𝐺
𝐴
and 𝜆𝐺
𝑊
are hyperparameters.
module, which consists of affine transformation, ReLU, and convolu- Discriminator Objective: Discriminator loss includes sentence-level
tional layers. It can learn the scale and bias under the global semantics, text loss 𝐿𝐷 , aspect-level text loss 𝐿𝐷 and word-level text loss 𝐿𝐷 .
𝑆 𝐴 𝑊
expand the input feature map to the correct shape, and then scale and
bias. Using TFM can effectively realize the fusion of coarse/fine-grained 𝐿𝐷 = 𝐿𝐷 𝐷 𝐷 𝐷 𝐷
𝑆 + 𝜆𝐴 𝐿𝐴 + 𝜆𝑊 𝐿𝑊 (23)
text so that the network can output an image that conforms to the
where 𝜆𝐷 and 𝜆𝐷 are hyperparameters.
semantic information of the text and has rich details. 𝐴 𝑊

4. Experiment
3.4. Objective function

In this section, to demonstrate the feasibility of SAW-GAN in gen-


We design loss functions from coarse to fine granularity based on the
erating semantically consistent and visually realistic images, we con-
text to generate realistic images while ensuring semantic consistency
ducted a large number of quantitative and qualitative experiments on
between the text and the corresponding image.
two benchmark datasets, i.e., Caltech-UCSD Birds 200 (CUB) [25] and
Sentence-Level Text Loss: We use the adversarial loss proposed in MS COCO [26]. Specifically, in Section 4.1, we explained the experi-
[21], leveraging the hinge objective [45] to stabilize training. The mental setup. In Section 4.2, the proposed SAW-GAN was quantitatively
adversarial loss formula for the discriminator is: and qualitatively compared with SOTA methods. Then, in Section 4.3,
we analyzed the contributions of DFM and TFM. Finally, in Section 4.4,
𝐿𝐷
𝑆 = 𝐸(𝑥∼𝑃𝑑𝑎𝑡𝑎 ) [𝑚𝑎𝑥(0, 1 − 𝐷(𝑥, 𝑠))] we tested the impact of fusing different quantities of text pairs on the
+ (1∕2)𝐸(𝑥∼𝑃𝐺 ) [𝑚𝑎𝑥(0, 1 + 𝐷(𝑥,
̂ 𝑠))] experiments.
+ (1∕2)𝐸(𝑥∼𝑃𝑑𝑎𝑡𝑎 ) [𝑚𝑎𝑥(0, 1 + 𝐷(𝑥, 𝑠))]
̂ (14)
4.1. Experiment setup
where 𝑠 is the given textual description and 𝑠̂ is the unmatched natural
language description.
In addition to the adversarial loss, we introduce the CLIP similarity Datasets: We conducted extensive experiments on two benchmark
loss in the generator. The loss of the generator is: datasets, CUB and COCO. CUB dataset contains only bird species as
single objects, with 200 bird categories. Among them, 150 classes
𝐿𝐺
𝑆 = 𝐸(𝑥∼𝑃𝐺 ) [𝐷(𝑥,
̂ 𝑠)] + 𝜆𝑠 𝐸(𝑥∼𝑃𝐺 ) [𝑆(𝑥,
̂ 𝑠)] (15) have 8855 images as the training set. The remaining 50 classes have
2933 images as the test set. Each image has 10 corresponding text
where 𝑆(⋅, ⋅) denotes computing the cosine similarity between CLIP-
descriptions. The COCO dataset consists of 123,287 complex everyday
encoded visual features and textual features. 𝜆𝑠 is a hyperparameter
scene images, with 82,783 used as training images and 40,504 as test
that represents the coefficient of text–image similarity.
images. Each image has 5 corresponding text descriptions.
Aspect-Level Text Loss: Alignment of aspect embeddings and local
Evaluation Metrics: Similar to previous work [21,31,32], two evalu-
region features of images is important. Therefore, we introduce con-
ation indicators commonly used in T2I, Inception Score (IS) [46]and
trastive learning to align aspect embeddings and local region features
Fréchet Inception Distance (FID) [47] are used to measure the quality
of images in a common space. We use cosine similarity as the distance
of the image generated by the model.
metric:
IS is obtained through the pre-trained Inception Net-V3 network
𝑢𝑇 𝑣 [47]. It calculates the KL divergence between the conditional class
𝑐𝑜𝑠(𝑢, 𝑣) = (16)
‖𝑢‖ ⋅ ‖𝑣‖ distribution and the marginal class distribution. A higher IS indicates
Thus, the contrastive loss takes a pair of inputs (𝑢𝑖 , 𝑣𝑗 ), minimizes that the generated images are clearer and easier to identify their
the embedding distance when they come from the same pair (𝑖.𝑒., 𝑖 = 𝑗), corresponding categories.
and maximizes the embedding distance of different pairs (𝑖.𝑒., 𝑖 ≠ 𝑗). FID also uses the pre-trained Inception Net-V3 network to imple-
ment. It calculates the Fréchet distance between the generated samples
1 ∑
𝑚
𝑒𝑥𝑝(𝑐𝑜𝑠((𝑢𝑖 , 𝑣𝑖 )∕𝜂)) distribution and the real data distribution. The smaller the distance,
𝐿𝑐𝑙 (𝑢, 𝑣) = − 𝑙𝑜𝑔 ∑𝑚 (17)
𝑚 𝑖=1 𝑖 𝑗
𝑗=1 𝑒𝑥𝑝(𝑐𝑜𝑠(𝑢 , 𝑣 )∕𝜂) the lower the FID, indicating that the generated model is closer to the
corresponding real images.
𝐿𝐷
𝐴 = 𝐿𝑐𝑙 (𝑓𝑖𝑚𝑔 (𝑥), 𝑎) (18)
Implementation details: We use the Pytorch framework to build the
𝐿𝐺 (19) model and conduct experiments on an Nvidia RTX 3090 GPU (24 GB
𝐴 = 𝐿𝑐𝑙 (𝑓𝑖𝑚𝑔 (𝑥),
̂ 𝑎)
memory). We set the batch size to 64, the dimension of the noise
where 𝑚 represents the batch size of the input sample, 𝑢𝑖 represents vector to 100, and the size of the generated images to 224 × 224.
the 𝑖th sample in a mini-batch of 𝑢, 𝑣𝑗 represents the 𝑗th sample of 𝑣, The training process uses the Adam optimizer [48], 𝛽1 = 0.0 and
and 𝜂 is a hyperparameter. 𝑓𝑖𝑚𝑔 is the extraction process of the image 𝛽2 = 0.9. The learning rate for the generator is set to 0.0001, and for
encoding and 𝑎 is the aspect embedding from the text encoder. the discriminator is set to 0.0004. Hyperparameter 𝜆𝑠 is set to 4.0, 𝜆𝐺𝐴
and 𝜆𝐷 are set to 0.1, 𝜆𝐺 and 𝜆𝐷 are set to 0.2. 𝜂 is set to 0.1. We
Word-Level Text Loss: To further monitor the semantic consistency of 𝐴 𝑊 𝑊
set the dimension of the initial block to 512. The model contains 88M
the network, we apply cosine similarity loss to align word embeddings
parameters. We train for 600 epochs on the CUB dataset and 120 epochs
and local features of images.
on the COCO dataset. The training time cost is about 27 h and 36 h,
𝐿𝐷
𝑊 = 1 − 𝑐𝑜𝑠(𝑓𝑖𝑚𝑔 (𝑥), 𝑤) (20) respectively. The speed of generating an image is about 0.05 s.

6
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795

Fig. 4. Example results of DM-GAN [32], DF-GAN [21] and our SAW-GAN on text-to-image synthesis on the CUB dataset.

Table 1 Table 2
IS of SOTA methods and our model on CUB dataset and COCO dataset (higher is better). FID of SOTA methods and our model on CUB dataset and COCO dataset (lower is
Method CUB ↑ COCO ↑ better).
Method CUB↓ COCO↓
AttnGAN [31] 4.36 ± 0.03 25.89 ± 0.47
MirroGAN [49] 4.54 ± 0.17 26.47 ± 0.41 DM-GAN [32] 16.09 32.64
DM-GAN [32] 4.75 ± 0.07 30.49 ± 0.57 DAE-GAN [33] 15.19 28.12
MA-GAN [50] 4.76 ± 0.09 – DF-GAN [21] 14.81 21.41
DF-GAN [21] 5.10 – SSA-GAN [22] 15.61 19.37
DiverGAN [51] 4.98 ± 0.06 – DiverGAN [51] 15.63 20.52
ARRPNGAN [18] 5.16 ± 0.06 32.36 ± 0.62 ARRPNGAN [18] 14.21 29.71
CF-GAN [52] 4.83 ± 0.08 31.13 ± 0.37 RAT-GAN [23] 13.91 14.60
ALR-GAN [53] 4.96 ± 0.04 34.70 ± 0.66 DE-GAN [54] 18.94 28.79
ALR-GAN [53] 15.14 29.04
Ours 4.63 ± 0.04 35.17 ± 0.49
Cogview2 [27] – 17.50
VQ-Diffusion [55] 10.32 13.86
Ours 10.45 11.17
4.2. Compared with SOTA

4.2.1. Quantitative results 4.2.2. Qualitative results


Our model is compared with previous models [18,21,31,32,49–53] In addition to quantitative experiments, we conduct qualitative
based on CUB and COCO datasets. The results of our SAW-GAN and comparisons with DM-GAN [32] and DF-GAN [21] on the CUB and
other models on IS are shown in Table 1. Our SAW-GAN achieves COCO datasets to evaluate the visual quality of generated images. The
better performance, significantly improves the IS from 34.70 ± 0.66 to generated results are shown in Fig. 4 and Fig. 5.
35.17 ± 0.49 on the COCO dataset and also shows relatively comparable The generation sample on the CUB dataset is shown in Fig. 4. We
performance on the CUB dataset. However, unlike CUB, which has can observe that DM-GAN, DF-GAN, and our SAW-GAN can gener-
a relatively small sample size and focuses only on birds, COCO has ate photorealistic images given a textual description. Still, SAW-GAN
more samples and more categories, which tests the model’s ability to mainly focuses on image details and text–image consistency aspects.
learn various visual features and changes in object expression. Our In the visual comparison of the generated samples, DM-GAN (2nd and
model achieves the best results on COCO dataset, which also shows that 8th column) and DF-GAN (5th and 7th column) have weird shapes. The
backgrounds of the synthetic images of DM-GAN (2nd and 5th column)
our model has more advantages when generating samples of multiple
and DF-GAN (2nd and 5th column) are implausible. In addition, in the
objects and categories.
4th column, DF-GAN does not represent ‘‘a red crown’’ in the text very
Table 2 shows the FID comparison results of SAW-GAN with CGAN-
well. This shows that, through DFM, the model can represent aspect-
based models [18,21–23,32,33,51,53,54], autoregressive model [27],
level text well. In addition, through TFM, the model can further bridge
and diffusion model [55] on CUB and COCO datasets. Compared with the gap between visual feature maps and word embedding, generat-
CGAN-based models, SAW-GAN reduces the FID from 13.91 to 10.45 ing high-quality samples with rich, detailed features and text–image
on the CUB dataset. On the COCO dataset, a significant reduction consistency.
from 14.60 to 11.17. SAW-GAN significantly reduces FID from 17.50 The generation samples on the COCO dataset are shown in Fig. 5.
to 11.17 compared to Cogview2 based on autoregressive model. Com- Compared with DM-GAN and DF-GAN, the images synthesized by our
pared with VQ-Diffusion based on diffusion model, SAW-GAN achieves SAW-GAN are more realistic. In the 3rd column of Fig. 5, SAW-GAN
similar FID results on CUB dataset, but on COCO dataset, the im- synthesized a giraffe walking on the lawn, whose body shape and
provement is very significant, which has four times more parameters details are more vivid and realistic than DM-GAN and DF-GAN. In
than ours. Quantitative results on two datasets show that our SAW- the 4th column, a smiling man wearing a black suit and floral tie
GAN outperforms existing GAN-based and autoregressive and diffusion is correctly synthesized by SAW-GAN, while DM-GAN and DF-GAN
models, showing that our SAW-GAN can generate high-fidelity images. generate a strange and indistinguishable object. SAW-GAN generates

7
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795

Fig. 5. Example results of DM-GAN [32], DF-GAN [21] and our SAW-GAN for text-to-image synthesis on the COCO dataset.

Table 3
Ablation experiments of our SAW-GAN on CUB dataset. DFM and TFM repre-
sent dual-granularity-text fusion module and triple-granularity-text fusion module,
respectively.
ID Components IS ↑ FID ↓
DFM TFM
M1 ✓ ✓ 4.63 ± 0.04 10.45
M2 ✓ 4.50 ± 0.08 11.50
M3 ✓ 4.31 ± 0.06 12.50
M4 4.23 ± 0.12 12.68

DFM and TFM, aspect- and word-level text can be well integrated
into visual features in image generation, which enables SAW-GAN to
accurately control the expressive effect of sample regions according to
the properties of text descriptions to generate images that conform to
text descriptions.
Fig. 6. The samples of SAW-GAN are generated by changing the color attribute value
of words in the aspect of natural language description. 4.3. Component analysis

In this section, we will discuss the effects of our proposed two


a boat docked on the shore (1st column), an identifiable train parked modules, DFM and TFM, from ablation experiments, as well as the
on the rails (2nd column), a city with well-defined layers and tall impact of different positions in the generator.
buildings (5th column), street lamps hanging next to various signs
(6th column), pasta and vegetables with rich color distribution and 4.3.1. Ablation experiment
straightforward identification (7th column). However, both DM-GAN The superiority of our proposed SAW-GAN has been demonstrated
and DF-GAN produce some oddly shaped objects (1st, 2nd, 3rd, 4th, from quantitative and qualitative experimental results. However, which
6th, and 7th column) and objects that do not match the number of text component is significant for performance improvement still needs to
descriptions (3rd and 8th column). Since our SAW-GAN is equipped be discovered. Therefore, we conducted ablation experiments on birds
with DFM and TFM, it can capture aspect-level and word-level infor- to verify the contribution of each part to SAW-GAN, including the
mation in sentences and highlight main objects in text descriptions,
Dual-granularity-text module (DFM) and Triple-granularity-text fusion
generating scene images with complex spatial distribution. In addition,
module (TFM). We gradually remove the corresponding parts in SAW-
the introduction of pre-trained CLIP improves the generalization ability
GAN to quantitatively explore the effectiveness of each component,
of the model. Even if only a small number of samples of some objects
namely, M1: SAW-GAN, M2: SAW-GAN without DFM, M3: SAW-GAN
are learned, SAW-GAN can still generate highly semantically consistent
without TFM, M4: SAW-GAN without DFM and TFM. All the results are
and realistic images.
To verify the sensitivity of SAW-GAN to words, we test the model shown in Table 3.
by modifying the color attribute values of words in facets in natural By comparing M1 (SAW-GAN) and M2 (removing DFM), the addi-
language descriptions. Fig. 6 shows three aspects [‘‘the small bird’’, tion of DFM improves IS from 4.50 ± 0.08 to 4.63 ± 0.04, and reduces
‘‘colorful < color > feathers’’, and ‘‘colorful < color > belly’’] are FID from 11.50 to 10.45 on CUB dataset, proving the importance and
extracted from the text. When we change the color attribute values of effectiveness of DFM. By comparing M1 (SAW-GAN) and M3 (removing
the latter two aspects, we can see that SAW-GAN produces consistent TFM), IS improved from 4.32 ± 0.06 to 4.63 ± 0.04, and FID significantly
and well-distributed bird images based on the modified text. At the improved from 12.50 to 10.45, proving that TFM can help the generator
same time, the visual appearance of the unmodified part (e.g., back- generate more realistic images. By comparing M4 (removing DFM and
ground, body shape, and texture) is well preserved. Therefore, through TFM) and M2 (removing DFM), we can see that TFM improves the

8
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795

Table 4
Effect of DFM and TFM at different positions of the generator on the CUB dataset. The
positions of the six blocks after GBlcok in the generator are named B-1 to B-6 in turn.
ID Components IS ↑ FID ↓
P1 DFM(B-1), TFM(B-2) 4.33 ± 0.05 11.99
P2 DFM(B-2), TFM(B-3) 4.18 ± 0.07 14.13
P3 DFM(B-3), TFM(B-4) 4.63 ± 0.04 10.45
P4 DFM(B-4), TFM(B-5) 4.60 ± 0.05 11.13
P5 DFM(B-5), TFM(B-6) 4.37 ± 0.04 12.96

Table 5
Effect of fusing different numbers of aspect-level text on the CUB dataset.
ID Amount of aspect IS ↑ FID ↓
N1 0 4.42 ± 0.08 12.01
N2 1 4.53 ± 0.09 13.49
N3 2 4.59 ± 0.03 11.56
N4 3 4.63 ± 0.04 10.45

Fig. 7. Samples generated by varying the number of fused aspect-level texts.

model more. However, from the comparison results of M4 (removing


DFM and TFM) and M3 (removing TFM), the influence of DFM is not
As shown in Fig. 7, we change the number of fused aspect-level
large. However, from M1 (SAW-GAN) and M4 (removing DFM and
texts in the generation process to verify the impact of aspect-level texts
TFM), the combined effect of DFM and TFM can achieve better results.
on the visual effect of the generated image. Specifically, given two
This may be because when the model fuses multiple granularities text, it
text descriptions, each of which has three aspects. Models N1, N2, N3
needs a progressive fusion process, that is, fusing text information from
and N4 fuse 0–3 aspects respectively, and the visual effect obtained is
coarse to fine, rather than directly spanning from fusing one granularity
shown in Fig. 7. In the example in the lower half, compared with N1
text information to three granularities text information. Therefore, we
without fused aspect-level texts, N2 which fuses one aspect ‘‘this bird
believe that DFM plays an auxiliary role in the model, which lets
has black-and-white speckled plumage’’ well expresses the semantic
the model act as a buffer when fusing text information at multiple
information in the aspect, such as‘‘speckled plumage’’. However, the
granularities, rather than directly spanning. Only the interaction of
other two aspects in the text description [‘‘a bright yellow crown’’,
DFM and TFM can make the model achieve the best effect.
‘‘long beak’’] are not shown in the generated image. In the example in
the upper half, as shown in N3, the two fused aspects [‘‘a small brown
4.3.2. Component location experiments
bird’’, ‘‘a white breast’’] are reflected in the corresponding position of
We named the positions of the six blocks after GBlock in the
generator B-1 to B-6. We first adjusted DFM and TFM to the positions the bird. Because the third aspect ‘‘black eyes’’ is not fused, it is not
of B-1 and B-2 and then adjusted one block backward, in turn, to obvious in the generated image. In addition, from the effect of the
observe how the performance changes on the CUB dataset. As shown image, the semantic information of each aspect-level text is reflected in
in Table 4, we can see that the models with P1 and P2 (DFM and TFM the local area related to the image, which proves that CAM can locate
located at B-1 to B-3) perform poorly. The best experimental results and enrich the visual local area related to semantics.
were achieved when DFM and TFM were placed in the middle position Quantitative and qualitative experimental results fully confirm the
of the generator, namely P3 and P4. However, when they were added importance of image refinement with aspect-level information. At
to the backend of the generator, according to the experimental results the same time, SAW-GAN can effectively realize the fusion of multi-
of P5, the performance also decreased. Considering the experimental granularity text information using DFM and TFM and realize the
results, this may be because the overall semantic information is lost or text–image consistency of generated images of fused aspect-level texts.
distorted when the fine-grained text features are fused too early or too
late. The fine-grained text more easily controls the text–image fusion 5. Conclusion
in the middle position, and the overall semantic information is better
preserved. Therefore, DFM and TFM are added in the middle position In this paper, we propose a single-level framework for multi-granul-
of the model, making it easier to fuse fine-grained text. arity text fusion, called SAW-GAN, for synthesizing photo-realistic and
semantically consistent images given text. Our main contribution is that
4.4. Influence of the amount of fused aspect-level text we propose two new modules that effectively fuse multi-granularity tex-
tual and visual information, Dual-granularity-text fusion module (DFM)
Since fine-grained text plays an essential role in our model, we and Triple-granularity-feature fusion module (TFM). DFM deeply inte-
wondered whether fusing different amounts of aspect-level text would grates sentence-aspect and visual information by parallel affine trans-
affect the generation results. Therefore, we conducted experiments on formation, which enhances the generator’s ability to understand and
the number of fused aspect-level texts. express aspect information. TFM first modulates local regions of vi-
We extracted aspect-level text from the given text and applied sual features related to aspect-words, and then deeply fuses sentence
different numbers of aspect-level text to text–image fusion. For N1 (no information to improve the overall understanding of the input text.
fused aspect-level text), we replaced aspect embeddings in DFM and Among them, for accurate fusion of fine-grained text, we develop a
TFM with sentence embeddings. Other models, such as N2, N3 and N4, novel coordinate attention module (CAM), which makes the generator
they only adjusted the number of fused aspect-level texts. As shown pay more attention to the local visual regions matching the aspect-word
in Table 5, we can see that N4 (fused 3 aspect-level texts) performs sense. In addition, we introduce pre-trained CLIP to encode text and
better than other models, both in IS (increased from 4.42 ± 0.08 to images, and improve the quality of generated images by using CLIP’s
4.63 ± 0.04) and FID (decreased from 13.49 to 10.45). This shows generalization ability in text–image domain. SAW-GAN achieves signifi-
that the more aspect-level text is integrated into the process of image cant performance gains on both CUB and COCO datasets, showing some
generation, the final generated image will be closer to the real text superiority even compared to other large models (diffusion and autore-
description, and the image will be more realistic. gressive models). However, the model size is much smaller compared

9
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795

to large models, which limits the ability to synthesize higher quality [15] M. Mirza, S. Osindero, Conditional generative adversarial nets, 2014, arXiv
images. In future work, we will explore the possibility of combining preprint arXiv:1411.1784.
[16] M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, T. Park, Scaling
with other pre-trained large models to generate higher quality samples.
up gans for text-to-image synthesis, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2023, pp. 10124–10134.
CRediT authorship contribution statement [17] M. Tao, B.-K. Bao, H. Tang, C. Xu, GALIP: Generative adversarial CLIPs for text-
to-image synthesis, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2023, pp. 14214–14223.
Dehu Jin: Writing – original draft, Software, Methodology, Data cu- [18] F. Quan, B. Lang, Y. Liu, ARRPNGAN: Text-to-image GAN with attention
ration, Conceptualization. Qi Yu: Conceptualization, Methodology. Lan regularization and region proposal networks, Signal Process., Image Commun.
Yu: Validation, Visualization. Meng Qi: Formal analysis, Resources, 106 (2022) 116728.
Supervision, Writing – review & editing. [19] H. Tan, X. Liu, B. Yin, X. Li, DR-GAN: Distribution regularization for
text-to-image generation, IEEE Trans. Neural Netw. Learn. Syst. (2022).
[20] Q. Cheng, K. Wen, X. Gu, Vision-language matching for text-to-image synthesis
Declaration of competing interest via generative adversarial networks, IEEE Trans. Multimed. (2022).
[21] M. Tao, H. Tang, F. Wu, X.-Y. Jing, B.-K. Bao, C. Xu, Df-gan: A sim-
The authors declare that they have no known competing finan- ple and effective baseline for text-to-image synthesis, in: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
cial interests or personal relationships that could have appeared to 16515–16525.
influence the work reported in this paper. [22] W. Liao, K. Hu, M.Y. Yang, B. Rosenhahn, Text to image generation with
semantic-spatial aware gan, in: Proceedings of the IEEE/CVF Conference on
Data availability Computer Vision and Pattern Recognition, 2022, pp. 18187–18196.
[23] S. Ye, H. Wang, M. Tan, F. Liu, Recurrent affine transformation for text-to-image
synthesis, IEEE Trans. Multimed. (2023).
Data will be made available on request. [24] A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry,
A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from
natural language supervision, in: International Conference on Machine Learning,
Acknowledgments
PMLR, 2021, pp. 8748–8763.
[25] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The Caltech-Ucsd
The work is partially supported by the National Natural Science Birds-200–2011 Dataset, California Institute of Technology, 2011.
Foundation of China (No. 61902225) and the Joint Fund of the Natural [26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L.
Zitnick, Microsoft coco: Common objects in context, in: Computer Vision–ECCV
Science Foundation of Shandong Province, China (No. ZR2021LZL011).
2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014,
Proceedings, Part V, Vol. 13, Springer, 2014, pp. 740–755.
References [27] M. Ding, W. Zheng, W. Hong, J. Tang, Cogview2: Faster and better text-to-image
generation via hierarchical transformers, Adv. Neural Inf. Process. Syst. 35 (2022)
[1] Y. Long, Y. Wen, J. Han, H. Xu, P. Ren, W. Zhang, S. Zhao, X. Liang, Capdet: 16890–16902.
Unifying dense captioning and open-world detection pretraining, in: Proceedings [28] L. Zhang, A. Rao, M. Agrawala, Adding conditional control to text-to-image
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, diffusion models, in: Proceedings of the IEEE/CVF International Conference on
pp. 15233–15243. Computer Vision, 2023, pp. 3836–3847.
[2] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, K. Aberman, Dreambooth: [29] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan:
Fine tuning text-to-image diffusion models for subject-driven generation, in: Text to photo-realistic image synthesis with stacked generative adversarial
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern networks, in: Proceedings of the IEEE International Conference on Computer
Recognition, 2023, pp. 22500–22510. Vision, 2017, pp. 5907–5915.
[30] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan++:
[3] E. Mitchell, Y. Lee, A. Khazatsky, C.D. Manning, C. Finn, Detectgpt: Zero-shot
Realistic image synthesis with stacked generative adversarial networks, IEEE
machine-generated text detection using probability curvature, in: International
Trans. Pattern Anal. Mach. Intell. 41 (8) (2018) 1947–1962.
Conference on Machine Learning, PMLR, 2023, pp. 24950–24962.
[31] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, X. He, Attngan:
[4] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L.
Fine-grained text to image generation with attentional generative adversarial
Zettlemoyer, N. Cancedda, T. Scialom, Toolformer: Language models can teach
networks, in: Proceedings of the IEEE Conference on Computer Vision and
themselves to use tools, Adv. Neural Inf. Process. Syst. 36 (2024).
Pattern Recognition, 2018, pp. 1316–1324.
[5] D. Hong, B. Zhang, H. Li, Y. Li, J. Yao, C. Li, M. Werner, J. Chanussot, A. Zipf,
[32] M. Zhu, P. Pan, W. Chen, Y. Yang, Dm-gan: Dynamic memory generative
X.X. Zhu, Cross-city matters: A multimodal remote sensing benchmark dataset
adversarial networks for text-to-image synthesis, in: Proceedings of the IEEE/CVF
for cross-city semantic segmentation using high-resolution domain adaptation
Conference on Computer Vision and Pattern Recognition, 2019, pp. 5802–5810.
networks, Remote Sens. Environ. 299 (2023) 113856.
[33] S. Ruan, Y. Zhang, K. Zhang, Y. Fan, F. Tang, Q. Liu, E. Chen, Dae-gan: Dynamic
[6] D. Hong, B. Zhang, X. Li, Y. Li, C. Li, J. Yao, N. Yokoya, H. Li, P. Ghamisi, X.
aspect-aware gan for text-to-image synthesis, in: Proceedings of the IEEE/CVF
Jia, et al., SpectralGPT: Spectral remote sensing foundation model, IEEE Trans.
International Conference on Computer Vision, 2021, pp. 13960–13969.
Pattern Anal. Mach. Intell. (2024).
[34] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with
[7] J. Wang, D. Chen, Z. Wu, C. Luo, L. Zhou, Y. Zhao, Y. Xie, C. Liu, Y.-G. Jiang,
deep convolutional generative adversarial networks, 2015, arXiv preprint arXiv:
L. Yuan, Omnivl: One foundation model for image-language and video-language
1511.06434.
tasks, 2022, arXiv preprint arXiv:2209.07526. [35] H. Liu, S. Xu, J. Fu, Y. Liu, N. Xie, C.-C. Wang, B. Wang, Y. Sun, Cma-clip:
[8] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, L. Wang, Git: Cross-modality attention clip for image-text classification, 2021, arXiv preprint
A generative image-to-text transformer for vision and language, 2022, arXiv arXiv:2112.03562.
preprint arXiv:2205.14100. [36] Z. Wang, W. Liu, Q. He, X. Wu, Z. Yi, Clip-gen: Language-free training of a
[9] W. Hong, M. Ding, W. Zheng, X. Liu, J. Tang, Cogvideo: Large-scale pretraining text-to-image generator with clip, 2022, arXiv preprint arXiv:2203.00386.
for text-to-video generation via transformers, 2022, arXiv preprint arXiv:2205. [37] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, Hierarchical text-conditional
15868. image generation with clip latents, 2022, arXiv preprint arXiv:2204.06125.
[10] S. Shahriar, GAN computers generate arts? a survey on visual arts, music, and [38] A. Baldrati, M. Bertini, T. Uricchio, A. Del Bimbo, Effective conditioned and
literary text generation using generative adversarial network, Displays (2022) composed image retrieval combining CLIP-based features, in: Proceedings of the
102237. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
[11] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, M. Irani, 21466–21474.
Imagic: Text-based real image editing with diffusion models, in: Proceedings of [39] A. Sain, A.K. Bhunia, P.N. Chowdhury, S. Koley, T. Xiang, Y.-Z. Song, Clip for all
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, things zero-shot sketch-based image retrieval, fine-grained or not, in: Proceedings
pp. 6007–6017. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023,
[12] NVIDIA, Gaugan2, 2021, http://gaugan.org/gaugan2/. pp. 2765–2775.
[13] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, Hierarchical text-conditional [40] Y. Zhou, R. Zhang, C. Chen, C. Li, C. Tensmeyer, T. Yu, J. Gu, J. Xu, T. Sun,
image generation with clip latents, 2022, arXiv preprint arXiv:2204.06125. 1 (2) Lafite: Towards language-free training for text-to-image generation, 2021, arXiv
(2022) 3. preprint arXiv:2111.13792.
[14] J. Yu, Y. Xu, J.Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, [41] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep
B.K. Ayan, et al., Scaling autoregressive models for content-rich text-to-image bidirectional transformers for language understanding, 2018, arXiv preprint
generation, 2022, arXiv preprint arXiv:2206.10789. arXiv:1810.04805.

10
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795

[42] B. Xu, N. Wang, T. Chen, M. Li, Empirical evaluation of rectified activations in [50] Y. Yang, L. Wang, D. Xie, C. Deng, D. Tao, Multi-sentence auxiliary adversarial
convolutional network, 2015, arXiv preprint arXiv:1505.00853. networks for fine-grained text-to-image synthesis, IEEE Trans. Image Process. 30
[43] J. Kim, M. Kim, H. Kang, K. Lee, U-gat-it: Unsupervised generative atten- (2021) 2798–2809.
tional networks with adaptive layer-instance normalization for image-to-image [51] Z. Zhang, L. Schomaker, Divergan: An efficient and effective single-stage
translation, 2019, arXiv preprint arXiv:1907.10830. framework for diverse text-to-image generation, Neurocomputing 473 (2022)
[44] H. Zhang, I. Goodfellow, D. Metaxas, A. Odena, Self-attention generative adver- 182–198.
sarial networks, in: International Conference on Machine Learning, PMLR, 2019, [52] Y. Zhang, S. Han, Z. Zhang, J. Wang, H. Bi, CF-GAN: cross-domain feature fusion
pp. 7354–7363. generative adversarial network for text-to-image synthesis, Vis. Comput. 39 (4)
[45] J.H. Lim, J.C. Ye, Geometric gan, 2017, arXiv preprint arXiv:1705.02894. (2023) 1283–1293.
[46] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, [53] H. Tan, B. Yin, K. Wei, X. Liu, X. Li, Alr-gan: Adaptive layout refinement for
Improved techniques for training gans, in: Advances in Neural Information text-to-image synthesis, IEEE Trans. Multimed. (2023).
Processing Systems, vol. 29, 2016. [54] B. Jiang, W. Zeng, C. Yang, R. Wang, B. Zhang, DE-GAN: Text-to-image synthesis
[47] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception with dual and efficient fusion model, Multimedia Tools Appl. (2023) 1–14.
architecture for computer vision, in: Proceedings of the IEEE Conference on [55] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, B. Guo, Vector
Computer Vision and Pattern Recognition, 2016, pp. 2818–2826. quantized diffusion model for text-to-image synthesis, in: Proceedings of the
[48] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
preprint arXiv:1412.6980. 10696–10706.
[49] T. Qiao, J. Zhang, D. Xu, D. Tao, Mirrorgan: Learning text-to-image generation by
redescription, in: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 2019, pp. 1505–1514.

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy