Saw Gan
Saw Gan
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
Keywords: Text-to-image generation is a challenging task that aims to generate visually realistic images semantically
Text-to-image generation consistent for a given text. Existing methods mainly exploit the global semantic information of a single sentence
Text–image information fusion while ignoring fine-grained semantic information such as aspects and words, which are critical factors in
Attention mechanism
bridging the semantic gap in text-to-image generation. We propose a Multi-granularity Text (Sentence-level,
CLIP
Aspect-level, and W ord-level) Fusion Generative Adversarial N etwork (SAW-GAN ), which comprehensively
represents textual information from multiple granularities. To effectively fuse multi-granularity information,
we design a Double-granularity-text F usion M odule (DFM ) fusing sentence and aspect information through
parallel affine transformation and a T riple granularity-text F usion M odule (TFM ) fusing sentence, aspect and
word information by designing a novel Coordinate Attention M odule (CAM ), which can precisely locate the
visual areas associated with each aspect and word. Furthermore, we use CLIP (Contrastive Language-Image
Pre-training) to provide visual information to bridge the semantic gap and improve the model’s generalization
ability. Our results show significant performance improvements over state-of-the-art methods using Conditional
Generation Adversarial Network (CGAN) on CUB (FID from 13.91 to 10.45) and COCO (FID from 14.60 to
11.17) datasets with photorealistic images of richer details and text–image consistency.
1. Introduction show better image effects than the methods based on diffusion model
and autoregressive model. In this paper, we continue to explore the
The rapid development of artificial intelligence (AI) models has potential of CGAN for text-to-image generation.
been applied to various fields, such as computer vision (CV) [1,2], Existing CGAN-based text-to-image generation models can be broad-
natural language processing (NLP) [3,4] and remote sensing (RS) [5,6]. ly categorized into two types: multi-stage models and single-stage
Cross-modal tasks [7–9] are sub-tasks in AI that involve information models. Multi-stage models [18–20] generate images in a coarse-to-
processing between different modes (such as text, image, speech, etc.), fine manner by stacking a series of generator and discriminator pairs,
and are a hot research nowadays. Text-to-image generation is one of gradually increasing the details and resolution of the image, and fi-
the research hotspots, aiming to generate realistic images that conform nally obtaining a photorealistic image. Single-stage models [21–23]
to semantic descriptions based on a given text, and has been applied in introduce only a generator/discriminator pair to produce photorealis-
art generation [10], image editing [11], interactive entertainment [12], tic images that are semantically consistent with textual descriptions.
and so on. Compared with the multi-stage model, the single-stage model is more
Recently, some methods based on diffusion and autoregressive mod- conducive to the convergence and stability of the model. In this work,
els, such as DALL⋅E 2 [13], Parti [14], etc., have demonstrated strong we also follow a single-stage architecture.
generative capabilities and significantly outperform previous condi- Although the state-of-the-art single-stage have achieved exciting
tional Generative Adversarial Networks (CGAN) [15]. Although these results, due to the difference in the modality structure of text and
image data, the above generative models can only extract limited
models have made significant progress, they all rely on iterative in-
semantic information from the text and use them for image generation,
ference and have huge model sizes, so requiring extremely high com-
which will lead to deficiencies in the quality and semantic consistency
putation time and hardware requirements. Compared with CGAN, the
of generated images. To narrow the semantic gap between text and
latter only needs a single forward pass to generate the image and the
generated images, we need to extract semantic information at differ-
model size is relatively small, which has the advantages of fast and low
ent granularities from text, such as sentence, aspect, and word-level
computational cost. Moreover, the latest CGAN based methods [16,17]
∗ Corresponding author.
E-mail address: qimeng@sdnu.edu.cn (M. Qi).
https://doi.org/10.1016/j.knosys.2024.111795
Received 24 July 2023; Received in revised form 18 March 2024; Accepted 9 April 2024
Available online 12 April 2024
0950-7051/© 2024 Elsevier B.V. All rights reserved.
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795
granularity. For example, for the text ‘‘a stripped bird with a red • Extensive experiments demonstrate our proposed method’s ef-
nape and a long pointed beak’’, the coarse-grained information at the fectiveness and superiority. Even compared with some diffusion
sentence level provides the text’s basic semantic information (global models and autoregressive models can be slightly better.
information). The extent to which aspect-level granularity information
The rest of this paper is organized as follows. In Section 2, we review
interprets the semantics of a text is intermediate between sentence-level
related work. Section 3 details our proposed SAW-GAN as well as new
and word-level. It can serve as a bridge between sentence and word
modules. Experimental results are reported in Section 4. Section 5
level, enabling the model to understand text semantics from coarse to
concludes the paper.
fine instead of jumping directly. For example, for the above text, its
aspects are ‘‘a strip bird’’, ‘‘a red neck’’, and ‘‘a long pointed beak’’, 2. Related work
each descript the object from a different perspective. Correspondingly,
each word at the word-level granularity has a specific meaning and In this section, we describe the research areas related to our work
semantic relationship, such as the adjectives ‘‘strip’’ and ‘‘red’’ and the from two aspects, including text-to-image generation and the use of
nouns ‘‘nape’’ and ‘‘beak’’, which provide detailed information. CLIP in text-to-image generation.
We propose an innovative approach to fuse multi-grained informa-
tion from text (including sentence, aspect, and word level grains) called 2.1. Text-to-image generation
Multi-Granular Text Fusion Generative Adversarial Network (SAW-
GAN). SAW-GAN supplements sentences with fine-grained information In recent years, Artificial Intelligence in Computer Graphics (AICG)
(aspects and words), which enhances the model’s ability to understand has become one of the research hotspots in computer science and
text representations and can generate visually plausible examples that artificial intelligence. With the rapid development of artificial intel-
correspond to semantic descriptions. To effectively fuse text informa- ligence and computer graphics, AICG has received extensive atten-
tion with different granularities, we propose two new multi-granularity tion and exploration in academia and industry. Among them, the
feature fusion modules: Double Granularity-Text Fusion Module (DFM) task of text-to-image synthesis has received increasing attention. Due
and Triple-Granularity-Text Fusion Module (TFM). We fuse two-grained to improvements in generative methods (CGAN) [15], text-to-image
texts (sentence-level and aspect-level) in DFM as a transition for TFM to tasks have yielded confusingly real results. Subsequently, the develop-
fuse three-grained texts (sentence-level, aspect-level, and word-level). ment of large-scale generative methods (autoregressive and diffusion
Specifically, DFM achieves deep fusion of sentence-aspect text by means models) enabled significant improvements in text-to-image generation
of parallel affine transformations to enhance the generator’s under- conditioned on arbitrary text descriptions.
standing of the text. TFM first enhances the visual feature map based The autoregressive model generates the corresponding image pixel
on aspect-words, and then fuses the sentence text to further improve values by decoding the input text features progressively. Yu et al. [14]
the network’s comprehensive understanding and expression of the text. proposed Parti, which views image generation as a translation task.
To achieve accurate fusion of fine-grained text with visual areas, we Ding et al. [27] proposed Cogview2, which performs image genera-
tion in a hierarchical Transformer and parallel autoregressive man-
developed a novel attention mechanism called Coordinate attention
ner. The diffusion model gradually transitions from random noise to
module (CAM). CAM synthetically represents the visual feature map
a meaningful image by adding Gaussian noise. Ramesh et al. [13]
from horizontal and vertical directions, and obtains two direction-
proposed DALL⋅E 2, which combines pre-trained CLIP and diffusion
aware features that can accurately locate the local area of the visual
models to generate images. Zhang et al. [28] proposed ControlNet,
feature map. They and aspect-word embeddings compute semantic
which controls pre-trained large diffusion models to support additional
affinities to update and enhance semantically relevant visual regions
input conditions. However, they all require a time-consuming and
while mitigating the interference of semantically irrelevant and redun-
resource-intensive iterative process to generate high-quality images.
dant information. In addition, we introduce pre-trained CLIP [24] into
CGAN generates images with a single forward pass, significantly re-
the model to encode text and provide visual information to bridge the ducing the training time requirement. We classify existing CGAN-based
semantic gap between text and images and improve generalization. methods into two categories by the number of generator/discriminator
Finally, we design text loss at different granularities to constrain the pairs: multi-stage and single-stage models. We introduce the existing
model, which enhances the ability to understand and express text methods in chronological order.
information at different granularities.
We conduct extensive experiments on CUB [25] and COCO [26] Multi-stage model: Zhang et al. [29,30] first proposed a multi-stage
datasets to evaluate the effectiveness of our proposed SAW-GAN method method, StackGAN, with multiple CGANs to generate exciting instances
quantitatively and qualitatively. Experimental results show that our from a given text in a coarse-to-fine manner. To refine image details, Xu
et al. [31] proposed AttnGAN, which uses an attention mechanism to
method outperforms the current CGAN-based methods, and even out-
refine image regions related to words, further enhancing the visual real-
performs some methods based on diffusion models and autoregressive
ism of images. Zhu et al. [32] proposed DM-GAN, which uses a memory
models. Our main contributions are summarized as follows:
network to store rich semantic information, allowing the network to
• We propose a novel single-stage architecture SAW-GAN that can capture the details and semantics of images better when processing
efficiently and deeply fuse text and image features at multi- complex text descriptions. Ruan et al. [33] proposed DAE-GAN, which
ple granularities like sentences, aspects, and words to generate can accurately capture multiple aspects of text and generate diverse
semantically consistent photorealistic images. and high-quality images. Tan et al. [19] proposed DR-GAN, which
• Two new DFM and TFM are introduced to target the fusion of text introduced a distribution regularization mechanism never to make the
and visual information with different granularities, respectively. generated image more closely match the distribution of the real image
DFM fuses sentence-aspect and visual information to enhance the by minimizing the distribution difference between the real image and
text understanding ability of the generator. TFM pays attention to the generated image.
the visual areas related to aspect-word, which helps the network Single-stage model: Reed et al. [34] were the first to utilize a single-
to realize the comprehensive integration of sentence, aspect and stage CGAN for image generation tasks and generate plausible images.
word information. However, the training process of CGAN is relatively unstable and
• A novel coordinate attention module is designed to accurately prone to mode collapse, resulting in the inability to create diverse
locate and enhance the visual feature map with aspect-word samples. Tao et al. [21] proposed DF-GAN, which utilizes Matching-
embedding. Aware Gradient Penalty (MA-GP) to stabilize the training process and
2
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795
Fig. 1. The overall architecture of our proposed SAW-GAN. It consists of CLIP and a generator/discriminator pair. In the generator, FC is the fully connected layer, and GBlock
and UpBlock are fusion modules based on residual structure. The difference is that there is no upsampling layer in GBlock. DFM is a dual-granularity-text fusion module discussed
in Section 3.2 for fusing sentence and aspect embeddings. TFM is a triple-granularity-text fusion module introduced in Section 3.3 to fuse sentence, aspect, and word embeddings.
The discriminator consists of multiple residual-based structure DBlocks to accept the input of the CLIP image encoder and fuse visual information to obtain image features for
computing multi-granularity text loss.
significantly enhance text–image semantic consistency without intro- encodes images to obtain CLIP image features, and the discriminator
ducing additional networks. Liao et al. [22] proposed SSA-GAN, which further extracts features to obtain effective visual features. Then, the
combines semantic information and spatial layout information for joint loss between text features and visual features at different granularities
optimization when generating images. Ye et al. [23] proposed RAT- is calculated to optimize the generator and discriminator jointly.
GAN to make the generated image consistent with the input text by
iteratively applying affine transformation and exploiting the feedback 3. Method
from the discriminator.
Unlike the above methods, our method adopts a single-stage ar- In this section, we will provide a detailed introduction to the overall
chitecture. It introduces aspect-level and word-level text information architecture of our proposed SAW-GAN, as shown in Fig. 1. In order
during the generation process, providing the model with fine-grained to integrate text information of multiple granularities, we propose
visual features and synthesis capabilities. two new modules to fuse different granularities of text embeddings.
(i) The Dual-granularity-text fusion module (DFM) fuses sentence and
2.2. Use of CLIP in text-to-image generation aspect embeddings in a parallel affine transformation manner. (ii) The
Triple-granularity-text fusion module (TFM) utilizes the Coordinate
CLIP [24] is a model based on contrastive learning. CLIP consists Attention Module (CAM) to modulate visual feature maps based on
of an image encoder and a text encoder by jointly training the image aspect and word embeddings, which are then used as inputs for the
and text encoders and using large-scale image and text datasets for subsequent sentence affine transformation. Finally, we design separate
pre-training. CLIP can learn a general feature space. This feature space loss functions for different granularities of text embeddings to optimize
enables the model to understand the semantic relationship between the semantic direction of the network from global to local constraints.
images and text without additional supervisory signals. Therefore, CLIP
can be applied to many tasks, such as image classification [13,35], 3.1. Overall architecture
image generation [36,37], and image search [38,39].
Since the CLIP pre-training model is pre-trained on a large-scale As shown in Fig. 1, our SAW-GAN consists of a text encoder and a
image–text dataset, it bridges the semantic gap between vision and generator/discriminator pair. Unlike previous works [21,29,31,32], we
language to a certain extent, enabling cross-modal semantic under- introduce the CLIP text encoder to encode the given text and aspect
standing. In the text-to-image generation task, Zhou et al. [40] took text, obtaining three different levels of text embeddings. The CLIP text
advantage of the well-aligned multimodal semantic space of the CLIP encoder is designed based on the BERT pre-trained model [41]. After
model to train a text-to-image generation model without text data. tokenization, we convert each token in the given text into a vector
Ramesh et al. [37] decomposed the generation process into multiple representation, forming a sequence passed through multiple layers of
stages and used the latent variables of the CLIP model for conditional Transformer encoders for feature extraction. Finally, a fixed-length
control, enabling the model to generate images with rich diversity and vector representation is output as word embeddings. The output of the
semantic decoupling. Tao et al. [17] combined GAN and CLIP, using last Transformer encoder is then subjected to average pooling to obtain
the scene understanding ability of CLIP to enable the discriminator to the encoding representation of the entire sentence.
evaluate the image quality accurately and using CLIP to provide helpful The generator network of SAW-GAN consists of one GBlock, five
visual concepts for the generator so that the model can generate images UpBlocks, one DFM, and one TFM, which handle visual feature maps
that are realistic, diverse, and consistent with the input text. of different scales and fuse text embeddings of different granularities,
Unlike previous methods, we fully utilize CLIP’s cross-modal under- respectively. The generator takes a noise vector 𝐸𝑍 sampled from a
standing ability, achieve semantic alignment at different granularities, Gaussian distribution, sentence embedding 𝐸𝑆 , aspect embedding 𝐸𝐴 ,
and enhance the model’s expression effect on the text. Specifically, we and word embedding 𝐸𝑊 as inputs. The noise vector and sentence
utilize the CLIP text encoder to encode text at different granularities. embedding are concatenated and fed into a fully connected layer,
Following the discriminator structure in [17], the CLIP image encoder reshaped into a visual vector, and then fed into the UpBlock. DFM
3
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795
Fig. 2. Overview of the proposed Dual-granularity-text fusion module. DFM aims to achieve the fusion of sentence-aspect embedding and visual features. DFM is composed of an
upsampling layer and two two-way fusion blocks. Conditional on sentence and aspect embedding, the Two-way fusion block uses parallel sentence and aspect affine transformations
to adjust feature maps. It concatenates and convolutions to complete the fusion process. Aspect affine passes the adjusted visual feature to the later processors after scaling, biasing,
and averaging them conditioned on aspect embedding.
first upsamples the output of UpBlock and then fuses the text embed- output. Aspect affine learns the scale and bias of aspect embeddings
dings through two two-way fusion blocks. The two-way fusion block by two sets of MLP. After MLP action, the matrix dimension changes
effectively integrates aspect embedding and sentence embedding into to 𝑅(𝑛×𝑐×ℎ×𝑤) . Visual feature ℎ𝑡−1 is first dimensionally increased to the
the visual features through parallel affine transformations. TFM first correct shape, then scaled and biased, and finally averaged and passed
modulates the visual feature map with aspect embedding and word to the next stage processor. This process is described as follows:
embedding as conditions through a coordinate attention module and
uses it as input for subsequent sentence embedding fusion. 𝐹𝑡𝑖 = 𝐹 𝑖 (𝐹𝑡−1
𝑖
, 𝐸𝑆 , 𝐸𝐴 ), 𝑖 = 1, 2 (1)
The discriminator is designed to distinguish real and generated
images. We follow the structure of the discriminator in [17], which 𝐹 𝑖 = 𝑓𝐶 (𝑓𝐴 (𝑓𝑅 (𝑓𝑆 (ℎ𝑡−1 )𝑖 , 𝐸𝑆 )), 𝑓𝑅 (𝑓𝐴 (ℎ𝑖(𝑡−1) , 𝐸𝐴 ))) (2)
consists of the CLIP image encoder and DBlocks. The real/generated
among them, 𝐸𝑆 is sentence embedding, 𝐸𝐴 is aspect embedding, 𝐹𝑖
images are first extracted into low-dimensional feature maps by the
means two-way fusion module, 𝑓𝐶 means a 3 × 3 convolutional layer,
CLIP image encoder, and then used by DBlock to do integration with
𝑓𝐴 means cascade operation, 𝑓𝑅 is Leaky ReLU activation function [42],
the informative visual features extracted by CLIP from shallow to deep
𝑓𝑆 means sentence affine transformation, 𝑓𝐴 represents aspect affine
layers. The final output is an image feature with rich visual information,
transformation.
which can cascade with sentence embeddings to calculate adversarial
loss as sentence-level text loss 𝐿𝑆 and calculate contrastive loss with Our proposed DFM can effectively fuse textual information at mul-
aspect embeddings as aspect-level text loss 𝐿𝐴 and similarity loss with tiple granularities, enabling the network to capture precise semantics
word embeddings as word-level text loss 𝐿𝑊 . The joint supervision of while maintaining good image quality.
multiple losses provides a better optimization direction for the network.
The following chapters will provide more detailed technical details 3.3. Triple-granularity-text fusion module
of DFM and TFM and loss functions.
Improving the semantic understanding ability of the model requires
3.2. Dual-granularity-text fusion module attention to sentence and aspect text information and more fine-grained
word text information. The given sentence often contains words that
The semantic affinity between text embedding and visual features is contribute differently to the synthesized sample. For example, adjec-
significant for generated images’ authenticity and semantic consistency. tives such as ‘‘small’’ and ‘‘long’’ play a crucial role in the model’s
Therefore, increasing the model’s understanding of the conditional understanding of the modified object’s shape or contour. Therefore,
text can positively impact image synthesis. A sentence may contain exploring the potential interaction between word embeddings and local
multiple aspect terms that describe objects or scenes from different visual features can improve generated images’ quality and semantic
perspectives, such as ‘‘grey and red wings’’ and ‘‘a red head’’ in Fig. 1. consistency. Given these reasons, we propose the Triple-granularity-
These aspect information can help the model better understand the feature fusion module (TFM). As shown in Fig. 3(a), TFM consists
semantic information contained in the text. Therefore, we extract mul- of aspect-word and sentence fusion modules. TFM has four inputs:
tiple aspect-level texts from the sentence and develop DFM to integrate the output ℎ(𝑡−1) of the previous stage, as well as aspect, word, and
them into the generation process to improve the model’s expressive sentence embeddings. First, the aspect-word fusion module combines
performance. aspect and word embeddings and outputs the fused features as input
As shown in Fig. 2, DFM consists of an upsampling module and to the sentence fusion module, which then combines with the sentence
two two-way fusion blocks. DFM takes the output ℎ𝑡−1 of the previous embedding.
step as input, with sentence embedding 𝐸𝑆 𝜖 𝑅(512) and aspect embed- To effectively integrate aspect-word textual information, we have
ding 𝐸𝐴 𝜖 𝑅(𝑛×512) as conditions. We upsampled the input ℎ𝑡−1 , then designed a novel attention module called Coordinate Attention Module
deep fused with sentence embedding and aspect embedding through (CAM), as shown in Fig. 3(b). CAM can identify essential aspects
two two-way fusion blocks, and outputs the fused visual features ℎ𝑡 . and words in a given text description and allocate greater weights to
We improved the deep fusion module of DF-GAN [21], changing the related visual regions to optimize the generation of local details in the
original single condition (sentence embedding) fusion to dual condi- generated image. CAM has three inputs: feature map ℎ𝑡−1 𝜖 𝑅(𝐶×𝐻×𝑊 )
tions (sentence embedding and aspect embedding) fusion. Sentence (where 𝐶, 𝐻, 𝑊 represent the number of channels, height, and width of
embedding and aspect embedding learn the scale and bias of different ℎ𝑡−1 , respectively), aspect embedding 𝐸𝐴 𝜖 𝑅(𝐷×𝑁) and word embedding
granular text conditions through sentence affine and aspect affine and 𝐸𝑊 𝜖 𝑅(𝐷×𝑇 ) (where 𝐷, 𝑁, 𝑇 represent the number of channels of text
then concatenate and convolve in the channel dimension to obtain the embedding, the number of aspect information and sequence length
4
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795
Fig. 3. Overview of the proposed Triple-granularity-text fusion module. (a) TFM consists of two processes: aspect-word fusion and sentence fusion. Aspect and word embeddings
are fused using the Coordinate Attention Module (CAM). (b) CAM calculates the correlation between aspect-word embeddings and feature maps, enhancing local regions of visual
features that align with the textual information to enrich image details. Sentence embedding is fused using multiple affine transformations.
respectively). We first process the original feature map with two global We model the semantic affinity between the two to capture the local
average pooling operations in the horizontal and vertical directions correlation between semantic information and visual feature maps. We
to aggregate the discriminative information in a specific direction to performed a dot product operation between the query and the key. We
obtain a horizontal feature vector 𝐹𝑥 𝜖 𝑅(𝐶×𝐻×1) and a vertical direction processed it with the softmax function to obtain the similarity matrix in
feature vector 𝐹𝑦 𝜖 𝑅(𝐶×1×𝑊 ) . Afterward, the two feature vectors are two directions: 𝑀𝑥 𝜖 𝑅(𝐻×(𝑁+𝑇 )) , 𝑀𝑦 𝜖 𝑅(𝐻×(𝑁+𝑇 )) . They represent local
concatenated and input to the convolutional and norm layers to inte- correlations between text and subregions of a specific direction. Finally,
grate the location information, which is then split into location-specific the two similarity matrices are reshaped into 𝑅(𝐶×𝐻×1) and 𝑅(𝐶×1×𝑊 )
query feature maps 𝐹𝑥𝑞 𝜖 𝑅(𝐶×𝐻×1) and 𝐹𝑦𝑞 𝜖 𝑅(𝐶×1×𝑊 ) . This process is by the softmax function after dot product with 𝐹𝑙𝑘 . We denote these
expressed mathematically in the following: operations as:
𝐹𝑥 = 𝐴𝑣𝑔𝑃 𝑜𝑜𝑙𝑋 (ℎ𝑡−1 ) (3) 𝑀𝑥 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝐷(𝐹𝑥𝑞 , 𝐹𝑙𝑘 )) (8)
𝐹𝑥𝑞 , 𝐹𝑦𝑞 = 𝑓𝑠 (𝑓𝑛 (𝑓𝑐𝑞 (𝑓𝑐 (𝐹𝑥 , 𝐹𝑦 )))) (5) 𝑊𝑥 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝐷(𝐹𝑙𝑘 , 𝑀𝑥 )) (10)
where 𝐴𝑣𝑔𝑃 𝑜𝑜𝑙𝑋 and 𝐴𝑣𝑔𝑃 𝑜𝑜𝑙𝑌 represent the global average pooling
in the horizontal and vertical directions, respectively. 𝑓𝑐 represents 𝑊𝑦 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝐷(𝐹𝑙𝑘 , 𝑀𝑦 )) (11)
a cascade operation, 𝑓𝑐𝑞 represents a 1 × 1 convolutional layer, 𝑓𝑛
is Adaptive Layer-Instance Normalization [43], and 𝑓𝑠 refers to a where, 𝐷(⋅, ⋅) represents the dot product operation, 𝑊𝑥 𝜖 𝑅(𝐶×𝐻×1) and
segmentation operation. 𝑊𝑦 𝜖 𝑅(𝐶×1×𝑊 ) represent the attention weights of 𝑀𝑥 in the horizontal
For aspect and word embeddings, we first concatenate them to- direction and 𝑀𝑦 in the vertical direction, respectively.
gether to integrate local semantic information and then flow them After obtaining the attention weights, we multiply them with the
into two different 1 × 1 convolution operations to generate two local original feature maps to reweight the visual feature maps. By doing so,
contextual embeddings: the key 𝐹𝑙𝑘 𝜖 𝑅(𝐶×(𝑁+𝑇 )) and 𝐹𝑙𝑣 𝜖 𝑅(𝐶×(𝑁+𝑇 )) , the network will pay more attention to the local regions in the feature
mathematically, map related to textual information and assign more weights to them. At
the same time, we use the adaptive residual connection [44] to retain
𝐹𝑙𝑘 = 𝑓𝑐𝑘 (𝑓𝑐 (𝐸𝐴 , 𝐸𝑊 )) (6) the rich original information in the original feature map, stabilize the
𝐹𝑙𝑣 = 𝑓𝑐𝑣 (𝑓𝑐 (𝐸𝐴 , 𝐸𝑊 )) (7) learning of the network, and finally get the adjusted feature map as
ℎ′𝑡−1 𝜖 𝑅(𝐶×𝐻×𝑊 ) . It is defined as follows:
among them, 𝑓𝑐𝑘 and 𝑓𝑐𝑣 represent two 1 × 1 convolutional layers
respectively, and 𝑓𝑐 represents the concat operation. 𝐹𝑤 = ℎ𝑡−1 ⊙ 𝑊𝑥 ⊙ 𝑊𝑦 (12)
5
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795
𝐿𝐺
𝑊 = 1 − 𝑐𝑜𝑠(𝑓𝑖𝑚𝑔 (𝑥),
̂ 𝑤) (21)
ℎ′𝑡−1 = 𝜆 ∗ 𝐹𝑤 + ℎ𝑡−1 (13)
where 𝑤 denotes the word embedding.
where ℎ𝑡−1 represents the input feature map. ⊙ means element-wise
Generator Objective: Generator loss includes sentence-level text loss
multiplication. 𝜆 is a learnable parameter initialized to 0, which enables
𝐿𝐺
𝑆
, aspect-level text loss 𝐿𝐺
𝐴
and word-level text loss 𝐿𝐺
𝑊
.
the network to autonomously adjust the weights of the feature maps to
obtain a more appropriate feature representation. 𝐿𝐺 = 𝐿𝐺 𝐺 𝐺 𝐺 𝐺
(22)
𝑆 + 𝜆𝐴 𝐿𝐴 + 𝜆𝑊 𝐿𝑊
After the above operations, we have realized the fusion of local
information. The fused feature maps are fed into the global fusion where 𝜆𝐺
𝐴
and 𝜆𝐺
𝑊
are hyperparameters.
module, which consists of affine transformation, ReLU, and convolu- Discriminator Objective: Discriminator loss includes sentence-level
tional layers. It can learn the scale and bias under the global semantics, text loss 𝐿𝐷 , aspect-level text loss 𝐿𝐷 and word-level text loss 𝐿𝐷 .
𝑆 𝐴 𝑊
expand the input feature map to the correct shape, and then scale and
bias. Using TFM can effectively realize the fusion of coarse/fine-grained 𝐿𝐷 = 𝐿𝐷 𝐷 𝐷 𝐷 𝐷
𝑆 + 𝜆𝐴 𝐿𝐴 + 𝜆𝑊 𝐿𝑊 (23)
text so that the network can output an image that conforms to the
where 𝜆𝐷 and 𝜆𝐷 are hyperparameters.
semantic information of the text and has rich details. 𝐴 𝑊
4. Experiment
3.4. Objective function
6
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795
Fig. 4. Example results of DM-GAN [32], DF-GAN [21] and our SAW-GAN on text-to-image synthesis on the CUB dataset.
Table 1 Table 2
IS of SOTA methods and our model on CUB dataset and COCO dataset (higher is better). FID of SOTA methods and our model on CUB dataset and COCO dataset (lower is
Method CUB ↑ COCO ↑ better).
Method CUB↓ COCO↓
AttnGAN [31] 4.36 ± 0.03 25.89 ± 0.47
MirroGAN [49] 4.54 ± 0.17 26.47 ± 0.41 DM-GAN [32] 16.09 32.64
DM-GAN [32] 4.75 ± 0.07 30.49 ± 0.57 DAE-GAN [33] 15.19 28.12
MA-GAN [50] 4.76 ± 0.09 – DF-GAN [21] 14.81 21.41
DF-GAN [21] 5.10 – SSA-GAN [22] 15.61 19.37
DiverGAN [51] 4.98 ± 0.06 – DiverGAN [51] 15.63 20.52
ARRPNGAN [18] 5.16 ± 0.06 32.36 ± 0.62 ARRPNGAN [18] 14.21 29.71
CF-GAN [52] 4.83 ± 0.08 31.13 ± 0.37 RAT-GAN [23] 13.91 14.60
ALR-GAN [53] 4.96 ± 0.04 34.70 ± 0.66 DE-GAN [54] 18.94 28.79
ALR-GAN [53] 15.14 29.04
Ours 4.63 ± 0.04 35.17 ± 0.49
Cogview2 [27] – 17.50
VQ-Diffusion [55] 10.32 13.86
Ours 10.45 11.17
4.2. Compared with SOTA
7
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795
Fig. 5. Example results of DM-GAN [32], DF-GAN [21] and our SAW-GAN for text-to-image synthesis on the COCO dataset.
Table 3
Ablation experiments of our SAW-GAN on CUB dataset. DFM and TFM repre-
sent dual-granularity-text fusion module and triple-granularity-text fusion module,
respectively.
ID Components IS ↑ FID ↓
DFM TFM
M1 ✓ ✓ 4.63 ± 0.04 10.45
M2 ✓ 4.50 ± 0.08 11.50
M3 ✓ 4.31 ± 0.06 12.50
M4 4.23 ± 0.12 12.68
DFM and TFM, aspect- and word-level text can be well integrated
into visual features in image generation, which enables SAW-GAN to
accurately control the expressive effect of sample regions according to
the properties of text descriptions to generate images that conform to
text descriptions.
Fig. 6. The samples of SAW-GAN are generated by changing the color attribute value
of words in the aspect of natural language description. 4.3. Component analysis
8
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795
Table 4
Effect of DFM and TFM at different positions of the generator on the CUB dataset. The
positions of the six blocks after GBlcok in the generator are named B-1 to B-6 in turn.
ID Components IS ↑ FID ↓
P1 DFM(B-1), TFM(B-2) 4.33 ± 0.05 11.99
P2 DFM(B-2), TFM(B-3) 4.18 ± 0.07 14.13
P3 DFM(B-3), TFM(B-4) 4.63 ± 0.04 10.45
P4 DFM(B-4), TFM(B-5) 4.60 ± 0.05 11.13
P5 DFM(B-5), TFM(B-6) 4.37 ± 0.04 12.96
Table 5
Effect of fusing different numbers of aspect-level text on the CUB dataset.
ID Amount of aspect IS ↑ FID ↓
N1 0 4.42 ± 0.08 12.01
N2 1 4.53 ± 0.09 13.49
N3 2 4.59 ± 0.03 11.56
N4 3 4.63 ± 0.04 10.45
9
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795
to large models, which limits the ability to synthesize higher quality [15] M. Mirza, S. Osindero, Conditional generative adversarial nets, 2014, arXiv
images. In future work, we will explore the possibility of combining preprint arXiv:1411.1784.
[16] M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, T. Park, Scaling
with other pre-trained large models to generate higher quality samples.
up gans for text-to-image synthesis, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2023, pp. 10124–10134.
CRediT authorship contribution statement [17] M. Tao, B.-K. Bao, H. Tang, C. Xu, GALIP: Generative adversarial CLIPs for text-
to-image synthesis, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2023, pp. 14214–14223.
Dehu Jin: Writing – original draft, Software, Methodology, Data cu- [18] F. Quan, B. Lang, Y. Liu, ARRPNGAN: Text-to-image GAN with attention
ration, Conceptualization. Qi Yu: Conceptualization, Methodology. Lan regularization and region proposal networks, Signal Process., Image Commun.
Yu: Validation, Visualization. Meng Qi: Formal analysis, Resources, 106 (2022) 116728.
Supervision, Writing – review & editing. [19] H. Tan, X. Liu, B. Yin, X. Li, DR-GAN: Distribution regularization for
text-to-image generation, IEEE Trans. Neural Netw. Learn. Syst. (2022).
[20] Q. Cheng, K. Wen, X. Gu, Vision-language matching for text-to-image synthesis
Declaration of competing interest via generative adversarial networks, IEEE Trans. Multimed. (2022).
[21] M. Tao, H. Tang, F. Wu, X.-Y. Jing, B.-K. Bao, C. Xu, Df-gan: A sim-
The authors declare that they have no known competing finan- ple and effective baseline for text-to-image synthesis, in: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
cial interests or personal relationships that could have appeared to 16515–16525.
influence the work reported in this paper. [22] W. Liao, K. Hu, M.Y. Yang, B. Rosenhahn, Text to image generation with
semantic-spatial aware gan, in: Proceedings of the IEEE/CVF Conference on
Data availability Computer Vision and Pattern Recognition, 2022, pp. 18187–18196.
[23] S. Ye, H. Wang, M. Tan, F. Liu, Recurrent affine transformation for text-to-image
synthesis, IEEE Trans. Multimed. (2023).
Data will be made available on request. [24] A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry,
A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from
natural language supervision, in: International Conference on Machine Learning,
Acknowledgments
PMLR, 2021, pp. 8748–8763.
[25] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The Caltech-Ucsd
The work is partially supported by the National Natural Science Birds-200–2011 Dataset, California Institute of Technology, 2011.
Foundation of China (No. 61902225) and the Joint Fund of the Natural [26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L.
Zitnick, Microsoft coco: Common objects in context, in: Computer Vision–ECCV
Science Foundation of Shandong Province, China (No. ZR2021LZL011).
2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014,
Proceedings, Part V, Vol. 13, Springer, 2014, pp. 740–755.
References [27] M. Ding, W. Zheng, W. Hong, J. Tang, Cogview2: Faster and better text-to-image
generation via hierarchical transformers, Adv. Neural Inf. Process. Syst. 35 (2022)
[1] Y. Long, Y. Wen, J. Han, H. Xu, P. Ren, W. Zhang, S. Zhao, X. Liang, Capdet: 16890–16902.
Unifying dense captioning and open-world detection pretraining, in: Proceedings [28] L. Zhang, A. Rao, M. Agrawala, Adding conditional control to text-to-image
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, diffusion models, in: Proceedings of the IEEE/CVF International Conference on
pp. 15233–15243. Computer Vision, 2023, pp. 3836–3847.
[2] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, K. Aberman, Dreambooth: [29] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan:
Fine tuning text-to-image diffusion models for subject-driven generation, in: Text to photo-realistic image synthesis with stacked generative adversarial
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern networks, in: Proceedings of the IEEE International Conference on Computer
Recognition, 2023, pp. 22500–22510. Vision, 2017, pp. 5907–5915.
[30] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan++:
[3] E. Mitchell, Y. Lee, A. Khazatsky, C.D. Manning, C. Finn, Detectgpt: Zero-shot
Realistic image synthesis with stacked generative adversarial networks, IEEE
machine-generated text detection using probability curvature, in: International
Trans. Pattern Anal. Mach. Intell. 41 (8) (2018) 1947–1962.
Conference on Machine Learning, PMLR, 2023, pp. 24950–24962.
[31] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, X. He, Attngan:
[4] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L.
Fine-grained text to image generation with attentional generative adversarial
Zettlemoyer, N. Cancedda, T. Scialom, Toolformer: Language models can teach
networks, in: Proceedings of the IEEE Conference on Computer Vision and
themselves to use tools, Adv. Neural Inf. Process. Syst. 36 (2024).
Pattern Recognition, 2018, pp. 1316–1324.
[5] D. Hong, B. Zhang, H. Li, Y. Li, J. Yao, C. Li, M. Werner, J. Chanussot, A. Zipf,
[32] M. Zhu, P. Pan, W. Chen, Y. Yang, Dm-gan: Dynamic memory generative
X.X. Zhu, Cross-city matters: A multimodal remote sensing benchmark dataset
adversarial networks for text-to-image synthesis, in: Proceedings of the IEEE/CVF
for cross-city semantic segmentation using high-resolution domain adaptation
Conference on Computer Vision and Pattern Recognition, 2019, pp. 5802–5810.
networks, Remote Sens. Environ. 299 (2023) 113856.
[33] S. Ruan, Y. Zhang, K. Zhang, Y. Fan, F. Tang, Q. Liu, E. Chen, Dae-gan: Dynamic
[6] D. Hong, B. Zhang, X. Li, Y. Li, C. Li, J. Yao, N. Yokoya, H. Li, P. Ghamisi, X.
aspect-aware gan for text-to-image synthesis, in: Proceedings of the IEEE/CVF
Jia, et al., SpectralGPT: Spectral remote sensing foundation model, IEEE Trans.
International Conference on Computer Vision, 2021, pp. 13960–13969.
Pattern Anal. Mach. Intell. (2024).
[34] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with
[7] J. Wang, D. Chen, Z. Wu, C. Luo, L. Zhou, Y. Zhao, Y. Xie, C. Liu, Y.-G. Jiang,
deep convolutional generative adversarial networks, 2015, arXiv preprint arXiv:
L. Yuan, Omnivl: One foundation model for image-language and video-language
1511.06434.
tasks, 2022, arXiv preprint arXiv:2209.07526. [35] H. Liu, S. Xu, J. Fu, Y. Liu, N. Xie, C.-C. Wang, B. Wang, Y. Sun, Cma-clip:
[8] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, L. Wang, Git: Cross-modality attention clip for image-text classification, 2021, arXiv preprint
A generative image-to-text transformer for vision and language, 2022, arXiv arXiv:2112.03562.
preprint arXiv:2205.14100. [36] Z. Wang, W. Liu, Q. He, X. Wu, Z. Yi, Clip-gen: Language-free training of a
[9] W. Hong, M. Ding, W. Zheng, X. Liu, J. Tang, Cogvideo: Large-scale pretraining text-to-image generator with clip, 2022, arXiv preprint arXiv:2203.00386.
for text-to-video generation via transformers, 2022, arXiv preprint arXiv:2205. [37] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, Hierarchical text-conditional
15868. image generation with clip latents, 2022, arXiv preprint arXiv:2204.06125.
[10] S. Shahriar, GAN computers generate arts? a survey on visual arts, music, and [38] A. Baldrati, M. Bertini, T. Uricchio, A. Del Bimbo, Effective conditioned and
literary text generation using generative adversarial network, Displays (2022) composed image retrieval combining CLIP-based features, in: Proceedings of the
102237. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
[11] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, M. Irani, 21466–21474.
Imagic: Text-based real image editing with diffusion models, in: Proceedings of [39] A. Sain, A.K. Bhunia, P.N. Chowdhury, S. Koley, T. Xiang, Y.-Z. Song, Clip for all
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, things zero-shot sketch-based image retrieval, fine-grained or not, in: Proceedings
pp. 6007–6017. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023,
[12] NVIDIA, Gaugan2, 2021, http://gaugan.org/gaugan2/. pp. 2765–2775.
[13] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, Hierarchical text-conditional [40] Y. Zhou, R. Zhang, C. Chen, C. Li, C. Tensmeyer, T. Yu, J. Gu, J. Xu, T. Sun,
image generation with clip latents, 2022, arXiv preprint arXiv:2204.06125. 1 (2) Lafite: Towards language-free training for text-to-image generation, 2021, arXiv
(2022) 3. preprint arXiv:2111.13792.
[14] J. Yu, Y. Xu, J.Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, [41] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep
B.K. Ayan, et al., Scaling autoregressive models for content-rich text-to-image bidirectional transformers for language understanding, 2018, arXiv preprint
generation, 2022, arXiv preprint arXiv:2206.10789. arXiv:1810.04805.
10
D. Jin et al. Knowledge-Based Systems 294 (2024) 111795
[42] B. Xu, N. Wang, T. Chen, M. Li, Empirical evaluation of rectified activations in [50] Y. Yang, L. Wang, D. Xie, C. Deng, D. Tao, Multi-sentence auxiliary adversarial
convolutional network, 2015, arXiv preprint arXiv:1505.00853. networks for fine-grained text-to-image synthesis, IEEE Trans. Image Process. 30
[43] J. Kim, M. Kim, H. Kang, K. Lee, U-gat-it: Unsupervised generative atten- (2021) 2798–2809.
tional networks with adaptive layer-instance normalization for image-to-image [51] Z. Zhang, L. Schomaker, Divergan: An efficient and effective single-stage
translation, 2019, arXiv preprint arXiv:1907.10830. framework for diverse text-to-image generation, Neurocomputing 473 (2022)
[44] H. Zhang, I. Goodfellow, D. Metaxas, A. Odena, Self-attention generative adver- 182–198.
sarial networks, in: International Conference on Machine Learning, PMLR, 2019, [52] Y. Zhang, S. Han, Z. Zhang, J. Wang, H. Bi, CF-GAN: cross-domain feature fusion
pp. 7354–7363. generative adversarial network for text-to-image synthesis, Vis. Comput. 39 (4)
[45] J.H. Lim, J.C. Ye, Geometric gan, 2017, arXiv preprint arXiv:1705.02894. (2023) 1283–1293.
[46] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, [53] H. Tan, B. Yin, K. Wei, X. Liu, X. Li, Alr-gan: Adaptive layout refinement for
Improved techniques for training gans, in: Advances in Neural Information text-to-image synthesis, IEEE Trans. Multimed. (2023).
Processing Systems, vol. 29, 2016. [54] B. Jiang, W. Zeng, C. Yang, R. Wang, B. Zhang, DE-GAN: Text-to-image synthesis
[47] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception with dual and efficient fusion model, Multimedia Tools Appl. (2023) 1–14.
architecture for computer vision, in: Proceedings of the IEEE Conference on [55] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, B. Guo, Vector
Computer Vision and Pattern Recognition, 2016, pp. 2818–2826. quantized diffusion model for text-to-image synthesis, in: Proceedings of the
[48] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
preprint arXiv:1412.6980. 10696–10706.
[49] T. Qiao, J. Zhang, D. Xu, D. Tao, Mirrorgan: Learning text-to-image generation by
redescription, in: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 2019, pp. 1505–1514.
11