Abstract
Although the text-to-image model aims to generate realistic images that correspond to the text description, generating high-quality, and accurate images remains a significant challenge. Most existing text-to-image methods are implemented through a two-stage stacking model, where the generation process is initiated by creating an initial image with a basic outline and subsequently refined to generate a high-resolution image. However, the quality of the initial image imposes limitations on this method as it directly impacts the final quality of the high-resolution output and may compromise the level of randomness in the high-resolution image, making it difficult for the model to generate a high-quality and realistic final image if the initial image is of low quality or lacks detail, causing the final image to lack diversity and to appear artificial if the initial image is too rigid or lacks randomness. Therefore, to overcome the limitation of the stacked structure, a new generative adversarial network method has been proposed, which generates high-resolution images directly from text descriptions, thus providing a more efficient and effective way to generate realistic images from text. Multi-head channel attention and masked cross-attention mechanisms are employed to emphasize the importance of relevance from various perspectives in order to enhance significant features associated with the text description and suppress non-essential features unrelated to the textual information. The integration of image and text information at a granular level is accomplished while employing a masked mechanism to minimize computational expenses and expedite the generation time of images. Furthermore, a discriminator-based semantic consistency loss function is devised to bolster the visual coherence between text and images, thereby directing the generator toward the production of more realistic images that align closely with text descriptions. The enhanced model improves the semantic consistency between text and images, leading to higher-quality generated images. Extensive experiments confirm the superiority of our proposed model to ControlGAN. On the CUB dataset, the model achieves an increased IS score from 4.58 to 4.96, while on the COCO dataset, the IS score improves from 24.06 to 33.56. Code is available at https://github.com/Leeziying0307/Github.git.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request. The code are available from the corresponding author upon reasonable request.
References
Lewis, P., Perez, E., Piktus, A., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural. Inf. Process. Syst. 33, 9459–9474 (2020)
Tang, Y., Han, K., Xu, C., et al.: Augmented shortcuts for vision transformers. Adv. Neural. Inf. Process. Syst. 34, 15316–15327 (2021)
Deldjoo, Y., Noia, T.D., Merra, F.A.: A survey on adversarial recommender systems: from attack/defense strategies to generative adversarial networks. ACM Comput. Surv. (CSUR) 54(2), 1–38 (2021)
Kim, H., Kim, J., Yang, H.: A GAN-based face rotation for artistic portraits. Mathematics 10(20), 3860 (2022)
Ramesh, A., Pavlov, M., Goh, G., et al.: Zero-shot text-to-image generation, international conference on machine learning. PMLR, 8821–8831, (2021)
Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with CLIP latents. arXiv, 2022(2022-04-12)[2023-03-20].
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial nets. In advances in neural information processing systems 27. Annu. Conf. Neural Info Process. Syst. 204, 2672–2680 (2014)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, (2014)
Zhang, H., Xu, T., Li, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. arXiv, 2018(2018-06-27)[2023-03-20]
Tan, H., Yin, B., Wei, K., Liu, X., Li, X.: ALR-GAN: Adaptive Layout Refinement for Text-to-Image Synthesis. IEEE Trans. Multimedia. 25, 8620–8631 (2023)
Zhu, J., Li, Z., Wei, J., et al.: PBGN: phased bidirectional generation network in text-to-image synthesis. Neural. Process. Lett. 54(6), 5371–5391 (2022)
Agarwal, V., Sharma, S., Aurelia, S., et al.: Deep learning techniques to improve radio resource management in vehicular communication network. In: Biswas, S.K. (ed.) Sustainable advanced computing, pp. 161–171. Springer, Singapore (2022)
Agarwal, V., Sharma, S.: EMVD: efficient multitype vehicle detection algorithm using deep learning approach in vehicular communication network for radio resource management. Int. J. Image Gr. Sign. Process. 14(2), 25–37 (2022)
Karras, T., Laine, S., Aittala, M., et al.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, Y., Qiu, H., Qin, C.: Conditional deformable image registration with spatially-variant and adaptive regularization. arXiv, 2023(2023-03-19)[2023-03-23]
Chen, L., Lu, X., Zhang, J., et al.: Hinet: half instance normalization network for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Xu, T., Zhang, P., Huang, Q., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks[Z/OL]. arXiv, 2017(2017-11-28)[2023-03-20].
Zhang, H., Xu, T., Li, H., et al.: Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks, In: Proceedings of the IEEE international conference on computer vision
Karras, T., Aila, T., Laine, S., et al.: Progressive growing of GANs for improved quality, stability, and variation. arXiv, 2018 (2018-02-26)[2022-10-09]
Agarwal, V., Sharma, S.: DQN Algorithm for network resource management in vehicular communication network. Int. J. Inf. Technol. 15(6), 3371–3379 (2023)
Huang, S., Chen, Y.: Generative adversarial networks with adaptive semantic normalization for text-to-image synthesis. Digit. Sign. Process. 120, 103267 (2022)
Liao, W., Hu, K., Yang, M. Y., et al.: Text to image generation with semantic-spatial aware gan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang, Y., Han, S., Zhang, Z., et al.: CF-GAN: cross-domain feature fusion generative adversarial network for text-to-image synthesis. Visual Comput. 39(4), 1283–1293 (2022)
Peng, D., Yang, W., Liu, C., et al.: SAM-GAN: self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis. Neural Netw.Netw. 138, 57–67 (2021)
Li, B., Qi, X., Lukasiewicz, T., et al.: Controllable text-to-image generation. arXiv, 2019 (2019-12-19)[2023-05-04]
Schuster, M., Paliwal, K. K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Vinker, Y., Pajouheshgar, E., Bo, J. Y., et al.: Clipasso: Semantically-aware object sketching. arXiv preprint arXiv:2202.05822
Lin, T. Y., Maire, M., Belongie, S., et al.: Microsoft COCO. In: common objects in context, european conference on computer vision. [2023-03-20]
Szegedy, C., Vanhoucke, V., Ioffe, S.: Rethinking the inception architecture for computer vision, In: Proceedings of the IEEE conference on computer vision and pattern recognition
Funding
This work was supported by Nation Natural Science Foundation of China (62072150), Shaanxi Provincial Key Research and Development Program (2023-YBGY-148), Henan Provincial Science and Technology Plan Project (222102210240) and Henan Provincial Higher Education Key Scientific Research Project (22B520012, 22A510017), Shaanxi Provincial Social Science Fund Project (2022M007).
Author information
Authors and Affiliations
Contributions
SH was involved in supervision and writing—review and editing; ZL contributed to conceptualization, methodology, and writing—original draft; KW was involved in writing—review and editing; YZ contributed to formal analysis, supervision, and writing—review and editing; and HL was involved in writing—review and editing.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
Not applicable.
Consent to participate
All authors agreed to participate in this paper.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hou, S., Li, Z., Wu, K. et al. Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation. Vis Comput 40, 8639–8651 (2024). https://doi.org/10.1007/s00371-024-03260-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-024-03260-2