Modern Neural Network Technologies Text-to-Image: Scientific Visualization, 2023, Volume 15, Number 2, Pages 66 - 79
Modern Neural Network Technologies Text-to-Image: Scientific Visualization, 2023, Volume 15, Number 2, Pages 66 - 79
06
Abstract
This paper discusses state-of-the-art graphical text-to-image neural networks and meth-
ods for text-to-image conversion, analyzing the results achieved and samples created to date
for text-to-image conversion tasks. Ways of applying neural network approaches to text-to-
image transformation for environmental monitoring, infrastructure and medical data analysis
tasks are proposed. In this paper the results of neural network generation and its correlation
with the user input linguistic constructions of text queries are reviewed, and the typical flaws
and artifacts typical of the neural network generated images are identified and classified. The
rapid development of neural network technologies in this field could have a significant impact
on society, the professional market and the media, which makes the task of studying neural
network images and identifying them among other graphic content particularly relevant.
Keywords: Machine Learning, Computer Vision and Pattern Recognition, Neural net-
work, Computer graphics, Text-to-image.
1. Introduction
The field of neural network technology is currently undergoing rapid development, be-
coming more sophisticated every day and gaining more and more skills and capabilities. In
particular, neural networks that can process images in a variety of ways, from animating pho-
tos to automatically creating full-fledged images based on a user's text request, are becoming
particularly popular these days.
The task of such a neural network is to form plausible images for a variety of sentences
that explore the compositional structure of language. Another task becomes the simultaneous
management of multiple objects, their attributes and their spatial relations. In order to cor-
rectly interpret a query sentence, the algorithm must not only correctly compose each object
attribute, but also correctly form associations. For example, to visualize the sentence "hedge-
hog in red hat, yellow gloves, blue shirt and green trousers", the neural network needs to rec-
ognize in the text and generate object images in a given combination of features and object
(hat, red), (gloves, yellow), (shirt, blue) and (trousers, green) without mixing them [1].
It should also be noted that the user of this type of neural network cannot yet predict in
advance the visual result that the neural network will produce for the entered textual query.
The correlation between the original query text and the resulting visual image is a separate
class of problems, which is currently being actively studied and solved by the developers of
the largest text-to-image neural networks, such as Midjourney.
Figure 1. A unique design for sneakers generated by Midjourney's neural network using the
query "nike sneakers in khokhloma style" [2].
It can be assumed that the use of neural networks will evolve in the future as a tool for
one of the most sought-after scenarios for businesses - personalizing content to suit individu-
al user needs.
However, the ability of neural networks to quickly and automatically generate an infinite
number of different images from a given textual description opens up opportunities for scien-
tific work as well.
With the ability to train off-the-shelf algorithms on thematically selected material (pre-
pared image database), it is possible to create specialized neural networks adapted to do-
main-specific terms and queries.
For example, the use of text-to-image neural networks is possible in areas such as envi-
ronmental monitoring or biomedical technology.
In order to organize environmental monitoring, it is first necessary to collect data. As the
data from different sources are analyzed in order to monitor the processes taking place in the
environment: images, heterogeneous sensor data, textual data and others, the collected data
is heterogeneous [3]. After analyzing the data and identifying the main components that have
the most impact on the overall situation, it becomes possible to summarize what is happening
in textual form. It is then possible to generate an appropriate textual query and model a visu-
al image from the linguistic data.
A complex analysis allows to get the most effective picture of the processes taking place
and draw adequate conclusions.
The use of neural network graphics for rapid generation of illustrative images of the pro-
cesses under study allows one to get the most complete impression of what is happening.
With the availability of textual eyewitness testimonies, it becomes possible to quickly re-
construct the visual picture of the events and visualize the overall situation for further analy-
sis.
The aggregate of various data can be transformed into a visual form without the need to
use human imagination, but with the possibility of online expert corrections to bring the final
representation to the desired form that most accurately reflects the phenomenon being de-
scribed. Figure 2 shows a visualization of a rather general query (query text: "crimson sky,
high waves, a storm is approaching"). Nevertheless, the image is already highly detailed and
presented in four versions, from which the user can choose the one most suitable for his
needs and make the necessary adjustments until a satisfying result is achieved.
For more specialized tasks, such as manufacturing or medicine, specially trained neural
networks are needed, capable of understanding certain jargon or science-based linguistic
phrases without allowing ambiguous interpretations.
Given the vast amount of accumulated material, and the existence of specialized archives
for many fields, it may be a matter of time before specifically oriented graphical neural net-
works are developed.
Potentially, their application gives ample opportunities for analysis of various types of da-
ta, combining them and displaying in a clear and understandable way. They can also be used
extensively in teaching and learning tools.
For example, a neural network can represent a typical condition of some organ, tissues at
a certain set of symptoms listed in a query. In case a textual description contains an indica-
tion of some pathology, a visual representation can help to highlight it and make the right de-
cision.
Neural networks are already widely used in different fields of science.
For example, tasks that are performed by an inpainting function (removing objects and
then shading empty areas of an image so that the fact of such shading is unnoticeable), are in
demand in archeology in the case when it is necessary to recreate a building of which only ru-
ins are left. A neural network can generate an image based on data about similar buildings
and styles in architecture.
2. TEXT-TO-IMAGE GRAPHICAL NEURAL NETWORKS
This section presents brief descriptions of the most popular and large commercial text-to-
image neural networks that have become widely known in the last year. These include
Midjourney, which opened in March 2022, an updated version of DALL-E 2, first demon-
strated in January 2021; Stable Diffusion, an open-source neural network that has become
the basis for dozens of new projects; and ruDALL-E, a Russian neural network based on gen-
erative models from SberDevices and Sber AI.
2.1 Midjourney
Midjourney [4] is proprietary software that creates images from text descriptions. The
project was founded in February 2022 by scientist and entrepreneur David Holtz. The
Midjourney team has positioned itself as an independent research laboratory dedicated to ex-
panding humanity's creative abilities.
Midjourney's work is enabled by two relatively recent technological breakthroughs in ar-
tificial intelligence: the ability of neural networks to understand human speech and create
images.
The neural network is trained to match textual descriptions with visual images across
hundreds of millions of examples, using specially compiled collections that contain billions of
images gathered in the network, as well as matched image-text pairs. Such datasets can be
commercial or open source, such as LAION [5], on which the famous Stable Diffusion neural
network was trained. The results of such training allow solving various cross-modal tasks -
generation of pictures based on text descriptions, generation of text descriptions based on
pictures, regeneration or rendering of image parts, etc. This makes it possible to advance in
solving such topical tasks as visualization of incomplete data and their replenishment.
Midjourney, like most neural networks of this type, is well capable of making explicit que-
ries, without getting specific. For example, if you give it a query to generate "red car on the
road", it will generate quite satisfactory options. You can experiment with car colour, size,
background - these are quite general queries.
Problems may arise with more specific queries. For example, a car model may already
cause problems for a neural network. The rarer this model occurs in the network space, the
less chance that a neural network will be able to draw it.
However, graphical neural networks at this point in time are an extremely fast developing
and progressing area of computer graphics, so versions of Midjourney are constantly being
updated and improved. The paper [6] provides a comparative review of Midjourney versions
v3 and v4, looking at the key differences and features of the updated version. In March 2023,
an update to Midjourney version v5 was released and its features are only being explored.
2.2 DALL-E 2
DALL-E 2 [7] is one of the most popular neural network graphics systems, developed by
OpenAI with 12 billion parameters based on GPT-3 (Generative Pre-trained Transformer 3 -
the largest and most advanced language model in the world from OpenAI), trained to gener-
ate images from text descriptions using the text-image pair data set. It is able to generate
original images from textual descriptions and allows users to upload images and edit them,
for example by adding elements. Furthermore DALL-E can not only generate an image from
scratch, it can also regenerate any rectangular area of an existing image,
According to the developers, "DALL-E 2 is an artificial intelligence system that can create
realistic images and drawings from a natural language description".
DALL-E 2 started as a research project and is primarily of interest due to the publications
of the developers, who have done a lot of work in creating algorithms and studying the behav-
ior and capabilities of the developed neural network [1, 8, 9].
A neural network can create images in a wide variety of drawing styles and techniques. It
can be an image that looks like a frame from a cartoon, or the image will look like a real pho-
tograph.
DALL-E 2 was trained on image pairs and their respective captions. According to the de-
velopers, the pairs were taken from a combination of publicly available and licensed sources
[10].
The software is now available to a limited number of people, only by subscription. This is
due to both limited server infrastructure capacity and the developers' desire to control the de-
velopment and self-learning of the neural network through user testing. In particular, due to
concerns about the misuse of the neural network, the developers carefully filter content for its
training and incoming requests for prohibited topics (violence, adult content, etc.).
Among the features provided in the latest updates are such as:
higher resolution of images
query processing in more than 107 languages, including Russian
high request recognition accuracy
possibility of setting colour filters and image style
can take an existing image as an input and create a creative variation of it
possibility to refine the uploaded image.
2.4 ruDALL-E
ruDALL-E [13] is a family of generative models from SberDevices and Sber AI. The neural
network was developed and trained by Sber AI researchers with the partner support of scien-
tists from AIRI Institute of Artificial Intelligence on a combined Sber AI and SberDevices da-
taset of 1 billion text-image pairs. Teams from Sber AI, SberDevices, Samara University, AIRI
and SberCloud actively participated in the project.
Specialists created and trained two versions of the model, named after two great Russian
abstractionists, Vasily Kandinsky and Kazimir Malevich:
ruDALL-E Kandinsky (XXL) with 12 billion parameters;
ruDALL-E Malevich (XL) with 1.3 billion parameters.
Both models are capable of generating colourful images on a variety of topics from a short
textual description. According to the developers, Kandinsky uses backward diffusion and can
process queries in 101 languages, without any loss in quality or speed. Among those languages
are both common languages such as Russian and English, as well as rarer languages such as
Mongolian. The system will understand even if a query contains words in different languages.
Training the ruDALL-E neural network on a Christofari cluster was the biggest computa-
tional challenge in Russia. It involved 196 NVIDIA A100 cards, each with 80 GB of memory.
The whole training took 14 days or 65,856 GPU-hours. It was first trained for 5 days at
256x256 resolution, then 6 days at 512x512 resolution and 3 days at maximum clean data.
The ruDALL-E Kandinsky 2.0 system is claimed to be the first multilingual diffusion neu-
ral network capable not only of accepting requests in different languages, but also of forming
linguistic-visual shifts in language cultures.
This statement is supported by a number of experiments [14]. In particular, such queries
as "national dish" or "person with higher education" are tested (Figures 4 and 5). For the
Russian-language query, the neural network produces predominantly white males, while for
the same query in French, the results are more varied. For the query in Chinese, the results
have more stylized images, but in most cases they also reflect the national component.
Figure 5. Testing the query "national dish" in Russian, Japanese and Hindi.
The author also conducted an experiment (Figure 6) on the FusionBrain platform [15],
which confirmed the orientation of this neural network to different language environments.
The query "national dish", performed in several languages, produced completely different re-
sults.
Figure 6. Testing the query "national dish" in Russian, Hindi and Italian (rows across).
It is worth noting that queries in different languages make sense to test either on the
above-mentioned platform or by interacting with developers' repositories directly. The
rudalle.ru platform is not adapted to such queries; it is capable of perceiving a foreign lan-
guage, identifying it, translating the query into Russian, and then generating a visual image.
Such experiments open up a separate area for research, as preliminary studies suggest
that neural networks of different language groups will have their own distortions and differ-
ences in the interpretation of the same phenomenon, depending on the mass culture belong-
ing to one or another language group.
Unfortunately, the original article [18] does not provide the exact text of the query, but
judging from the result, we can conclude that there was a direct literal translation of "wolf
feet fed", and the neural network reproduced this query quite literally. Meanwhile, this prov-
erb has the full English analogue of the semantic idiom "The dog that trots about finds a
bone" or the translation offered by the online translator DeepL, "the wolf feeds the wolf",
which implies absolutely different visual images, but has the same meaning. Therefore, when
giving a neural network a query, you should take into account the difference between the se-
mantic translation and the direct translation, because the results can be drastically different.
Thus, making the right query becomes, in a sense, a profession. People who have learned to
get the intended and high-quality result are already called "prompt-engineers", and more and
more offers to form a precise query for a neural network are appearing on freelance exchang-
es.
Another artefact that occurs quite often in Midjourney is the bent spoons in the food pic-
tures (Figures 9-10).
Texture artefacts. In this case, the artifacts do not affect the overall image and occur in
places where the neural network cannot adequately process some highly detailed area or rec-
reate the desired structure. This could be hair, clothing fabric, or skin.
Such artifacts are inherent to neural networks that can reconstruct part of the image, en-
hance the quality, or generate the image from scratch.
More often than not, at the location of the artefact, when zoomed in you can see a visible
difference between the damaged area and the rest of the image. In Figure 11, for example, you
can see an odd ripple in one section of hair, unlike the rest of the hair. A neural network often
produces this pixel grid effect, but in most cases it is only visible when it is greatly exaggerat-
ed.
CONCLUSIONS
In this paper, state-of-the-art text-to-image graphical neural networks and methods of
text-to-image transformation have been examined and the results achieved have been ana-
lyzed.
A number of problems generated by these systems were considered. Ways of applying
neural network approaches to text-to-image transformation for environmental monitoring,
infrastructure and medical data analysis tasks were proposed.
References
1. Ramesh A., Pavlov M., Goh G., Gray S., Voss C., Radford A., Chen M, Sutskever I.,
2021. Zero-Shot Text-to-Image Generation, arXiv:2102.12092 [cs.CV],
https://doi.org/10.48550/arXiv.2102.12092
2. Telegram Channel «Neurodesign», 2023a, https://t.me/neurodes/343 (19 March
2023)
3. Yazikov E.G., Talovskaya A.V., Nadeina L.V., 2013. Geoecological environmental moni-
toring: соursebook / Tomsk Polytechnic University
4. Midjourney, https://www.midjourney.com/ (19 April 2023)
5. LAION. Large-scale Artificial Intelligence Open Network. https://laion.ai/ (19 March
2023)
6. Yubin Ma, 10 Incredible Prompt Styles to Try in Midjourney V4.
https://aituts.com/midjourney-v4-prompts-to-try/ (23 January 2023)
7. DALL·E 2, https://openai.com/product/dall-e-2 (19 April 2023)
8. Dhariwal P., Nichol A. 2021, Diffusion Models Beat GANs on Image Synthesis.
arXiv:2105.05233
https://doi.org/10.48550/arXiv.2105.05233
9. Radford A., Jong W.K., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A.,
Mishkin P., Clark J., Krueger G., Sutskever I. 2021. Learning Transferable Visual Models
From Natural Language Supervision. arXiv preprint arXiv:2103.00020 [cs.CV].
https://doi.org/10.48550/arXiv.2103.00020
10. DALL·E 2 Preview - Risks and Limitation, 2022, https://github.com/openai/dalle-2-
preview/blob/main/system-card.md#model (19 March 2023)
11. Stable Diffusion Online, https://stablediffusionweb.com/ (19 April 2023)
12. Alammar J. 2022, The Illustrated Stable Diffusion.
https://jalammar.github.io/illustrated-stable-diffusion/ (19 March 2023)
13. ruDALL-E, https://rudalle.ru/ (19 April 2023)
14. Shakhmatov A., Razhigayev A., Arkhipkin V., Nikolic A., Pavlov I., Kuznetsov A., Di-
mitrov D., Shavrina T., Markov S. Kandinsky 2.0 - the first multilingual diffusion for text-
based image generation.
https://habr.com/ru/company/sberbank/blog/701162/ (19 March 2023)
15. FusionBrain. https://fusionbrain.ai/diffusion (19 March 2023)
16. Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A., 2017. Image-toimage translation with
conditional adversarial networks. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 1125–1134.
17. Koh, J. Y., Baldridge, J., Lee, H., and Yang, Y., 2021. Text-toimage generation ground-
ed by fine-grained user attention. In Proceedings of the IEEE/CVF Winter Conference on Ap-
plications of Computer Vision, pp. 237–246.
18. Midjourney and idioms.
https://pikabu.ru/story/midjourney_i_frazeologizmyi_9768400 (23 January 2023)
19. Telegram Channel «Neurodesign», 2023b, https://t.me/neurodes/619 (19 March
2023)
20. Telegram Channel «Neurodesign», 2023c, https://t.me/neurodes/303 (19 March
2023)
21. Telegram Channel «Neurodesign», 2023d, https://t.me/neurodes/750 (19 March
2023)
22. Makushin A. https://t.me/makushinphoto/541 (23 January 2023)
23. Gelbart H., 2023, Scammers are profiting from the earthquake in Turkey by raising
money, supposedly to help the victims. https://www.bbc.com/russian/news-64640487 (19
March 2023)