Text To Image Synthesis Using Self
Text To Image Synthesis Using Self
ADVERSARIAL NETWORKS
IN
May 2022
ii
Name of Student : MICHELLE SARAH SIMON
Chennai‐ 600025
Bonafide certificate
Certified that this Project Report titled Text to Image Synthesis using Self-Attention
Generative Adversarial Network is the bonafide work of Ms. Michelle Sarah Simon, who
carried out the project under my supervision. Certified further, that to the best of my
knowledge the work reported herein does not form part of any other Project Report of the
basis of which a degree or award was conferred on an earlier occasion on this or any other
candidate
iii
Abstract
iv
Acknowledgement
v
Table of contents
vi
List of tables
List of figures
vii
CHAPTER 1
INTRODUCTION
In the image-text matching task, we pretrain an image encoder and a text encoder to learn
the semantically consistent visual and textual representations of the image-text pair.
Meanwhile, we learn the consistent textual representations by pushing together the captions
of the same image and pushing way the captions of different images via the contrastive
loss. The pretrained image encoder and text encoder are leveraged to extract consistent
visual and textual features in the following stage of GAN training. Then contrastive loss is
used to minimize the distance of the fake images generated from text descriptions related to
the same ground truth image while maximizing those related to different ground truth
images. We generalize the existing text-to-image models to a unified framework so that the
approach can be integrated into them to improve their performance.
Current models are still far from being capable of generating complex scenes with multiple
objects based only on textual descriptions. There is also very limited work on for
resolutions higher than 256 × 256 pixels. It is challenging to reproduce the quantitative
results of many approaches, even if code and pre-trained models are provided. This is
reflected in the literature provided, where often different quantitative results are reported
for the same model. Furthermore, it is observe that many of the currently used evaluation
metrics are unsuitable for evaluating text-to-image synthesis models and do not correlate
well with human perception. This is because only a few approaches perform human user
studies to assess if their improvements are evident in a qualitative sense, and if they do, the
studies are not standardized, making the comparison of results difficult.
ix
CHAPTER 2
LITERATURE REVIEW
OVERVIEW
There are billions of volumes of textual content generated every day in today's world. In-
app messaging such as WhatsApp and Telegram, social media sites such as Facebook and
Instagram, news publishing sites, Google searches, and a variety of other sources are all
possible sources. The main focus of this section is on the popular NLP task of sentiment
analysis. Sentiment analysis is a fantastic tool for users to extract important information
and assists organisations in understanding the social sentiment of their brand, product, or
service while monitoring online conversations. This section investigates the various
approaches and models used in the task of sentiment analysis.
LITERATURES
2. Improved Techniques for Training GANs - Tim Salimans, Ian Goodfellow, Wojciech Zaremba,
Vicki Cheung, Alec Radford, Xi Chen
We present a number of new architectural features and training procedures for the
generative adversarial networks (GANs) framework. We achieve cutting-edge results in
x
semi-supervised classification on MNIST, CIFAR-10, and SVHN using our new
techniques. A visual Turing test confirmed the high quality of the generated images: our
model generates MNIST samples that humans cannot distinguish from real data and
CIFAR-10 samples with a human error rate of 21.3 percent. We also show ImageNet
samples with unprecedented resolution and demonstrate how our methods enable the
model to learn recognisable ImageNet class features.
xi
CHAPTER 3
THE STUDY
1. OPEN-SOURCE DATASET
Caltech-UCSD Birds 200 (CUB-200) is an image dataset annotated with 200 bird species.
It was created to enable the study of subordinate categorization, which is not possible with
other popular datasets that focus on basic level categories. The images were downloaded
from the website data.caltech.edu/records. Each image is annotated with a bounding box, a
rough bird segmentation, and a set of attribute labels.
CUB-200 includes 6,033 annotated images of birds, belonging to 200, mostly North
American, bird species.
2. FEATURE EXTRACTION
xii
Consider that we are given the below image and we need to identify the objects present in
it:
As a human, you recognize the images instantly - a dog, a car, and a cat. The shape could
be one important factor, followed by colour, or size.
A similar idea is to extract edges as features and use that as the input for the model. Edge is
basically where there is a sharp change in colour. Look at the below image:
The machine can identify the edge because there was a change in colour from white to
brown, and brown to black. An image is represented in the form of numbers. So, we will
look for pixels values around which there is a drastic change in the pixel values.
With the help of this, we can extract several features, such as eyes, ears, wings, feathers,
colour, etc.
xiii
Feature extraction in text processing
Text words represent discrete, categorical features in text processing. We encode such data
in a way that the algorithms can use it. Feature extraction refers to the process of mapping
textual data to real-valued vectors. Bag of Words is one of the most basic techniques for
numerically representing text.
Bag of Words (BOW): The vocabulary is a list of unique words in the text corpus that we
create. Then, for each sentence or document, we can represent it as a vector, with each word
represented as 1 for present and 0 for absent from the vocabulary. Another way to represent
this is to count the number of times each word appears in a document. The most widely used
method is the Term Frequency-Inverse Document Frequency (TF-IDF) technique.
xiv
As shown in Document 1, the TF-IDF method heavily penalises the word 'beautiful' while
giving more weight to the word 'day.' This is due to the IDF part, which gives more
weightage to distinct words. In other words, in the context of the entire corpus, 'day' is an
important word for Document1. The Python scikit-learn library includes functions for
calculating the TF-IDF of text vocabulary given a text corpus. For natural language
processing (NLP) maintaining the context of the words is of utmost importance. For this, we
use another approach called Word Embedding.
Word Embedding is a text representation in which words with the same meaning are
represented similarly. In other words, it represents words in a coordinate system where
related words are placed closer together based on a corpus of relationships. The more well-
known models of word embedding are Word2Vec and Global vectors (GloVe). For this,
project, I will be using Word2Vec for the embedding, and relationship definition.
Word2vec takes as its input a large corpus of text and produces a vector space with each
unique word being assigned a corresponding vector in the space. Word vectors are
positioned in the vector space such that words that share common contexts in the corpus are
in close proximity to one another in the space. Word2Vec is very famous at capturing
meaning and demonstrating it on tasks like calculating analogy questions of the form a is
to b as c is to ___. For example, man is to woman as uncle is to ___ (aunt) using a simple
vector offset method based on cosine distance.
xv
3. EVALUATION METRICS
The Fréchet Inception Distance (FID) is a metric for evaluating the quality of generated
images and specifically developed to evaluate the performance of GANs. The activations are
summarized as a multivariate Gaussian by calculating the mean and covariance of the
images. These statistics are calculated for the activations across the collection of real and
generated images. The deviation between these two distributions is called the Fréchet
distance.
4. SELF ATTENTION
When modelling dependencies for small feature maps, it functions similarly to the local
convolution. It demonstrates how the attention mechanism empowers both the generator and
the discriminator to directly model the feature maps' long-range dependencies. Furthermore,
a comparison of our SAGAN and the baseline model without attention demonstrates the
efficacy of the proposed self-attention mechanism.
The self-attention blocks perform better than residual blocks with the same number of
parameters. Even when the training goes smoothly, replacing the self-attention block with
the residual block results in worse FID and Inception score results. This comparison shows
that the performance boost provided by SAGAN is not simply due to an increase in model
depth and capacity. To better understand what was learned during the generation process, we
visualised the generator's attention weights in SAGAN for various images.
xvi
Fig. 3.4.1: Self attention architecture for image processing
5. SPECTRAL NORMALIZATION
Miyato first proposed using spectral normalisation to the discriminator network to stabilise
GAN training. By limiting the spectral norm of each layer, the discriminator's Lipschitz
constant is constrained. In comparison to other normalisation techniques, spectral
normalisation does not require additional hyper-parameter tuning (in practise, setting the
spectral norm of all weight layers to 1 consistently performs well). Furthermore, the
computational cost is relatively low.
Spectral normalisation in the generator can prevent parameter magnitude spike and avoid
unusual gradients. We find that spectral normalisation of both the generator and the
discriminator allows us to use fewer discriminator updates per generator update,
significantly lowering the computational cost of training. The method also exhibits more
consistent training behaviour.
xvii
CHAPTER 4
RESULTS
After pre-training the model on BERT, and developing the self-attention, we get the
following table.
SAGAN
Model No. of attention layers
feat 8 feat 16 feat 32 feat 64
We can clearly see that, by decreasing the number of features, the FID distance reduces. It is
optimal at 32 features, after the distance starts to increase.
Fig. 4.2: Fréchet inception distance of the self-attention GAN visually represented
xviii
This project shows that our approach is satisfactory based on the FID score and when
compared to the results of the papers mentioned in the literature reviews. On the CUB-200
dataset, our method increases the FID by 21.11%. We believe our approach has potential
applicability in a wide range of cross domain tasks, such as visual question answering,
image-text retrieval, and text-to-image synthesis, because image-text representation learning
is a fundamental task.
xix
CHAPTER 5
CONCLUSION
xx