0% found this document useful (0 votes)
124 views20 pages

Text To Image Synthesis Using Self

The document discusses text-to-image synthesis using self-attention generative adversarial networks. It describes how earlier convolutional GANs could generate high-resolution images but lacked the ability to maintain detailed features. The self-attention GAN model allows for long-range dependency modelling which helps preserve image details. It also has a more thorough discriminator that can check for fine-grained features. The model is evaluated using the Fréchet Inception distance metric.

Uploaded by

PG Guides
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views20 pages

Text To Image Synthesis Using Self

The document discusses text-to-image synthesis using self-attention generative adversarial networks. It describes how earlier convolutional GANs could generate high-resolution images but lacked the ability to maintain detailed features. The self-attention GAN model allows for long-range dependency modelling which helps preserve image details. It also has a more thorough discriminator that can check for fine-grained features. The model is evaluated using the Fréchet Inception distance metric.

Uploaded by

PG Guides
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

TEXT TO IMAGE SYNTHESIS USING SELF-ATTENTION GENERATIVE

ADVERSARIAL NETWORKS

MICHELLE SARAH SIMON

A project report submitted


in partial fulfilment of the requirement for the award of the

POST GRADUATE DIPLOMA IN MANAGEMENT

IN

RESEARCH & BUSINESS ANALYTICS


MADRAS SCHOOL OF ECONOMICS
MSE BUSINESS SCHOOL

May 2022

MADRAS SCHOOL OF ECONOMICS


Chennai ‐600025
Degree and Branch : PGDM

(RESEARCH & BUSINESS ANALYTICS)

Month and Year of Submission : MAY 2022

Title of the Project Work : TEXT TO IMAGE SYNTHESIS USING


SELF-ATTENTION GENERATIVE
ADVERSARIAL NETWORKS

ii
Name of Student : MICHELLE SARAH SIMON

Roll Number : 2020DMB06

Name and Designation : Mr. BALAJI MUTHUKRISHNAN

of Supervisor Visiting faculty

Madras School of Economics

Chennai‐ 600025

Bonafide certificate

Certified that this Project Report titled Text to Image Synthesis using Self-Attention
Generative Adversarial Network is the bonafide work of Ms. Michelle Sarah Simon, who
carried out the project under my supervision. Certified further, that to the best of my
knowledge the work reported herein does not form part of any other Project Report of the
basis of which a degree or award was conferred on an earlier occasion on this or any other
candidate

iii
Abstract

In this project, I attempt the self-attention generative adversarial network (SAGAN)


which allows for attention-driven, long-range dependency modelling for image
generation. For a long time, convolutional GANs were used. While they were of high
resolution, they were not able to maintain several detailed features of the image. In self
attention GAN, the details of the images, the features of the images were salient. In
addition, self-attention GAN’s discriminator is more thorough, and can check for
highly detailed features in the images generated. The key is applying spectral
normalization, and not batch normalization in case of Deep Convoluted GANs. The
metrics that the model is judged on is the Fréchet Inception distance (FID).

iv
Acknowledgement

Thank you everyone

v
Table of contents

vi
List of tables

List of figures

vii
CHAPTER 1

INTRODUCTION

The objective of the text-to-image synthesis problem is to generate high-quality images


from the specific text descriptions. It is a fundamental problem with a wide range of
practical applications, including art generation, image editing, and computer-aided design.
While the idea of text to image generation started in 2008, it was the creation of GANs , or
generative adversarial networks, in 2014 that led to the current state-of-the-art synthesizers
that exist today. Conditioned on the text descriptions, the GAN-based models can generate
realistic images with consistent semantic meaning. In practice, one image is associated to
multiple captions in the datasets. These text descriptions annotated by humans for the same
image are highly subjective and diverse in terms of contents and choice of words.
Additionally, some text descriptions do not even provide sufficient semantic information to
guide the image generation. The linguistic variance and inadequacy between the captions of
the identical image leads to the synthetic images conditioned on them deviating from the
ground truth.

In the image-text matching task, we pretrain an image encoder and a text encoder to learn
the semantically consistent visual and textual representations of the image-text pair.
Meanwhile, we learn the consistent textual representations by pushing together the captions
of the same image and pushing way the captions of different images via the contrastive
loss. The pretrained image encoder and text encoder are leveraged to extract consistent
visual and textual features in the following stage of GAN training. Then contrastive loss is
used to minimize the distance of the fake images generated from text descriptions related to
the same ground truth image while maximizing those related to different ground truth
images. We generalize the existing text-to-image models to a unified framework so that the
approach can be integrated into them to improve their performance.

Generation of complicated, real-world images such as MS-COCO remains an open


challenge. GANs have sparked a lot of interest and advanced research efforts in
viii
synthesising images. They framed the image synthesis task as a two-player game of two
competing artificial neural networks. A generator network is trained to produce realistic
samples, while a discriminator network is trained to distinguish between real and generated
images. The training objective of the generator is to fool the discriminator. This approach
has successfully been adapted to many applications such as high-resolution synthesis of
human faces, image super resolution, image in-painting, data augmentation, style transfer,
image-to-image translation, and representation learning.

Current models are still far from being capable of generating complex scenes with multiple
objects based only on textual descriptions. There is also very limited work on for
resolutions higher than 256 × 256 pixels. It is challenging to reproduce the quantitative
results of many approaches, even if code and pre-trained models are provided. This is
reflected in the literature provided, where often different quantitative results are reported
for the same model. Furthermore, it is observe that many of the currently used evaluation
metrics are unsuitable for evaluating text-to-image synthesis models and do not correlate
well with human perception. This is because only a few approaches perform human user
studies to assess if their improvements are evident in a qualitative sense, and if they do, the
studies are not standardized, making the comparison of results difficult.

ix
CHAPTER 2

LITERATURE REVIEW

OVERVIEW
There are billions of volumes of textual content generated every day in today's world. In-
app messaging such as WhatsApp and Telegram, social media sites such as Facebook and
Instagram, news publishing sites, Google searches, and a variety of other sources are all
possible sources. The main focus of this section is on the popular NLP task of sentiment
analysis. Sentiment analysis is a fantastic tool for users to extract important information
and assists organisations in understanding the social sentiment of their brand, product, or
service while monitoring online conversations. This section investigates the various
approaches and models used in the task of sentiment analysis.

LITERATURES 

For this project, I take the help of

1. SPECTRAL NORMALIZATION FOR GENERATIVE ADVERSARIAL NETWORKS -


Takeru Miyato , Toshiki Kataoka , Masanori Koyama , Yuichi Yoshida:
One of the difficulties in studying generative adversarial networks is the instability of their
training. In this paper, we propose spectral normalisation, a novel weight normalisation
technique for stabilising discriminator training. Our new normalisation technique is
computationally light and simple to implement in existing systems. We tested the efficacy of
spectral normalisation on the CIFAR10, STL-10, and ILSVRC2012 datasets, and we found that
spectrally normalised GANs (SN-GANs) can generate images of comparable or higher quality
than previous training stabilisation techniques.

2. Improved Techniques for Training GANs - Tim Salimans, Ian Goodfellow, Wojciech Zaremba,
Vicki Cheung, Alec Radford, Xi Chen
We present a number of new architectural features and training procedures for the
generative adversarial networks (GANs) framework. We achieve cutting-edge results in

x
semi-supervised classification on MNIST, CIFAR-10, and SVHN using our new
techniques. A visual Turing test confirmed the high quality of the generated images: our
model generates MNIST samples that humans cannot distinguish from real data and
CIFAR-10 samples with a human error rate of 21.3 percent. We also show ImageNet
samples with unprecedented resolution and demonstrate how our methods enable the
model to learn recognisable ImageNet class features.

xi
CHAPTER 3

THE STUDY

1. OPEN-SOURCE DATASET
Caltech-UCSD Birds 200 (CUB-200) is an image dataset annotated with 200 bird species.
It was created to enable the study of subordinate categorization, which is not possible with
other popular datasets that focus on basic level categories. The images were downloaded
from the website data.caltech.edu/records. Each image is annotated with a bounding box, a
rough bird segmentation, and a set of attribute labels.

CUB-200 includes 6,033 annotated images of birds, belonging to 200, mostly North
American, bird species.

2. FEATURE EXTRACTION

Feature extraction in image processing


Feature extraction is a step in the dimensionality reduction process that divides and reduces
an initial set of raw data to more manageable groups. As a result, processing will be
simpler. The most important feature of these large data sets is the large number of
variables. These variables necessitate a significant amount of computing power to process.
As a result, feature extraction aids in obtaining the best feature from large data sets by
selecting and combining variables into features, effectively reducing the amount of data.
These features are simple to process while accurately and uniquely describing the actual
data set.

For image processing, there are three methods to extract features.


1. Grayscale Pixel Values as Features
2. Mean Pixel Value of Channels
3. Extracting Edges

For this project, I will be using extracting edges method.

xii
Consider that we are given the below image and we need to identify the objects present in
it:

As a human, you recognize the images instantly - a dog, a car, and a cat. The shape could
be one important factor, followed by colour, or size.

A similar idea is to extract edges as features and use that as the input for the model. Edge is
basically where there is a sharp change in colour. Look at the below image:

The machine can identify the edge because there was a change in colour from white to
brown, and brown to black. An image is represented in the form of numbers. So, we will
look for pixels values around which there is a drastic change in the pixel values.

With the help of this, we can extract several features, such as eyes, ears, wings, feathers,
colour, etc.

xiii
Feature extraction in text processing

Text words represent discrete, categorical features in text processing. We encode such data
in a way that the algorithms can use it. Feature extraction refers to the process of mapping
textual data to real-valued vectors. Bag of Words is one of the most basic techniques for
numerically representing text.

Bag of Words (BOW): The vocabulary is a list of unique words in the text corpus that we
create. Then, for each sentence or document, we can represent it as a vector, with each word
represented as 1 for present and 0 for absent from the vocabulary. Another way to represent
this is to count the number of times each word appears in a document. The most widely used
method is the Term Frequency-Inverse Document Frequency (TF-IDF) technique.

 Term Frequency (TF) = (Number of times term t appears in a document)/(Number of


terms in the document)
 Inverse Document Frequency (IDF) = log(N/n), where N is the number of documents and
n is the number of documents a term t has appeared in. The IDF of a rare word is high,
whereas the IDF of a frequent word is likely to be low. Thus having the effect of
highlighting words that are distinct.
 We calculate TF-IDF value of a term as = TF * IDF

Let us take an example to calculate TF-IDF of a term in a document.

xiv
As shown in Document 1, the TF-IDF method heavily penalises the word 'beautiful' while
giving more weight to the word 'day.' This is due to the IDF part, which gives more
weightage to distinct words. In other words, in the context of the entire corpus, 'day' is an
important word for Document1. The Python scikit-learn library includes functions for
calculating the TF-IDF of text vocabulary given a text corpus. For natural language
processing (NLP) maintaining the context of the words is of utmost importance. For this, we
use another approach called Word Embedding.

Word Embedding is a text representation in which words with the same meaning are
represented similarly. In other words, it represents words in a coordinate system where
related words are placed closer together based on a corpus of relationships. The more well-
known models of word embedding are Word2Vec and Global vectors (GloVe). For this,
project, I will be using Word2Vec for the embedding, and relationship definition.

Word2vec takes as its input a large corpus of text and produces a vector space with each
unique word being assigned a corresponding vector in the space. Word vectors are
positioned in the vector space such that words that share common contexts in the corpus are
in close proximity to one another in the space. Word2Vec is very famous at capturing
meaning and demonstrating it on tasks like calculating analogy questions of the form a is
to b as c is to ___. For example, man is to woman as uncle is to ___ (aunt) using a simple
vector offset method based on cosine distance.

xv
3. EVALUATION METRICS

The Fréchet Inception Distance (FID) is a metric for evaluating the quality of generated
images and specifically developed to evaluate the performance of GANs. The activations are
summarized as a multivariate Gaussian by calculating the mean and covariance of the
images. These statistics are calculated for the activations across the collection of real and
generated images. The deviation between these two distributions is called the Fréchet
distance.

4. SELF ATTENTION

To evaluate the effect of the proposed self-attention mechanism, we constructed several


SAGAN models by incorporating the self-attention mechanism at various stages of the
generator and discriminator. SAGAN models with self-attention mechanisms at the middle-
to-high level feature maps outperform models with self-attention mechanisms at the low
level feature maps. With larger feature maps, self-attention receives more evidence and has
more freedom to choose conditions, which results in a lower FID score. It works in
conjunction with convolution for large feature maps.

When modelling dependencies for small feature maps, it functions similarly to the local
convolution. It demonstrates how the attention mechanism empowers both the generator and
the discriminator to directly model the feature maps' long-range dependencies. Furthermore,
a comparison of our SAGAN and the baseline model without attention demonstrates the
efficacy of the proposed self-attention mechanism.

The self-attention blocks perform better than residual blocks with the same number of
parameters. Even when the training goes smoothly, replacing the self-attention block with
the residual block results in worse FID and Inception score results. This comparison shows
that the performance boost provided by SAGAN is not simply due to an increase in model
depth and capacity. To better understand what was learned during the generation process, we
visualised the generator's attention weights in SAGAN for various images.

xvi
Fig. 3.4.1: Self attention architecture for image processing

5. SPECTRAL NORMALIZATION

Miyato first proposed using spectral normalisation to the discriminator network to stabilise
GAN training. By limiting the spectral norm of each layer, the discriminator's Lipschitz
constant is constrained. In comparison to other normalisation techniques, spectral
normalisation does not require additional hyper-parameter tuning (in practise, setting the
spectral norm of all weight layers to 1 consistently performs well). Furthermore, the
computational cost is relatively low.

Spectral normalisation in the generator can prevent parameter magnitude spike and avoid
unusual gradients. We find that spectral normalisation of both the generator and the
discriminator allows us to use fewer discriminator updates per generator update,
significantly lowering the computational cost of training. The method also exhibits more
consistent training behaviour.

xvii
CHAPTER 4

RESULTS

After pre-training the model on BERT, and developing the self-attention, we get the
following table.
SAGAN
Model No. of attention layers
feat 8 feat 16 feat 32 feat 64

FID 23 22.98 22.14 18.28 18.65

Table 4.1: Fréchet inception distance of the self-attention GAN

We can clearly see that, by decreasing the number of features, the FID distance reduces. It is
optimal at 32 features, after the distance starts to increase.

Fig. 4.2: Fréchet inception distance of the self-attention GAN visually represented

The self-attention module is useful for modelling long-term dependencies. Furthermore, we


show that applying spectral normalisation to the generator stabilises GAN training and
speeds up training of regularised discriminators.

xviii
This project shows that our approach is satisfactory based on the FID score and when
compared to the results of the papers mentioned in the literature reviews. On the CUB-200
dataset, our method increases the FID by 21.11%. We believe our approach has potential
applicability in a wide range of cross domain tasks, such as visual question answering,
image-text retrieval, and text-to-image synthesis, because image-text representation learning
is a fundamental task.

xix
CHAPTER 5

CONCLUSION

We demonstrated how to incorporate the self-attention model into text-to-image models to


improve their performance in this project. To begin, we use the BERT and Self-attention to
train the image-text matching task to push together the textual representations corresponding
to the same image. Furthermore, we use the method to improve the consistency of generated
images based on captions for the same image. We propose a generalised framework for text-
to-image models and use the FID score to evaluate our model.

xx

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy