Liu Hu Report
Liu Hu Report
Abstract
Generative adversarial networks (GANs) are powerful generative models that have lead to break-
throughs in image generation. In this project, we investigate the use of GANs in generating synthetic
data from the MNIST dataset to either augment or replace the original data when training classifiers.
We demonstrate that training classifiers on purely synthetic data achieves comparable results to those
trained solely on pure data and show that for small sets of training data, augmenting the dataset by
first training GANs on the data can lead to dramatic improvement in classifier performance. We also
begin to explore using GAN-generated data to recursively train other GANs.
Decribed by Yann LeCun, Director of AI Research at Facebook AI, as "the most interesting idea in the last 10 years
in Machine Learning", Generative Adversarial Networks (GANs) are powerful generative models which reformulate
the task of learning a data distribution as an adversarial game. A fundamental bottleneck in machine learning is data
availability, and a variety of techniques are used to augment datasets to create more training data. As powerful gen-
erative models, GANs are good candidates for data augmentation. In recent years, there has been some development
in exploring the use of GANs in generating synthetic data for data augmentation given limited or imbalanced datasets
[1]. Aside from augmenting real data, there are scenarios in which one may wish to directly substitute real data with
synthetic data for example, when people provide images in a medical context, having a GAN as the "middle man"
would grant confidentiality to parties providing the original data.
Additionally, it has been shown that GANs can translate images from one context to another in the absence of paired
examples [2]. This has been used to combine datasets of similar images in different formats. One variant of GANs,
CycleGANs, have been used to combine datasets of contrast and non-contrast medical images [3].
Figure 1: Left Three: GAN-generated numbers from the GAN with α = 4 and training size of 2000 at 2000 epochs.
Right Three: Real samples from the MNIST dataset.
2 Methods
In this section, we discuss our GAN objectives and the model architectures that we use for our tasks. All of models
we describe in the following subsections are built from scratch.
2.1 GANs
We trained a separate GAN to generate images of each digit. When training GANs, the generator and discriminator
are playing a two-player minimax game [7] with value function
min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
G D
where G : R 100
→R 784
maps a noise sample to a 28 × 28 image and D : R784 → R maps an image to a probability
that the image came from the true data distribution (rather than the generator). The discriminator tries to maximize the
objective function, so its loss function over all examples is
1 ∑[ ( ) ( ( ( )))]
m
JD (x, z; θd , θg ) = log D x(i) + log 1 − D G z (i)
m i=1
The generator can’t affect the first term in the summation, so it tries to minimize the objective function by minimizing
its loss
1 ∑
m ( ( ( )))
JG (z; θd , θg ) = log 1 − D G z (i)
m i=1
When training a GAN, we ascend the discriminator’s stochastic gradient and descend the generator’s stochastic gra-
dient with respect to θd and θg , respectively, until the losses don’t change anymore. The size of the hidden layers in
each GAN are directly proportional to a parameter α. In order to combat overconfidence from the discriminator, we
use one-sided label smoothing, which penalizes the discriminator
( ( for
( predictions
))) ( for real(images
( )))which exceeded .9.
To implement this, we simply replace every instance of 1 − D G z (i) with .9 − D G z (i) in the equations
above.
2.2.2 Discriminator
The discriminator takes in an image in the form 784-dimensional vector We again use 3 hidden layers with LeakyRELU
and dropout, and output a probability that the image is legitimately from the dataset. Letting h[0] denote the input
image, W [j] and b[j] denoting the weight matrix and the bias vector in the j-th hidden layer, we have
h[i] = LeakyRELU(W [i−1] h[i−1] + b[i−1] )
for i = 1, 2, 3. The size of each hidden layer were designed to decay exponentially between layers while parameterized
−i
by α, with h[i] ∈ R256α2 for i = 1, 2, 3. and we output the prediction that the image comes from the original data
p ∈ (0, 1) via
p = σ(W [3] h[3] + b[3] ).
2.3 Classifier
The classifier uses a single hidden layer of size 300 and a sigmoid non-linearity to output a 10-dimensional vector
representing how likely an image is to be a certain number. Letting p denote this prediction vector and the input image
be i ∈ R784 , we have
p = W [1] σ(W [0] i + b[0] ) + b[1]
For prediction, the largest entry in the output vector corresponds to classifier’s prediction. The loss function used is
the cross entropy loss with an additional regularization term preventing the model from over-fitting. When trained on
the entire MNIST training set, this classifier is able to achieve over 96 percent accuracy [8].
3 Experiments
3.1 Data
We used a random subset of the MNIST dataset for each of our experiments. Our largest experiments used 20000 of
the 60000 total available training images due to GPU limitations. It’s not crucial to achieve state-of-the-art results (of
which the accuracy on MNIST is over 99.8% [9]), since we just want to show the efficacy of synthetic data in training
a GAN.
Train size represents the number of images of each number used to train each GAN. The baseline performance is that
of a classifier trained solely on the original set of images. Despite intuition that smaller models may outperform larger
models when given less training data, the accuracies in Table 1 show that on all sizes of training data, GANs of size
α = 4 outperformed all other models. Thus, we decided to further evaluate mixed datasets between real and synthetic
data for the models with α = 4.
We proceeded to test the validity of using GANs to augment data. Using GANs of size α = 4, the original images
used to train the GANs were supplemented by GAN-generated images. This mixture of original and synthetic images
were then used to train the classifier. As in the previous section, each classifier’s accuracy shown in Figure 2 reflects
the average performance across 5 trials. The mixing ratio = Synthetic Data
Real Data reflects the amount of GAN-generated data
used to augment the original data.
4
Lastly, because GAN-generated data is successful in augmenting the training data for training classifiers, we tried
repeatedly using GANs to augment the dataset. This was done (for each number) by starting with a training set of
s0 = 250 real images. During each iteration i, the current training set of size si was used to train a GAN, which was in
turn used to generate 0.1si synthetic images. These images were then added to the training set and used in subsequent
iterations. The GAN size was kept constant across iterations, but in each iteration, the GANs were trained anew. To
evaluate each generation of GANs, classifiers were trained on solely GAN-generated data as in Experiment 1. The
classifier accuracies of each generation of GANs are shown in Figure 3.
4 Analysis
From Experiment 1, we observe that classifiers trained on purely synthetic data can achieve comparable and sometimes
better results than classifiers trained directly on MNIST images. This suggests that in situations where privacy concerns
make disclosing real data undesirable, GAN-generated data represents a comparable alternative to the original data. In
addition, we note that regardless of train size, the GAN of size α = 4 outperformed other model sizes. This suggests
that when optimal GAN hyper-parameters are more influences by the variability and nature of the training data rather
than the quantity.
Experiment 2 showed that adding synthetic data to real data increases classifier performance for all train sizes (see
Table 1). For small training sizes of 250, 500, the classifier performance increases by over 15 percent when adding
GAN-generated images. For all training sizes, we observe a point at which incorporating additional synthetic data
decreases classifier performance. This optimal ratio between synthetic to real data differs based on the original train
size, and decreases as the training size increases. In addition, classifier performance when given a train-size of 2000
was barely impacted by data augmentation. These two observations are in line with the general intuition that data
augmentation is more beneficial when there is less data. Perhaps most surprising is the observation that incorporating
GAN-generated data can improve classifier performance more than incorporating additional real images. For s =
250, 500, 1000, we observe that classifier performance when trained on s real and s synthetic images exceeds classifier
performance when trained on 2s real images, despite the synthetic numbers being generated from GANs trained on
the s real numbers and thus not containing any additional information about the dataset. In Figure 4, the classifier
losses when training a classifier on 500 real and 500 synthetic images of each number vs. training on 1000 real images
are shown. Both trials used the same number images for the same duration, but training on a mixture of original and
synthetic data shows a far less variable and smooth descent of the training loss. This different behavior could be a
result of the GAN-generated images having less variability than real images; then, the greater variability and noise in
the training loss of the classifier on real data would reflect the greater variability in the images it is being trained on.
Figure 4: Classifier losses for Experiment 2. Left: 500 real and 500 synthetic numbers; Right: 1000 real numbers
For Experiment 3, we don’t observe any significant increase in performance as we train for more iterations. However,
the change between iterations wasn’t monotonic either; rather, it appears periodic. Since the only real data the GAN
is trained on are the original 250 images, it’s not unreasonable for the accuracy to be stagnant because it never gets
more information about the distribution of the data. The oscillating accuracy is likely due to the noisiness of the new
synthetic data, and if we kept running the experiment for longer, we would expect the accuracies to drop, just as they
did in Experiment 1 when too much synthetic data was added.
5
5 Conclusion
Our findings show training classifies solely on GAN-generated data can produce comparable performance to that
of classifiers trained on the original data. This suggests that using GANs as an intermediary is a viable way to
work with personal or private data. We also show that use GAN-generated data for augmentation can significantly
improve classifier performance for augmenting small datasets. We even show scenarios in which the performance
gain of adding GAN generated data exceeds that of adding more true images. However, additional tests suggest
that we cannot benefit from recursively augmenting the training data of additional GANs, for classifiers trained on
self-training GANs never seem to do better than after the single-iteration mix.
The primary limitation that our study faced was the limited computational power; due to limited access to
GPU, we had to train everything on local machines. This ended up being quite costly and limited the rate at
which we could train, and will only get worse if we intend to scale up to more complex images. If given access
to more computational power, it would be worth doing similar classification on more complex datasets, such as
CIFAR-10 or CIFAR-100. Additionally, some GANs clearly had better generator losses than other due to the differing
complexity and variability of each number. Some future work could include training each GAN with a different set
of hyperparameters in order to have the generators for different digits be more consistent. In addition, exploring data
augmentation using other generative models - such as variation auto-encoders - both in isolation and in conjunction
with GANs could be interesting. Lastly, there’s more work to be done on self-training (perhaps in the form of a
hyperparameter sweep, or increasing model size as the dataset increases), and with more time, we would have liked
to investigate in more detail how a recursive approach to training compares to a one-shot approach.
6 Contributions
Both team members contributed equally to the project and report. The majority of the actual modules in the code (clas-
sifier, GANs, training scripts) and the design of experiments were a joint effort, and Nathan built the data processing
pipelines while David built the plotting and model loading pipelines.
7 Code
The repository for this project is available at https://github.com/dliud/gan.
References
[1] Fabio Henrique Kiyoiti dos Santos Tanaka and Claus Aranha. Data Augmentation Using GANs. In Proceedings
of Machine Learning Research, 2019.
[2] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei Efros. Unpaired Image-to-Image Translation using Cycle-
Consistent Adversarial Networks. In International Conference on Computer Vision, 2017.
[3] Veit Sandfort, Ke Yan, Perry Pickhardt, and Ronald Summers. Data augmentation using generative adversarial
networks (CycleGAN) to improve generalizability in CT segmentation tasks. In Nature Research, 2019.
[4] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and Im-
proving the Image Quality of StyleGAN. 12 2019.
[5] Tamar Rott Shaham, Tali Dekel, and Tomar Michaeli. SinGAN, Learning a Generative Model from a Single
Natural Image. In International Conference on Computer Vision, 2019.
[6] Jiajun Shen, Peng-Jen Chen, Matt Le, Junxian He, Jiatao Gu, Myle Ott, Michael Auli, and Marc’Aurelio Ranzato.
The source-target domain mismatch problem in machine translation. 09 2019.
[7] Ian Goodfellow et al. Generative Adversarial Nets. In Conference on Neural Information Processing Systems,
2014.
[8] CS 229. Problem Set 4. 2020.
[9] Adam Byerly, Tatiana Kalganova, and Ian Dear. A Branching and Merging Convolutional Network with Homo-
geneous Filter Capsules. In arXiv, 2019.