0% found this document useful (0 votes)
15 views9 pages

DCGANs

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views9 pages

DCGANs

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

High-resolution Deep Convolutional Generative Adversarial Networks

J. D. Curtó∗,1,2,3,4 , I. C. Zarza∗,1,2,3,4 , F. Torre2,5 , I. King1 , and M. R. Lyu1 .


1 The Chinese University of Hong Kong. 2 Carnegie Mellon.
3 Eidgenössische Technische Hochschule Zürich. 4 City University of Hong Kong. 5 Facebook.
∗ Both authors contributed equally.
arXiv:1711.06491v18 [cs.CV] 17 Apr 2020

Figure 1. HDCGAN Synthetic Images. A set of random samples. Our system generates high-resolution synthetic faces with an extremely high level of
detail. HDCGAN goes from random noise to realistic synthetic pictures that can even fool humans. To demonstrate this effect, we create the Dataset of Curtó
& Zarza, the first GAN augmented dataset of faces.

Generative Adversarial Networks (GANs) [Goodfellow et al. 2014] conver- 2018a,b; Wu et al. 2016; Yang et al. 2017; Yu et al. 2018; Zhu et al. 2017]
gence in a high-resolution setting with a computational constrain of GPU including image inpainting, 3D data, domain translation, video syn-
memory capacity has been beset with difficulty due to the known lack of thesis, image edition, semantic segmentation and semi-supervised
convergence rate stability. In order to boost network convergence of DCGAN learning.
(Deep Convolutional Generative Adversarial Networks) [Radford et al. 2016]
and achieve good-looking high-resolution results we propose a new layered
In this paper, we focus on the task of face generation, as it gives
network, HDCGAN, that incorporates current state-of-the-art techniques for
this effect. Glasses, a mechanism to arbitrarily improve the final GAN gener- GANs a huge space of learning attributes. In this context, we in-
ated results by enlarging the input size by a telescope ζ is also presented. troduce the Dataset of Curtó & Zarza, a well-balanced collection
A novel bias-free dataset, Curtó & Zarza1 2 , containing human faces from of images containing 14,248 human faces from different ethnical
different ethnical groups in a wide variety of illumination conditions and groups and rich in a wide range of learnable attributes, such as
image resolutions is introduced. Curtó is enhanced with HDCGAN synthetic gender and age diversity, hair-style and pose variation or presence
images, thus being the first GAN augmented dataset of faces. We conduct of smile, glasses, hats and fashion items. We also ensure the pres-
extensive experiments on CelebA [Liu et al. 2015], CelebA-hq [Karras et al. ence of changes in illumination and image resolution. We propose
2018] and Curtó. HDCGAN is the current state-of-the-art in synthetic im- to use Curtó as de facto approach to empirically test the distribu-
age generation on CelebA achieving a MS-SSIM of 0.1978 and a FRÉCHET tion learned by a GAN, as it offers a challenging problem to solve,
Inception Distance of 8.44.
while keeping the number of samples, and therefore training time,
CCS Concepts: • Neural Networks; bounded. It can also be used as a drop-in substitute of MNIST for sim-
Additional Key Words and Phrases: Generative Adversarial Network, Con- ple tasks of classification, say for instance using labels of ethnicity,
volutional Neural Network, Synthetic Faces. gender, age, hair style or smile. It ships with scripts in TensorFlow
and Python that allow benchmarks of classification. A set of random
1 INTRODUCTION samples can be seen in Figure 2.
Developing a Generative Adversarial Network (GAN) [Goodfel-
Despite improvements in GANs training stability [Mescheder et al.
low et al. 2014] able to produce good quality high-resolution sam-
2018, 2017; Salimans et al. 2016] and specific-task design during the
ples from images has important applications [Bousmalis et al. 2017;
last years, it is still challenging to train GANs to generate high-
Chen et al. 2019; Li et al. 2017; Lombardi et al. 2018; Portenier et al.
resolution images due to the disjunction in the high dimensional
2018; Romero et al. 2018; Sankaranarayanan et al. 2018; Wang et al.
pixel space between supports of the real image and implied model
1 Curtó is available at https://www.github.com/curto2/c/ distributions [Arjovsky and Bottou 2017; Sønderby et al. 2017].
2 Code is available at https://www.github.com/curto2/graphics/

{curto,zarza,king,lyu}@cse.cuhk.edu.hk, ftorre@cs.cmu.edu
decurto.tw dezarza.tw.
2 • Curtó, Zarza, Torre, King, and Lyu.

2 PRIOR WORK
Generative image generation is a key problem in Computer Vision
White
and Computer Graphics. Remarkable advances have been made
with the renaissance of Deep Learning. Variational Autoencoders
(VAE) [Kingma and Welling 2014; Lombardi et al. 2018] formulate
the problem with an approach that builds on probabilistic graphical
East-Asian
models, where the lower bound of data likelihood is maximized.
Autoregressive models (scilicet PixelRNN [van den Oord et al. 2016]),
based on modeling the conditional distribution of the pixel space,
have also presented relative success generating synthetic images.
South-Asian Lately, Generative Adversarial Networks (GANs) [Antoniou et al.
2018; Goodfellow et al. 2014; Odena et al. 2017; Portenier et al. 2018;
Radford et al. 2016; Wang and Gupta 2016; Zhu et al. 2016] have
shown strong performance in image generation. However, training
instability makes it very hard to scale to high-resolution (256×256
African American or 512×512) samples. Some current works on the topic pinpoint
this specific problem [Zhang et al. 2017], where conditional image
generation is also tackled while other recent techniques [Brock et al.
Figure 2. Samples of Curtó. A set of random instances for each class of 2019; Chen and Koltun 2017; Dosovitskiy and Brox 2016; Karras
ethnicity: African American, White, East-asian and South-asian. See Table 1 et al. 2018; Salimans et al. 2016; Wei et al. 2018; Zhao et al. 2017] try
for numerics. to stabilize training.

Our goal is to be able to generate indistinguishable sample in- 3 DATASET OF CURTÓ & ZARZA
stances using face data to push the boundaries of GAN image gen- Curtó contains 14,248 faces balanced in terms of ethnicity: African
eration that scale well to high-resolution images (such as 512×512) American, East-Asian, South-Asian and White. Mirror images are
and where context information is maintained. included to enhance pose variation and there is roughly 25% per
image class. Attribute information, see Table 1, is composed of thor-
In this sense, Deep Learning has a tremendous appetite for data. ough labels of gender, age, ethnicity, hair color, hair style, eyes color,
The question that arises instantly is, what if we were able to generate facial hair, glasses, visible forehead, hair covered and smile. There
additional realistic data to aid learning using the same techniques is also an extra set with 3,384 cropped labeled images of faces, eth-
that are later used to train the system. The first step would then be nicity white, no mirror samples included, see Column 4 in Table 1
to have an image generation tool able to sample from a very precise for statistics. We crawled Flickr to download images of faces from
distribution (e.g. faces from celebrities) which instances resemble several countries that contain different hair-style variations and
or highly correlate with real sample images of the underlying true style attributes. These images were then processed to extract 49
distribution. Once achieved, what is desirable and comes next is that facial landmark points using [Xiong and Torre 2013]. We ensure
these generated image points not only fit well into the original distri- using Mechanical Turk that the detected faces are correct in terms
bution set of images but also add additional useful information such of ethnicity and face detection. Cropped faces are then extracted
as redundancy, different poses or even generate highly-probable to generate multiple resolution sources. Mirror augmentation is
scenarios that would be possible to see in the original dataset but performed to further enhance pose variation.
are actually not present.
Curtó introduces a difficult paradigm of learning, where different
Current research trends link Deep Learning and Kernel Methods ethnical groups are present, with very varied fashion and hair styles.
to establish a unifying theory of learning [Curtó et al. 2017]. The The fact that the photos are taken using non-professional cameras in
next frontier in GANs would be to achieve learning at scale with a non-controlled environment, gives us multiple poses, illumination
very few examples. To achieve the former goal this work contributes conditions and camera quality.
in the following:
• Network that achieves compelling results and scales well to
the high-resolution setting where to the best of our knowl- 4 APPROACH
edge the majority of other variants are unable to continue Generative Adversarial Networks (GANs) proposed by [Goodfellow
learning or fall into mode collapse. et al. 2014] are based on two dueling networks, Figure 3; Generator
G and Discriminator D. In essence, the process of learning consists
• New dataset targeted for GAN training, Curtó, that introduces of a two-player game where D tries to distinguish between the
a wide space of learning attributes. It aims to provide a well- prediction of G and the ground truth, while at the same time G tries
posed difficult task while keeping training time and resources to fool D by producing fake instance samples as closer to the real
tightly bounded to spearhead research in the area. ones as possible. The solution to a game is called NASH equilibrium.
High-resolution Deep Convolutional Generative Adversarial Networks. • 3

Table 1. Dataset of Curtó & Zarza. Attribute Information. Descending order z G HDCGAN Synthetic Image
of class instances by number of samples, Column 3.
Noise Generator

Attribute Class # Samples # Extra Discriminator D ? Real or Fake

Age Early Adulthood 3606 966


Middle Aged 2954 875
Teenager 2202 178 Real Image
Adult 1806 565
Kid 1706 85 Disposable after training
Senior 1102 402
Figure 3. Generative Adversarial Networks. A two-player game between
Retirement 436 218
the Generator G and the Discriminator D. The dotted line denotes elements
Baby 232 14
that will not be further used after the game stops, namely, end of training.
Ethnicity African American 4348 0
White 3442 3384
East Asian 3244 0 The min-max game entails the following objective function
South Asian 3214 0
Eyes Color Brown 9116 2119 min max V (D, G) = Ex ∼pd at a [log D(x)] + (1)
Other 4136 875 G D
Blue 580 262 +Ez∼pz [log(1 − D(G(z)))] , (2)
Green 416 128 where x is a ground truth image sampled from the true distribution
Facial Hair No 12592 2821 pdat a , and z is a noise vector sampled from pz (that is, uniform
Light Mustache 466 156 or normal distribution). G and D are parametric functions where
Light Goatee 444 96 G : pz → pdat a maps samples from noise distribution pz to data
Light Beard 258 142 distribution pdat a .
Thick Goatee 168 39
Thick Beard 166 68 The goal of the Discriminator is to minimize
Thick Mustache 154 62
Gender Male 7554 1998 1
Female 6694 1386 L(D) = − Ex ∼pd at a [log D(x)] − (3)
2
Glasses No 12576 2756 1
Eyeglasses 1464 539 − Ez∼pz [log(1 − D(G(z)))] . (4)
2
Sunglasses 208 89
Hair Color Black 8402 964
Brown 3038 1241 If we differentiate it w.r.t D(x) and set the derivative equal to zero,
Other 1554 253 we can obtain the optimal strategy
Blonde 616 543
White 590 347 pdat a (x)
Red 48 36 D(x) = . (5)
pz (x) + pdat a (x)
Hair Covered No 12292 3060
Turban 1206 76
Cap 722 237 Which can be understood intuitively as follows. Accept an input,
Helmet 28 11 evaluate its probability under the distribution of the data, pdat a ,
Hair Style Short Straight 5038 1642 and then evaluate its probability under the generator’s distribution
Long Straight 2858 857 of the data, pz . Under the condition in D of enough capacity, it can
Short Curly 2524 287 achieve its optimum. Note the discriminator does not have access
Other 2016 249 to the distribution of the data but it is learned through training.
Bald 1298 187 The same applies for the generator’s distribution of the data. Under
Long Curly 514 162 the condition in G of enough capacity, then it will set pz = pdat a .
Smile Yes 8428 2118 This results in D(x) = 12 , that is actually the NASH equilibrium. In
No 5820 1266 this situation, the generator is a perfect generative model, sampling
Visible Forehead Yes 11890 3033 from p(x).
No 2358 351
As an extension to this framework, DCGAN [Radford et al. 2016]
proposes an architectural topology based on Convolutional Neural
Networks (CNNs) to stabilize training and re-use state-of-the-art
4 • Curtó, Zarza, Torre, King, and Lyu.

networks from tasks of classification. This direction has recently Empirical observation leads us to say that the use of SELU greatly
received lots of attention due to its compelling results in supervised improves the convergence speed on the DCGAN structure, how-
and unsupervised learning. We build on this to propose a novel ever, after some iterations mode collapse and gradient explosion
DCGAN architecture to address the problem of high-resolution completely destroy training when using high-resolution images.
image generation. We name this approach HDCGAN. We conclude that although SELU gives theoretical guarantees as
the optimal activation function in FNNs, numerical errors in the
4.1 HDCGAN GPU computation degrade its performance in the overall min-max
game of DCGAN. To alleviate this problem, we propose to use SELU
Despite the undoubtable success, GANs are still arduous to train,
and BatchNorm [Ioffe and Szegedy 2015] together. The motivation
particularly when we use big images (e.g. 512×512). It is very com-
is that when numerical errors move (µ̃, ν̃ ) away from the attract-
mon to see D beating G in the process of learning, or the reverse,
ing point that depends on (ω, τ ) ∈ Ω, BatchNorm will ensure it is
ending in unrecognizable imagery, also known as mode collapse.
close to a desired value and therefore maintain the convergence rate.
Only when stable learning is achieved, the GAN structure is able to
succeed in getting better and better results with time.
Experiments show that this technique stabilizes training and
allows us to use fewer GPU resources, having steady diminishing
This issue is what drives us to carefully derive a simple yet pow-
errors in G and D. It also accelerates convergence speed by a great
erful structure that leverages common problems and gets a stable
factor, as can be seen after some few epochs of training on CelebA
and steady training mechanism.
in Figure 8.
Self-normalizing Neural Networks (SNNs) were introduced in Generator
[Klarbauer et al. 2017]. We consider a neural network with activa-
CONVOLUTION 4x4 stride 1 Discriminator
tion function f , connected to the next layer by a weight matrix W,
BatchNorm
and whose inputs are the activations from the preceding layer x, CONVOLUTION 4x4 stride 2

y = f (Wx). SELU SELU

CONVOLUTION 4x4 stride 2 CONVOLUTION 4x4 stride 2

We can define a mapping д that maps mean and variance from BatchNorm BatchNorm

one layer to mean and variance of the following layer

µ
 
µ̃ µ̃
   
µ
  ..
SELU

..
SELU

ν
7−→
ν̃
:
ν̃

ν
. (6)
. down-sampling
blocks

CONVOLUTION 4x4 stride 2


. up-sampling
blocks

CONVOLUTION 4x4 stride 2


Common normalization tactics such as batch normalization en-
BatchNorm BatchNorm
sure a mapping д that keeps (µ, ν) and (µ̃, ν̃ ) close to a desired value,
SELU SELU
normally (0, 1).
CONVOLUTION 4x4 stride 2 CONVOLUTION 4x4 stride 1

SNNs go beyond this assumption and require the existence of a TANH SIGMOID

mapping д : Ω 7−→ Ω that for each activation y maps mean and


variance from one layer to the next layer and at the same time have Figure 4. HDCGAN Architecture. Generator and Discriminator.
a stable and attracting fixed point depending on (ω, τ ) in Ω. More-
over, the mean and variance remain in the domain Ω and when As SELU + BatchNorm (BS) layers keep mean and variance close
iteratively applying the mapping д, each point within Ω converges to (0, 1) we get an unbiased estimator of pdat a with contractive
to this fixed point. Therefore, SNNs keep activations normalized finite variance. These are very desirable properties from the point of
when propagating them through the layers of the network. view of an estimator as we are iteratively looking for a MVU (Mini-
mum Variance Unbiased) criterion and thus solving MSE (Minimum
Here (ω, τ ) are defined as follows. For n units with activation xc , Square Error) among unbiased estimators. Hence, if the MVU esti-
1 ≤ c ≤ n in the lower layer, we set n times the mean of the weight mator exists and the network has enough capacity to actually find
Ín
vector w ∈ Rn as ω := c=1 wc and n times the second moment as the solution, given a sufficiently large sample size by the Central
τ := c=1 wc2 .
Ín
Limit Theorem, we can attain NASH equilibrium.

Scaled Exponential Linear Units (SELU) [Klarbauer et al. 2017] HDCGAN Architecture is described in Figure 4. It differs from
is introduced as the choice of activation function in Feed-forward traditional DCGAN in the use of BS layers instead of ReLUs.
Neural Networks (FNNs) to construct a mapping д with properties
that lead to SNNs. We observe that when having difficulty in training DCGAN, it is
always better to use a fixed learning rate and instead increase the
x if x > 0

selu(x) = λ (7) batch size. This is because having more diversity in training, gives
α expx −α if x ≤ 0. a steady diminishing loss and better generalization. To aid learning,
High-resolution Deep Convolutional Generative Adversarial Networks. • 5

noise following a Normal N (0, 1) is added to both the inputs of D 5 EMPIRICAL ANALYSIS
and G. We see that this helps overcome mode saturation and col- We build on DCGAN and extend the framework to train with high-
lapse whereas it does not change the distribution of the original data. resolution images using Pytorch. Our experiments are conducted
using a fixed learning rate of 0.0002 and ADAM solver [Kingma and
We empirically show that the use of BS induces SNNs properties Ba 2015] with batch size 32 and 512×512 samples with the number
in the GAN structure, and thus makes learning highly robust, even of filters of G and D equal to 64.
in the stark presence of noise and perturbations. This behavior can
be observed when the zero-sum game problem stabilizes and errors In order to test generalization capability, we train HDCGAN in
in D and G jointly diminish, Figure 9. Comparison to traditional the newly introduced Curtó, CelebA and CelebA-hq.
DCGAN, Wasserstein GAN [Arjovsky et al. 2017] and WGAN-GP
[Gulrajani et al. 2017] is not possible, as to date, the majority of Technical Specifications: 2 × NVIDIA Titan X, Intel Core i7-
former methods, such as [Denton et al. 2015], cannot generate rec- 5820k@3.30GHz.
ognizable results in image size 512×512, 24GB GPU memory setting.
5.1 Curtó
Thus, HDCGAN pushes up state-of-the-art results beating all
The results after 150 epochs are shown in Figure 6. We can see that
former DCGAN-based architectures and shows that, under the right
HDCGAN captures the underlying features that represent faces
circumstances, BS can solve the min-max game efficiently.
and not only memorizes training examples. We retrieve nearest
neighbors to the generated images in Figure 7 to illustrate this
4.2 Glasses effect.
We introduce here a key technique behind the success of HDCGAN.
Once we have a good convergence mechanism for large input sam-
ples, that is a concatenation of BS layers, we observe that we can
arbitrarily improve the final results of the GAN structure by the
use of a Magnifying Glass approach. Assuming our input length is
N × M, we can enlarge it by a constant factor, ζ 1 N × ζ 2 M, which we
call telescope, and then feed it into the network, maintaining the size
of the convolutional filters untouched. This simple procedure works
similar to how contact lenses correct or assist defective eyesight on
humans and empowers the GAN structure to appreciate the inner
properties of samples. Figure 6. HDCGAN Example Results. Dataset of Curtó & Zarza. 150
epochs of training. Image size 512×512.
Note that as the input gets bigger so does the neural network.
That is, the number of layers is implicitly set by the image size, see
up-sampling and down-sampling blocks in Figure 4. For example,
for an input size of 32 we have 4 layers while for an input size of
256 we have 7 layers. HDCGAN Synthetic Images

128 × 128

Nearest Neighbors
512 × 512 <latexit sha1_base64="FR3z6sQ41ZPocMYG/z3TV3s0siw=">AAACC3icbVDJSgNBEO1xjXGLevTSGFxOYSYKehS8eIxgVMyEUNPp0cZehu4aMQ65e/FXvHhQxKs/4M2/sRNzcHtQ8Hiviqp6SSaFwzD8CMbGJyanpksz5dm5+YXFytLyiTO5ZbzJjDT2LAHHpdC8iQIlP8ssB5VIfppcHQz802tunTD6GHsZbyu40CIVDNBLncpajPwGtbEKZJFao+hBbjHeNDTeoOdgb6HfqVTDWjgE/UuiEamSERqdynvcNSxXXCOT4FwrCjNsF2BRMMn75Th3PAN2BRe85akGxV27GP7Sp+te6dLUWF8a6VD9PlGAcq6nEt+pAC/db28g/ue1ckz32oXQWY5cs69FaS4pGjoIhnaF5QxlzxNgVvhbKbsECwx9fGUfQvT75b/kpF6Ltmv1o53qfjiKo0RWyRrZIhHZJfvkkDRIkzByRx7IE3kO7oPH4CV4/WodC0YzK+QHgrdPth+awQ==</latexit>
from
fromCurtó & Zarza
Graphics

Figure 5. Glasses on a set of samples from CelebA. HDCGAN introduces


the use of a Magnifying Glass approach, enlarging the input size by a
telescope ζ .

We can empirically observe that BS layers together with Glasses


induce high capacity into the GAN structure so that a NASH equi-
librium can be reached. That is to say, the generator draws samples
from pdat a , which is the distribution of the data, and the discrimi- Figure 7. Nearest Neighbors. Dataset of Curtó & Zarza. Generated sam-
nator is not able to distinguish between them, D(x) = 21 ∀x. ples in the first row and their five nearest neighbors in training (rows 2-6).
6 • Curtó, Zarza, Torre, King, and Lyu.

5.2 CelebA
CelebA is a large-scale dataset with 202,599 celebrity faces. It mainly
contains frontal portraits and is particularly biased towards groups
of ethnicity white. The fact that it presents very controlled illumi-
nation settings and good photo resolution, makes it a considerably
easier problem than Curtó. The results after 19 epochs of training
are shown in Figure 8.

Figure 10. HDCGAN Example Results. CelebA. 39 epochs of training.


Image size 512×512. The network generates distinctly accurate and assorted
faces, including exhaustive details.

Figure 8. HDCGAN Example Results. CelebA. 19 epochs of training. Im-


age size 512×512. The network learns swiftly a clear pattern of the face.

Figure 11. HDCGAN Example Result. CelebA. 39 epochs of training.


Image size 512×512. 27% of full-scale image.
In Figure 9 we can observe that BS stabilizes the zero-sum game,
where errors in D and G concomitantly diminish. To show the
validity of our method, we enclose Figure 10, presenting a large
number of samples for epoch 39. We also attach a zoomed-in example
to appreciate the quality and size of the generated samples, Figure
11. Failure cases can be observed in Figure 12.
SCALARS IMAGES AUDIO GRAPHS DISTRIBUTIONS HISTOGRAMS EMBEDDINGS TEXT

errD 1

errD

ult 25.0

20.0

0.956 15.0

10.0

WALL 5.00

0.00

0.000 20.00k 40.00k 60.00k 80.00k 100.0k 120.0k 140.0k 160.0k 180.0k 200.0k 220.0k 240.0k

errG 1

errG

12.0

10.0

8.00

6.00

4.00

2.00

0.00

0.000 20.00k 40.00k 60.00k 80.00k 100.0k 120.0k 140.0k 160.0k 180.0k 200.0k 220.0k 240.0k

Figure 12. HDCGAN Example Results. CelebA. 39 epochs of training.


Figure 9. HDCGAN on CelebA. Error in Discriminator (top) and Error in Image size 512×512. Failure cases. The number of failure cases declines over
Generator (bottom). 19 epochs of training. time, and when present, they are of more meticulous nature.
High-resolution Deep Convolutional Generative Adversarial Networks. • 7

Besides, to illustrate how fundamental our approach is, we enlarge To exemplify that the model is generating new bona fide instances
Curtó with 4,239 unlabeled synthetic images generated by HDCGAN instead of memorizing samples from the training set, we retrieve
on CelebA, a random set can be seen in Figure 13. nearest neighbors to the generated images in Figure 16.

HDCGAN Synthetic Images

Figure 13. HDCGAN Synthetic Images. A set of random samples. Nearest Neighbors
from CelebA-hq

5.3 CelebA-hq
[Karras et al. 2018] introduces CelebA-hq, a set of 30.000 high-
definition images to improve training on CelebA. A set of samples
generated by HDCGAN on CelebA-hq can be seen in Figures 1, 14
and 15.

Figure 16. Nearest Neighbors. CelebA-hq. Generated samples in the first


row and their five nearest neighbors in training (rows 2-6).

6 ASSESSING THE DISCRIMINABILITY AND QUALITY


OF GENERATED SAMPLES
We build on previous image similarity metrics to qualitatively eval-
uate generated samples of generative models. The most effective
Figure 14. HDCGAN Example Results. CelebA-hq. 229 epochs of train-
of these is multi-scale structural similarity (MS-SSIM) [Odena et al.
ing. Image size 512×512. The network generates superior faces, with great
2017]. We make comparison at resized image size 128×128 on CelebA.
attention to detail and quality.
MS-SSIM results are averaged from 10,000 pairs of generated sam-
ples. Table 2 shows HDCGAN significantly improves state-of-the-art
results.

Table 2. Multi-scale structural similarity (MS-SSIM) results on CelebA at


resized image size 128×128. Lower is better.

MS-SSIM
[Gulrajani et al. 2017] 0.2854
[Karras et al. 2018] 0.2838
HDCGAN 0.1978

We monitor MS-SSIM scores across several epochs averaging from


10,000 pairs of generated images to see the temporal performance,
Figure 15. HDCGAN Example Result. CelebA-hq. 229 epochs of training. Figure 17. HDCGAN improves the quality of the samples while
Image size 512×512. 27% of full-scale image. increases the diversity of the generated distribution.
099 099

MS-SSIM Sco
100 100
0.24
101 101
102 102
103 103
0.22
104 104
105 105
106 106
107 8 • Curtó,
0.2 Zarza,
20
Torre,
30
King,
40
and Lyu.
50 60 70 107
108 108
Training Epochs
109 109
110 0.26 110 M. Arjovsky and L. Bottou. 2017. Towards principled methods for training generative
111 111 adversarial networks. ICLR (2017).
M. Arjovsky, S. Chintala, and L. Bottou. 2017. Wasserstein GAN. ICML (2017).
MS-SSIM Score

112 112
113 0.24 113 K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. 2017. Unsupervised
114 114 Pixel-Level Domain Adaptation with Generative Adversarial Networks. CVPR
115 115 (2017).
0.22
116 116 A. Brock, J. Donahue, and K. Simonyan. 2019. Large Scale GAN Training for High
117 117 Fidelity Natural Image Synthesis. ICLR (2019).
118
0.2
118 Q. Chen and V. Koltun. 2017. Photographic Image Synthesis with Cascaded Refinement
119 119 Networks. ICCV (2017).
120 20 30 40 50 60 70 120
Y. Chen, W. Li, X. Chen, and L. Gool. 2019. Learning Semantic Segmentation from
121
Training Epochs
121
Synthetic Data: A Geometrically Guided Input-Output Adaptation Approach. CVPR
122 122
(2019).
123 If you are making a submission to another conference at the same time, 123
J. D. Curtó, I. C. Zarza, F. Yang, A. Smola, F. Torre, C. W. Ngo, and L. Gool. 2017. McKernel:
124 Figure 17. MS-SSIM
which covers similar or overlapping
Scores on material,
CelebA you may several
across need to refer Results
to that
epochs. 124
A Library for Approximate Kernel Expansions in Log-linear Time. arXiv:1702.08159
submission in order to explain the di↵erences, just as you would if you had
125
are averaged from 10,000 pairs of generated images from epoch 19
previously published related work. In such cases, include the anonymized parallel
to 74.
125
(2017).
126
Comparison is made at resized image size 128×128. Affine interpolation
126
is E. Denton, S. Chintala, A. Szlam, and R. Fergus. 2015. Deep Generative Image Models
submission [?] as additional material and cite it as
127 127
using a Laplacian Pyramid of Adversarial Networks. NIPS (2015).
128 shown in red. 128
A. Dosovitskiy and T. Brox. 2016. Generating Images with Perceptual Similarity Metrics
1. Authors. “The frobnicatable foo filter”, BMVC 2014 Submission ID
129
324, Supplied as additional material bmvc14.pdf.
129
based on Deep Networks. NIPS (2016).
130 130
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
131
In [Heusel
Finally, et al.
you may feel2017] they
you need to propose to evaluate
tell the reader that more GANs using
details can be the
131
and Y. Bengio. 2014. Generative Adversarial Neworks. NIPS (2014).
132 found elsewhere, and refer them to a technical report. For conference submis- 132
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. 2017. Improved
133 FRÉCHET Inception Distance, which assesses the similarity between
sions, the paper must stand on its own, and not require the reviewer to go to 133
Training of Wasserstein GANs. NIPS (2017).
134 two distributions
a techreport bydetails.
for further the difference
Thus, you of
maytwosay Gaussians. We
in the body of themake
paper com-
134
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. 2017. GANs
parison at resized image size 64×64 on CelebA. Results are computed Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.
NIPS (2017).
from 10,000 512×512 generated samples from epochs 36 to 52, re- S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training
sized at image size 64×64 yielding a value of 8.44, Table 3, clearly by reducing internal covariate shift. ICML (2015).
outperforming current reported scores in DCGAN architectures T. Karras, T. Aila, S. Laine, and J. Lehtinen. 2018. Progressive Growing of GANs for
Improved Quality, Stability, and Variation. ICLR (2018).
[Wu et al. 2018]. D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. ICLR (2015).
D. P. Kingma and M. Welling. 2014. Auto-encoding variational bayes. ICLR (2014).
G. Klarbauer, T. Unterthiner, and A. Mayr. 2017. Self-Normalizing Neural Networks.
Table 3. FRÉCHET Inception Distance on CelebA at resized image size NIPS (2017).
64×64. Lower is better. C. Li, K. Xu, J. Zhu, and B. Zhang. 2017. Triple Generative Adversarial Nets. NIPS
(2017).
Z. Liu, P. Luo, X. Wang, and X. Tang. 2015. Deep Learning Face Attributes in the Wild.
Fréchet ICCV (2015).
S. Lombardi, J. Saragih, T. Simon, and Y. Sheikh. 2018. Deep Appearance Models for
[Karras et al. 2018] 16.3 Face Rendering. SIGGRAPH (2018).
L. Mescheder, A. Geiger, and S. Nowozin. 2018. Which Training Methods for GANs do
[Wu et al. 2018] 16.0 actually Converge? ICML (2018).
HDCGAN 8.44 L. Mescheder, S. Nowozin, and A. Geiger. 2017. The Numerics of GANs. NIPS (2017).
A. Odena, C. Olah, and J. Shlens. 2017. Conditional Image Synthesis With Auxiliary
Classifier GANs. ICML (2017).
T. Portenier, Q. Hu, A. Szabó, S. A. Bigdeli, P. Favaro, and M. Zwicker. 2018. FaceShop:
7 DISCUSSION Deep Sketch-based Face Image Editing. SIGGRAPH (2018).
A. Radford, L. Metz, and S. Chintala. 2016. Unsupervised Representation Learning with
In this paper, we propose High-resolution Deep Convolutional Gen- Deep Convolutional Generative Adversarial Network. ICLR (2016).
erative Adversarial Networks (HDCGAN) by stacking SELU + Batch- A. Romero, P. Arbeláez, L. Gool, and R. Timofte. 2018. SMIT: Stochastic Multi-Label
Norm (BS) layers. The proposed method generates high-resolution Image-to-Image Translation. arXiv:1812.03704 (2018).
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Redford, and X. Chen. 2016.
images (e.g. 512×512) in circumstances where the majority of former Improved techniques for training GANs. NIPS (2016).
methods fail. It exhibits a steady and smooth mechanism of training. S. Sankaranarayanan, Y. Balaji, A. Jain, S. Lim, and R. Chellappa. 2018. Learning from
Synthetic Data: Addressing Domain Shift for Semantic Segmentation. CVPR (2018).
It also introduces Glasses, the notion that enlarging the input image C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Hussar. 2017. Amortised map
by a telescope ζ while keeping all convolutional filters unchanged, inference for image super-resolution. ICLR (2017).
can arbitrarily improve the final generated results. HDCGAN is the A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. 2016. Pixel recurrent neural
networks. ICML (2016).
current state-of-the-art in synthetic image generation on CelebA T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. 2018a. Video-to-Video
(MS-SSIM 0.1978 and FRÉCHET Inception Distance 8.44). Synthesis. NIPS (2018).
T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro. 2018b. High-Resolution
Image Synthesis and Semantic Manipulation with Conditional GANs. CVPR (2018).
Further, we present a bias-free dataset of faces containing well- X. Wang and A. Gupta. 2016. Generative Image Modeling using Style and Structure
balanced ethnical groups, Curtó & Zarza, that poses a very diffi- Adversarial Networks. ECCV (2016).
X. Wei, B. Gong, Z. Liu, W. Lu, and L. Wang. 2018. Improving the Improved Training of
cult challenge and is rich on learning attributes to sample from. Wasserstein GANs: A Consistency Term and Its Dual Effect. ICLR (2018).
Moreover, we enhance Curtó with 4,239 unlabeled synthetic images J. Wu, Z. Huang, J. Thoma, D. Acharya, and L. Gool. 2018. Wasserstein Divergence for
generated by HDCGAN, being therefore the first GAN augmented GANs. ECCV (2018).
J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. 2016. Learning a
dataset of faces. Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling.
NIPS (2016).
REFERENCES X. Xiong and F. Torre. 2013. Supervised Descent Method and its Application to Face
Alignment. CVPR (2013).
A. Antoniou, A. Storkey, and H. Edwards. 2018. Data Augmentation Generative Adver-
sarial Networks. ICLR (2018).
High-resolution Deep Convolutional Generative Adversarial Networks. • 9

C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. 2017. High-Resolution Image J. Zhao, M. Mathieu, and Y. LeCun. 2017. Energy-based Generative Adversarial Network.
Inpainting using Multi-Scale Neural Patch Synthesis. CVPR (2017). ICLR (2017).
J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. 2018. Generative Image Inpainting J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. 2016. Generative Visual Manipu-
with Contextual Attention. CVPR (2018). lation on the Natural Image Manifold. ECCV (2016).
H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. 2017. Stack- J. Zhu, T. Park, P. Isola, and A. A. Efros. 2017. Unpaired Image-to-Image Translation
GAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial using Cycle-Consistent Adversarial Networks. ICCV (2017).
Networks. CVPR (2017).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy