GLea D
GLea D
Qingyan Bai1∗ Ceyuan Yang2 Yinghao Xu3∗ Xihui Liu4 Yujiu Yang1† Yujun Shen5
1
Tsinghua Shenzhen International Graduate School, Tsinghua University 2 Shanghai AI Laboratory
3
The Chinese University of Hong Kong 4 The University of Hong Kong 5 Ant Group
arXiv:2212.03752v2 [cs.CV] 7 Jun 2023
Forward Pass
Abstract Backward Pass
G(z)
Generative adversarial network (GAN) is formulated z G D Realness
as a two-player game between a generator (G) and a Gradients
discriminator (D), where D is asked to differentiate whether Reconstruction
an image comes from real data or is produced by G. Under Difference
such a formulation, D plays as the rule maker and hence D(x)
tends to dominate the competition. Towards a fairer game in G
D
GANs, we propose a new paradigm for adversarial training, Gradients
which makes G assign a task to D as well. Specifically, x x’
given an image, we expect D to extract representative
features that can be adequately decoded by G to reconstruct Figure 1. Concept diagram of our proposed generator-leading
the input. That way, instead of learning freely, D is task (bottom), as complementary to the discriminator-leading task
urged to align with the view of G for domain classification. in the original formulation of GANs (upper). D is required to
Experimental results on various datasets demonstrate the extract representative features that can be adequately decoded by
G to reconstruct the input.
substantial superiority of our approach over the baselines.
For instance, we improve the FID of StyleGAN2 from 4.30
to 2.55 on LSUN Bedroom and from 4.04 to 2.82 on LSUN unfair. Specifically, on the one hand, D acts as a player
Church. We believe that the pioneering attempt present in in this adversarial game by measuring the discrepancy
this work could inspire the community with better designed between the real and synthesized samples. But on the
generator-leading tasks for GAN improvement. Project other hand, the learning signals (i.e., gradients) of G are
page is at https://ezioby.github.io/glead/. only derived from D, making the latter naturally become a
referee in the competition. Such a formulation easily allows
D to rule the game. Massive experimental results could
serve as supporting evidence for the theoretical analysis.
1. Introduction For instance, in practice, D can successfully distinguish
Generative adversarial networks (GANs) [18] have sig- real and fake samples from a pretty early stage of training
nificantly advanced image synthesis, which is typically and is able to maintain its advantage in the entire training
formulated as a two-player game. The generator (G) aims process [64]. Accordingly, the capability of the discrimi-
at synthesizing realistic data to fool the discriminator (D), nator usually determines the generation performance more
while D pours attention on distinguishing the synthesized or less. For instance, a discriminator that has over-fitted
samples from the real ones. Ideally, it would come to the whole training set always results in synthesis with
an optimal solution where G can recover the real data limited diversity and poor visual quality [33]. Following
distribution, and D can hardly tell the source of images this philosophy, many attempts [28, 29, 38, 40, 57, 71] have
anymore [18]. been made for discriminator improvement.
However, the competition between G and D seems to be This work offers a different perspective on GAN im-
provement. In particular, we propose a new adversarial
∗ This work was done during an internship at Ant Group.
† Corresponding author. This work was partly supported by the Na-
paradigm where G is assigned a new role, i.e., playing as the
tional Natural Science Foundation of China (Grant No. U1903213) and the referee as well to guide D. Recall that producing realistic
Shenzhen Science and Technology Program (JCYJ20220818101014030). images usually requires G to generate all-level concepts
1
adequately. Nevertheless, due to the asymmetrical status without improving the capacity of G for the first time. To
of G and D, D is able to tell apart the real and synthesized learn useful feature representations with weak supervision,
data merely from limited discriminative regions [64]. We, BiGAN [14] proposes to learn an encoder, to project real
therefore, would like to encourage D to extract as much samples back into GAN latent space in addition to the
information from an image as possible, such that the original G and D. And D is required to discriminate samples
features learned by D could be rendered back to the input jointly in data and latent space. In this way, the well-
with a frozen G, as in Fig. 1. That is, D is enforced to align trained encoder could serve as a feature extractor in a weak-
with the view of G (i.e. focusing on the entire image region) supervised training manner. Differently, we directly adopt
instead of learning freely for domain classification. D to extract features of both real and synthesized samples
Our method is termed as GLeaD because we propose to reconstruct them with G for a fairer setting instead of
to assign D a generator-leading task. In particular, given a representation learning.
real or synthesized image, the discriminator would deliver Improving GANs with the enhanced discriminator. Con-
extra spatial representations and latent representations that sidering D largely dominates the competition with G, many
are then fed into a frozen generator to reproduce the original prior works attempt to boost synthesis quality by improving
image. Reconstruction loss (perceptual loss is adopted in D. Jolicoeur employs a relativistic discriminator [29] to
practice) penalties the difference between the input image estimate the probability that the given real data is more
and the reconstructed image and derives gradients from realistic than fake data for better training stability and
updating the parameters of the discriminator. Moreover, synthesis quality. Yang et al. [71] propose to improve D
comprehensive experiments are then conducted on various representation by additionally requiring D to distinguish
datasets, demonstrating the effectiveness of the proposed every individual real and fake image. Kumari et al. [38]
method. Particularly, our method improves Frechet Incep- propose to ensemble selected backbones pre-trained on
tion Distance (FID) [23] from 4.30 to 2.55 on LSUN Bed- visual understanding tasks in addition to the original D
room and 4.04 to 2.82 on LSUN Church. We also manage as a strengthened D. The effect of various capacity of
to improve Recall [39] largely (56%) from 0.25 to 0.39 on discriminator on training a generator is also investigated
LSUN Bedroom. In addition, thorough ablation studies also in [70]. Based on the finding of OASIS [59] that dense
suggest that applying generator-leading tasks to require D to supervision such as segmentation labels could improve the
reconstruct only real or fake images could boost synthesis representation of D in conditional synthesis, GGDR [40]
quality. While a larger improvement would be gained if leverages the feature map of G to supervise the output fea-
both real and synthesized images were incorporated. Last tures of D for unconditional synthesis. However, different
but not least, experimental results in Sec. 4 reveal that our from the discrimination process, G does not backward any
method can indeed boost the fairness between G and D as gradient to D in this work. Contrasted with GGDR, our
well as improve the spatial attention of D. method aims at a fairer setting rather than gaining more
supervision for D. Also, our D receives gradients from G,
2. Related Work leading to fairer competition.
Generative adversarial networks. As one of the popular Image reconstruction with GANs. GAN inversion [68]
paradigms for generative models, generative adversarial aims to reconstruct the input image with a pre-trained
networks (GANs) [18] have significantly advanced image GAN generator. Mainstream GAN inversion methods
synthesis [9, 15, 21, 31, 34–37, 42, 45, 50], as well as various include predicting desirable latent codes corresponding to
tasks like image manipulation [22, 48, 55, 61, 67, 76], image the images through learning an encoder [3, 49, 52, 62, 77] or
translation [11, 27, 43, 60, 63, 78], image restoration [2, 20, optimization [1, 12, 20, 46, 51, 53]. Most work chooses to
47, 65], 3D-aware image synthsis [10, 19, 56, 69, 75], and predict latent codes in the native latent space of StyleGAN
talking head generation [24,66,72]. In the traditional setting such as Z, W or W+. Recently there are also some
of GAN training, D serves as the referee of synthesis quality work [6, 30] extending the latent space or fine-tuing [4, 13]
and thus tends to dominate the competition. As a result, in the generator for better reconstruction. Note that although
practice D can always tell the real and fake samples apart our method could achieve image reconstruction with the
and the equilibrium between G and D turns out hard to be well-trained D and G, our motivation lies in boosting
achieved as expected [5, 18]. Some earlier work [5, 7, 17] generative quality by making G assign the generator-leading
tries to boost GAN equilibrium to stabilize GAN training reconstruction task to D, instead of the reconstruction
and improve synthesis quality. Recently, EqGAN-SA [64] performance. Another significant difference lies in that we
proposes to boost GAN equilibrium by raising the spatial adopt D to extract representative features for reconstruction,
awareness of G. Concretely, the spatial attention of D is which is simultaneously trained with G, while in GAN
utilized to supervise and strengthen G. While our method inversion the feature extractor (namely the encoder) is
forces D to fulfill a reconstruction task provided by G learned based on a pre-trained G.
2
Realness Realness
Input Reconstruction
w
D G
f Denc h
f w
Figure 2. Illustration of how a generator-leading task is incorporated into GAN training from the perspective of discriminator
optimization. Given an image (i.e., either real or synthesized) as the input, D is asked to extract representative features from the input
in addition to predicting a realness score. These features including spatial features f and global latent codes w are sent to the fixed G to
reconstruct the inputs of D. The perceptual loss is adopted to penalize the difference between the reconstruction and inputs. The sub-figure
on the right demonstrates the specific architecture of our D. A decoder h composed of a series of 1 × 1 convolution layers is attached to
the original backbone Denc to extract f and w. This training process is described in detail in Sec. 3.2.
As mentioned before, it seems to be unfair that a Considering the unfair division of labor in this two-
discriminator (D) competes against a generator (G) since player game, we turn to assign a new role to G that
D does not only join the two-player game as a player but could supervise the learning of D in turn. Recall that the
also guides the learning of G, namely serves as a referee target of generation is to produce realistic samples which
for G. Sec. 3.1 presents the vanilla formulation. To chase a usually requires all concepts well-synthesized. However,
fairer game, Sec. 3.2 introduces a new adversarial paradigm it is suggested [64] that the most discriminative regions of
GLeaD that assigns a new generator-leading task for D given real or synthesized images are sufficient for domain
which in turn is judged by G. classification. Therefore, we propose a generator-leading
task that enforces D to extract as many representative
features as possible to retain adequate information that
3.1. Preliminary could reconstruct a given image through a frozen generator,
as described in Fig. 2 and Algorithm 1. Note that we
GAN usually consists of two components: a generator empirically validate that requiring D to extract spatial
G(·) and a discriminator D(·). The former aims at mapping representations is essential to improve synthesis quality
a random latent code z to an image, while the latter learns in Sec. 4.3. Taking StyleGAN2 [37] as an example, we
to distinguish the synthesized image G(z) from the real one will introduce the detailed instantiations in the following
x. These two networks compete with each other and are context.
jointly optimized with the learning objectives as follows: Extracting representations through D. The original D of
StyleGAN is a convolutional network composed of a series
LG = −Ez∈Z [log(D(G(z)))], (1) of downsampling convolution layers. To make it conve-
LD = −Ex∈X [log(D(x))] − Ez∈Z [log(1 − D(G(z)))], nient, the backbone network of the original D (namely,
(2) parts of D except the final head predicting realness score)
is denoted as Denc in the following statement. In order to
predict the representative features of a given image while re-
where Z and X denote a pre-defined latent distribution and taining various information from low-level to high-level, we
data distribution respectively. additionally affiliate Denc with a decoder h(·) to construct
Ideally, the optimal solution is that G manages to our new D with a multi-level feature pyramid [41]. Based
reproduce the realistic data distribution while D is not on such feature hierarchy ending with a convolutional head,
able to tell the real and synthesized samples apart [18]. spatial representations f and latent representations w are
However, during the iterative training of the generator and predicted respectively. In particular, the newly-attached
discriminator, there exists an unfair competition since D parts over the backbone adopt convolution layers with the
plays the player and referee roles simultaneously. Thus the kernel size of 1 × 1. This is because the crucial part in
ideal solution is hard to be achieved in practice [16, 64]. D that influences the synthesis quality of G is the backbone
3
Algorithm 1 GAN training with the proposed generator-leading ones. Here, perceptual loss [74] Lper is adopted as the loss
task. function:
Input: G and our D (including h) that are initialized with random
parameters. Training data {xi }. Lrec = λ1 Lper (x, x′real ) + λ2 Lper (G(z), x′f ake ), (7)
Hyperparameters: T : maximum number of training iterations.
1: for t = 1 to T do where λ1 and λ2 denote the weights for different terms.
2: Sample z ∼ P(Z) ▷ Begin training of G. Note that setting one weight as zero is identical to disabling
3: Update G with Eq. (1) the reconstruction tasks on real/synthesized images, which
4: Sample z ∼ P(Z) ▷ Begin training of D. may deteriorate the synthesis performance to some extent.
5: Reconstruct G(z) with Eq. (4) and Eq. (6) Our final algorithm is summarized as in Algorithm 1.
6: Sample x ∼ {xi }
Full objective. With the updated D architecture and the
7: Reconstruct x with Eq. (3) and Eq. (5)
8: Discriminate images by D(G(z)) and D(x)
generator-leading task, the discriminator and generator are
9: Update D with Eq. (2), Eq. (7), and Eq. (9) jointly optimized with
10: end for
Output: G with best training set FID. L′G = LG , (8)
. L′D = LD + Lrec . (9)
4
Table 1. Comparisons on FFHQ [36], LSUN Bedroom and LSUN Church [73]. Our method improves StyleGAN2 [37] in large datasets
in terms of FID [23] and recall. P and R denote precision and recall [39]. Lower FID and higher precision and recall indicate better
performance. The bold numbers indicate the best metrics for each dataset. The blue numbers in the brackets indicate the improvements.
perceptual loss calculation. As for the loss weights, we set Table 2. Ablation studies on the loss weights λ1 and λ2 . The
λ1 = 10 and λ2 = 3. numbers in bold indicate the best FID in each sub-table.
5
Face
Bedroom
Church
Figure 3. Synthesized images by our models respectively trained on FFHQ [36], LSUN Bedroom and Church [73].
6
GGDR – Rec
Real
Score
Baseline
GGDR
Ours
1.6
0 Kimg 50000
-1.6
Baseline
Fake
GGDR
Ours
Score
7
GGDR – Teaser
words, GLeaD can improve the fairness between G and Input Ours Baseline
D, as well as the synthesis quality. On the contrary, the
effectiveness of GGDR is not brought by the improvement
of fairness, which emphasizes the viewpoint that, in order
to boost fairness between G and D, it is necessary to pass
gradients of G to D like our method.
8
References Many paths to equilibrium: GANs do not need to decrease
a divergence at every step. In Int. Conf. Learn. Represent.,
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- 2018. 2
age2StyleGAN++: How to edit the embedded images? In
[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
IEEE Conf. Comput. Vis. Pattern Recog., 2020. 2
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[2] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Yoshua Bengio. Generative adversarial networks. In Adv.
Radu Timofte, and Luc Van Gool. Generative adversarial Neural Inform. Process. Syst., 2014. 1, 2, 3
networks for extreme learned image compression. In IEEE
[19] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt.
Conf. Comput. Vis. Pattern Recog., pages 221–231, 2019. 2
Stylenerf: A style-based 3d-aware generator for high-
[3] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. ReStyle:
resolution image synthesis. In Int. Conf. Learn. Represent.,
A residual-based StyleGAN encoder via iterative refinement.
2022. 2
In Int. Conf. Comput. Vis., 2021. 2
[20] Jinjin Gu, Yujun Shen, and Bolei Zhou. Image processing
[4] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit
using multi-code GAN prior. In IEEE Conf. Comput. Vis.
Bermano. Hyperstyle: StyleGAN inversion with hypernet-
Pattern Recog., 2020. 2
works for real image editing. In IEEE Conf. Comput. Vis.
Pattern Recog., 2022. 2 [21] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent
Dumoulin, and Aaron C Courville. Improved training of
[5] Martin Arjovsky, Soumith Chintala, and Léon Bottou.
wasserstein gans. In Adv. Neural Inform. Process. Syst.,
Wasserstein generative adversarial networks. In Int. Conf.
2017. 2
Mach. Learn., 2017. 2
[6] Qingyan Bai, Yinghao Xu, Jiapeng Zhu, Weihao Xia, Yujiu [22] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and
Yang, and Yujun Shen. High-fidelity GAN inversion with Sylvain Paris. Ganspace: Discovering interpretable GAN
padding space. In Eur. Conf. Comput. Vis., 2022. 2 controls. In Adv. Neural Inform. Process. Syst., 2020. 2
[7] David Berthelot, Thomas Schumm, and Luke Metz. Began: [23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Boundary equilibrium generative adversarial networks. In Bernhard Nessler, and Sepp Hochreiter. GANs trained
arXiv preprint arXiv:1703.10717, 2017. 2 by a two time-scale update rule converge to a local nash
equilibrium. In Adv. Neural Inform. Process. Syst., 2017.
[8] Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P
2, 4, 5
Breckon, and Chris G Willcocks. Unleashing transform-
ers: Parallel token prediction with discrete absorbing diffu- [24] Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu.
sion for fast high-resolution image generation from vector- Depth-aware generative adversarial network for talking head
quantized codes. In Eur. Conf. Comput. Vis., 2022. 5 video generation. In IEEE Conf. Comput. Vis. Pattern
[9] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large Recog., 2022. 2
scale GAN training for high fidelity natural image synthesis. [25] Ahmed Imtiaz Humayun, Randall Balestriero, and Richard
In Int. Conf. Learn. Represent., 2019. 2 Baraniuk. Polarity sampling: Quality and diversity control
[10] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, of pre-trained generative networks via singular values. In
and Gordon Wetzstein. pi-gan: Periodic implicit generative IEEE Conf. Comput. Vis. Pattern Recog., 2022. 5
adversarial networks for 3d-aware image synthesis. In IEEE [26] J Stuart Hunter. The exponentially weighted moving aver-
Conf. Comput. Vis. Pattern Recog., 2021. 2 age. Journal of quality technology, 18(4), 1986. 7
[11] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. [27] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
Stargan v2: Diverse image synthesis for multiple domains. Efros. Image-to-image translation with conditional adver-
In IEEE Conf. Comput. Vis. Pattern Recog., 2020. 2 sarial networks. In IEEE Conf. Comput. Vis. Pattern Recog.,
[12] Antonia Creswell and Anil Anthony Bharath. Inverting the 2017. 2
generator of a generative adversarial network. IEEE Trans. [28] Jongheon Jeong and Jinwoo Shin. Training GANs with
Neur. Network. Learn. Syst., 2018. 2 stronger augmentations via contrastive discriminator. In Int.
[13] Tan M Dinh, Anh Tuan Tran, Rang Nguyen, and Binh-Son Conf. Learn. Represent., 2021. 1
Hua. Hyperinverter: Improving StyleGAN inversion via [29] Alexia Jolicoeur-Martineau. The relativistic discriminator:
hypernetwork. In IEEE Conf. Comput. Vis. Pattern Recog., a key element missing from standard GAN. In Int. Conf.
2022. 2 Learn. Represent., 2019. 1, 2
[14] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Ad- [30] Kyoungkook Kang, Seongtae Kim, and Sunghyun Cho.
versarial feature learning. In Int. Conf. Learn. Represent., GAN inversion for out-of-range images with geometric
2017. 2 transformations. In IEEE Conf. Comput. Vis. Pattern Recog.,
[15] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming 2021. 2
transformers for high-resolution image synthesis. In IEEE [31] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
Conf. Comput. Vis. Pattern Recog., 2021. 2 Progressive growing of GANs for improved quality, stability,
[16] Farzan Farnia and Asuman Ozdaglar. Do GANs always have and variation. In Int. Conf. Learn. Represent., 2018. 2
nash equilibria? In Int. Conf. Mach. Learn., 2020. 3 [32] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
[17] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Progressive growing of GANs for improved quality, stability,
Andrew M Dai, Shakir Mohamed, and Ian Goodfellow. and variation. In Int. Conf. Learn. Represent., 2018. 8
9
[33] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, [49] Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco
Jaakko Lehtinen, and Timo Aila. Training generative adver- Doretto. Adversarial latent autoencoders. In IEEE Conf.
sarial networks with limited data. In Adv. Neural Inform. Comput. Vis. Pattern Recog., 2020. 2
Process. Syst., 2020. 1 [50] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper-
[34] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, vised representation learning with deep convolutional gener-
Jaakko Lehtinen, and Timo Aila. Training generative adver- ative adversarial networks. In Int. Conf. Learn. Represent.,
sarial networks with limited data. In Adv. Neural Inform. 2016. 2
Process. Syst., volume 33, 2020. 2 [51] Abdal Rameen, Qin Yipeng, and Wonka Peter. Im-
[35] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, age2StyleGAN: How to embed images into the StyleGAN
Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias- latent space? In Int. Conf. Comput. Vis., 2019. 2
free generative adversarial networks. In Adv. Neural Inform. [52] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan,
Process. Syst., volume 34, 2021. 2 Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in
[36] Tero Karras, Samuli Laine, and Timo Aila. A style-based style: a StyleGAN encoder for image-to-image translation.
generator architecture for generative adversarial networks. In In IEEE Conf. Comput. Vis. Pattern Recog., 2021. 2
IEEE Conf. Comput. Vis. Pattern Recog., 2019. 2, 4, 5, 6 [53] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel
[37] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Cohen-Or. Pivotal tuning for latent-based editing of real
Jaakko Lehtinen, and Timo Aila. Analyzing and improving images. ACM Trans. Graph., 2021. 2
the image quality of StyleGAN. In IEEE Conf. Comput. Vis. [54] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
Pattern Recog., 2020. 2, 3, 4, 5, 7, 12 Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
[38] Nupur Kumari, Richard Zhang, Eli Shechtman, and Jun-Yan Grad-cam: Visual explanations from deep networks via
Zhu. Ensembling off-the-shelf models for GAN training. In gradient-based localization. In IEEE Conf. Comput. Vis.
IEEE Conf. Comput. Vis. Pattern Recog., 2022. 1, 2 Pattern Recog., 2017. 8
[39] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko [55] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou.
Lehtinen, and Timo Aila. Improved precision and recall Interpreting the latent space of GANs for semantic face
metric for assessing generative models. In Adv. Neural editing. In IEEE Conf. Comput. Vis. Pattern Recog., 2020.
Inform. Process. Syst., 2019. 2, 4, 5 2
[40] Gayoung Lee, Hyunsu Kim, Junho Kim, Seonghyeon Kim, [56] Zifan Shi, Yujun Shen, Jiapeng Zhu, Dit-Yan Yeung, and
Jung-Woo Ha, and Yunjey Choi. Generator knows what Qifeng Chen. 3d-aware indoor scene synthesis with depth
discriminator should learn in unconditional GANs. In Eur. priors. In Eur. Conf. Comput. Vis., 2022. 2
Conf. Comput. Vis., 2022. 1, 2, 5, 7 [57] Zifan Shi, Yinghao Xu, Yujun Shen, Deli Zhao, Qifeng
[41] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Chen, and Dit-Yan Yeung. Improving 3d-aware image
Bharath Hariharan, and Serge Belongie. Feature pyramid synthesis with a geometry-aware discriminator. In Adv.
networks for object detection. In IEEE Conf. Comput. Vis. Neural Inform. Process. Syst., 2022. 1
Pattern Recog., 2017. 3, 12 [58] Karen Simonyan and Andrew Zisserman. Very deep convo-
[42] Bingchen Liu, Yizhe Zhu, Kunpeng Song, and Ahmed lutional networks for large-scale image recognition. In Int.
Elgammal. Towards faster and stabilized GAN training for Conf. Learn. Represent., 2015. 4
high-fidelity few-shot image synthesis. In Int. Conf. Learn. [59] Vadim Sushko, Edgar Schönfeld, Dan Zhang, Juergen Gall,
Represent., 2021. 2 Bernt Schiele, and Anna Khoreva. You only need adversarial
[43] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised supervision for semantic image synthesis. In Int. Conf.
image-to-image translation networks. In Adv. Neural Inform. Learn. Represent., 2021. 2
Process. Syst., 2017. 2 [60] Hao Tang, Song Bai, Li Zhang, Philip HS Torr, and Nicu
[44] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Sebe. Xinggan for person image generation. In Eur. Conf.
Deep learning face attributes in the wild. In Int. Conf. Comput. Vis., pages 717–734. Springer, 2020. 2
Comput. Vis., 2015. 8 [61] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
[45] Mehdi Mirza and Simon Osindero. Conditional generative Daniel Cohen-Or. Designing an encoder for StyleGAN
adversarial nets. arXiv preprint arXiv:1411.1784, 2014. 2 image manipulation. ACM Trans. Graph., 40(4):1–14, 2021.
[46] Xingang Pan, Xiaohang Zhan, Bo Dai, Dahua Lin, 2
Chen Change Loy, and Ping Luo. Exploiting deep generative [62] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
prior for versatile image restoration and manipulation. In Daniel Cohen-Or. Designing an encoder for StyleGAN
Eur. Conf. Comput. Vis., 2020. 2 image manipulation. ACM Trans. Graph., 2021. 2
[47] Xingang Pan, Xiaohang Zhan, Bo Dai, Dahua Lin, [63] Ziyu Wan, Bo Zhang, Dongdong Chen, Pan Zhang, Dong
Chen Change Loy, and Ping Luo. Exploiting deep generative Chen, Jing Liao, and Fang Wen. Bringing old photos back to
prior for versatile image restoration and manipulation. IEEE life. In IEEE Conf. Comput. Vis. Pattern Recog., 2020. 2
Trans. Pattern Anal. Mach. Intell., 2021. 2 [64] Jianyuan Wang, Ceyuan Yang, Yinghao Xu, Yujun Shen,
[48] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, Hongdong Li, and Bolei Zhou. Improving GAN equilibrium
and Dani Lischinski. Styleclip: Text-driven manipulation of by raising spatial awareness. In IEEE Conf. Comput. Vis.
StyleGAN imagery. In Int. Conf. Comput. Vis., 2021. 2 Pattern Recog., 2022. 1, 2, 3, 7
10
[65] Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. representative features f and w. Taking an image whose
Towards real-world blind face restoration with generative resolution is 256×256 as an instance, the backbone Denc is
facial prior. In IEEE Conf. Comput. Vis. Pattern Recog., first employed to extract features from the input image. The
2021. 2 very last feature map of 4 × 4 is sent to the scoring head
[66] Wayne Wu, Yunxuan Zhang, Cheng Li, Chen Qian, and to extract the realness score while the multi-level feature
Chen Change Loy. Reenactgan: Learning to reenact faces
maps are sent to the decoder h to predict the representative
via boundary transfer. In Eur. Conf. Comput. Vis., 2018. 2
features adequate for G to reconstruct the original image.
[67] Zongze Wu, Dani Lischinski, and Eli Shechtman. Stylespace
analysis: Disentangled controls for StyleGAN image gener-
As described in the submission, the representative features
ation. In IEEE Conf. Comput. Vis. Pattern Recog., 2021. 2 consist of latent codes w and the spatial representations f ,
[68] Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei which include a low-level representation and a high-level
Zhou, and Ming-Hsuan Yang. GAN inversion: A survey. representation. Recall that, these spatial representations
IEEE Trans. Pattern Anal. Mach. Intell., 2022. 2 will be sent to the fixed generator to serve as the basis of
[69] Yinghao Xu, Sida Peng, Ceyuan Yang, Yujun Shen, and the reconstruction and will be modulated by latent codes
Bolei Zhou. 3d-aware image synthesis via learning structural to predict the final results. We illustrate the architectures of
and textural representations. In IEEE Conf. Comput. Vis. the three aforementioned components of D in Tab. 4, Tab. 5,
Pattern Recog., 2022. 2 and Tab. 6, respectively.
[70] Ceyuan Yang, Yujun Shen, Yinghao Xu, Deli Zhao, Bo
Dai, and Bolei Zhou. Improving gans with a dynamic
discriminator. In Adv. Neural Inform. Process. Syst., 2022. 2
[71] Ceyuan Yang, Yujun Shen, Yinghao Xu, and Bolei Zhou.
Data-efficient instance generation from instance discrimina-
Table 4. Network structure of the backbone Denc . The output size
tion. In Adv. Neural Inform. Process. Syst., 2021. 1, 2
is with order {C × H × W }, where C, H, and W respectively
[72] Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo denotes the channel dimension, height and weight of the output.
Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and
Yujiu Yang. Styleheat: One-shot high-resolution editable Stage Block Output Size
talking face generation via pretrained StyleGAN. In Eur.
input - 3 × 256 × 256
Conf. Comput. Vis., 2022. 2
[73] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
1×1 Conv, 128
Funkhouser, and Jianxiong Xiao. Lsun: Construction of a 2×3×3 Conv, 128
large-scale image dataset using deep learning with humans block1 1×1 Conv, 128
128 × 128 × 128
in the loop. arXiv preprint arXiv:1506.03365, 2015. 4, 5, 6 Downsample
[74] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, LeakyReLU, 0.2
and Oliver Wang. The unreasonable effectiveness of deep
features as a perceptual metric. In IEEE Conf. Comput. Vis. 2×3×3 Conv, 256
Pattern Recog., 2018. 4 block2 1×1 Conv, 256 256 × 64 × 64
[75] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian.
Downsample
Cips-3d: A 3d-aware generator of GANs based on LeakyReLU, 0.2
conditionally-independent pixel synthesis. arXiv preprint
arXiv:2110.09788, 2021. 2 2×3×3 Conv, 512
[76] Jiapeng Zhu, Ruili Feng, Yujun Shen, Deli Zhao, Zheng-Jun 1×1 Conv, 512
block3 512 × 32 × 32
Downsample
Zha, Jingren Zhou, and Qifeng Chen. Low-rank subspaces
in GANs. In Adv. Neural Inform. Process. Syst., volume 34, LeakyReLU, 0.2
2021. 2
2×3×3 Conv, 512
[77] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-
1×1 Conv, 512
domain GAN inversion for real image editing. In Eur. Conf. block4 512 × 16 × 16
Downsample
Comput. Vis., 2020. 2
LeakyReLU, 0.2
[78] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
2×3×3 Conv, 512
consistent adversarial networks. In IEEE Conf. Comput. Vis. 1×1 Conv, 512
block5 512 × 8 × 8
Pattern Recog., 2017. 2 Downsample
LeakyReLU, 0.2
Appendix
2×3×3 Conv, 512
A. Discriminator Network Structure 1×1 Conv, 512
block6 512 × 4 × 4
Downsample
Recall that, our D includes a backbone Denc , a head LeakyReLU, 0.2
predicting realness scores, and a decoder h for predicting
11
Table 5. Network structure of the decoder h predicting the low- time the training costs. From the numbers in Tab. 7, we
level spatial representation, the high-level spatial representation can conclude that our method improves the synthesis quality
and the 512-channel latent codes. Note that h receives multi-level without much additional computational burden.
features as inputs due to its feature pyramid architecture [41]. The
output size is with order {C × H × W }.
Stage Block Output Size
512 × 32 × 32
512 × 16 × 16
input −
512 × 8 × 8
512 × 4 × 4
1×1 Conv, 512
block1 512 × 8 × 8
Upsample
1×1 Conv, 512
block2 512 × 16 × 16
Upsample
1×1 Conv, 512
block3 512 × 32 × 32
Upsample
1×1 Conv, 3 3 × 32 × 32
block4 2×1×1 Conv, 512 512 × 32 × 32
Downsample 512
Mbstd, 1
3×3 Conv, 512
LeakyReLU, 0.2
block1
Downsample
1
FC, 512
LeakyReLU, 0.2
FC, 1
B. Computational Costs
We first compute the discriminator parameter amounts
of the baseline and our method. As in Tab. 7, our method
merely brings 7.4% additional parameters over baseline,
which is brought by the proposed lightweight design of h
composed of 1 × 1 convolutions. Then we compare the
inference time of the discriminators with a single A6000
GPU. At last, we make comparisons on the training time.
We separately train the baseline model [37] and our model
with 8 A100 GPUs on LSUN Church and record how much
12