0% found this document useful (0 votes)
11 views12 pages

GLea D

The document presents GLeaD, a novel approach to improve Generative Adversarial Networks (GANs) by assigning a generator-leading task to the discriminator, encouraging it to extract representative features for image reconstruction. This method aims to create a fairer competition between the generator and discriminator, resulting in significant improvements in synthesis quality, as evidenced by reduced Frechet Inception Distance (FID) scores on various datasets. Experimental results demonstrate the effectiveness of this approach in enhancing GAN performance while maintaining a balanced training dynamic.

Uploaded by

hoàng anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

GLea D

The document presents GLeaD, a novel approach to improve Generative Adversarial Networks (GANs) by assigning a generator-leading task to the discriminator, encouraging it to extract representative features for image reconstruction. This method aims to create a fairer competition between the generator and discriminator, resulting in significant improvements in synthesis quality, as evidenced by reduced Frechet Inception Distance (FID) scores on various datasets. Experimental results demonstrate the effectiveness of this approach in enhancing GAN performance while maintaining a balanced training dynamic.

Uploaded by

hoàng anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

GLeaD: Improving GANs with A Generator-Leading Task

Qingyan Bai1∗ Ceyuan Yang2 Yinghao Xu3∗ Xihui Liu4 Yujiu Yang1† Yujun Shen5
1
Tsinghua Shenzhen International Graduate School, Tsinghua University 2 Shanghai AI Laboratory
3
The Chinese University of Hong Kong 4 The University of Hong Kong 5 Ant Group
arXiv:2212.03752v2 [cs.CV] 7 Jun 2023

Forward Pass
Abstract Backward Pass
G(z)
Generative adversarial network (GAN) is formulated z G D Realness
as a two-player game between a generator (G) and a Gradients
discriminator (D), where D is asked to differentiate whether Reconstruction
an image comes from real data or is produced by G. Under Difference
such a formulation, D plays as the rule maker and hence D(x)
tends to dominate the competition. Towards a fairer game in G
D
GANs, we propose a new paradigm for adversarial training, Gradients
which makes G assign a task to D as well. Specifically, x x’
given an image, we expect D to extract representative
features that can be adequately decoded by G to reconstruct Figure 1. Concept diagram of our proposed generator-leading
the input. That way, instead of learning freely, D is task (bottom), as complementary to the discriminator-leading task
urged to align with the view of G for domain classification. in the original formulation of GANs (upper). D is required to
Experimental results on various datasets demonstrate the extract representative features that can be adequately decoded by
G to reconstruct the input.
substantial superiority of our approach over the baselines.
For instance, we improve the FID of StyleGAN2 from 4.30
to 2.55 on LSUN Bedroom and from 4.04 to 2.82 on LSUN unfair. Specifically, on the one hand, D acts as a player
Church. We believe that the pioneering attempt present in in this adversarial game by measuring the discrepancy
this work could inspire the community with better designed between the real and synthesized samples. But on the
generator-leading tasks for GAN improvement. Project other hand, the learning signals (i.e., gradients) of G are
page is at https://ezioby.github.io/glead/. only derived from D, making the latter naturally become a
referee in the competition. Such a formulation easily allows
D to rule the game. Massive experimental results could
serve as supporting evidence for the theoretical analysis.
1. Introduction For instance, in practice, D can successfully distinguish
Generative adversarial networks (GANs) [18] have sig- real and fake samples from a pretty early stage of training
nificantly advanced image synthesis, which is typically and is able to maintain its advantage in the entire training
formulated as a two-player game. The generator (G) aims process [64]. Accordingly, the capability of the discrimi-
at synthesizing realistic data to fool the discriminator (D), nator usually determines the generation performance more
while D pours attention on distinguishing the synthesized or less. For instance, a discriminator that has over-fitted
samples from the real ones. Ideally, it would come to the whole training set always results in synthesis with
an optimal solution where G can recover the real data limited diversity and poor visual quality [33]. Following
distribution, and D can hardly tell the source of images this philosophy, many attempts [28, 29, 38, 40, 57, 71] have
anymore [18]. been made for discriminator improvement.
However, the competition between G and D seems to be This work offers a different perspective on GAN im-
provement. In particular, we propose a new adversarial
∗ This work was done during an internship at Ant Group.
† Corresponding author. This work was partly supported by the Na-
paradigm where G is assigned a new role, i.e., playing as the
tional Natural Science Foundation of China (Grant No. U1903213) and the referee as well to guide D. Recall that producing realistic
Shenzhen Science and Technology Program (JCYJ20220818101014030). images usually requires G to generate all-level concepts

1
adequately. Nevertheless, due to the asymmetrical status without improving the capacity of G for the first time. To
of G and D, D is able to tell apart the real and synthesized learn useful feature representations with weak supervision,
data merely from limited discriminative regions [64]. We, BiGAN [14] proposes to learn an encoder, to project real
therefore, would like to encourage D to extract as much samples back into GAN latent space in addition to the
information from an image as possible, such that the original G and D. And D is required to discriminate samples
features learned by D could be rendered back to the input jointly in data and latent space. In this way, the well-
with a frozen G, as in Fig. 1. That is, D is enforced to align trained encoder could serve as a feature extractor in a weak-
with the view of G (i.e. focusing on the entire image region) supervised training manner. Differently, we directly adopt
instead of learning freely for domain classification. D to extract features of both real and synthesized samples
Our method is termed as GLeaD because we propose to reconstruct them with G for a fairer setting instead of
to assign D a generator-leading task. In particular, given a representation learning.
real or synthesized image, the discriminator would deliver Improving GANs with the enhanced discriminator. Con-
extra spatial representations and latent representations that sidering D largely dominates the competition with G, many
are then fed into a frozen generator to reproduce the original prior works attempt to boost synthesis quality by improving
image. Reconstruction loss (perceptual loss is adopted in D. Jolicoeur employs a relativistic discriminator [29] to
practice) penalties the difference between the input image estimate the probability that the given real data is more
and the reconstructed image and derives gradients from realistic than fake data for better training stability and
updating the parameters of the discriminator. Moreover, synthesis quality. Yang et al. [71] propose to improve D
comprehensive experiments are then conducted on various representation by additionally requiring D to distinguish
datasets, demonstrating the effectiveness of the proposed every individual real and fake image. Kumari et al. [38]
method. Particularly, our method improves Frechet Incep- propose to ensemble selected backbones pre-trained on
tion Distance (FID) [23] from 4.30 to 2.55 on LSUN Bed- visual understanding tasks in addition to the original D
room and 4.04 to 2.82 on LSUN Church. We also manage as a strengthened D. The effect of various capacity of
to improve Recall [39] largely (56%) from 0.25 to 0.39 on discriminator on training a generator is also investigated
LSUN Bedroom. In addition, thorough ablation studies also in [70]. Based on the finding of OASIS [59] that dense
suggest that applying generator-leading tasks to require D to supervision such as segmentation labels could improve the
reconstruct only real or fake images could boost synthesis representation of D in conditional synthesis, GGDR [40]
quality. While a larger improvement would be gained if leverages the feature map of G to supervise the output fea-
both real and synthesized images were incorporated. Last tures of D for unconditional synthesis. However, different
but not least, experimental results in Sec. 4 reveal that our from the discrimination process, G does not backward any
method can indeed boost the fairness between G and D as gradient to D in this work. Contrasted with GGDR, our
well as improve the spatial attention of D. method aims at a fairer setting rather than gaining more
supervision for D. Also, our D receives gradients from G,
2. Related Work leading to fairer competition.
Generative adversarial networks. As one of the popular Image reconstruction with GANs. GAN inversion [68]
paradigms for generative models, generative adversarial aims to reconstruct the input image with a pre-trained
networks (GANs) [18] have significantly advanced image GAN generator. Mainstream GAN inversion methods
synthesis [9, 15, 21, 31, 34–37, 42, 45, 50], as well as various include predicting desirable latent codes corresponding to
tasks like image manipulation [22, 48, 55, 61, 67, 76], image the images through learning an encoder [3, 49, 52, 62, 77] or
translation [11, 27, 43, 60, 63, 78], image restoration [2, 20, optimization [1, 12, 20, 46, 51, 53]. Most work chooses to
47, 65], 3D-aware image synthsis [10, 19, 56, 69, 75], and predict latent codes in the native latent space of StyleGAN
talking head generation [24,66,72]. In the traditional setting such as Z, W or W+. Recently there are also some
of GAN training, D serves as the referee of synthesis quality work [6, 30] extending the latent space or fine-tuing [4, 13]
and thus tends to dominate the competition. As a result, in the generator for better reconstruction. Note that although
practice D can always tell the real and fake samples apart our method could achieve image reconstruction with the
and the equilibrium between G and D turns out hard to be well-trained D and G, our motivation lies in boosting
achieved as expected [5, 18]. Some earlier work [5, 7, 17] generative quality by making G assign the generator-leading
tries to boost GAN equilibrium to stabilize GAN training reconstruction task to D, instead of the reconstruction
and improve synthesis quality. Recently, EqGAN-SA [64] performance. Another significant difference lies in that we
proposes to boost GAN equilibrium by raising the spatial adopt D to extract representative features for reconstruction,
awareness of G. Concretely, the spatial attention of D is which is simultaneously trained with G, while in GAN
utilized to supervise and strengthen G. While our method inversion the feature extractor (namely the encoder) is
forces D to fulfill a reconstruction task provided by G learned based on a pre-trained G.

2
Realness Realness

Input Reconstruction

w
D G
f Denc h

f w

Figure 2. Illustration of how a generator-leading task is incorporated into GAN training from the perspective of discriminator
optimization. Given an image (i.e., either real or synthesized) as the input, D is asked to extract representative features from the input
in addition to predicting a realness score. These features including spatial features f and global latent codes w are sent to the fixed G to
reconstruct the inputs of D. The perceptual loss is adopted to penalize the difference between the reconstruction and inputs. The sub-figure
on the right demonstrates the specific architecture of our D. A decoder h composed of a series of 1 × 1 convolution layers is attached to
the original backbone Denc to extract f and w. This training process is described in detail in Sec. 3.2.

3. Method 3.2. Generator-leading Task

As mentioned before, it seems to be unfair that a Considering the unfair division of labor in this two-
discriminator (D) competes against a generator (G) since player game, we turn to assign a new role to G that
D does not only join the two-player game as a player but could supervise the learning of D in turn. Recall that the
also guides the learning of G, namely serves as a referee target of generation is to produce realistic samples which
for G. Sec. 3.1 presents the vanilla formulation. To chase a usually requires all concepts well-synthesized. However,
fairer game, Sec. 3.2 introduces a new adversarial paradigm it is suggested [64] that the most discriminative regions of
GLeaD that assigns a new generator-leading task for D given real or synthesized images are sufficient for domain
which in turn is judged by G. classification. Therefore, we propose a generator-leading
task that enforces D to extract as many representative
features as possible to retain adequate information that
3.1. Preliminary could reconstruct a given image through a frozen generator,
as described in Fig. 2 and Algorithm 1. Note that we
GAN usually consists of two components: a generator empirically validate that requiring D to extract spatial
G(·) and a discriminator D(·). The former aims at mapping representations is essential to improve synthesis quality
a random latent code z to an image, while the latter learns in Sec. 4.3. Taking StyleGAN2 [37] as an example, we
to distinguish the synthesized image G(z) from the real one will introduce the detailed instantiations in the following
x. These two networks compete with each other and are context.
jointly optimized with the learning objectives as follows: Extracting representations through D. The original D of
StyleGAN is a convolutional network composed of a series
LG = −Ez∈Z [log(D(G(z)))], (1) of downsampling convolution layers. To make it conve-
LD = −Ex∈X [log(D(x))] − Ez∈Z [log(1 − D(G(z)))], nient, the backbone network of the original D (namely,
(2) parts of D except the final head predicting realness score)
is denoted as Denc in the following statement. In order to
predict the representative features of a given image while re-
where Z and X denote a pre-defined latent distribution and taining various information from low-level to high-level, we
data distribution respectively. additionally affiliate Denc with a decoder h(·) to construct
Ideally, the optimal solution is that G manages to our new D with a multi-level feature pyramid [41]. Based
reproduce the realistic data distribution while D is not on such feature hierarchy ending with a convolutional head,
able to tell the real and synthesized samples apart [18]. spatial representations f and latent representations w are
However, during the iterative training of the generator and predicted respectively. In particular, the newly-attached
discriminator, there exists an unfair competition since D parts over the backbone adopt convolution layers with the
plays the player and referee roles simultaneously. Thus the kernel size of 1 × 1. This is because the crucial part in
ideal solution is hard to be achieved in practice [16, 64]. D that influences the synthesis quality of G is the backbone

3
Algorithm 1 GAN training with the proposed generator-leading ones. Here, perceptual loss [74] Lper is adopted as the loss
task. function:
Input: G and our D (including h) that are initialized with random
parameters. Training data {xi }. Lrec = λ1 Lper (x, x′real ) + λ2 Lper (G(z), x′f ake ), (7)
Hyperparameters: T : maximum number of training iterations.
1: for t = 1 to T do where λ1 and λ2 denote the weights for different terms.
2: Sample z ∼ P(Z) ▷ Begin training of G. Note that setting one weight as zero is identical to disabling
3: Update G with Eq. (1) the reconstruction tasks on real/synthesized images, which
4: Sample z ∼ P(Z) ▷ Begin training of D. may deteriorate the synthesis performance to some extent.
5: Reconstruct G(z) with Eq. (4) and Eq. (6) Our final algorithm is summarized as in Algorithm 1.
6: Sample x ∼ {xi }
Full objective. With the updated D architecture and the
7: Reconstruct x with Eq. (3) and Eq. (5)
8: Discriminate images by D(G(z)) and D(x)
generator-leading task, the discriminator and generator are
9: Update D with Eq. (2), Eq. (7), and Eq. (9) jointly optimized with
10: end for
Output: G with best training set FID. L′G = LG , (8)
. L′D = LD + Lrec . (9)

while introducing too many parameters for h will encourage 4. Experiments


the optimization to focus on this reconstruction branch We conduct extensive experiments on various bench-
(decoder). Moreover, considering the residual architecture mark datasets to demonstrate the effectiveness of the pro-
of G, the spatial representation f consists of a low-level posed method and the superiority of the specific settings.
feature and a high-level one in total. More details are The subsections are arranged as follows: Sec. 4.1 introduces
available in Supplementary Material. Therefore, given one our detailed experiment settings. In Sec. 4 we demonstrate
real image x or a synthesized one G(z), the corresponding the qualitative and quantitative superiority of GLeaD .
representative features could be obtained by: Sec. 4.3 includes comprehensive ablation studies of the
designed components. Then we visualize the realness
freal , wreal = h(Denc (x)), (3)
score curves of D to validate the improvement of fairness
ff ake , wf ake = h(Denc (G(z))). (4) in Sec. 4.4. At last, we provide qualitative reconstruction
results and validate the improvement of D’s spatial attention
Reconstructing images via a frozen G. For a fair compar-
respectively in Sec. 4.5 and Sec. 4.6.
ison, the generator of the original StyleGAN2 is adopted
without any modification, which stacks a series of con- 4.1. Experimental Setup
volutional “synthesis blocks”. Notably, the StyleGAN2
generator is designed with a residual architecture, which Datasets. We conduct experiments on FFHQ [36] con-
synthesizes images progressively from a lower resolution to sisting of 70K high-resolution portraits for face synthesis.
the higher one. For instance, the 16 × 16 synthesized result We also adopt the training set of LSUN Bedroom and
of the synthesis block corresponding to a lower resolution is Church [73] respectively for indoor and outdoor scene
firstly upsampled to 32 × 32, and then the 32 × 32 synthesis synthesis, which respectively contains about 126K and 3M
block only predicts the residual between the upsampled 256 × 256 images.
result and the desirable 32 × 32 image. As mentioned Evaluation. We mainly adopt the prevalent Frechet Incep-
before, our predicted spatial representations indeed contain tion Distance (FID) [23] for evaluation. Precision & Recall
two features that could serve as the basis and the residual (P&R) [39] is also adopted as the supplement of FID for
respectively. And the latent representation is sent to the more grounded evaluation. In particular, we calculate FID
synthesis blocks to modulate the features to generate the and P&R between all the real samples and 50K synthesized
final output just as in [36, 37]. Such that, the reconstructed ones for experiments on FFHQ and LSUN Church. While
images could be derived from: for LSUN Bedroom we calculate FID and P&R between
50K real samples and 50K synthesized ones because feature
x′real = G(freal , wreal ), (5) extracting of 3M samples is rather costly.
x′f ake = G(ff ake , wf ake ), (6) Other settings. For all the baseline and our models, on
FFHQ we keep training the model until D has been shown
where G is fully frozen. 25M images with mirror augmentation. While models on
Reconstruction loss. After gathering the reconstructed LSUN Church and Bedroom are trained until 50M images
real and synthesized images, we could easily penalize the have been shown to D for more sufficient convergence.
differences between the original images and reconstructed We adopt VGG [58] as the pre-trained feature extractor for

4
Table 1. Comparisons on FFHQ [36], LSUN Bedroom and LSUN Church [73]. Our method improves StyleGAN2 [37] in large datasets
in terms of FID [23] and recall. P and R denote precision and recall [39]. Lower FID and higher precision and recall indicate better
performance. The bold numbers indicate the best metrics for each dataset. The blue numbers in the brackets indicate the improvements.

FFHQ [36] LSUN Bedroom [73] LSUN Church [73]


Method
FID↓ P↑ R↑ FID↓ P↑ R↑ FID↓ P↑ R↑
UT [8] 6.11 0.73 0.48 - - - 4.07 0.71 0.45
Polarity [25] - - - - - - 3.92 0.61 0.39
StyleGAN2 [37] 3.79 0.68 0.44 4.30 0.59 0.25 4.04 0.58 0.40
Ours 3.24 (−0.55) 0.69 0.47 2.55 (−1.75) 0.62 0.39 2.82 (−1.22) 0.62 0.43
GGDR [40] 3.25 0.66 0.51 3.71 0.62 0.33 2.81 0.61 0.46
Ours* 2.90 (−0.35) 0.69 0.50 2.72 (−0.99) 0.62 0.37 2.15 (−0.66) 0.61 0.48

perceptual loss calculation. As for the loss weights, we set Table 2. Ablation studies on the loss weights λ1 and λ2 . The
λ1 = 10 and λ2 = 3. numbers in bold indicate the best FID in each sub-table.

4.2. Main Results λ1 λ2 FID λ1 λ2 FID


Quantitative comparisons. In order to compare our 0 0 4.04 0 10 3.32
GLeaD against prior works, e.g., UT [8], Polarity [25], and
StyleGAN2 [37], we calculate the FID and Precision and 100 0 331 10 10 3.15
Recall [39] (P & R) to measure the synthesis. In particular, 10 0 3.10 10 3 2.82
Precision and Recall could reflect the synthesis quality 1 0 3.27 10 1 2.85
and diversity to some extent. Moreover, considering that
recent work GGDR [40] also leverages the G to enhance Table 3. Ablation studies on the resolution of f . The upper line
the representations of D, we further incorporate it with our indicates the resolution settings and the bottom line concludes the
corresponding FID performance. The number in bold indicates the
method to check whether exists a consistent gain.
best FID in the table.
Tab. 1 presents the results. From the perspective of
FID, our direct baseline StyleGAN2 could be substantially
Baseline 1×1 8×8 16 × 16 32 × 32 64 × 64
improved with the proposed GLeaD, outperforming other
approaches by a clear margin. These results strongly 4.04 4.68 3.27 3.01 2.82 2.88
demonstrate the effectiveness of our GLeaD. Moreover,
combined with GGDR (Ours* in the table), our GLeaD
could further introduce significant gains, achieving new FFHQ, LSUN Bedroom, and LSUN Church. Obviously,
state-of-the-art performance on various datasets. Namely, all models could generate images with desirable quality and
the proposed GLeaD could be compatible with the recent coverage.
work GGDR that also considers improving D through G. Computational costs. We evaluate the proposed model in
Regarding Precision and Recall, clear gains are also terms of parameter amount and inference time. The specific
observed on multiple benchmarks. Importantly, the im- results could be found in Supplementary Material.
provements mainly come from the Recall side, i.e., the
4.3. Ablation Studies
synthesis diversity is further improved. This matches our
motivation that the generator-leading task could further urge Constraint strength. Here we ablate the specific target
D to extract more representative features rather than focus of the proposed generator-leading task on LSUN Church.
on the limited discriminative regions. As a result, G has Recall that we have λ1 and λ2 that respectively control
to synthesize images with a variety of modes to fool D in the constraint strength when reconstructing real and fake
turn. Moreover, the diversity is significantly improved in images in Eq. (7). As shown in the left sub-table of Tab. 2,
the LSUN bedrooms from 0.25 to 0.39 (56%). This may we first set λ1 = λ2 = 0 to get the baseline performance.
imply that our GLeaD could continuously benefit from the Then we set λ2 as 0 and explore a proper λ1 for only
larger-scale reconstruction task, which we leave in future reconstructing real images. Experiments suggest that an
studies. overlarge weight like 100 will make the proposed task
Qualitative results. Fig. 3 presents the synthesized sam- interfere with the adversarial training and the model cannot
ples by our GLeaD. The models are respectively trained on converge. And 10 turns out to be a proper choice for λ1 ,

5
Face
Bedroom
Church

Figure 3. Synthesized images by our models respectively trained on FFHQ [36], LSUN Bedroom and Church [73].

6
GGDR – Rec

Input Rec Input Rec


2.6

Real
Score

Baseline
GGDR
Ours
1.6
0 Kimg 50000
-1.6
Baseline

Fake
GGDR
Ours
Score

Figure 5. Reconstruction results of real and synthesized input


images. “Input” and “Rec” respectively denote the input images
and the reconstruction results by our D and G.
-2.6
0 Kimg 50000 model performance under this setting is even inferior than
the baseline, suggesting the necessity of extracting spatial
Figure 4. Curves of realness scores that are predicted by various features.
discriminators during training. The corresponding settings are
labeled on the right. We separately visualize the realness scores 4.4. Validation of the Fairer Game
from the discriminators of StyleGAN baseline [37], GGDR [40], Recall that aiming to improve the synthesis quality
and the proposed method. through a fairer setting between G and D, we provide
the generator-leading task for D to extract representative
features adequate for reconstruction. Thus we validate the
improving FID from 4.04 to 3.10. The results incorporating boosted fairness in this subsection through experiments.
the reconstruction of fake images are demonstrated in the Following [64], we visualize the mean score in terms of
sub-table on the right of Tab. 2. We first set λ1 = 0 realness extracted by discriminators throughout the training
and λ2 = 10 to validate that merely reconstructing fake process on LSUN Bedroom. Note that the curves are
images benefits the synthesis quality. Then we try to find smoothed with exponentially weighted averages [26] for
an appropriate λ2 when the reconstruction of real images clearer understanding. The top of Fig. 4 describes the
has been incorporated in the task (λ1 = 10). Through the visualization results for the real images while the bottom
aforementioned experiments, reconstructing both real and includes score curves for the synthesized images. The
fake images when λ1 = 10 and λ2 = 3 turns out to be the colors of the curves indicate various settings for training
best strategy. GANs, as labeled on the right of Fig. 4. From the
Resolution of f . Recall that we require D to extract spatial aforementioned figure, it can be found that equipped with
features f as the basis of the image reconstruction. And the our generator-leading task, the absolute score values of
predicted latent codes w modulate the latter features of G our methods become smaller than the baseline. While
to generate the reconstructed image based on f . Here we GGDR [40] (the red curve) just maintains and even enlarges
conduct ablation studies on the resolution of f on LSUN the gap between the absolute values and zero compared to
Church. As in Tab. 3, extracting f whose resolution is the baseline, though it can improve FID.
32 × 32 brings the best synthesis quality. And 1 × 1 We can thus draw a conclusion that with the aid of the
in the table indicates the setting where D only predicts generator-leading task, it becomes much more challenging
latent codes w without spatial dimension. Notably, the for D to distinguish the real and fake samples. In other

7
GGDR – Teaser

words, GLeaD can improve the fairness between G and Input Ours Baseline
D, as well as the synthesis quality. On the contrary, the
effectiveness of GGDR is not brought by the improvement
of fairness, which emphasizes the viewpoint that, in order
to boost fairness between G and D, it is necessary to pass
gradients of G to D like our method.

4.5. Reconstruction Results


Recall that we instantiate the generator-leading task
as a reconstruction task. In this subsection, we provide
reconstruction results of real and fake images with the well-
trained D and G. To explore the reconstruction ability of D
more accurately, we provide it with unseen real and synthe-
sized images to extract features. These features are then fed
into the corresponding G to reconstruct the images inputted
to D, as in the training stage. As mentioned in Sec. 4.1, Figure 6. Attention heatmaps of the discriminators visualized by
we train GANs on FFHQ for the face domain and training GradCAM [54]. We feed our D and the baseline D with generated
set of LSUN Church for outdoor scenes. Thus, here we images with artifacts and expect them to pour attention on these
randomly sample real images from CelebA-HQ [32, 44] regions. Please zoom in to view the artifacts more clearly.
(another widely-used face dataset) and the validation set
of LSUN Church. Fake images are sampled with the
generators corresponding to the tested discriminators. under the generator-leading task, D is forced to extract
As shown in Fig. 5, though some out-of-domain objects representative spatial features to faithfully reconstruct the
(e.g., crowds in Church) and high-frequency details (e.g., inputs. To achieve this additional task, the backbone of D
teeth of the child) are not perfectly well-reconstructed, our (namely Denc ) is naturally forced to learn a much stronger
well-trained discriminator manages to extract representative representation than only fulfilling the binary classification
features and reproduce the input real and fake images with task. Moreover, it is suggested that the strengthened
G. This indicates that our D could learn features aligned representation of D is strong enough to better detect the gen-
with the domain of G, matching our motivation. erated artifacts. In contrast, the red regions in the attention
map of the baseline are mainly distributed on the face or
4.6. Spatial Attention Visualization for D bed, which means D pays more attention to the subject of
We also visualize the spatial attention of the well- the training set, even though there are artifacts generated
trained discriminators with the help of GradCAM [54]. As by G. Naturally, D’s success in detecting and penalizing the
mentioned in Sec. 1, we expect D to avoid focusing on artifacts will improve the synthesis capability of G. And this
some limited regions or objects, by extracting spatial repre- could serve as one of the reasons why GLeaD can boost the
sentative features. Here, the discriminators of the baseline synthesis quality of GANs.
and our method are chosen to validate the improvement in
terms of spatial attention. Considering the discriminators 5. Conclusion
have been fully trained, we pick some generated images
with unacceptable artifacts, expecting D aware of these Generative adversarial network (GAN) is formulated
regions with artifacts. For fair comparison, G of GGDR is as a two-player game between a generator (G) and a
adopted to synthesize the images rather than the baseline or discriminator (D). In order to establish a fairer game setting
ours. The spatial attention maps are demonstrated in Fig. 6, between G and D, we propose a new adversarial paradigm
note that we pick the gradient map with a relatively higher additionally assigning D a generator-leading task, which
resolution (64×64) because it is more spatially aligned with is termed as GLeaD. Specifically, we urge D to extract
the original image than an abstract one (e.g., 8×8). adequate features from the input real and fake images.
As in Fig. 6, the provided fake images contain various These features should be representative enough that G can
kinds of unpleasant artifacts. The background of the portrait reconstruct the original inputs with them. As a result, D
is full of unidentified filamentous artifacts. And there is a is forced to learn a stronger representation aligned with
weird object on the bed in the bedroom picture. Compared G instead of learning and discriminating freely. Thus the
with the baseline D, our D pays much more attention to unfairness between G and D could be alleviated. Massive
the artifacts instead of focusing on the face and the bed, experiments demonstrate GLeaD can significantly improve
which are well synthesized as the subject. Recall that the synthesis quality over the baseline.

8
References Many paths to equilibrium: GANs do not need to decrease
a divergence at every step. In Int. Conf. Learn. Represent.,
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- 2018. 2
age2StyleGAN++: How to edit the embedded images? In
[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
IEEE Conf. Comput. Vis. Pattern Recog., 2020. 2
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[2] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Yoshua Bengio. Generative adversarial networks. In Adv.
Radu Timofte, and Luc Van Gool. Generative adversarial Neural Inform. Process. Syst., 2014. 1, 2, 3
networks for extreme learned image compression. In IEEE
[19] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt.
Conf. Comput. Vis. Pattern Recog., pages 221–231, 2019. 2
Stylenerf: A style-based 3d-aware generator for high-
[3] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. ReStyle:
resolution image synthesis. In Int. Conf. Learn. Represent.,
A residual-based StyleGAN encoder via iterative refinement.
2022. 2
In Int. Conf. Comput. Vis., 2021. 2
[20] Jinjin Gu, Yujun Shen, and Bolei Zhou. Image processing
[4] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit
using multi-code GAN prior. In IEEE Conf. Comput. Vis.
Bermano. Hyperstyle: StyleGAN inversion with hypernet-
Pattern Recog., 2020. 2
works for real image editing. In IEEE Conf. Comput. Vis.
Pattern Recog., 2022. 2 [21] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent
Dumoulin, and Aaron C Courville. Improved training of
[5] Martin Arjovsky, Soumith Chintala, and Léon Bottou.
wasserstein gans. In Adv. Neural Inform. Process. Syst.,
Wasserstein generative adversarial networks. In Int. Conf.
2017. 2
Mach. Learn., 2017. 2
[6] Qingyan Bai, Yinghao Xu, Jiapeng Zhu, Weihao Xia, Yujiu [22] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and
Yang, and Yujun Shen. High-fidelity GAN inversion with Sylvain Paris. Ganspace: Discovering interpretable GAN
padding space. In Eur. Conf. Comput. Vis., 2022. 2 controls. In Adv. Neural Inform. Process. Syst., 2020. 2
[7] David Berthelot, Thomas Schumm, and Luke Metz. Began: [23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Boundary equilibrium generative adversarial networks. In Bernhard Nessler, and Sepp Hochreiter. GANs trained
arXiv preprint arXiv:1703.10717, 2017. 2 by a two time-scale update rule converge to a local nash
equilibrium. In Adv. Neural Inform. Process. Syst., 2017.
[8] Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P
2, 4, 5
Breckon, and Chris G Willcocks. Unleashing transform-
ers: Parallel token prediction with discrete absorbing diffu- [24] Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu.
sion for fast high-resolution image generation from vector- Depth-aware generative adversarial network for talking head
quantized codes. In Eur. Conf. Comput. Vis., 2022. 5 video generation. In IEEE Conf. Comput. Vis. Pattern
[9] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large Recog., 2022. 2
scale GAN training for high fidelity natural image synthesis. [25] Ahmed Imtiaz Humayun, Randall Balestriero, and Richard
In Int. Conf. Learn. Represent., 2019. 2 Baraniuk. Polarity sampling: Quality and diversity control
[10] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, of pre-trained generative networks via singular values. In
and Gordon Wetzstein. pi-gan: Periodic implicit generative IEEE Conf. Comput. Vis. Pattern Recog., 2022. 5
adversarial networks for 3d-aware image synthesis. In IEEE [26] J Stuart Hunter. The exponentially weighted moving aver-
Conf. Comput. Vis. Pattern Recog., 2021. 2 age. Journal of quality technology, 18(4), 1986. 7
[11] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. [27] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
Stargan v2: Diverse image synthesis for multiple domains. Efros. Image-to-image translation with conditional adver-
In IEEE Conf. Comput. Vis. Pattern Recog., 2020. 2 sarial networks. In IEEE Conf. Comput. Vis. Pattern Recog.,
[12] Antonia Creswell and Anil Anthony Bharath. Inverting the 2017. 2
generator of a generative adversarial network. IEEE Trans. [28] Jongheon Jeong and Jinwoo Shin. Training GANs with
Neur. Network. Learn. Syst., 2018. 2 stronger augmentations via contrastive discriminator. In Int.
[13] Tan M Dinh, Anh Tuan Tran, Rang Nguyen, and Binh-Son Conf. Learn. Represent., 2021. 1
Hua. Hyperinverter: Improving StyleGAN inversion via [29] Alexia Jolicoeur-Martineau. The relativistic discriminator:
hypernetwork. In IEEE Conf. Comput. Vis. Pattern Recog., a key element missing from standard GAN. In Int. Conf.
2022. 2 Learn. Represent., 2019. 1, 2
[14] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Ad- [30] Kyoungkook Kang, Seongtae Kim, and Sunghyun Cho.
versarial feature learning. In Int. Conf. Learn. Represent., GAN inversion for out-of-range images with geometric
2017. 2 transformations. In IEEE Conf. Comput. Vis. Pattern Recog.,
[15] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming 2021. 2
transformers for high-resolution image synthesis. In IEEE [31] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
Conf. Comput. Vis. Pattern Recog., 2021. 2 Progressive growing of GANs for improved quality, stability,
[16] Farzan Farnia and Asuman Ozdaglar. Do GANs always have and variation. In Int. Conf. Learn. Represent., 2018. 2
nash equilibria? In Int. Conf. Mach. Learn., 2020. 3 [32] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
[17] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Progressive growing of GANs for improved quality, stability,
Andrew M Dai, Shakir Mohamed, and Ian Goodfellow. and variation. In Int. Conf. Learn. Represent., 2018. 8

9
[33] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, [49] Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco
Jaakko Lehtinen, and Timo Aila. Training generative adver- Doretto. Adversarial latent autoencoders. In IEEE Conf.
sarial networks with limited data. In Adv. Neural Inform. Comput. Vis. Pattern Recog., 2020. 2
Process. Syst., 2020. 1 [50] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper-
[34] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, vised representation learning with deep convolutional gener-
Jaakko Lehtinen, and Timo Aila. Training generative adver- ative adversarial networks. In Int. Conf. Learn. Represent.,
sarial networks with limited data. In Adv. Neural Inform. 2016. 2
Process. Syst., volume 33, 2020. 2 [51] Abdal Rameen, Qin Yipeng, and Wonka Peter. Im-
[35] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, age2StyleGAN: How to embed images into the StyleGAN
Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias- latent space? In Int. Conf. Comput. Vis., 2019. 2
free generative adversarial networks. In Adv. Neural Inform. [52] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan,
Process. Syst., volume 34, 2021. 2 Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in
[36] Tero Karras, Samuli Laine, and Timo Aila. A style-based style: a StyleGAN encoder for image-to-image translation.
generator architecture for generative adversarial networks. In In IEEE Conf. Comput. Vis. Pattern Recog., 2021. 2
IEEE Conf. Comput. Vis. Pattern Recog., 2019. 2, 4, 5, 6 [53] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel
[37] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Cohen-Or. Pivotal tuning for latent-based editing of real
Jaakko Lehtinen, and Timo Aila. Analyzing and improving images. ACM Trans. Graph., 2021. 2
the image quality of StyleGAN. In IEEE Conf. Comput. Vis. [54] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
Pattern Recog., 2020. 2, 3, 4, 5, 7, 12 Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
[38] Nupur Kumari, Richard Zhang, Eli Shechtman, and Jun-Yan Grad-cam: Visual explanations from deep networks via
Zhu. Ensembling off-the-shelf models for GAN training. In gradient-based localization. In IEEE Conf. Comput. Vis.
IEEE Conf. Comput. Vis. Pattern Recog., 2022. 1, 2 Pattern Recog., 2017. 8
[39] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko [55] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou.
Lehtinen, and Timo Aila. Improved precision and recall Interpreting the latent space of GANs for semantic face
metric for assessing generative models. In Adv. Neural editing. In IEEE Conf. Comput. Vis. Pattern Recog., 2020.
Inform. Process. Syst., 2019. 2, 4, 5 2
[40] Gayoung Lee, Hyunsu Kim, Junho Kim, Seonghyeon Kim, [56] Zifan Shi, Yujun Shen, Jiapeng Zhu, Dit-Yan Yeung, and
Jung-Woo Ha, and Yunjey Choi. Generator knows what Qifeng Chen. 3d-aware indoor scene synthesis with depth
discriminator should learn in unconditional GANs. In Eur. priors. In Eur. Conf. Comput. Vis., 2022. 2
Conf. Comput. Vis., 2022. 1, 2, 5, 7 [57] Zifan Shi, Yinghao Xu, Yujun Shen, Deli Zhao, Qifeng
[41] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Chen, and Dit-Yan Yeung. Improving 3d-aware image
Bharath Hariharan, and Serge Belongie. Feature pyramid synthesis with a geometry-aware discriminator. In Adv.
networks for object detection. In IEEE Conf. Comput. Vis. Neural Inform. Process. Syst., 2022. 1
Pattern Recog., 2017. 3, 12 [58] Karen Simonyan and Andrew Zisserman. Very deep convo-
[42] Bingchen Liu, Yizhe Zhu, Kunpeng Song, and Ahmed lutional networks for large-scale image recognition. In Int.
Elgammal. Towards faster and stabilized GAN training for Conf. Learn. Represent., 2015. 4
high-fidelity few-shot image synthesis. In Int. Conf. Learn. [59] Vadim Sushko, Edgar Schönfeld, Dan Zhang, Juergen Gall,
Represent., 2021. 2 Bernt Schiele, and Anna Khoreva. You only need adversarial
[43] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised supervision for semantic image synthesis. In Int. Conf.
image-to-image translation networks. In Adv. Neural Inform. Learn. Represent., 2021. 2
Process. Syst., 2017. 2 [60] Hao Tang, Song Bai, Li Zhang, Philip HS Torr, and Nicu
[44] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Sebe. Xinggan for person image generation. In Eur. Conf.
Deep learning face attributes in the wild. In Int. Conf. Comput. Vis., pages 717–734. Springer, 2020. 2
Comput. Vis., 2015. 8 [61] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
[45] Mehdi Mirza and Simon Osindero. Conditional generative Daniel Cohen-Or. Designing an encoder for StyleGAN
adversarial nets. arXiv preprint arXiv:1411.1784, 2014. 2 image manipulation. ACM Trans. Graph., 40(4):1–14, 2021.
[46] Xingang Pan, Xiaohang Zhan, Bo Dai, Dahua Lin, 2
Chen Change Loy, and Ping Luo. Exploiting deep generative [62] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
prior for versatile image restoration and manipulation. In Daniel Cohen-Or. Designing an encoder for StyleGAN
Eur. Conf. Comput. Vis., 2020. 2 image manipulation. ACM Trans. Graph., 2021. 2
[47] Xingang Pan, Xiaohang Zhan, Bo Dai, Dahua Lin, [63] Ziyu Wan, Bo Zhang, Dongdong Chen, Pan Zhang, Dong
Chen Change Loy, and Ping Luo. Exploiting deep generative Chen, Jing Liao, and Fang Wen. Bringing old photos back to
prior for versatile image restoration and manipulation. IEEE life. In IEEE Conf. Comput. Vis. Pattern Recog., 2020. 2
Trans. Pattern Anal. Mach. Intell., 2021. 2 [64] Jianyuan Wang, Ceyuan Yang, Yinghao Xu, Yujun Shen,
[48] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, Hongdong Li, and Bolei Zhou. Improving GAN equilibrium
and Dani Lischinski. Styleclip: Text-driven manipulation of by raising spatial awareness. In IEEE Conf. Comput. Vis.
StyleGAN imagery. In Int. Conf. Comput. Vis., 2021. 2 Pattern Recog., 2022. 1, 2, 3, 7

10
[65] Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. representative features f and w. Taking an image whose
Towards real-world blind face restoration with generative resolution is 256×256 as an instance, the backbone Denc is
facial prior. In IEEE Conf. Comput. Vis. Pattern Recog., first employed to extract features from the input image. The
2021. 2 very last feature map of 4 × 4 is sent to the scoring head
[66] Wayne Wu, Yunxuan Zhang, Cheng Li, Chen Qian, and to extract the realness score while the multi-level feature
Chen Change Loy. Reenactgan: Learning to reenact faces
maps are sent to the decoder h to predict the representative
via boundary transfer. In Eur. Conf. Comput. Vis., 2018. 2
features adequate for G to reconstruct the original image.
[67] Zongze Wu, Dani Lischinski, and Eli Shechtman. Stylespace
analysis: Disentangled controls for StyleGAN image gener-
As described in the submission, the representative features
ation. In IEEE Conf. Comput. Vis. Pattern Recog., 2021. 2 consist of latent codes w and the spatial representations f ,
[68] Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei which include a low-level representation and a high-level
Zhou, and Ming-Hsuan Yang. GAN inversion: A survey. representation. Recall that, these spatial representations
IEEE Trans. Pattern Anal. Mach. Intell., 2022. 2 will be sent to the fixed generator to serve as the basis of
[69] Yinghao Xu, Sida Peng, Ceyuan Yang, Yujun Shen, and the reconstruction and will be modulated by latent codes
Bolei Zhou. 3d-aware image synthesis via learning structural to predict the final results. We illustrate the architectures of
and textural representations. In IEEE Conf. Comput. Vis. the three aforementioned components of D in Tab. 4, Tab. 5,
Pattern Recog., 2022. 2 and Tab. 6, respectively.
[70] Ceyuan Yang, Yujun Shen, Yinghao Xu, Deli Zhao, Bo
Dai, and Bolei Zhou. Improving gans with a dynamic
discriminator. In Adv. Neural Inform. Process. Syst., 2022. 2
[71] Ceyuan Yang, Yujun Shen, Yinghao Xu, and Bolei Zhou.
Data-efficient instance generation from instance discrimina-
Table 4. Network structure of the backbone Denc . The output size
tion. In Adv. Neural Inform. Process. Syst., 2021. 1, 2
is with order {C × H × W }, where C, H, and W respectively
[72] Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo denotes the channel dimension, height and weight of the output.
Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and
Yujiu Yang. Styleheat: One-shot high-resolution editable Stage Block Output Size
talking face generation via pretrained StyleGAN. In Eur.
input - 3 × 256 × 256
Conf. Comput. Vis., 2022. 2
[73] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
 
1×1 Conv, 128
Funkhouser, and Jianxiong Xiao. Lsun: Construction of a  2×3×3 Conv, 128 
large-scale image dataset using deep learning with humans block1  1×1 Conv, 128
 
 128 × 128 × 128
in the loop. arXiv preprint arXiv:1506.03365, 2015. 4, 5, 6  Downsample 
[74] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, LeakyReLU, 0.2
and Oliver Wang. The unreasonable effectiveness of deep
 
features as a perceptual metric. In IEEE Conf. Comput. Vis. 2×3×3 Conv, 256
Pattern Recog., 2018. 4 block2  1×1 Conv, 256  256 × 64 × 64
[75] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian.
 Downsample 
Cips-3d: A 3d-aware generator of GANs based on LeakyReLU, 0.2
conditionally-independent pixel synthesis. arXiv preprint  
arXiv:2110.09788, 2021. 2 2×3×3 Conv, 512
[76] Jiapeng Zhu, Ruili Feng, Yujun Shen, Deli Zhao, Zheng-Jun  1×1 Conv, 512
block3 512 × 32 × 32

 Downsample 
Zha, Jingren Zhou, and Qifeng Chen. Low-rank subspaces
in GANs. In Adv. Neural Inform. Process. Syst., volume 34, LeakyReLU, 0.2
2021. 2  
2×3×3 Conv, 512
[77] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-
 1×1 Conv, 512
domain GAN inversion for real image editing. In Eur. Conf. block4 512 × 16 × 16

 Downsample 
Comput. Vis., 2020. 2
LeakyReLU, 0.2
[78] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle- 
2×3×3 Conv, 512

consistent adversarial networks. In IEEE Conf. Comput. Vis.  1×1 Conv, 512
block5 512 × 8 × 8

Pattern Recog., 2017. 2  Downsample 
LeakyReLU, 0.2
Appendix  
2×3×3 Conv, 512
A. Discriminator Network Structure  1×1 Conv, 512
block6 512 × 4 × 4

 Downsample 
Recall that, our D includes a backbone Denc , a head LeakyReLU, 0.2
predicting realness scores, and a decoder h for predicting

11
Table 5. Network structure of the decoder h predicting the low- time the training costs. From the numbers in Tab. 7, we
level spatial representation, the high-level spatial representation can conclude that our method improves the synthesis quality
and the 512-channel latent codes. Note that h receives multi-level without much additional computational burden.
features as inputs due to its feature pyramid architecture [41]. The
output size is with order {C × H × W }.
Stage Block Output Size
512 × 32 × 32
512 × 16 × 16
input −
512 × 8 × 8
512 × 4 × 4
 
1×1 Conv, 512
block1 512 × 8 × 8
Upsample
 
1×1 Conv, 512
block2 512 × 16 × 16
Upsample
 
1×1 Conv, 512
block3 512 × 32 × 32
Upsample
 
1×1 Conv, 3 3 × 32 × 32
block4  2×1×1 Conv, 512  512 × 32 × 32
Downsample 512

Table 6. Network structure of the head predicting realness scores


which are scalars. The output size is with order {C × H × W }.
Stage Block Output Size
input − 512 × 4 × 4

Mbstd, 1
 
 3×3 Conv, 512 

 LeakyReLU, 0.2 

block1 
 Downsample 
 1

 FC, 512 

LeakyReLU, 0.2
FC, 1

Table 7. Computational cost comparisons.


Method # params inference time(s) training time(h)
Baseline 24.00M 0.0184 43.83
GLeaD 25.77M 0.0219 55.78

B. Computational Costs
We first compute the discriminator parameter amounts
of the baseline and our method. As in Tab. 7, our method
merely brings 7.4% additional parameters over baseline,
which is brought by the proposed lightweight design of h
composed of 1 × 1 convolutions. Then we compare the
inference time of the discriminators with a single A6000
GPU. At last, we make comparisons on the training time.
We separately train the baseline model [37] and our model
with 8 A100 GPUs on LSUN Church and record how much

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy