3d-Aware Conditional Image Synthesis: Kangle Deng Gengshan Yang Deva Ramanan Jun-Yan Zhu Carnegie Mellon University
3d-Aware Conditional Image Synthesis: Kangle Deng Gengshan Yang Deva Ramanan Jun-Yan Zhu Carnegie Mellon University
Figure 1. Given a 2D label map as input, such as a segmentation or edge map, our model learns to predict high-quality 3D labels, geometry,
and appearance, which enables us to render both labels and RGB images from different viewpoints. The inferred 3D labels further allow
interactive editing of label maps from any viewpoint, as shown in Figure 10.
Abstract user-controllable image and video synthesis [19, 20, 24, 34].
In particular, image-to-image translation methods [29,56,86]
We propose pix2pix3D, a 3D-aware conditional gener- allow users to interactively create and manipulate a high-
ative model for controllable photorealistic image synthesis. resolution image given a 2D input label map. Unfortunately,
Given a 2D label map, such as a segmentation or edge map, existing image-to-image translation methods operate purely
our model learns to synthesize a corresponding image from in 2D, without explicit reasoning of the underlying 3D struc-
different viewpoints. To enable explicit 3D user control, we ture of the content. As shown in Figure 1, we aim to make
extend conditional generative models with neural radiance conditional image synthesis 3D-aware, allowing not only
fields. Given widely-available posed monocular image and 3D content generation but also viewpoint manipulation and
label map pairs, our model learns to assign a label to every attribute editing (e.g., car shape) in 3D.
3D point in addition to color and density, which enables it
Synthesizing 3D content conditioned on user input is
to render the image and pixel-aligned label map simulta-
challenging. For model training, it is costly to obtain large-
neously. Finally, we build an interactive system that allows
scale datasets with paired user inputs and their desired 3D
users to edit the label map from different viewpoints and
outputs. During test time, 3D content creation often requires
generate outputs accordingly.
multi-view user inputs, as a user may want to specify the
details of 3D objects using 2D interfaces from different
viewpoints. However, these inputs may not be 3D-consistent,
1. Introduction providing conflicting signals for 3D content creation.
Content creation with generative models has witnessed To address the above challenges, we extend conditional
tremendous progress in recent years, enabling high-quality, generative models with 3D neural scene representations. To
enable cross-view editing, we additionally encode semantic user-guided image synthesis and editing applications [26,
information in 3D, which can then be rendered as 2D label 27, 38, 40, 55, 56, 62, 74, 86, 89] . In contrast, we propose a
maps from different viewpoints. We learn the aforemen- 3D-aware generative model conditioned on 2D user inputs
tioned 3D representation using only 2D supervision in the that can render view-consistent images and enable interac-
form of image reconstruction and adversarial losses. While tive 3D editing. Recently, SoFGAN [11] uses a 3D semantic
the reconstruction loss ensures the alignment between 2D map generator and a 2D semantic-to-image generator to en-
user inputs and corresponding 3D content, our pixel-aligned able 3D-aware generation, but using 2D generators does not
conditional discriminator encourages the appearance and ensure 3D consistency.
labels to look plausible while remaining pixel-aligned when 3D-aware Image Synthesis. Early data-driven 3D image
rendered into novel viewpoints. We also propose a cross- editing systems can achieve various 3D effects but often
view consistency loss to enforce the latent codes to be con- require a huge amount of manual effort [14, 37]. Recent
sistent from different viewpoints. works have integrated the 3D structure into learning-based
We focus on 3D-aware semantic image synthesis on image generation pipelines using various geometric represen-
the CelebAMask-HQ [38], AFHQ-cat [16], and shapenet- tations, including voxels [22,88], voxelized 3D features [49],
car [10] datasets. Our method works well for various 2D and 3D morphable models [71, 78]. However, many rely on
user inputs, including segmentation maps and edge maps. external 3D data [71,78,88]. Recently, neural scene represen-
Our method outperforms several 2D and 3D baselines, such tations have been integrated into GANs to enable 3D-aware
as Pix2NeRF variants [6], SofGAN [11], and SEAN [89]. image synthesis [8, 9, 21, 51–53, 64, 77]. Intriguingly, these
We further ablate the impact of various design choices and 3D-aware GANs can learn 3D structures without any 3D su-
demonstrate applications of our method, such as cross-view pervision. For example, StyleNeRF [21] and EG3D [8] learn
editing and explicit user control over semantics and style. to generate 3D representations by modulating either NeRFs
Please see our website for more results and code. or explicit representations with latent style vectors. This al-
lows them to render high-resolution view-consistent images.
2. Related Work Unlike the above methods, we focus on conditional synthesis
and interactive editing rather than random sampling. Sev-
Neural Implicit Representation. Neural implicit fields, eral works [17, 28, 42, 76] have explored sketch-based shape
such as DeepSDF and NeRFs [46, 54], model the appear- generation but they do not allow realistic image synthesis.
ance of objects and scenes with an implicitly defined, con- Closely related to our work, Huang et al. [25] propose
tinuous 3D representation parameterized by neural net- synthesizing novel views conditional on a semantic map.
works. They have produced significant results for 3D re- Our work differs in three ways. First, we can predict full 3D
construction [67, 90] and novel view synthesis applica- labels, geometry, and appearance, rather than only 2D views,
tions [39, 43, 44, 48, 81] thanks to their compactness and which enables cross-view editing. Second, our method can
expressiveness. NeRF and its descendants aim to optimize a synthesize images with a much wider baseline than Huang
network for an individual scene, given hundreds of images et al. [25]. Finally, our learning algorithm does not require
from multiple viewpoints. Recent works further reduce the ground truth multi-view images of the same scene. Two
number of training views through learning network initializa- recent works, FENeRF [69] and 3DSGAN [80], also lever-
tions [13,70,79], leveraging auxiliary supervision [18,30], or age semantic labels for training 3D-aware GANs, but they
imposing regularization terms [50]. Recently, explicit or hy- do not support conditional inputs and require additional ef-
brid representations of radiance fields [12, 48, 61] have also forts (e.g., GAN-inversion) to allow user editing. Three con-
shown promising results regarding quality and speed. In our current works, IDE-3D [68], NeRFFaceEditing [31], and
work, we use hybrid representations for modeling both user sem2nerf [15], also explore the task of 3D-aware genera-
inputs and outputs in 3D, focusing on synthesizing novel tion based on segmentation masks. However, IDE-3D and
images rather than reconstructing an existing scene. A recent sem2nerf only allow editing on a fixed view, and NeRF-
work Pix2NeRF [6] aims to translate a single image to a FaceEditing focuses on real image editing rather than gen-
neural radiance field, which allows single-image novel view eration. All of them do not include results for other input
synthesis. In contrast, we focus on 3D-aware user-controlled modalities. In contrast, we present a general-purpose method
content generation. that works well for diverse datasets and input controls.
Conditional GANs. Generative adversarial networks 3. Method
(GANs) learn the distribution of natural images by forcing
the generated and real images to be indistinguishable. They Given a 2D label map Is , such as a segmentation or edge
have demonstrated high-quality results on 2D image synthe- map, pix2pix3D generates a 3D-volumetric representa-
sis and manipulation [1, 3, 5, 20, 33–35, 59, 65, 72, 84, 85]. tion of geometry, appearance, and labels that can be rendered
Several methods adopt image-conditional GANs [29, 47] for from different viewpoints. Figure 2 provides an overview.
Î+
c
Figure 2. Overall pipeline. Given a 2D label map (e.g., segmentation map), a random latent code z, and a camera pose P̂ as inputs, our
generator renders the label map and image from viewpoint P̂ . Intuitively, the input label map specifies the geometric structure, while the
latent code captures the appearance, such as hair color. We begin with an encoder that encodes both the input label map and the latent
code into style vectors w+ . We then use w+ to modulate our 3D representation, which takes a spatial point x and outputs (1) color c ∈ R3 ,
(2) density σ, (3) feature ϕ ∈ Rl , and (4) label s ∈ Rc . We then perform volumetric rendering and 2D upsampling to get the high-res
label map Î+ +
s and RGB Image Îc . For those rendered from ground-truth poses, we compare them to ground-truth labels and images with
an LPIPS loss and label reconstruction loss. We apply a GAN loss on labels and images rendered from both novel and original viewpoints.
We first introduce the formulation of our 3D conditional gen- represent the global geometric information of the scene. We
erative model for 3D-aware image synthesis in Section 3.1. then feed the random latent code z through a Multi-Layer
Then, in Section 3.2, we discuss how to learn the model from Perceptron (MLP) mapping network to obtain the rest of the
color and label map pairs {Ic , Is } associated with poses P. style vectors that control the appearance.
3.1. Conditional 3D Generative Models Conditional 3D Representation. Our 3D representation is
parameterized by tri-planes followed by an 2-layer MLP
Similar to EG3D [8], we adopt a hybrid representation f [8], which takes in a spatial point x ∈ R3 and returns 4
for the density and appearance of a scene and use style types of outputs: (1) color c ∈ R3 , (2) density σ ∈ R+ , (3)
vectors to modulate the 3D generations. To condition the feature ϕ ∈ R64 for the purpose of 2D upsampling, and most
3D representations on 2D label map inputs, we introduce a notably, (4) label s ∈ Rc , where c is the number of classes if
conditional encoder that maps a 2D label map into a latent Is is a segmentation map, otherwise 1 for edge labels. We
style vector. Additionally, pix2pix3D produces 3D labels make the field conditional by modulating the generation of
that can be rendered from different viewpoints, allowing for tri-planes F tri with the style vectors w+ . We also remove the
cross-view user editing. view dependence of the color following [8, 21]. Formally,
Conditional Encoder. Given a 2D label map input Is and
a random latent code sampled from the spherical Gaussian (\mathbf {c}, \mathbf {s}, \sigma , \phi ) = f(F^{\text {tri}}_{\mathbf {w}^{+}}(\mathbf {x})).
space z ∼ N (0, I), our conditional encoder E outputs a list
of style vectors w+ ∈ Rl×256 , Volume Rendering and Upsampling. We apply volumetric
rendering to synthesize color images [32, 46]. In addition,
\mathbf {w}^{+} = E({\mathbf {I}}_\mathbf {s}, \mathbf {z}),
we render label maps, which are crucial for enabling cross-
where l = 13 is the number of layers to be modulated. view editing (Section 4.3) and improving rendering quality
Specifically, we encode Is into the first 7 style vectors that (Table 1). Given a viewpoint P̂ looking at the scene origin,
we sample N points along the ray that emanates from a Multi-view Generation of Seg Maps
Rendered Î00s
<latexit sha1_base64="Mh8NGGK1oiFuvTta7xJBNeZ4iy8=">AAACBnicbVDLSsNAFJ34rPUVdSnCYJG6KokUdVlwU3cV7AOaECbTSTt08mDmRighKzf+ihsXirj1G9z5N07bLLT1wIXDOfdy7z1+IrgCy/o2VlbX1jc2S1vl7Z3dvX3z4LCj4lRS1qaxiGXPJ4oJHrE2cBCsl0hGQl+wrj++mfrdByYVj6N7mCTMDckw4gGnBLTkmSdOSGDkB5kzIoBvcy8rBKzyatUzK1bNmgEvE7sgFVSg5ZlfziCmacgioIIo1betBNyMSOBUsLzspIolhI7JkPU1jUjIlJvN3sjxmVYGOIilrgjwTP09kZFQqUno687pjWrRm4r/ef0Ugms341GSAovofFGQCgwxnmaCB1wyCmKiCaGS61sxHRFJKOjkyjoEe/HlZdK5qNmXtfpdvdJoFnGU0DE6RefIRleogZqohdqIokf0jF7Rm/FkvBjvxse8dcUoZo7QHxifP7p2mKo=</latexit>
<latexit sha1_base64="3LK1ntlfMsyJopHGxnALS5CVg88=">AAACBHicbVDLSsNAFL2pr1pfUZfdDBbBVUlE1GXBTd1VsA9oQphMJ+3QyYOZiVBCFm78FTcuFHHrR7jzb5y0WWjrgQuHc+7l3nv8hDOpLOvbqKytb2xuVbdrO7t7+wfm4VFPxqkgtEtiHouBjyXlLKJdxRSng0RQHPqc9v3pTeH3H6iQLI7u1SyhbojHEQsYwUpLnll3QqwmfpA5E6zQbe5lpYBk7pkNq2nNgVaJXZIGlOh45pczikka0kgRjqUc2lai3AwLxQinec1JJU0wmeIxHWoa4ZBKN5s/kaNTrYxQEAtdkUJz9fdEhkMpZ6GvO4sL5bJXiP95w1QF127GoiRVNCKLRUHKkYpRkQgaMUGJ4jNNMBFM34rIBAtMlM6tpkOwl19eJb3zpn3ZvLi7aLTaZRxVqMMJnIENV9CCNnSgCwQe4Rle4c14Ml6Md+Nj0Voxyplj+APj8wfkmJhI</latexit>
Rendered Î0s
<latexit sha1_base64="Wp2dptigIG6a1a9kCDXnspfwyCs=">AAACBXicbVDLSsNAFL2pr1pfUZe6GCyiq5JIUZcFN3VXwT6gCWEynbRDJw9mJkIJ3bjxV9y4UMSt/+DOv3HSZqGtBy4czrmXe+/xE86ksqxvo7Syura+Ud6sbG3v7O6Z+wcdGaeC0DaJeSx6PpaUs4i2FVOc9hJBcehz2vXHN7nffaBCsji6V5OEuiEeRixgBCsteeaxE2I18oPMGWGFbqdeVghITs88s2rVrBnQMrELUoUCLc/8cgYxSUMaKcKxlH3bSpSbYaEY4XRacVJJE0zGeEj7mkY4pNLNZl9M0alWBiiIha5IoZn6eyLDoZST0Ned+Yly0cvF/7x+qoJrN2NRkioakfmiIOVIxSiPBA2YoETxiSaYCKZvRWSEBSZKB1fRIdiLLy+TzkXNvqzV7+rVRrOIowxHcALnYMMVNKAJLWgDgUd4hld4M56MF+Pd+Ji3loxi5hD+wPj8AU96mHk=</latexit>
pixel location and query density, color, labels, and feature Input Is Rendered Îs
<latexit sha1_base64="QS2CrVnNj2fdhHi853ltln7AySI=">AAAB/3icbVBNS8NAFHypX7V+RQUvXhaL4KkkUtRjwUu9VbC10Iaw2W7apZtN2N0IJfbgX/HiQRGv/g1v/hs3bQ7aOrAwzLzHm50g4Uxpx/m2Siura+sb5c3K1vbO7p69f9BRcSoJbZOYx7IbYEU5E7Stmea0m0iKo4DT+2B8nfv3D1QqFos7PUmoF+GhYCEjWBvJt4/6EdajIMxupn5WcKSmvl11as4MaJm4BalCgZZvf/UHMUkjKjThWKme6yTay7DUjHA6rfRTRRNMxnhIe4YKHFHlZbP8U3RqlAEKY2me0Gim/t7IcKTUJArMZJ5QLXq5+J/XS3V45WVMJKmmgswPhSlHOkZ5GWjAJCWaTwzBRDKTFZERlphoU1nFlOAufnmZdM5r7kWtfluvNppFHWU4hhM4AxcuoQFNaEEbCDzCM7zCm/VkvVjv1sd8tGQVO4fwB9bnD2Gpll0=</latexit>
information from our 3D representation. Let xi be the i-th from Pose P from Pose P from Pose P0 from Pose P
<latexit sha1_base64="JuX+ubSGmFLhLotRdRviCpAeKEw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbRU0mkqMeClx4r2FpoQ9lsJ+3SzSbuboQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlTtYLQtI8n/bLFbfqzkFWiZeTCuRo9stfvUHM0gilYYJq3fXcxPgZVYYzgdNSL9WYUDamQ+xaKmmE2s/m907JmVUGJIyVLWnIXP09kdFI60kU2M6ImpFe9mbif143NeGNn3GZpAYlWywKU0FMTGbPkwFXyIyYWEKZ4vZWwkZUUWZsRCUbgrf88ippX1a9q2rtrlapN/I4inACp3ABHlxDHRrQhBYwEPAMr/DmPDovzrvzsWgtOPnMMfyB8/kDWUCPjA==</latexit>
<latexit sha1_base64="pxdjrA/ytItJKo0AdiBVHvxAb7U=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFN11WsA9sS8mkd9rQTGZIMkIZ+hduXCji1r9x59+YaWehrQcCh3PuJecePxZcG9f9dgobm1vbO8Xd0t7+weFR+fikraNEMWyxSESq61ONgktsGW4EdmOFNPQFdvzpXeZ3nlBpHskHM4txENKx5AFn1FjpsR9SM/GDtDkflitu1V2ArBMvJxXI0RyWv/qjiCUhSsME1brnubEZpFQZzgTOS/1EY0zZlI6xZ6mkIepBukg8JxdWGZEgUvZJQxbq742UhlrPQt9OZgn1qpeJ/3m9xAS3g5TLODEo2fKjIBHERCQ7n4y4QmbEzBLKFLdZCZtQRZmxJZVsCd7qyeukfVX1rqu1+1ql3sjrKMIZnMMleHADdWhAE1rAQMIzvMKbo50X5935WI4WnHznFP7A+fwBxS2RAw==</latexit>
<latexit sha1_base64="pxdjrA/ytItJKo0AdiBVHvxAb7U=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFN11WsA9sS8mkd9rQTGZIMkIZ+hduXCji1r9x59+YaWehrQcCh3PuJecePxZcG9f9dgobm1vbO8Xd0t7+weFR+fikraNEMWyxSESq61ONgktsGW4EdmOFNPQFdvzpXeZ3nlBpHskHM4txENKx5AFn1FjpsR9SM/GDtDkflitu1V2ArBMvJxXI0RyWv/qjiCUhSsME1brnubEZpFQZzgTOS/1EY0zZlI6xZ6mkIepBukg8JxdWGZEgUvZJQxbq742UhlrPQt9OZgn1qpeJ/3m9xAS3g5TLODEo2fKjIBHERCQ7n4y4QmbEzBLKFLdZCZtQRZmxJZVsCd7qyeukfVX1rqu1+1ql3sjrKMIZnMMleHADdWhAE1rAQMIzvMKbo50X5935WI4WnHznFP7A+fwBxS2RAw==</latexit>
<latexit sha1_base64="pxdjrA/ytItJKo0AdiBVHvxAb7U=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFN11WsA9sS8mkd9rQTGZIMkIZ+hduXCji1r9x59+YaWehrQcCh3PuJecePxZcG9f9dgobm1vbO8Xd0t7+weFR+fikraNEMWyxSESq61ONgktsGW4EdmOFNPQFdvzpXeZ3nlBpHskHM4txENKx5AFn1FjpsR9SM/GDtDkflitu1V2ArBMvJxXI0RyWv/qjiCUhSsME1brnubEZpFQZzgTOS/1EY0zZlI6xZ6mkIepBukg8JxdWGZEgUvZJQxbq742UhlrPQt9OZgn1qpeJ/3m9xAS3g5TLODEo2fKjIBHERCQ7n4y4QmbEzBLKFLdZCZtQRZmxJZVsCd7qyeukfVX1rqu1+1ql3sjrKMIZnMMleHADdWhAE1rAQMIzvMKbo50X5935WI4WnHznFP7A+fwBxS2RAw==</latexit>
\lbleq {volume_render} {\bf \hat I_c}(r) = \sum _{i=1}^{N} \tau _i{\bf c}_i, \;\; {\bf \hat I_s}(r) = \sum _{i=1}^{N} \tau _i{\bf s}_i , \;\; {\bf \hat I_{\phi }}(r) = \sum _{i=1}^{N} \tau _i{\bf {\phi }}_i,
Figure 3. Cross-View Consistency Loss. Given an input label map
(1) Is and its associated pose P, we first infer the geometry latent code
where the transmittance τi is computed as the probability of wg . From wg , we can generate a label map Îs from the same pose
a photon traversing between the camera center and the i-th P, and Î′s from a random pose P′ . Next, we infer wg′ from the
point given the length of the i-th interval δi , novel view Î′s , and render it back to the original pose P to obtain
Î′′s . Finally, we add a reconstruction loss: LCVC = λCVC Ls (Î′′s , Îs ).
{\tau _i} = \prod _{j=1}^{i} \exp {(-\sigma _j\delta _j)} (1-\exp {(-\sigma _i\delta _i})).
the pixel-aligned conditional discriminator Ds concatenates
color images and label maps as input, which encourages
Similar to prior works [8,21,52], we approximate Equation 1 pixel alignment between color images and label maps. No-
by 2D Upsampler U to reduce the computational cost. We tably, in Ds , we stop the gradients for the color images to
render high-res 512 × 512 images in two passes. In the first prevent a potential quality downgrade. We also feed the
pass, we render low-res 64 × 64 images Îc , Îs , Îϕ . Then a rendered low-res images to prevent the upsampler from hal-
CNN up-sampler U is applied to obtain high-res images, lucinating details, inconsistent with the low-res output. The
adversarial loss can be written as follows.
\mathbf {\hat I}_{\mathbf {c}}^+ = U({\bf \hat I_{c}},{\bf \hat I_{\phi }}), \qquad \mathbf {\hat I}_{\mathbf {s}}^+ = U({\bf \hat I_{s}}, {\bf \hat I_{\phi }}).
\mathcal {L}_{\text {GAN}} = \lambda _{D_{\bf c}}\mathcal {L}_{D_{\mathbf {c}}}(\mathbf {\hat I}_{\mathbf {c}}^+, \mathbf {\hat I}_{\mathbf {c}}) + \lambda _{D_{\bf s}} \mathcal {L}_{D_{\mathbf {s}}}(\mathbf {\hat I}_{\mathbf {c}}^+, \mathbf {\hat I}_{\mathbf {c}}, \mathbf {\hat I}_{\mathbf {s}}^+, \mathbf {\hat I}_{\mathbf {s}}).
3.2. Learning Objective
Learning conditional 3D representations from monocular where λDc and λDs balance two terms. To stabilize the GAN
images is challenging due to its under-constrained nature. training, we adopt the R1 regularization loss [45].
Given training data of associated images, label maps, and Cross-view Consistency Loss. We observe that inputting
camera poses predicted by an off-the-shelf model, we care- label maps of the same object from different viewpoints will
fully construct learning objectives, including reconstruction, sometimes result in different 3D shapes. Therefore we add
adversarial, and cross-view consistency losses. These objec- a cross-view consistency loss to regularize the training, as
tives will be described below. illustrated in Figure 3. Given an input label map Is and its
Reconstruction Loss. Given a ground-truth viewpoint P associated pose P, we generate the label map Î′s from a
associated with the color and label maps {Ic , Is }, we ren- different viewpoint P′ , and render the label map Î′′s back to
der color and label maps from P and compute reconstruc- the pose P using Î′s as input. We add a reconstruction loss
tion losses for both high-res and low-res output. We use between Î′′s and Îs :
LPIPS [82] to compute the image reconstruction loss Lc for \mathcal {L}_{\text {CVC}} = \lambda _{\text {CVC}}\mathcal {L}_s(\mathbf {\hat I}_{\mathbf {s}}'',\mathbf {\hat I}_{\mathbf {s}}),
color images. For label reconstruction loss Ls , we use the
balanced cross-entropy loss for segmentation maps or L2 where Ls denotes the reconstruction loss in the label space,
Loss for edge maps, and λCVC weights the loss term. This loss is crucial for re-
ducing error accumulation during cross-view editing.
\mathcal {L}_{\text {recon}} = \lambda _c \mathcal {L}_c({\mathbf {I}}_{\mathbf {c}},\{\mathbf {\hat I}_{\mathbf {c}}, \mathbf {\hat I}_{\mathbf {c}}^+\})+ \lambda _s \mathcal {L}_s({\mathbf {I}}_{\mathbf {s}}, \{\mathbf {\hat I}_{\mathbf {s}}, \mathbf {\hat I}_{\mathbf {s}}^+\}),
Optimization. Our final learning objective is written as fol-
where λc and λs balance two terms. lows:
\mathcal {L}_{\text {total}} = \mathcal {L}_{\text {recon}} + \mathcal {L}_{\text {GAN}} + \mathcal {L}_{\text {CVC}}.
Pixel-aligned Conditional Discriminator. The reconstruc-
tion loss alone fails to synthesize detailed results from novel At every iteration, we determine whether to use a ground-
viewpoints. Therefore, we use an adversarial loss [20] to truth pose or sample a random one with a probability of p. We
enforce renderings to look realistic from random viewpoints. use the reconstruction loss and GAN loss for ground-truth
Specifically, we have two discriminators Dc and Ds for RGB poses, while for random poses, we only use the GAN loss.
images and label maps, respectively. Dc is a widely-used We provide the hyper-parameters and more implementation
GAN loss that takes real and fake images as input, while details in Appendix B.
Input Seg Map Ours Pix2NeRF SoFGAN SEAN
Figure 4. Qualitative Comparison with Pix2NeRF [6], SoFGAN [11], and SEAN [89] on CelebAMask dataset for seg2face task. SEAN
fails in multi-view synthesis, while SoFGAN suffers from multi-view inconsistency (e.g., face identity changes across viewpoints). Our
method renders high-quality images while maintaining multi-view consistency. Please check our website for more examples.
Input Seg Map Ours w/o 3D Labels Input Seg Map Ours w/o 3D Labels
Figure 5. Qualitative ablation on seg2face and seg2cat. We ablate our method by removing the branch that renders label maps (w/o 3D
Labels). Our results better align with input labels (e.g., hairlines and the cat’s ear).
Q UALITY A LIGNMENT car [10] and render 500,000 images at 128× resolution for
Edge2Car
FID ↓ KID ↓ SG Diversity ↑ AP ↑ training, and 30,000 for evaluation. We extract the edges
P IX 2N E RF [6] 23.42 0.014 0.06 0.28
using informative drawing [7]. We train our model at 512×
resolution except for 128× in the edge2car task.
P I X 2 P I X 3D (O URS )
W/O 3D L ABELS 10.73 0.005 0.12 0.45 (0.42)
Running Time. For training the model at 512× resolution,
W / O CVC 9.42 0.004 0.13 0.61 (0.59) it takes about three days on eight RTX 3090 GPUs. But we
FULL MODEL 8.31 0.004 0.13 0.63 (0.59) can significantly reduce the training time to 4 hours if we
initialize parts of our model with pretrained weights from
Table 2. Edge2car Evaluation. We compare our method with EG3D [8]. During inference, our model takes 10 ms to obtain
Pix2NeRF [6] on edge2car using the shapenet-car [10] dataset. the style vector, and another 30 ms to render the final image
Similar to Table 1, we evaluate FID, KID, and SG Diversity for and the label map on a single RTX A5000. The low latency
image quality. We also evaluate the alignment with the input edge
(25 FPS) allows for interactive user editing.
map using AP. Similarly, we can either run informative drawing [7]
on generated images to obtain edge maps (numbers in parentheses) 4.1. Evaluation metrics
or directly use generated edge maps to calculate the metrics. We
achieve better image quality and alignment than Pix2NeRF. We We evaluate the models from two aspects: 1) the image
also find that using 3D labels and cross-view consistency loss is quality regarding fidelity and diversity, and 2) the alignment
helpful regarding FID and AP metrics. between input label maps and generated outputs.
Quality Metrics. Following prior works [21, 57], we use
a training set of 24,183, a validation set of 2,993, and a test the clean-fid library [58] to compute Fréchet Inception
set of 2,824, following the original work [38]. For seg2cat Distance (FID) [23] and Kernel Inception Distance (KID) [4]
and edge2cat, we use AFHQ-cat [16], which contains 5,065 to measure the distribution distance between synthesized re-
images at 512× resolution. We estimate the viewpoints using sults and real images. We also evaluate the single-generation
unsup3d [75]. We extract the edges using pidinet [66] and diversity (SG Diversity) by calculating the LPIPS metric
obtain segmentation by clustering DINO features [2] into between randomly generated pairs given a single input fol-
6 classes. For edge2car, we use 3D models from shapenet- lowing prior works [11, 87]. For FID and KID, we generate
0.0060 0.55 0.90
FID KID Generated mIoU Generated acc
13.0 0.0055 Predicted mIoU Predicted acc
0.50 0.85
0.0050
12.5 0.45 0.80
0.0045
0.40
mIoU
0.0040 0.75
KID
acc
FID
12.0
0.0035 0.35 0.70
11.5 0.0030
0.30 0.65
0.0025
11.0 0.25
0.0020 0.60
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
p p p p
Figure 9. We study the effect of random pose sampling probability p during training. Without random poses (p = 0), the model achieves the
best alignment with input semantic maps, with reduced image quality. In contrast, only using random poses (p = 1) achieves the best image
quality, while results fail to align with input maps. We find p = 0.5 balances the image quality and input alignment.
Figure 10. Cross-view Editing of Edge2Car. Our 3D editing system allows users to edit label maps from any viewpoint instead of only the
input view. Importantly, our feed-forward encoder allows fast inference of the latent code without GAN-inversion. Typically, a single forward
pass of rendering takes only 40 ms on a single RTX A5000, which enables interactive editing. Please check our demo video on our website.
10 images per label map in the test set using randomly sam- an unconditional 3D semantic map generator before the 2D
pled z. We compare our generated images with the whole generator so we can evaluate FVV Identity for that.
dataset, including training and test images. Results. Figure 4 shows the qualitative comparison for
Alignment Metrics. We evaluate models on the test set seg2face and Table 1 reports the evaluation results. SoF-
using mean Intersection-over-Union (mIoU) and pixel ac- GAN [11] tends to produce results with slightly better align-
curacy (acc) for segmentation maps following existing ment but worse 3D consistency for its 2D RGB generator.
works [57, 63], and average precision (AP) for edge maps. Our method achieves the best quality, alignment acc, and
For those models that render label maps as output, we di- FVV Identity while being competitive with 2D baselines
rectly compare them with ground-truth labels. Otherwise, on SG diversity. Figure 5 shows the qualitative ablation on
we first predict the label maps from the output RGB im- seg2face and seg2cat. Table 5 reports the metrics for seg2cat.
ages using off-the-shelf networks [38, 66], and then compare Figure 6 shows the example results for edge2cat. Figure 7
the prediction with the ground truth. The metrics regarding shows the qualitative comparison for edge2car and Table 2
such predicted semantic maps are reported within brackets reports the metrics. Our method achieves the best image
in Table 1 and Table 2. quality and alignment. Figure 8 shows semantic meshes of
For seg2face, we evaluate the preservation of facial iden- human and cat faces, extracted by marching cubes and col-
tity from different viewpoints (FVV Identity) by calculating ored by our learned 3D labels. We provide more evaluation
their distances with the dlib face recognition algorithm* . results in Appendix A.
Ablation Study. We compare our full method to several
4.2. Baseline comparison variants. Specifically, (1) W / O 3D L ABELS, we remove the
branch of rendering label maps from our method, and (2)
Baselines. Since there are no prior works on conditional
W / O CVC, we remove the cross-view consistency loss. From
3D-aware image synthesis, we make minimum modifications
Table 1, Table 2, and Figure 5, rendering label maps is crucial
to Pix2NeRF [6] to be conditional on label maps instead of
for the alignment with the input. We posit that the joint learn-
images. For a thorough comparison, we introduce several
ing of appearance, geometry, and label information poses
baselines: SEAN [89] and SoFGAN [11]. 2D baselines
strong constraints on correspondence between the input label
like SEAN [89] cannot generate multi-view images by
maps and the 3D representation. Thus our method can syn-
design (N/A for FVV Identity), while SoFGAN [11] uses
thesize images pixel-aligned with the inputs. Our CVC loss
* https://github.com/ageitgey/face_recognition helps preserve the facial identity from different viewpoints.
Appearance Appearance
Geometry
Erase Some
Face Pixels
Figure 13. Cross-view Editing of Seg2cat. The 3D representation can be edited from a viewpoint different than the input seg map.
GT view Novel View Input Seg Map Output Seg Map Output Image
Output Images
Missing Completed
Eyes & Eyebrows Eyes & Eyebrows
Output Seg Maps
Figure 16. Our model projects the user input onto the learned
manifold. Even if the user input contains errors, our model is able
to fix them, e.g., completing the missing eyes and eyebrows.
Figure 14. Visual Results of Seg2Car. erated results, and find our generated images can be detected
with an accuracy of 89.77%, and an average precision of
99.97%.