0% found this document useful (0 votes)
122 views15 pages

3d-Aware Conditional Image Synthesis: Kangle Deng Gengshan Yang Deva Ramanan Jun-Yan Zhu Carnegie Mellon University

1) The document proposes pix2pix3D, a 3D-aware conditional generative model that can synthesize photorealistic images from different viewpoints given a 2D input label map. 2) It extends conditional generative models with neural radiance fields to learn 3D representations of content from monocular image and label map pairs, enabling novel view synthesis and cross-view editing. 3) Experimental results show the method outperforms 2D and 3D baselines on tasks like segmentation map and edge map synthesis, and enables applications such as interactive 3D editing.

Uploaded by

pyash2113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views15 pages

3d-Aware Conditional Image Synthesis: Kangle Deng Gengshan Yang Deva Ramanan Jun-Yan Zhu Carnegie Mellon University

1) The document proposes pix2pix3D, a 3D-aware conditional generative model that can synthesize photorealistic images from different viewpoints given a 2D input label map. 2) It extends conditional generative models with neural radiance fields to learn 3D representations of content from monocular image and label map pairs, enabling novel view synthesis and cross-view editing. 3) Experimental results show the method outperforms 2D and 3D baselines on tasks like segmentation map and edge map synthesis, and enables applications such as interactive 3D editing.

Uploaded by

pyash2113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

3D-aware Conditional Image Synthesis

Kangle Deng Gengshan Yang Deva Ramanan Jun-Yan Zhu


Carnegie Mellon University

GT View Novel View GT View Novel View


arXiv:2302.08509v2 [cs.CV] 1 May 2023

Input Seg Map Input Seg Map

GT View Novel View GT View Novel View

Input Edge Map Input Edge Map

Figure 1. Given a 2D label map as input, such as a segmentation or edge map, our model learns to predict high-quality 3D labels, geometry,
and appearance, which enables us to render both labels and RGB images from different viewpoints. The inferred 3D labels further allow
interactive editing of label maps from any viewpoint, as shown in Figure 10.

Abstract user-controllable image and video synthesis [19, 20, 24, 34].
In particular, image-to-image translation methods [29,56,86]
We propose pix2pix3D, a 3D-aware conditional gener- allow users to interactively create and manipulate a high-
ative model for controllable photorealistic image synthesis. resolution image given a 2D input label map. Unfortunately,
Given a 2D label map, such as a segmentation or edge map, existing image-to-image translation methods operate purely
our model learns to synthesize a corresponding image from in 2D, without explicit reasoning of the underlying 3D struc-
different viewpoints. To enable explicit 3D user control, we ture of the content. As shown in Figure 1, we aim to make
extend conditional generative models with neural radiance conditional image synthesis 3D-aware, allowing not only
fields. Given widely-available posed monocular image and 3D content generation but also viewpoint manipulation and
label map pairs, our model learns to assign a label to every attribute editing (e.g., car shape) in 3D.
3D point in addition to color and density, which enables it
Synthesizing 3D content conditioned on user input is
to render the image and pixel-aligned label map simulta-
challenging. For model training, it is costly to obtain large-
neously. Finally, we build an interactive system that allows
scale datasets with paired user inputs and their desired 3D
users to edit the label map from different viewpoints and
outputs. During test time, 3D content creation often requires
generate outputs accordingly.
multi-view user inputs, as a user may want to specify the
details of 3D objects using 2D interfaces from different
viewpoints. However, these inputs may not be 3D-consistent,
1. Introduction providing conflicting signals for 3D content creation.
Content creation with generative models has witnessed To address the above challenges, we extend conditional
tremendous progress in recent years, enabling high-quality, generative models with 3D neural scene representations. To
enable cross-view editing, we additionally encode semantic user-guided image synthesis and editing applications [26,
information in 3D, which can then be rendered as 2D label 27, 38, 40, 55, 56, 62, 74, 86, 89] . In contrast, we propose a
maps from different viewpoints. We learn the aforemen- 3D-aware generative model conditioned on 2D user inputs
tioned 3D representation using only 2D supervision in the that can render view-consistent images and enable interac-
form of image reconstruction and adversarial losses. While tive 3D editing. Recently, SoFGAN [11] uses a 3D semantic
the reconstruction loss ensures the alignment between 2D map generator and a 2D semantic-to-image generator to en-
user inputs and corresponding 3D content, our pixel-aligned able 3D-aware generation, but using 2D generators does not
conditional discriminator encourages the appearance and ensure 3D consistency.
labels to look plausible while remaining pixel-aligned when 3D-aware Image Synthesis. Early data-driven 3D image
rendered into novel viewpoints. We also propose a cross- editing systems can achieve various 3D effects but often
view consistency loss to enforce the latent codes to be con- require a huge amount of manual effort [14, 37]. Recent
sistent from different viewpoints. works have integrated the 3D structure into learning-based
We focus on 3D-aware semantic image synthesis on image generation pipelines using various geometric represen-
the CelebAMask-HQ [38], AFHQ-cat [16], and shapenet- tations, including voxels [22,88], voxelized 3D features [49],
car [10] datasets. Our method works well for various 2D and 3D morphable models [71, 78]. However, many rely on
user inputs, including segmentation maps and edge maps. external 3D data [71,78,88]. Recently, neural scene represen-
Our method outperforms several 2D and 3D baselines, such tations have been integrated into GANs to enable 3D-aware
as Pix2NeRF variants [6], SofGAN [11], and SEAN [89]. image synthesis [8, 9, 21, 51–53, 64, 77]. Intriguingly, these
We further ablate the impact of various design choices and 3D-aware GANs can learn 3D structures without any 3D su-
demonstrate applications of our method, such as cross-view pervision. For example, StyleNeRF [21] and EG3D [8] learn
editing and explicit user control over semantics and style. to generate 3D representations by modulating either NeRFs
Please see our website for more results and code. or explicit representations with latent style vectors. This al-
lows them to render high-resolution view-consistent images.
2. Related Work Unlike the above methods, we focus on conditional synthesis
and interactive editing rather than random sampling. Sev-
Neural Implicit Representation. Neural implicit fields, eral works [17, 28, 42, 76] have explored sketch-based shape
such as DeepSDF and NeRFs [46, 54], model the appear- generation but they do not allow realistic image synthesis.
ance of objects and scenes with an implicitly defined, con- Closely related to our work, Huang et al. [25] propose
tinuous 3D representation parameterized by neural net- synthesizing novel views conditional on a semantic map.
works. They have produced significant results for 3D re- Our work differs in three ways. First, we can predict full 3D
construction [67, 90] and novel view synthesis applica- labels, geometry, and appearance, rather than only 2D views,
tions [39, 43, 44, 48, 81] thanks to their compactness and which enables cross-view editing. Second, our method can
expressiveness. NeRF and its descendants aim to optimize a synthesize images with a much wider baseline than Huang
network for an individual scene, given hundreds of images et al. [25]. Finally, our learning algorithm does not require
from multiple viewpoints. Recent works further reduce the ground truth multi-view images of the same scene. Two
number of training views through learning network initializa- recent works, FENeRF [69] and 3DSGAN [80], also lever-
tions [13,70,79], leveraging auxiliary supervision [18,30], or age semantic labels for training 3D-aware GANs, but they
imposing regularization terms [50]. Recently, explicit or hy- do not support conditional inputs and require additional ef-
brid representations of radiance fields [12, 48, 61] have also forts (e.g., GAN-inversion) to allow user editing. Three con-
shown promising results regarding quality and speed. In our current works, IDE-3D [68], NeRFFaceEditing [31], and
work, we use hybrid representations for modeling both user sem2nerf [15], also explore the task of 3D-aware genera-
inputs and outputs in 3D, focusing on synthesizing novel tion based on segmentation masks. However, IDE-3D and
images rather than reconstructing an existing scene. A recent sem2nerf only allow editing on a fixed view, and NeRF-
work Pix2NeRF [6] aims to translate a single image to a FaceEditing focuses on real image editing rather than gen-
neural radiance field, which allows single-image novel view eration. All of them do not include results for other input
synthesis. In contrast, we focus on 3D-aware user-controlled modalities. In contrast, we present a general-purpose method
content generation. that works well for diverse datasets and input controls.
Conditional GANs. Generative adversarial networks 3. Method
(GANs) learn the distribution of natural images by forcing
the generated and real images to be indistinguishable. They Given a 2D label map Is , such as a segmentation or edge
have demonstrated high-quality results on 2D image synthe- map, pix2pix3D generates a 3D-volumetric representa-
sis and manipulation [1, 3, 5, 20, 33–35, 59, 65, 72, 84, 85]. tion of geometry, appearance, and labels that can be rendered
Several methods adopt image-conditional GANs [29, 47] for from different viewpoints. Figure 2 provides an overview.
      

 
   

­ 
   
    €

Îc ∈ R 64×64×3  


   
Îs ∈ R 64×64×c   

Îφ ∈ R64×64×l ­  


Îc Îs Îφ   Î+ ∈R 512×512×3 ­ 
c
 512×512×c ­  
Î+
s ∈R

         Î+


s   

ƒ­„ ƒ­„
   
 ‚  ‚ 


 
 ƒ­„ ‚  ƒ­„ ‚ 
 
      Î+  
c

Figure 2. Overall pipeline. Given a 2D label map (e.g., segmentation map), a random latent code z, and a camera pose P̂ as inputs, our
generator renders the label map and image from viewpoint P̂ . Intuitively, the input label map specifies the geometric structure, while the
latent code captures the appearance, such as hair color. We begin with an encoder that encodes both the input label map and the latent
code into style vectors w+ . We then use w+ to modulate our 3D representation, which takes a spatial point x and outputs (1) color c ∈ R3 ,
(2) density σ, (3) feature ϕ ∈ Rl , and (4) label s ∈ Rc . We then perform volumetric rendering and 2D upsampling to get the high-res
label map Î+ +
s and RGB Image Îc . For those rendered from ground-truth poses, we compare them to ground-truth labels and images with
an LPIPS loss and label reconstruction loss. We apply a GAN loss on labels and images rendered from both novel and original viewpoints.

We first introduce the formulation of our 3D conditional gen- represent the global geometric information of the scene. We
erative model for 3D-aware image synthesis in Section 3.1. then feed the random latent code z through a Multi-Layer
Then, in Section 3.2, we discuss how to learn the model from Perceptron (MLP) mapping network to obtain the rest of the
color and label map pairs {Ic , Is } associated with poses P. style vectors that control the appearance.
3.1. Conditional 3D Generative Models Conditional 3D Representation. Our 3D representation is
parameterized by tri-planes followed by an 2-layer MLP
Similar to EG3D [8], we adopt a hybrid representation f [8], which takes in a spatial point x ∈ R3 and returns 4
for the density and appearance of a scene and use style types of outputs: (1) color c ∈ R3 , (2) density σ ∈ R+ , (3)
vectors to modulate the 3D generations. To condition the feature ϕ ∈ R64 for the purpose of 2D upsampling, and most
3D representations on 2D label map inputs, we introduce a notably, (4) label s ∈ Rc , where c is the number of classes if
conditional encoder that maps a 2D label map into a latent Is is a segmentation map, otherwise 1 for edge labels. We
style vector. Additionally, pix2pix3D produces 3D labels make the field conditional by modulating the generation of
that can be rendered from different viewpoints, allowing for tri-planes F tri with the style vectors w+ . We also remove the
cross-view user editing. view dependence of the color following [8, 21]. Formally,
Conditional Encoder. Given a 2D label map input Is and
a random latent code sampled from the spherical Gaussian (\mathbf {c}, \mathbf {s}, \sigma , \phi ) = f(F^{\text {tri}}_{\mathbf {w}^{+}}(\mathbf {x})).
space z ∼ N (0, I), our conditional encoder E outputs a list
of style vectors w+ ∈ Rl×256 , Volume Rendering and Upsampling. We apply volumetric
rendering to synthesize color images [32, 46]. In addition,
\mathbf {w}^{+} = E({\mathbf {I}}_\mathbf {s}, \mathbf {z}),
we render label maps, which are crucial for enabling cross-
where l = 13 is the number of layers to be modulated. view editing (Section 4.3) and improving rendering quality
Specifically, we encode Is into the first 7 style vectors that (Table 1). Given a viewpoint P̂ looking at the scene origin,
we sample N points along the ray that emanates from a Multi-view Generation of Seg Maps
Rendered Î00s
<latexit sha1_base64="Mh8NGGK1oiFuvTta7xJBNeZ4iy8=">AAACBnicbVDLSsNAFJ34rPUVdSnCYJG6KokUdVlwU3cV7AOaECbTSTt08mDmRighKzf+ihsXirj1G9z5N07bLLT1wIXDOfdy7z1+IrgCy/o2VlbX1jc2S1vl7Z3dvX3z4LCj4lRS1qaxiGXPJ4oJHrE2cBCsl0hGQl+wrj++mfrdByYVj6N7mCTMDckw4gGnBLTkmSdOSGDkB5kzIoBvcy8rBKzyatUzK1bNmgEvE7sgFVSg5ZlfziCmacgioIIo1betBNyMSOBUsLzspIolhI7JkPU1jUjIlJvN3sjxmVYGOIilrgjwTP09kZFQqUno687pjWrRm4r/ef0Ugms341GSAovofFGQCgwxnmaCB1wyCmKiCaGS61sxHRFJKOjkyjoEe/HlZdK5qNmXtfpdvdJoFnGU0DE6RefIRleogZqohdqIokf0jF7Rm/FkvBjvxse8dcUoZo7QHxifP7p2mKo=</latexit>

<latexit sha1_base64="3LK1ntlfMsyJopHGxnALS5CVg88=">AAACBHicbVDLSsNAFL2pr1pfUZfdDBbBVUlE1GXBTd1VsA9oQphMJ+3QyYOZiVBCFm78FTcuFHHrR7jzb5y0WWjrgQuHc+7l3nv8hDOpLOvbqKytb2xuVbdrO7t7+wfm4VFPxqkgtEtiHouBjyXlLKJdxRSng0RQHPqc9v3pTeH3H6iQLI7u1SyhbojHEQsYwUpLnll3QqwmfpA5E6zQbe5lpYBk7pkNq2nNgVaJXZIGlOh45pczikka0kgRjqUc2lai3AwLxQinec1JJU0wmeIxHWoa4ZBKN5s/kaNTrYxQEAtdkUJz9fdEhkMpZ6GvO4sL5bJXiP95w1QF127GoiRVNCKLRUHKkYpRkQgaMUGJ4jNNMBFM34rIBAtMlM6tpkOwl19eJb3zpn3ZvLi7aLTaZRxVqMMJnIENV9CCNnSgCwQe4Rle4c14Ml6Md+Nj0Voxyplj+APj8wfkmJhI</latexit>

Rendered Î0s
<latexit sha1_base64="Wp2dptigIG6a1a9kCDXnspfwyCs=">AAACBXicbVDLSsNAFL2pr1pfUZe6GCyiq5JIUZcFN3VXwT6gCWEynbRDJw9mJkIJ3bjxV9y4UMSt/+DOv3HSZqGtBy4czrmXe+/xE86ksqxvo7Syura+Ud6sbG3v7O6Z+wcdGaeC0DaJeSx6PpaUs4i2FVOc9hJBcehz2vXHN7nffaBCsji6V5OEuiEeRixgBCsteeaxE2I18oPMGWGFbqdeVghITs88s2rVrBnQMrELUoUCLc/8cgYxSUMaKcKxlH3bSpSbYaEY4XRacVJJE0zGeEj7mkY4pNLNZl9M0alWBiiIha5IoZn6eyLDoZST0Ned+Yly0cvF/7x+qoJrN2NRkioakfmiIOVIxSiPBA2YoETxiSaYCKZvRWSEBSZKB1fRIdiLLy+TzkXNvqzV7+rVRrOIowxHcALnYMMVNKAJLWgDgUd4hld4M56MF+Pd+Ji3loxi5hD+wPj8AU96mHk=</latexit>

pixel location and query density, color, labels, and feature Input Is Rendered Îs
<latexit sha1_base64="QS2CrVnNj2fdhHi853ltln7AySI=">AAAB/3icbVBNS8NAFHypX7V+RQUvXhaL4KkkUtRjwUu9VbC10Iaw2W7apZtN2N0IJfbgX/HiQRGv/g1v/hs3bQ7aOrAwzLzHm50g4Uxpx/m2Siura+sb5c3K1vbO7p69f9BRcSoJbZOYx7IbYEU5E7Stmea0m0iKo4DT+2B8nfv3D1QqFos7PUmoF+GhYCEjWBvJt4/6EdajIMxupn5WcKSmvl11as4MaJm4BalCgZZvf/UHMUkjKjThWKme6yTay7DUjHA6rfRTRRNMxnhIe4YKHFHlZbP8U3RqlAEKY2me0Gim/t7IcKTUJArMZJ5QLXq5+J/XS3V45WVMJKmmgswPhSlHOkZ5GWjAJCWaTwzBRDKTFZERlphoU1nFlOAufnmZdM5r7kWtfluvNppFHWU4hhM4AxcuoQFNaEEbCDzCM7zCm/VkvVjv1sd8tGQVO4fwB9bnD2Gpll0=</latexit>

information from our 3D representation. Let xi be the i-th from Pose P from Pose P from Pose P0 from Pose P
<latexit sha1_base64="JuX+ubSGmFLhLotRdRviCpAeKEw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbRU0mkqMeClx4r2FpoQ9lsJ+3SzSbuboQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlTtYLQtI8n/bLFbfqzkFWiZeTCuRo9stfvUHM0gilYYJq3fXcxPgZVYYzgdNSL9WYUDamQ+xaKmmE2s/m907JmVUGJIyVLWnIXP09kdFI60kU2M6ImpFe9mbif143NeGNn3GZpAYlWywKU0FMTGbPkwFXyIyYWEKZ4vZWwkZUUWZsRCUbgrf88ippX1a9q2rtrlapN/I4inACp3ABHlxDHRrQhBYwEPAMr/DmPDovzrvzsWgtOPnMMfyB8/kDWUCPjA==</latexit>

<latexit sha1_base64="pxdjrA/ytItJKo0AdiBVHvxAb7U=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFN11WsA9sS8mkd9rQTGZIMkIZ+hduXCji1r9x59+YaWehrQcCh3PuJecePxZcG9f9dgobm1vbO8Xd0t7+weFR+fikraNEMWyxSESq61ONgktsGW4EdmOFNPQFdvzpXeZ3nlBpHskHM4txENKx5AFn1FjpsR9SM/GDtDkflitu1V2ArBMvJxXI0RyWv/qjiCUhSsME1brnubEZpFQZzgTOS/1EY0zZlI6xZ6mkIepBukg8JxdWGZEgUvZJQxbq742UhlrPQt9OZgn1qpeJ/3m9xAS3g5TLODEo2fKjIBHERCQ7n4y4QmbEzBLKFLdZCZtQRZmxJZVsCd7qyeukfVX1rqu1+1ql3sjrKMIZnMMleHADdWhAE1rAQMIzvMKbo50X5935WI4WnHznFP7A+fwBxS2RAw==</latexit>

<latexit sha1_base64="pxdjrA/ytItJKo0AdiBVHvxAb7U=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFN11WsA9sS8mkd9rQTGZIMkIZ+hduXCji1r9x59+YaWehrQcCh3PuJecePxZcG9f9dgobm1vbO8Xd0t7+weFR+fikraNEMWyxSESq61ONgktsGW4EdmOFNPQFdvzpXeZ3nlBpHskHM4txENKx5AFn1FjpsR9SM/GDtDkflitu1V2ArBMvJxXI0RyWv/qjiCUhSsME1brnubEZpFQZzgTOS/1EY0zZlI6xZ6mkIepBukg8JxdWGZEgUvZJQxbq742UhlrPQt9OZgn1qpeJ/3m9xAS3g5TLODEo2fKjIBHERCQ7n4y4QmbEzBLKFLdZCZtQRZmxJZVsCd7qyeukfVX1rqu1+1ql3sjrKMIZnMMleHADdWhAE1rAQMIzvMKbo50X5935WI4WnHznFP7A+fwBxS2RAw==</latexit>

<latexit sha1_base64="pxdjrA/ytItJKo0AdiBVHvxAb7U=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFN11WsA9sS8mkd9rQTGZIMkIZ+hduXCji1r9x59+YaWehrQcCh3PuJecePxZcG9f9dgobm1vbO8Xd0t7+weFR+fikraNEMWyxSESq61ONgktsGW4EdmOFNPQFdvzpXeZ3nlBpHskHM4txENKx5AFn1FjpsR9SM/GDtDkflitu1V2ArBMvJxXI0RyWv/qjiCUhSsME1brnubEZpFQZzgTOS/1EY0zZlI6xZ6mkIepBukg8JxdWGZEgUvZJQxbq742UhlrPQt9OZgn1qpeJ/3m9xAS3g5TLODEo2fKjIBHERCQ7n4y4QmbEzBLKFLdZCZtQRZmxJZVsCd7qyeukfVX1rqu1+1ql3sjrKMIZnMMleHADdWhAE1rAQMIzvMKbo50X5935WI4WnHznFP7A+fwBxS2RAw==</latexit>

sampled point along the ray r. Let ci , si and ϕi be the color,


wg0
<latexit sha1_base64="rNUXvLD6bYd3rDHwt+4MZmQCm7I=">AAAB8XicbVDLTgJBEOzFF+IL9ehlIjF6IruGqEcSLxwxkUcEQmaHWZgwO7uZ6dWQDX/hxYPGePVvvPk3DrAHBSvppFLVne4uP5bCoOt+O7m19Y3Nrfx2YWd3b/+geHjUNFGiGW+wSEa67VPDpVC8gQIlb8ea09CXvOWPb2d+65FrIyJ1j5OY90I6VCIQjKKVHtKuH5Cn/vB82i+W3LI7B1klXkZKkKHeL351BxFLQq6QSWpMx3Nj7KVUo2CSTwvdxPCYsjEd8o6liobc9NL5xVNyZpUBCSJtSyGZq78nUhoaMwl92xlSHJllbyb+53USDG56qVBxglyxxaIgkQQjMnufDITmDOXEEsq0sLcSNqKaMrQhFWwI3vLLq6R5WfauypW7Sqlay+LIwwmcwgV4cA1VqEEdGsBAwTO8wptjnBfn3flYtOacbOYY/sD5/AERHZCN</latexit>

labels, and the features of xi . Similar to [69], The color, wg


<latexit sha1_base64="wDllnW/WrbjKLI5nFQd9L/cAoAE=">AAAB8HicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKqMeClx4r2A9pl5JNs21okl2SrFKW/govHhTx6s/x5r8xbfegrQ8GHu/NMDMvTAQ31vO+UWFtfWNzq7hd2tnd2z8oHx61TJxqypo0FrHuhMQwwRVrWm4F6ySaERkK1g7HtzO//ci04bG6t5OEBZIMFY84JdZJD1kvjPBTfzjtlyte1ZsDrxI/JxXI0eiXv3qDmKaSKUsFMabre4kNMqItp4JNS73UsITQMRmyrqOKSGaCbH7wFJ85ZYCjWLtSFs/V3xMZkcZMZOg6JbEjs+zNxP+8bmqjmyDjKkktU3SxKEoFtjGefY8HXDNqxcQRQjV3t2I6IppQ6zIquRD85ZdXSeui6l9VL+8uK7V6HkcRTuAUzsGHa6hBHRrQBAoSnuEV3pBGL+gdfSxaCyifOYY/QJ8/rduQXA==</latexit>

label map, and feature images are computed as the weighted


combination of queried values,
LCVC
<latexit sha1_base64="N6wPVOUOZ/2f9vOUMpvS1F1cj/M=">AAACAXicbVBNS8NAEN34WetX1IvgJVgETyWRoh4LvfTgoYL9gDaEzXbTLt1swu5ELCFe/CtePCji1X/hzX/jps1BWx8MPN6bYWaeH3OmwLa/jZXVtfWNzdJWeXtnd2/fPDjsqCiRhLZJxCPZ87GinAnaBgac9mJJcehz2vUnjdzv3lOpWCTuYBpTN8QjwQJGMGjJM48HIYYxwTy9ybx0APQB0kankWWeWbGr9gzWMnEKUkEFWp75NRhGJAmpAMKxUn3HjsFNsQRGOM3Kg0TRGJMJHtG+pgKHVLnp7IPMOtPK0AoiqUuANVN/T6Q4VGoa+rozv1ctern4n9dPILh2UybiBKgg80VBwi2IrDwOa8gkJcCnmmAimb7VImMsMQEdWlmH4Cy+vEw6F1Xnslq7rVXqzSKOEjpBp+gcOegK1VETtVAbEfSIntErejOejBfj3fiYt64YxcwR+gPj8wdB0Zdw</latexit>

\lbleq {volume_render} {\bf \hat I_c}(r) = \sum _{i=1}^{N} \tau _i{\bf c}_i, \;\; {\bf \hat I_s}(r) = \sum _{i=1}^{N} \tau _i{\bf s}_i , \;\; {\bf \hat I_{\phi }}(r) = \sum _{i=1}^{N} \tau _i{\bf {\phi }}_i,
Figure 3. Cross-View Consistency Loss. Given an input label map
(1) Is and its associated pose P, we first infer the geometry latent code
where the transmittance τi is computed as the probability of wg . From wg , we can generate a label map Îs from the same pose
a photon traversing between the camera center and the i-th P, and Î′s from a random pose P′ . Next, we infer wg′ from the
point given the length of the i-th interval δi , novel view Î′s , and render it back to the original pose P to obtain
Î′′s . Finally, we add a reconstruction loss: LCVC = λCVC Ls (Î′′s , Îs ).

{\tau _i} = \prod _{j=1}^{i} \exp {(-\sigma _j\delta _j)} (1-\exp {(-\sigma _i\delta _i})).
the pixel-aligned conditional discriminator Ds concatenates
color images and label maps as input, which encourages
Similar to prior works [8,21,52], we approximate Equation 1 pixel alignment between color images and label maps. No-
by 2D Upsampler U to reduce the computational cost. We tably, in Ds , we stop the gradients for the color images to
render high-res 512 × 512 images in two passes. In the first prevent a potential quality downgrade. We also feed the
pass, we render low-res 64 × 64 images Îc , Îs , Îϕ . Then a rendered low-res images to prevent the upsampler from hal-
CNN up-sampler U is applied to obtain high-res images, lucinating details, inconsistent with the low-res output. The
adversarial loss can be written as follows.
\mathbf {\hat I}_{\mathbf {c}}^+ = U({\bf \hat I_{c}},{\bf \hat I_{\phi }}), \qquad \mathbf {\hat I}_{\mathbf {s}}^+ = U({\bf \hat I_{s}}, {\bf \hat I_{\phi }}).
\mathcal {L}_{\text {GAN}} = \lambda _{D_{\bf c}}\mathcal {L}_{D_{\mathbf {c}}}(\mathbf {\hat I}_{\mathbf {c}}^+, \mathbf {\hat I}_{\mathbf {c}}) + \lambda _{D_{\bf s}} \mathcal {L}_{D_{\mathbf {s}}}(\mathbf {\hat I}_{\mathbf {c}}^+, \mathbf {\hat I}_{\mathbf {c}}, \mathbf {\hat I}_{\mathbf {s}}^+, \mathbf {\hat I}_{\mathbf {s}}).
3.2. Learning Objective
Learning conditional 3D representations from monocular where λDc and λDs balance two terms. To stabilize the GAN
images is challenging due to its under-constrained nature. training, we adopt the R1 regularization loss [45].
Given training data of associated images, label maps, and Cross-view Consistency Loss. We observe that inputting
camera poses predicted by an off-the-shelf model, we care- label maps of the same object from different viewpoints will
fully construct learning objectives, including reconstruction, sometimes result in different 3D shapes. Therefore we add
adversarial, and cross-view consistency losses. These objec- a cross-view consistency loss to regularize the training, as
tives will be described below. illustrated in Figure 3. Given an input label map Is and its
Reconstruction Loss. Given a ground-truth viewpoint P associated pose P, we generate the label map Î′s from a
associated with the color and label maps {Ic , Is }, we ren- different viewpoint P′ , and render the label map Î′′s back to
der color and label maps from P and compute reconstruc- the pose P using Î′s as input. We add a reconstruction loss
tion losses for both high-res and low-res output. We use between Î′′s and Îs :
LPIPS [82] to compute the image reconstruction loss Lc for \mathcal {L}_{\text {CVC}} = \lambda _{\text {CVC}}\mathcal {L}_s(\mathbf {\hat I}_{\mathbf {s}}'',\mathbf {\hat I}_{\mathbf {s}}),
color images. For label reconstruction loss Ls , we use the
balanced cross-entropy loss for segmentation maps or L2 where Ls denotes the reconstruction loss in the label space,
Loss for edge maps, and λCVC weights the loss term. This loss is crucial for re-
ducing error accumulation during cross-view editing.
\mathcal {L}_{\text {recon}} = \lambda _c \mathcal {L}_c({\mathbf {I}}_{\mathbf {c}},\{\mathbf {\hat I}_{\mathbf {c}}, \mathbf {\hat I}_{\mathbf {c}}^+\})+ \lambda _s \mathcal {L}_s({\mathbf {I}}_{\mathbf {s}}, \{\mathbf {\hat I}_{\mathbf {s}}, \mathbf {\hat I}_{\mathbf {s}}^+\}),
Optimization. Our final learning objective is written as fol-
where λc and λs balance two terms. lows:
\mathcal {L}_{\text {total}} = \mathcal {L}_{\text {recon}} + \mathcal {L}_{\text {GAN}} + \mathcal {L}_{\text {CVC}}.
Pixel-aligned Conditional Discriminator. The reconstruc-
tion loss alone fails to synthesize detailed results from novel At every iteration, we determine whether to use a ground-
viewpoints. Therefore, we use an adversarial loss [20] to truth pose or sample a random one with a probability of p. We
enforce renderings to look realistic from random viewpoints. use the reconstruction loss and GAN loss for ground-truth
Specifically, we have two discriminators Dc and Ds for RGB poses, while for random poses, we only use the GAN loss.
images and label maps, respectively. Dc is a widely-used We provide the hyper-parameters and more implementation
GAN loss that takes real and fake images as input, while details in Appendix B.
Input Seg Map Ours Pix2NeRF SoFGAN SEAN

Figure 4. Qualitative Comparison with Pix2NeRF [6], SoFGAN [11], and SEAN [89] on CelebAMask dataset for seg2face task. SEAN
fails in multi-view synthesis, while SoFGAN suffers from multi-view inconsistency (e.g., face identity changes across viewpoints). Our
method renders high-quality images while maintaining multi-view consistency. Please check our website for more examples.
Input Seg Map Ours w/o 3D Labels Input Seg Map Ours w/o 3D Labels

Figure 5. Qualitative ablation on seg2face and seg2cat. We ablate our method by removing the branch that renders label maps (w/o 3D
Labels). Our results better align with input labels (e.g., hairlines and the cat’s ear).

Rendered RGB images & edge maps Novel View


GT View
Input Edge Map GT View Novel View
Pix2NeRF

Input Edge Map


Ours

Figure 7. Qualitative comparisons on edge2car. pix2pix3D


(Ours) and Pix2NeRF [6] are trained on shapenet-car [10], and
pix2pix3D achieves better quality and alignment than Pix2NeRF.
Figure 6. Results on edge2cat. Our model is trained on AFHQ-
cat [16] with edges extracted by pidinet [66].
edge2cat, and edge2car in our experiments. For seg2face, we
4. Experiment use CelebAMask-HQ [38] for evaluation. CelebAMask-HQ
contains 30,000 high-resolution face images from CelebA
We first introduce the datasets and evaluation metrics.
[41], and each image has a facial part segmentation mask
Then we compare our method with the baselines. Finally, we
and a predicted pose. The segmentation masks contain 19
demonstrate cross-view editing and multi-modal synthesis
classes, including skin, eyebrows, ears, mouth, lip, etc. The
applications enabled by our method.
pose associated with each image segmentation is predicted
Datasets. We consider four tasks: seg2face, seg2cat, by HopeNet [60]. We split the CelebAMask-HQ dataset into
Q UALITY A LIGNMENT Seg2Cat Q UALITY A LIGNMENT
Seg2Face
SG FVV AFHQ- CAT [34] FID ↓ KID ↓ SG Diversity ↑ mIoU ↑ acc ↑
C ELEBAM ASK [38] FID ↓ KID ↓ Diversity ↑ mIoU ↑ acc ↑ Identity ↓
P IX 2N E RF [6] 43.92 0.081 0.15 0.27 0.58
SEAN [89] 32.74 0.018 0.29 0.52 0.85 N/A
S O FGAN [11] 23.34 0.012 0.33 0.53 0.89 0.58 O URS
P IX 2N E RF [6] 54.23 0.042 0.16 0.36 0.65 0.44 W / O 3D L ABELS 10.41 0.004 0.26 N/A (0.49) N/A (0.69)
W / O CVC 9.64 0.004 0.26 0.66 (0.63) 0.76 (0.73)
(O URS )
P I X 2 P I X 3D
F ULL M ODEL 8.62 0.003 0.27 0.66 (0.62) 0.78 (0.73)
W/O 3D L ABELS 12.96 0.005 0.30 N/A (0.43) N/A (0.81) 0.38
W / O CVC 11.62 0.004 0.30 0.50 (0.50) 0.87 (0.85) 0.42
FULL MODEL 11.54 0.003 0.28 0.51 (0.52) 0.90 (0.88) 0.36 Table 3. Seg2cat Evaluation. We compare our method with

FULL MODEL 11.13 0.003 0.29 0.51 (0.50) 0.90 (0.87) 0.36 Pix2NeRF [6] on Seg2Cat using AFHQ-cat dataset [16], with seg-
mentation obtained by clustering DINO features [2]. Similar to
Table 1. Seg2face Evaluation. Our metrics include image quality Table 1, we evaluate the image quality and alignment. Ours per-
(FID, KID, SG Diversity), alignment (mIoU and acc against GT forms better in all metrics.
label maps), and multi-view consistency (FVV Identity). Single-
Input Seg Map Generated Semantic Mesh
generation diversity (SG Diversity) is obtained by computing the
LPIPS metric between randomly generated pairs given a single
conditional input. To evaluate alignment, we compare the generated
label maps against the ground truth in terms of mIoU and pixel
accuracy (acc). Alternatively, given a generated image, one could
estimate label maps via a face parser, and compare those against
the ground truth (numbers in parentheses). We include SEAN [89]
and SoFGAN [11] as baselines, and modify Pix2NeRF [6] to take
conditional input. Our method achieves the best quality, alignment
ACC, and FVV Identity while being competitive on SG Diversity.
SoFGAN tends to have better alignment but worse 3D consistency.
We also ablate our method w.r.t the 3D labels and the cross-view
consistency (CVC) loss. Our 3D labels are crucial for alignment,
Figure 8. Semantic Mesh. We show semantic meshes of human
while the CVC loss improves multi-view consistency. Using pre-
and cat faces from marching cubes colored by 3D labels.
trained models from EG3D (†) also improves the performance.

Q UALITY A LIGNMENT car [10] and render 500,000 images at 128× resolution for
Edge2Car
FID ↓ KID ↓ SG Diversity ↑ AP ↑ training, and 30,000 for evaluation. We extract the edges
P IX 2N E RF [6] 23.42 0.014 0.06 0.28
using informative drawing [7]. We train our model at 512×
resolution except for 128× in the edge2car task.
P I X 2 P I X 3D (O URS )
W/O 3D L ABELS 10.73 0.005 0.12 0.45 (0.42)
Running Time. For training the model at 512× resolution,
W / O CVC 9.42 0.004 0.13 0.61 (0.59) it takes about three days on eight RTX 3090 GPUs. But we
FULL MODEL 8.31 0.004 0.13 0.63 (0.59) can significantly reduce the training time to 4 hours if we
initialize parts of our model with pretrained weights from
Table 2. Edge2car Evaluation. We compare our method with EG3D [8]. During inference, our model takes 10 ms to obtain
Pix2NeRF [6] on edge2car using the shapenet-car [10] dataset. the style vector, and another 30 ms to render the final image
Similar to Table 1, we evaluate FID, KID, and SG Diversity for and the label map on a single RTX A5000. The low latency
image quality. We also evaluate the alignment with the input edge
(25 FPS) allows for interactive user editing.
map using AP. Similarly, we can either run informative drawing [7]
on generated images to obtain edge maps (numbers in parentheses) 4.1. Evaluation metrics
or directly use generated edge maps to calculate the metrics. We
achieve better image quality and alignment than Pix2NeRF. We We evaluate the models from two aspects: 1) the image
also find that using 3D labels and cross-view consistency loss is quality regarding fidelity and diversity, and 2) the alignment
helpful regarding FID and AP metrics. between input label maps and generated outputs.
Quality Metrics. Following prior works [21, 57], we use
a training set of 24,183, a validation set of 2,993, and a test the clean-fid library [58] to compute Fréchet Inception
set of 2,824, following the original work [38]. For seg2cat Distance (FID) [23] and Kernel Inception Distance (KID) [4]
and edge2cat, we use AFHQ-cat [16], which contains 5,065 to measure the distribution distance between synthesized re-
images at 512× resolution. We estimate the viewpoints using sults and real images. We also evaluate the single-generation
unsup3d [75]. We extract the edges using pidinet [66] and diversity (SG Diversity) by calculating the LPIPS metric
obtain segmentation by clustering DINO features [2] into between randomly generated pairs given a single input fol-
6 classes. For edge2car, we use 3D models from shapenet- lowing prior works [11, 87]. For FID and KID, we generate
0.0060 0.55 0.90
FID KID Generated mIoU Generated acc
13.0 0.0055 Predicted mIoU Predicted acc
0.50 0.85
0.0050
12.5 0.45 0.80
0.0045
0.40

mIoU
0.0040 0.75

KID

acc
FID

12.0
0.0035 0.35 0.70
11.5 0.0030
0.30 0.65
0.0025
11.0 0.25
0.0020 0.60
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
p p p p

Figure 9. We study the effect of random pose sampling probability p during training. Without random poses (p = 0), the model achieves the
best alignment with input semantic maps, with reduced image quality. In contrast, only using random poses (p = 1) achieves the best image
quality, while results fail to align with input maps. We find p = 0.5 balances the image quality and input alignment.
     
     

 

Figure 10. Cross-view Editing of Edge2Car. Our 3D editing system allows users to edit label maps from any viewpoint instead of only the
input view. Importantly, our feed-forward encoder allows fast inference of the latent code without GAN-inversion. Typically, a single forward
pass of rendering takes only 40 ms on a single RTX A5000, which enables interactive editing. Please check our demo video on our website.

10 images per label map in the test set using randomly sam- an unconditional 3D semantic map generator before the 2D
pled z. We compare our generated images with the whole generator so we can evaluate FVV Identity for that.
dataset, including training and test images. Results. Figure 4 shows the qualitative comparison for
Alignment Metrics. We evaluate models on the test set seg2face and Table 1 reports the evaluation results. SoF-
using mean Intersection-over-Union (mIoU) and pixel ac- GAN [11] tends to produce results with slightly better align-
curacy (acc) for segmentation maps following existing ment but worse 3D consistency for its 2D RGB generator.
works [57, 63], and average precision (AP) for edge maps. Our method achieves the best quality, alignment acc, and
For those models that render label maps as output, we di- FVV Identity while being competitive with 2D baselines
rectly compare them with ground-truth labels. Otherwise, on SG diversity. Figure 5 shows the qualitative ablation on
we first predict the label maps from the output RGB im- seg2face and seg2cat. Table 5 reports the metrics for seg2cat.
ages using off-the-shelf networks [38, 66], and then compare Figure 6 shows the example results for edge2cat. Figure 7
the prediction with the ground truth. The metrics regarding shows the qualitative comparison for edge2car and Table 2
such predicted semantic maps are reported within brackets reports the metrics. Our method achieves the best image
in Table 1 and Table 2. quality and alignment. Figure 8 shows semantic meshes of
For seg2face, we evaluate the preservation of facial iden- human and cat faces, extracted by marching cubes and col-
tity from different viewpoints (FVV Identity) by calculating ored by our learned 3D labels. We provide more evaluation
their distances with the dlib face recognition algorithm* . results in Appendix A.
Ablation Study. We compare our full method to several
4.2. Baseline comparison variants. Specifically, (1) W / O 3D L ABELS, we remove the
branch of rendering label maps from our method, and (2)
Baselines. Since there are no prior works on conditional
W / O CVC, we remove the cross-view consistency loss. From
3D-aware image synthesis, we make minimum modifications
Table 1, Table 2, and Figure 5, rendering label maps is crucial
to Pix2NeRF [6] to be conditional on label maps instead of
for the alignment with the input. We posit that the joint learn-
images. For a thorough comparison, we introduce several
ing of appearance, geometry, and label information poses
baselines: SEAN [89] and SoFGAN [11]. 2D baselines
strong constraints on correspondence between the input label
like SEAN [89] cannot generate multi-view images by
maps and the 3D representation. Thus our method can syn-
design (N/A for FVV Identity), while SoFGAN [11] uses
thesize images pixel-aligned with the inputs. Our CVC loss
* https://github.com/ageitgey/face_recognition helps preserve the facial identity from different viewpoints.
Appearance Appearance
Geometry

Figure 11. Multi-modal Synthesis. The leftmost column is the Geometry


input segmentation map. We use the same segmentation map for
each row. We generate multi-modal results by randomly sampling
an appearance style for each column.
Analysis on random sampling of poses. We study the ef-
fect of the different probabilities of sampling random poses
during training, as shown in Figure 9. When sampling no
random poses (p = 0), the model best aligns with input
label maps with suboptimal image quality. Conversely, only
Figure 12. Interpolation. In each 5 × 5 grid, the images at the
sampling random poses (p = 1) gives the best image quality
top left and bottom right are generated from the input maps next
but suffers huge misalignment with input label maps. We to them. Each row interpolates two images in label space, while
find p = 0.5 achieves the balance between the image quality each column interpolates the appearance. For camera poses, we
and the alignment with the input. interpolate the pitch along the row and the yaw along the column.
4.3. Applications
5. Discussion
Cross-view Editing. As shown in Figure 10, our 3D editing
system allows users to generate and edit label maps from any We have introduced pix2pix3D, a 3D-aware condi-
viewpoint instead of only the input view. The edited label tional generative model for controllable image synthesis.
map is further fed into the conditional encoder to update Given a 2D label map, our model allows users to render
the 3D representation. Unlike GAN inversion [85], our feed- images given any viewpoint. Our model augments the neural
forward conditional encoder allows fast inference of the field with 3D labels, assigning label, color, and density to
latent code. Thus, a single forward pass of our full model every 3D point, allowing for the simultaneous rendering of
takes only 40 ms on a single RTX A5000. the image and a pixel-aligned label map. The learned 3D
labels further enable interactive 3D cross-view editing. We
Multi-modal synthesis and interpolation. Like other style-
discuss the broader impact and limitations in the appendix.
based generative models [8, 21, 34, 36], our method can dis-
entangle the geometry and appearance information. Specifi- Acknowledgments. We thank Sheng-Yu Wang, Nupur Ku-
cally, the input label map captures the geometry information mari, Gaurav Parmer, Ruihan Gao, Muyang Li, George
while the randomly sampled latent code controls the appear- Cazenavette, Andrew Song, Zhipeng Bao, Tamaki Kojima,
ance. We show style manipulation results in Figure 11. We Krishna Wadhwani, Takuya Narihira, and Tatsuo Fujiwara
can also interpolate both the geometry styles and the ap- for their discussion and help. We are grateful for the support
pearance styles (Figure 12). These results show the clear from Sony Corporation, Singapore DSTA, and the CMU
disentanglement of our 3D representation. Argo AI Center for Autonomous Vehicle Research.
References In IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2021. 2
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- [14] Tao Chen, Zhe Zhu, Ariel Shamir, Shi-Min Hu, and Daniel
age2stylegan: How to embed images into the stylegan latent Cohen-Or. 3-sweep: Extracting editable objects from a single
space? In IEEE International Conference on Computer Vision photo. ACM Transactions on Graphics (TOG), 32(6):1–10,
(ICCV), 2019. 2 2013. 2
[2] Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. [15] Yuedong Chen, Qianyi Wu, Chuanxia Zheng, Tat-Jen Cham,
Deep vit features as dense visual descriptors. ECCVW What and Jianfei Cai. Sem2nerf: Converting single-view semantic
is Motion For?, 2022. 6 masks to neural radiance fields. In European Conference on
[3] David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Computer Vision (ECCV), 2022. 2
Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. Semantic [16] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha.
photo manipulation with a generative image prior. In ACM Stargan v2: Diverse image synthesis for multiple domains. In
SIGGRAPH, 2019. 2 IEEE Conference on Computer Vision and Pattern Recogni-
[4] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and tion (CVPR), 2020. 2, 5, 6, 14
Arthur Gretton. Demystifying mmd gans. In International [17] Johanna Delanoy, Adrien Bousseau, Mathieu Aubry, Phillip
Conference on Learning Representations (ICLR), 2018. 6 Isola, and Alexei A Efros. What you sketch is what you get:
[5] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large 3d sketching using multi-view deep volumetric prediction.
scale GAN training for high fidelity natural image synthesis. In ACM SIGGRAPH Symposium on Interactive 3D Graphics
In International Conference on Learning Representations and Games (I3D), 2018. 2
(ICLR), 2019. 2 [18] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan.
[6] Shengqu Cai, Anton Obukhov, Dengxin Dai, and Luc Van Depth-supervised NeRF: Fewer views and faster training for
Gool. Pix2nerf: Unsupervised conditional π-gan for single free. In Proceedings of the IEEE/CVF Conference on Com-
image to neural radiance fields translation. In IEEE Confer- puter Vision and Pattern Recognition (CVPR), 2022. 2
ence on Computer Vision and Pattern Recognition (CVPR), [19] Patrick Esser, Robin Rombach, and Bjorn Ommer. Tam-
2022. 2, 5, 6, 7, 13 ing transformers for high-resolution image synthesis. In
[7] Caroline Chan, Frédo Durand, and Phillip Isola. Learning IEEE Conference on Computer Vision and Pattern Recog-
to generate line drawings that convey geometry and seman- nition (CVPR), 2021. 1
tics. In IEEE Conference on Computer Vision and Pattern [20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Recognition (CVPR), 2022. 6 Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[8] Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Yoshua Bengio. Generative adversarial nets. In Advances in
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Neural Information Processing Systems, 2014. 1, 2, 4, 13
Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, [21] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt.
and Gordon Wetzstein. Efficient geometry-aware 3D genera- Stylenerf: A style-based 3d aware generator for high-
tive adversarial networks. In IEEE Conference on Computer resolution image synthesis. In International Conference on
Vision and Pattern Recognition (CVPR), 2022. 2, 3, 4, 6, 8, Learning Representations (ICLR), 2022. 2, 3, 4, 6, 8, 13, 14
14 [22] Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. Escaping
[9] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, plato’s cave: 3d shape from adversarial rendering. In IEEE
and Gordon Wetzstein. pi-gan: Periodic implicit generative International Conference on Computer Vision (ICCV), 2019.
adversarial networks for 3d-aware image synthesis. In IEEE 2
Conference on Computer Vision and Pattern Recognition [23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
(CVPR), 2021. 2 hard Nessler, and Sepp Hochreiter. Gans trained by a two time-
[10] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat scale update rule converge to a local nash equilibrium. In Ad-
Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Mano- vances in Neural Information Processing Systems (NeurIPS),
lis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and 2017. 6
Fisher Yu. ShapeNet: An Information-Rich 3D Model Repos- [24] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
itory. Technical Report arXiv:1512.03012 [cs.GR], Stanford sion probabilistic models. In Advances in Neural Information
University — Princeton University — Toyota Technological Processing Systems (NeurIPS), 2020. 1
Institute at Chicago, 2015. 2, 5, 6 [25] Hsin-Ping Huang, Hung-Yu Tseng, Hsin-Ying Lee, and Jia-
[11] Anpei Chen, Ruiyang Liu, Ling Xie, Zhang Chen, Hao Su, Bin Huang. Semantic view synthesis. In European Confer-
and Jingyi Yu. Sofgan: A portrait image generator with dy- ence on Computer Vision (ECCV), 2020. 2
namic styling. In ACM SIGGRAPH, 2021. 2, 5, 6, 7, 13, [26] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz.
14 Multimodal unsupervised image-to-image translation. In Eu-
[12] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and ropean Conference on Computer Vision (ECCV), 2018. 2
Hao Su. Tensorf: Tensorial radiance fields. In European [27] Zeng Huang, Tianye Li, Weikai Chen, Yajie Zhao, Jun Xing,
Conference on Computer Vision (ECCV), 2022. 2 Chloe Legendre, Linjie Luo, Chongyang Ma, and Hao Li.
[13] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Deep volumetric video from very sparse multi-view perfor-
Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- mance capture. In European Conference on Computer Vision
izable radiance field reconstruction from multi-view stereo. (ECCV), pages 351–369, 2018. 2
[28] Takeo Igarashi, Satoshi Matsuoka, and Hidehiko Tanaka. [43] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Saj-
Teddy: a sketching interface for 3d freeform design. In ACM jadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel
SIGGRAPH, 1999. 2 Duckworth. NeRF in the Wild: Neural Radiance Fields for
[29] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Unconstrained Photo Collections. In IEEE Conference on
Image-to-image translation with conditional adversarial net- Computer Vision and Pattern Recognition (CVPR), 2021. 2
works. In IEEE Conference on Computer Vision and Pattern [44] Quan Meng, Anpei Chen, Haimin Luo, Minye Wu, Hao Su,
Recognition (CVPR), 2017. 1, 2 Lan Xu, Xuming He, and Jingyi Yu. Gnerf: Gan-based neural
[30] Ajay Jain, Matthew Tancik, and Pieter Abbeel. Putting nerf radiance field without posed camera. In IEEE International
on a diet: Semantically consistent few-shot view synthesis. In Conference on Computer Vision (ICCV), 2021. 2
IEEE International Conference on Computer Vision (ICCV), [45] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin.
2021. 2 Which training methods for gans do actually converge? In
[31] Kaiwen Jiang, Shu-Yu Chen, Feng-Lin Liu, Hongbo Fu, and International Conference on Machine Learning (ICML), 2018.
Lin Gao. Nerffaceediting: Disentangled face editing in neural 4, 13
radiance fields. In ACM SIGGRAPH Asia, 2022. 2 [46] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
[32] James T Kajiya and Brian P Von Herzen. Ray tracing volume Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
densities. ACM SIGGRAPH, 18(3):165–174, 1984. 3 Representing scenes as neural radiance fields for view syn-
[33] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, thesis. In European Conference on Computer Vision (ECCV),
Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free 2020. 2, 3
generative adversarial networks. In Advances in Neural Infor- [47] Mehdi Mirza and Simon Osindero. Conditional generative
mation Processing Systems (NeurIPS), 2021. 2 adversarial nets. arXiv preprint arXiv:1411.1784, 2014. 2
[34] Tero Karras, Samuli Laine, and Timo Aila. A style-based [48] Thomas Müller, Alex Evans, Christoph Schied, and Alexander
generator architecture for generative adversarial networks. In Keller. Instant neural graphics primitives with a multiresolu-
IEEE Conference on Computer Vision and Pattern Recogni- tion hash encoding. In ACM SIGGRAPH, 2022. 2
tion (CVPR), 2019. 1, 2, 6, 8
[49] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian
[35] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
Richardt, and Yong-Liang Yang. Hologan: Unsupervised
Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
learning of 3d representations from natural images. In IEEE
ing the image quality of stylegan. In IEEE Conference on
International Conference on Computer Vision (ICCV), 2019.
Computer Vision and Pattern Recognition (CVPR), 2020. 2
2
[36] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
Jaakko Lehtinen, and Timo Aila. Analyzing and improving [50] Michael Niemeyer, Jonathan T Barron, Ben Mildenhall,
the image quality of StyleGAN. In IEEE Conference on Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. Reg-
Computer Vision and Pattern Recognition (CVPR), 2020. 8, nerf: Regularizing neural radiance fields for view synthesis
13, 14 from sparse inputs. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2022. 2
[37] Natasha Kholgade, Tomas Simon, Alexei Efros, and Yaser
Sheikh. 3d object manipulation in a single photograph using [51] Michael Niemeyer and Andreas Geiger. Giraffe: Represent-
stock 3d models. ACM Transactions on Graphics (TOG), ing scenes as compositional generative neural feature fields.
33(4):1–12, 2014. In IEEE Conference on Computer Vision and Pattern Recog-
[38] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. nition (CVPR), 2021. 2
Maskgan: Towards diverse and interactive facial image manip- [52] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman,
ulation. In IEEE Conference on Computer Vision and Pattern Jeong Joon Park, and Ira Kemelmacher-Shlizerman. Stylesdf:
Recognition (CVPR), 2020. 2, 5, 6, 7, 14 High-resolution 3d-consistent image and geometry genera-
[39] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon tion. In IEEE Conference on Computer Vision and Pattern
Lucey. Barf: Bundle-adjusting neural radiance fields. In IEEE Recognition (CVPR), 2022. 2, 4
International Conference on Computer Vision (ICCV), 2021. [53] Xingang Pan, Xudong Xu, Chen Change Loy, Christian
2 Theobalt, and Bo Dai. A shading-guided generative implicit
[40] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised model for shape-accurate 3d-aware image synthesis. In Ad-
image-to-image translation networks. Advances in neural vances in Neural Information Processing Systems (NeurIPS),
information processing systems, 30, 2017. 2 2021. 2
[41] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. [54] Jeong Joon Park, Peter Florence, Julian Straub, Richard New-
Deep learning face attributes in the wild. In Proceedings of combe, and Steven Lovegrove. Deepsdf: Learning continu-
International Conference on Computer Vision (ICCV), 2015. ous signed distance functions for shape representation. In
5 IEEE Conference on Computer Vision and Pattern Recogni-
[42] Zhaoliang Lun, Matheus Gadelha, Evangelos Kalogerakis, tion (CVPR), 2019. 2
Subhransu Maji, and Rui Wang. 3d shape reconstruction [55] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan
from sketches via multi-view convolutional networks. In Zhu. Contrastive learning for unpaired image-to-image trans-
2017 International Conference on 3D Vision (3DV). IEEE, lation. In European Conference on Computer Vision (ECCV),
2017. 2 2020. 2
[56] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Ng. Learned initializations for optimizing coordinate-based
Zhu. Semantic image synthesis with spatially-adaptive nor- neural representations. In IEEE Conference on Computer
malization. In IEEE Conference on Computer Vision and Vision and Pattern Recognition (CVPR), 2021. 2
Pattern Recognition (CVPR), 2019. 1, 2 [71] Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian
[57] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhofer,
Zhu. Semantic image synthesis with spatially-adaptive nor- and Christian Theobalt. Stylerig: Rigging stylegan for 3d con-
malization. In IEEE Conference on Computer Vision and trol over portrait images. In IEEE Conference on Computer
Pattern Recognition (CVPR), 2019. 6, 7 Vision and Pattern Recognition (CVPR), 2020. 2
[58] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased [72] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
resizing and surprising subtleties in gan evaluation. In IEEE Daniel Cohen-Or. Designing an encoder for stylegan image
Conference on Computer Vision and Pattern Recognition manipulation. In ACM Transactions on Graphics (TOG),
(CVPR), 2022. 6 2021. 2
[59] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, [73] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew
and Dani Lischinski. Styleclip: Text-driven manipulation Owens, and Alexei A Efros. Cnn-generated images are sur-
of stylegan imagery. In IEEE International Conference on prisingly easy to spot...for now. In IEEE Conference on
Computer Vision (ICCV), 2021. 2 Computer Vision and Pattern Recognition (CVPR), 2020. 14
[60] Nataniel Ruiz, Eunji Chong, and James M. Rehg. Fine- [74] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
grained head pose estimation without keypoints. In IEEE Con- Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
ference on Computer Vision and Pattern Recognition (CVPR) thesis and semantic manipulation with conditional gans. In
Workshop, 2018. 5 IEEE Conference on Computer Vision and Pattern Recogni-
[61] Sara Fridovich-Keil and Alex Yu, Matthew Tancik, Qinhong tion (CVPR), 2018. 2
Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: [75] Shangzhe Wu, Christian Rupprecht, and Andrea Vedaldi. Un-
Radiance fields without neural networks. In IEEE Conference supervised learning of probably symmetric deformable 3d
on Computer Vision and Pattern Recognition (CVPR), 2022. objects from images in the wild. In IEEE Conference on
2 Computer Vision and Pattern Recognition (CVPR), 2020. 6
[62] Edgar Schönfeld, Vadim Sushko, Dan Zhang, Juergen Gall,
[76] Xiaohua Xie, Kai Xu, Niloy J Mitra, Daniel Cohen-Or, Weny-
Bernt Schiele, and Anna Khoreva. You only need adversarial
ong Gong, Qi Su, and Baoquan Chen. Sketch-to-design:
supervision for semantic image synthesis. In International
Context-based part assembly. In Computer Graphics Forum,
Conference on Learning Representations (ICLR), 2020. 2
volume 32, pages 233–245. Wiley Online Library, 2013. 2
[63] Edgar Schönfeld, Vadim Sushko, Dan Zhang, Juergen Gall,
[77] Yinghao Xu, Sida Peng, Ceyuan Yang, Yujun Shen, and Bolei
Bernt Schiele, and Anna Khoreva. You only need adversarial
Zhou. 3d-aware image synthesis via learning structural and
supervision for semantic image synthesis. In International
textural representations. In IEEE Conference on Computer
Conference on Learning Representations (ICLR), 2021. 7
Vision and Pattern Recognition (CVPR), 2022. 2
[64] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
Geiger. Graf: Generative radiance fields for 3d-aware image [78] Shunyu Yao, Tzu Ming Hsu, Jun-Yan Zhu, Jiajun Wu, Anto-
synthesis. In Advances in Neural Information Processing nio Torralba, Bill Freeman, and Josh Tenenbaum. 3d-aware
Systems (NeurIPS), 2020. 2 scene manipulation via inverse graphics. In Advances in
[65] Yujun Shen and Bolei Zhou. Closed-form factorization of Neural Information Processing Systems (NeurIPS), 2018. 2
latent semantics in gans. In IEEE Conference on Computer [79] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
Vision and Pattern Recognition (CVPR), 2021. 2 Pixelnerf: Neural radiance fields from one or few images. In
[66] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi IEEE Conference on Computer Vision and Pattern Recogni-
Tian, Matti Pietikainen, and Li Liu. Pixel difference networks tion (CVPR), 2021. 2
for efficient edge detection. In IEEE Conference on Computer [80] Jichao Zhang, Enver Sangineto, Hao Tang, Aliaksandr Siaro-
Vision and Pattern Recognition (CVPR), 2021. 5, 6, 7 hin, Zhun Zhong, Nicu Sebe, and Wei Wang. 3d-aware
[67] Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew Davison. semantic-guided generative model for human synthesis. In
iMAP: Implicit mapping and positioning in real-time. In European Conference on Computer Vision (ECCV), 2022. 2
Proceedings of the International Conference on Computer [81] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Vision (ICCV), 2021. 2 Koltun. Nerf++: Analyzing and improving neural radiance
[68] Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue fields. arXiv preprint arXiv:2010.07492, 2020. 2
Wang, and Yebin Liu. Ide-3d: Interactive disentangled edit- [82] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
ing for high-resolution 3d-aware portrait synthesis. In ACM and Oliver Wang. The unreasonable effectiveness of deep
Transactions on Graphics (TOG), 2022. 2 features as a perceptual metric. In IEEE Conference on Com-
[69] Jingxiang Sun, Xuan Wang, Yong Zhang, Xiaoyu Li, Qi puter Vision and Pattern Recognition (CVPR), 2018. 4
Zhang, Yebin Liu, and Jue Wang. Fenerf: Face editing in [83] Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean-
neural radiance fields. In IEEE Conference on Computer Francois Lafleche, Adela Barriuso, Antonio Torralba, and
Vision and Pattern Recognition (CVPR), 2022. 2, 4 Sanja Fidler. Datasetgan: Efficient labeled data factory with
[70] Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi minimal human effort. In IEEE Conference on Computer
Schmidt, Pratul P Srinivasan, Jonathan T Barron, and Ren Vision and Pattern Recognition (CVPR), 2021. 13
[84] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-
domain gan inversion for real image editing. In European
Conference on Computer Vision (ECCV), 2020. 2
[85] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and
Alexei A Efros. Generative visual manipulation on the natu-
ral image manifold. In European Conference on Computer
Vision (ECCV), 2016. 2, 8
[86] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros.
Unpaired image-to-image translation using cycle-consistent
adversarial networks. In IEEE International Conference on
Computer Vision (ICCV), 2017. 1, 2
[87] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell,
Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward
multimodal image-to-image translation. Advances in neural
information processing systems, 30, 2017. 6
[88] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu,
Antonio Torralba, Josh Tenenbaum, and Bill Freeman. Visual
object networks: Image generation with disentangled 3d rep-
resentations. In Advances in Neural Information Processing
Systems (NeurIPS), 2018. 2
[89] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka.
Sean: Image synthesis with semantic region-adaptive normal-
ization. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2020. 2, 5, 6, 7
[90] Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun
Bao, Zhaopeng Cui, Martin R. Oswald, and Marc Pollefeys.
Nice-slam: Neural implicit scalable encoding for slam. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2022. 2
Appendix Seg2Car Q UALITY A LIGNMENT
S HAPNET- CAR FID ↓ KID ↓ SG Diversity ↑ mIoU ↑ acc ↑
We include additional experimental results, implementa-
tion details, and the societal impact of our work. Please also P IX 2N E RF 25.86 0.018 0.08 0.24 0.59
view our webpage for our interactive editing demo video and
O URS 9.35 0.004 0.14 0.58 0.88
additional visual results.

Table 5. Seg2car Evaluation. We compare our method with


A. Additional Experiments Pix2NeRF [6]. Ours performs better in all metrics.

Cross-view Editing of Seg2cat. In addition to the edge2car


editing example in the main paper, we showcase the ed- B. Implementation Details
itability of segmentation maps in Figure 13. Note that the
edited segmentation map does not have to be from the same Class-balanced cross-entropy. In Section 3.2 of the main
viewpoint as the input segmentation map. text, we mentioned using class-balanced cross-entropy loss
for reconstructing 2D segmentation maps. Specifically,
Ablation study on discriminator design. In the main pa-
per, we introduce our Pixel-aligned Conditional Discrimina-
tor that concatenates RGB images and label maps as input.
\mathcal {L}_s(\hat {\mathbf {I}}_{\mathbf {s}},\mathbf {I}_{\mathbf {s}}^+) = \mathbb {E}_n - \sum _{c=1}^C w_c \log \frac {\exp (x_{n,c})}{\sum _{i=1}^C \exp (x_{n,i})} y_{n,c}.
To verify the effectiveness of our discriminator design, we
introduce three ablation experiments in Table 4. We find that
our image discriminator helps improve the image quality, where xn,c is the semantic logits of class c at location n of
while our pixel-aligned conditional discriminator is crucial I+
s , yn,c is the ground-truth probability of class c at location
for the alignment. n of Îs , and wc is the weight of each class c.
In our case, 2D segmentation maps are imbalanced as
skin and hair cover a lot more areas than the other classes.
Q UALITY A LIGNMENT
C ELEBA-M ASK So we calculate wc based on the inverse frequency of the
FID ↓ KID ↓ mIoU ↑ acc ↑
classes in the training set,
O URS 11.54 0.003 0.51 (0.52) 0.90 (0.88)
W / O I MAGE D 15.32 0.006 0.51 (0.52) 0.89 (0.85)
w_c = \sqrt {\frac {\text {\# of pixels with class $c$}}{\text {\# of all pixels}}}.
W / O C ONDITIONAL D 12.02 0.004 0.37 (0.47) 0.82 (0.80)
W / O P IXEL -A LIGN D 11.94 0.003 0.41 (0.40) 0.82 (0.81)
Regularization. As mentioned in the main text, we use non-
Table 4. Ablation Study on Discriminator Design. To verify the saturating loss [20] and R1 Regularization [45] for GAN
effectiveness of our discriminator design, we introduce three abla- training following [21, 36]. Specifically,
tion experiments: (1)W / O I MAGE D, we remove the image discrim-
inator and only keep the conditional discriminator that accepts the
concatenation of image and segmentation maps; (2)W / O C ONDI - \mathcal {L}_{\text {GAN}}(G,D) = & \mathbb {E}[f(D(G(z, \mathbf {\hat {I}_s}))] \\ + & \mathbb {E} [f(-D(\mathbf {\hat {I}_c})) + \lambda ||\nabla D(\mathbf {\hat {I}_c})||^2],
TIONAL D, we remove the conditional discriminator and only keeps
the image discriminator; (3)W / O P IXEL -A LIGN D, we keep both
discriminators, but the conditional discriminator no longer concate-
nates color images as part of the input. Our image discriminator where G is the generator, D is the discriminator, f (u) =
improves the image quality, while our pixel-aligned conditional − log(1 + exp(−u)), and λ = 0.5.
discriminator ensures alignment. Hyper-parameters. λc = 1, λs = 5 for edge maps, λs = 1
for segmentation maps, λDc = 1, λDs = 0.1, and λCVC =
1e − 5. Check our codebase for more detailed hyperparame-
Evaluation on Seg2Car. We evaluate our method on an ters.
additional non-face dataset Seg2Car, where we get the seg-
mentation model from DatasetGAN [83]. We show the visual C. Discussion
and evaluation results in the Figure 14 and Table 5. We find
our method outperforms Pix2NeRF [6]. Broader Impact. Our work allows a novice user to create
Figure 15 compares our method with SoFGAN [11] re- 3D content more easily. The 3D outputs can be directly
grading multi-view consistency. We also show our method’s used in photo editing software as well as virtual reality and
capability of correcting errors in the user input in Figure 16. augmented reality applications.
Multi-view Generation Multi-view Generation
of RGB & Seg Maps of RGB & Seg Maps

Input Seg Map Edited Seg Map

Erase Some
Face Pixels

Figure 13. Cross-view Editing of Seg2cat. The 3D representation can be edited from a viewpoint different than the input seg map.

GT view Novel View Input Seg Map Output Seg Map Output Image
Output Images

Input Seg Map

Missing Completed
Eyes & Eyebrows Eyes & Eyebrows
Output Seg Maps

Figure 16. Our model projects the user input onto the learned
manifold. Even if the user input contains errors, our model is able
to fix them, e.g., completing the missing eyes and eyebrows.

Figure 14. Visual Results of Seg2Car. erated results, and find our generated images can be detected
with an accuracy of 89.77%, and an average precision of
99.97%.

Input Usage of Existing Assets. We use CelebAMask-HQ dataset


Ours

[38]. The CelebA dataset is available for non-commercial


research purposes only. All images of the CelebA dataset are
obtained from the Internet. The face identities are released
upon request for research purposes only. See CelebA web-
site for details. We also use AFHQ Cat dataset [16]. This
dataset is under Creative Commons license. Our work is also
SoFGAN

inspired by a few codebases. StyleNeRF codebase [21] is un-


der Creative Common license. StyleGAN2 codebase [36] is
under the Nvidia Source Code License. EG3D codebase [8]
is under the Nvidia Source Code License.
Figure 15. Multi-view Consistency. We compare our method with
SoFGAN [11] regarding multi-view consistency. Although SoF- Limitations. Our current method has three major limitations.
GAN can generate images from different viewpoints, their method First, it mainly focuses on modeling the appearance and ge-
shifts the face identity across viewpoints. In contrast, our method ometry of a single object category. Extending the method
better preserves the identity. to more complex scene datasets with multiple objects is a
promising next step, though defining a canonical pose for
Similar to recent works on data-driven 2D and 3D face generic scenes poses a nontrivial challenge. Second, our
synthesis, we suffer from biases in the underlying dataset. model’s generation is limited to the dataset’s distribution.
Our model is trained on CelebAMask-HQ dataset, as it pro- Our model will not follow the user input unless it is within
vides segmentation masks that can be used as conditional the dataset’s distribution. Incorporating diffusion models and
input. To reduce the dataset bias, one future direction is to run training on more diverse datasets can potentially improve
our model on more diverse datasets with a pre-trained face the generalization. Finally, our model training requires cam-
parser. While our work allows for controllable 3D content era poses associated with each training image, though our
generation, there may be potential misuse of the generated method does not require poses during inference time. Elim-
content. As an attempt to identify the generated content from inating the requirement for pose information will further
the real photos, we run a forensics detector [73] on our gen- broaden the scope of applications.
D. Changelog
V2. Add more citations in Section 2. Fix some typos.
V1. Initial preprint release (CVPR 2023).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy