Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models For 3D Generation
Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models For 3D Generation
for 3D Generation
Haochen Wang∗1 Xiaodan Du∗1 Jiahao Li∗1 Raymond A. Yeh2 Greg Shakhnarovich1
1 2
TTI-Chicago Purdue University
arXiv:2212.00774v1 [cs.CV] 1 Dec 2022
A zoomed out high-quality photo of Temple of Heaven A high quality photo of a delicious burger
Figure 1. Results for text-driven 3D generation using Score Jacobian Chaining with Stable Diffusion as the pretrained model.
1
function f with parameters θ, i.e., x = f (θ). Applying the 2. Related Works
chain rule through the Jacobian ∂x ∂θ converts a gradient on
image x into a gradient on the parameter θ. There are many Diffusion models have recently advanced to image gener-
potential use cases for pairing a pretrained diffusion model ation on Internet-scale datasets [10, 36, 42, 44–47]. A dif-
with different choices of f . In this work we are interested fusion model could be interpreted as either a VAE [12, 49]
in exploring the connection between 3D and multiview 2D or a denoising score-matcher [51, 53, 56]. Notably, models
by choosing f to be a differentiable renderer, thus creating a trained under one regime can be directly used for inference
3D generative model using only pretrained 2D resources. and sampling by the other [18, 53]; they are in practice
Many prior works [2, 58, 60] perform 3D generative mod- largely equivalent.
eling by training on 3D datasets [5, 23, 55, 59]. This ap- Neural radiance fields (NeRF) is a family of inverse ren-
proach is often as challenging as it is format-ambiguous. In dering algorithms that have excelled at multiview 3D recon-
addition to the high data acquisition cost of 3D assets [9], struction tasks including view synthesis and surface geome-
there is no universal data format: point clouds, meshes, volu- try estimation [31, 34, 40, 57, 61]. Conceptually, a 3D asset
metric radiance field, etc, all have computational trade-offs. is represented as a dense grid of RGB colors and spatial
What is common to these 3D assets is that they can be ren- density τ , and rendered into images in a way analogous to
dered into 2D images. An inverse rendering system, or a alpha compositing [32]. NeRF parameterizes the (RGB, τ )
differentiable renderer [24, 26, 29, 34, 39], provides access volume with a neural network, but querying the network
to the Jacobian Jπ , ∂x densely in 3D incurs significant compute costs. Alternatively,
∂θ of a rendered image xπ at camera
π
viewpoint π with respect to the underlying 3D parameteriza- Voxel NeRFs [6, 27, 54, 62] store the volume on voxels and
tion θ. Our method uses differentiable rendering to aggregate observe no loss in end task performance [54, 62]. Querying
2D image gradients over multiple viewpoints into a 3D asset voxels is a simple memory operation that is much faster than
gradient, and lifts a generative model from 2D to 3D. We a feedforward pass of a neural network. Here we use a cus-
parameterize a 3D asset θ as a radiance field stored on voxels tomized voxel radiance field with hyperparameters based on
and choose f to be the volume rendering function. DvGO [54] and TensoRF [6].
A key technical challenge is that computing the 2D score 2D-supervised 3D GANs pioneered [35, 43, 64] the ap-
by directly evaluating a diffusion model on a rendered image proach of training 3D generative models using only unstruc-
xπ leads to an out-of-distribution (OOD) problem. Gener- tured 2D images, and promise greater scalability in terms
ally, diffusion models are trained as denoisers and have only of data. Rather than supervising directly on the 3D asset a
seen noisy inputs during training. On the other hand, our model generates, these methods supervise the 2D render-
method requires evaluating the denoiser on non-noisy ren- ings of the generated 3D asset, often using an adversarial
dered images from a 3D asset during optimization, and it loss [3, 4, 38, 48, 65]. In other words, only images are needed
leads to the OOD problem. To address the issue, we propose as training data. However, training such a 3D generative
Perturb-and-Average Scoring, an approach to estimate the model from scratch is still challenging [37]. Recent empiri-
score for non-noisy images. cal evaluation remains mostly on human and animal faces [3].
Empirically, we first validate the effectiveness of Perturb- Our method does the opposite: we take an image generative
and-Average Scoring at solving the OOD problem and ex- model that is already pretrained on large amounts of 2D
plore the hyperparameter choices on a simple 2D image can- data and use it to guide the iterative optimization of a 3D
vas. Here we identify open problems on using unconditioned asset. Optimization-based generation makes it much slower
diffusion models trained on FFHQ and LSUN Bedroom. compared to 3D GANs, but it becomes possible to harness
Next, we use Stable Diffusion, a model pretrained on the powerful off-the-shelf 2D generative models such as Stable
web-scale LAION dataset to perform SJC for 3D generation, Diffusion [45] for greater content diversity.
as shown in Fig. 1. Our contributions are as follows:
CLIP-guided, optimization-based 3D generative models
• We propose a method for lifting a 2D diffusion model share a similar philosophy of optimizing 3D assets by guid-
to 3D via an application of the chain rule. ing on 2D renderings [14, 16, 17, 20, 25, 33]. Among them,
• We illustrate the challenge of OOD when using a DreamFields [16] and PureClipNeRF [25] also use NeRF as
pretrained denoiser and propose Perturb-and-Average their differentiable renderers. In this case, the 2D guidance
Scoring to resolve it. comes from CLIP [42], a pretrained image-text matching
model. These works optimize the 3D assets so that the image
• We point out the subtleties and open problems on ap-
renderings match a user-provided text prompt. Since CLIP
plying Perturb-and-Average Scoring as gradient for
is not a 2D generative model per se, such a pipeline usu-
optimization.
ally creates some abstract distilled content [28] that looks
• We demonstrate the effectiveness of SJC for the task of very different from real images. In contrast, we use diffusion
3D text-driven generation. models, which are proper 2D generative models, to create
2
realistic looking 3D content. sample through a sequence of noise levels of σT > · · · >
DreamFusion. The recently arXived work by Poole et al. σ0 = 0. {σi } are chosen empirically, with a typical range
[41], independent and concurrent to our work, proposes an being [0.01, 157] [12] in the case of DDPM.
algorithm that is similar to our approach at the pseudo-code Score as mean-shift. A helpful intuition is that the score
level. Differently, their procedure uses the mathematical behaves like mean-shift [7, 8]. If we simplify pdata to be
setup by Graikos et al. [11] to search for image parametriza- an empirical data distribution over the i.i.d. samples {yi },
tion that minimizes the training loss of a diffusion model. In then at noise level σ, pσ (x) takes the form of a mixture of
contrast, our work is motivated by applying the chain rule Gaussians [52]
to the 2D score. The key differences have been summarized
pσ (x) = Ey∼pdata N (x; y, σ 2 I). (4)
in Sec. 4.3. In terms of implementation, we do not have
access to the close-sourced Imagen [46] diffusion model. In- In this case there exists a closed-form expression [18, 52] to
stead, we use the pretrained Stable Diffusion model released the optimal denoiser
by Rombach et al. [45]. For a comparison with DreamFusion, P 2
we use with a third-party implementation based on the same i N (x; yi , σ I) yi
D(x; σ) = P 2
. (5)
diffusion model, namely Stable-DreamFusion. i N (x; yi , σ I)
3
Input xblob D(xblob , σ) D(xblob + σn, σ)
D(xπ + σn1 ; σ)
pσ (x)
Noised
Distribution
σn1
Figure 2. Illustration of denoiser’s OOD issue using a denoiser pre-
trained on FFHQ. When directly evaluating D(xblob , σ) the model xπ
σn2
σn3 xπ + σni ; ni ∼ N (0, I)
did not correct for the orange blob into a face image. Contrarily,
evaluating the denoiser on noised input D(xblob + σn, σ) produces Input to denoiser D
an image that successfully merges the blob with the face manifold. Figure 3. Computing PAAS on 2D renderings xπ . Directly evaluat-
ing D(xπ ; σ) leads to an OOD problem. Instead, we add noise to
xπ , and evaluate D(xπ + σn; σ) (blue dots). The PAAS is then
4.1. Computing 2D Score on Non-Noisy Images
computed by averaging over the brown dashed arrows, correspond-
Computing the 3D score in Eq. (11) requires the 2D score ing to multiple samples of n. See Sec. 4.1 for details.
on xπ . A first attempt would be to directly apply the score
from the denoiser in Eq. (3), i.e., a set of sampled noises {ni }, each D(xπ + σni ) provides
an update direction on the perturbed input xπ + σni . By
score(xπ , σ) , (D(xπ ; σ) − xπ )/σ 2 . (12) averaging over the noise perturbations {ni }, we obtain an
update direction on xπ itself.
Unfortunately, evaluating the pretrained denoiser D on xπ
causes an out-of-distribution (OOD) problem. From the train- Justifying PAAS in Eq. (13). We show that Perturb-and-
ing objective in Eq. (1), at each noise level σ, the denoiser D Average Scoring provides an approximation
√ to the score on
has only seen noisy inputs of the distribution y + σn where xπ at an inflated noise level of 2σ
√
y ∼ pdata and n ∼ N (0, I). However, a rendered image PAAS(xπ , 2σ) ≈ ∇xπ log p√2σ (xπ ). (17)
xπ from 3D asset θ is generally not consistent with such
distribution. Lemma 1 Assuming an empirical data distribution
We illustrate this OOD situation in Fig. 2. Given a de- pσ (x) in Eq. (4), for any x ∈ Rd
noiser pretrained on FFHQ [19] by Baranchuk et al. [1], we
visualize the output D(xblob ; σ = 6.5) where the input xblob log p√2σ (x) ≥ En∼N (0, I) log pσ (x + σn). (18)
is a non-noisy image showing an orange blob centered on a
grey canvas. Under the intuition that D predicts a weighted Proof. Observe that the LHS of Eq. (19) is a convolution of
nearest neighbor as reviewed in (5), we expect the denoiser two Gaussians, therefore
to blend the orange blob with the manifold of faces. How- En∼N (0, I) [N (x + σn; µ, σ 2 I)] = N (x; µ, 2σ 2 I) (19)
ever in reality we observe sharp artifacts when updating with
this score (D(xblob ; σ) − xblob )/σ 2 and the image becomes Recall that pσ (x) is a mixture of Gaussians per Eq. (4);
further away from the face manifold. p√2σ (x) = Ey∼pdata N (x; y, 2σ 2 I) (20)
Perturb-and-Average Scoring. To address the OOD prob- = Ey∼pdata En∼N (0, I) N (x + σn; y, σ I) 2
(21)
lem, we propose Perturb-and-Average Scoring (PAAS). It 2
computes the score on non-noisy images xπ with a denoiser = En∼N (0, I) Ey∼pdata N (x + σn; y, σ I) (22)
D by adding noise to the input, and then considering the = En∼N (0, I) pσ (x + σn). (23)
expectation of the predicted scores w.r.t. the random noise,
Taking the log on both sides of Eq. (23) and by Jensen’s
√
PAAS(xπ , 2σ) (13) inequality, we arrive at Eq. (18).
,En∼N (0, I) [score(xπ + σn, σ)] (14) Claim 1 Assuming√ a trained denoiser D as in Eq. (3),
D(xπ + σn, σ) − (xπ + σn)
our PAAS(xπ , 2σ) in Eq. (13) computes the gradient
=En (15) w.r.t. a lower bound of log p√2σ (x).
σ2
D(xπ + σn, σ) − xπ Z
hni
=En −Z
En . (16) Proof. By taking the gradient of the RHS of Lemma 1,
σ 2
{zσ }
| Z ∇x En log p(x + σn, σ) = En ∇x+σn log pσ (x + σn)
=0ZZ
= En [score(xπ + σn, σ)]. (24)
In practice, we use the Monte Carlo estimate of the expecta-
tion in Eq. (16). The algorithm is illustrated in Fig. 3. Given which is the proposed PAAS algorithm in Eq. (13).
4
4.2. Inverse Rendering on Voxel Radiance Field Annealed σ Random σ
With the computation of the 2D score resolved, the other
FFHQ
half of our setup in Eq. (11) requires access to the Jacobian
of a differentiable renderer.
3D Representation. We represent a 3D asset θ as a voxel
radiance field [6, 54, 62], which is much faster to access
and update compared to a vanilla NeRF parameterized by a
LSUN
neural network [34]. The parameters θ consist of a density
voxel grid V(density) ∈ R1×Nx ×Ny ×Nz and a voxel grid of
appearance features V(app) ∈ Rc×Nx ×Ny ×Nz . Convention-
5
A DSLR photo of a yellow duck A ficus planted in a pot
A zoomed out photo a small castle A high quality photo of a toy motorcycle
A zoomed out high quality photo of Sydney Opera House A photo of a horse walking
Figure 5. Qualitative results of text-prompted generation of 3D models with SJC, purely from the pretrained Stable Diffusion (2D) image
model. Each row shows two views, with associated depth maps (blue is far, red is near), for a single 3D model generated for a given prompt.
Note the detailed appearance as well as a sharp, well-defined depth structure.
6
(a) (b) (c)
Ours
StableDF
Figure 6. Qualitative comparison between Stable-DreamFusion (StableDF) and Ours. The prompts are: (a) “A high quality photo of a
delicious burger"; (b) “a DSLR photo of a yellow duck"; (c) “A ficus planted in a pot"; (d) “A product photo of a toy tank"; (e) “A high
quality photo of a chocolate icecream cone"; (f)“A wide angle zoomed out photo of a giraffe". Both methods are run for 10k iterations
without per-prompt finetuning on the hyperparameters. The images on the left are rendered RGB images and the images on the right are
depth visualization.
on the vector field produced by PAAS creates high quality blurry image canvas with no content.
images. Here an important decision to make is the schedule On the other hand, natural language prompting plays a
of {σi } at which we compute PAAS. critical role when sampling images with Stable Diffusion.
We experimented with an annealed schedule (Annealed When the language guidance is set to a regular level of 3.0,
σ) vs. a random schedule (Random σ) as proposed in Dream- the observations are broadly consistent with sampling on
Fusion. Under the Annealed σ schedule, we start from a FFHQ and LSUN. Random σ schedule produces blurry out-
large σ and gradually decrease it as we update the image can- puts. However, when the guidance scale is elevated to 10.0,
vas x. PAAS computed at larger σ level attends to high level Random σ schedule begins to generate crisp, clean images
image structure while smaller σ provides stronger guidance and outperforms Annealed σ schedule. Despite various so-
on detailed features. The Random σ schedule on the contrary phisticated strategies on Annealed σ scheduling (see our
uniformly samples a σ at every step. We show qualitative code for details), at a high language guidance scale Ran-
comparisons in Fig. 4. domized σ remains the better option. We hypothesize that
For unconditioned diffusion models trained on FFHQ, stronger language guidance forces the image distribution to
we observe that Annealed σ performs better than Random be narrower and more beneficial for a mode-seeking algo-
σ, and the image samples have better pose variation and rithm. We acknowledge none of the images in Fig. 4 can
quality. Particularly, the randomized σ exhibits severe mode- match the sample quality of a standard diffusion inference
seeking behavior converging to average faces. In the case pipeline, and the right way to apply PAAS as gradient for
of LSUN Bedroom, the mode-seeking behavior results in a optimization remains an open problem.
7
A zoomed out high quality photo of Temple of Heaven.
Figure 7. Ablation experiments on the proposed emptiness loss schedule. For each setting of the loss weight λ, we show a rendered image
and the associated depth map from a randomly sampled viewpoint. Ours incorporates the loss with weight schedule described in 5.2. It leads
to better 3D shape, as evidenced in the cleaner depth maps. Setting the loss weight too low yields "cloudy" depth fields. When setting the
weight too high, SJC fails to produce meaningful 3D models.
8
7. Acknowledgements [15] Aapo Hyvärinen and Peter Dayan. Estimation of non-
normalized statistical models by score matching. J. Mach.
The authors would like to thank David McAllester for Learn. Res, 2005. 1, 3
feedbacks on an early pitch of the work, Shashank
√ Srivas- [16] Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel,
tava and Madhur Tulsiani for discussing the 2 factor on and Ben Poole. Zero-shot text-guided object generation with
synthetic experiments. We would like to thank friends at TRI dream fields. In IEEE Conf. Comput. Vis. Pattern Recog.,
and 3DL lab at UChicago for suggestions on the manuscript. 2022. 2
[17] Nikolay Jetchev. Clipmatrix: Text-controlled creation of 3D
HC would like to thank Kavya Ravichandran for incredible
textured meshes. arXiv preprint arXiv:2109.12922, 2021. 2
officemate support, and Michael Maire for the discussion [18] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.
and encouragement while riding Metra. Elucidating the design space of diffusion-based generative
models. arXiv preprint arXiv:2206.00364, 2022. 1, 2, 3, 12
References [19] Tero Karras, Samuli Laine, and Timo Aila. A style-based
generator architecture for generative adversarial networks. In
[1] Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, Valentin IEEE Conf. Comput. Vis. Pattern Recog., 2019. 4, 6
Khrulkov, and Artem Babenko. Label-efficient seman- [20] Nasir Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu
tic segmentation with diffusion models. arXiv preprint Popa. Clip-mesh: Generating textured meshes from text using
arXiv:2112.03126, 2021. 4 pretrained image-text models. ACM Trans. Graph., 2022. 2
[2] Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun [21] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho.
Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. Variational diffusion models. Advances in neural information
Learning gradient fields for shape generation. In Eur. Conf. processing systems, 34:21696–21707, 2021. 1
Comput. Vis., 2020. 2 [22] Diederik P Kingma and Jimmy Ba. Adam: A method for
[3] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, stochastic optimization. arXiv preprint arXiv:1412.6980,
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J 2014. 14
Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient [23] Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis
geometry-aware 3D generative adversarial networks. In IEEE Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa,
Conf. Comput. Vis. Pattern Recog., 2022. 2, 8 Denis Zorin, and Daniele Panozzo. ABC: A big cad model
[4] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, dataset for geometric deep learning. In IEEE Conf. Comput.
and Gordon Wetzstein. pi-GAN: Periodic implicit generative Vis. Pattern Recog., 2019. 2
adversarial networks for 3d-aware image synthesis. In IEEE [24] Christoph Lassner and Michael Zollhofer. Pulsar: Efficient
Conf. Comput. Vis. Pattern Recog., 2021. 2 sphere-based neural rendering. In IEEE Conf. Comput. Vis.
[5] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Pattern Recog., 2021. 2
Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis [25] Han-Hung Lee and Angel X Chang. Understanding pure
Savva, Shuran Song, Hao Su, et al. ShapeNet: An information- clip guidance for voxel grid nerf models. arXiv preprint
rich 3D model repository. arXiv preprint arXiv:1512.03012, arXiv:2209.15172, 2022. 2
2015. 2 [26] Tzu-Mao Li, Miika Aittala, Frédo Durand, and Jaakko Lehti-
[6] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and nen. Differentiable Monte Carlo ray tracing through edge
Hao Su. Tensorf: Tensorial radiance fields. arXiv preprint sampling. ACM Trans. Graph., 2018. 2
arXiv:2203.09517, 2022. 2, 5 [27] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
[7] Yizong Cheng. Mean shift, mode seeking, and clustering. Christian Theobalt. Neural sparse voxel fields. In Adv. Neural
IEEE Trans. Pattern Anal. Mach. Intell., 1995. 3 Inform. Process. Syst., 2020. 2
[8] Dorin Comaniciu and Peter Meer. Mean shift analysis and [28] Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang,
applications. In Int. Conf. Comput. Vis., 1999. 3 Hao Su, and Qiang Liu. Fusedream: Training-free text-to-
[9] Amélie Deltombe. How much does it cost to create 3D mod- image generation with improved clip+ GAN space optimiza-
els?, Apr 2022. 2 tion. arXiv preprint arXiv:2112.01573, 2021. 2
[10] Prafulla Dhariwal and Alexander Nichol. Diffusion mod- [29] Matthew M Loper and Michael J Black. OpenDR: An ap-
els beat GANs on image synthesis. In Adv. Neural Inform. proximate differentiable renderer. In Eur. Conf. Comput. Vis.,
Process. Syst., 2021. 2, 6 2014. 2
[11] Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dim- [30] Dimitra Maoutsa, Sebastian Reich, and Manfred Opper. Inter-
itris Samaras. Diffusion models as plug-and-play priors. arXiv acting particle solutions of fokker–planck equations through
preprint arXiv:2206.09012, 2022. 3 gradient–log–density estimation. Entropy, 2020. 13
[12] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- [31] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi,
fusion probabilistic models. In Adv. Neural Inform. Process. Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck-
Syst., 2020. 1, 2, 3 worth. Nerf in the wild: Neural radiance fields for uncon-
[13] Jonathan Ho and Tim Salimans. Classifier-free diffusion strained photo collections. In IEEE Conf. Comput. Vis. Pat-
guidance. arXiv preprint arXiv:2207.12598, 2022. 6 tern Recog., 2021. 2
[14] Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang [32] Nelson Max. Optical models for direct volume rendering.
Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero-shot text- IEEE Trans. Vis. Comput. Graph., 1995. 2, 5
driven generation and animation of 3D avatars. arXiv preprint [33] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and
arXiv:2205.08535, 2022. 2 Rana Hanocka. Text2mesh: Text-driven neural stylization for
9
meshes. In IEEE Conf. Comput. Vis. Pattern Recog., 2022. 2 [48] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
[34] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Geiger. GRAF: Generative radiance fields for 3D-aware im-
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: age synthesis. In Adv. Neural Inform. Process. Syst., 2020.
Representing scenes as neural radiance fields for view synthe- 2
sis. Commun. ACM, 2021. 2, 5 [49] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[35] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian and Surya Ganguli. Deep unsupervised learning using
Richardt, and Yong-Liang Yang. HoloGAN: Unsupervised nonequilibrium thermodynamics. In International Confer-
learning of 3D representations from natural images. In Int. ence on Machine Learning, pages 2256–2265. PMLR, 2015.
Conf. Comput. Vis., 2019. 2 1, 2
[36] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [50] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and diffusion implicit models. In Int. Conf. Learn. Represent.,
Mark Chen. Glide: Towards photorealistic image generation 2021. 12
and editing with text-guided diffusion models. arXiv preprint [51] Yang Song and Stefano Ermon. Generative modeling by
arXiv:2112.10741, 2021. 2 estimating gradients of the data distribution. In Adv. Neural
[37] Michael Niemeyer and Andreas Geiger. Campari: Camera- Inform. Process. Syst., 2019. 1, 2, 3, 12
aware decomposed generative neural radiance fields. In Int. [52] Yang Song and Stefano Ermon. Improved techniques for
Conf. 3DV, 2021. 2 training score-based generative models. In Adv. Neural Inform.
[38] Michael Niemeyer and Andreas Geiger. Giraffe: Representing Process. Syst., 2020. 3, 12
scenes as compositional generative neural feature fields. In [53] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-
IEEE Conf. Comput. Vis. Pattern Recog., 2021. 2, 8 hishek Kumar, Stefano Ermon, and Ben Poole. Score-based
[39] Merlin Nimier-David, Delio Vicini, Tizian Zeltner, and Wen- generative modeling through stochastic differential equations.
zel Jakob. Mitsuba 2: A retargetable forward and inverse In Int. Conf. Learn. Represent., 2021. 1, 2, 3, 13
renderer. ACM Trans. Graph., 2019. 2 [54] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel
[40] Michael Oechsle, Songyou Peng, and Andreas Geiger. grid optimization: Super-fast convergence for radiance fields
Unisurf: Unifying neural implicit surfaces and radiance fields reconstruction. In IEEE Conf. Comput. Vis. Pattern Recog.,
for multi-view reconstruction. In Int. Conf. Comput. Vis., 2022. 2, 5
2021. 2 [55] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang,
[41] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and
DreamFusion: Text-to-3D using 2D diffusion. arXiv preprint William T Freeman. Pix3D: Dataset and methods for single-
arXiv:2209.14988, 2022. 3, 6, 14 image 3D shape modeling. In IEEE Conf. Comput. Vis. Pat-
[42] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya tern Recog., 2018. 2
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [56] Pascal Vincent. A connection between score matching and
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning denoising autoencoders. Neural computation, 2011. 2
transferable visual models from natural language supervision. [57] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
2021. 2 Komura, and Wenping Wang. Neus: Learning neural implicit
[43] Sai Rajeswar, Fahim Mannan, Florian Golemo, David surfaces by volume rendering for multi-view reconstruction.
Vazquez, Derek Nowrouzezahrai, and Aaron Courville. arXiv preprint arXiv:2106.10689, 2021. 2
Pix2scene: Learning implicit 3d representations from images. [58] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and
openreview, 2018. 2 Josh Tenenbaum. Learning a probabilistic latent space of
[44] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, object shapes via 3D generative-adversarial modeling. In Adv.
and Mark Chen. Hierarchical text-conditional image genera- Neural Inform. Process. Syst., 2016. 2
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. [59] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-
2 guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3D
[45] Robin Rombach, Andreas Blattmann, Dominik Lorenz, ShapeNets: A deep representation for volumetric shapes. In
Patrick Esser, and Björn Ommer. High-resolution image syn- IEEE Conf. Comput. Vis. Pattern Recog., 2015. 2
thesis with latent diffusion models. In IEEE Conf. Comput. [60] Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge
Vis. Pattern Recog., 2022. 2, 3, 6 Belongie, and Bharath Hariharan. PointFlow: 3D point cloud
[46] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay generation with continuous normalizing flows. In Int. Conf.
Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Comput. Vis., 2019. 2
Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, [61] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
et al. Photorealistic text-to-image diffusion models with deep ume rendering of neural implicit surfaces. In Adv. Neural
language understanding. arXiv preprint arXiv:2205.11487, Inform. Process. Syst., 2021. 2
2022. 3 [62] Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong
[47] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels:
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Radiance fields without neural networks. arXiv preprint
Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. arXiv:2112.05131, 2021. 2, 5
LAION-5B: An open large-scale dataset for training next gen- [63] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
eration image-text models. arXiv preprint arXiv:2210.08402, Funkhouser, and Jianxiong Xiao. LSUN: Construction of a
2022. 2, 6 large-scale image dataset using deep learning with humans in
10
the loop. arXiv preprint arXiv:1506.03365, 2015. 6
[64] Yuxuan Zhang, Wenzheng Chen, Huan Ling, Jun Gao, Yinan
Zhang, Antonio Torralba, and Sanja Fidler. Image GANs meet
differentiable rendering for inverse graphics and interpretable
3D neural rendering. arXiv preprint arXiv:2010.09125, 2020.
2
[65] Xiaoming Zhao, Fangchang Ma, David Güera, Zhile Ren,
Alexander G Schwing, and Alex Colburn. Generative multi-
plane images: Making a 2D GAN 3D-Aware. In Eur. Conf.
Comput. Vis., 2022. 2
11
Appendix
• In Sec. A1, we provide diffusion models from a score-based perspective following Karras et al. [18].
• In Sec. A2, we provide additional experiments on our approach, including additional ablation study, qualitative results,
and video results.
Figure A1. Training and Sampling Algorithm Card for Score-Based Methods with numerical scaling s(t) = 1 and σ(t) = t. Note that
the inference step is analogous to DDIM [50], and simplifies to a weighted averaging between the current iterate xt and the denoiser
output D(xt , σt ). This particular scheduling allows for taking large step sizes, and a sample can be generated in as few as 80 network
evaluations [18] while maintaining high image quality.
where z ∼ N (0, I) and x0 is a sample drawn from data distribution. s(t) and σ(t) are user-defined coefficients. Here the
coefficient on noise z is parameterized as the product of s(t) and σ(t) so that σ(t) represents the noise-to-signal ratio in xt .
SMLD [51, 52], DDIM [50] and Karras [18] sets scaling, i.e. s(t) = 1, and therefore adding noise by x0 + σ(t)z would
cause xt to numerically get larger as t increases. DDPM on the other hand introduced rapidly decreasing s(t) to scale down the
successive xt so that at any time t, pt (x) has variance fixed at 1. This goal of maintaining a standard deviation 1 requires that
√ qQ qQ q
1−ᾱt
DDPM specifies the s(t) by a set of βt , i.e., s(t) = ᾱt = i≤t αi = i≤t (1 − βi ) , and therefore σ(t) = ᾱt .
12
The noising step (28) describes the marginal distribution at pt (x). The infinitesimal time evolution of this process can be
written as the following stochastic differential equation [53]:
ṡ(t) p
dx = f (t)x dt + g(t) dωt where f (t) = g(t) = s(t) 2σ̇(t)σ(t). (33)
s(t)
Fokker-Planck equation [30] states that a stochastic differential equation of the form (33) is identified with a partial differential
equation describing the marginal probability density distribution pt (x)
g(t)2
∂pt (x)
dx = f (x, t) dt + g(t) dωt ←→ = −∇ · f (x, t) pt (x) − ∇x pt (x) . (34)
∂t 2
Applying this identity tells us that a stochastic differential equation like (33) implies a deterministic, ordinary differential
equation. Here we illustrate the proof schematically:
h i
FP ∂pt (x) g(t)2
dx = f (t) x dt + g(t) dωt ∂t = −∇ · f (t) x pt (x) − 2 ∇x pt (x)
| {z }
stochastic
equal (by log derivative trick; expanded below)
implies (35)
g(t)2 2
∂pt (x) g(t)
dx = f (t) x − ∇x log pt (x) dt + 0 dωt ∂t = −∇ · f (t) x − ∇x log pt (x) pt (x) − 0
2 FP 2
| {z }
deterministic
g(t)2
∂pt (x)
= −∇ · f (t) x pt (x) − ∇x pt (x) (36)
∂t 2
" 2 #
f (t) x pt (x) − g(t)
2 ∇ x pt (x)
= −∇ · pt (x) (37)
pt (x)
g(t)2 ∇x pt (x)
= −∇ · f (t) x − pt (x) (38)
2 pt (x)
| {z }
log derivative
2
g(t)
= −∇ · f (t) x − ∇x log pt (x) pt (x) . (39)
2
Substituting the expression for f (t) and g(t) from (33), we obtain an ODE from which we can sample the data by applying
the score function with a step schedule that theoretically guarantees to take us back to initial, clean data distribution
ṡ(t) 1 p 2
dx = x− s(t) 2σ̇(t)σ(t) ∇x log pt (x) dt (40)
s(t) 2
ṡ(t) σ̇(t)
dx = x − s(t) D(x/s(t); σ(t)) − x/s(t) dt . (41)
s(t) σ(t)
When s(t) = 1, σ(t) = t, the above simplifies to
D(x; σt ) − x
dx = −σt · dt (42)
σt2
dx = −σt · score(x, σt ) dt (43)
Note that this schedule with s(t) = 1, σ(t) = t allows for taking large step sizes during inference since it introduces no
extra curvature in the trajectory beyond what’s induced by the score function itself. The discretized sampling algorithm of
equation (43) is described in Fig. A1.
13
A high quality photo of french fries from McDonald’s
Figure A2. Ablation experiments on the proposed center depth loss. Each pair of corresponding columns of the same prompt are visualized
from the same camera angle.
14
view of” depending on the camera location. More specifically, when camera elevation is above 30 degrees we use the “overhead
view” prompt. Otherwise, the prompts are assigned based on the azimuth quadrant the camera falls into. This technique helps
to alleviate the degeneracy of multiple frontal faces being painted around an object during optimization. We hope as part of our
future work to develop a more general solution to induce the optimization towards more plausible geometry without using
language as guidance.
15
Trump figure
Obama figure
Biden figure
Zelda Link
A pig
16
Figure A3. Additional results of text-prompted generation of 3D models with SJC.