0% found this document useful (0 votes)
30 views16 pages

Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models For 3D Generation

Uploaded by

auser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views16 pages

Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models For 3D Generation

Uploaded by

auser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models

for 3D Generation

Haochen Wang∗1 Xiaodan Du∗1 Jiahao Li∗1 Raymond A. Yeh2 Greg Shakhnarovich1
1 2
TTI-Chicago Purdue University
arXiv:2212.00774v1 [cs.CV] 1 Dec 2022

A zoomed out high-quality photo of Temple of Heaven A high quality photo of a delicious burger

A high quality photo of a Victorian style wooden chair


A high quality photo of a classic silver muscle car
with velvet upholstery

Figure 1. Results for text-driven 3D generation using Score Jacobian Chaining with Stable Diffusion as the pretrained model.

Abstract learned predictors of a gradient field, often referred to as the


score function of the data log-likelihood. We apply the chain
A diffusion model learns to predict a vector field of gradi- rule on the estimated score, hence the name Score Jacobian
ents. We propose to apply chain rule on the learned gradients, Chaining (SJC).
and back-propagate the score of a diffusion model through Following Hyvärinen and Dayan [15], the score is defined
the Jacobian of a differentiable renderer, which we instan- as the gradient of the log-density function with respect to the
tiate to be a voxel radiance field. This setup aggregates 2D data. Diffusion models of various families [12, 49, 51, 53]
scores at multiple camera viewpoints into a 3D score, and re- can all be interpreted [18, 21, 53] as modeling ∇x log pσ (x)
purposes a pretrained 2D model for 3D data generation. We i.e. the denoising score at noise level σ. For readability, we
identify a technical challenge of distribution mismatch that refer to the denoising score as the score. Generating a sam-
arises in this application, and propose a novel estimation ple from a diffusion model involves repeated evaluations
mechanism to resolve it. We run our algorithm on several off- of the score function from large to small σ level, so that a
the-shelf diffusion image generative models, including the sample x gradually moves closer to the data manifold. It
recently released Stable Diffusion trained on the large-scale can be loosely interpreted as gradient descent, with precise
LAION 5B dataset. control on the step sizes so that data distribution evolves to
match the annealed σ level (ancestral sampler [12], SDE and
1. Introduction probability-flow ODE [53], etc.). While there are other per-
spectives to a diffusion model [12, 49], here we are primarily
We introduce a method that converts a pretrained 2D motivated from the viewpoint that diffusion models produce
diffusion generative model on images into a 3D generative a gradient field.
model of radiance fields, without requiring access to any
A natural question to ask is whether the chain rule can be
3D data. The key insight is to interpret diffusion models as
applied to the learned gradients. Consider a diffusion model
* Equal contribution. on images. An image x may be parameterized by some

1
function f with parameters θ, i.e., x = f (θ). Applying the 2. Related Works
chain rule through the Jacobian ∂x ∂θ converts a gradient on
image x into a gradient on the parameter θ. There are many Diffusion models have recently advanced to image gener-
potential use cases for pairing a pretrained diffusion model ation on Internet-scale datasets [10, 36, 42, 44–47]. A dif-
with different choices of f . In this work we are interested fusion model could be interpreted as either a VAE [12, 49]
in exploring the connection between 3D and multiview 2D or a denoising score-matcher [51, 53, 56]. Notably, models
by choosing f to be a differentiable renderer, thus creating a trained under one regime can be directly used for inference
3D generative model using only pretrained 2D resources. and sampling by the other [18, 53]; they are in practice
Many prior works [2, 58, 60] perform 3D generative mod- largely equivalent.
eling by training on 3D datasets [5, 23, 55, 59]. This ap- Neural radiance fields (NeRF) is a family of inverse ren-
proach is often as challenging as it is format-ambiguous. In dering algorithms that have excelled at multiview 3D recon-
addition to the high data acquisition cost of 3D assets [9], struction tasks including view synthesis and surface geome-
there is no universal data format: point clouds, meshes, volu- try estimation [31, 34, 40, 57, 61]. Conceptually, a 3D asset
metric radiance field, etc, all have computational trade-offs. is represented as a dense grid of RGB colors and spatial
What is common to these 3D assets is that they can be ren- density τ , and rendered into images in a way analogous to
dered into 2D images. An inverse rendering system, or a alpha compositing [32]. NeRF parameterizes the (RGB, τ )
differentiable renderer [24, 26, 29, 34, 39], provides access volume with a neural network, but querying the network
to the Jacobian Jπ , ∂x densely in 3D incurs significant compute costs. Alternatively,
∂θ of a rendered image xπ at camera
π

viewpoint π with respect to the underlying 3D parameteriza- Voxel NeRFs [6, 27, 54, 62] store the volume on voxels and
tion θ. Our method uses differentiable rendering to aggregate observe no loss in end task performance [54, 62]. Querying
2D image gradients over multiple viewpoints into a 3D asset voxels is a simple memory operation that is much faster than
gradient, and lifts a generative model from 2D to 3D. We a feedforward pass of a neural network. Here we use a cus-
parameterize a 3D asset θ as a radiance field stored on voxels tomized voxel radiance field with hyperparameters based on
and choose f to be the volume rendering function. DvGO [54] and TensoRF [6].
A key technical challenge is that computing the 2D score 2D-supervised 3D GANs pioneered [35, 43, 64] the ap-
by directly evaluating a diffusion model on a rendered image proach of training 3D generative models using only unstruc-
xπ leads to an out-of-distribution (OOD) problem. Gener- tured 2D images, and promise greater scalability in terms
ally, diffusion models are trained as denoisers and have only of data. Rather than supervising directly on the 3D asset a
seen noisy inputs during training. On the other hand, our model generates, these methods supervise the 2D render-
method requires evaluating the denoiser on non-noisy ren- ings of the generated 3D asset, often using an adversarial
dered images from a 3D asset during optimization, and it loss [3, 4, 38, 48, 65]. In other words, only images are needed
leads to the OOD problem. To address the issue, we propose as training data. However, training such a 3D generative
Perturb-and-Average Scoring, an approach to estimate the model from scratch is still challenging [37]. Recent empiri-
score for non-noisy images. cal evaluation remains mostly on human and animal faces [3].
Empirically, we first validate the effectiveness of Perturb- Our method does the opposite: we take an image generative
and-Average Scoring at solving the OOD problem and ex- model that is already pretrained on large amounts of 2D
plore the hyperparameter choices on a simple 2D image can- data and use it to guide the iterative optimization of a 3D
vas. Here we identify open problems on using unconditioned asset. Optimization-based generation makes it much slower
diffusion models trained on FFHQ and LSUN Bedroom. compared to 3D GANs, but it becomes possible to harness
Next, we use Stable Diffusion, a model pretrained on the powerful off-the-shelf 2D generative models such as Stable
web-scale LAION dataset to perform SJC for 3D generation, Diffusion [45] for greater content diversity.
as shown in Fig. 1. Our contributions are as follows:
CLIP-guided, optimization-based 3D generative models
• We propose a method for lifting a 2D diffusion model share a similar philosophy of optimizing 3D assets by guid-
to 3D via an application of the chain rule. ing on 2D renderings [14, 16, 17, 20, 25, 33]. Among them,
• We illustrate the challenge of OOD when using a DreamFields [16] and PureClipNeRF [25] also use NeRF as
pretrained denoiser and propose Perturb-and-Average their differentiable renderers. In this case, the 2D guidance
Scoring to resolve it. comes from CLIP [42], a pretrained image-text matching
model. These works optimize the 3D assets so that the image
• We point out the subtleties and open problems on ap-
renderings match a user-provided text prompt. Since CLIP
plying Perturb-and-Average Scoring as gradient for
is not a 2D generative model per se, such a pipeline usu-
optimization.
ally creates some abstract distilled content [28] that looks
• We demonstrate the effectiveness of SJC for the task of very different from real images. In contrast, we use diffusion
3D text-driven generation. models, which are proper 2D generative models, to create

2
realistic looking 3D content. sample through a sequence of noise levels of σT > · · · >
DreamFusion. The recently arXived work by Poole et al. σ0 = 0. {σi } are chosen empirically, with a typical range
[41], independent and concurrent to our work, proposes an being [0.01, 157] [12] in the case of DDPM.
algorithm that is similar to our approach at the pseudo-code Score as mean-shift. A helpful intuition is that the score
level. Differently, their procedure uses the mathematical behaves like mean-shift [7, 8]. If we simplify pdata to be
setup by Graikos et al. [11] to search for image parametriza- an empirical data distribution over the i.i.d. samples {yi },
tion that minimizes the training loss of a diffusion model. In then at noise level σ, pσ (x) takes the form of a mixture of
contrast, our work is motivated by applying the chain rule Gaussians [52]
to the 2D score. The key differences have been summarized
pσ (x) = Ey∼pdata N (x; y, σ 2 I). (4)
in Sec. 4.3. In terms of implementation, we do not have
access to the close-sourced Imagen [46] diffusion model. In- In this case there exists a closed-form expression [18, 52] to
stead, we use the pretrained Stable Diffusion model released the optimal denoiser
by Rombach et al. [45]. For a comparison with DreamFusion, P 2
we use with a third-party implementation based on the same i N (x; yi , σ I) yi
D(x; σ) = P 2
. (5)
diffusion model, namely Stable-DreamFusion. i N (x; yi , σ I)

In other words, D(x; σ) is a locally weighted mean of data


3. Preliminaries samples {yi } around x under a Gaussian kernel with band-
width σ. The denoising score function can be thought of as a
To establish a common notation, we briefly review the
non-parametric guide on how to update x in order to move
score-based perspective of diffusion models. For readers
it towards its weighted nearest neighbors.
familiar with VAE literature on diffusion models, we pro-
vide a concise score-based formula card and more details in
4. Score Jacobian Chaining for 3D Generation
Appendix Sec. A1 to connect these ideas.
Denoising score matching. Given a dataset of samples Y = Let θ denotes the parameters of a 3D asset, e.g., voxel
{yi } drawn from pdata , a diffusion model revolves primarily grid of (RGB, τ ) as in Sec. 4.2. Our goal is to model and
around learning a denoiser D by minimizing the difference sample from the distribution p(θ) to generate a 3D scene. In
between a noised sample y + σn and y, our setting, only a pretrained 2D diffusion model on images
p(x) is given and we do not have access to 3D data. To relate
Ey∼pdata En∼N (0,I) kD(y + σn; σ) − yk22 , (1) the 2D and 3D distributions p(x) and p(θ), we assume that
the probability density of 3D asset θ is proportional to the
i.e. D is denoising the input y + σn, for a range of σ values. expected probability densities of its multiview 2D image
For 2D images, D is commonly chosen to be a ConvNet. renderings xπ over camera poses π, i.e.,
Variants such as DDPM [12] parameterized the ConvNet to  
instead predict a noise residual ˆ, and these models can be pσ (θ) ∝ Eπ pσ (xπ (θ)) , (6)
converted back to the form of a denoiser by [53] R 
up to a normalization constant Z = Eπ pσ (xπ (θ)) dθ.


That is, a 3D asset θ is as likely as its 2D renderings xπ .


D(x; σ) = x − σˆ
(x). (2)
Next, we establish a lower bound, log p̃σ (θ), on the dis-
In this paper, we treat all pretrained diffusion models as tribution in Eq. (6) using Jensen’s inequality:
 
denoisers, and perform the interface conversion in our im- log pσ (θ) = log Eπ (pσ (xπ )) − log Z (7)
plementation when needed.
≥ Eπ [log pσ (xπ )] − log Z , log p̃σ (θ). (8)
Score from denoiser. Let pσ (x) denote the data distribution
perturbed by Gaussian noise of standard deviation σ. It is Recall that the score is the gradient of log probability density
shown in prior works [15, 51] that the denoiser D trained of data. By chain rule
according to Eq. (1) provides a good approximation to the ∇θ log p̃σ (θ) = Eπ [∇θ log pσ (xπ )] (9)
denoising score:  
∂ log p̃σ (θ) ∂ log pσ (xπ ) ∂xπ
= Eπ · (10)
D(x; σ) − x ∂θ ∂xπ ∂θ
∇x log pσ (x) ≈ . (3)
σ2 ∇θ log p̃σ (θ) = Eπ [ ∇xπ log pσ (xπ ) · Jπ ].
| {z } | {z } |{z}
A denoising diffusion model estimates the score function of 3D score 2D score; pretrained renderer Jacobian
the noised distribution pσ (x) at various σ ∈ {σi }Ti=1 . To (11)
perform sampling, the diffusion model gradually updates a
We will next discuss how to compute the 2D score in practice
github.com/ashawkey/stable-dreamfusion using a pretrained diffusion model.

3
Input xblob D(xblob , σ) D(xblob + σn, σ)

D(xπ + σn1 ; σ)
pσ (x)
Noised
Distribution

σn1
Figure 2. Illustration of denoiser’s OOD issue using a denoiser pre-
trained on FFHQ. When directly evaluating D(xblob , σ) the model xπ
σn2
σn3 xπ + σni ; ni ∼ N (0, I)
did not correct for the orange blob into a face image. Contrarily,
evaluating the denoiser on noised input D(xblob + σn, σ) produces Input to denoiser D
an image that successfully merges the blob with the face manifold. Figure 3. Computing PAAS on 2D renderings xπ . Directly evaluat-
ing D(xπ ; σ) leads to an OOD problem. Instead, we add noise to
xπ , and evaluate D(xπ + σn; σ) (blue dots). The PAAS is then
4.1. Computing 2D Score on Non-Noisy Images
computed by averaging over the brown dashed arrows, correspond-
Computing the 3D score in Eq. (11) requires the 2D score ing to multiple samples of n. See Sec. 4.1 for details.
on xπ . A first attempt would be to directly apply the score
from the denoiser in Eq. (3), i.e., a set of sampled noises {ni }, each D(xπ + σni ) provides
an update direction on the perturbed input xπ + σni . By
score(xπ , σ) , (D(xπ ; σ) − xπ )/σ 2 . (12) averaging over the noise perturbations {ni }, we obtain an
update direction on xπ itself.
Unfortunately, evaluating the pretrained denoiser D on xπ
causes an out-of-distribution (OOD) problem. From the train- Justifying PAAS in Eq. (13). We show that Perturb-and-
ing objective in Eq. (1), at each noise level σ, the denoiser D Average Scoring provides an approximation
√ to the score on
has only seen noisy inputs of the distribution y + σn where xπ at an inflated noise level of 2σ

y ∼ pdata and n ∼ N (0, I). However, a rendered image PAAS(xπ , 2σ) ≈ ∇xπ log p√2σ (xπ ). (17)
xπ from 3D asset θ is generally not consistent with such
distribution. Lemma 1 Assuming an empirical data distribution
We illustrate this OOD situation in Fig. 2. Given a de- pσ (x) in Eq. (4), for any x ∈ Rd
noiser pretrained on FFHQ [19] by Baranchuk et al. [1], we
visualize the output D(xblob ; σ = 6.5) where the input xblob log p√2σ (x) ≥ En∼N (0, I) log pσ (x + σn). (18)
is a non-noisy image showing an orange blob centered on a
grey canvas. Under the intuition that D predicts a weighted Proof. Observe that the LHS of Eq. (19) is a convolution of
nearest neighbor as reviewed in (5), we expect the denoiser two Gaussians, therefore
to blend the orange blob with the manifold of faces. How- En∼N (0, I) [N (x + σn; µ, σ 2 I)] = N (x; µ, 2σ 2 I) (19)
ever in reality we observe sharp artifacts when updating with
this score (D(xblob ; σ) − xblob )/σ 2 and the image becomes Recall that pσ (x) is a mixture of Gaussians per Eq. (4);
further away from the face manifold. p√2σ (x) = Ey∼pdata N (x; y, 2σ 2 I) (20)
Perturb-and-Average Scoring. To address the OOD prob- = Ey∼pdata En∼N (0, I) N (x + σn; y, σ I) 2
(21)
lem, we propose Perturb-and-Average Scoring (PAAS). It 2
computes the score on non-noisy images xπ with a denoiser = En∼N (0, I) Ey∼pdata N (x + σn; y, σ I) (22)
D by adding noise to the input, and then considering the = En∼N (0, I) pσ (x + σn). (23)
expectation of the predicted scores w.r.t. the random noise,
Taking the log on both sides of Eq. (23) and by Jensen’s

PAAS(xπ , 2σ) (13) inequality, we arrive at Eq. (18). 
,En∼N (0, I) [score(xπ + σn, σ)] (14) Claim 1 Assuming√ a trained denoiser D as in Eq. (3),
D(xπ + σn, σ) − (xπ + σn)
  our PAAS(xπ , 2σ) in Eq. (13) computes the gradient
=En (15) w.r.t. a lower bound of log p√2σ (x).
σ2
D(xπ + σn, σ) − xπ Z
  hni
=En −Z
En . (16) Proof. By taking the gradient of the RHS of Lemma 1,
σ 2
{zσ }
| Z ∇x En log p(x + σn, σ) = En ∇x+σn log pσ (x + σn)
=0ZZ
= En [score(xπ + σn, σ)]. (24)
In practice, we use the Monte Carlo estimate of the expecta-
tion in Eq. (16). The algorithm is illustrated in Fig. 3. Given which is the proposed PAAS algorithm in Eq. (13). 

4
4.2. Inverse Rendering on Voxel Radiance Field Annealed σ Random σ
With the computation of the 2D score resolved, the other

FFHQ
half of our setup in Eq. (11) requires access to the Jacobian
of a differentiable renderer.
3D Representation. We represent a 3D asset θ as a voxel
radiance field [6, 54, 62], which is much faster to access
and update compared to a vanilla NeRF parameterized by a

LSUN
neural network [34]. The parameters θ consist of a density
voxel grid V(density) ∈ R1×Nx ×Ny ×Nz and a voxel grid of
appearance features V(app) ∈ Rc×Nx ×Ny ×Nz . Convention-

SD, scale=10 SD, scale=3


ally the appearance features are simply the RGB colors and
c = 3. For simplicity, we do not model view dependencies
in this work.
Inverse Volumetric Rendering. Image rendering is per-
formed independently along a camera ray through each pixel.
We cut a camera light ray into equally distanced segments
of length d, and at the spatial location corresponding to the
beginning of the i-th segment we sample a (RGBi , τi ) tuple
from the color and density grids using trilinear interpolation.
These values are alpha-composited using volumeP rendering Figure 4. Sampling 2D images with Perturb-and-Average Scoring.
quadrature [32] into the pixel color C = i wi · RGBi , We compare Annealed vs Random σ schedule against several dif-
where fusion models. Row 1 & 2: the random σ schedule exhibits strong
i−1
Y mode-seeking behavior, and it results in low-quality “mean” images
wi = αi · (1 − αj ); αi = 1 − exp(−τi d). (25) on unconditioned diffusion models trained on FFHQ and LSUN
j=0 Bedroom. In this case, we need a carefully designed annealed σ
schedule to produce better, more diverse samples. Row 3 & 4: Sta-
Volume rendering of θ is directly differentiable. At a ren- ble Diffusion (SD) is conditioned on the prompt “a squirrel holding
dered image xπ , the Vector-Jacobian product in Eq. (11) a saxophone”. The use of natural language makes the conditioned
between PAAS(xπ ) and the Jacobian Jπ = ∂x ∂θ is com-
π
distribution much easier to sample from. When the guidance scale
puted by back-propagating the score through Eq. (25). This is elevated to 10, Random σ schedule that fails on FFHQ and LSUN
Vector-Jacobian product provides us with the 3D gradient starts to produce crisp, clean images.
needed for generative modeling on the voxel radiance field.
Regularization Strategies. The voxel grid is a very power- Emptiness Loss Schedule: We use a hyperparameter λ to
ful 3D representation for volumetric rendering. Given noisy control the contribution of the emptiness loss. If we apply a
2D guidance, the model may cheat by populating the entire large emptiness loss, it will hinder the learning of geometry
grid with small densities such that the combined effect from in the early stage of training. But if the emptiness is too
one view hallucinates a plausible image. We propose sev- small, there will be floating density artifacts. We adopt a two-
eral techniques to encourage the formation of a coherent 3D stage noise elimination schedule to deal with this problem.
structure. In the first K iterations, we use a relatively small weighting
Emptiness Loss: Ideally, the space should be sparse with near factor λ1 . After the K th iteration, it is increased to a larger
zero densities except at the object. We propose an emptiness λ2 . In our experiments λ1 = 1 × 104 and λ2 = 2 × 105 . We
loss to encourage sparsity on a ray r: provide an ablation study of this technique in Fig. 7 to show
its effectiveness.
N
1 X
Lemptiness (r) = log(1 + β · wi ), (26) Center Depth Loss: Sometimes the optimization places the
N i=1
object away from the scene center. The object either becomes
where wi are the alpha-composited weights shown in (25). small or wander around the image boundary. For the few
The shape of the log function imposes severe penalties at the cases when this happens we apply a center depth loss
onset of small weights, but does not grow aggressively if the  
weights are large. It is consistent with our aim to eliminate
1 X 1 X
small densities. The hyperparameter β controls the steepness Lcenter (D) = − log  D(p) − { D(q)
of the loss function near 0. A larger β will put more emphasis |B| |B | q∈B
p∈B /
on eliminating low-density noise. We set β = 10. (27)

5
A DSLR photo of a yellow duck A ficus planted in a pot

A zoomed out photo a small castle A high quality photo of a toy motorcycle

A zoomed out high quality photo of Sydney Opera House A photo of a horse walking

Figure 5. Qualitative results of text-prompted generation of 3D models with SJC, purely from the pretrained Stable Diffusion (2D) image
model. Each row shows two views, with associated depth maps (blue is far, red is near), for a single 3D model generated for a given prompt.
Note the detailed appearance as well as a sharp, well-defined depth structure.

where D is the depth image, B is a box (set of pixel locations) 5. Experiments


at the center of the image, and B { is its complement.
We conduct experiments on both unconditioned and con-
4.3. SJC vs. DreamFusion ditioned diffusion models to have a more comprehensive
understanding of the properties of SJC.
In this section, we describe the differences and the con- DDPMs trained on FFHQ and LSUN Bedroom are un-
nections between our SJC and DreamFusion. conditioned diffusion models with an architecture based on
Differences from DreamFusion. In terms of formulation, the implementation by Dhariwal and Nichol [10]. They are
DreamFusion’s computation of the gradient w.r.t. θ involves trained on an image resolution of 256 × 256. FFHQ [19] is a
a U-Net Jacobian term (see Eq. 2 in their paper [41]). In dataset of aligned faces with diverse coverage of gender, age,
practice, they “found that omitting the U-Net Jacobian term” race, facial appearance as well as head poses. LSUN Bed-
to be more effective. On the other hand, this U-Net Jacobian room [63] includes bedroom images with varied furniture
term does not appear in our formulation. Their additional layout plans and rich interior design styles.
justification in the appendix actually leans more towards our
viewpoint. An additional contribution of ours beyond Dream- Stable Diffusion is an expanded work based on Latent Dif-
Fusion [41] is our analysis of the effect that the OOD prob- fusion Model (LDM) developed by Rombach et al. [45]. It
lem has when using a denoiser on rendered images (Claim 1), is trained on the LAION5B dataset [47]. We use the release
and the PAAS method to address it. For the variance reduc- version v1.5. Diffusion is performed on a latent space of
tion technique, namely the use of the Monte-Carlo estimate 4 × 64 × 64, then upsampled to 3 × 256 × 256 by a decoder.
on Eq. (16), or ˆ −  (in DreamFusion), vs. on Eq. (15), we The model is natively trained for text-conditioned image
observe comparable performance between the two methods generations, and exposes a guidance scale parameter that
empirically for 3D generation. controls the strength of language conditioning [13]. Intu-
itively a larger guidance scale makes the conditioned image
Influences by DreamFusion. At the time of this submission,
distribution more faithful to the text prompt by trading off
DreamFusion is a concurrent arXiv paper. However, as we
sample diversity.
have read the paper, our research was influenced by their
reported observations. In particular, we adopted the idea
5.1. Validating PAAS on 2D images.
of randomized scheduling of σ during 3D optimization for
easier hyperparameter tuning, and used view-augmented Before directly jumping to 3D generation, we first verify
language prompting that improves the overall 3D quality. For that PAAS provides effective guidance on a simple 2D image
future work we do hope to explore a more general solution canvas. In other words, here θ is a grid of RGB values and
than view-dependent prompts. f is an identity function. The hope is that gradient descent

6
(a) (b) (c)
Ours
StableDF

(d) (e) (f)


Ours
StableDF

Figure 6. Qualitative comparison between Stable-DreamFusion (StableDF) and Ours. The prompts are: (a) “A high quality photo of a
delicious burger"; (b) “a DSLR photo of a yellow duck"; (c) “A ficus planted in a pot"; (d) “A product photo of a toy tank"; (e) “A high
quality photo of a chocolate icecream cone"; (f)“A wide angle zoomed out photo of a giraffe". Both methods are run for 10k iterations
without per-prompt finetuning on the hyperparameters. The images on the left are rendered RGB images and the images on the right are
depth visualization.

on the vector field produced by PAAS creates high quality blurry image canvas with no content.
images. Here an important decision to make is the schedule On the other hand, natural language prompting plays a
of {σi } at which we compute PAAS. critical role when sampling images with Stable Diffusion.
We experimented with an annealed schedule (Annealed When the language guidance is set to a regular level of 3.0,
σ) vs. a random schedule (Random σ) as proposed in Dream- the observations are broadly consistent with sampling on
Fusion. Under the Annealed σ schedule, we start from a FFHQ and LSUN. Random σ schedule produces blurry out-
large σ and gradually decrease it as we update the image can- puts. However, when the guidance scale is elevated to 10.0,
vas x. PAAS computed at larger σ level attends to high level Random σ schedule begins to generate crisp, clean images
image structure while smaller σ provides stronger guidance and outperforms Annealed σ schedule. Despite various so-
on detailed features. The Random σ schedule on the contrary phisticated strategies on Annealed σ scheduling (see our
uniformly samples a σ at every step. We show qualitative code for details), at a high language guidance scale Ran-
comparisons in Fig. 4. domized σ remains the better option. We hypothesize that
For unconditioned diffusion models trained on FFHQ, stronger language guidance forces the image distribution to
we observe that Annealed σ performs better than Random be narrower and more beneficial for a mode-seeking algo-
σ, and the image samples have better pose variation and rithm. We acknowledge none of the images in Fig. 4 can
quality. Particularly, the randomized σ exhibits severe mode- match the sample quality of a standard diffusion inference
seeking behavior converging to average faces. In the case pipeline, and the right way to apply PAAS as gradient for
of LSUN Bedroom, the mode-seeking behavior results in a optimization remains an open problem.

7
A zoomed out high quality photo of Temple of Heaven.

a DSLR photo of a rose

A modern house with flat roof floating on water.

No emptiness loss (λ = 0) λ = 1e4 λ = 2e5 Ours

Figure 7. Ablation experiments on the proposed emptiness loss schedule. For each setting of the loss weight λ, we show a rendered image
and the associated depth map from a randomly sampled viewpoint. Ours incorporates the loss with weight schedule described in 5.2. It leads
to better 3D shape, as evidenced in the cleaner depth maps. Setting the loss weight too low yields "cloudy" depth fields. When setting the
weight too high, SJC fails to produce meaningful 3D models.

5.2. 3D Generation parison is to show that our overall pipeline is competitive.


In this paper, we focus on 3D Generation with the Ablations. In Fig. 7, we conduct ablations to demonstrate
language-conditioned Stable Diffusion model. We found the importance of the proposed emptiness loss and schedul-
that tuning the Annealed σ schedule on FFHQ and LSUN ing of its weight λ discussed in Sec. 4.2. We show results
Bedroom in 3D domain is difficult in practice, and leave it without the emptiness loss, with constant weight λ vs our
as future work. Based on the insights from 2D experiments proposed scheduling of λ. We observe that our complete
earlier, we use Random σ schedule coupled with a high method (Ours) improves the quality of generated 3D mod-
language guidance scale. els, e.g., fewer floating artifacts and better geometry.
Rendering with Latent 3D Features. Stable Diffusion
economizes compute by performing diffusion modeling on
the latent features of a pretrained AutoEncoder. We there- 6. Conclusion
fore choose to render a feature image in this latent space
We propose an optimization-based approach to generate
from a features field [3, 38] represented by a voxel grid in
3D assets from pretrained image (2D) diffusion models. The
R4×Nx ×Ny ×Nz .
key technical contribution is the derivation of Perturb-and-
Qualitative Comparison. In Fig. 5, we show text-prompted Average Scoring method which bridges the gap between the
3D generation results from SJC. It is capable of generating denoising-trained diffusion models and the non-noisy im-
complex 3D models over a diverse set of prompts ranging ages encountered in the process of optimizing a 3D model
from animals to the Sydney Opera House. Next, we compare guided by the diffusion. We also propose a new regulariza-
SJC with Stable-DreamFusion, the third-party implementa- tion loss for improving the quality of generated 3D scene.
tion based on the same pretrained Stable Diffusion model. Working with the large-scale Stable Diffusion model, we
In Fig. 6, we show qualitative comparisons of generated 3D demonstrate that our approach can generate compelling 3D
assets given the same prompt. We observe that SJC gener- models, comparing favorably to available concurrent work.
ates 3D models with better image quality and more sensible Finally, we investigate an interesting distinction between the
structure than Stable-DreamFusion in a significant number effect of noise scheduling regime in unconditional image
of cases. We acknowledge that both systems exhibit quality diffusion models and a text-conditional model, and identify
fluctutations over different trials, and the point of this com- an avenue for future work.

8
7. Acknowledgements [15] Aapo Hyvärinen and Peter Dayan. Estimation of non-
normalized statistical models by score matching. J. Mach.
The authors would like to thank David McAllester for Learn. Res, 2005. 1, 3
feedbacks on an early pitch of the work, Shashank
√ Srivas- [16] Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel,
tava and Madhur Tulsiani for discussing the 2 factor on and Ben Poole. Zero-shot text-guided object generation with
synthetic experiments. We would like to thank friends at TRI dream fields. In IEEE Conf. Comput. Vis. Pattern Recog.,
and 3DL lab at UChicago for suggestions on the manuscript. 2022. 2
[17] Nikolay Jetchev. Clipmatrix: Text-controlled creation of 3D
HC would like to thank Kavya Ravichandran for incredible
textured meshes. arXiv preprint arXiv:2109.12922, 2021. 2
officemate support, and Michael Maire for the discussion [18] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.
and encouragement while riding Metra. Elucidating the design space of diffusion-based generative
models. arXiv preprint arXiv:2206.00364, 2022. 1, 2, 3, 12
References [19] Tero Karras, Samuli Laine, and Timo Aila. A style-based
generator architecture for generative adversarial networks. In
[1] Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, Valentin IEEE Conf. Comput. Vis. Pattern Recog., 2019. 4, 6
Khrulkov, and Artem Babenko. Label-efficient seman- [20] Nasir Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu
tic segmentation with diffusion models. arXiv preprint Popa. Clip-mesh: Generating textured meshes from text using
arXiv:2112.03126, 2021. 4 pretrained image-text models. ACM Trans. Graph., 2022. 2
[2] Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun [21] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho.
Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. Variational diffusion models. Advances in neural information
Learning gradient fields for shape generation. In Eur. Conf. processing systems, 34:21696–21707, 2021. 1
Comput. Vis., 2020. 2 [22] Diederik P Kingma and Jimmy Ba. Adam: A method for
[3] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, stochastic optimization. arXiv preprint arXiv:1412.6980,
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J 2014. 14
Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient [23] Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis
geometry-aware 3D generative adversarial networks. In IEEE Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa,
Conf. Comput. Vis. Pattern Recog., 2022. 2, 8 Denis Zorin, and Daniele Panozzo. ABC: A big cad model
[4] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, dataset for geometric deep learning. In IEEE Conf. Comput.
and Gordon Wetzstein. pi-GAN: Periodic implicit generative Vis. Pattern Recog., 2019. 2
adversarial networks for 3d-aware image synthesis. In IEEE [24] Christoph Lassner and Michael Zollhofer. Pulsar: Efficient
Conf. Comput. Vis. Pattern Recog., 2021. 2 sphere-based neural rendering. In IEEE Conf. Comput. Vis.
[5] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Pattern Recog., 2021. 2
Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis [25] Han-Hung Lee and Angel X Chang. Understanding pure
Savva, Shuran Song, Hao Su, et al. ShapeNet: An information- clip guidance for voxel grid nerf models. arXiv preprint
rich 3D model repository. arXiv preprint arXiv:1512.03012, arXiv:2209.15172, 2022. 2
2015. 2 [26] Tzu-Mao Li, Miika Aittala, Frédo Durand, and Jaakko Lehti-
[6] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and nen. Differentiable Monte Carlo ray tracing through edge
Hao Su. Tensorf: Tensorial radiance fields. arXiv preprint sampling. ACM Trans. Graph., 2018. 2
arXiv:2203.09517, 2022. 2, 5 [27] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
[7] Yizong Cheng. Mean shift, mode seeking, and clustering. Christian Theobalt. Neural sparse voxel fields. In Adv. Neural
IEEE Trans. Pattern Anal. Mach. Intell., 1995. 3 Inform. Process. Syst., 2020. 2
[8] Dorin Comaniciu and Peter Meer. Mean shift analysis and [28] Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang,
applications. In Int. Conf. Comput. Vis., 1999. 3 Hao Su, and Qiang Liu. Fusedream: Training-free text-to-
[9] Amélie Deltombe. How much does it cost to create 3D mod- image generation with improved clip+ GAN space optimiza-
els?, Apr 2022. 2 tion. arXiv preprint arXiv:2112.01573, 2021. 2
[10] Prafulla Dhariwal and Alexander Nichol. Diffusion mod- [29] Matthew M Loper and Michael J Black. OpenDR: An ap-
els beat GANs on image synthesis. In Adv. Neural Inform. proximate differentiable renderer. In Eur. Conf. Comput. Vis.,
Process. Syst., 2021. 2, 6 2014. 2
[11] Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dim- [30] Dimitra Maoutsa, Sebastian Reich, and Manfred Opper. Inter-
itris Samaras. Diffusion models as plug-and-play priors. arXiv acting particle solutions of fokker–planck equations through
preprint arXiv:2206.09012, 2022. 3 gradient–log–density estimation. Entropy, 2020. 13
[12] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- [31] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi,
fusion probabilistic models. In Adv. Neural Inform. Process. Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck-
Syst., 2020. 1, 2, 3 worth. Nerf in the wild: Neural radiance fields for uncon-
[13] Jonathan Ho and Tim Salimans. Classifier-free diffusion strained photo collections. In IEEE Conf. Comput. Vis. Pat-
guidance. arXiv preprint arXiv:2207.12598, 2022. 6 tern Recog., 2021. 2
[14] Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang [32] Nelson Max. Optical models for direct volume rendering.
Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero-shot text- IEEE Trans. Vis. Comput. Graph., 1995. 2, 5
driven generation and animation of 3D avatars. arXiv preprint [33] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and
arXiv:2205.08535, 2022. 2 Rana Hanocka. Text2mesh: Text-driven neural stylization for

9
meshes. In IEEE Conf. Comput. Vis. Pattern Recog., 2022. 2 [48] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
[34] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Geiger. GRAF: Generative radiance fields for 3D-aware im-
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: age synthesis. In Adv. Neural Inform. Process. Syst., 2020.
Representing scenes as neural radiance fields for view synthe- 2
sis. Commun. ACM, 2021. 2, 5 [49] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[35] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian and Surya Ganguli. Deep unsupervised learning using
Richardt, and Yong-Liang Yang. HoloGAN: Unsupervised nonequilibrium thermodynamics. In International Confer-
learning of 3D representations from natural images. In Int. ence on Machine Learning, pages 2256–2265. PMLR, 2015.
Conf. Comput. Vis., 2019. 2 1, 2
[36] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [50] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and diffusion implicit models. In Int. Conf. Learn. Represent.,
Mark Chen. Glide: Towards photorealistic image generation 2021. 12
and editing with text-guided diffusion models. arXiv preprint [51] Yang Song and Stefano Ermon. Generative modeling by
arXiv:2112.10741, 2021. 2 estimating gradients of the data distribution. In Adv. Neural
[37] Michael Niemeyer and Andreas Geiger. Campari: Camera- Inform. Process. Syst., 2019. 1, 2, 3, 12
aware decomposed generative neural radiance fields. In Int. [52] Yang Song and Stefano Ermon. Improved techniques for
Conf. 3DV, 2021. 2 training score-based generative models. In Adv. Neural Inform.
[38] Michael Niemeyer and Andreas Geiger. Giraffe: Representing Process. Syst., 2020. 3, 12
scenes as compositional generative neural feature fields. In [53] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-
IEEE Conf. Comput. Vis. Pattern Recog., 2021. 2, 8 hishek Kumar, Stefano Ermon, and Ben Poole. Score-based
[39] Merlin Nimier-David, Delio Vicini, Tizian Zeltner, and Wen- generative modeling through stochastic differential equations.
zel Jakob. Mitsuba 2: A retargetable forward and inverse In Int. Conf. Learn. Represent., 2021. 1, 2, 3, 13
renderer. ACM Trans. Graph., 2019. 2 [54] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel
[40] Michael Oechsle, Songyou Peng, and Andreas Geiger. grid optimization: Super-fast convergence for radiance fields
Unisurf: Unifying neural implicit surfaces and radiance fields reconstruction. In IEEE Conf. Comput. Vis. Pattern Recog.,
for multi-view reconstruction. In Int. Conf. Comput. Vis., 2022. 2, 5
2021. 2 [55] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang,
[41] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and
DreamFusion: Text-to-3D using 2D diffusion. arXiv preprint William T Freeman. Pix3D: Dataset and methods for single-
arXiv:2209.14988, 2022. 3, 6, 14 image 3D shape modeling. In IEEE Conf. Comput. Vis. Pat-
[42] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya tern Recog., 2018. 2
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [56] Pascal Vincent. A connection between score matching and
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning denoising autoencoders. Neural computation, 2011. 2
transferable visual models from natural language supervision. [57] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
2021. 2 Komura, and Wenping Wang. Neus: Learning neural implicit
[43] Sai Rajeswar, Fahim Mannan, Florian Golemo, David surfaces by volume rendering for multi-view reconstruction.
Vazquez, Derek Nowrouzezahrai, and Aaron Courville. arXiv preprint arXiv:2106.10689, 2021. 2
Pix2scene: Learning implicit 3d representations from images. [58] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and
openreview, 2018. 2 Josh Tenenbaum. Learning a probabilistic latent space of
[44] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, object shapes via 3D generative-adversarial modeling. In Adv.
and Mark Chen. Hierarchical text-conditional image genera- Neural Inform. Process. Syst., 2016. 2
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. [59] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-
2 guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3D
[45] Robin Rombach, Andreas Blattmann, Dominik Lorenz, ShapeNets: A deep representation for volumetric shapes. In
Patrick Esser, and Björn Ommer. High-resolution image syn- IEEE Conf. Comput. Vis. Pattern Recog., 2015. 2
thesis with latent diffusion models. In IEEE Conf. Comput. [60] Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge
Vis. Pattern Recog., 2022. 2, 3, 6 Belongie, and Bharath Hariharan. PointFlow: 3D point cloud
[46] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay generation with continuous normalizing flows. In Int. Conf.
Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Comput. Vis., 2019. 2
Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, [61] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
et al. Photorealistic text-to-image diffusion models with deep ume rendering of neural implicit surfaces. In Adv. Neural
language understanding. arXiv preprint arXiv:2205.11487, Inform. Process. Syst., 2021. 2
2022. 3 [62] Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong
[47] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels:
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Radiance fields without neural networks. arXiv preprint
Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. arXiv:2112.05131, 2021. 2, 5
LAION-5B: An open large-scale dataset for training next gen- [63] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
eration image-text models. arXiv preprint arXiv:2210.08402, Funkhouser, and Jianxiong Xiao. LSUN: Construction of a
2022. 2, 6 large-scale image dataset using deep learning with humans in

10
the loop. arXiv preprint arXiv:1506.03365, 2015. 6
[64] Yuxuan Zhang, Wenzheng Chen, Huan Ling, Jun Gao, Yinan
Zhang, Antonio Torralba, and Sanja Fidler. Image GANs meet
differentiable rendering for inverse graphics and interpretable
3D neural rendering. arXiv preprint arXiv:2010.09125, 2020.
2
[65] Xiaoming Zhao, Fangchang Ma, David Güera, Zhile Ren,
Alexander G Schwing, and Alex Colburn. Generative multi-
plane images: Making a 2D GAN 3D-Aware. In Eur. Conf.
Comput. Vis., 2022. 2

11
Appendix
• In Sec. A1, we provide diffusion models from a score-based perspective following Karras et al. [18].

• In Sec. A2, we provide additional experiments on our approach, including additional ablation study, qualitative results,
and video results.

• In Sec. A3, we document implementation details.

Algorithm 1 Training Algorithm 2 Deterministic Sampling


1: repeat 1: {σi }T
i=1 descending; σ0 = 0
2: x ∼ pdata 2: xT = σT z, z ∼ N (0, I)
3: σ ∼ [σmin , σmax ] 3: for t = T, . . . , 1 do
4: z ∼ N (0, I) 4: xt−1 = xt + (σt − σt−1 ) · σt · score(xt , σt )
5: Take gradient descent step on σt −σt−1
5: = (1 − wt ) xt + wt Dφ (xt , σt ) wt = σt
∇φ kDφ (x + σz, σ) − xk2 | {z }
weighted average
6: until converged
7: score(x, σ) = ∇x log pσ (x) = (Dφ (x, σ) − x)/σ 2 6: return x0

Figure A1. Training and Sampling Algorithm Card for Score-Based Methods with numerical scaling s(t) = 1 and σ(t) = t. Note that
the inference step is analogous to DDIM [50], and simplifies to a weighted averaging between the current iterate xt and the denoiser
output D(xt , σt ). This particular scheduling allows for taking large step sizes, and a sample can be generated in as few as 80 network
evaluations [18] while maintaining high image quality.

A1. Diffusion Models from Score-Based Perspective


We provide a more detailed recap of diffusion models from the score-based perspective. For a quick overview, we summarize
the training and deterministic sampling algorithms in Fig. A1; the deterministic sampling algorithm can be made stochastic by
adding noise and adjusting σ level after each update (details see Karras et al. [18]).
In the following analysis we assume that each dimension of the random vector x is independent, and that the variance in
each dimension is 1. The general form of the forward noising step of a diffusion model can be described as scaling and adding
noise, i.e.

xt = s(t)x0 + s(t)σ(t)z, (28)

where z ∼ N (0, I) and x0 is a sample drawn from data distribution. s(t) and σ(t) are user-defined coefficients. Here the
coefficient on noise z is parameterized as the product of s(t) and σ(t) so that σ(t) represents the noise-to-signal ratio in xt .
SMLD [51, 52], DDIM [50] and Karras [18] sets scaling, i.e. s(t) = 1, and therefore adding noise by x0 + σ(t)z would
cause xt to numerically get larger as t increases. DDPM on the other hand introduced rapidly decreasing s(t) to scale down the
successive xt so that at any time t, pt (x) has variance fixed at 1. This goal of maintaining a standard deviation 1 requires that

Var[xt ] = Var[s(t)x0 ] + Var[s(t)σ(t)z] (29)


2 2 2
s(t) Var[x0 ] +s(t) σ(t) Var[z] = I (30)
| {z } | {z }
=I =I
s(t)2 + s(t)2 σ(t)2 = 1 (31)
s
1 − s(t)2
σ(t) = (32)
s(t)2

√ qQ qQ q
1−ᾱt
DDPM specifies the s(t) by a set of βt , i.e., s(t) = ᾱt = i≤t αi = i≤t (1 − βi ) , and therefore σ(t) = ᾱt .

12
The noising step (28) describes the marginal distribution at pt (x). The infinitesimal time evolution of this process can be
written as the following stochastic differential equation [53]:
ṡ(t) p
dx = f (t)x dt + g(t) dωt where f (t) = g(t) = s(t) 2σ̇(t)σ(t). (33)
s(t)
Fokker-Planck equation [30] states that a stochastic differential equation of the form (33) is identified with a partial differential
equation describing the marginal probability density distribution pt (x)

g(t)2
 
∂pt (x)
dx = f (x, t) dt + g(t) dωt ←→ = −∇ · f (x, t) pt (x) − ∇x pt (x) . (34)
∂t 2
Applying this identity tells us that a stochastic differential equation like (33) implies a deterministic, ordinary differential
equation. Here we illustrate the proof schematically:
h i
FP ∂pt (x) g(t)2
dx = f (t) x dt + g(t) dωt ∂t = −∇ · f (t) x pt (x) − 2 ∇x pt (x)
| {z }
stochastic
equal (by log derivative trick; expanded below)
implies (35)
 
g(t)2 2
  
∂pt (x) g(t)
dx = f (t) x − ∇x log pt (x) dt + 0 dωt ∂t = −∇ · f (t) x − ∇x log pt (x) pt (x) − 0
2 FP 2
| {z }
deterministic

The application of the log derivative trick is expanded below:

g(t)2
 
∂pt (x)
= −∇ · f (t) x pt (x) − ∇x pt (x) (36)
∂t 2
" 2 #
f (t) x pt (x) − g(t)
2 ∇ x pt (x)
= −∇ · pt (x) (37)
pt (x)
  
 g(t)2 ∇x pt (x)  
= −∇ · f (t) x − pt (x) (38)
  
 2 pt (x)  
| {z }
log derivative
2
  
g(t)
= −∇ · f (t) x − ∇x log pt (x) pt (x) . (39)
2
Substituting the expression for f (t) and g(t) from (33), we obtain an ODE from which we can sample the data by applying
the score function with a step schedule that theoretically guarantees to take us back to initial, clean data distribution
ṡ(t) 1 p 2
dx = x− s(t) 2σ̇(t)σ(t) ∇x log pt (x) dt (40)
s(t) 2
 
ṡ(t) σ̇(t)
dx = x − s(t) D(x/s(t); σ(t)) − x/s(t) dt . (41)
s(t) σ(t)
When s(t) = 1, σ(t) = t, the above simplifies to
D(x; σt ) − x
dx = −σt · dt (42)
σt2
dx = −σt · score(x, σt ) dt (43)

Note that this schedule with s(t) = 1, σ(t) = t allows for taking large step sizes during inference since it introduces no
extra curvature in the trajectory beyond what’s induced by the score function itself. The discretized sampling algorithm of
equation (43) is described in Fig. A1.

13
A high quality photo of french fries from McDonald’s

a DSLR photo of a rose

No center depth loss With center depth loss (weight = 100)

Figure A2. Ablation experiments on the proposed center depth loss. Each pair of corresponding columns of the same prompt are visualized
from the same camera angle.

A2. Additional Experiments


Ablation on center depth loss. In Fig. A2, we illustrate the effect of the center depth loss proposed in Eq. (27). Without the
center depth loss, we observe that some objects, e.g., French Fries, are placed far from the center of the scene box and tend to
drift around when the camera viewpoints are changed. This effect is more pronounced in the provided video result. In contrast,
a moderate center depth loss forces the object to be placed at the scene box center. Additionally, we observe that the objects
tend to be enlarged to occupy more of the visible screen space without wasting model capacity.
Additional qualitative results. We provide additional qualitative results from SJC in Fig. A3. Note that we increase the
resolution of the depth maps beyond the 64 × 64 resolution of the image latents by rendering subpixel rays. In general, we
observe that the volumetric renderer is powerful enough to hallucinate shadows (horse), water surfaces (Sydney opera house,
duck), grasslands (zebra) and even a traffic lane (school bus), using the volume densities.
Video results. We have attached numerous video results in the supplemental materials. Please see the attached HTML and
videos. We have named each file after the text prompt used to generate the 3D asset. In addition, we included the videos for the
ablation experiments in Fig. 7 and Fig. A2.

A3. Implementation Details


3D scene setup. Our voxel grids are of size 1003 , and placed at world origin with a normalized side length [−1, 1]3 . We sample
cameras uniformly on a hemisphere that covers the voxel cube with a radius of 1.5, with look-at directions pointing at the origin.
The camera field of view is randomly sampled from 40 degrees to 70 degrees during optimization, and fixed to 60 degrees at
test time. We found the jittering on FoV to help with 3D optimization in some cases, and this data augmentation technique is
reported in DreamFusion [41]. Our scene background consists of an optimizable image of size 4 × 4 environment-mapped to
the spherical surface by azimuth and elevation angles of the incoming ray. The small image size with constrained capacity
helps to avoid confounding visual artifacts accumulating in the background during optimization.
Optimization. We use Adamax [22] optimizer and perform gradient descent at a learning rate of 0.05 for 10, 000 steps, with
some prompts running at a longer schedule for better quality. Note that when performing gradient descent with NEScore, we
implicitly rely on the optimizer’s momentum state to perform the averaging. We have tried explicitly averaging the scores at
multiple noise perturbations, but observed no benefits or degradation. The language-guidance scale is set to 100. Our system
consumes 9GB of GPU memory during optimization, and takes approximately 25 minutes on an A6000 GPU including the
time spent on miscellaneous tasks like visualization.
View-dependent prompting. An influence of DreamFusion [41] on our work is the use of view-dependent prompting.
Language prompts are prepended with one of the following: “overhead view of”, “front view of”, “backside view of”, “side

14
view of” depending on the camera location. More specifically, when camera elevation is above 30 degrees we use the “overhead
view” prompt. Otherwise, the prompts are assigned based on the azimuth quadrant the camera falls into. This technique helps
to alleviate the degeneracy of multiple frontal faces being painted around an object during optimization. We hope as part of our
future work to develop a more general solution to induce the optimization towards more plausible geometry without using
language as guidance.

15
Trump figure

Obama figure

Biden figure

Zelda Link

A product photo of a Canon home printer

A pig

A photo of a zebra walking

A wide angle zoomed out photo of Saturn V rocket from distance

A high quality photo of a yellow school bus

16
Figure A3. Additional results of text-prompted generation of 3D models with SJC.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy