0% found this document useful (0 votes)
45 views31 pages

Score-Based Generative Modeling

This document proposes techniques to improve score-based generative models so they can generate high-resolution images comparable to GANs. It presents a new theoretical analysis of learning score functions in high dimensions that explains failures of existing methods. It also derives methods to choose effective noise scales for training and optimize sampling. With these contributions and maintaining a model weight average, the authors scale score-based models to diverse image datasets from 64x64 to 256x256 resolution, generating sharp, realistic samples rivaling top GANs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views31 pages

Score-Based Generative Modeling

This document proposes techniques to improve score-based generative models so they can generate high-resolution images comparable to GANs. It presents a new theoretical analysis of learning score functions in high dimensions that explains failures of existing methods. It also derives methods to choose effective noise scales for training and optimize sampling. With these contributions and maintaining a model weight average, the authors scale score-based models to diverse image datasets from 64x64 to 256x256 resolution, generating sharp, realistic samples rivaling top GANs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Improved Techniques for Training Score-Based

Generative Models

Yang Song Stefano Ermon


Computer Science Department Computer Science Department
Stanford University Stanford University
yangsong@cs.stanford.edu ermon@cs.stanford.edu
arXiv:2006.09011v2 [cs.LG] 23 Oct 2020

Abstract
Score-based generative models can produce high quality image samples comparable
to GANs, without requiring adversarial optimization. However, existing training
procedures are limited to images of low resolution (typically below 32 × 32),
and can be unstable under some settings. We provide a new theoretical analysis
of learning and sampling from score-based models in high dimensional spaces,
explaining existing failure modes and motivating new solutions that generalize
across datasets. To enhance stability, we also propose to maintain an exponential
moving average of model weights. With these improvements, we can scale score-
based generative models to various image datasets, with diverse resolutions ranging
from 64 × 64 to 256 × 256. Our score-based models can generate high-fidelity
samples that rival best-in-class GANs on various image datasets, including CelebA,
FFHQ, and several LSUN categories.

1 Introduction
Score-based generative models [1] represent probability distributions through score—a vector field
pointing in the direction where the likelihood of data increases most rapidly. Remarkably, these
score functions can be learned from data without requiring adversarial optimization, and can produce
realistic image samples that rival GANs on simple datasets such as CIFAR-10 [2].
Despite this success, existing score-based generative models only work on low resolution images
(32 × 32) due to several limiting factors. First, the score function is learned via denoising score
matching [3, 4, 5]. Intuitively, this means a neural network (named the score network) is trained
to denoise images blurred with Gaussian noise. A key insight from [1] is to perturb the data using
multiple noise scales so that the score network captures both coarse and fine-grained image features.
However, it is an open question how these noise scales should be chosen. The recommended settings
in [1] work well for 32 × 32 images, but perform poorly when the resolution gets higher. Second,
samples are generated by running Langevin dynamics [6, 7]. This method starts from white noise
and progressively denoises it into an image using the score network. This procedure, however, might
fail or take an extremely long time to converge when used in high-dimensions and with a necessarily
imperfect (learned) score network.
We propose a set of techniques to scale score-based generative models to high resolution images.
Based on a new theoretical analysis on a simplified mixture model, we provide a method to analytically
compute an effective set of Gaussian noise scales from training data. Additionally, we propose an
efficient architecture to amortize the score estimation task across a large (possibly infinite) number
of noise scales with a single neural network. Based on a simplified analysis of the convergence
properties of the underlying Langevin dynamics sampling procedure, we also derive a technique to
approximately optimize its performance as a function of the noise scales. Combining these techniques
with an exponential moving average (EMA) of model parameters, we are able to significantly improve

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
Figure 1: Generated samples on datasets of decreasing resolutions. From left to right: FFHQ
256 × 256, LSUN bedroom 128 × 128, LSUN tower 128 × 128, LSUN church_outdoor 96 × 96,
and CelebA 64 × 64.

the sample quality, and successfully scale to images of resolutions ranging from 64 × 64 to 256 × 256,
which was previously impossible for score-based generative models. As illustrated in Fig. 1, the
samples are sharp and diverse.

2 Background
2.1 Langevin dynamics

For any continuously differentiable probability density p(x), we call ∇x log p(x) its score function.
In many situations the score function is easier to model and estimate than the original probability
density function [3, 8]. For example, for an unnormalized density it does not depend on the partition
function. Once the score function is known, we can employ Langevin dynamics to sample from the
corresponding distribution. Given a step size α > 0, a total number of iterations T , and an initial
sample x0 from any prior distribution π(x), Langevin dynamics iteratively evaluate the following

xt ← xt−1 + α ∇x log p(xt−1 ) + 2α zt , 1 ≤ t ≤ T (1)
where zt ∼ N (0, I). When α is sufficiently small and T is sufficiently large, the distribution
of xT will be close to p(x) under some regularity conditions [6, 7]. Suppose we have a neural
network sθ (x) (called the score network) parameterized by θ, and it has been trained such that
sθ (x) ≈ ∇x log p(x). We can approximately generate samples from p(x) using Langevin dynamics
by replacing ∇x log p(xt−1 ) with sθ (xt−1 ) in Eq. (1). Note that Eq. (1) can be interpreted as noisy
gradient ascent on the log-density log p(x).

2.2 Score-based generative modeling

We can estimate the score function from data and generate new samples with Langevin dynamics.
This idea was named score-based generative modeling by ref. [1]. Because the estimated score
function is inaccurate in regions without training data, Langevin dynamics may not converge correctly
when a sampling trajectory encounters those regions (see more detailed analysis in ref. [1]). As a
remedy, ref. [1] proposes to perturb the data with Gaussian noise of different intensities and jointly
estimate the score functions of all noise-perturbed data distributions. During inference, they combine
the information from all noise scales by sampling from each noise-perturbed distribution sequentially
with Langevin dynamics.
More specifically, suppose we have an underlying data distribution pdata (x) and consider a sequence
of noise scales {σi }L
i=1 that satisfies σ1 > σ2 > · · · > σL . Let pR
2
σ (x̃ | x) = N (x̃ | x, σ I), and
denote the corresponding perturbed data distribution as pσ (x̃) , pσ (x̃ | x)pdata (x)dx. Ref. [1]

2
proposes to estimate the score function of each pσi (x) by training a joint neural network sθ (x, σ)
(called the noise conditional score network) with the following loss:

L  2 
1 X x̃ − x
Epdata (x) Epσi (x̃|x) σ s
i θ (x̃, σi ) + , (2)
2L i=1 σi 2

where all expectations can be efficiently estimated using empirical averages. When trained to
the optimum (denoted as sθ∗ (x, σ)), the noise conditional score network (NCSN) satisfies ∀i :
sθ∗ (x, σi ) = ∇x log pσi (x) almost everywhere [1], assuming enough data and model capacity.
After training an NCSN, ref. [1] generates
Algorithm 1 Annealed Langevin dynamics [1]
samples by annealed Langevin dynamics, a
method that combines information from all Require: {σi }L i=1 , , T .
noise scales. We provide its pseudo-code in Al- 1: Initialize x0
gorithm 1. The approach amounts to sampling 2: for i ← 1 to L do
from pσ1 (x), pσ2 (x), · · · , pσL (x) sequentially 3: αi ←  · σi2 /σL2
. αi is the step size.
with Langevin dynamics with a special step size 4: for t ← 1 to T do
schedule αi =  σi2 /σL 2
for the i-th noise scale. 5: Draw zt ∼ N (0, I) √
Samples from each noise scale are used to ini- 6: xt ← xt−1 + αi sθ (xt−1 , σi ) + 2αi zt
tialize Langevin dynamics for the next noise 7: x0 ← xT
scale until reaching the smallest one, where it 8: if denoise xT then
provides final samples for the NCSN. 9: return xT + σT2 sθ (xT , σT )
Following the first public release of this work, 10: else
ref. [9] noticed that adding an extra denoising 11: return xT
step after the original annealed Langevin dynam-
ics in [1], similar to [10, 11, 12], often significantly improves FID scores [13] without affecting
the visual appearance of samples. Instead of directly returning xT , this denoising step returns
xT + σT2 sθ (xT , σT ) (see Algorithm 1), which essentially removes the unwanted noise N (0, σT2 I)
from xT using Tweedie’s formula [14]. Therefore, we have updated results in the main paper by
incorporating this denoising trick, but kept some original results without this denoising step in the
appendix for reference.
There are many design choices that are critical to the successful training and inference of NCSNs,
including (i) the set of noise scales {σi }L
i=1 , (ii) the way that sθ (x, σ) incorporates information of σ,
(iii) the step size parameter  and (iv) the number of sampling steps per noise scale T in Algorithm 1.
Below we provide theoretically motivated ways to configure them without manual tuning, which
significantly improve the performance of NCSNs on high resolution images.

3 Choosing noise scales

Noise scales are critical for the success of NCSNs. As shown in [1], score networks trained with a
single noise can never produce convincing samples for large images. Intuitively, high noise facilitates
the estimation of score functions, but also leads to corrupted samples; while lower noise gives clean
samples but makes score functions harder to estimate. One should therefore leverage different noise
scales together to get the best of both worlds.
When the range of pixel values is [0, 1], the original work on NCSN [1] recommends choosing
{σi }Li=1 as a geometric sequence where L = 10, σ1 = 1, and σL = 0.01. It is reasonable that
the smallest noise scale σL = 0.01  1, because we sample from perturbed distributions with
descending noise scales and we want to add low noise at the end. However, some important questions
remain unanswered, which turn out to be critical to the success of NCSNs on high resolution images:
(i) Is σ1 = 1 appropriate? If not, how should we adjust σ1 for different datasets? (ii) Is geometric
progression a good choice? (iii) Is L = 10 good across different datasets? If not, how many noise
scales are ideal?
Below we provide answers to the above questions, motivated by theoretical analyses on simple
mathematical models. Our insights are effective for configuring score-based generative modeling in
practice, as corroborated by experimental results in Section 6.

3
3.1 Initial noise scale

The algorithm of annealed Langevin dynamics (Algorithm 1) is an iterative refining procedure that
starts from generating coarse samples with rich variation under large noise, before converging to fine
samples with less variation under small noise. The initial noise scale σ1 largely controls the diversity
of the final samples. In order to promote sample diversity, we might want to choose σ1 to be as
large as possible. However, an excessively large σ1 will require more noise scales (to be discussed in
Section 3.2) and make annealed Langevin dynamics more expensive. Below we present an analysis
to guide the choice of σ1 and provide a technique to strike the right balance.
Real-world data distributions are complex and hard to analyze, so we approximate them with empirical
distributions. Suppose we have a dataset {x(1) , x(2) , · · · , x(N ) } which is i.i.d. sampled from pdata (x).
PN
Assuming N is sufficiently large, we have pdata (x) ≈ p̂data (x) , N1 i=1 δ(x = x(i) ), where δ(·)
denotes a point mass distribution. When perturbed with N (0, σ12 I), the empirical distribution becomes
PN
p̂σ1 (x) , N1 i=1 p(i) (x), where p(i) (x) , N (x | x(i) , σ12 I). For generating diverse samples
regardless of initialization, we naturally expect that Langevin dynamics can explore any component
p(i) (x) when initialized from any other component p(j) (x), where i 6= j. The performance of
Langevin dynamics is governed by the score function ∇x log p̂σ1 (x) (see Eq. (1)).
PN
Proposition 1. Let p̂σ1 (x) , N1 i=1 p(i) (x), where p(i) (x) , N (x | x(i) , σ12 I). With r(i) (x) ,
p(i) (x) PN
PN
p(k) (x)
, the score function is ∇x log p̂σ1 (x) = i=1 r(i) (x)∇x log p(i) (x). Moreover,
k=1

x − x(j) 2 
 (i)
(j) 1 2
Ep(i) (x) [r (x)] ≤ exp − . (3)
2 8σ12

In order for Langevin dynamics to transition from p(i) (x) to p(j) (x) easily for i 6= j, Ep(i) (x) [r(j) (x)]
PN
has to be relatively large, because otherwise ∇x log p̂σ1 (x) = k=1 r(k) (x)∇x log p(k) (x) will ig-
nore the component p(j) (x) (on average) when initialized with x ∼ p(i) (x) and in such case Langevin
dynamics will act as if p(j) (x) did not exist. The bound of Eq. (3) indicates that Ep(i) (x) [r(j) (x)] can

decay exponentially fast if σ1 is small compared to x(i) − x(j) 2 . As a result, it is necessary for σ1
to be numerically comparable to the maximum pairwise distances of data to facilitate transitioning of
Langevin dynamics and hence improving sample diversity. In particular, we suggest:
Technique 1 (Initial noise scale). Choose σ1 to be as large as the maximum Euclidean distance
between all pairs of training data points.

Taking CIFAR-10 as an example, the median


pairwise distance between all training images
is around 18, so σ1 = 1 as in [1] implies
E[r(x)] < 10−17 and is unlikely to produce
diverse samples as per our analysis. To test
whether choosing σ1 according to Technique 1
(i.e., σ1 = 50) gives significantly more diverse
samples than using σ1 = 1, we run annealed (a) Data (b) σ1 = 1 (c) σ1 = 50
Langevin dynamics to sample from a mixture of Figure 2: Running annealed Langevin dynamics
Gaussian with 10000 components, where each to sample from a mixture of Gaussian centered at
component is centered at one CIFAR-10 test images in the CIFAR-10 test set.
image. All initial samples are drawn from a uni-
form distribution over [0, 1]32×32×3 . This setting allows us to avoid confounders introduced by
NCSN training because we use ground truth score functions. As shown in Fig. 2, samples in Fig. 2c
(using Technique 1) exhibit comparable diversity to ground-truth images (Fig. 2a), and have better
variety than Fig. 2b (σ1 = 1). Quantitatively, the average pairwise distance of samples in Fig. 2c is
18.65, comparable to data (17.78) but much higher than that of Fig. 2b (10.12).

3.2 Other noise scales

After setting σL and σ1 , we need to choose the number of noise scales L and specify the other
elements of {σi }L
i=1 . As analyzed in [1], it is crucial for the success of score-based generative models

4
to ensure that pσi (x) generates a sufficient number of training data in high density regions of pσi−1 (x)
for all 1 < i ≤ L. The intuition is we need reliable gradient signals for pσi (x) when initializing
Langevin dynamics with samples from pσi−1 (x).
However, an extensive grid search on {σi }L i=1 can be very expensive. To give some theoretical
guidance on finding good noise scales, we consider a simple case where the dataset contains only one
data point, or equivalently, ∀1 ≤ i ≤ L : pσi (x) = N (x | 0, σi2 I). Our first step is to understand
the distributions of pσi (x) better, especially when x has high dimensionality. We can decompose
pσi (x) in hyperspherical coordinates to p(φ)pσi (r), where r and φ denote the radial and angular
coordinates of x respectively. Because pσi (x) is an isotropic Gaussian, the angular component p(φ)
is uniform and shared across all noise scales. As for pσi (r), we have the following
Proposition 2. Let x ∈ RD ∼ N (0, σ 2 I), and r = kxk2 . We have
rD−1 r2 √
 
1 d
p(r) = D/2−1 D
exp − 2
and r − Dσ → N (0, σ 2 /2) when D → ∞.
2 Γ(D/2) σ 2σ
In practice, dimensions of image data√can range from several thousand to millions, and are typically
large enough to warrant p(r) ≈ N (r| Dσ, σ 2 /2) with
√ negligible error. We therefore take pσi (r) =
N (r|mi , s2i ) to simplify our analysis, where mi , Dσ, and s2i , σ 2 /2.
Recall that our goal is to make sure samples from pσi (x) will cover high density regions of pσi−1 (x).
Because p(φ) is shared across all noise scales, pσi (x) already covers the angular component of
pσi−1 (x). Therefore, we need the radial components of pσi (x) and pσi−1 (x) to have large overlap.
Since pσi−1 (r) has high density in Ii−1 , [mi−1 − 3si−1 , mi−1 + 3si−1 ] (employing the “three-

√ rule of thumb” [15]), a natural choice is to fix pσi (r ∈ Ii−1 ) = Φ( 2D(γi − 1) + 3γi ) −
sigma
Φ( 2D(γi − 1) − 3γi ) = C with some moderately large constant C > 0 for all 1 < i ≤ L, where
γi , σi−1 /σi and Φ(·) is the CDF of standard Gaussian. This choice immediately implies that
γ2 = γ3 = · · · γL and thus {σi }L
i=1 is a geometric progression.
Ideally, we should choose as many noise scales as possible to make C ≈ 1. However, having too
many noise scales will make sampling very costly, as we need to run Langevin dynamics for each
noise scale in sequence. On the other hand, L = 10 (for 32 × 32 images) as in the original setting
of [1] is arguably too small, for which C = 0 up to numerical precision. To strike a balance, we
recommend C ≈ 0.5 which performs well in our experiments. In summary,
L
√ {σi }i=1 as a geometric progression with common ratio γ,
Technique 2√(Other noise scales). Choose
such that Φ( 2D(γ − 1) + 3γ) − Φ( 2D(γ − 1) − 3γ) ≈ 0.5.

3.3 Incorporating the noise information

For high resolution images, we need a large σ1 and a huge


number of noise scales as per Technique 1 and 2. Recall that
the NCSN is a single amortized network that takes a noise
scale and gives the corresponding score. In [1], authors use a
separate set of scale and bias parameters in normalization layers
to incorporate the information from each noise scale. However,
its memory consumption grows linearly w.r.t. L, and it is not
applicable when the NCSN has no normalization layers.
We propose an efficient alternative that is easier to implement Figure 3: Training loss curves of
and more widely applicable. For pσ (x) = N (x | 0, σ 2 I) two noise conditioning methods.
analyzed in Section 3.2, we observe that E[k∇x log pσ (x)k2 ] ≈

D/σ . Moreover, as empirically noted in [1], ks (x, σ)k ∝ 1/σ for a trained NCSN on real data.
θ 2
Because the norm of score functions scales inverse proportionally to σ, we can incorporate the
noise information by rescaling the output of an unconditional score network sθ (x) with 1/σ. This
motivates our following recommendation
Technique 3 (Noise conditioning). Parameterize the NCSN with sθ (x, σ) = sθ (x)/σ, where sθ (x)
is an unconditional score network.
It is typically hard for deep networks to automatically learn this rescaling, because σ1 and σL can
differ by several orders of magnitude. This simple choice is easier to implement, and can easily

5
handle a large number of noise scales (even continuous ones). As shown in Fig. 3 (detailed settings in
Appendix B), it achieves similar training losses compared to the original noise conditioning approach
in [1], and generate samples of better quality (see Appendix C.4).

4 Configuring annealed Langevin dynamics


In order to sample from an NCSN with annealed Langevin dynamics, we need to specify the number
of sampling steps per noise scale T and the step size parameter  in Algorithm 1. Authors of [1]
recommends  = 2 × 10−5 and T = 100. It remains unclear how we should change  and T for
different sets of noise scales.
To gain some theoretical insight, we revisit the setting in Section 3.2 where the dataset has one point
(i.e., pσi (x) = N (x | 0, σi2 I)). Annealed Langevin dynamics connect two adjacent noise scales
σi−1 > σi by initializing the Langevin dynamics for pσi (x) with samples obtained from pσi−1 (x).

When applying Langevin dynamics to pσi (x), we have xt+1 ← xt + α∇x log pσi (xt ) + 2αzt ,
where x0 ∼ pσi−1 (x) and zt ∼ N (0, I). The distribution of xT can be computed in closed form:
σi−1 σi2
Proposition 3. Let γ = σi . For α =  · (as in Algorithm 1), we have xT ∼ N (0, s2T I), where
2
σL
2T !
s2T

 2 2 2
= 1− 2 γ − 2 + 2 . (4)
σi2 σL
 
2 2  2 2 
σL − σL 1 − σ2 σL − σL 1 − σ 2
L L

When {σi }L i=1 is a geometric progression as advocated by Technique 2, we immediately see that
s2T /σ 2 is identical across all 1 < i ≤ T because of the shared γ. Furthermore, the value of s2T /σ 2 has
i i
no explicit dependency on the dimensionality D.
2
For better mixing of annealed Langevin dynamics, we hope sT/σi2 approaches 1 across all noise
scales, which can be achieved by finding  and T that minimize the difference between Eq. (4) and
1. Unfortunately, this often results in an unnecessarily large T that makes sampling very expensive
for large L. As an alternative, we propose to first choose T based on a reasonable computing budget
(typically T × L is several thousand), and subsequently find  by making Eq. (4) as close to 1 as
possible. In summary:
Technique 4 (selecting T and ). Choose T as large as allowed by a computing budget and then
select an  that makes Eq. (4) maximally close to 1.

We follow this guidance to generate all samples in this paper, except for those from the original
NCSN where we adopt the same settings as in [1]. When finding  with Technique 4 and Eq. (4), we
recommend performing grid search over , rather than using gradient-based optimization methods.

5 Improving stability with moving average


Unlike GANs, score-based generative models have one unified objective (Eq. (2)) and require no
adversarial training. However, even though the loss function of NCSNs typically decreases steadily
over the course of training, we observe that the generated image samples sometimes exhibit unstable
visual quality, especially for images of larger resolutions. We empirically demonstrate this fact
by training NCSNs on CIFAR-10 32 × 32 and CelebA [16] 64 × 64 following the settings of [1],
which exemplifies typical behavior on other image datasets. We report FID scores [13] computed
on 1000 samples every 5000 iterations. Results in Fig. 4 are computed with the denoising step, but
results without the denoising step are similar (see Fig. 8 in Appendix C.1). As shown in Figs. 4
and 8, the FID scores for the vanilla NCSN often fluctuate significantly during training. Additionally,
samples from the vanilla NCSN sometimes exhibit characteristic artifacts: image samples from the
same checkpoint have strong tendency to have a common color shift. Moreover, samples are shifted
towards different colors throughout training. We provide more samples in Appendix C.3 to manifest
this artifact.
This issue can be easily fixed by exponential moving average (EMA). Specifically, let θ i denote
the parameters of an NCSN after the i-th training iteration, and θ 0 be an independent copy of the
parameters. We update θ 0 with θ 0 ← mθ 0 + (1 − m)θ i after each optimization step, where m is the

6
Figure 4: FIDs and color artifacts over the course of training (best viewed in color). The FIDs of
NCSN have much higher volatility compared to NCSN with EMA. Samples from the vanilla NCSN
often have obvious color shifts. All FIDs are computed with the denoising step.

Table 1: Inception and FID scores.


Model Inception ↑ FID ↓
CIFAR-10 Unconditional
PixelCNN [17] 4.60 65.93
IGEBM [18] 6.02 40.58
WGAN-GP [19] 7.86 ± .07 36.4
SNGAN [20] 8.22 ± .05 21.7
NCSN [1] 8.87 ± .12 25.32
NCSN (w/ denoising) 7.32 ± .12 29.8
NCSNv2 (w/o denoising) 8.73 ± .13 31.75
(a) CIFAR-10 FIDs (b) CelebA FIDs NCSNv2 (w/ denoising) 8.40 ± .07 10.87
CelebA 64 × 64
Figure 5: FIDs for different groups of techniques.
Subscripts of “NCSN” are IDs of techniques in effect. NCSN (w/o denoising) - 26.89
NCSN (w/ denoising) - 25.30
“NCSNv2” uses all techniques. Results are computed NCSNv2 (w/o denoising) - 28.86
with the denoising step. NCSNv2 (w/ denoising) - 10.23

momentum parameter and typically m = 0.999. When producing samples, we use sθ0 (x, σ) instead
of sθi (x, σ). As shown in Fig. 4, EMA can effectively stabilize FIDs, remove artifacts (more samples
in Appendix C.3) and give better FID scores in most cases. Empirically, we observe the effectiveness
of EMA is universal across a large number of different image datasets. As a result, we recommend
the following rule of thumb:
Technique 5 (EMA). Apply exponential moving average to parameters when sampling.

6 Combining all techniques together


Employing Technique 1–5, we build NCSNs that can readily work across a large number of different
datasets, including high resolution images that were previously out of reach with score-based genera-
tive modeling. Our modified model is named NCSNv2. For a complete description on experimental
details and more results, please refer to Appendix B and C.
Quantitative results: We consider CIFAR-10 32×32 and CelebA 64×64 where NCSN and NCSNv2
both produce reasonable samples. We report FIDs (lower is better) every 5000 iterations of training
on 1000 samples and give results in Fig. 5 (with denoising) and Fig. 9 (without denoising, deferred
to Appendix C.1). As shown in Figs. 5 and 9, we observe that the FID scores of NCSNv2 (with all
techniques applied) are on average better than those of NCSN, and have much smaller variance over
the course of training. Following [1], we select checkpoints with the smallest FIDs (on 1000 samples)
encountered during training, and compute full FID and Inception scores on more samples from them.
As shown by results in Table 1, NCSNv2 (w/ denoising) is able to significantly improve the FID
scores of NCSN on both CIFAR-10 and CelebA, while bearing a slight loss of Inception scores on
CIFAR-10. However, we note that Inception and FID scores have known issues [21, 22] and they
should be interpreted with caution as they may not correlate with visual quality in the expected way.
In particular, they can be sensitive to slight noise perturbations [23], as shown by the difference of

7
Figure 6: From top to bottom: FFHQ 2562 , LSUN bedroom 1282 , LSUN tower 1282 , and LSUN
church_outdoor 962 . Within each group of images: the first row shows uncurated samples from
NCSNv2, and the second shows the interpolation results between the leftmost and rightmost samples
with NCSNv2. You may zoom in to view more details.

(a) NCSN (b) NCSNv2 (c) NCSN (d) NCSNv2


Figure 7: NCSN vs. NCSNv2 samples on LSUN church_outdoor (a)(b) and LSUN bedroom (c)(d).

scores with and without denoising in Table 1. To verify that NCSNv2 indeed generates better images
than NCSN, we provide additional uncurated samples in Appendix C.4 for visual comparison.
Ablation studies: We conduct ablation studies to isolate the contributions of different techniques. We
partition all techniques into three groups: (i) Technique 5, (ii) Technique 1,2,4, and (iii) Technique 3,
where different groups can be applied simultaneously. Technique 1,2 and 4 are grouped together
because Technique 1 and 2 collectively determine the set of noise scales, and to sample from NCSNs
trained with these noise scales we need Technique 4 to configure annealed Langevin dynamics
properly. We test the performance of successively removing groups (iii), (ii), (i) from NCSNv2, and
report results in Fig. 5 for sampling with denoising and in Fig. 9 (Appendix C.1) for sampling without
denoising. All groups of techniques improve over the vanilla NCSN. Although the FID scores are
not strictly increasing when removing (iii), (ii), and (i) progressively, we note that FIDs may not
always correlate with sample quality well. In fact, we do observe decreasing sample quality by visual
inspection (see Appendix C.4), and combining all techniques gives the best samples.
Towards higher resolution: The original NCSN only succeeds at generating images of low resolu-
tion. In fact, [1] only tested it on MNIST 28 × 28 and CelebA/CIFAR-10 32 × 32. For slightly larger
images such as CelebA 64 × 64, NCSN can generate images of consistent global structure, yet with
strong color artifacts that are easily noticeable (see Fig. 4 and compare Fig. 10a with Fig. 10b). For
images with resolutions beyond 96 × 96, NCSN will completely fail to produce samples with correct
structure or color (see Fig. 7). All samples shown here are generated without the denoising step, but
since σL is very small, they are visually indistinguishable from ones with the denoising step.
By combining Technique 1–5, NCSNv2 can work on images of much higher resolution. Note that
we directly calculated the noise scales for training NCSNs, and computed the step size for annealed

8
Langevin dynamics sampling without manual hyper-parameter tuning. The network architectures are
the same across datasets, except that for ones with higher resolution we use more layers and more
filters to ensure the receptive field and model capacity are large enough (see details in Appendix B.1).
In Fig. 6 and 1, we show NCSNv2 is capable of generating high-fidelity image samples with
resolutions ranging from 96 × 96 to 256 × 256. To show that this high sample quality is not a result
of dataset memorization, we provide the loss curves for training/test, as well as nearest neighbors
for samples in Appendix C.5. In addition, NCSNv2 can produce smooth interpolations between two
given samples as in Fig. 6 (details in Appendix B.2), indicating the ability to learn generalizable
image representations.

7 Conclusion
Motivated by both theoretical analyses and empirical observations, we propose a set of techniques
to improve score-based generative models. Our techniques significantly improve the training and
sampling processes, lead to better sample quality, and enable high-fidelity image generation at
high resolutions. Although our techniques work well without manual tuning, we believe that the
performance can be improved even more by fine-tuning various hyper-parameters. Future directions
include theoretical understandings on the sample quality of score-based generative models, as well as
alternative noise distributions to Gaussian perturbations.

Broader Impact
Our work represents another step towards more powerful generative models. While we focused
on images, it is quite likely that similar techniques could be applicable to other data modalities
such as speech or behavioral data (in the context of imitation learning). Like other generative
models that have been previously proposed, such as GANs and WaveNets, score models have a
multitude of applications. Among many other applications, they could be used to synthesize new
data automatically, detect anomalies and adversarial examples, and also improve results in key tasks
such as semi-supervised learning and reinforcement learning. In turn, these techniques can have both
positive and negative impacts on society, depending on the application. In particular, the models we
trained on image datasets can be used to synthesize new images that are hard to distinguish from
real ones by humans. Synthetic images from generative models have already been used to deceive
humans in malicious ways. There are also positive uses of these technologies, for example in the arts
and as a tool to aid design in engineering. We also note that our models have been trained on datasets
that have biases (e.g., CelebA is not gender-balanced), and the learned distribution is likely to have
inherited them, in addition to others that are caused by the so-called inductive bias of models.

Acknowledgments and Disclosure of Funding


The authors would like to thank Aditya Grover, Rui Shu and Shengjia Zhao for reviewing an early
draft of this paper, as well as Gabby Wright and Sharon Zhou for resolving technical issues in
computing HYPE∞ scores. This research was supported by NSF (#1651565, #1522054, #1733686),
ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), and Amazon AWS.

References
[1] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data
distribution. In Advances in Neural Information Processing Systems, pages 11895–11907, 2019.
[2] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.
2009.
[3] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal
of Machine Learning Research, 6(Apr):695–709, 2005.
[4] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural
computation, 23(7):1661–1674, 2011.
[5] Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision.
Neural computation, 23(2):374–420, 2011.

9
[6] Gareth O Roberts, Richard L Tweedie, et al. Exponential convergence of langevin distributions
and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
[7] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics.
In Proceedings of the 28th international conference on machine learning (ICML-11), pages
681–688, 2011.
[8] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable
approach to density and score estimation. In Proceedings of the Thirty-Fifth Conference on
Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, page 204,
2019.
[9] Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Ioannis Mitliagkas, and Rémi Tachet des
Combes. Adversarial score matching and improved sampling for image generation. arXiv
preprint arXiv:2009.05475, 2020.
[10] Saeed Saremi and Aapo Hyvarinen. Neural empirical bayes. Journal of Machine Learning
Research, 20:1–23, 2019.
[11] Zengyi Li, Yubei Chen, and Friedrich T Sommer. Learning energy-based models in high-
dimensional spaces with multi-scale denoising score matching. arXiv, pages arXiv–1910,
2019.
[12] Zahra Kadkhodaie and Eero P Simoncelli. Solving linear inverse problems using the prior
implicit in a denoiser. arXiv preprint arXiv:2007.13640, 2020.
[13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances
in neural information processing systems, pages 6626–6637, 2017.
[14] Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical
Association, 106(496):1602–1614, 2011.
[15] Erik W Grafarend. Linear and nonlinear models: fixed effects, random effects, and mixed
models. de Gruyter, 2006.
[16] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the
wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
[17] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.
Conditional image generation with pixelcnn decoders. In Advances in neural information
processing systems, pages 4790–4798, 2016.
[18] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models.
arXiv preprint arXiv:1903.08689, 2019.
[19] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.
Improved training of wasserstein gans. In Advances in neural information processing systems,
pages 5767–5777, 2017.
[20] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization
for generative adversarial networks. In International Conference on Learning Representations,
2018.
[21] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973,
2018.
[22] Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assess-
ing generative models via precision and recall. In Advances in Neural Information Processing
Systems, pages 5228–5237, 2018.
[23] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images
with vq-vae-2. In Advances in Neural Information Processing Systems, pages 14837–14847,
2019.
[24] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement
networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1925–1934, 2017.
[25] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network
learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.

10
[26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[27] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun:
Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv
preprint arXiv:1506.03365, 2015.
[28] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative
adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4401–4410, 2019.
[29] Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Li F Fei-Fei, and Michael
Bernstein. Hype: A benchmark for human eye perceptual evaluation of generative models. In
Advances in Neural Information Processing Systems, pages 3444–3456, 2019.
[30] Jiaming Song and Stefano Ermon. Bridging the gap between f -gans and wasserstein gans.
arXiv preprint arXiv:1910.09779, 2019.
[31] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond.
arXiv preprint arXiv:1904.09237, 2019.
[32] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for
improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
[33] David Berthelot, Thomas Schumm, and Luke Metz. Began: Boundary equilibrium generative
adversarial networks. arXiv preprint arXiv:1703.10717, 2017.

11
A Proofs
1
PN
Proposition 1. Let p̂σ1 (x) , N i=1p(i) (x), where p(i) (x) , N (x | x(i) , σ12 I). With r(i) (x) ,
p(i) (x) PN
PN (k) (x) , the score function is ∇x log p̂σ1 (x) = i=1 r(i) (x)∇x log p(i) (x). Moreover,
k=1 p
 x(i) − x(j) 2 

(j) 1 2
Ep(i) (x) [r (x)] ≤ exp − . (5)
2 8σ12

Proof. According to the definition of pσ1 (x) and r(x), we have


N N
∇x p(i) (x)
 X  X
1
∇x log p̂σ1 (x) = ∇x log p(i) (x) = PN (j)
N i=1 i=1 j=1 p (x)
N
X p(i) (x)∇x log p(i) (x)
= PN (j)
i=1 j=1 p (x)
N
X
= r(i) (x)∇x log p(i) (x).
i=1
D
Next, assuming x ∈ R , we have
Z (i)
p (x)p(j) (x) p(i) (x)p(j) (x)
Z
Ep(i) (x) [r(j) (x)] = PN dx ≤ dx
k=1 p
(k) (x) p (x) + p(j) (x)
(i)

(1) 1
Z Z q
1 2
= 1 1 dx ≤ p(i) (x)p(j) (x)dx
2 (i)
p (x)
+ (j)
p (x)
2
Z   2 2 
1 1 1 (i) (j)
= exp − 2 x − x + x − x dx

2 (2πσ12 )D/2 4σ1 2 2
Z   2 2 
1 1 1 (i) (i) (i) (j)
= exp − 2 x − x + x − x + x − x dx

2 (2πσ12 )D/2 4σ1 2 2

x − x(j) 2 
Z   2 (i)
1 1 1 (i) (i) T (i) (j) 2
= exp − 2 x − x + (x − x ) (x − x ) + dx

2 (2πσ12 )D/2 2σ1 2 2
2 x(i) − x(j) 2 

x(i) − x(j)
Z  
1 1 1 (i) 2
= exp − 2 x − x +
+
dx
2 (2πσ12 )D/2 2σ1 2 2 4
 x(i) − x(j) 2  (j) 2

(i)
  
x − x(i) + x − x
Z
1 2 1 1
= exp − exp − dx
2 8σ12 (2πσ12 )D/2 2σ12 2
2
 x(i) − x(j) 2 

1 2
= exp − ,
2 8σ12
where (1) is due to the geometric mean–harmonic mean inequality.
Proposition 2. Let x ∈ RD ∼ N (0, σ 2 I), and r = kxk2 . We have
rD−1 r2 √
 
1 d
p(r) = D/2−1 D
exp − 2
and r − Dσ → N (0, σ 2 /2) when D → ∞.
2 Γ(D/2) σ 2σ

2
Proof. Since x ∼ N (0, σ 2 I), we have s , kxk2 /σ 2 ∼ χ2D , i.e.,
1
ps (s) = D/2 sD/2−1 e−s/2 .
2 Γ(D/2)

Because r = kxk2 = σ s, we can use the change of variables formula to get
 2
rD−1 r2
 
2r 2r r 1
p(r) = 2 ps (s) = 2 ps = exp − ,
σ σ σ2 2D/2−1 Γ(D/2) σ D 2σ 2

12
which proves our first result. Next, we notice that if x ∼ N (0, σ 2 ), we have x2 /σ 2 ∼ χ21 and thus
i.i.d.
E[x] = σ 2 , Var[x] = 2σ 4 . As a result, if x1 , x2 , · · · , xD ∼ N (0, σ 2 ), the law of large numbers
and the central limit theorem will imply that as D → ∞, both of the following hold:
x21 + x22 + · · · + x2D p 2
→σ
D

 2
x + x22 + · · · + x2D

d
D 1 − σ 2 → N (0, 2σ 4 ).
D
Equivalently,
√ r2
 
d
D − σ2 → N (0, 2σ 4 ).
D
Applying the delta method, we obtain

 
r d
D √ − σ → N (0, σ 2 /2),
D
√ d
and therefore r − Dσ → N (0, σ 2 /2).

σi−1 σi2
Proposition 3. Let γ = σi . For α =  · 2
σL
(as in Algorithm 1), we have xT ∼ N (0, s2T I), where
2T !
s2T

 2 2
= 1− 2 γ2 − + 2 . (6)
σi2
2
σL
 
2 − σ2 1 −  2 − σ2 1 − 
σL L 2
σL
σL L 2
σL

Proof. First, the conditions we know are


2
x0 ∼ pσi−1 (x) = N (0, σi−1 I),
√ xt √
xt+1 ← xt + α∇x log pσi (xt ) + 2αzt = xt − α 2 + 2αzt ,
σi
where zt ∼ N (0, I). Therefore, the variance of xt satisfies
 2
σi−1 I if t = 0
2
Var[xt ] =
 1 − σα2 Var[xt−1 ] + 2αI otherwise.
i

Now let v ,  2α 2 I, we have


1− 1− α2
σ
i

 2
α
Var[xt ] − v = 1− 2 (Var[xt−1 ] − v).
σi
Therefore,
 2T
α
Var[xT ] − v = 1− (Var[x0 ] − v)
σi2
 2T
α
=⇒ Var[xT ] = 1− (Var[x0 ] − v) + v
σi2
 2T  
α 2α 2α
=⇒ s2T = 1− 2
σi−1 − +  . (7)
σi2 α 2 α 2

1− 1− σi2
1− 1− 2
σi

Substituting  σi2 /σL


2
for α in Eq. (7), we immediately obtain Eq. (6).

13
B Experimental details

B.1 Network architectures and hyperparameters

The original NCSN in [1] uses a network structure based on RefineNet [24]—a classical architecture
for semantic segmentation. There are three major modifications to the original RefineNet in NCSN:
(i) adding an enhanced version of conditional instance normalization (designed in [1] and named
CondInstanceNorm++) for every convolutional layer; (ii) replacing max pooling with average pooling
in RefineNet blocks; and (iii) using dilated convolutions in the ResNet backend of RefineNet. We
use exactly the same architecture for NCSN experiments, but for NCSNv2 or any other architecture
implementing Technique 3, we apply the following modifications: (i) setting the number of classes
in CondInstanceNorm++ to 1 (which we name as InstanceNorm++); (ii) changing average pooling
back to max pooling; and (iii) removing all normalization layers in RefineNet blocks. Here (ii)
and (iii) do not affect the results much, but they are included because we hope to minimize the
number of unnecessary changes to the standard RefineNet architecture (the original RefineNet
blocks in [24] use max pooling and have no normalization layers). We name a ResNet block
(with InstanceNorm++ instead of BatchNorm) “ResBlock”, and a RefineNet block “RefineBlock”.
When CondInstanceNorm++ is added, we name them “CondResBlock” and “CondRefineBlock”
respectively. We use the ELU activation function [25] throughout all architectures.
To ensure sufficient capacity and receptive fields, the network structures for images of different
resolutions have different numbers of layers and filters. We summarize the architectures in Table 2
and Table 3.

Table 2: The architectures of NCSN for images of various resolutions.


(a) NCSN 322 –642 (b) NCSN 962 –1282

3x3 Conv2D, 128 3x3 Conv2D, 128


CondResBlock, 128 CondResBlock, 128
CondResBlock, 128 CondResBlock, 128
CondResBlock down, 256 CondResBlock down, 256
CondResBlock, 256 CondResBlock, 256
CondResBlock down, 256 CondResBlock down, 256
dilation 2
CondResBlock, 256
CondResBlock, 256
CondResBlock down, 512
dilation 2
dilation 2
CondResBlock down, 256
CondResBlock, 512
dilation 4
dilation 2
CondResBlock, 256
CondResBlock down, 512
dilation 4
dilation 4
CondRefineBlock, 256
CondResBlock, 512
CondRefineBlock, 256 dilation 4
CondRefineBlock, 128 CondRefineBlock, 512
CondRefineBlock, 128 CondRefineBlock, 256
3x3 Conv2D, 3 CondRefineBlock, 256
CondRefineBlock, 128
CondRefineBlock, 128
3x3 Conv2D, 3

14
Table 3: The architectures of NCSNv2 for images of various resolutions.
(a) NCSNv2 322 –642 (b) NCSNv2 962 –1282 (c) NCSNv2 2562

3x3 Conv2D, 128 3x3 Conv2D, 128 3x3 Conv2D, 128


ResBlock, 128 ResBlock, 128 ResBlock, 128
ResBlock, 128 ResBlock, 128 ResBlock, 128
ResBlock down, 256 ResBlock down, 256 ResBlock down, 256
ResBlock, 256 ResBlock, 256 ResBlock, 256
ResBlock down, 256 ResBlock down, 256 ResBlock down, 256
dilation 2
ResBlock, 256 ResBlock, 256
ResBlock, 256
ResBlock down, 512 ResBlock down, 256
dilation 2
dilation 2
ResBlock, 256
ResBlock down, 256
ResBlock, 512
dilation 4 ResBlock down, 512
dilation 2
dilation 2
ResBlock, 256
ResBlock down, 512
dilation 4 ResBlock, 512
dilation 4
dilation 2
RefineBlock, 256
ResBlock, 512
ResBlock down, 512
RefineBlock, 256 dilation 4
dilation 4
RefineBlock, 128 RefineBlock, 512
ResBlock, 512
RefineBlock, 128 RefineBlock, 256 dilation 4
3x3 Conv2D, 3 RefineBlock, 256 RefineBlock, 512
RefineBlock, 128 RefineBlock, 256
RefineBlock, 128 RefineBlock, 256
3x3 Conv2D, 3 RefineBlock, 256
RefineBlock, 128
RefineBlock, 128
3x3 Conv2D, 3

We use the Adam optimizer [26] for all models. When Technique 3 is not in effect, we choose the
learning rate 0.001; otherwise we use a learning rate 0.0001 to avoid loss explosion. We set the 
parameter of Adam to 10−3 for FFHQ and 10−8 otherwise. We provide other hyperparameters in
Table 4, where σ1 , L, T , and  of NCSNv2 are all chosen in accordance with our proposed techniques.
When the number of training data is larger than 60000, we randomly sample 10000 of them and
compute the maximum pairwise distance, which is set as σ1 for NCSNv2.
Table 4: Hyperparameters of NCSN/NCSNv2. The latter is configured according to Technique 1–4.
σ1 and L determine the set of noise levels. T and  are parameters of annealed Langevin dynamics.
Model Dataset σ1 L T  Batch size Training iterations
2
NCSN CIFAR-10 32 1 10 100 2e-5 128 300k
NCSN CelebA 642 1 10 100 2e-5 128 210k
NCSN LSUN church_outdoor 962 1 10 100 2e-5 128 200k
NCSN LSUN bedroom 1282 1 10 100 2e-5 64 150k
NCSNv2 CIFAR-10 322 50 232 5 6.2e-6 128 300k
NCSNv2 CelebA 642 90 500 5 3.3e-6 128 210k
NCSNv2 LSUN church_outdoor 962 140 788 4 4.9e-6 128 200k
NCSNv2 LSUN bedroom/tower 1282 190 1086 3 1.8e-6 128 150k
NCSNv2 FFHQ 2562 348 2311 3 0.9e-7 32 80k

15
B.2 Additional settings

Datasets: We use the following datasets in our experiments: CIFAR-10 [2], CelebA [16], LSUN [27],
and FFHQ [28]. CIFAR-10 contains 50000 training images and 10000 test images, all of resolution
32 × 32. CelebA contains 162770 training images and 19962 test images with various resolutions.
For preprocessing, we first center crop them to size 140 × 140, and then resize them to 64 × 64.
We choose the church_outdoor, bedroom and tower categories in the LSUN dataset. They contain
126227, 3033042, and 708264 training images respectively, and all have 300 validation images. For
preprocessing, we first resize them so that the smallest dimension of images is 96 (for church_outdoor)
or 128 (for bedroom and tower), and then center crop them to equalize their lengths and heights.
Finally, the FFHQ dataset consists of 70000 high-quality facial images at resolution 1024 × 1024.
We resize them to 256 × 256 in our experiments. Because FFHQ does not have an official test dataset,
we randomly select 63000 images for training and the remaining 7000 as the test dataset. In addition,
we apply random horizontal flip as data augmentation in all cases.
Metrics: We use FID [13] and HYPE∞ [29] scores for quantitative comparison of results. When
computing FIDs on CIFAR-10 32 × 32, we measure the distance between the statistics of samples
and training data. When computing FIDs on CelebA 64 × 64, we follow the settings in [30] where
the distance is measured between 10000 samples and the test dataset. We use the official website
https://hype.stanford.edu for computing HYPE∞ scores. Regarding model selection, we follow the
settings in [1], where we compute FID scores on 1000 samples every 5000 training iterations and
choose the checkpoint with the smallest FID for computing both full FID scores (with more samples
from it) and the HYPE∞ scores.
Training: We use the Adam [26] optimizer with default hyperparameters. The learning rates and
batch sizes are provided in Appendix B.1 and Table 4. We observe that for images at resolution
128 × 128 or 256 × 256, training can be unstable when the loss is near convergence. We note,
however, this is a well-known problem of the Adam optimizer, and can be mitigated by techniques
such as AMSGrad [31]. We trained all models on Nvidia Tesla V100 GPUs.
Settings for Section 3.3: The loss curves in Fig. 3 are results of two settings: (i) Technique 1, 2, 4
and 5 are in effect, but the model architecture is the same as the original NCSN (i.e., Table 2a); and
(ii) all techniques are in effect, i.e., the model is the same as NCSNv2 depicted in Table 3a. We apply
EMA with momentum 0.9 to smooth the curves in Fig. 3. We observe that despite being simpler
to implement, the new noise conditioning method proposed in Technique 3 performs as well as the
original and arguably more complex one in [1] in terms of the training loss. See the ablation studies
in Section 6 and Appendix C.4 for additional results.
Interpolation: We can interpolate between two different samples from NCSN/NCSNv2 via interpo-
lating the Gaussian random noise injected by annealed Langevin dynamics. Specifically, suppose
we have a total of L noise levels, and for each noise level we run T steps of Langevin dynamics.
Let {zij }1≤i≤L,1≤j≤T , {z11 , z12 , · · · , z1T , z21 , z22 , · · · , z2T , · · · , zL1 , zL2 , · · · , zLT } denote
the set of all Gaussian noise used in this procedure, where zij is the noise injected at the j-th
iteration of Langevin dynamics corresponding to the i-th noise level. Next, suppose we have two
samples x(1) and x(2) with the same initialization x0 , and denote the corresponding set of Gaussian
(1) (2)
noise as {zij }1≤i≤L,1≤j≤T and {zij }1≤i≤L,1≤j≤T respectively. We can generate N interpolated
samples between x(1) and x(2) , where for the k-th interpolated sample we use Gaussian noise
 (1)  (2)
{cos 2(Nkπ+1) zij + sin 2(Nkπ+1) zij }1≤i≤L,1≤j≤T and initialization x0 .

C Additional experimental results

C.1 Additional results without the denoising step

We further demonstrate the stabilizing effect of EMA in Fig. 8, where FIDs are computed without
the denoising step. As indicated by Figs. 4 and 8, EMA can stabilize training and remove sample
artifacts regardless of whether denoising is used or not.
FID scores should be interpreted with caution because they may not align well with human judgement.
For example, the samples from NCSNv2 as demonstrated in Fig. 10b have an FID score of 28.9
(without denoising), worse than NCSN (Fig. 10a) whose FID is 26.9 (without denoising), but arguably

16
Figure 8: FIDs and color artifacts over the course of training (best viewed in color). The FIDs of
NCSN have much higher volatility compared to NCSN with EMA. Samples from the vanilla NCSN
often have obvious color shifts. All FIDs are computed without the denoising step.

(a) CIFAR-10 FIDs (b) CelebA FIDs


(a) NCSN (b) NCSNv2
Figure 9: FIDs for different groups of techniques.
Subscripts of “NCSN” are IDs of techniques in Figure 10: Uncurated samples from NCSN
effect. “NCSNv2” uses all techniques. Results (a) and NCSNv2 (b) on CelebA 64 × 64.
are computed without the denoising step.

produce much more visually appealing samples. To investigate whether FID scores align well with
human ratings, we use the HYPE∞ [29] score (higher is better), a metric of sample quality based
on human evaluation, to compare the two models that generated samples in Figs. 10a and 10b.
We provide full results in Table 5, where all numbers except those for NCSN and NCSNv2 are
directly taken from [29]. As Table 5 shows, our NCSNv2 achieves 37.3 on CelebA 64 × 64 which
is comparable to ProgressiveGAN [32], whereas NCSN achieves 19.8. This is completely different
from the ranking indicated by FIDs.

Table 5: HYPE∞ scores on CelebA 64 × 64. ∗ With truncation tricks.


Model HYPE∞ (%) Fakes Error(%) Reals Error(%) Std.

StyleGAN [28] 50.7 62.2 39.3 1.3
ProgressiveGAN [32] 40.3 46.2 34.4 0.9
BEGAN [33] 10 6.2 13.8 1.6
WGAN-GP [19] 3.8 1.7 5.9 0.6
NCSN 19.8 22.3 17.3 0.4
NCSNv2 37.3 49.8 24.8 0.5

with truncation tricks

Finally, we provide ablation results without the denoising step in Fig. 9. It is qualitatively similar to
Fig. 5 where results are computed with denoising.

17
C.2 Training and sampling speed

In Table 6, we provide the time cost for training and sampling from NCSNv2 models on various
datasets considered in our experiments.

Table 6: Training and sampling speed of NCSNv2 on various datasets.


Dataset Device Sampling time Training time
CIFAR-10 2x V100 2 min 22 h
CelebA 4x V100 7 min 29 h
Church 8x V100 17 min 52 h
Bedroom 8x V100 19 min 52 h
Tower 8x V100 19 min 52 h
FFHQ 8x V100 50 min 41 h

C.3 Color shifts

(a) NCSN (Iter. = 50k) (b) NCSN (Iter. = 100k) (c) NCSN (Iter. = 200k)

(d) NCSN w/ EMA (Iter. = 50k) (e) NCSN w/ EMA (Iter. = 100k) (f) NCSN w/ EMA (Iter. = 200k)
Figure 11: EMA reduces undesirable color shifts on CIFAR-10. We show samples from NCSN and
NCSN with EMA at the 50k/100k/200k-th training iteration.

18
(a) NCSN (Iter. = 50k) (b) NCSN (Iter. = 100k) (c) NCSN (Iter. = 150k)

(d) NCSN w/ EMA (Iter. = 50k) (e) NCSN w/ EMA (Iter. = 100k) (f) NCSN w/ EMA (Iter. = 150k)
Figure 12: EMA reduces undesirable color shifts on CelebA-10. We show samples from NCSN and
NCSN with EMA at the 50k/100k/150k-th training iteration.

C.4 Additional results on ablation studies

As discussed in Section 6, we partition all techniques into three groups: (i) Technique 5, (ii)
Technique 1,2,4, and (iii) Technique 3, and investigate the performance of models after successively
removing (iii), (ii), and (i) from NCSNv2. Aside from the FID curves in Figs. 5 and 9, we also provide
samples from different models for visual inspection in Figs. 13 and 14. To generate these samples, we
compute the FID scores on 1000 samples every 5000 training iterations for each considered model,
and sample from the checkpoint of the smallest FID (the same setting as in [1]). From samples in
Figs. 13 and 14, we easily observe that removing any group of techniques leads to worse samples.

19
(a) NCSN on CIFAR-10 (b) NCSN on CelebA

(c) NCSN5 on CIFAR-10 (d) NCSN5 on CelebA

(e) NCSN1,2,4,5 on CIFAR-10 (f) NCSN1,2,4,5 on CelebA

(g) NCSNv2 on CIFAR-10 (h) NCSNv2 on CelebA


Figure 13: Samples from models with different groups of techniques applied. NCSN is the original
model in [1] and does not use any of the newly proposed techniques. Subscripts of “NCSN” denote
the IDs of techniques in effect. NCSN5 only applies EMA. NCSN1,2,4,5 applies both EMA and
technique group (ii). NCSNv2 is the result of all techniques combined. Checkpoints are selected
according to the lowest FID (with denoising) over the course of training.

20
(a) NCSN on CIFAR-10 (b) NCSN on CelebA

(c) NCSN5 on CIFAR-10 (d) NCSN5 on CelebA

(e) NCSN1,2,4,5 on CIFAR-10 (f) NCSN1,2,4,5 on CelebA

(g) NCSNv2 on CIFAR-10 (h) NCSNv2 on CelebA


Figure 14: Samples from models with different groups of techniques applied. NCSN is the original
model in [1] and does not use any of the newly proposed techniques. Subscripts of “NCSN” denote
the IDs of techniques in effect. NCSN5 only applies EMA. NCSN1,2,4,5 applies both EMA and
technique group (ii). NCSNv2 is the result of all techniques combined. Checkpoints are selected
according to the lowest FID (without denoising) over the course of training.

21
C.5 Generalization

C.5.1 Loss curves

First, we demonstrate that our NCSNv2 does not overfit to the training dataset by showing the curves
of training/test loss in Fig. 15. Since the loss on the test dataset is always close to the loss on the
training dataset during the course of training, this indicates that our model does not simply memorize
training data.

(a) (b) (c)

(d) (e) (f)

Figure 15: Training vs. test loss curves of NCSNv2.

C.5.2 Nearest neighbors

Starting from this section, all samples are from NCSNv2 at the last training iteration. For each
generated sample, we show the nearest neighbors from the training dataset, measured by `2 distance
in the feature space of a pre-trained InceptionV3 network. Since we apply random horizontal flip
when training, we also take this into consideration when computing nearest neighbors, so that we can
detect cases in which NCSNv2 memorizes a flipped training data point.

Figure 16: Nearest neighbors on CIFAR-10. NCSNv2 samples are on the left side of the red vertical
line. Corresponding nearest neighbors are on the right side in the same row.

22
Figure 17: Nearest neighbors on CelebA 64 × 64.

Figure 18: Nearest neighbors on LSUN church_outdoor 96 × 96.

Figure 19: Nearest neighbors on FFHQ 256 × 256.

23
C.5.3 Additional interpolation results
We generate samples from NCSNv2 and interpolate between them using the method described in
Appendix B.2.

Figure 20: NCSNv2 interpolation results on CelebA 64 × 64.

Figure 21: NCSNv2 interpolation results on LSUN church_outdoor 96 × 96.

Figure 22: NCSNv2 interpolation results on LSUN bedroom 128 × 128.

24
Figure 23: NCSNv2 interpolation results on LSUN tower 128 × 128.

Figure 24: NCSNv2 interpolation results on FFHQ 256 × 256.

25
C.6 Additional uncurated samples

Figure 25: Uncurated CIFAR-10 32 × 32 samples from NCSNv2.

26
Figure 26: Uncurated CelebA 64 × 64 samples from NCSNv2.

27
Figure 27: Uncurated LSUN church_outdoor 96 × 96 samples from NCSNv2.

28
Figure 28: Uncurated LSUN bedroom 128 × 128 samples from NCSNv2.

29
Figure 29: Uncurated LSUN tower 128 × 128 samples from NCSNv2.

30
Figure 30: Uncurated FFHQ 256 × 256 samples from NCSNv2.

31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy