0% found this document useful (0 votes)
27 views24 pages

Maximum Likelihood Training of

This document presents a method for maximum likelihood training of score-based diffusion models. Score-based diffusion models generate samples by reversing a stochastic diffusion process that transforms data into noise. Previous work trained these models by minimizing a weighted combination of score matching losses, which does not directly optimize likelihood. The authors show that by choosing a specific "likelihood weighting", the combination of score matching losses upper bounds the negative log-likelihood. This allows approximate maximum likelihood training while maintaining the efficiency of score matching. Empirically, likelihood weighting improves likelihoods across datasets and architectures, achieving state-of-the-art likelihoods on CIFAR-10 and ImageNet without data augmentation.

Uploaded by

mengxiangxi163
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views24 pages

Maximum Likelihood Training of

This document presents a method for maximum likelihood training of score-based diffusion models. Score-based diffusion models generate samples by reversing a stochastic diffusion process that transforms data into noise. Previous work trained these models by minimizing a weighted combination of score matching losses, which does not directly optimize likelihood. The authors show that by choosing a specific "likelihood weighting", the combination of score matching losses upper bounds the negative log-likelihood. This allows approximate maximum likelihood training while maintaining the efficiency of score matching. Empirically, likelihood weighting improves likelihoods across datasets and architectures, achieving state-of-the-art likelihoods on CIFAR-10 and ImageNet without data augmentation.

Uploaded by

mengxiangxi163
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Maximum Likelihood Training of

Score-Based Diffusion Models

Yang Song˚ Conor Durkan˚


Computer Science Department School of Informatics
Stanford University University of Edinburgh
arXiv:2101.09258v4 [stat.ML] 21 Oct 2021

yangsong@cs.stanford.edu conor.durkan@ed.ac.uk

Iain Murray Stefano Ermon


School of Informatics Computer Science Department
University of Edinburgh Stanford University
i.murray@ed.ac.uk ermon@cs.stanford.edu

Abstract

Score-based diffusion models synthesize samples by reversing a stochastic process


that diffuses data to noise, and are trained by minimizing a weighted combination
of score matching losses. The log-likelihood of score-based diffusion models can
be tractably computed through a connection to continuous normalizing flows, but
log-likelihood is not directly optimized by the weighted combination of score
matching losses. We show that for a specific weighting scheme, the objective upper
bounds the negative log-likelihood, thus enabling approximate maximum likelihood
training of score-based diffusion models. We empirically observe that maximum
likelihood training consistently improves the likelihood of score-based diffusion
models across multiple datasets, stochastic processes, and model architectures.
Our best models achieve negative log-likelihoods of 2.83 and 3.76 bits/dim on
CIFAR-10 and ImageNet 32 ˆ 32 without any data augmentation, on a par with
state-of-the-art autoregressive models on these tasks.

1 Introduction
Score-based generative models [44, 45, 48] and diffusion probabilistic models [43, 19] have recently
achieved state-of-the-art sample quality in a number of tasks, including image generation [48, 11],
audio synthesis [5, 27, 37], and shape generation [3]. Both families of models perturb data with a
sequence of noise distributions, and generate samples by learning to reverse this path from noise
to data. Through stochastic calculus, these approaches can be unified into a single framework [48]
which we refer to as score-based diffusion models in this paper.
The framework of score-based diffusion models [48] involves gradually diffusing the data distribution
towards a given noise distribution using a stochastic differential equation (SDE), and learning the
time reversal of this SDE for sample generation. Crucially, the reverse-time SDE has a closed-form
expression which depends solely on a time-dependent gradient field (a.k.a., score) of the perturbed
data distribution. This gradient field can be efficiently estimated by training a neural network (called
a score-based model [44, 45]) with a weighted combination of score matching losses [23, 56, 46] as
the objective. A key advantage of score-based diffusion models is that they can be transformed into
continuous normalizing flows (CNFs) [6, 15], thus allowing tractable likelihood computation with
numerical ODE solvers.
˚
Equal contribution.

35th Conference on Neural Information Processing Systems (NeurIPS 2021).


Compared to vanilla CNFs, score-based diffusion models are much more efficient to train. This is
because the maximum likelihood objective for training CNFs requires running an expensive ODE
solver for every optimization step, while the weighted combination of score matching losses for
training score-based models does not. However, unlike maximum likelihood training, minimizing a
combination of score matching losses does not necessarily lead to better likelihood values. Since
better likelihoods are useful for applications including compression [21, 20, 51], semi-supervised
learning [10], adversarial purification [47], and comparing against likelihood-based generative models,
we seek a training objective for score-based diffusion models that is as efficient as score matching but
also promotes higher likelihoods.
We show that such an objective can be readily obtained through slight modification of the weighted
combination of score matching losses. Our theory reveals that with a specific choice of weighting,
which we term the likelihood weighting, the combination of score matching losses actually upper
bounds the negative log-likelihood. We further prove that this upper bound becomes tight when our
score-based model corresponds to the true time-dependent gradient field of a certain reverse-time
SDE. Using likelihood weighting increases the variance of our objective, which we counteract by
introducing a variance reduction technique based on importance sampling. Our bound is analogous
to the classic evidence lower bound used for training latent-variable models in the variational
autoencoding framework [26, 39], and can be viewed as a continuous-time generalization of [43].
With our likelihood weighting, we can minimize the weighted combination of score matching losses
for approximate maximum likelihood training of score-based diffusion models. Compared to weight-
ings in previous work [48], we consistently improve likelihood values across multiple datasets, model
architectures, and SDEs, with only slight degradation of Fréchet Inception distances [17]. Moreover,
our upper bound on negative log-likelihood allows training with variational dequantization [18],
with which we reach negative log-likelihood of 2.83 bits/dim on CIFAR-10 [28] and 3.76 bits/dim
on ImageNet 32 ˆ 32 [55] with no data augmentation. Our models present the first instances of
normalizing flows which achieve comparable likelihood to cutting-edge autoregressive models.

2 Score-based diffusion models


Score-based diffusion models are deep generative models that smoothly transform data to noise
with a diffusion process, and synthesize samples by learning and simulating the time reversal of this
diffusion. The overall idea is illustrated in Fig. 1.

2.1 Diffusing data to noise with an SDE

Let ppxq denote the unknown distribution of a dataset consisting of D-dimensional i.i.d. samples.
Score-based diffusion models [48] employ a stochastic differential equation (SDE) to diffuse ppxq
towards a noise distribution. The SDEs are of the form
dx “ f px, tq dt ` gptq dw, (1)
where f p¨, tq : R Ñ R is the drift coefficient, gptq P R is the diffusion coefficient, and w P RD
D D

denotes a standard Wiener process (a.k.a., Brownian motion). Intuitively, we can interpret dw as
infinitesimal Gaussian noise. The solution of an SDE is a diffusion process txptqutPr0,T s , where
r0, T s is a fixed time horizon. We let pt pxq denote the marginal distribution of xptq, and p0t px1 | xq
denote the transition distribution from xp0q to xptq. Note that by definition we always have p0 “ p
when using an SDE to perturb the data distribution.
The role of the SDE is to smooth the data distribution by adding noise, gradually removing structure
until little of the original signal remains. In the framework of score-based diffusion models, we
choose f px, tq, gptq, and T such that the diffusion process txptqutPr0,T s approaches some analytically
tractable prior distribution πpxq at t “ T , meaning pT pxq « πpxq. Three families of SDEs suitable
for this task are outlined in [48], namely Variance Exploding (VE) SDEs, Variance Preserving (VP)
SDEs, and subVP SDEs.

2.2 Generating samples with the reverse SDE

Sample generation in score-based diffusion models relies on time-reversal of the diffusion process.
For well-behaved drift and diffusion coefficients, the forward diffusion described in Eq. (1) has an

2
Data Noise

dx “ f px, tq dt ` gptq dw
<latexit sha1_base64="bp6Q6k3FNeuP4SvJc/T0EcG9LjI=">AAA6nXiclVvpctxIcubu+ljT1+z6p38YY45iZ9etNklJq7EiNkK8Sal5iLc0rVHgSKAh4hKqAJJCtF/CT+O/9lP4bZxVhUJWoUFOLCOkRn1fVgJV9WVlAo32iiRmfHX1/37xy1/9xV/+1V//+m+W//bv/v4f/vGb3/z2kuVV6cOFnyd5ee25DJI4gwse8wSuixLc1EvgyrvZEvxVDSWL8+yc3xfwMXWjLA5j3+UIffrm6bQKnGlZ3zl/cqZ1+L04HDn8947AufNvTvR920Dm9tM3K6vjVfnnLB6stQcrS+3fyaff/Mf1NMj9KoWM+4nL2I9r6wX/2Lglj/0E5stPxJ9ztHPlHG6c7zvbO7sHRwfnB8dHZ46klpenFYPC9W/cCBo3ZanLZyP8DPOMs5GXimN2n3ojQfA8T5hA+CydWz3lTI3u5MfcdjrDaSlLCOeO+nviSARn9IYtT+UxA14VzbLjyP6SGWFLHEjkT00JwbcvVr/1EnQ6F5wfc2g5L6nAJqsyMbkfNIeU+DcNIMT1lCZNXrpZBPNm//xwMm/C8GW4CgtGwo02WQtfvvSeCxNrmHjO3sDdJMrLWE7VEwP/McshCz4KumBQ4fLlAcz7PR+hlNPYv+udD6EyFyK0eyRxwareckWlW8ykB/PKGlZ5YRxVZc8Dwr5bCEH38DgAJHpKGLL08vyGux6b4+qHeekUZR4CE1HjJg4SCbC+EKUASTNyAb3cLQNHKNERmkQxWr2y2IewdH3q5ecpklxby3Ov/fv6yAHujx+YPN1XI7mcrPue3FMRcvYgIatSVGVvqm+xN07qiMc3X3sLk2eRHPsodDP/fhaUvaUOPleMezluGF4cMV5WfCQmq0rcErG0Snhc5rd45N6AD0kycoLYjbDD8GKPwiR3e9fMOB6UkAxE8wDEvZ4hn0FeQiqiuycCBqx/ptQt78vARsVaul7vggO5+La6f0RdxdzFLRs+Nri14aYrToo2YvQubsycQ2m5uWN4hOu5LEJ5+jrPIMh5M43Du+lrNH/N8xvIxlNIGIynWZUk0zCetp2Wp9uAG2oJp7mHi7CF6+1mwVT7CCuOUYJODE/6DOK6+3vH2rw53ducN6ujtfUfRusvXswXbZ51NtKqZxG45U3hMg5JVAJk86aMPLQdr66NMEO8fCH+X3/e6+RDWSXgGtbC7PkP4v8/imvI4NZXQ2umIjx/XPvYTBOcW76yhgkJPy2jPzRTXgp3PzXT9AbK7Ona+EVaTcUqsrA5n897PlMvVD6FhRc2K2uLFp5h4Q1Z+G5CJtgYshESVgZSzGrJIWqmkBazBsbRfKpXR1I7mtpZoGJoqXgMPepAUwcLlB+2lB/2mC3NbPUZ3IT0BXK/x9WspWrWY25L3tyOyzG34SAPm2Ccj0Mbdm/cxh3fjN3eIOMAxxePg/4V4fTqS3Jw3qfDmhZqtHQ4QkGNVtHsiXOInMPAF1mAOXnotBlB7b4YtVgNxFnk8By3qzpm2kxlHmYvLIIJhHzeiKtyvp/g8e/7i482Pm7GUGqrLdkasivjaNY5OxWNISueF9rmPC+GLLyc8zzVRpuytWDXjtvVZu5DFp628B6y8LWF/5BFoC0CYYHLsI+jS8QIHddB++TeUQuIZSeUqe0DjwUoo6yZeqEjIwyd7Mo1UYsGmQ8jJ8lvoXwq9vexkoyYVQhX1hq1fv85xZaK0KHu7S4+dnZRC5h8Si6WnokFQ1553NUed/seJc1vc33OlfX2rMzRRg4OqG2s6x5fKjegLivPVp4vdBt1ffTRM9PVczmcMyXqR6cDha8uvo0Aaz4GHOgJUb3PdO+zgd6nupcIHZyFLsjG3cSoszM5M10IPjA1fYe4c0LfpeFv5dmiR5o1w/ezRd9u5gAugug8MGXwRY1Zmzw8aMtPVeCdgyP8tFt662ZnyM2GU7q3NO89Z0+fPnXrPA6ciomdKQ6dIsfqFEsz5bpIXIyc1v/DV+f4MwxGKAfGKBjVvbX5swfZOtrqHG39rCNH3trILVjZMuVDwt0VoVQ0rV09ffqgTPDquluQgXG6SRu7ndGjAzVcLYx0Q7vaGHClBa/Ph4PofD2+GZxbnTZ+ttPCpBa4abUjN9QnUHW54uixRVH9++o96fqf2P31SLsT4FWL4wcvuBUcxIkQayIOcD9HA3HU+sM7gbyUtDxSvDxsDZDy0mahzsLqL87mVIlt9w1qrNAD0+BTW5WljaIWSkQOjA93kMy8GxEUTKS4gsV44yRQMbXoIk+d2i3j9iZS6hvLmEZ4vuNZXqbo9TtR2Xw319NZ9miXGM9mPGJ8m/GJCWwmIAZsBogJbSYkJrKZiJiZzcyIiW0mJuazzXwm5sZmbohJbCaZSxmXqRMzjNgS3OBebHZqBUeOuEt1gjz7HXcyNwWU473YeayFcdLWd2b7zuisuc3kxBQ2UxDzxWa+EFPaTEkMsxlGDLcZTkxlMxUxtc3UxNzazC0xdzZzR8y9zdwT89Vmvs5VgaYDADNz3m3vdRskTXffRWHTXTfeusso0RaybfAGZ8IewUZs1D7BRmDUAcFGVNRAsBESdUiwEQ91RLARDPWMYCMS6opgIwzqzwQbMVDfEGwEQJ0QnBhwSnBqwMZEmzOcE2yIuS4INpRcfyHYkHFdEmxouGYEM3NRCebDc2JKtybY0G19S7Ah2vqOYEOx9T3BhlzrrwRrre4kIB6ayZu9ckC3oEQ3uC+DUt7gzgxKfoN7MygNDu7OoIQ4uD+DUuPgDg1KkoN7NChdDu7SyD24T4NS6OBODUqmg3s1KK32d2vNpTaXmtyDOzEo6Q7uxaD0O7gbgxLx4H4MSsmDOzIoOQ/uyaA0PbgrgxL24L4MSt2DOzMoiQ/uzaB0Prg7gxL74P4MSvEP79AYC2XsdxVKukHxsUFhk24SvGnAWwRvGfA2wdsGvEPwjgHvErxrwHsE7xnwPsH7BnxA8IEBvyH4jQG/JfitAU8InhjwIcGHBnxE8JEBHxN8bMAnBJ8Y8DuC3xnwKcGnBnxG8JkBnxN8bsAXBF8Y8CXBlwZ8RfCVAV8TfG3A7wl+b8AfCP7w8PZqiw6U6gyNbhj6ldIzuE2T27K5LZPbtrltk9uxuR2T27W5XZPbs7k9k9u3uX2TO7C5A5N7Y3NvTO6tzb01uYnNTUzu0OYOTe7I5o5M7tjmjk3uxOZOTO6dzb0zuVObOzW5M5s7M7lzmzs3uQubuzC5S5u7NLkrm7syuWubuza59zb33uQ+2JyW/aVZQtRfQd5H4L3rate3zjNo9P2sxtJKQdOUkkZXEwvcrofrIm7xIjYcpxBpe3lMVOKmXqA51TCczTpvM8Nda24YegqhWkdWOohQhSPrG0Sorqnb0VI1I2sZRKiGkRUMIlS5yLoFEapX6vYijSv8rBCqTWRlgsiNMXKFJMZcK4SqD1l7IJIZU6mQ3JgkhVB9IasLRKiqkDUFIlRLyEoCEWYsqkKobqjbpTcWvlYI1QiyQkCEKgNZFyBC9YCsBhChKkDWAIgMVbx2qVu7STET6y0/O5XXXitAqT0N0u0cPf1oqU5mWmKa0NJsZalhKXsheQ2gQ0Twf4JYHKWiq/wkWAdHGxjdQJrGvP5GiFW3UKw+tVCogTGoRghUt1CgIbVQnBG1UJgzauHlGteKgvxMLRTjjTE3jRBhN/JGCFC3cDKNWUTx5caUNEJ0uoWi+0ItFFxpzFQjhNZNUCNEpls40cY0o8BqaqG4bqmFwrqjForqnlooqK/z9ms1zOV3Cpd5HHVG+Vtmb0Qoa8ucjQjlapmpEaEMLfMzIpSXZVZGhLKxzMWIUA6WGRgRyrwy7yJC+VZmW0Qoy8ociwjlVplZEaGMKvMpIpRHZRZFhLKnzJ2IUM6UGRMRypQyTyJC+VFmR0QoK8qciAjlQpkJEaEMKPMfIpT3ZNZDhLKdzHWIUI6TGQ4RymwyryFC+UxmM0Q+GCtIucczU0960mWLEyNbpBMzw0zsDJOetTEsuDMVx1JF55CxvOzebkBRzTbEDoRnbNqv8WPxOBYyPw/iLEJnbpUIhIXdcTpvmHiSfAb8IQdengQ/58a7mzfiPSH7aXDWvoUgc3PrTz4Jb4fGVQ2bMUP9fFNjpH++pTGKAL6tMYoBvqMxigK+qzGKA76nMYoEvq8xigV+oDGKBv5GYxQP/K3GKCL4RGMUE/xQYxQV/EhjFBf8WGMUGfxEYxQb/J3GKDr4qcYoPviZxihC+LnGKEb4hcYoSvilxihO+JXGKFL4tcYoVvh7jVG08A8aU1UfCnlPvNSk2EjfS/vWLU20acCki2jLgEka0bYBkzqiHQMmgUS7BkwaifYMmGQS7RswKSU6MGASS/TGgEkv0VsDJslEEwMm1USHBkzCiY4MmLQTHRswySc6MWBSUPTOgElE0akBk46iMwMmKUXnBkxqii4MmAQVXRowaSq6MmCSVXRtwKSs6L0Bk7iiDwas7ypwa2tLNdY9qfEMcbFNQklbbItQkhbbJlQq64mzLb8tqRg4rsOAO3jqBAJnZ+R44LsC57OYObd5lQQIYQscJr9bwVqyKtv3GZfVy5RwV2BtKb8w1l/b79IZSZ1sj1ASJ9snlLTJDgglabI3hJIy2VtCSZhsQijpkh0SSrJkR4SSKtkxoSRKdkIoaZK9I5QkyU4JJUWyM0JJkOycUNIjuyCU5MguCSU1sitCSYzsmlDSIntPKEmRfSC0e6yTYd0H8hbCVQ902iIQqAKY2MW/KA83qIVS3aQWSnSLWijNbWrhZrdDLRTRLrVQPHvUQtHsUwvFckAtFMkbaqE43lILRTGhForhkFoogiNq4eIfUwsX/YRauNjvqIWLfEotXNwzauGinlMLF/OCWriIl9TCxbuiFi7aNbVwsd5TCxfpg3G+ttJqqyyxZGAuGVcVF24pIn7Fi60iiBU6cm5jPssr7mC544g3hQvxDqtZEAFVRFY11J6edxqQhguFIMhyCXr1EsiCCXoVE8iSCXo1E8iiCXpVE8iyCXp1E8jCCXqVE8jSCXq1E8jiCXrVE8jyCXr1E8gCCnoVFMgSCno1FMgiCnpVFMgyCnp1FMhCCnqVFMhSCnq1FMhiCnrVFMhyCnr1FMiCCnoVFciSCno1FciiCnpVFciyCnp1FcjCCnqVFcjSCnq1FcjiCnrVFcjyCnr1FcgCC4wKC+8UMOXwsgKnygIok3vxYlTgcteJIIMSs41oxwyV7lUi9diyLYTpvCk+NdMybWRDJj7hFdIiLmNMeVb/7vVE716mO/mqiTgJ5seeb/0WyszleKdun8KyPDEtT+ZDF5PmASSPDUQadCNRrYXztEYnjxkVPE4CaC2nstFdfdcDtwme+zOX8dh33Irn8g4KSusKe+/IFsqmu8a2y+IFBGDZqeaAXYkEbjraTjVRC758hmYbJ26RiBfmu7d2Ji0gfurRHtvTa/ffmXcZb6d/IRNmvBg06bOnczO393bNtDDmuEcm5Vw/d7OJEqJ59yStT/mcxihacRhD2XfN8pCn7h1ZaqBvh8kily9KqYdsi16KpBKj/yqeBNjs28ncfEvq7WRhAS/dkq5ANPr+OX64Ja59mRuWZwsLsJXXRIuGEOhVnoSlm4oHUrPbvMQClbn3zPlu8tP6d+INIVGShlWm3nSVv6Zg8g2176aQJIaNfiD6xNnEBIghn4n/7jHeIRVvyokqWDk1rMVrqnkVyZwpi+KYw0i6Z7kT5CDc3cY3cQFB7I57bznnZZqILwjmzeSn1fkAmWcguLUhjt/KfutDXCGYYoCRWpj8NI2zkN/3Q6dwS/FwGLcNVwTLGeBey9wInDhzsrwt6DncjZ2tWc7E9OSiAPRnzjbe+2bwO+aI31WNl63HOceF2J3z8g+o8TKSF4Cf05E4esxQ7JPKMM4eM2RV0RqKo0cMccytoTgaNJTyx/PK/x+wOEeFnpf9uav0S4Z50cyn32qNBgv7k1iF9mctEPJpctn+tkW8GT8tL+UvXLCL+CmP6+GukOS3XgnuzfKnb1bW+r+8XDy4XB+v/XG8/u75yuvN9leZv17656V/Xfp+aW3p5dLrpf2lk6WLJX/pv5b+e+l/lv731b+82nk1eXWkTH/5i7bPPy1Zf6+u/h+7zJn5</latexit>

<latexit sha1_base64="7uq7DIIzaFJhJXwuumgY/3L6Idw=">AAA6dXiclVtpc9xIcuXu+ljT16z3o8MRGFPyzm60eklKWs1OhCPEm5Sah3hL0xoFjgQaIi6hCiApRPs/+Kv9y/xL/NVZVShkFRrkhBkhNeq9rASq6mVlAo32iiRmfHX1f37xy1/9xV/+1V//+m+W//bv/v4f/vGb3/zTJcur0ocLP0/y8tpzGSRxBhc85glcFyW4qZfAlXezJfirGkoW59k5vy/gY+pGWRzGvssRupqW9d13q7//9M3K6nhV/jmLB2vtwcpS+3fy6Td/vp4GuV+lkHE/cRn7cW294B8bt+Sxn8B8+an4c452rpzDjfN9Z3tn9+Do4Pzg+OjMkdTy8rRiULj+jRtB46YsdflshJ9hnnE28lJxzO5TbyQInucJEwifpXOrp5yB0Z38mNtOZzjcsoRw7qi/p45EcKZu2PJUHjPgVdEsO47sL5kRtsSBRP69KSH49uXqt16CTueC82MOLeclFdhkVSYm973mkBL/pgGEuE7SpMlLN4tg3uyfH07mTRi+CldhwUi40SZr4atX3gthYg0Tz9kbuJtEeRnLqXpq4D9mOWTBR0EXDCpcvjyAeb/nI5RyGvt3vfMhVOZCXHaPJC5Y1VuuqHSLmfRgXlnDKi+Mo6rseUDYdwsh1B4eB4BETwlDll6e33DXY3Nc/TAvnaLMQ2AiGtzEQSIB1heiFCBpRi6gl7tl4AglOkKTKEarVxb7EJauT738PEWSa2t57rU/ro8c4P74gcnTfTWSy8m678k9FSFnDxKyKkVV9qb6FnvjpI54fPO1tzB5Fsmxj0I38+9nQdlb6uBzxbiX340cL44YLys+EpNVJW6JWFolPC7zWzxyb8CHJBk5QexG2GF4sUdhkru9a2YcD0pIBqJ5AOJez5DPIC8hFdHdEwED1j9T6pb3ZWCjYi1dr3fBgVx8W90/oq5i7uJWDB8b3NpwMxUnRRsxehc3XM6htNzcMTzC9VwWoTx9nWcQ5LyZxuHd9DWav+b5DWTjKSQMxtOsSpJpGE/bTsvTbcANtYTT3MNF2ML1drNgqn2EFccoQSeGJ30Gcd39vWNt3pzubc6b1dHa+vej9Zcv54s2zzsbadWzCNzypnAZhyQqAbJ5U0Ye2o5X10aYIV69FP+vv+h18qGsEnANa2H24nvx/5/ENWRw66uhNVMRnj+ufWymCc4tX1nDnISfltEfmikvhbufmml6A2X2bG38Mq2mYhVZ2JzP5z2fqRcqn8LCC5uVtUULz7Dwhix8NyETbAzZCAkrAylmteQQNVNIi1kD42g+1asjqR1N7SxQMbRUPIYedaCpgwXKD1vKD3vMlma2+gxuQvoCud/jatZSNesxtyVvbsflmNtwkIdNMM7HoQ27N27jjm/Gbm+QcYDji8dB/4pwevUlOTjv02FNCzVaOhyhoEaraPbUOUTOYeCLLMCcPHTajKB2X4xarAbiLHJ4jttVHTNtpjIPsxcWwQRCPm/EVTnfTfD49/3FRxsfN2MotdWWbA3ZlXE065ydisaQFc8LbXOeF0MWXs55nmqjTdlasGvH7Woz9yELT1t4D1n42sJ/yCLQFoGwwGXYx9ElYoSO66B9cu+oBQwcnJvU9oHHApRR1ky90JERhk525ZqoRYPMh5GT5LdQPhP7+1hJRswqhCtrjVq//5hiS0XoUPd2Fx87u6gFTD4lF0vPxIIhrzzuao+7fY+S5re5PufKentW5mgjBwfUNtZ1jy+VG1CXlecrLxa6jbo++ui56eqFHM6ZEvWj04HCVxffRoA1HwMO9ISo3me699lA71PdS4QOzkIXZONuYtTZmZyZLgQfmJq+Q9w5oe/S8LfyfNEjzZrh+/mibzdzABdBdB6YMviixqxNHh605acq8M7BEX7aLb11szPkZsMp3Vua956zZ8+euXUeB07FxM4Uh06RY3WKpZlyXSQuRk7r/+Grc/wZBiOUA2MUjOre2vy/B9k62uocbf2sI0fe2sgtWNky5UPC3RWhVDStXT179qBM8Oq6W5CBcbpJG7ud0aMDNVwtjHRDu9oYcKUFr8+Hg+h8Pb4ZnFudNn6208KkFrhptSM31CdQdbni6LFFUf376j3p+p/Y/fVIuxPgVYvjBy+4FRzEiRBrIg5wP0cDcdT6wzuBvJS0PFK8PGwNkPLSZqHOwuovzuZUiW33DWqs0APT4FNblaWNohZKRA6MD3eQzLwbERRMpLiCxXjjJFAxtegiT53aLeP2JlLqG8uYRni+41lepuj1iahsnsz1dJY92iXGsxmPGN9mfGICmwmIAZsBYkKbCYmJbCYiZmYzM2Jim4mJ+Wwzn4m5sZkbYhKbSeZSxmXqxAwjtgQ3uBebnVrBkSPuUp0gz37HncxNAeV4L3Yea2GctPWd2b4zOmtuMzkxhc0UxHyxmS/ElDZTEsNshhHDbYYTU9lMRUxtMzUxtzZzS8ydzdwRc28z98R8tZmvc1Wg6QDAzJx323vdBknT3XdR2HTXjbfuMkq0hWwbvMGZsEewERu1T7ARGHVAsBEVNRBshEQdEmzEQx0RbARDPSPYiIS6ItgIg/ozwUYM1DcEGwFQJwQnBpwSnBqwMdHmDOcEG2KuC4INJddfCDZkXJcEGxquGcHMXFSC+fCcmNKtCTZ0W98SbIi2viPYUGx9T7Ah1/orwVqrOwmIh2byZq8c0C0o0Q3uy6CUN7gzg5Lf4N4MSoODuzMoIQ7uz6DUOLhDg5Lk4B4NSpeDuzRyD+7ToBQ6uFODkungXg1Kq/3dWnOpzaUm9+BODEq6g3sxKP0O7sagRDy4H4NS8uCODErOg3syKE0P7sqghD24L4NS9+DODErig3szKJ0P7s6gxD64P4NS/MM7NMZCGftdhZJuUHxsUNikmwRvGvAWwVsGvE3wtgHvELxjwLsE7xrwHsF7BrxP8L4BHxB8YMBvCH5jwG8JfmvAE4InBnxI8KEBHxF8ZMDHBB8b8AnBJwb8juB3BnxK8KkBnxF8ZsDnBJ8b8AXBFwZ8SfClAV8RfGXA1wRfG/B7gt8b8AeCPzy8vdqiA6U6Q6Mbhn6l9Axu0+S2bG7L5LZtbtvkdmxux+R2bW7X5PZsbs/k9m1u3+QObO7A5N7Y3BuTe2tzb01uYnMTkzu0uUOTO7K5I5M7trljkzuxuROTe2dz70zu1OZOTe7M5s5M7tzmzk3uwuYuTO7S5i5N7srmrkzu2uauTe69zb03uQ82p2V/aZYQ9VeQ9xF477ra9a3zDBp9P6uxtFLQNKWk0dXEArfr4bqIW7yIDccpRNpeHhOVuKkXaE41DGezztvMcNeaG4aeQqjWkZUOIlThyPoGEapr6na0VM3IWgYRqmFkBYMIVS6ybkGE6pW6vUjjCj8rhGoTWZkgcmOMXCGJMdcKoepD1h6IZMZUKiQ3JkkhVF/I6gIRqipkTYEI1RKykkCEGYuqEKob6nbpjYWvFUI1gqwQEKHKQNYFiFA9IKsBRKgKkDUAIkMVr13q1m5SzMR6y89O5bXXClBqT4N0O0dPP1qqk5mWmCa0NFtZaljKXkheA+gQEfyfIBZHqegqPwnWwdEGRjeQpjGvvxFi1S0Uq08tFGpgDKoRAtUtFGhILRRnRC0U5oxaeLnGtaIgP1MLxXhjzE0jRNiNvBEC1C2cTGMWUXy5MSWNEJ1uoei+UAsFVxoz1QihdRPUCJHpFk60Mc0osJpaKK5baqGw7qiForqnFgrq67z9Wg1z+Z3CZR5HnVH+ltkbEcraMmcjQrlaZmpEKEPL/IwI5WWZlRGhbCxzMSKUg2UGRoQyr8y7iFC+ldkWEcqyMsciQrlVZlZEKKPKfIoI5VGZRRGh7ClzJyKUM2XGRIQypcyTiFB+lNkREcqKMiciQrlQZkJEKAPK/IcI5T2Z9RChbCdzHSKU42SGQ4Qym8xriFA+k9kMkQ/GClLu8czUk5502eLEyBbpxMwwEzvDpGdtDAvuTMWxVNE5ZCwvu7cbUFSzDbED4Rmb9mv8WDyOhczPgziL0JlbJQJhYXeczhsmniSfAX/IgZcnwc+58e7mjXhPyH4anLVvIcjc3PqTT8LboXFVw2bMUD/f1Bjpn29pjCKAb2uMYoDvaIyigO9qjOKA72mMIoHva4xigR9ojKKBv9EYxQN/qzGKCD7RGMUEP9QYRQU/0hjFBT/WGEUGP9EYxQZ/pzGKDn6qMYoPfqYxihB+rjGKEX6hMYoSfqkxihN+pTGKFH6tMYoV/l5jFC38g8ZU1YdC3hMvNSk20vfSvnVLE20aMOki2jJgkka0bcCkjmjHgEkg0a4Bk0aiPQMmmUT7BkxKiQ4MmMQSvTFg0kv01oBJMtHEgEk10aEBk3CiIwMm7UTHBkzyiU4MmBQUvTNgElF0asCko+jMgElK0bkBk5qiCwMmQUWXBkyaiq4MmGQVXRswKSt6b8AkruiDAeu7Ctza2lKNdU9qPENcbJNQ0hbbIpSkxbYJlcp66mzLb0sqBo7rMOAOnjqBwNkZOR74rsD5LGbObV4lAULYAofJ71awlqzK9n3GZfUyJdwVWFvKL4z11/a7dEZSJ9sjlMTJ9gklbbIDQkma7A2hpEz2llASJpsQSrpkh4SSLNkRoaRKdkwoiZKdEEqaZO8IJUmyU0JJkeyMUBIkOyeU9MguCCU5sktCSY3silASI7smlLTI3hNKUmQfCO0e62RY94G8hXDVA522CASqACZ28S/Kww1qoVQ3qYUS3aIWSnObWrjZ7VALRbRLLRTPHrVQNPvUQrEcUAtF8oZaKI631EJRTKiFYjikForgiFq4+MfUwkU/oRYu9jtq4SKfUgsX94xauKjn1MLFvKAWLuIltXDxrqiFi3ZNLVys99TCRfpgnK+ttNoqSywZmEvGVcWFW4qIX/FiqwhihY6c25jP8oo7WO444k3hQrzDahZEQBWRVQ21p+edBqThQiEIslyCXr0EsmCCXsUEsmSCXs0EsmiCXtUEsmyCXt0EsnCCXuUEsnSCXu0EsniCXvUEsnyCXv0EsoCCXgUFsoSCXg0FsoiCXhUFsoyCXh0FspCCXiUFspSCXi0FspiCXjUFspyCXj0FsqCCXkUFsqSCXk0FsqiCXlUFsqyCXl0FsrCCXmUFsrSCXm0FsriCXnUFsryCXn0FssACo8LCOwVMObyswKmyAMrkXrwYFbjcdSLIoMRsI9oxQ6V7lUg9tmwLYTpvik/NtEwb2ZCJT3iFtIjLGFOe1b97PdG7l+lOvmoiToL5sedbv4UyczneqdunsCxPTMuT+dDFpHkAyWMDkQbdSFRr4Tyt0cljRgWPkwBay6lsdFff9cBtguf+zGU89h234rm8g4LSusLeO7KFsumuse2yeAEBWHaqOWBXIoGbjrZTTdSCL5+h2caJWyTihfnurZ1JC4iferTH9vTa/XfmXcbb6V/IhBkvBk367OnczO29XTMtjDnukUk518/dbKKEaN49SetTPqcxilYcxlD2XbM85Kl7R5Ya6Nthssjli1LqIduilyKpxOi/iicBNvt2Mjffkno7WVjAS7ekKxCNvn+OH26Ja1/mhuXZwgJs5TXRoiEEepUnYemm4oHU7DYvsUBl7j1znkx+Wn8i3hASJWlYZepNV/lrCibfUHsyhSQxbPQD0afOJiZADPlM/HeP8Q6peFNOVMHKqWEtXlPNq0jmTFkUxxxG0j3LnSAH4e42vokLCGJ33HvLOS/TRHxBMG8mP63OB8g8A8GtDXH8VvZbH+IKwRQDjNTC5KdpnIX8vh86hVuKh8O4bbgiWM4A91rmRuDEmZPlbUHP4W7sbM1yJqYnFwWgP3O28d43g98xR/yuarxsPc45LsTunJd/QI2XkbwA/JyOxNFjhmKfVIZx9pghq4rWUBw9Yohjbg3F0aChlD+eV/7/gMU5KvS87M9dpV8yzItmPv1WazRY2J/EKrQ/a4GQT5PL9rct4s34aXkpf+GCXcRPeVwPd4Ukv/VKcG+WP32zstb/5eXiweX6eO1P4/V3L1Zeb7a/yvz10j8v/evSd0trS6+WXi/tL50sXSz5SzdL/7n0X0v//ef//eFffnjyw78p01/+ou3z2yXr74c//h9p+I1F</latexit>

xp0q Score function


dx “ rf px, tq ´ gptq2 rx log pt pxq s dt ` gptq dw̄
<latexit sha1_base64="hCSA3ZC7t7fMtkXuo6O4Nvrb7zM=">AAA6y3iclVvbctxIcuWub2v6Nms/+gVjjmJndlttkpJ2xhOxESPeREpNiuJdUksKXBJoiLipqgCSgnve/GH+DH+BX+0/cFYVqrMKDWrCjJAadU5WAlV1sjKBRgdVlnKxvv5fv/r1n/35X/zlX/3mr1f/5m//7u//4avf/uMFL2sWwnlYZiW7CnwOWVrAuUhFBlcVAz8PMrgMrrclf9kA42lZnIm7Ct7lflKkcRr6AqEPX72d1pE3Zc2t9yfv7bSJv5XHI0985z30km/Fd+83f/55WvhB5n9QVtOsTLzqg1B23/388zvZX3h/UMaechb4rEX2Zv7hq7X18br685YPNrqDtZXu7/jDb//tahqVYZ1DIcLM5/ztxmYl3rU+E2mYwXz1gfzzjnYvvcOnZ/vezu7ewdHB2cHLo1NPUaur05pD5YfXfgKtn/PcF7MRfsZlIfgoyOUxv8uDkSREWWZcImKWz52eal5Ht+pj7jqd4SQyBvHc038PPIXg/F/z1ak65iDqql31PNVfMSNsyQOF/KllEH39ZP1rnNXwei65MBXQcUFWg0vWLLO5HwyHlPw3jSDG1Vcmbcn8IoF5u392OJm3cfx9vA5LRtKNMdmIv/8+eCxNnGHiOXsD97OkZKmaqgcW/rYooYjeSbriUOPylRHM+z2/QGmnaXjbOx9CrJSSdXtkacXr3nIlzK9myoN9ZS2vgzhNatbzgHDoV1L+PTyNAImeEoYsg7K8Fn7A57j6ccm8ipUxcBljfuYhkQHvC1EJkDSjFjAofRZ5Uome1CSK0elVpCHEzA+pV1jmSApjrc698a+bIw9EOL5n8kxfg5Rqsu56cs9lyLmDhKLOUZW9qb7B3jipI5Fef+4tTFkkauyj2C/Cu1nEeksdfay5CErcXYI04YLVYiQnq858hlheZyJl5Q0e+dcQQpaNvCj1E+wwvNijOCv93jVzgQcMsoFoHoBE0DMUMygZ5DK6eyLgwPtnyn12xyIXlWvpB70LjtTiu+p+i7pKhY8bPLxrcWvDLVqeFG3k6H3cxoUA5ri55XiE67kqQ3n6U1lAVIp2msa305/Q/CdRXkMxnkLGYTwt6iybxum067Q63QHcUBmclAEuwjaut19EU+MjrgVGCTqxPJkzyOvu7x0b8/bk2da8XR9tbP4w2nzyZL5s82hho6x6FpHPriufC8gSBlDMW5YEaDte3xhhhvj+ifx/83GvUwiszsC3rKXZ4x/k/3+U11DATaiH1k5leL7deNdOM5xbsbaBGQk/HaPft1PBpLv37TS/BlY83Bg/yeupXEUet2fzec9nHsTap7QI4nZtY9kisCyCIYvQz8gEG0M2UsLaQIlZLzkk7RTyatbCOJlPzeooatdQu0tUCh2VjqFHHRjqYIkK444K4x6zbZjtPoObkLlAEfa4hndUw3vMDRPtzZiNhQtHZdxG43Icu7B/7bf++Hrs9waZRji+dBz1rwin11ySh/M+Hda0VKOjwxEKarSOZg+8Q+Q8DqHMAtwrY6/LCHr3xajFaiAtEk+UuF01KTdmOvNwd2ERzCAW81ZelfftBI+/6y8+2oS4GQMzVtuqNWTH0mS2cHYiG0NWoqyMzVlZDVkEpRBlboy2VGvJrhu3b8z8+ywCYxHcZxEai/A+i8hYRNICl2EfR5fJEXq+h/bZnacXEItPYLnrA48lqKKsnQaxpyIMneypNdGLBkUIIy8rb4A9lPv7WEtGzirEaxutXr+fp9jSETrUvdvFx94eagGTDxNy6blcMOS1xz3jca/vUdHipjTnXNvszso9Y+ThgLrGpunxqfYj6rL2aO3xUrfRoo85emS7eqyGc6pF/cXpQOHri+8iwJmPAQdmQnTvU9P7dKD3ieklQwdnYRFk48XE6LNzNTOLELxnavoOceeEvkvL39qjZY80a5bvR8u+/cIDXATZeWDK4JMeszG5f9COn7rCOwdP+um29M7N7pCbpx7zb2jee84ePnzoN2UaeTWXO1Mae1WJ1SmWZtp1lfkYOZ3/+6/OC2cYjMAGxigZ3b2z+X8PsnO0vXC0/YuOPHVro7Zgbcu1DwUvrgilYmjj6uHDe2WCV7e4BRkYp591sbsw+uJALVdLI31qXD0dcGUEb86Hg1j4+vJmcOZ0evqLnZYmtcJNqxu5pT6J6suVR19aFN2/r97jRf9jt78Z6eIEeNXy+N4L7gQHaSbFmskD3M/RQB51/vBOoGSKVkeaV4edAVJB3i7VWVj9pcWcKrGdvkGDFXpkG3zoqrK81dRSiSiAi+EOipkvRgQVlymu4ineOElUTi26KHOv8Vna3UQqfWMZ00rPt6IoWY5ev5GVzTdzM52sR/vEBC4TEBO6TEhM5DIRMeAyQEzsMjExicskxMxcZkZM6jIpMR9d5iMx1y5zTUzmMtlcyZjlXsoxYhn40Z3c7PQKjjx5l+pFZfE74RV+DijHO7nzOAvj5Z3vwvVd0FlLlymJqVymIuaTy3wihrkMI4a7DCdGuIwgpnaZmpjGZRpiblzmhphbl7kl5s5l7oj57DKf57pAMwGAmblcbO9NFyTt4r6LwmZx3XjrrqLEWKi2xVucDQcEW7HRhARbgdFEBFtR0QDBVkg0McFWPDQJwVYwNDOCrUhoaoKtMGg+EmzFQHNNsBUATUZwZsE5wbkFWxNtz3BJsCXmpiLYUnLziWBLxg0j2NJwwwnm9qISLIbnxJZuQ7Cl2+aGYEu0zS3BlmKbO4ItuTafCTZa3c1APjRTN3tsQLegRTe4L4NW3uDODFp+g3szaA0O7s6ghTi4P4NW4+AODVqSg3s0aF0O7tLI3btPg1bo4E4NWqaDezVorfZ3a8PlLpfb3L07MWjpDu7FoPU7uBuDFvHgfgxayYM7Mmg5D+7JoDU9uCuDFvbgvgxa3YM7M2iJD+7NoHU+uDuDFvvg/gxa8ffv0BgLLA0XFUr+lOLjKYVNvkXwlgVvE7xtwTsE71jwLsG7FrxH8J4FPyP4mQXvE7xvwQcEH1jwc4KfW/ALgl9Y8ITgiQUfEnxowUcEH1nwS4JfWvAxwccW/IrgVxZ8QvCJBZ8SfGrBZwSfWfA5wecWfEHwhQVfEnxpwVcEX1nwa4JfW/Abgt/cv726ogOtOkujTy39KulZ3JbNbbvcts3tuNyOze263K7N7bncns09c7lnNrfvcvs2d+ByBzb33OWe29wLl3thcxOXm9jcocsd2tyRyx3Z3EuXe2lzxy53bHOvXO6VzZ243InNnbrcqc2dudyZzZ273LnNXbjchc1dutylzV253JXNvXa51zb3xuWM7C/sEqL5DOo+Au9d1xd9m7KA1tzPGiyvNTTNKWksamKJu/VwU6UdXqWW4xwSY6+Oicr8PIgMpxuWs9nC28xy15lbhoFGqNZRlQ4iVOGo+gYRqmuabrRUzahaBhGqYVQFgwhVLqpuQYTqlaa7SOsKP2qEahNVmSBybY1cI5k11xqh6kPVHogU1lRqpLQmSSNUX6jqAhGqKlRNgQjVEqqSQIRbi6oRqhuabumthW80QjWCqhAQocpA1QWIUD2gqgFEqApQNQAiQxWvW+o2flbN5Hqrz4XKm6AToNKeAel2jp5+dNRCZkZihjDS7GRpYCV7KXkDoENE8H+CeJrksqv6JNgERxcYi4G0rX39rRSraaFYQ2qhUCNrUK0UqGmhQGNqoTgTaqEwZ9TCy7WuFQX5kVooxmtrblopwsXIWylA08LJtGYRxVdaU9JK0ZkWiu4TtVBwzJqpVgptMUGtFJlp4URb04wCa6iF4rqhFgrrllooqjtqoaA+z7uv1TCX32pc5XHUGeVvlb0RoaytcjYilKtVpkaEMrTKz4hQXlZZGRHKxioXI0I5WGVgRCjzqryLCOVblW0RoSyrciwilFtVZkWEMqrKp4hQHlVZFBHKnip3IkI5U2VMRChTqjyJCOVHlR0RoayociIilAtVJkSEMqDKf4hQ3lNZDxHKdirXIUI5TmU4RCizqbyGCOUzlc0QeWOtIOWewE49+fEiWxxb2SKf2Blm4maY/LSLYcmd6jhWKjqDgpds8XYDimr2VO5AeMa2+xo/lY9joQjLKC0SdObXmUR4vDjO5y2XT5JPQdznICiz6JfcBLfzVr4n5D4NLrq3EFRu7vypJ+Hd0ISuYQtuqV9sGYz0L7YNRhEgdgxGMSB2DUZRIPYMRnEgnhmMIkHsG4xiQRwYjKJBPDcYxYN4YTCKCDExGMWEODQYRYU4MhjFhXhpMIoMcWwwig3xymAUHeLEYBQf4tRgFCHizGAUI+LcYBQl4sJgFCfi0mAUKeLKYBQr4rXBKFrEG4Ppqg+F/Ey+1KTZxNxLh84tTbJlwaSLZNuCSRrJjgWTOpJdCyaBJHsWTBpJnlkwySTZt2BSSnJgwSSW5LkFk16SFxZMkkkmFkyqSQ4tmISTHFkwaSd5acEkn+TYgklBySsLJhElJxZMOkpOLZiklJxZMKkpObdgElRyYcGkqeTSgklWyZUFk7KS1xZM4kreWLC5q8CtrSvV+OJJTWCJi28RStri24SStPgOoUpZD7wd9W1JzcHzPQ7Cw1NnEHm7Iy+A0Je4mKXcuynrLEIIW+Bx9d0K1pI1695nXNUvU8JthbWl+sLYfG2/R2ckdfJnhJI4+T6hpE1+QChJkz8nlJTJXxBKwuQTQkmX/JBQkiU/IpRUyV8SSqLkx4SSJvkrQkmS/IRQUiQ/JZQEyc8IJT3yc0JJjvyCUFIjvySUxMivCCUt8teEkhT5G0IXj3UKrPtA3UL4+oFOVwQCVQATt/iX5eFTaqFUt6iFEt2mFkpzh1q42e1SC0W0Ry0UzzNqoWj2qYViOaAWiuQ5tVAcL6iFophQC8VwSC0UwRG1cPFfUgsX/ZhauNivqIWLfEItXNxTauGinlELF/OcWriIF9TCxbukFi7aFbVwsV5TCxfpjXW+rtLqqiy5ZGAvmdAVF24pMn7li60yiDU68m5SMStr4WG548k3hSv5DqtdEAFVRE411J1eLDSgDJcKQVDlEvTqJVAFE/QqJlAlE/RqJlBFE/SqJlBlE/TqJlCFE/QqJ1ClE/RqJ1DFE/SqJ1DlE/TqJ1AFFPQqKFAlFPRqKFBFFPSqKFBlFPTqKFCFFPQqKVClFPRqKVDFFPSqKVDlFPTqKVAFFfQqKlAlFfRqKlBFFfSqKlBlFfTqKlCFFfQqK1ClFfRqK1DFFfSqK1DlFfTqK1AFFlgVFt4pYMoRrAavLiJg2Z18MSryhe8lUADDbCPbKUelB7VMPa5sK2k6b6sP7ZTlrWqoxCe9Ql6lLMWU5/RfvJ4Y3Kl0p141kSfB/Njzbd5CmfkC79TdUziWx7bl8XzoYvIyguxLA1EGi5Ho1tJ5OqPjLxlVIs0i6CynqrG4+kUP3CZEGc58LtLQ82tRqjsoYM4V9t6RrbTN4hq7LssXEIFjp5sDdgwJ3HSMnW6iFkL1DM01zvwqky/ML97amXSA/KlHd+xOr9t/d77IeLv9C5lw68WgSZ89mdu5vbdr5pU1xz0yY3Pz3M0lGCTzxZO0PhUKGqNspXEKrO+al7HI/VuyNEDfDpNFqV6U0g/Zlr1UWS1H/1k+CXDZF5O5/ZbUi8nSAl74jK5ANvr+BX74DNeelZbl6dICbJcN0bIhBXpZZjHzc/lAanZTMixQuX/HvW8m7ze/kW8IyZI0rgv9pqv6NQVXb6h9M4Uss2zMA9EH3hYmQAz5Qv53h/EOuXxTTlbB2qllLV9TLetE5UxVFKcCRso9L72oBOnuJr1OK4hSf9x7y7lkeSa/IJi3k/fr8wGyLEByG0OcuFH9Noe4SjLVAKO0MHk/TYtY3PVDp/KZfDiM24Yvg+UUcK/lfgJeWnhF2RX0Am7H3vas5HJ6SlkAhjNvB+99C/gd9+TvqsarzuOcl5XcnUv2e9Q4S9QF4Od0JI++ZCj3SW2YFl8y5HXVGcqjLxjimDtDeTRoqOSP51X/32Nxhgo9Y/25q81LhmXVzqdfG41GS/uTXIXuZy0Qi2l20f22Rb4ZP2UX6hcu2EX+lMcPcFfIypuAgX+9+uGrtY3+Ly+XDy42xxt/HG++erz201b3q8zfrPzzyr+sfLuysfL9yk8r+yvHK+cr4cp/rvz3yv+s/O+Phz/yHz//+O/a9Ne/6vr804rz9+N//B+Wtayq</latexit>

Samples Prior

Figure 1: We can use an SDE to diffuse data to a simple noise distribution. This SDE can be reversed
once we know the score of the marginal distribution at each intermediate time step, ∇x log pt pxq.

associated reverse-time diffusion process [1, 16] given by the following SDE
“ ‰
dx “ f px, tq ´ gptq2 ∇x log pt pxq dt ` gptq dw̄, (2)
where w̄ is now a standard Wiener process in the reverse-time direction. Here dt represents an
infinitesimal negative time step, meaning that the above SDE must be solved from t “ T to t “ 0.
This reverse-time SDE results in exactly the same diffusion process txptqutPr0,T s as Eq. (1), assuming
it is initialized with xpT q „ pT pxq. This result allows for the construction of diffusion-based
generative models, and its functional form reveals the key target for learning: the time-dependent
score function ∇x log pt pxq. Again, see Fig. 1 for a helpful visualization of this two-part formulation.
In order to estimate ∇x log pt pxq from a given dataset, we fit the parameters of a neural network
sθ px, tq, termed a score-based model, such that sθ px, tq « ∇x log pt pxq for almost all x P RD and
t P r0, T s. Unlike many likelihood-based generative models, a score-based model does not need to
satisfy the integral constraints of a density function, and is therefore much easier to parameterize.
Good score-based models should keep the following least squares loss small
1 T
ż
2
JSM pθ; λp¨qq :“ E rλptq k∇x log pt pxq ´ sθ px, tqk2 s dt, (3)
2 0 pt pxq
where λ : r0, T s Ñ Rą0 is a positive weighting function. The integrand features the well-known
2
score matching [23] objective Ept pxq rk∇x log pt pxq ´ sθ px, tqk2 s. We therefore refer to Eq. (3) as a
weighted combination of score matching losses.
With score matching techniques [56, 46], we can compute Eq. (3) up to an additive constant and
minimize it for training score-based models. For example, we can use denoising score matching [56]
to transform JSM pθ; λp¨qq into the following, which is equivalent up to a constant independent of θ:
1 T
ż
2
JDSM pθ; λp¨qq :“ Eppxqp0t px1 |xq rλptq ∇x1 log p0t px1 | xq ´ sθ px1 , tq 2 s dt. (4)
2 0
Whenever the drift coefficient fθ px, tq is linear in x (which is true for all SDEs in [48]), the transition
density p0t px1 | xq is a tractable Gaussian distribution. We can form a Monte Carlo estimate of
both the time integral and expectation in JDSM pθ; λp¨qq with a sample pt, x, x1 q, where t is uniformly
drawn from r0, T s, x „ ppxq is a sample from the dataset, and x1 „ p0t px1 | xq. The gradient
∇x1 log p0t px1 | xq can also be computed in closed form since p0t px1 | xq is Gaussian.
After training a score-based model sθ px, tq with JDSM pθ; λp¨qq, we can plug it into the reverse-time
SDE in Eq. (2). Samples are then generated by solving this reverse-time SDE with numerical SDE
solvers, given an initial sample from πpxq at t “ T . Since the forward SDE Eq. (1) is designed such
that pT pxq « πpxq, the reverse-time SDE will closely trace the diffusion process given by Eq. (1) in
the reverse time direction, and yield an approximate data sample at t “ 0 (as visualized in Fig. 1).

3 Likelihood of score-based diffusion models


The forward and backward diffusion processes in score-based diffusion models induce two probabilis-
tic models for which we can define a likelihood. The first probabilistic model, denoted as pSDE
θ pxq,
is given by the approximate reverse-time SDE constructed from our score-based model sθ px, tq. In
particular, suppose tx̂θ ptqutPr0,T s is a stochastic process given by
“ ‰
dx̂ “ f px̂, tq ´ gptq2 sθ px̂, tq dt ` gptq dw̄, x̂θ pT q „ π. (5)

3
We define pSDE
θ as the marginal distribution of x̂θ p0q. The probabilistic model pSDE
θ is jointly defined
by the score-based model sθ px, tq, the prior π, plus the drift and diffusion coefficients of the forward
SDE in Eq. (1). We can obtain a sample x̂θ p0q „ pSDE θ by numerically solving the reverse-time SDE
in Eq. (5) with an initial noise vector x̂θ pT q „ π.
The other probabilistic model, denoted pODE
θ pxq, is derived from the SDE’s associated probability flow
ODE [32, 48]. Every SDE has a corresponding probability flow ODE whose marginal distribution at
each time t matches that of the SDE, so that they share the same pt pxq for all time. In particular, the
ODE corresponding to the SDE in Eq. (1) is given by
dx 1
“ f px, tq ´ gptq2 ∇x log pt pxq. (6)
dt 2
Unlike the SDEs in Eq. (1) and Eq. (2), this ODE describes fully deterministic dynamics for
the process. Notably, it still features the same time-dependent score function ∇x log pt pxq. By
approximating this score function with our model sθ px, tq, the probability flow ODE becomes
dx̃ 1
“ f px̃, tq ´ gptq2 sθ px̃, tq. (7)
dt 2
In fact, this ODE is an instance of a continuous normalizing flow (CNF) [15], and we can quantify
how the ODE dynamics transform volumes across time in exactly the same way as these traditional
flow-based models [6]. Given a prior distribution πpxq, and a trajectory function x̃θ : r0, T s Ñ RD
satisfying the ODE in Eq. (7), we define pODE
θ as the marginal distribution of x̃θ p0q when x̃θ pT q „ π.
Similarly to pSDE
θ , the model p ODE
θ is jointly defined by the score-based model sθ px, tq, the prior π,
and the forward SDE in Eq. (1). Leveraging the instantaneous change-of-variables formula [6], we
can evaluate log pODE
θ pxq exactly with numerical ODE solvers. Since pθ
ODE
is a CNF, we can generate
ODE
a sample x̃θ p0q „ pθ by numerically solving the ODE in Eq. (7) with an initial value x̃θ pT q „ π.
Although computing log pODE ODE
θ pxq is tractable, training pθ with maximum likelihood will require
calling an ODE solver for every optimization step [6, 15], which can be prohibitively expensive for
large-scale score-based models. Unlike pODE SDE
θ , we cannot evaluate log pθ pxq exactly for an arbitrary
SDE
data point x. However, we have a lower bound on log pθ pxq which allows both efficient evaluation
and optimization, as will be shown in Section 4.2.

4 Bounding the likelihood of score-based diffusion models


Many applications benefit from models which achieve high likelihood. One example is lossless
compression, where log-likelihood directly corresponds to the minimum expected number of bits
needed to encode a message. Popular likelihood-based models such as variational autoencoders
and normalizing flows have already found success in image compression [51, 20, 21]. Despite
some known drawbacks [50], likelihood is still one of the most popular metrics for evaluating and
comparing generative models.
Maximizing the likelihood of score-based diffusion models can be accomplished by either maximizing
the likelihood of pSDE
θ or pODE
θ . Although pθ
ODE
is a continuous normalizing flow (CNF) and its log-
likelihood is tractable, training with maximum likelihood is expensive. As mentioned already, it
requires solving an ODE at every optimization step in order to evaluate the log-likelihood on a batch
of training data. In contrast, training with the weighted combination of score matching losses is much
more efficient, yet in general it does not directly promote high likelihood of either pSDE
θ or pODE
θ .
In what follows, we show that with a specific choice of the weighting function λptq, the combination
of score matching losses JSM pθ; λp¨qq actually becomes an upper bound on DKL pp } pSDE θ q, and can
therefore serve as an efficient proxy for maximum likelihood training. In addition, we provide a
related lower bound on log pSDE
θ pxq that can be evaluated efficiently on any individual datapoint x.

4.1 Bounding the KL divergence with likelihood weighting

It is well-known that maximizing the log-likelihood of a probabilistic model is equivalent to minimiz-


ing the KL divergence from the data distribution to the model distribution. We show in the following
theorem that for the model pSDE
θ , this KL divergence can be upper bounded by JSM pθ; λp¨qq when
using the weighting function λptq “ gptq2 , where gptq is the diffusion coefficient of SDE in Eq. (1).

4
Table 1: SDEs and their corresponding weightings for score matching losses.
SDE Formula λptq in [48] likelihood weighting
VE dx “ σptq dw σ 2şptq σ 2 ptq
a t
VP dx “ ´ 21 βptqx
b dt ` βptq dw 1 ´ e´ 0 βpsq ds βptq
şt şt şt
subVP dx “ ´ 21 βptqx dt ` βptqp1 ´ e´2 0
βpsq ds
q dw p1 ´ e´ 0
βpsq ds 2
q βptqp1 ´ e´2 0
βpsq ds
q

Theorem 1. Let ppxq be the data distribution, πpxq be a known prior distribution, and pSDE θ be
defined as in Section 3. Suppose txptqutPr0,T s is a stochastic process defined by the SDE in Eq. (1)
with xp0q „ p, where the marginal distribution of xptq is denoted as pt . Under some regularity
conditions detailed in Appendix A, we have
2
DKL pp } pSDE
θ q ď JSM pθ; gp¨q q ` DKL ppT } πq. (8)

Sketch of proof. Let µ and ν denote the path measures of SDEs in Eq. (1) and Eq. (5) respectively.
Intuitively, µ is the joint distribution of the diffusion process txptqutPr0,T s given in Section 2.1,
and ν represents the joint distribution of the process tx̂θ ptqutPr0,T s defined in Section 3. Since
we can marginalize µ and ν to obtain distributions p and pSDE θ , the data processing inequality
gives DKL pp } pSDEθ q ď D KL pµ } νq. From the chain rule for the KL divergence, we also have
DKL pµ } νq “ DKL ppT } πq ` EpT pzq rDKL pµp¨ | xpT q “ zq } νp¨ | x̂θ pT q “ zqqs, where the KL
divergence in the final term can be computed by applying the Girsanov theorem [34] to Eq. (5) and
the reverse-time SDE of Eq. (1).

When the prior distribution π is fixed, Theorem 1 guarantees that optimizing the weighted combination
of score matching losses JSM pθ; gp¨q2 q is equivalent to minimizing an upper bound on the KL
divergence from the data distribution p to the model distribution pSDE
θ . Due to well-known equivalence
between minimizing KL divergence and maximizing likelihood, we have the following corollary.
Corollary 1. Consider the same conditions and notations in Theorem 1. When π is a fixed prior
distribution that does not depend on θ, we have
2 2
´Eppxq rlog pSDE
θ pxqs ď JSM pθ; gp¨q q ` C1 “ JDSM pθ; gp¨q q ` C2 ,

where C1 and C2 are constants independent of θ.


In light of the result in Corollary 1, we henceforth term λptq “ gptq2 the likelihood weighting.
The original weighting functions in [48] are inspired from earlier work such as [44, 45] and [19],
which are motivated by balancing different score matching losses in the combination, and justified
by empirical performance. In contrast, likelihood weighting is motivated from maximizing the
likelihood of a probabilistic model induced by the diffusion process, and derived by theoretical
analysis. There are three types of SDEs considered in [48]: the Variance Exploding (VE) SDE, the
Variance Preserving (VP) SDE, and the subVP SDE. In Table 1, we summarize all these SDEs and
contrast their original weighting functions with our likelihood weighting. For VE SDE, our likelihood
weighting incidentally coincides with the original weighting used in [48], whereas for VP and subVP
SDEs they differ from one another.
Theorem 1 leaves two questions unanswered. First, what are the conditions for the bound to be
tight (become an equality)? Second, is there any connection between pSDEθ and pODE
θ under some
conditions? We provide both answers in the following theorem.
Theorem 2. Suppose ppxq and qpxq have continuous second-order derivatives and finite second
moments. Let txptqutPr0,T s be the diffusion process defined by the SDE in Eq. (1). We use pt and
qt to denote the distributions of xptq when xp0q „ p and xp0q „ q, and assume they satisfy the
same assumptions in Appendix A. Under the conditions qT “ π and sθ px, tq ” ∇x log qt pxq for all
t P r0, T s, we have the following equivalence in distributions
pSDE
θ “ pODE
θ “ q. (9)
Moreover, we have
2
DKL pp } pSDE
θ q “ JSM pθ; gp¨q q ` DKL ppT } πq. (10)

5
Sketch of proof. When sθ px, tq matches ∇x log qt pxq, they both represent the time-dependent score
of the same stochastic process so we immediately have pSDE θ “ q. According to the theory of
probability flow ODEs, we also have pODE “ q “ pSDE
. To prove Eq. (10), we note that DKL pp }
SDE
şTθ d θ şT d
pθ q “ DKL pp}qq “ DKL ppT }qT q´ 0 dt DKL ppt } qt q dt “ DKL ppT }πq´ 0 dt DKL ppt } qt q dt.
We can now complete the proof by simplifying the integrand using the Fokker–Planck equation of pt
and qt followed by integration by parts.

In practice, the conditions of Theorem 2 are hard to satisfy since our score-based model sθ px, tq
will not exactly match the score function ∇x log qt pxq of some reverse-time diffusion process with
the initial distribution qT “ π. In other words, our score model may not be a valid time-dependent
score function of a stochastic process with an appropriate initial distribution. Therefore, although
score matching with likelihood weighting performs approximate maximum likelihood training for
pSDE
θ , we emphasize that it is not theoretically guaranteed to make the likelihood of pθ
ODE
better.
ODE SDE
That said, pθ will closely match pθ if our score-based model well-approximates the true score
such that sθ px, tq « ∇x log pt pxq for all x and t P r0, T s. Moreover, we empirically observe in our
experiments (see Table 2) that training with the likelihood weighting is actually able to consistently
improve the likelihood of pODE
θ across multiple datasets, SDEs, and model architectures.

4.2 Bounding the log-likelihood on individual datapoints

The bound in Theorem 1 is for the entire distributions of p and pSDE


θ , but we often seek to bound the
log-likelihood for an individual data point x. In addition, JSM pθ; λp¨qq in the bound is not directly
computable due to the unknown quantity ∇x log pt pxq, and can only be evaluated up to an additive
constant through JDSM pθ; λp¨qq (as we already discussed in Section 2.2). Therefore, the bound in
Theorem 1 is only suitable for training purposes. To address these issues, we provide the following
bounds for individual data points.
Theorem 3. Let p0t px1 | xq denote the transition distribution from p0 pxq to pt pxq for the SDE in
Eq. (1). With the same notations and conditions in Theorem 1, we have
´ log pSDE SM DSM
θ pxq ď Lθ pxq “ Lθ pxq, (11)

where LSM
θ pxq is defined as
żT
1 ”
2
ı
´Ep0T px1 |xq rlog πpx1 qs ` Ep0t px1 |xq 2gptq2 ∇x1 ¨ sθ px1 , tq ` gptq2 sθ px1 , tq 2
´ 2∇x1 ¨ f px1 , tq dt,
2 0

and LDSM
θ pxq is given by
żT
1 ”
2
ı
´ Ep0T px1 |xq rlog πpx1 qs ` Ep0t px1 |xq gptq2 sθ px1 , tq ´ ∇x1 log p0t px1 | xq 2
dt
2 0
żT
1 ”
2
ı
´ Ep0t px1 |xq gptq2 ∇x1 log p0t px1 | xq 2
` 2∇x1 ¨ f px1 , tq dt.
2 0

Sketch of proof. For any continuous data distribution p, we have ´Eppxq rlog pSDE θ pxqs “ DKL pp }
pSDE
θ q ` Hppq, where Hppq denotes the differential entropy of p. The KL term can be bounded
according to Theorem 1, while the differential entropy has an identity similar to Theorem 2 (see
Theorem 4 in Appendix A). Combining the bound of DKL pp } pSDE θ q and the identity of Hppq, we
obtain a bound on ´Eppxq rlog pSDE
θ pxqs that holds for all continuous distribution p. Removing the
expectation over p on both sides then gives us a bound on ´ log pSDE
θ pxq for an individual datapoint
x. We can simplify this bound to LSM
θ pxq and L DSM
θ pxq with similar techniques to [23] and [56].

We provide two equivalent bounds LSM θ pxq and Lθ


DSM
pxq. The former bears resemblance to score
matching while the second resembles denoising score matching. Both admit efficient unbiased
estimators when f p¨, tq is linear, as the time integrals and expectations in LSM θ pxq and Lθ
DSM
pxq
1
can be estimated by samples of the form pt, x q, where t is uniformly sampled over r0, T s, and
x1 „ p0t px1 | xq. Since the transition distribution p0t px1 | xq is a tractable Gaussian when f p¨, tq is
linear, we can easily sample from it as well as evaluating ∇x1 log p0t px1 | xq for computing LDSM
θ pxq.

6
Moreover, the divergences ∇x ¨ sθ px, tq and ∇x ¨ f px, tq in LSM DSM
θ pxq and Lθ pxq have efficient
unbiased estimators via the Skilling–Hutchinson trick [42, 22].
We can view LDSMθ pxq as a continuous-time generalization of the evidence lower bound (ELBO) in
diffusion probabilistic models [43, 19]. Our bounds in Theorem 3 are not only useful for optimizing
and estimating log pSDE
θ pxq, but also for training the drift and diffusion coefficients f px, tq and
gptq jointly with the score-based model sθ px, tq; we leave this avenue of research for future work.
In addition, we can plug the bounds in Theorem 3 into any objective that involves minimizing
´ log pSDE
θ pxq to obtain an efficient surrogate. Section 5.2 provides an example, where we perform
variational dequantization to further improve the likelihood of score-based diffusion models.
Similar to the observation in Section 4.1, LSM θ pxq and Lθ
DSM
pxq are not guaranteed to upper bound
ODE
´ log pθ pxq. However, they should become approximate upper bounds when sθ px, tq is trained
sufficiently close to the ground truth. In fact, we empirically observe that ´ log pODE SM
θ pxq ď Lθ pxq “
DSM
Lθ pxq holds true for x sampled from the dataset in all experiments.

4.3 Numerical stability

So far we have assumed that the SDEs are defined in the time horizon r0, T s in all theoretical
analysis. In practice, however, we often face numerical instabilities when t Ñ 0. To avoid them,
we choose a small non-zero starting time  ą 0, and train/evaluate score-based diffusion models
in the time horizon r, T s instead of r0, T s. Since  is small, training score-based diffusion models
with likelihood weighting still approximately maximizes their model likelihood. Yet at test time,
the likelihood bound as computed in Theorem 3 is slightly biased, rendering the values not directly
comparable to results reported in other works. We use Jensen’s inequality to correct for this bias in
our experiments, for which we provide a detailed explanation in Appendix B.

4.4 Related work

Our result in Theorem 2 can be viewed as a generalization of De Bruijin’s identity ([49], Eq. 2.12)
from its original differential form to an integral form. De Bruijn’s identity relates the rate of change
of the Shannon entropy under an additive Gaussian noise channel to the Fisher information, a result
which can be interpreted geometrically as relating the rate of change of the volume of a distribution’s
typical set to its surface area. Ref. [2] (Lemma 1) builds on this result and presents an integral and
relative form of de Bruijn’s identity which relates the KL divergence to the integral of the relative
Fisher information for a distribution of interest and a reference standard normal. More generally,
various identities and inequalities involving the (relative) Shannon entropy and (relative) Fisher
information have found use in proofs of the central limit theorem [24]. Ref. [31] (Theorem 1) covers
similar ground to the relative form of de Bruijn’s identity, but is perhaps the first to consider its
implications for learning in probabilistic models by framing the discussion in terms of the score
matching objective ([23], Eq. 2).

5 Improving the likelihood of score-based diffusion models


Our theoretical analysis implies that training with the likelihood weighting should improve the
likelihood of score-based diffusion models. To verify this empirically, we test likelihood weighting
with different model architectures, SDEs, and datasets. We observe that switching to likelihood
weighting increases the variance of the training objective and propose to counteract it with importance
sampling. We additionally combine our bound with variational dequantization [18] which narrows
the gap between the likelihood of continuous and discrete probability models. All combined, we
observe consistent improvement of likelihoods for both pSDE θ and pODE
θ across all settings. We term
ODE
the model pθ trained in this way ScoreFlow, and show that it achieves excellent likelihoods on
CIFAR-10 [28] and ImageNet 32ˆ32 [55], on a par with cutting-edge autoregressive models.

5.1 Variance reduction via importance sampling

As mentioned in Section 2.2, we typically use Monte Carlo sampling to approximate the time integral
in JDSM pθ; λp¨qq during training. In particular, we first uniformly sample a time step t „ Ur0, T s,
and then use the denoising score matching loss at t as an estimate for the whole time integral. This

7
Figure 2: Learning curves with the likelihood weighting on the CIFAR-10 dataset (smoothed with
exponential moving average). Importance sampling significantly reduces the loss variance.

Monte Carlo approximation is much faster than computing the time integral accurately, but introduces
additional variance to the training loss.
We empirically observe that this Monte Carlo approximation suffers from a larger variance when
using our likelihood weighting instead of the original weightings in [48]. Leveraging importance
sampling, we propose a new Monte Carlo approximation that significantly reduces the variance of
learning curves under likelihood weighting, as demonstrated in Fig. 2. In fact, with importance
sampling, the loss variance (after convergence) decreases from 98.48 to 0.068 on CIFAR-10, and
decreases from 0.51 to 0.043 on ImageNet.
Let λptq “ αptq2 denote the weightings in [48] (reproduced in Table 1), and recall that our likelihood
weighting is λptq “ gptq2 . Since αptq2 empirically leads to lower variance, we can use a proposal
2 2 2 2
distribution pptq :“ gptq {αptq2 Z to change the weighting in JDSM pθ; gp¨q
ş q from gptq to αptq with
importance sampling, where Z is a normalizing constant that ensures pptq dt “ 1. Specifically, for
şT
any function hptq, we estimate the time integral 0 gptq2 hptq dt with
żT żT
gptq2 hptq dt “ Z pptqαptq2 hptq dt « T Zαpt̃q2 hpt̃q, (12)
0 0
where t̃ is a sample from pptq. When training score-based models with likelihood weighting, hptq
corresponds to the denoising score matching loss at time t.
Ref. [33] also observes that optimizing the ELBO for diffusion probabilistic models has large variance,
and proposes to reduce it with importance sampling. They build their proposal distribution based on
historical loss values stored at thousands of discrete time steps. Despite this similarity, our method is
easier to implement without needing to maintain history, can be used for evaluation, and is particularly
suited to the continuous-time setting.

5.2 Variational dequantization

Digital images are discrete data, and must be dequantized when training continuous density models
like normalizing flows [12, 13] and score-based diffusion models. One popular approach to this is
uniform dequantization [53, 50], where we add small uniform noise over r0, 1q to images taking values
in t0, 1, ¨ ¨ ¨ , 255u. As shown in [50], training a continuous model pθ pxq on uniformly dequantized
data implicitly maximizes a lower bound on the log-likelihood of a certain discrete model Pθ pxq.
Due to the gap between pθ pxq and Pθ pxq, comparing the likelihood of continuous density models to
models which fit discrete data directly, such as autoregressive models [55] or variational autoencoders,
naturally puts the former at a disadvantage.
To minimize the gap between pθ pxq and Pθ pxq, ref. [18] proposes variational dequantization, where
a separate normalizing flow model qφ pu | xq is trained to produce the dequantization noise by
optimizing the following objective
max Ex„ppxq Eu„qφ p¨|xq rlog pθ px ` uq ´ log qφ pu | xqs. (13)
φ

Plugging in the lower bound on log pθ pxq from Theorem 3, we can optimize Eq. (13) to improve the
likelihood of score-based diffusion models.

8
Table 2: Negative log-likelihood (bits/dim) and sample quality (FID scores) on CIFAR-10 and
ImageNet 32ˆ32. Abbreviations: “NLL” for “negative log-likelihood”; “Uni. deq.” for “Uniform
dequantization”; “Var. deq.” for “Variational dequantization”; “LW” for “likelihood weighting”; and
“IS” for “importance sampling”. Bold indicates best result in the corresponding column. Shaded rows
represent models trained with both likelihood weighting and importance sampling.
CIFAR-10 ImageNet 32ˆ32
Model SDE Uni. deq. Var. deq. Uni. deq. Var. deq.
FIDÓ FIDÓ
NLLÓ BoundÓ NLLÓ BoundÓ NLLÓ BoundÓ NLLÓ BoundÓ
Baseline VP 3.16 3.28 3.04 3.14 3.98 3.90 3.96 3.84 3.91 8.34
Baseline + LW VP 3.06 3.18 2.94 3.03 5.18 3.91 3.96 3.86 3.92 17.75
Baseline + LW + IS VP 2.95 3.08 2.83 2.94 6.03 3.86 3.92 3.80 3.88 11.15
Deep VP 3.13 3.25 3.01 3.10 3.09 3.89 3.95 3.84 3.90 8.40
Deep + LW VP 3.06 3.17 2.93 3.02 7.88 3.91 3.96 3.86 3.92 17.73
Deep + LW + IS VP 2.93 3.06 2.80 2.92 5.34 3.85 3.92 3.79 3.88 11.20
Baseline subVP 2.99 3.09 2.88 2.98 3.20 3.87 3.92 3.82 3.88 8.71
Baseline + LW subVP 2.97 3.07 2.86 2.96 7.33 3.87 3.92 3.82 3.88 12.99
Baseline + LW + IS subVP 2.94 3.05 2.84 2.94 5.58 3.84 3.91 3.79 3.87 10.57
Deep subVP 2.96 3.06 2.85 2.95 2.86 3.86 3.91 3.81 3.87 8.87
Deep + LW subVP 2.95 3.05 2.85 2.94 6.57 3.88 3.93 3.83 3.88 16.55
Deep + LW + IS subVP 2.90 3.02 2.81 2.90 5.40 3.82 3.90 3.76 3.86 10.18

5.3 Experiments

We empirically test the performance of likelihood weighting, importance sampling and variational
dequantization across multiple architectures of score-based models, SDEs, and datasets. In particular,
we consider DDPM++ (“Baseline” in Table 2) and DDPM++ (deep) (“Deep” in Table 2) models
with VP and subVP SDEs [48] on CIFAR-10 [28] and ImageNet 32ˆ32 [55] datasets. We omit
experiments on the VE SDE since (i) under this SDE our likelihood weighting is the same as the
original weighting in [48]; (ii) we empirically observe that the best VE SDE model achieves around
3.4 bits/dim on CIFAR-10 in our experiments, which is significantly worse than other SDEs. For
each experiment, we report ´Erlog pODE θ pxqs (“Negative log-likelihood” in Table 2), and the upper
bound ErLDSM
θ pxqs on ´Erlog p SDE
θ pxqs (“Bound” in Table 2). In addition, we report FID scores [17]
for samples from pODE
θ , produced by solving the corresponding ODE with the Dormand–Prince
RK45 [14] solver. Unless otherwise noted, we apply horizontal flipping as data augmentation for
training models on CIFAR-10, so as to match the settings in [48, 19]. Detailed description of all our
experiments can be found in Appendices B and C.
We summarize all results in Table 2. Our key observations are as follows:

1. Although Theorem 3 only guarantees ErLDSM θ pxqs ě ´Erlog pSDE


θ pxqs, and in general we
have pθ ‰ pθ , we still find that ErLθ pxqs (“Bound” in Table 2) ě ´Erlog pODE
SDE ODE DSM
θ pxqs
(“NLL” in Table 2) in all our settings.
2. When all conditions are fixed except for the weighting in the training objective, having a
lower value of the bound for pSDE
θ always leads to a lower negative log-likelihood for pODE
θ .

3. With only likelihood weighting, we can uniformly improve the likelihood of pODE
θ and the
bound of pSDE
θ on CIFAR-10 across model architectures and SDEs, but it is not sufficient to
guarantee likelihood improvement on ImageNet 32 ˆ 32.
4. By combining importance sampling and likelihood weighting, we are able to achieve
uniformly better likelihood for pODE
θ and bounds for pSDE
θ across all model architectures,
SDEs, and datasets, with only slight degradation of sample quality as measured by FID [17].
5. Variational dequantization uniformly improves both the bound for pSDE θ and the negative
log-likelihood (NLL) of pODE
θ in all settings, regardless of likelihood weighting.

Our experiments confirm that with importance sampling, likelihood weighting is not only effective
for maximizing the lower bound for the log-likelihood of pSDE
θ , but also improving the log-likelihood
of pODE
θ . In agreement with [19, 33], we observe that models achieving better likelihood tend to have
worse FIDs. However, we emphasize that this degradation of FID is small, and samples actually

9
have no obvious difference in visual quality (see Figs. 3 and 4). To trade likelihood for FID, we can
use weighting functions that interpolate between likelihood weighting and the original weighting
functions in [48]. Our FID scores are still much better than most other likelihood-based models.
We term pODE
θ a ScoreFlow when its correspond- Table 3: NLLs on CIFAR-10 and ImageNet 32x32.
ing score-based model sθ px, tq is trained with
likelihood weighting, importance sampling, and Model CIFAR-10 ImageNet
variational dequantization combined. It can be FFJORD [15] 3.40 -
viewed as a continuous normalizing flow, but Flow++ [18] 3.08 3.86
is parameterized by a score-based model and Gated PixelCNN [35] 3.03 3.83
trained in a more efficient way. With variational VFlow [4] 2.98 3.83
dequantization, we show ScoreFlows obtain PixelCNN++ [40] 2.92 -
competitive negative log-likelihoods (NLLs) of NVAE [54] 2.91 3.92
2.83 bits/dim on CIFAR-10 and 3.76 bits/dim Image Transformer [36] 2.90 3.77
on ImageNet 32ˆ32. Here the ScoreFlow on Very Deep VAE [8] 2.87 3.80
CIFAR-10 is trained without horizontal flipping PixelSNAIL [7] 2.85 3.80
(different from the setting in Table 2). As shown
δ-VAE [38] 2.83 3.77
in Table 3, our results are on a par with the state-
Sparse Transformer [9] 2.80 -
of-the-art autoregressive models on these tasks,
and outperform all existing normalizing flow ScoreFlow (Ours) 2.83 3.76
models. The likelihood on CIFAR-10 can be significantly improved by incorporating advanced data
augmentation, as demonstrated in [25, 41]. While we do not compare against them, we believe that
incorporating the same data augmentation techniques can also improve the likelihood of ScoreFlows.

6 Conclusion

We propose an efficient training objective for approximate maximum likelihood training of score-
based diffusion models. Our theoretical analysis shows that the weighted combination of score
matching losses upper bounds the negative log-likelihood when using a particular weighting function
which we term the likelihood weighting. By minimizing this upper bound, we consistently improve
the likelihood of score-based diffusion models across multiple model architectures, SDEs, and
datasets. When combined with variational dequantization, we achieve competitive likelihoods on
CIFAR-10 and ImageNet 32ˆ32, matching the performance of best-in-class autoregressive models.
Our upper bound is analogous to the evidence lower bound commonly used for training variational
autoencoders. Aside from promoting higher likelihood, the bound can be combined with other
objectives that depend on the negative log-likelihood, and also enables joint training of the forward
and backward SDEs, which we leave as a future research direction. Our results suggest that score-
based diffusion models are competitive alternatives to continuous normalizing flows which enjoy the
same tractable likelihood computation but with more efficient maximum likelihood training.

Limitations and broader impact Despite promising experimental results, we would like to em-
phasize that there is no theoretical guarantee that improving the SDE likelihood will improve the
ODE likelihood, and this is explicitly a limitation of our work. Score-based diffusion models also
suffer from slow sampling. In our experiments, the ODE solver typically need around 550 and 450
evaluations of the score-based model for generation and likelihood computation on CIFAR-10 and
ImageNet respectively, which is considerably slower than alternative generative models like VAEs and
GANs. In addition, the current formulation of score-based diffusion models only supports continuous
data, and cannot be naturally adapted to discrete data without resorting to dequantization. Same
as other deep generative models, score-based diffusion models can potentially be used to generate
harmful media contents such as “deepfakes”, and might reflect and amplify undesirable social bias
that could exist in the training dataset.

Author Contributions

Yang Song wrote the code, ran the experiments, proposed and proved Theorems 1 and 3, and wrote
most of the paper. Conor Durkan proposed and proved a first version of Theorem 2, and wrote the
paper. Iain Murray and Stefano Ermon co-advised the project and provided helpful edits to the draft.

10
Acknowledgments and Disclosure of Funding
The authors would like to thank Sam Power, George Papamakarios, Adji Dieng for helpful feedback,
and Duoduo for providing her photos in Fig. 1. This research was supported by NSF (#1651565,
#1522054, #1733686), ONR (N000141912145), AFOSR (FA95501910024), ARO (W911NF-21-1-
0125), Sloan Fellowship, and Google TPU Research Cloud. This research was also supported by the
EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical
Sciences Research Council (grant EP/L016427/1), and the University of Edinburgh. Yang Song was
supported by the Apple PhD Fellowship in AI/ML.

References
[1] B. D. Anderson. Reverse-Time Diffusion Equation Models. Stochastic Processes and their
Applications, 12(3):313–326, 1982.
[2] A. R. Barron. Entropy and the Central Limit Theorem. Annals of Probability, 14(1):336–342,
1986.
[3] R. Cai, G. Yang, H. Averbuch-Elor, Z. Hao, S. Belongie, N. Snavely, and B. Hariharan. Learning
Gradient Fields for Shape Generation. In Proceedings of the European Conference on Computer
Vision (ECCV), 2020.
[4] J. Chen, C. Lu, B. Chenli, J. Zhu, and T. Tian. Vflow: More expressive generative flows
with variational data augmentation. In International Conference on Machine Learning, pages
1660–1669. PMLR, 2020.
[5] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan. WaveGrad: Estimating
Gradients for Waveform Generation. arXiv preprint arXiv:2009.00713, 2020.
[6] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural Ordinary Differential
Equations. In Advances in neural information processing systems, pages 6571–6583, 2018.
[7] X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel. Pixelsnail: An improved autoregressive
generative model. In International Conference on Machine Learning, pages 864–872. PMLR,
2018.
[8] R. Child. Very deep VAEs generalize autoregressive models and can outperform them on
images. In International Conference on Learning Representations, 2021.
[9] R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse
transformers. arXiv preprint arXiv:1904.10509, 2019.
[10] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. Salakhutdinov. Good semi-supervised learning
that requires a bad GAN. In Proceedings of the 31st International Conference on Neural
Information Processing Systems, pages 6513–6523, 2017.
[11] P. Dhariwal and A. Nichol. Diffusion models beat GANs on image synthesis. arXiv preprint
arXiv:2105.05233, 2021.
[12] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-Linear Independent Components Estimation.
arXiv preprint arXiv:1410.8516, 2014.
[13] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. In 5th Interna-
tional Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Conference Track Proceedings. OpenReview.net, 2017.
[14] J. R. Dormand and P. J. Prince. A family of embedded runge-kutta formulae. Journal of
computational and applied mathematics, 6(1):19–26, 1980.
[15] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. FFJORD:
Free-form Continuous Dynamics for Scalable Reversible Generative Models. In International
Conference on Learning Representations, 2019.
[16] U. G. Haussmann and E. Pardoux. Time reversal of diffusions. The Annals of Probability, pages
1188–1205, 1986.
[17] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a
two time-scale update rule converge to a local Nash equilibrium. In I. Guyon, U. von Luxburg,
S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances

11
in Neural Information Processing Systems 30: Annual Conference on Neural Information
Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626–6637, 2017.
[18] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++: Improving flow-based generative
models with variational dequantization and architecture design. In International Conference on
Machine Learning, pages 2722–2730. PMLR, 2019.
[19] J. Ho, A. Jain, and P. Abbeel. Denoising Diffusion Probabilistic Models. Advances in Neural
Information Processing Systems, 33, 2020.
[20] J. Ho, E. Lohn, and P. Abbeel. Compression with flows via local bits-back coding. In
H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett,
editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural
Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
Canada, pages 3874–3883, 2019.
[21] E. Hoogeboom, J. W. T. Peters, R. van den Berg, and M. Welling. Integer discrete flows and
lossless compression. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B.
Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual
Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14,
2019, Vancouver, BC, Canada, pages 12134–12144, 2019.
[22] M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian
smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–
1076, 1989.
[23] A. Hyvärinen. Estimation of Non-Normalized Statistical Models by Score Matching. Journal
of Machine Learning Research, 6(Apr):695–709, 2005.
[24] O. Johnson and A. Barron. Fisher Information inequalities and the Central Limit Theorem.
Probability Theory and Related Fields, 129(3):391–409, 2004.
[25] H. Jun, R. Child, M. Chen, J. Schulman, A. Ramesh, A. Radford, and I. Sutskever. Distribution
augmentation for generative modeling. In International Conference on Machine Learning,
pages 5006–5019. PMLR, 2020.
[26] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In International Conference
on Learning Representations, 2014.
[27] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. DiffWave: A Versatile Diffusion
Model for Audio Synthesis. arXiv preprint arXiv:2009.09761, 2020.
[28] A. Krizhevsky, V. Nair, and G. Hinton. The CIFAR-10 Dataset. online: http://www. cs. toronto.
edu/kriz/cifar. html, 55, 2014.
[29] C. Léonard. Some properties of path measures. In Séminaire de Probabilités XLVI, pages
207–230. Springer, 2014.
[30] X. Li, T.-K. L. Wong, R. T. Q. Chen, and D. Duvenaud. Scalable Gradients for Stochastic
Differential Equations. In Proceedings of the 23rd International Conference on Artificial
Intelligence and Statistics, 2020.
[31] S. Lyu. Interpretation and Generalization of Score Matching. In Proceedings of the 25th
Conference on Uncertainty in Artificial Intelligence, pages 359–366, 2009.
[32] D. Maoutsa, S. Reich, and M. Opper. Interacting particle solutions of Fokker–Planck equations
through gradient-log-density estimation. arXiv preprint arXiv:2006.00702, 2020.
[33] A. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. arXiv preprint
arXiv:2102.09672, 2021.
[34] B. Oksendal. Stochastic differential equations: an introduction with applications. Springer
Science & Business Media, 2013.
[35] A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu.
Conditional image generation with pixelcnn decoders. In Proceedings of the 30th International
Conference on Neural Information Processing Systems, pages 4797–4805, 2016.
[36] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran. Image
transformer. In International Conference on Machine Learning, pages 4055–4064. PMLR,
2018.

12
[37] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov. Grad-TTS: A diffusion
probabilistic model for text-to-speech. arXiv preprint arXiv:2105.06337, 2021.
[38] A. Razavi, A. van den Oord, B. Poole, and O. Vinyals. Preventing posterior collapse with
delta-vaes. In 7th International Conference on Learning Representations, ICLR 2019, New
Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
[39] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate
Inference in Deep Generative Models. In Proceedings of the 31st International Conference on
Machine Learning, volume 32, 2014.
[40] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with
discretized logistic mixture likelihood and other modifications. In 5th International Conference
on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track
Proceedings. OpenReview.net, 2017.
[41] S. Sinha and A. B. Dieng. Consistency regularization for variational auto-encoders. arXiv
preprint arXiv:2105.14859, 2021.
[42] J. Skilling. The eigenvalues of mega-dimensional matrices. In Maximum Entropy and Bayesian
Methods, pages 455–466. Springer, 1989.
[43] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep Unsupervised Learning
Using Nonequilibrium Thermodynamics. In International Conference on Machine Learning,
pages 2256–2265, 2015.
[44] Y. Song and S. Ermon. Generative Modeling by Estimating Gradients of the Data Distribution.
In Advances in Neural Information Processing Systems, pages 11918–11930, 2019.
[45] Y. Song and S. Ermon. Improved Techniques for Training Score-Based Generative Models.
Advances in Neural Information Processing Systems, 33, 2020.
[46] Y. Song, S. Garg, J. Shi, and S. Ermon. Sliced Score Matching: A Scalable Approach to Density
and Score Estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial
Intelligence, page 204, 2019.
[47] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Leveraging generative
models to understand and defend against adversarial examples. In International Conference on
Learning Representations, 2018.
[48] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-Based
Generative Modeling Through Stochastic Differential Equations. In International Conference
on Learning Representations, 2021.
[49] A. Stam. Some Inequalities Satisfied by the Quantities of Information of Fisher and Shannon.
Information and Control, 2(2):101–112, June 1959.
[50] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. In
Y. Bengio and Y. LeCun, editors, 4th International Conference on Learning Representations,
ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
[51] J. Townsend, T. Bird, and D. Barber. Practical lossless compression with latent variables using
bits back coding. In International Conference on Learning Representations, 2019.
[52] B. Tzen and M. Raginsky. Neural Stochastic Differential Equations: Deep Latent Gaussian
Models in the Diffusion Limit. arXiv:1905.09883, 2019.
[53] B. Uria, I. Murray, and H. Larochelle. Rnade: the real-valued neural autoregressive density-
estimator. In Proceedings of the 26th International Conference on Neural Information Process-
ing Systems-Volume 2, pages 2175–2183, 2013.
[54] A. Vahdat and J. Kautz. NVAE: A deep hierarchical variational autoencoder. In H. Larochelle,
M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information
Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020,
NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[55] A. Van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In
International Conference on Machine Learning, pages 1747–1756. PMLR, 2016.
[56] P. Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural
Computation, 23(7):1661–1674, 2011.

13
[57] K. Yang, J. Yau, L. Fei-Fei, J. Deng, and O. Russakovsky. A study of face obfuscation in
imagenet. arXiv preprint arXiv:2103.06191, 2021.

Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] The limitations of our theory are
discussed inline in Section 4.
(c) Did you discuss any potential negative societal impacts of your work? [Yes] Discussed
in conclusion.
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [Yes] The full set of
assumptions are in Appendix A.
(b) Did you include complete proofs of all theoretical results? [Yes] The full proofs are all
provided in Appendix A.
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main experi-
mental results (either in the supplemental material or as a URL)? [Yes] Code is released
at https://github.com/yang-song/score_flow.
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] All experimental details are in Appendix C.
(c) Did you report error bars (e.g., with respect to the random seed after running exper-
iments multiple times)? [No] Training our models is expensive and we do not have
enough resource/time for multiple repetitions of our experiments.
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes] Compute and resource types are
given in Appendix C.
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes] We cited the creators
of CIFAR-10 and down-sampled ImageNet.
(b) Did you mention the license of the assets? [No] Licenses are standard and can be found
online.
(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]
Code and checkpoint are released at https://github.com/yang-song/score_flow.
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [No] All datasets used in our work are publicly available.
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [Yes] We mentioned privacy issues of the ImageNet
dataset inline in Appendix C.
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]

14
A Proofs
We first summarize the notations and assumptions used in our theorems.

Notations The drift and diffusion coefficients of the SDE in Eq. (1) are denoted as f : RD ˆ
r0, T s Ñ RD and g : r0, T s Ñ R respectively, where r0, T s represents a fixed time horizon, and ˆ
denotes the Cartesian product. The solution to Eq. (1) is a stochastic process txptqutPr0,T s . We use
pt to represent the marginal distribution of xptq, and p0t px1 | xq to denote the transition distribution
from xp0q to xptq. The data distribution and prior distribution are given by p and π. We use C to
denote all continuous functions, and let C k denote the family of functions with continuous k-th order
derivatives. For any vector-valued function h : RD ˆ r0, T s Ñ RD , we use ∇ ¨ hpx, tq to represent
its divergence with respect to the first input variable.

Assumptions We make the following assumptions throughout the paper:


“ 2‰
(i) ppxq P C 2 and Ex„p kxk2 ă 8.
“ 2‰
(ii) πpxq P C 2 and Ex„π kxk2 ă 8.
(iii) @t P r0, T s : f p¨, tq P C 1 , DC ą 0 @x P RD , t P r0, T s : kf px, tqk2 ď Cp1 ` kxk2 q.
(iv) DC ą 0, @x, y P RD : kf px, tq ´ f py, tqk2 ď C kx ´ yk2 .
(v) g P C and @t P r0, T s, |gptq| ą 0.
şT ş 2 2
(vi) For any open bounded set O, 0 O kpt pxqk2 ` Dgptq2 k∇x pt pxqk2 dx dt ă 8.
(vii) DC ą 0 @x P RD , t P r0, T s : k∇x log pt pxqk2 ď Cp1 ` kxk2 q.
(viii) DC ą 0, @x, y P RD : k∇x log pt pxq ´ ∇y log pt pyqk2 ď C kx ´ yk2 .
(ix) DC ą 0 @x P RD , t P r0, T s : ksθ px, tqk2 ď Cp1 ` kxk2 q.
(x) DC ą 0, @x, y P RD : ksθ px, tq ´ sθ py, tqk2 ď C kx ´ yk2 .
” ´ ş ¯ı
T 2
(xi) Novikov’s condition: E exp 21 0 k∇x log pt pxq ´ sθ px, tqk2 dt ă 8.
k
(xii) @t P r0, T s Dk ą 0 : pt pxq “ Ope´kxk2 q as kxk2 Ñ 8.
Below we provide all proofs for our theorems.
Theorem 1. Let ppxq be the data distribution, πpxq be a known prior distribution, and pSDE θ be
defined as in Section 3. Suppose txptqutPr0,T s is a stochastic process defined by the SDE in Eq. (1)
with xp0q „ p, where the marginal distribution of xptq is denoted as pt . Under some regularity
conditions detailed in Appendix A, we have
2
DKL pp } pSDE
θ q ď JSM pθ; gp¨q q ` DKL ppT } πq. (8)

Proof. We denote the path measure of txptqutPr0,T s and tx̂θ ptqutPr0,T s as µ and ν respectively. Due
to assumptions (i) (ii) (iii) (iv) (v) (ix) and (x), both µ and ν are uniquely given by the corresponding
SDEs. Consider a Markov kernel KptzptqutPr0,ts , yq :“ δpzp0q “ yq. Since xp0q „ p0 and
x̂θ p0q „ pθ , we have the following result
ż
KptxptqutPr0,T s , xq dµptxptqutPr0,T s q “ p0 pxq
ż
Kptx̂θ ptqutPr0,T s , xq dνptx̂θ ptqutPr0,T s q “ pθ pxq.

Here the Markov kernel K essentially performs marginalization of path measures to obtain “sliced”
distributions at t “ 0. We can use the data processing inequality with this Markov kernel to obtain
DKL pp } pθ q “ DKL pp0 } pθ q
ˆż ›ż ˙

“DKL KptxptqutPr0,T s , xq dµptxptqutPr0,T s q ›› Kptx̂θ ptqutPr0,T s , xq dνptx̂θ ptqutPr0,T s q

ďDKL pµ } νq. (14)

15
Recall that by definition xpT q „ pT and x̂θ pT q „ π. Leveraging the chain rule of KL divergences
(see, for example, Theorem 2.4 in [29]), we have
DKL pµ } νq “ DKL ppT } πq ` Ez„pT rDKL pµp¨ | xpT q “ zq } νp¨ | x̂θ pT q “ zqqs. (15)
Under assumptions (i) (iii) (iv) (v) (vi) (vii) (viii), the SDE in Eq. (1) has a corresponding reverse-time
SDE given by
dx “ rf px, tq ´ gptq2 ∇x log pt pxqs dt ` gptq dw̄. (16)
Since Eq. (16) is the time reversal of Eq. (1), it induces the same path measure µ. As a result,
DKL pµp¨ | xpT q “ zq } νp¨ | x̂θ pT q “ zqq can be viewed as the KL divergence between the path
measures induced by the following two (reverse-time) SDEs:
dx “ rf px, tq ´ gptq2 ∇x log pt pxqs dt ` gptq dw̄, xpT q “ x
2
dx̂ “ rf px̂, tq ´ gptq sθ px̂, tqs dt ` gptq dw̄, x̂θ pT q “ x.
The KL divergence between two SDEs with shared diffusion coefficients and starting points exists
under assumptions (vii) (viii) (ix) (x) (xi) (see, e.g., [52, 30]), and can be computed via the Girsanov
theorem [34]:
DKL pµp¨ | xpT q “ zq } νp¨ | x̂θ pT q “ zqq
” dν ı
“ ´ Eµ log (17)

„ż T
1 T
ż 
piq 2
“Eµ gptqp∇x log pt pxq ´ sθ px, tqq dw̄t ` gptq2 k∇x log pt pxq ´ sθ px, tqk2 dt
2 0
« 0ż ff
T
piiq 1 2 2
“ Eµ gptq k∇x log pt pxq ´ sθ px, tqk2 dt
2 0
1 T
ż
2
“ E rgptq2 k∇x log pt pxq ´ sθ px, tqk2 s dt
2 0 pt pxq
“JSM pθ; gp¨q2 q, (18)
where (i) is due to Girsanov Theorem II [34, Theorem 8.6.6], and (ii) is due to the martingale property
of Itô integrals. Combining Eqs. (14), (15) and (18) completes the proof.
Theorem 2. Suppose ppxq and qpxq have continuous second-order derivatives and finite second
moments. Let txptqutPr0,T s be the diffusion process defined by the SDE in Eq. (1). We use pt and
qt to denote the distributions of xptq when xp0q „ p and xp0q „ q, and assume they satisfy the
same assumptions in Appendix A. Under the conditions qT “ π and sθ px, tq ” ∇x log qt pxq for all
t P r0, T s, we have the following equivalence in distributions
pSDE
θ “ pODE
θ “ q. (9)
Moreover, we have
2
DKL pp } pSDE
θ q “ JSM pθ; gp¨q q ` DKL ppT } πq. (10)

Proof. When π “ qT and sθ px, tq ” ∇x log qt pxq, the reverse-time SDE that defines pSDE
θ , i.e.,

dx̂ “ rf px̂, tq ´ gptq2 sθ px̂, tqs dt ` gptq dw̄, x̂θ pT q „ π, (19)


becomes equivalent to
dx̂ “ rf px̂, tq ´ gptq2 ∇x̂ log qt px̂qs dt ` gptq dw̄, x̂θ pT q „ qT , (20)
which yields the same stochastic process as the following forward-time SDE
dx̂ “ f px̂, tq dt ` gptq dw, x̂θ p0q „ q. (21)
Since x̂θ p0q „ pSDE
θ by definition, we immediately have pSDE
θ “ q. Similarly, the ODE that defines
pODE
θ is
dx̃ 1
“ fθ px̃, tq ´ gptq2 sθ px̃, tq, x̃θ pT q „ π, (22)
dt 2

16
which is equivalent to the following when qT “ π and sθ px, tq ” ∇x log qt pxq,
dx̃ 1
“ fθ px̃, tq ´ gptq2 ∇x̃ log qt px̃, tq, x̃θ pT q „ qT . (23)
dt 2
The theory of probability flow ODEs [48] guarantees that Eq. (21) and Eq. (23) share the same set of
marginal distributions, tqt utPr0,T s , which implies that x̃θ p0q „ q. Since by definition x̃θ p0q „ pODE
θ ,
we have pODE
θ “ q.
The next part of the theorem can be proved by first rewriting the KL divergence from p to q in an
integral form:
piq
DKL pppxq } qpxqq “ DKL pp0 pxq } q0 pxqq ´ DKL ppT pxq } qT pxqq ` DKL ppT pxq } qT pxqq
ż0
piiq BDKL ppt pxq } qt pxqq
“ dt ` DKL ppT pxq } qT pxqq, (24)
T Bt
where (i) holds due to our definition p0 pxq ” ppxq and q0 pxq ” qpxq; (ii) is due to the fundamental
theorem of calculus.
Next, we show how to rewrite Eq. (24) as a mixture of score matching losses. The Fokker–Planck
equation for the SDE in Eq. (1) describes the time-evolution of the stochastic process’s associated
probability density function, and is given by
Bpt pxq ´1 ¯
“ ∇x ¨ g 2 ptqpt pxq∇x log pt pxq ´ f px, tqpt pxq “ ∇x ¨ php px, tqpt pxqq,
Bt 2
where for simplified notations we define hp px, tq :“ 12 g 2 ptq∇x log pt pxq ´ f px, tq. Similarly,
Bqt pxq
Bt “ ∇x ¨ phq px, tqqt pxqq. Since we assume log pt pxq and log qt pxq are smooth functions with
at most polynomial growth at infinity (assumption (xii)), we have limxÑ8 hp px, tqpt pxq “ 0 and
limxÑ8 hq px, tqqt pxq “ 0 for all t. Then, the time-derivative of DKL ppt } qt q can be rewritten in
the following way:

ż
BDKL ppt pxq } qt pxqq B pt pxq
“ pt pxq log dx
Bt Bt qt pxq
ż ż ż
Bpt pxq pt pxq Bpt pxq pt pxq Bqt pxq
“ log dx ` dx ´ dx
Bt qt pxq Bt
loooooomoooooon qt pxq Bt
“0
ż ż
pt pxq pt pxq
“ ∇x ¨ php px, tqpt pxqq log dx ´ ∇x ¨ phq px, tqqt pxqq dx
qt pxq qt pxq
ż
piq
“ ´ pt pxqrhTp px, tq ´ hTq px, tqsr∇x log pt pxq ´ ∇x log qt pxqs dx
ż
1 2
“´ pt pxqgptq2 k∇x log pt pxq ´ ∇x log qt pxqk2 dx,
2
where (i) is due to integration by parts. Combining with Eq. (24), we can conclude that
1 T
ż
2
DKL pp } qq “ E rgptq2 k∇x log pt pxq ´ ∇x log qt pxqk2 s dt ` DKL ppT } qT q. (25)
2 0 x„pt pxq
Since pSDE
θ “ q and qT “ π, we also have
1 T
ż
2
DKL pp }pSDE q “ E rgptq2 k∇x log pt pxq ´ ∇x log qt pxqk2 s dt ` DKL ppT } qT q
θ
2 0 x„pt pxq
“ JSM pθ; gp¨q2 q ` DKL ppT } qT q, (26)
which completes the proof.

Using a similar technique to Theorem 2, we can express the entropy of a distribution in terms of a
time-dependent score function, as detailed in the following theorem.

17
Theorem 4. Let Hpppxqq be the differential entropy of the initial probability density ppxq. Under
the same conditions in Theorem 2, we have

1 T
ż ” ı
2
Hpppxqq “ HppT pxqq ` Ex„pt pxq 2f px, tqT ∇x log pt pxq ´ gptq2 k∇x log pt pxqk2 dt.
2 0
(27)
żT
1 ”
2
ı
“ HppT pxqq ´ E 2∇ ¨ f px, tq ` gptq2 k∇x log pt pxqk2 dt. (28)
2 0 x„pt pxq

Proof. Once more we proceed analogously to the proofs of Theorem 2. We have


ż0
B
Hpppxqq ´ HppT pxqq “ Hppt pxqq dt. (29)
T Bt
Expanding the integrand, we have
ż
B B
Hppt pxqq “ ´ pt pxq log pt pxq dx
Bt Bt
ż
Bpt pxq Bpt pxq
“´ log pt pxq ` dx
Bt Bt
ż ż
Bpt pxq B
“´ log pt pxq dx ´ pt pxq dx
Bt Bt looooomooooon
“1
ż
“ ´ ∇x ¨ php px, tqpt pxqq log pt pxq dx
ż
piq
“ pt pxqhTp px, tq∇x log pt pxq dx
1 2
“ E rgptq2 k∇x log pt pxqk2 ´ 2f px, tqT ∇x log pt pxqs,
2 x„pt pxq
where again (i) follows from integration by parts and the limiting behaviour of hp given by assumption
(xii). Plugging this expression in for the integrand in Eq. (29) then completes the proof for Eq. (27).
For Eq. (28), we can once again perform integration by parts and leverage the limiting behavior of
pt pxq in assumption (xii) to get
ż ż
Ept pxq rf px, tqT ∇x log pt pxqs “ f px, tqT ∇x pt pxq dx “ ´ pt pxq∇ ¨ f px, tq dx,

which establishes the equivalence between Eq. (28) and Eq. (27).

Remark The formula in Theorem 4 provides a new way to estimate the entropy of a data distribution
i.i.d.
from i.i.d. samples. Specifically, given tx1 , x2 , ¨ ¨ ¨ , xN u „ ppxq and an SDE like Eq. (1), we
can first apply score matching to train a time-dependent score-based model such that sθ px, tq «
∇x log pt pxq, and then plug sθ px, tq into Eq. (27) to obtain the following estimator of Hpppxqq:
N ż
1 ÿ T” 2
ı
HppT pxqq ` 2f pxi , tqT sθ pxi , tq ´ gptq2 ksθ pxi , tqk2 dt,
2N i“1 0

or plug it into Eq. (28) to obtain the following alternative estimator


N ż
1 ÿ T” 2
ı
HppT pxqq ´ 2∇ ¨ f pxi , tq ` gptq2 ksθ pxi , tqk2 dt.
2N i“1 0

Both estimators can be computed from a score-based model alone, and do not require training a
density model.

18
Theorem 5. Let p0t px1 | xq denote the transition kernel from p0 pxq to pt pxq for any t P p0, T s. With
the same conditions and notations in Theorem 1, we have

żT
1
´Eppxq rlog pSDE
θ pxqs ď ´ EpT pxq rlog πpxqs ` Ex„pt pxq r2gptq2 ∇ ¨ sθ px, tq
2 0
2
` gptq2 ksθ px, tqk2 ´ 2∇ ¨ f px, tqs dt. (30)
“ ´ EpT pxq rlog πpxqs
1 T
ż
2
` E 1 rgptq2 sθ px1 , tq ´ ∇x1 log p0t px1 | xq
2 0 p0t px |xqppxq 2
2
´ gptq2 ∇x1 log p0t px1 | xq 2
´ 2∇ ¨ f px1 , tqs dt. (31)

Proof. Since ´Eppxq rlog pSDE SDE


θ pxqs “ DKL pp } pθ q ` Hppq, we can combine Theorem 1 and
Theorem 4 to obtain

żT
1 2
´Eppxq rlog pSDE
θ pxqs ď Ept pxq rgptq2 k∇x log pt pxq ´ sθ px, tqk2 s dt ` DKL ppT } πq
2 0
1 T
ż
2
` HppT pxqq ´ E r2∇ ¨ f px, tq ` gptq2 k∇x log pt pxqk2 s dt
2 0 pt pxq
“ ´EpT pxq rlog πpxqs
1 T
ż
2 2
` E rgptq2 k∇x log pt pxq ´ sθ px, tqk2 ´ gptq2 k∇x log pt pxqk2 s dt
2 0 pt pxq
żT
´ Ept pxq r∇ ¨ f px, tqs dt. (32)
0

The second term of Eq. (32) can be simplified via integration by parts

żT
1 2 2
Ept pxq rgptq2 k∇x log pt pxq ´ sθ px, tqk2 ´ gptq2 k∇x log pt pxqk2 s dt
2 0
żT
1 2
“ Ept pxq rgptq2 ksθ px, tqk2 ´ 2gptq2 sθ px, tqT ∇x log pt pxqs dt
2 0
żT żT
1 2 2
“ Ept pxq rgptq ksθ px, tqk2 s dt ´ Ept pxq rgptq2 sθ px, tqT ∇x log pt pxqs dt
2 0 0
żT żT ż
1 2
“ Ept pxq rgptq2 ksθ px, tqk2 s dt ´ gptq2 pt pxqsθ px, tqT ∇x log pt pxq dx dt
2 0 0
żT żT ż
1 2
“ Ept pxq rgptq2 ksθ px, tqk2 s dt ´ gptq2 sθ px, tqT ∇x pt pxq dx dt
2 0 0
żT żT ż
piq 1 2 2
“ Ept pxq rgptq ksθ px, tqk2 s dt ` gptq2 pt pxq∇ ¨ sθ px, tq dx dt
2 0 0
żT
1 2
“ Ept pxq rgptq2 ksθ px, tqk2 ` 2gptq2 ∇ ¨ sθ px, tqs dt, (33)
2 0

where (i) is due to integration by parts and the limiting behavior of pt pxq given by assumption (xii).
Combining Eq. (33) and Eq. (32) completes the proof for Eq. (30).

19
The
ş proof for Eq. (31) parallels that of denoising score matching [56]. Observe that pt pxq “
ppx1 qp0t px | x1 q dx1 . As a result,
żT żT ż
Ept pxq rgptq2 sθ px, tqT ∇x log pt pxqs dt “ gptq2 sθ px, tqT ∇x pt pxq dx dt
0 0
żT ż ż
“ gptq2 sθ px, tqT ∇x ppx1 qp0t px | x1 q dx1 dx dt
0
żT ż ż
“ gptq2 sθ px, tqT ppx1 q∇x p0t px | x1 q dx1 dx dt
0
żT ż ż
2 T
“ gptq sθ px, tq ppx1 qp0t px | x1 q∇x log p0t px | x1 q dx1 dx dt
0
żT
“ Eppxqp0t px1 |xq rgptq2 sθ px1 , tqT ∇x1 log p0t px1 | xqs dt. (34)
0
Substituting Eq. (34) into the second term of Eq. (32), we have
1 T
ż
2 2
E rgptq2 k∇x log pt pxq ´ sθ px, tqk2 ´ gptq2 k∇x log pt pxqk2 s dt
2 0 pt pxq
1 T
ż
2
“ E rgptq2 ksθ px, tqk2 ´ 2gptq2 sθ px, tqT ∇x log pt pxqs dt
2 0 pt pxq
1 T
ż
2
“ E rgptq2 ksθ px, tqk2 ´ 2gptq2 sθ px, tqT ∇x log pt pxqs dt
2 0 pt pxq
1 T
ż
2
“ E 1 rgptq2 sθ px1 , tq 2 ´ 2gptq2 sθ px1 , tqT ∇x1 log p0t px1 | xqs dt
2 0 ppxqp0t px |xq
1 T
ż
2 2
“ E 1 rgptq2 sθ px1 , tq ´ ∇x1 log p0t px1 | xq 2 ´ gptq2 ∇x1 log p0t px1 | xq s dt.
2 0 ppxqp0t px |xq 2
(35)
We can now complete the proof for Eq. (31) by combining Eq. (35) an Eq. (32).
Theorem 3. Let p0t px1 | xq denote the transition distribution from p0 pxq to pt pxq for the SDE in
Eq. (1). With the same notations and conditions in Theorem 1, we have
´ log pSDE SM DSM
θ pxq ď Lθ pxq “ Lθ pxq, (11)
where LSM
θ pxq is defined as
żT
1 ”
2
ı
´Ep0T px1 |xq rlog πpx1 qs ` Ep0t px1 |xq 2gptq2 ∇x1 ¨ sθ px1 , tq ` gptq2 sθ px1 , tq 2
´ 2∇x1 ¨ f px1 , tq dt,
2 0

and LDSM
θ pxq is given by
żT
1 ”
2
ı
´ Ep0T px1 |xq rlog πpx1 qs ` Ep0t px1 |xq gptq2 sθ px1 , tq ´ ∇x1 log p0t px1 | xq 2
dt
2 0
żT
1 ”
2
ı
´ Ep0t px1 |xq gptq2 ∇x1 log p0t px1 | xq 2
` 2∇x1 ¨ f px1 , tq dt.
2 0

Proof. The result in Theorem 5 can be re-written as


żT
1
´Eppxq rlog pSDE
θ pxqs ď ´ Eppxqp0T px1 |xq rlog πpx qs ` 1
Eppxqp0t px1 |xq r2gptq2 ∇ ¨ sθ px1 , tq
2 0
2
` gptq2 sθ px1 , tq 2
´ 2∇ ¨ f px1 , tqs dt.
“ ´ Eppxqp0T px1 |xq rlog πpx1 qs
1 T
ż
2
` E 1 rgptq2 sθ px1 , tq ´ ∇x1 log p0t px1 | xq
2 0 ppxqp0t px |xq 2
2
´ gptq2 ∇x1 log p0t px1 | xq 2
´ 2∇ ¨ f px1 , tqs dt.

20
Given a fixed SDE (and its transition kernel p0t px1 | xq), Theorem 5 holds for any data distribution
p that satisfies our assumptions. Leveraging proof by contradiction, we can easily see that Eppxq in
both sides of Eqs. (30) and (31) can be cancelled to get
´ log pSDE SM DSM
θ pxq ď Lθ pxq “ Lθ pxq,
which finishes the proof.

B Numerical stability
In our previous theoretical discussion, we always assume that data are perturbed with an SDE starting
from t “ 0. However, in practical implementations, t “ 0 often leads to numerical instability. As
a pragmatic solution, we choose a small non-zero starting time  ą 0, and consider the SDE in the
time horizon r, T s. Using the same proof techniques, we can easily see that when the time horizon is
r, T s instead of r0, T s, the original bound in Theorem 1,
2
DKL pp } pSDE
θ q ď JSM pθ; gp¨q q ` DKL ppT } πq
żT
1 2
“ E rgptq2 k∇x log pt pxq ´ sθ px, tqk2 s dt ` DKL ppT } πq
2 0 pt pxq
shall be replaced with
żT
1 2
DKL pp̃ } p̃SDE
θ qď Ept pxq rgptq2 k∇x log pt pxq ´ sθ px, tqk2 s dt ` DKL ppT } πq (36)
2 
ş
where p̃pxq :“ ppx̃qp0 px | x̃q dx, and p̃SDEθ denotes the marginal distribution of x̂θ pq. Here the
stochastic process tx̂θ ptqutPr0,T s is defined according to Eq. (5). When  is sufficiently small, we
always have
DKL pp̃ } p̃SDE SDE
θ q « DKL pp } pθ q,

so we train with Eq. (36) to approximately maximize the model likelihood for pSDEθ . However, at test
time, we should report the likelihood bound for pSDE
θ for mathematical rigor, not p̃SDE
θ . To this end,
we first derive an analogous result to Theorem 3 with the time horizon r, T s, given as below.
Theorem 6. Let p0t px1 | xq denote the transition distribution from p0 pxq to pt pxq for the SDE in
Eq. (1). With the same notations and conditions in Theorem 3, as well as the definitions of p̃ and p̃SDE
θ
given above, we have
´Ep0 px1 |xq rlog p̃SDE 1 SM DSM
θ px qs ď Lθ px, q “ Lθ px, q, (37)
where LSM
θ px, q is defined as
żT
1 ”
2
ı
´Ep0T px1 |xq rlog πpx1 qs ` Ep0t px1 |xq 2gptq2 ∇x1 ¨ sθ px1 , tq ` gptq2 sθ px1 , tq 2
´ 2∇x1 ¨ f px1 , tq dt,
2 

and LDSM
θ px, q is given by
żT
1 ”
2
ı
´ Ep0T px1 |xq rlog πpx1 qs ` Ep0t px1 |xq gptq2 sθ px1 , tq ´ ∇x1 log p0t px1 | xq 2
dt
2 
żT
1 ”
2
ı
´ Ep0t px1 |xq gptq2 ∇x1 log p0t px1 | xq 2
` 2∇x1 ¨ f px1 , tq dt.
2 

ş
Proof. The proof closely parallels that of Theorem 3, by noting that p̃pxq “ ppx1 qp0 px | x1 q dx1 .

Although p̃SDE
θ is a probabilistic model for p̃, we can transform it into a probabilistic model for
p leveraging a denoising distribution qθ px | x1 q that approximately converts p̃ to p. Suppose
p0 px1 | xq “ N px1 | αx, β 2 Iq. Inspired by Tweedie’s formula, we choose
ˆ ˇ 1
β2 β2
˙
ˇx
qθ px | x1 q :“ N x ˇˇ ` sθ px1 , q, 2 I ,
α α α

21
ş
and define pθ pxq :“ qθ px | x1 qp̃SDE 1 1
θ px q dx , which is a probabilistic model for p. With slight abuse
SDE
of notation, we identify pθ with pθ in Table 2. With Jensen’s inequality, we have
qθ px | x1 qp̃SDE
„ 1

θ px q
´ log pθ pxq ď ´Ep0 px1 |xq log .
p0 px1 | xq
Combined with Theorem 6, we have
´ log pθ pxq ď ´Ep0 px1 |xq rlog qθ px | x1 q ´ log p0 px1 | xqs ` LSMθ px, q (38)
“ ´Ep0 px1 |xq rlog qθ px | x1 q ´ log p0 px1 | xqs ` LDSM
θ px, q (39)
The above bound Eq. (39) was applied to both computing the test-time likelihood bounds in Table 2,
and training the flow model used in variational dequantization. Note that it was not used to train the
time-dependent score-based model.
In practice, we choose  “ 10´5 for VP SDEs and  “ 10´2 for subVP SDEs, except that on
ImageNet we use  “ 5 ˆ 10´5 for VP SDE models trained with likelihood weighting and importance
sampling. Note that [48] chooses  “ 10´5 for all cases. We found that when using likelihood
weighting and optionally importance sampling,  “ 10´5 for subVP SDEs can cause stiffness for
numerical ODE solvers. In contrast, using  “ 10´2 for subVP SDEs sidesteps numerical issues
without hurting the performance for score-based models trained with original weightings in [48]. For
the bound values in Table 2, we draw 1000 time values uniformly in r, T s and use them to estimate
LDSM
θ for each datapoint, with the same importance sampling technique in Eq. (12). We use the
correction in Eq. (39) and report upper bounds for ´ log pθ pxq. For computing the likelihood of
pODE
θ , we use the Dormand-Prince RK45 ODE solver [14] with absolute and relevant tolerances set
to 10´5 . We do not use the correction in Eq. (39) for pODE
θ , because it is still a valid likelihood for the
data distribution even in the time horizon r, T s.
Below is a related result to bound log p̃SDE
θ pxq directly. We include it here for completeness, though
we do not use it for either training or inference in our experiments.
Theorem 7. Let p0t px1 | xq denote the transition distribution from p0 pxq to pt pxq for the SDE in
Eq. (1). With the same notations and conditions in Theorem 3, as well as the definitions of p̃ and p̃SDE
θ
in Theorem 6, we have
´ log p̃SDE SM DSM
θ pxq ď Lθ, pxq “ Lθ, pxq, (40)
where LSM
θ, pxq is defined as
żT
1 ”
2
ı
´EpT px1 |xq rlog πpx1 qs ` Ept px1 |xq 2gptq2 ∇x1 ¨ sθ px1 , tq ` gptq2 sθ px1 , tq 2
´ 2∇x1 ¨ f px1 , tq dt,
2 
and LDSM
θ, pxq is given by
żT
1 ”
2
ı
´ EpT px1 |xq rlog πpx1 qs ` Ept px1 |xq gptq2 sθ px1 , tq ´ ∇x1 log pt px1 | xq 2
dt
2 
żT
1 ”
2
ı
´ Ept px1 |xq gptq2 ∇x1 log pt px1 | xq 2
` 2∇x1 ¨ f px1 , tq dt.
2 

Proof. Proof closely parallels those of Theorems 3 and 6.

C Experimental details
Datasets All our experiments are performed on two image datasets: CIFAR-10 [28] and down-
sampled ImageNet [55]. Both contain images of resolution 32 ˆ 32. CIFAR-10 has 50000 images
as the training set and 10000 images as the test set. Down-sampled ImageNet has 1281149 training
images and 49999 test images. It is well-known that ImageNet contains some personal sensitive
information and may cause privacy concern [57]. We minimize this risk by using the dataset with a
small resolution (32 ˆ 32).

Model architectures Our variational dequantization model, qφ pu | xq, follows the same architec-
ture of Flow++ [18]. We do not use dropout for score-based models trained on ImageNet. We did not
tune model architectures or training hyper-parameters specifically for maximizing likelihoods. All
likelihood values were reported using the last checkpoint of each setting.

22
(a) DDPM++ (deep, subVP) [48], FID = 2.86

(b) ScoreFlow, FID = 5.34


Figure 3: Samples on CIFAR-10. (a) Model with the best FID. (b) ScoreFlow trained with likelihood
weighting + importance sampling + VP SDE. Samples of both models are generated with the same
random seed.

Training We follow the same training procedure for score-based models in [48]. We also use the
same hyperparameters for training the variational dequantization model, except that we train it for
only 300000 iterations while fixing the score-based model. All models are trained on Cloud TPU
v3-8 (roughly equivalent to 4 Tesla V100 GPUs). The baseline DDPM++ model requires around 33
hours to finish training, while the deep DDPM++ model requires around 44 hours. The variational
dequantization model for the former requires around 7 hours to train, and for the latter it requires
around 9.5 hours.

Confidence intervals All likelihood values are obtained by averaging the results on around 50000
datapoints, sampled with replacement from the test dataset. We can compute the confidence intervals
with Student’s t-test. On CIFAR-10, the radius of 95% confidence intervals is typically around 0.006
bits/dim, while on ImageNet it is around 0.008 bits/dim.

Sample quality All FID values are computed on 50000 samples from pODE θ , generated with nu-
merical ODE solvers as in [48]. We compute FIDs between samples and training/test data for
CIFAR-10/ImageNet. Although likelihood weighting + importance sampling slightly increases FID
scores, their samples have comparable visual quality, as demonstrated in Figs. 3 and 4.

23
(a) DDPM++ (VP) [48], FID = 8.34

(b) ScoreFlow, FID = 10.18


Figure 4: Samples on ImageNet 32 ˆ 32. (a) Model with the best FID. (b) ScoreFlow trained with
likelihood weighting + importance sampling + VP SDE. Samples of both models are generated with
the same random seed.

24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy