0% found this document useful (0 votes)
6 views20 pages

Analyzing Inverse Problems With Invertible Neural Networks

The paper discusses the application of Invertible Neural Networks (INNs) for solving inverse problems in natural sciences, where hidden parameters must be inferred from observable measurements. INNs are shown to effectively estimate the full posterior distribution of parameters, overcoming the challenges of ambiguity and information loss typically associated with inverse problems. The authors demonstrate the efficacy of INNs through theoretical proofs and experimental validation on both synthetic and real-world datasets from fields like medicine and astrophysics.

Uploaded by

ahmad991397
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views20 pages

Analyzing Inverse Problems With Invertible Neural Networks

The paper discusses the application of Invertible Neural Networks (INNs) for solving inverse problems in natural sciences, where hidden parameters must be inferred from observable measurements. INNs are shown to effectively estimate the full posterior distribution of parameters, overcoming the challenges of ambiguity and information loss typically associated with inverse problems. The authors demonstrate the efficacy of INNs through theoretical proofs and experimental validation on both synthetic and real-world datasets from fields like medicine and astrophysics.

Uploaded by

ahmad991397
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Published as a conference paper at ICLR 2019

Analyzing Inverse Problems with


Invertible Neural Networks
Lynton Ardizzone1 , Jakob Kruse1 , Sebastian Wirkert2 ,
Daniel Rahner3 , Eric W. Pellegrini3 , Ralf S. Klessen3 ,
Lena Maier-Hein2 , Carsten Rother1 , Ullrich Köthe1
1
Visual Learning Lab Heidelberg, 2 German Cancer Research Center (DKFZ),
3
Zentrum für Astronomie der Universität Heidelberg (ZAH)
1
lynton.ardizzone@iwr.uni-heidelberg.de,
2
s.wirkert@dkfz-heidelberg.de, 3 daniel.rahner@uni-heidelberg.de
arXiv:1808.04730v3 [cs.LG] 6 Feb 2019

Abstract
For many applications, in particular in natural science, the task is to
determine hidden system parameters from a set of measurements. Often,
the forward process from parameter- to measurement-space is well-defined,
whereas the inverse problem is ambiguous: multiple parameter sets can
result in the same measurement. To fully characterize this ambiguity, the full
posterior parameter distribution, conditioned on an observed measurement,
has to be determined. We argue that a particular class of neural networks
is well suited for this task – so-called Invertible Neural Networks (INNs).
Unlike classical neural networks, which attempt to solve the ambiguous
inverse problem directly, INNs focus on learning the forward process, using
additional latent output variables to capture the information otherwise
lost. Due to invertibility, a model of the corresponding inverse process is
learned implicitly. Given a specific measurement and the distribution of
the latent variables, the inverse pass of the INN provides the full posterior
over parameter space. We prove theoretically and verify experimentally, on
artificial data and real-world problems from medicine and astrophysics, that
INNs are a powerful analysis tool to find multi-modalities in parameter space,
uncover parameter correlations, and identify unrecoverable parameters.

1 Introduction
When analyzing complex physical systems, a common problem is that the system parameters
of interest cannot be measured directly. For many of these systems, scientists have developed
sophisticated theories on how measurable quantities y arise from the hidden parameters x.
We will call such mappings the forward process. However, the inverse process is required to
infer the hidden states of a system from measurements. Unfortunately, the inverse is often
both intractable and ill-posed, since crucial information is lost in the forward process.
To fully assess the diversity of possible inverse solutions for a given measurement, an inverse
solver should be able to estimate the complete posterior of the parameters, conditioned on an
observation. This makes it possible to quantify uncertainty, reveal multi-modal distributions,
and identify degenerate and unrecoverable parameters – all highly relevant for applications
in natural science. In this paper, we ask if invertible neural networks (INNs) are a suitable
model class for this task. INNs are characterized by three properties:
(i) The mapping from inputs to outputs is bijective, i.e. its inverse exists,
(ii) both forward and inverse mapping are efficiently computable, and
(iii) both mappings have a tractable Jacobian, which allows explicit computation of
posterior probabilities.
Networks that are invertible by construction offer a unique opportunity: We can train
them on the well-understood forward process x → y and get the inverse y → x for free by

1
Published as a conference paper at ICLR 2019

forward (simulation): x → y forward (simulation): x → y

SL
y

USL
SL
x (Bayesian) NN y x INN
z

USL
inverse (prediction): y → x inverse (sampling): [y, z] → x
Standard (Bayesian) Neural Network Invertible Neural Network

Figure 1: Abstract comparison of standard approach (left) and ours (right). The
standard direct approach requires a discriminative, supervised loss (SL) term between
predicted and true x, causing problems when y → x is ambiguous. Our network uses a
supervised loss only for the well-defined forward process x → y. Generated x are required
to follow the prior p(x) by an unsupervised loss (USL), while the latent variables z are made
to follow a Gaussian distribution, also by an unsupervised loss. See details in Section 3.3.
running them backwards at prediction time. To counteract the inherent information loss
of the forward process, we introduce additional latent output variables z, which capture
the information about x that is not contained in y. Thus, our INN learns to associate
hidden parameter values x with unique pairs [y, z] of measurements and latent variables.
Forward training optimizes the mapping [y, z] = f (x) and implicitly determines its inverse
x = f −1 (y, z) = g(y, z). Additionally, we make sure that the density p(z) of the latent
variables is shaped as a Gaussian distribution. Thus, the INN represents the desired
posterior p(x | y) by a deterministic function x = g(y, z) that transforms (“pushes”) the
known distribution p(z) to x-space, conditional on y.
Compared to standard approaches (see Fig. 1, left), INNs circumvent a fundamental difficulty
of learning inverse problems: Defining a sensible supervised loss for direct posterior learning is
problematic since it requires prior knowledge about that posterior’s behavior, constituting a
kind of hen-end-egg problem. If the loss does not match the possibly complicated (e.g. multi-
modal) shape of the posterior, learning will converge to incorrect or misleading solutions.
Since the forward process is usually much simpler and better understood, forward training
diminishes this difficulty. Specifically, we make the following contributions:

• We show that the full posterior of an inverse problem can be estimated with invertible
networks, both theoretically in the asymptotic limit of zero loss, and practically on
synthetic and real-world data from astrophysics and medicine.
• The architectural restrictions imposed by invertibility do not seem to have detrimental
effects on our network’s representational power.
• While forward training is sufficient in the asymptotic limit, we find that a combination
with unsupervised backward training improves results on finite training sets.
• In our applications, our approach to learning the posterior compares favourably
to approximate Bayesian computation (ABC) and conditional VAEs. This enables
identifying unrecoverable parameters, parameter correlations and multimodalities.

2 Related work
Modeling the conditional posterior of an inverse process is a classical statistical task that
can in principle be solved by Bayesian methods. Unfortunately, exact Bayesian treatment
of real-world problems is usually intractable. The most common (but expensive) solution
is to resort to sampling, typically by a variant of Markov Chain Monte Carlo (Robert and
Casella, 2004; Gamerman and Lopes, 2006). If a model y = s(x) for the forward process is
available, approximate Bayesian computation (ABC) is often preferred, which embeds the
forward model in a rejection sampling scheme for the posterior p(x|y) (Sunnåker et al., 2013;
Lintusaari et al., 2017; Wilkinson, 2013).
Variational methods offer a more efficient alternative, approximating the posterior by an
optimally chosen member of a tractable distribution family (Blei et al., 2017). Neural

2
Published as a conference paper at ICLR 2019

networks can be trained to predict accurate sufficient statistics for parametric posteriors
(Papamakarios and Murray, 2016; Siddharth et al., 2017), or can be designed to learn a
mean-field distribution for the network’s weights via dropout variational inference (Gal and
Ghahramani, 2015; Kingma et al., 2015). Both ideas can be combined (Kendall and Gal, 2017)
to differentiate between data-related and model-related uncertainty. However, the restriction
to limited distribution families fails if the true distribution is too complex (e.g. when it
requires multiple modes to represent ambiguous or degenerate solutions) and essentially
counters the ability of neural networks to act as universal approximators. Conditional GANs
(cGANs; Mirza and Osindero, 2014; Isola et al., 2017) overcome this restriction in principle,
but often lack satisfactory diversity in practice (Zhu et al., 2017b). For our tasks, conditional
variational autoencoders (cVAEs; Sohn et al., 2015) perform better than cGANs, and are
also conceptually closer to our approach (see appendix Sec. 2), and hence serve as a baseline
in our experiments.
Generative modeling via learning of a non-linear transformation between the data distribution
and a simple prior distribution (Deco and Brauer, 1995; Hyvärinen and Pajunen, 1999)
has the potential to solve these problems. Today, this approach is often formulated as a
normalizing flow (Tabak et al., 2010; Tabak and Turner, 2013), which gradually transforms a
normal density into the desired data density and relies on bijectivity to ensure the mapping’s
validity. These ideas were applied to neural networks by Deco and Brauer (1995); Rippel and
Adams (2013); Rezende and Mohamed (2015) and refined by Tomczak and Welling (2016);
Berg et al. (2018); Trippe and Turner (2018). Today, the most common realizations use
auto-regressive flows, where the density is decomposed according to the Bayesian chain rule
(Kingma et al., 2016; Huang et al., 2018; Germain et al., 2015; Papamakarios et al., 2017;
Oord et al., 2016; Kolesnikov and Lampert, 2017; Salimans et al., 2017; Uria et al., 2016).
These networks successfully learned unconditional generative distributions for artificial data
and standard image sets (e.g. MNIST, CelebA, LSUN bedrooms), and some encouraging
results for conditional modeling exist as well (Oord et al., 2016; Salimans et al., 2017;
Papamakarios et al., 2017; Uria et al., 2016).
These normalizing flows possess property (i) of an INN, and are usually designed to fulfill
requirement (iii) as well. In other words, flow-based networks are invertible in principle,
but the actual computation of their inverse is too costly to be practical, i.e. INN property
(ii) is not fulfilled. This precludes the possibility of bi-directional or cyclic training, which
has been shown to be very beneficial in generative adversarial nets and auto-encoders (Zhu
et al., 2017a; Dumoulin et al., 2016; Donahue et al., 2017; Teng et al., 2018). In fact,
optimization for cycle consistency forces such models to converge to invertible architectures,
making fully invertible networks a natural choice. True INNs can be built using coupling
layers, as introduced in the NICE (Dinh et al., 2014) and RealNVP (Dinh et al., 2016)
architectures. Despite their simple design and training, these networks were rarely studied:
Gomez et al. (2017) used a NICE-like design as a memory-efficient alternative to residual
networks, Jacobsen et al. (2018) demonstrated that the lack of information reduction from
input to representation does not cause overfitting, and Schirrmeister et al. (2018) trained such
a network as an adverserial autoencoder. Danihelka et al. (2017) showed that minimization
of an adversarial loss is superior to maximum likelihood training in RealNVPs, whereas
the Flow-GAN of Grover et al. (2017) performs even better using bidirectional training, a
combination of maximum likelihood and adverserial loss. The Glow architecture by Kingma
and Dhariwal (2018) incorporates invertible 1x1 convolutions into RealNVPs to achieve
impressive image manipulations. This line of research inspired us to extend RealNVPs for the
task of computing posteriors in real-world inverse problems from natural and life sciences.

3 Methods

3.1 Problem specification

We consider a common scenario in natural and life sciences: Researchers are interested in a
set of variables x ∈ RD describing some phenomenon of interest, but only variables y ∈ RM
can actually be observed, for which the theory of the respective research field provides a
model y = s(x) for the forward process. Since the transformation from x to y incurs an

3
Published as a conference paper at ICLR 2019

information loss, the intrinsic dimension m of y is in general smaller than D, even if the
nominal dimensions satisfy M > D. Hence we want to express the inverse model as a
conditional probability p(x | y), but its mathematical derivation from the forward model is
intractable in the applications we are going to address.
We aim at approximating p(x | y) by a tractable model q(x | y), taking advantage of the
possibility to create an arbitrary amount of training data {(xi , yi )}N
i=1 from the known
forward model s(x) and a suitable prior p(x). While this would allow for training of a
standard regression model, we want to approximate the full posterior probability. To this
end, we introduce a latent random variable z ∈ RK drawn from a multi-variate standard
normal distribution and reparametrize q(x | y) in terms of a deterministic function g of y
and z, represented by a neural network with parameters θ:
x = g(y, z; θ) with z ∼ p(z) = N (z; 0, IK ). (1)
Note that we distinguish between hidden parameters x representing unobservable real-world
properties and latent variables z carrying information intrinsic to our model. Choosing a
Gaussian prior for z poses no additional limitation, as proven by the theory of non-linear
independent component analysis (Hyvärinen and Pajunen, 1999).
In contrast to standard methodology, we propose to learn the model g(y, z; θ) of the inverse
process jointly with a model f (x; θ) approximating the known forward process s(x):
[y, z] = f (x; θ) = [fy (x; θ), fz (x; θ)] = g −1 (x; θ) with fy (x; θ) ≈ s(x). (2)
Functions f and g share the same parameters θ and are implemented by a single invertible
neural network. Our experiments show that joint bi-directional training of f and g avoids
many complications arising in e.g. cVAEs or Bayesian neural networks, which have to learn
the forward process implicitly.
The relation f = g −1 is enforced by the invertible network architecture, provided that the
nominal and intrinsic dimensions of both sides match. When m ≤ M denotes the intrinsic
dimension of y, the latent variable z must have dimension K = D − m, assuming that the
intrinsic dimension of x equals its nominal dimension D. If the resulting nominal output
dimension M + K exceeds D, we augment the input with a vector x0 ∈ RM +K−D of zeros
and replace x with the concatenation [x, x0 ] everywhere. Combining these definitions, our
network expresses q(x | y) as
!
 −1 ∂g(y, z; θ)
q x = g(y, z; θ) | y = p(z) Jx , Jx = det (3)
∂[y, z] y,fz (x)

with Jacobian determinant Jx . When using coupling layers, according to Dinh et al. (2016),
computation of Jx is simple, as each transformation has a triangular Jacobian matrix.

3.2 Invertible architecture

To create a fully invertible neural network, we follow the architecture proposed by Dinh et al.
(2016): The basic unit of this network is a reversible block consisting of two complementary
affine coupling layers. Hereby, the block’s input vector u is split into two halves, u1 and
u2 , which are transformed by an affine function with coefficients exp(si ) and ti (i ∈ {1, 2}),
using element-wise multiplication ( ) and addition:
v1 = u1 exp(s2 (u2 )) + t2 (u2 ), v2 = u2 exp(s1 (v1 )) + t1 (v1 ). (4)
Given the output v = [v1 , v2 ], these expressions are trivially invertible:
u2 = (v2 − t1 (v1 )) exp(−s1 (v1 )), u1 = (v1 − t2 (u2 )) exp(−s2 (u2 )). (5)
Importantly, the mappings si and ti can be arbitrarily complicated functions of v1 and
u2 and need not themselves be invertible. In our implementation, they are realized by a
succession of several fully connected layers with leaky ReLU activations.
A deep invertible network is composed of a sequence of these reversible blocks. To increase
model capacity, we apply a few simple extensions to this basic architecture. Firstly, if the

4
Published as a conference paper at ICLR 2019

dimension D is small, but a complex transformation has to be learned, we find it advantageous


to pad both the in- and output of the network with an equal number of zeros. This does not
change the intrinsic dimensions of in- and output, but enables the network’s interior layers
to embed the data into a larger representation space in a more flexible manner. Secondly,
we insert permutation layers between reversible blocks, which shuffle the elements of the
subsequent layer’s input in a randomized, but fixed, way. This causes the splits u = [u1 , u2 ]
to vary between layers and enhances interaction among the individual variables. Kingma
and Dhariwal (2018) use a similar architecture with learned permutations.

3.3 Bi-directional training

Invertible networks offer the opportunity to simultaneously optimize for losses on both the in-
and output domains (Grover et al., 2017), which allows for more effective training. Hereby, we
perform forward and backward iterations in an alternating fashion, accumulating gradients
from both directions before performing a parameter update. For the forward iteration,
we penalize deviations between simulation outcomes yi = s(xi ) and network predictions
fy (xi ) with a loss Ly yi , fy (xi ) . Depending on the problem, Ly can be any supervised loss,
e.g. squared loss for regression or cross-entropy for classification.
The loss for latent variables penalizes
 the mismatch between the joint distribution of network
outputs q y = fy (x), z = fz (x) = p(x)/|Jyz | and the product of marginal distributions of
simulation outcomes p y = s(x) = p(x)/|Js | and latents p(z) as Lz q(y, z), p(y) p(z) .
We block the gradients of Lz with respect to y to ensure the resulting updates only affect the
predictions of z and do not worsen the predictions of y. Thus, Lz enforces two things: firstly,
the generated z must follow the desired normal distribution p(z); secondly, y and z must be
independent upon convergence (i.e. p(z | y) = p(z)), and not encode the same information
twice. As Lz is implemented by Maximum Mean Discrepancy D (Sec. 3.4), which only
requires samples from the distributions to be compared, the Jacobian determinants Jyz and
Js do not have to be known explicitly. In appendix Sec. 1, we prove the following theorem:
Theorem: If an INN f (x) = [y, z] is trained as proposed, and both  the supervised loss
Ly = E[(y−fy (x))2 ] and the unsupervised loss Lz = D q(y, z), p(y) p(z) reach zero, sampling
according to Eq. 1 with g = f −1 returns the true posterior p(x | y∗ ) for any measurement y∗ .
Although Ly and Lz are sufficient asymptotically, a small amount of residual dependency
between y and z remains after a finite amount of training. This causes q(x | y) to deviate
from the true posterior p(x | y). To speed up convergence, we also define a loss Lx on the
input side, implemented
 again by MMD.
 It matches the distribution of backward predictions
q(x) = p y = fy (x) p z = fz (x) /|Jx | against the prior data distribution p(x) through
Lx p(x), q(x) . In the appendix, Sec. 1, we prove that Lx is guaranteed to be zero when the
forward losses Ly and Lz have converged to zero. Thus, incorporating Lx does not alter the
optimum, but improves convergence in practice.
Finally, if we use padding on either network side, loss terms are needed to ensure no
information is encoded in the additional dimensions. We a) use a squared loss to keep those
values close to zero and b) in an additional inverse training pass, overwrite the padding
dimensions with noise of the same amplitude and minimize a reconstruction loss, which
forces these dimensions to be ignored.

3.4 Maximum mean discrepancy

Maximum Mean Discrepancy (MMD) is a kernel-based method for comparison of two


probability distributions that are only accessible through samples (Gretton et al., 2012).
While a trainable discriminator loss is often preferred for this task in high-dimensional
problems, especially in GAN-based image generation, MMD also works well, is easier to use
and much cheaper, and leads to more stable training (Tolstikhin et al., 2017). The method
requires a kernel function as a design parameter, and we found that kernels with heavier
tails than Gaussian are needed to get meaningful gradients for outliers. We achieved best
results with the Inverse Multiquadratic k(x, x0 ) = 1/(1 + k(x − x0 )/hk22 ), reconfirming the

5
Published as a conference paper at ICLR 2019

Ground truth INN, all losses INN, only Ly + Lz INN, only Lx


Figure 2: Viability of INN for a basic inverse problem. The task is to produce the
correct (multi-modal) distribution of 2D points x, given only the color label y∗ . When
trained with all loss terms from Sec. 3.3, the INN output matches ground truth almost
exactly (2nd image). The ablations (3rd and 4th image) show that we need Ly and Lz to
learn the conditioning correctly, whereas Lx helps us remain faithful to the prior.

suggestion from Tolstikhin et al. (2017). Since the magnitude of the MMD depends on the
kernel choice, the relative weights of the losses Lx , Ly , Lz are adjusted as hyperparameters,
such that their effect is about equal.

4 Experiments
We first demonstrate the capabilities of INNs on two well-behaved synthetic problems and
then show results for two real-world applications from the fields of medicine and astrophysics.
Additional details on the datasets and network architectures are provided in the appendix.

4.1 Artificial data

Gaussian mixture model: To test basic viability of INNs for inverse problems, we train
them on a standard 8-component Gaussian mixture model p(x). The forward process is very
simple: The first four mixture components (clockwise) are assigned label y = red, the next
two get label y = blue, and the final two are labeled y = green and y = purple (Fig. 2).
The true inverse posteriors p(x | y∗ ) consist of the mixture components corresponding to
the given one-hot-encoded label y∗ . We train the INN to directly regress one-hot vectors
y using a squared loss Ly , so that we can provide plain one-hot vectors y∗ to the inverse
network when sampling p(x | y∗ ). We observe the following: (i) The INN learns very
accurate approximations of the posteriors and does not suffer from mode collapse. (ii) The
coupling block architecture does not reduce the network’s representational power – results
are similar to standard networks of comparable size (see appendix Sec. 2). (iii) Bidirectional
training works best, whereas forward training alone (using only Ly and Lz ) captures the
conditional relationships properly, but places too much mass in unpopulated regions of
x-space. Conversely, pure inverse training (just Lx ) learns the correct x-distribution, but
loses all conditioning information.
Inverse kinematics: For a task with a more complex and continuous forward process,
we simulate a simple inverse kinematics problem in 2D space: An articulated arm moves
vertically along a rail and rotates at three joints. These four degrees of freedom constitute
the parameters x. Their priors are given by a normal distribution, which favors a pose with
180◦ angles and centered origin. The forward process is to calculate the coordinates of the
end point y, given a configuration x. The inverse problem asks for the posterior distribution
over all possible inputs x that place the arm’s end point at a given y position. An example
for a fixed y∗ is shown in Fig. 3, where we compare our INN to a conditional VAE (see
appendix Fig. 7 for conceptual comparison of architectures). Adding Inverse Autoregressive
Flow (IAF, Kingma et al., 2016) does not improve cVAE performance in this case (see
appendix, Table 2). The y∗ chosen in Fig. 3 is a hard example, as it is unlikely under the
prior p(x) (Fig. 3, right) and has a strongly bi-modal posterior p(x | y∗ ).
In this case, due to the computationally cheap forward process, we can use approximate
Bayesian computation (ABC, see appendix Sec. 7) to sample from the ground truth posterior.
Compared to ground truth, we find that both INN and cVAE recover the two symmetric

6
Published as a conference paper at ICLR 2019

Figure 3: Distribution over articulated poses x, conditioned on the end point y∗ .


The desired end point y∗ is marked by a gray cross. A dotted line on the left represents the
rail the arm is based on, and the faint colored lines indicate sampled arm configurations x
taken from the true (ABC) or learned (INN, cVAE) posterior p(x | y∗ ). The prior (right) is
shown for reference. The actual end point of each sample may deviate slightly from the target
y∗ ; contour lines enclose the regions containing 97% of these end points. We emphasize the
articulated arm with the highest estimated likelihood for illustrative purposes.

modes well. However, the true end points of x-samples produced by the cVAE tend to miss
the target y∗ by a wider margin. This is because the forward process x → y is only learned
implicitly during cVAE training. See appendix for quantitative analysis and details.

4.2 Real-world applications


After demonstrating the viability on synthetic data, we apply our method to two real world
problems from medicine and astronomy. While we focus on the medical task in the following,
the astronomy application is shown in Fig. 5.
In medical science, the functional state of biological tissue is of interest for many applications.
Tumors, for example, are expected to show changes in oxygen saturation sO2 (Hanahan and
Weinberg, 2011). Such changes cannot be measured directly, but influence the reflectance of
the tissue, which can be measured by multispectral cameras (Lu and Fei, 2014). Since ground
truth data can not be obtained from living tissue, we create training data by simulating
observed spectra y from a tissue model x involving sO2 , blood volume fraction vhb , scattering
magnitude amie , anisotropy g and tissue layer thickness d (Wirkert et al., 2016). This model
constitutes the forward process, and traditional methods to learn point estimates of the
inverse (Wirkert et al., 2016; 2017; Claridge and Hidovic-Rowe, 2013) are already sufficiently
reliable to be used in clinical trials. However, these methods can not adequately express
uncertainty and ambiguity, which may be vital for an accurate diagnosis.
Competitors. We train an INN for this problem, along with two ablations (as in Fig. 2),
as well as a cVAE with and without IAF (Kingma et al., 2016) and a network using the
method of Kendall and Gal (2017), with dropout sampling and additional aleatoric error
terms for each parameter. The latter also provides a point-estimate baseline (classical NN)
when used without dropout and error terms, which matches the current state-of-the-art
results in Wirkert et al. (2017). Finally, we compare to ABC, approximating p(x | y∗ ) with
the 256 samples closest to y∗ . Note that with enough samples ABC would produce the true
posterior. We performed 50 000 simulations to generate samples for ABC at test time, taking
one week on a GPU, but still measure inconsistencies in the posteriors. The learning-based
methods are trained within minutes, on a training set of 15 000 samples generated offline.
Error measures. We are interested in both the accuracy (point estimates), and the shape
of the posterior distributions. For point estimates x̂, i.e. MAP estimates, we compute the
deviation pfrom ground-truth values x∗ in terms of the RMSE over test set observations y∗ ,
RMSE = Ey∗[kx̂ − x∗ k2 ]. The scores are reported both for the main parameter of interest
sO2 , and the parameter subspace of sO2 , vhb , amie , which we found to be the only recoverable
parameters. Furthermore, we check the re-simulation error: We apply the simulation s(x̂) to
the point estimate, and compare the simulation outcome to the conditioning y∗ . To evaluate
the shape of the posteriors, we compute the calibration error for the sampling-based methods,
based on the fraction of ground truth inliers αinl. for corresponding α-confidence-region of

7
Published as a conference paper at ICLR 2019

Table 1: Quantitative results in medical application. We measure the accuracy of


point/MAP estimates as detailed in Sec. 4.2. Best results within measurement error are bold,
and we determine uncertainties (±) by statistical bootstrapping. The parameter sO2 is the
most relevant in this application, whereas error all means all recoverable parameters (sO2 ,
vhb and amie ). Re-simulation error measures how well the MAP estimate x̂ is conditioned on
the observation y∗ . Calibration error is the most important, as it summarizes correctness of
the posterior shape in one number; see appendix Fig. 11 for more calibration results.
Method MAP error sO2 MAP error all MAP re-simulation error Calibration error
NN (+ Dropout) 0.057 ± 0.003 0.56 ± 0.01 0.397 ± 0.008 1.91%
INN 0.041 ± 0.002 0.57 ± 0.02 0.327 ± 0.007 0.34%
INN, only Ly , Lz 0.066 ± 0.003 0.71 ± 0.02 0.506 ± 0.010 1.62%
INN, only Lx 0.861 ± 0.033 1.70 ± 0.02 2.281 ± 0.045 3.20%
cVAE 0.050 ± 0.002 0.74 ± 0.02 0.314 ± 0.007 2.19%
cVAE-IAF 0.050 ± 0.002 0.74 ± 0.03 0.313 ± 0.008 1.40%
ABC 0.036 ± 0.001 0.54 ± 0.02 0.284 ± 0.005 0.90%
Simulation noise 0.129 ± 0.001

1.00
ABC

0.75

0.50
INN

0.25

0.00
cVAE -IAF

−0.25

−0.50
Dropout

−0.75

−1.00
0 1 0.0 0.1 0 2k 4k 0.6 0.8 1.0 0.8 0.9 Correlation
sO2 vhb amie d g Matrix of x

Figure 4: Sampled posterior of 5 parameters for fixed y in medical application.
For a fixed observation y∗ , we compare the estimated posteriors p(x | y∗ ) of different methods.
The bottom row also includes the point estimate (dashed green line). Ground truth values
x∗ (dashed black line) and prior p(x) over all data (gray area) are provided for reference.

the marginal posteriors of x. The reported error is the median of |αinl. − α| over all α. All
values are computed over 5000 test-set observations y∗ , or 1000 observations in the case of
re-simulation error. Each posterior uses 4096 samples, or 256 for ABC; all MAP estimates
are found using the mean-shift algorithm.
Quantitative results. Evaluation results for all methods are presented in Table 1. The
INN matches or outperforms other methods in terms of point estimate error. Its accuracy
deteriorates slightly when trained without Lx , and entirely when trained without the
conditioning losses Ly and Lz , just as in Fig. 2. For our purpose, the calibration error is the
most important metric, as it summarizes the correctness of the whole posterior distribution
in one number (see appendix Fig. 11). Here, the INN has a big lead over cVAE(-IAF) and
Dropout, and even over ABC due to the low ABC sample count.
Qualitative results. Fig. 4 shows generated parameter distributions for one fixed measure-
ment y∗ , comparing the INN to cVAE-IAF, Dropout sampling and ABC. The three former
methods use a sample count of 160 000 to produce smooth curves. Due to the sparse posteri-

8
Published as a conference paper at ICLR 2019

1.0

0.5

0.0

−0.5

−1.0
Ionizing Ionizing Cloud Expansion Age of Correlation
Luminosity Emission Rate Density Velocity the Cluster Matrix of x
Figure 5: Astrophysics application. Properties x of star clusters in interstellar gas clouds
are inferred from multispectral measurements y. We train an INN on simulated data, and
show the sampled posterior of 5 parameters for one y∗ (colors as in Fig. 4, second row). The
peculiar shape of the prior is due to the dynamic nature of these simulations. We include
this application as a real-world example for the INN’s ability to recover multiple posterior
modes, and strong correlations in p(x | y∗ ), see details in appendix, Sec. 5.

ors of 256 samples in the case of ABC, kernel density estimation was applied to its results,
with a bandwidth of σ = 0.1. The results produced by the INN provide relevant insights:
First, we find that the posteriors for layer thickness d and anisotropy g match the shape of
their priors, i.e. y∗ holds no information about these parameters – they are unrecoverable.
This finding is supported by the ABC results, whereas the other two methods misleadingly
suggest a roughly Gaussian posterior. Second, we find that the sampled distributions for the
blood volume fraction vhb and scattering amplitude amie are strongly correlated (rightmost
plot). This phenomenon is not an analysis artifact, but has a sound physical explanation: As
blood volume fraction increases, more light is absorbed inside the tissue. For the sensor to
record the same intensities y∗ as before, scattering must be increased accordingly. In Fig. 10
in the appendix, we show how the INN is applied to real multispectral images.

5 Conclusion
We have shown that the full posterior of an inverse problem can be estimated with invertible
networks, both theoretically and practically on problems from medicine and astrophysics.
We share the excitement of the application experts to develop INNs as a generic tool, helping
them to better interpret their data and models, and to improve experimental setups. As a
side effect, our results confirm the findings of others that the restriction to coupling layers
does not noticeably reduce the expressive power of the network.
In summary, we see the following fundamental advantages of our INN-based method compared
to alternative approaches: Firstly, one can learn the forward process and obtain the (more
complicated) inverse process ‘for free’, as opposed to e.g. cGANs, which focus on the inverse
and learn the forward process only implicitly. Secondly, the learned posteriors are not
restricted to a particular parametric form, in contrast to classical variational methods. Lastly,
in comparison to ABC and related Bayesian methods, the generation of the INN posteriors is
computationally very cheap. In future work, we plan to systematically analyze the properties
of different invertible architectures, as well as more flexible models utilizing cycle losses, in
the context of representative inverse problem. We are also interested in how our method can
be scaled up to higher dimensionalities, where MMD becomes less effective.

Acknowledgments

LA received funding by the Federal Ministry of Education and Research of Germany, project
‘High Performance Deep Learning Framework’ (No 01IH17002). JK, CR and UK received
financial support from the European Research Council (ERC) under the European Unions
Horizon 2020 research and innovation program (grant agreement No 647769). SW and LMH
received funding from the European Research Council (ERC) starting grant COMBIOSCOPY
(637960). EWP, DR, and RSK acknowledge support by Collaborative Research Centre (SFB
881) ‘The Milky Way System’ (subprojects B1, B2 and B8), the Priority Program SPP 1573
‘Physics of the Interstellar Medium’ (grant numbers KL 1358/18.1, KL 1358/19.2 and GL
668/2-1) and the European Research Council in the ERC Advanced Grant STARLIGHT
(project no. 339177)

9
Published as a conference paper at ICLR 2019

References
Jack A. Baldwin, Mark M. Phillips, and Roberto Terlevich. Classification parameters for the
emission-line spectra of extragalactic objects. PASP, 93:5–19, February 1981.
Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester
normalizing flows for variational inference. arXiv:1803.05649, 2018.
David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for
statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
Ela Claridge and Dzena Hidovic-Rowe. Model based inversion for deriving maps of histological
parameters characteristic of cancer from ex-vivo multispectral images of the colon. IEEE
Trans Med Imaging, November 2013.
Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, and Peter
Dayan. Comparison of maximum likelihood and GAN-based training of Real NVPs.
arXiv:1705.05263, 2017.
Gustavo Deco and Wilfried Brauer. Nonlinear higher-order statistical decorrelation by
volume-conserving neural architectures. Neural Networks, 8(4):525–535, 1995.
Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components
estimation. arXiv:1410.8516, 2014.
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP.
arXiv:1605.08803, 2016.
Chris Donahue, Akshay Balsubramani, Julian McAuley, and Zachary C Lipton. Semantically
decomposing the latent spaces of generative adversarial networks. arXiv:1705.07904, 2017.
Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin
Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv:1606.00704, 2016.
Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with Bernoulli
approximate variational inference. arXiv:1506.02158, 2015.
Dani Gamerman and Hedibert F Lopes. Markov Chain Monte Carlo: Stochastic simulation
for Bayesian inference. Chapman and Hall/CRC, 2006.
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: Masked
autoencoder for distribution estimation. In International Conference on Machine Learning,
pages 881–889, 2015.
Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual
network: Backpropagation without storing activations. In Advances in Neural Information
Processing Systems, pages 2211–2221, 2017.
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander
Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773,
2012.
Aditya Grover, Manik Dhar, and Stefano Ermon. Flow-GAN: Combining maximum likelihood
and adversarial learning in generative models. arXiv:1705.08868, 2017.
Douglas Hanahan and Robert A. Weinberg. Hallmarks of cancer: The next generation. Cell,
144(5):646–674, March 2011. ISSN 00928674.
Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autore-
gressive flows. arXiv:1804.00779, 2018.
Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence
and uniqueness results. Neural Networks, 12(3):429–439, 1999.

10
Published as a conference paper at ICLR 2019

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation
with conditional adversarial networks. In CVPR, pages 1125–1134, 2017.

Jörn-Henrik Jacobsen, Arnold Smeulders, and Edouard Oyallon. i-RevNet: Deep invertible
networks. arXiv:1802.07088, 2018.

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for
computer vision? In Advances in Neural Information Processing Systems, pages 5580–5590,
2017.

Lisa J. Kewley, Michael A. Dopita, Claus Leitherer, Romeel Davé, Tiantian Yuan, Mark
Allen, Brent Groves, and Ralph Sutherland. Theoretical evolution of optical strong lines
across cosmic time. The Astrophysical Journal, 774(2):100, 2013.

Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1
convolutions. arXiv:1807.03039, 2018.

Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local
reparameterization trick. In Advances in Neural Information Processing Systems, pages
2575–2583, 2015.

Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max
Welling. Improved variational inference with inverse autoregressive flow. In Advances in
Neural Information Processing Systems, pages 4743–4751, 2016.

Alexander Kolesnikov and Christoph H. Lampert. PixelCNN models with auxiliary variables
for natural image modeling. In International Conference on Machine Learning, pages
1905–1914, 2017.

Jarno Lintusaari, Michael U. Gutmann, Ritabrata Dutta, Samuel Kaski, and Jukka Corander.
Fundamentals and recent developments in approximate bayesian computation. Systematic
Biology, 66(1):e66–e82, 2017.

Guolan Lu and Baowei Fei. Medical hyperspectral imaging: a review. Journal of Biomedical
Optics, 19(1):10901, January 2014. ISSN 1560-2281.

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv:1411.1784,
2014.

Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and
Koray Kavukcuoglu. Conditional image generation with PixelCNN decoders. In Advances
in Neural Information Processing Systems, pages 4797–4805, 2016.

George Papamakarios and Iain Murray. Fast ε-free inference of simulation models with
bayesian conditional density estimation. In Advances in Neural Information Processing
Systems, pages 1028–1036, 2016.

George Papamakarios, Iain Murray, and Theo Pavlakou. Masked autoregressive flow for
density estimation. In Advances in Neural Information Processing Systems, pages 2335–
2344, 2017.

Eric W. Pellegrini, Jack A. Baldwin, and Gary J. Ferland. Structure and feedback in 30
Doradus. II. Structure and chemical abundances. The Astrophysical Journal, 738(1):34,
2011.

Daniel Rahner, Eric W. Pellegrini, Simon C. O. Glover, and Ralf S. Klessen. Winds and
radiation in unison: A new semi-analytic feedback model for cloud dissolution. Monthly
Notices of the Royal Astronomical Society, 470:4453–4472, 10 2017.

Stefan Reissl, Robert Brauer, and Sebastian Wolf. Radiative transfer with polaris: I. analysis
of magnetic fields through synthetic dust continuum polarization measurements. Astronomy
& Astrophysics, 593, 04 2016.

11
Published as a conference paper at ICLR 2019

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In
International Conference on Machine Learning, pages 1530–1538, 2015.

Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep
density models. arXiv:1302.5125, 2013.

Christian Robert and George Casella. Monte Carlo Statistical Methods. Springer, 2004.

Ralf S. Klessen and Simon C. O. Glover. Physical processes in the interstellar medium.
Saas-Fee Advanced Course, 43:85, 2016.

Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improv-
ing the PixelCNN with discretized logistic mixture likelihood and other modifications.
arXiv:1701.05517, 2017.

R.T. Schirrmeister, P. Chraba̧szcz, F. Hutter, and T. Ball. Training generative reversible


networks. arXiv:1806.01610, 2018.

N Siddharth, Brooks Paige, Jan-Willem Van de Meent, Alban Desmaison, and Philip HS
Torr. Learning disentangled representations with semi-supervised deep generative models.
In Advances in Neural Information Processing Systems, pages 5925–5935, 2017.

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation
using deep conditional generative models. In Advances in Neural Information Processing
Systems, pages 3483–3491, 2015.

Mikael Sunnåker, Alberto Giovanni Busetto, Elina Numminen, Jukka Corander, Matthieu
Foll, and Christophe Dessimoz. Approximate bayesian computation. PLoS computational
biology, 9(1):e1002803, 2013.

E. G. Tabak and Cristina V. Turner. A family of nonparametric density estimation algorithms.


Communications on Pure and Applied Mathematics, 66(2):145–164, 2013. doi: 10.1002/
cpa.21423.

Esteban G Tabak, Eric Vanden-Eijnden, et al. Density estimation by dual ascent of the
log-likelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010.

Yunfei Teng, Anna Choromanska, and Mariusz Bojarski. Invertible autoencoder for domain
adaptation. arXiv:1802.06869, 2018.

Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein
auto-encoders. arXiv:1711.01558, 2017.

Jakub M Tomczak and Max Welling. Improving variational auto-encoders using householder
flow. arXiv:1611.09630, 2016.

Brian L Trippe and Richard E Turner. Conditional density estimation with bayesian
normalising flows. arXiv:1802.04908, 2018.

Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural
autoregressive distribution estimation. Journal of Machine Learning Research, 17(205):
1–37, 2016.

Richard David Wilkinson. Approximate bayesian computation (abc) gives exact results
under the assumption of model error. Statistical applications in genetics and molecular
biology, 12(2):129–141, 2013.

Sebastian J Wirkert, Hannes Kenngott, Benjamin Mayer, Patrick Mietkowski, Martin Wagner,
Peter Sauer, Neil T Clancy, Daniel S Elson, and Lena Maier-Hein. Robust near real-time
estimation of physiological parameters from megapixel multispectral images with inverse
monte carlo and random forest regression. International journal of computer assisted
radiology and surgery, 11(6):909–917, 2016.

12
Published as a conference paper at ICLR 2019

Sebastian J. Wirkert, Anant S. Vemuri, Hannes G. Kenngott, Sara Moccia, Michael Götz,
Benjamin F. B. Mayer, Klaus H. Maier-Hein, Daniel S. Elson, and Lena Maier-Hein.
Physiological Parameter Estimation from Multispectral Images Unleashed. In Medical
Image Computing and Computer-Assisted Intervention, Lecture Notes in Computer Science,
pages 134–141. Springer, Cham, September 2017.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image
translation using cycle-consistent adversarial networks. In CVPR, pages 2223–2232, 2017a.
Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang,
and Eli Shechtman. Toward multimodal image-to-image translation. In Advances in Neural
Information Processing Systems, pages 465–476, 2017b.

13
Published as a conference paper at ICLR 2019

Appendix

1 Proof of correctness of generated posteriors


Lemma: If some bijective function f : x → z transforms a probability density pX (x) to
pZ (z), then the inverse function f −1 transforms pZ (z) back to pX (x).

Proof: We denote the probability density obtained through the reverse transformation as
p∗X (x). Therefore, we have to show that p∗X (x) = pX (x). For the forward direction, via the
change-of-variables formula, we have
 
pZ (z) = pX x = f −1 (z) det[∂z (f −1 )] (6)

with the Jacobian ∂z f −1 ≡ ∂fi−1 /∂zj . For the reverse transformation, we have
 
p∗X (x) = pZ z = f (x) |det[∂x f ]| . (7)

We can substitute pZ from Eq. 6 and obtain


  h i
p∗X (x) = pX x = f −1 (f (x)) det (∂z (f −1 ))(∂x f ) (8)
h i
= pX (x) det (∂z f −1 )(∂x f ) (9)
= pX (x) |det[I]| = pX (x). (10)
In Eq. 9, the Jacobians cancel out due to the inverse function theorem, i.e. the Jacobian
∂z (f −1 ) is the matrix inverse of ∂x f .

Theorem: If an INN f (x) = [y, z] is trained as proposed, and both  the supervised loss
Ly = E[(y−fy (x))2 ] and the unsupervised loss Lz = D q(y, z), p(y) p(z) reach zero, sampling
according to Eq. 1 with g = f −1 returns the true posterior p(x | y∗ ) for any measurement y∗ .

Proof: We denote the chosen latent distribution as pZ (z), the distribution of observations
as pY (y), and the joint distribution of network outputs as q(y, z). As shown by Gretton
et al. (2012), if the MMD loss converges to 0, the network outputs follow the prescribed
distribution:
Lz = 0 ⇐⇒ q(y, z) = pY (y) pZ (z) (11)
Suppose we take a posterior conditioned on a fixed y∗ , i.e. p(x | y∗ ), and transform it using
the forward pass of our perfectly converged INN. From this we obtain an output distribution
q ∗ (y, z). Because Ly = 0, we know that the output distribution of y (marginalized over z)
must be q ∗ (y) = δ(y − y∗ ). Also, because of the independence between z and y in the output,
the distribution of z-outputs is still q ∗ (z) = pZ (z). So the joint distribution of outputs is
q ∗ (y, z) = δ(y − y∗ ) pZ (z) (12)

When we invert the network, and repeatedly input y while sampling z ∼ pZ (z), this is the
same as sampling [y, z] from the q ∗ (y, z) above. Using the Lemma from above, we know
that the inverted network will output samples from p(x | y∗ ).

Corollary: If the conditions


 of the theorem above are fulfilled, the unsupervised reverse
loss Lx = D q(x), pX (x) between the marginalized outputs of the inverted network, q(x),
and the prior data distribution, pX (x), will also be 0. This justifies using the loss on the
prior to speed up convergence, without altering the final results.

14
Published as a conference paper at ICLR 2019

Proof: Due to the theorem, the estimated posteriors generated by the INN are correct,
i.e. q(x | y∗ ) = p(x | y∗ ). If they are marginalized over observations y∗ from the training
data, then q(x) will be equal to pX (x) by definition. As shown by Gretton et al. (2012), this
is equivalent to Lx = 0.

2 Artificial data – Gaussian mixture


In Sec. 4.1, we demonstrate that the proposed INN can approximate the true posteriors very
well and is not hindered by the required coupling block architecture. Here we show how
some existing methods do on the same task, using neural networks of similar size as the INN.

Ground truth INN, all losses cVAE cVAE-IAF

cGAN Larger cGAN Generator + MMD Dropout sampling

Figure 6: Results of several existing methods for the Gaussian mixture toy example.

cGAN Training a conditional GAN of network size comparable to the INN (counting
only the generator) and only two noise dimensions turned out to be challenging. Even with
additional pre-training to avoid mode collapse, the individual modes belonging to one label
are reduced to nearly one-dimensional structures.

Larger cGAN In order to match the results of the INN, we trained a more complex cGAN
with 2M parameters instead of the previous 10K, and a latent dimension of 128, instead of 2.
To prevent mode collapse, we introduced an additional regularization: an extra loss term
forces the variance of generator outputs to match the variance of the training data prior.
With these changes, the cGAN can be seen to recover the posteriors reasonably well.

Generator + MMD Another option is to keep the cGAN generator the same size as our
INN, but replace the discriminator with an MMD loss (cf. Sec. 3.4). This loss receives a
concatenation of the generator output x and the label y it was supplied with, and compares
these batch-wise with the concatenation of ground truth (x, y)-pairs. Note that in contrast
to this, the corresponding MMD loss of the INN only receives x, and no information about y.
For this small toy problem, we find that the hand-crafted MMD loss dramatically improves
results compared to the smaller learned discriminator.

cVAE We also compare to a conditional Variational Autoencoder of same total size as the
INN. There is some similarity between the training setup of our method (Fig. 7, right) and

15
Published as a conference paper at ICLR 2019

[x, y] → µz , σ z →z forward (simulation): x → y

reconstruction loss
Encoder IAF

SL
y

ELBO loss

USL
x y z x INN
z

USL
Decoder
[z, y] → x inverse (sampling): [y, z] → x
Conditional VAE with Inverse Autoregressive Flow Invertible Neural Network
Figure 7: Abstraction of the cVAE-IAF training scheme compared to our INN from Fig. 1.
For the standard cVAE, the IAF component is omitted.

that of cVAE (Fig. 7, left), as the forward and inverse pass of an INN can also be seen as
an encoder-decoder pair. The main differences are that the cVAE learns the relationship
x → y only indirectly, since there is no explicit loss for it, and that the INN requires no
reconstruction loss, since it is bijective by construction.

cVAE-IAF We adapt the cVAE to use Inverse Autoregressive Flow (Kingma et al., 2016)
between the encoder and decoder. On the Gaussian mixture toy problem, the trained
cVAE-IAF generates correct posteriors on par with our INN (see Fig. 6).

Dropout sampling The method of dropout sampling with learned error terms is by
construction not able to produce multi-modal outputs, and therefore fails on this task.

2.1 Latent space analysis

To analyze how the latent space of our INN is structured for this task, we choose a fixed
label y∗ and sample z from a dense grid. For each z, we compute x through our inverse
network and colorize this point in latent (z) space according to the distance from the closest
mode in x-space. We can see that our network learns to shape the latent space such that
each mode receives the expected fraction of samples (Fig. 8).

Figure 8: Layout of INN latent space for one fixed label y∗ , colored by mode closest
to x = g(y∗ , z). For each latent position z, the hue encodes which mode the corresponding x
belongs to and the luminosity encodes how close x is to this mode. Note that colors used
here do not relate to those in Fig. 2, and encode the position x instead of the label y. The
first three columns correspond to labels green, blue and red Fig. 2. White circles mark areas
that contain 50% and 90% of the probability mass of latent prior p(z).

3 Artificial data – inverse kinematics

A short video demonstrating the structure of our INN’s latent space can be found under
https://gfycat.com/SoggyCleanHog, for a slightly different arm setup.

16
Published as a conference paper at ICLR 2019

The dataset is constucted using gaussian priors xi ∼ N (0, σi ), with σ1 = 0.25 and σ2 = σ3 =

σ4 = 0.5 = 28.65◦ . The forward process is given by
y1 = x1 + l1 sin(x2 ) + l2 sin(x3 − x2 ) + l3 sin(x4 − x2 − x3 ) (13)
y2 = l1 cos(x2 ) + l2 cos(x3 − x2 ) + l3 cos(x4 − x2 − x3 ) (14)
with the arm lenghts l1 = 0.5, l2 = 0.5, l3 = 1.0.
To judge the quality of posteriors, we quantify both the re-simulation error and the calibration
error over the test set, as in Sec. 4.2 of the paper. Because of the cheap simulation, we
average the re-simulation error over the whole posterior, and not only the MAP estimate.
In Table 2, we find that the INN has a clear advantage in both metrics, confirming the
observations from Fig. 3.
Table 2: Quantitative evaluation of the inverse kinematics experiment

Method Mean re-sim. err. Median re-sim. err. Calibration err.


cVAE 0.0368 0.0307 7.78%
cVAE-IAF 0.0368 0.0307 7.81%
INN 0.0139 0.0113 0.96%

Figure 9: Posteriors generated for less challenging observations y∗ than in Fig. 3.

4 Multispectral measurements of biological tissue

The following figure shows the results when the INN trained in Sec. 4.2 is applied pixel-wise
to multispectral endoscopic footage. In addition to estimating the oxygenation sO2 , we
measure the uncertainty in the form of the 68% confidence interval.

17
Published as a conference paper at ICLR 2019

Figure 10: INN applied to real footage to predict oxygenation sO2 and uncertainty.
The clips (arrows) on the connecting tissue cause lower oxygenation (blue) in the small
intestine. Uncertainty is low in crucial areas and high only at some edges and specularities.

5 Star cluster spectral data


Star clusters are born from a large reservoir of gas and dust that permeates the Galaxy,
the interstellar medium (ISM). The densest parts of the ISM are called molecular clouds,
and star formation occurs in regions that become unstable under their own weight. The
process is governed by the complex interplay of competing physical agents such as gravity,
turbulence, magnetic fields, and radiation; with stellar feedback playing a decisive regulatory
role (S. Klessen and C. O. Glover, 2016). To characterize the impact of the energy and
momentum input from young star clusters on the dynamical evolution of the ISM, astronomers
frequently study emission lines from chemical elements such as hydrogen or oxygen. These
lines are produced when gas is ionized by stellar radiation, and their relative intensities
depend on the ionization potential of the chemical species, the spectrum of the ionizing
radiation, the gas density as well as the 3D geometry of the cloud, and the absolute intensity
of the radiation (Pellegrini et al., 2011). Key diagnostic tools are the so-called BPT diagrams
(after Baldwin et al., 1981) emission of ionized hydrogen, H+ , to normalize the recombination
lines of O++ , O+ and S+ (see also Kewley et al., 2013). We investigate the dynamical
feedback of young star clusters on their parental cloud using the WARPFIELD 1D model
developed by Rahner et al. (2017). It follows the entire temporal evolution of the system
until the cloud is destroyed, which could take several stellar populations to happen. At
each timestep we employ radiative transfer calculations (Reissl et al., 2016) to generate
synthetic emission line maps which we use to train the neural network. Similar to the medical
application from Section 4.2, the mapping from simulated observations to underlying physical
parameters (such as cloud and cluster mass, and total age of the system) is highly degenerate
and ill-posed. As an intermediary step, we therefore train our forward model to predict the
observable quantities y (emission line ratios) from composite simulation outputs x (such
as ionizing luminosity and emission rate, cloud density, expansion velocity, and age of the
youngest cluster in the system, which in the case of multiple stellar populations could be
considerably smaller than the total age). Using the inverse of our trained model for a given
set of observations y∗ , we can obtain a distribution over the unobservable properties x of
the system.
Results for one specific y are shown in Fig. 5. Note that our network recovers a decidedly
multimodal distribution of x that visibly deviates from the prior p(x). Note also the strong
correlations in the system. For example, the measurements y∗ investigated may correspond
to a young cluster with large expansion velocity, or to an older system that expands slowly.
Finding these ambiguities in p(x | y∗ ) and identifying degeneracies in the underlying model
are pivotal aspects of astrophysical research, and a method to effectively approximate full
posterior distributions has the potential to lead to a major breakthrough in this field.

6 Calibration curve for tissue parameter estimation


In Sec. 4.2, we report the median calibration error for each method. The following figure
plots the calibration error, qinliers − q, against the level of confidence q. Negative values mean
that a model is overconfident, while positive values say the opposite.

18
Published as a conference paper at ICLR 2019

0.06 ABC

Calibration error qInliers − q


INN
0.04
cVAE
0.02 MC Dropout

0.00

−0.02

−0.04

−0.06

0.0 0.2 0.4 0.6 0.8 1.0

Confidence q
Figure 11: Calibration curves for all four methods compared in Sec. 4.2.

7 Approximate Bayesian computation (ABC)


While there is a whole field of research concerned with ABC approaches and their efficiency-
accuracy tradeoffs, our use of the method here is limited to the essential principle of rejection
sampling. When we require N samples of x from the posterior p(x | y∗ ) conditioned on some
y∗ , there are two basic ways to obtain them:

Threshold: We set an acceptance threshold , repeatedly draw x-samples from the prior,
compute the corresponding y-values (via simulation) and keep those where dist(y, y∗ ) < ,
until we have accepted N samples. The smaller we want , the more simulations have to be
run, which is why we use this approach only for the experiment in Sec. 4.1, where we can
afford to run the forward process millions or even billions of times.

Quantile: Alternatively, we choose what quantile q of samples shall be accepted, and then
run exactly N/q simulations. All sampled pairs (x, y) are sorted by dist(y, y∗ ) and the
N closest to y∗ form the posterior. This allows for a more predictable runtime when the
simulations are costly, as in the medical application in Sec. 4.2 where q = 0.005.

8 Details of datasets and network architectures


Table 3 summarizes the datasets used throughout the paper. The architecture details are
given in the following.

Table 3: Dimensionalities and training set sizes for each experiment.

Experiment training data dim(x) dim(y) dim(z) see also


6
Gaussian mixture 10 2 8 2
Inverse kinematics 106 4 2 2
Medical data 15 000 13 8 13 Wirkert et al. (2016)
Astronomy 8 772 19 69 17 Pellegrini et al. (2011)

8.1 Artificial data – Gaussian mixture

INN: 3 invertible blocks, 3 fully connected layers per affine coefficient function with ReLU
activation functions in the intermediate layers, zero padding to a nominal dimension of 16,
Adam optimizer, decaying learning rate from 10−3 to 10−5 , batch size 200. The inverse
multiquadratic kernel was used for MMD, with h = 0.2 in both x- and z-space.

19
Published as a conference paper at ICLR 2019

Dropout sampling: 6 fully connected layers with ReLU activations, Adam optimizer,
learning rate decay from 10−3 to 10−5 , batch size 200, dropout probability p = 0.2.

cGAN: 6 fully connected layers for the generator and 8 for the discriminator, all with leaky
ReLU activations. Adam was used for the generator, SGD for the discriminator, learning
rates decaying from 2 · 10−3 to 2 · 10−6 , batch size 256. Initially 100 iterations training with
L = N1 i kg(zi , yi ) − xi k22 , to separate the differently labeled modes, followed by pure GAN
P
training.

Larger cGAN: 2 fully connected layers with 1024 neurons each for discriminator and
generator, batch size 512, Adam optimizer with learning rate 8 · 10−4 for the generator, SGD
with learning rate 1.2 · 10−3 and momentum 0.05 for the discriminator, 1.6 · 10−3 weight decay
for both, 0.25 dropout probabiliy for the generator at training and test time. Equal weighting
of discriminator loss and penalty of output variance L = (Vari [g(zi , yi )] − Vari [xi ])2

Generator with MMD: 8 fully connected layers with leaky ReLU activations, Adam
optimizer, decaying learning rate from 10−3 to 10−6 , batch size 256. Inverse multiquadratic
kernel, h = 0.5.

cVAE: 3 fully connected layers each for encoder and decoder, ReLU activations, learning
rate 2 · 10−2 , decay to 2.5 · 10−5 , Adam optimizer, batch size 25, reconstruction loss weighted
50:1 versus KL divergence loss.

8.2 Artificial data – inverse kinematics

INN: 6 affine coupling blocks with 3 fully connected layers each and leaky ReLU activations.
Adam optimizer, decaying learning rate from 10−2 to 10−4 , multiquadratic kernel with
h = 1.2.

cVAE: 4 fully connected layers each for encoder and decoder, ReLU activations, learning
rate 5 · 10−3 , decay to 1.6 · 10−5 , Adam optimizer, batch size 250, reconstruction loss weighted
15:1 versus KL divergence loss.

8.3 Functional parameter estimation from multispectral tissue images

INN: 3 invertible blocks, 4 fully connected layers per affine coefficient function with leaky
ReLUs in the intermediate layers, zero padding to double the original width. Adam optimizer,
learning rate decay from 2 · 10−3 to 2 · 10−5 , batch size 200. Inverse multiquadratic kernel
with h = 1, weighted MMD terms by observation distance with decaying γ = 0.2 to 0.

Dropout sampling/point estimate: 8 fully connected layers, ReLU activations, Adam


with decaying learning rate from 10−2 to 10−5 , batch size 100, dropout probability p = 0.2.

cVAE: 4 fully connected layers each for encoder and decoder, ReLU activations, learning
rate 10−3 , decay to 3.2 · 10−6 , Adam optimizer, batch size 25, reconstruction loss weighted
103 :1 versus KL divergence loss.

8.4 Impact of star clusters on the dynamical evolution of the galactic gas

INN: 5 invertible blocks, 4 fully connected layers per affine coefficient function with
leaky ReLUs in the intermediate layers, no additional zero padding. Adam optimizer with
decaying learning rate from 2 · 10−3 to 1.5 · 10−6 , batch size 500. Kernel for latent space:
1/4
k(z, z0 ) = exp(−k(z − z0 )/hk2 ) with h = 7.1. Kernel for x-space: k(x, x0 ) = −kx − x0 k1/2 .
Due to the complex nature of the prior distributions, this was the kernel found to capture
the details correctly, whereas the peak of the inverse multiquadratic kernel was too broad for
this purpose.

20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy