Neural Processes
Neural Processes
Marta Garnelo 1 Jonathan Schwarz 1 Dan Rosenbaum 1 Fabio Viola 1 Danilo J. Rezende 1 S. M. Ali Eslami 1
Yee Whye Teh 1
proximate a labelled collection of data with high As an alternative to using neural networks one can also per-
precision. A Gaussian process (GP), on the other form inference on a stochastic process in order to carry out
hand, is a probabilistic model that defines a dis- function regression. The most common instantiation of this
tribution over possible functions, and is updated approach is a Gaussian process (GP), a model with compli-
in light of data via the rules of probabilistic in- mentary properties to those of neural networks: GPs do not
ference. GPs are probabilistic, data-efficient and require a costly training phase and can carry out inference
flexible, however they are also computationally in- about the underlying ground truth function conditioned on
tensive and thus limited in their applicability. We some observations, which renders them very flexible at test-
introduce a class of neural latent variable models time. In addition GPs represent infinitely many different
which we call Neural Processes (NPs), combin- functions at locations that have not been observed thereby
ing the best of both worlds. Like GPs, NPs define capturing the uncertainty over their predictions given some
distributions over functions, are capable of rapid observations. However, GPs are computationally expen-
adaptation to new observations, and can estimate sive: in their original formulation they scale cubically with
the uncertainty in their predictions. Like NNs, respect to the number of data points, and current state of
NPs are computationally efficient during training the art approximations still scale quadratically (Quiñonero-
and evaluation but also learn to adapt their priors Candela & Rasmussen, 2005). Furthermore, the available
to data. We demonstrate the performance of NPs kernels are usually restricted in their functional form and an
on a range of learning tasks, including regression additional optimisation procedure is required to identify the
and optimisation, and compare and contrast with most suitable kernel, as well as its hyperparameters, for any
related models in the literature. given task.
As a result, there is growing interest in combining aspects of
neural networks and inference on stochastic processes as a
1. Introduction potential solution to some of the downsides of both (Huang
et al., 2015; Wilson et al., 2016). In this work we intro-
Function approximation lies at the core of numerous prob- duce a neural network-based formulation that learns an ap-
lems in machine learning and one approach that has been proximation of a stochastic process, which we term Neural
exceptionally popular for this purpose over the past decade Processes (NPs). NPs display some of the fundamental
are deep neural networks. At a high level neural networks properties of GPs, namely they learn to model distributions
constitute black-box function approximators that learn to over functions, are able to estimate the uncertainty over
parameterise a single function from a large number of train- their predictions conditioned on context observations, and
ing data points. As such, the majority of the workload of shift some of the workload from training to test time, which
a networks falls on the training phase while the evaluation allows for model flexibility. Crucially, NPs generate predic-
and testing phases are reduced to quick forward-passes. Al- tions in a computationally efficient way. Given n context
though high test-time performance is valuable for many points and m target points, inference with a trained NP cor-
real-world applications, the fact that network outputs cannot responds to a forward pass in a deep NN, which scales with
be updated after training may be undesirable. Meta-learning, O(n+m) as opposed to the O((n+m)3 ) runtime of classic
1
DeepMind, London, UK. Correspondence to: Marta Garnelo GPs. Furthermore the model overcomes many functional
<garnelo@google.com>. design restrictions by learning an implicit kernel from the
data directly.
Presented at the ICML 2018 workshop on Theoretical Foundations
and Applications of Deep Generative Models
Neural Processes
Our main contributions are: Take, for example, three different sequences x1:n , π(x1:n )
and x1:m as well as their corresponding joint distributions
1. We introduce Neural Processes, a class of models that ρx1:n , ρπ(x1:n ) and ρx1:m . In order for these joint distribu-
combine benefits of neural networks and stochastic tions to all be marginals of some higher-dimensional dis-
processes. tribution given by the stochastic process F , they have to
satisfy equations 1 and 2 above.
2. We compare NPs to related work in meta-learning,
deep latent variable models and Gaussian processes. Given a particular instantiation of the stochastic process f
Given that NPs are linked to many of these areas, they the joint distribution is defined as:
form a bridge for comparison between many related Z
topics. ρx1:n (y1:n ) = p(f )p(y1:n |f, x1:n )df. (3)
3. We showcase the benefits and abilities of NPs by apply-
ing them to a range of tasks including 1-D regression, Here p denotes the abstract probability distribution over all
real-world image completion, Bayesian optimization random quantities. Instead of Yi = F (xi ), we add some
and contextual bandits. observation noise Yi ∼ N (F (xi ), σ 2 ) and define p as:
n
Y
2. Model p(y1:n |f, x1:n ) = N (yi |f (xi ), σ 2 ). (4)
i=1
2.1. Neural processes as stochastic processes
Inserting this into equation 3 the stochastic process is speci-
The standard approach to defining a stochastic process is via fied by:
its finite-dimensional marginal distributions. Specifically,
n
we consider the process as a random function F : X → Y
Z Y
and for each finite sequence x1:n = (x1 , . . . , xn ) with ρx1:n (y1:n ) = p(f ) N (yi |f (xi ), σ 2 )df. (5)
i=1
xi ∈ X , we define the marginal joint distribution over the
function values Y1:n := (F (x1 ), . . . , F (xn )). For example,
in the case of GPs, these joint distributions are multivari- In other words, exchangeability and consistency of the col-
ate Gaussians parameterised by a mean and a covariance lection of joint distributions {ρx1:n } impliy the existence of
function. a stochastic process F such that the observations Y1:n be-
come iid conditional upon F . This essentially corresponds
Given a collection of joint distributions ρx1:n we can derive to a conditional version of de Finetti’s Theorem that anchors
two necessary conditions to be able to define a stochastic much of Bayesian nonparametrics (De Finetti, 1937). In
process F such that ρx1:n is the marginal distribution of order to represent a stochastic process using a NP, we will
(F (x1 ), . . . , F (xn )), for each finite sequence x1:n . These approximate it with a neural network, and assume that F
conditions are: (finite) exchangeability and consistency. As can be parameterised by a high-dimensional random vector
stated by the Kolmogorov Extension Theorem (Øksendal, z, and write F (x) = g(x, z) for some fixed and learnable
2003) these conditions are sufficient to define a stochastic function g (i.e. the randomness in F is due to that of z). The
process. generative model (Figure 1a) then follows from (5):
Exchangeability This condition requires the joint distribu- n
Y
tions to be invariant to permutations of the elements in x1:n . p(z, y1:n |x1:n ) = p(z) N (yi |g(xi , z), σ 2 ) (6)
More precisely, for each finite n, if π is a permutation of i=1
{1, . . . , n}, then:
where, following ideas of variational auto-encoders, we
ρx1:n (y1:n ) := ρx1 ,...,xn (y1 , . . . , yn ) (1) assume p(z) is a multivariate standard normal, and g(xi , z)
=ρxπ(1) ,...,xπ(n) (yπ(1) , . . . , yπ(n) ) =: ρπ(x1:n ) (π(y1:n )) is a neural network which captures the complexities of the
model.
where π(x1:n ) := (xπ(1) , . . . , xπ(n) ) and π(y1:n ) :=
(yπ(1) , . . . , yπ(n) ). To learn such a distribution over random functions, rather
than a single function, it is essential to train the system using
Consistency If we marginalise out a part of the sequence multiple datasets concurrently, with each dataset being a se-
the resulting marginal distribution is the same as that defined quence of inputs x1:n and outputs y1:n , so that we can learn
on the original sequence. More precisely, if 1 ≤ m ≤ n, the variability of the random function from the variability
then: of the datasets (see section 2.2).
Z
ρx1:m (y1:m ) = ρx1:n (y1:n )dym+1:n . (2) Since the decoder g is non-linear, we can use amortised
variational inference to learn it. Let q(z|x1:n , y1:n ) be a
Neural Processes
yC rC a r z g yT
z
xC h a rT h xT
C T
yC xC yT xT
C T Generation Inference
Figure 1. Neural process model. (a) Graphical model of a neural process. x and y correspond to the data where y = f (x). C and T
are the number of context points and target points respectively and z is the global latent variable. A grey background indicates that the
variable is observed. (b) Diagram of our neural process implementation. Variables in circles correspond to the variables of the graphical
model in (a), variables in square boxes to the intermediate representations of NPs and unbound, bold letters to the following computation
modules: h - encoder, a - aggregator and g - decoder. In our implementation h and g correspond to neural networks and a to the mean
function. The continuous lines depict the generative process, the dotted lines the inference.
variational posterior of the latent variables z, parameterised More formally, to train a NP we form a dataset that con-
by another neural network that is invariant to permutations sists of functions f : X → Y that are sampled from some
of the sequences x1:n , y1:n . Then the evidence lower-bound underlying distribution D. As an illustrating example con-
(ELBO) is given by: sider a dataset consisting of functions fd (x) ∼ GP that
have been generated using a Gaussian process with a fixed
log p(y1:n |x1:n ) (7) kernel. For each of the functions fd (x) our dataset contains
" n # a number of (x, y)i tuples where yi = fd (xi ). For train-
X p(z)
≥Eq(z|x1:n ,y1:n ) log p(yi |z, xi ) + log ing purposes we divide these points into a set of n context
i=1
q(z|x1:n , y1:n ) points C = {(x, y)i }ni=1 and a set of n + m target points
which consists of all points in C as well as m additional
In an alternative objective that better reflects the desired unobserved points T = {(x, y)i }n+m i=1 . During testing the
model behaviour at test time, we split the dataset into a model is presented with some context C and has to predict
context set, x1:m , y1:m and a target set xm+1:n , ym+1:n , and the target values yT = f (xT ) at target positions xT .
model the conditional of the target given the context. This
gives: In order to be able to predict accurately across the entire
dataset a model needs to learn a distribution that covers all
log p(ym+1:n |x1:n , y1:m ) of the functions observed in training and be able to take into
" n #
X p(z|x1:m , y1:m ) account the context data at test time.
≥Eq(z|x1:n ,y1:n ) log p(yi |z, xi ) + log
i=m+1
q(z|x1:n , y1:n )
(8) 2.3. Global latent variable
As mentioned above, neural processes include a latent vari-
Note that in the above the conditional prior p(z|x1:m , y1:m )
is intractable. We can approximate it using the variational able z that captures F . This latent variable is of particular
posterior q(z|x1:m , y1:m ), which gives, interest because it captures the global uncertainty, which
allows us to sample at a global level – one function fd at a
log p(ym+1:n |x1:n , y1:m ) (9) time, rather than at a local output level – one yi value for
" n
X q(z|x1:m , y1:m )
#
each xi at a time (independently of the remaining yT ).
≥Eq(z|x1:n ,y1:n ) log p(yi |z, xi ) + log
i=m+1
q(z|x1:n , y1:n ) In addition, since we are passing all of the context’s infor-
mation through this single variable we can formulate the
model in a Bayesian framework. In the absence of context
points C the latent distribution p(z) would correspond to
2.2. Distributions over functions
a data specific prior the model has learned during training.
A key motivation for NPs is the ability to represent a dis- As we add observations the latent distribution encoded by
tribution over functions rather than a single function. In the model amounts to the posterior p(z|C) over the function
order to train such a model we need a training procedure given the context. On top of this, as shown in equation 9,
that reflects this task. instead of using a zero-information prior p(z), we condition
Neural Processes
z
z
zT zT
yC yT yC yT yC xC yT xT yC xC yT xT
C T C T C T C T
(a) Conditional VAE (b) Neural statistician (c) Conditional neural process (d) Neural process
Figure 2. Graphical models of related models (a-c) and of the neural process (d). Gray shading indicates the variable is observed. C
stands for context variables and T for target variables i.e. the variables to predict given C.
the prior on the context. As such this prior is equivalent to a large part of the motivation behind neural processes, but
a less informed posterior of the underlying function. This lack a latent variable that allows for global sampling (see
formulation makes it clear that the posterior given a subset Figure 2c for a diagram of the model). As a result, CNPs
of the context points will serve as the prior when additional are unable to produce different function samples for the
context points are included. By using this setup, and train- same context data, which can be important if modelling this
ing with different sizes of context, we encourage the learned uncertainty is desirable. It is worth mentioning that the orig-
model to be flexible with regards to the number and position inal CNP formulation did include experiments with a latent
of the context points. variable in addition to the deterministic connection. How-
ever, given the deterministic connections to the predicted
2.4. The Neural process model variables, the role of the global latent variable is not clear.
In contrast, NPs constitute a more clear-cut generalisation
In our implementation of NPs we accommodate for two ad- of the original deterministic CNP with stronger parallels
ditional desiderata: invariance to the order of context points to other latent variable models and approximate Bayesian
and computational efficiency. The resulting model can be methods. These parallels allow us to compare our model
boiled down to three core components (see Figure 1b): to a wide range of related research areas in the following
sections.
• An encoder h from input space into representation
space that takes in pairs of (x, y)i context values and Finally, NPs and CNPs themselves can be seen as gener-
produces a representation ri = h((x, y)i ) for each of alizations of recently published generative query networks
the pairs. We parameterise h as a neural network. (GQN) which apply a similar training procedure to predict
new viewpoints in 3D scenes given some context observa-
• An aggregator a that summarises the encoded inputs. tions (Eslami et al., 2018). Consistent GQN (CGQN) is
We are interested in obtaining a single order-invariant an extension of GQN that focuses on generating consistent
global representation r that parameterises the latent dis- samples and is thus also closely related to NPs (Kumar et al.,
tribution z ∼ N (µ(r), Iσ(r)). The simplest operation 2018).
that ensures order-invariance and worksP well in prac-
n
tice is the mean function r = a(ri ) = n1 i=1 ri . Cru-
3.2. Gaussian processes
cially, the aggregator reduces the runtime to O(n + m)
where n and m are the number of context and target We start by considering models that, like NPs, lie on the
points respectively. spectrum between neural networks (NNs) and Gaussian
processes (GPs). Algorithms on the NN end of the spectrum
• A conditional decoder g that takes as input the sam-
fit a single function that they learn from a very large amount
pled global latent variable z as well as the new target
of data directly. GPs on the other hand can represent a
locations xT and outputs the predictions ŷT for the
distribution over a family of functions, which is constrained
corresponding values of f (xT ) = yT .
by an assumption on the functional form of the covariance
between two points.
3. Related work
Scattered across this spectrum, we can place recent research
3.1. Conditional neural processes that has combined ideas from Bayesian non-parametrics
Neural Processes (NPs) are a generalisation of Conditional with neural networks. Methods like (Calandra et al., 2016;
Neural Processes (CNPs, Garnelo et al. (2018)). CNPs share Huang et al., 2015) remain fairly close to the GPs, but incor-
Neural Processes
Welling, 2017). These models learn distributions over the additional local hidden variable zT .
network weights and use the posterior of these weights to
The ELBO of the NS reflects the hierarchical nature of
estimate the values of yT given yC . In this context NPs can
the model with a double expectation over the local and the
be thought of as amortised version of Bayesian DL.
global variable. If we leave out the local latent variable for
a more direct comparison to NPs the ELBO becomes:
3.5. Conditional latent variable models
"
We have covered algorithms that are conceptually similar to log p(yt , yc ) ≥ Eq(z|yc ,yt ) log p(yt |z)
NPs and algorithms that carry out similar tasks to NPs. In
this section we look at models of the same family as NPs: # (12)
p(z)
conditional latent variable models. Such models (Figure 2a) + log
learn the conditional distribution p(yT |yC , z) where z is q(z|yc , yt )
a latent variable that can be sampled to generate different
predictions. Training this type of directed graphical model Notably the prior p(z) of NS is not conditioned on the
is intractable and as with variational autoencoders (VAEs, context. The prior of NPs on the other hand is conditional
Rezende et al. (2014); Kingma & Welling (2013)), condi- (equation 9), which brings the training objective closer to
tional variational autoencoders (CVAEs, Sohn et al. (2015)) the way the model is used at test time. Another variant of the
approximate the objective function using the variational ELBO is presented in the variational homoencoder (Hewitt
lower bound on the log likelihood: et al., 2018), a model that is very similar to the neural
" statistician but uses a separate subset of data points for the
context and predictions.
log p(yt |yc ) ≥ Eq(zT |yc ,yt ) log p(yt |zT , yc )
As reflected in Figure 2 the main difference between NPs
# (11) and the conditional latent variable models is the lack of an
p(zT |yc ) x variable that allows for targeted sampling of the latent
+ log
q(zT |yc , yt ) distribution. This change, despite seeming small, drastically
changes the range of applications. Targeted sampling, for
We refer to the latent variable of CVAEs zT as a local latent
example allows for generation and completion tasks (e.g.
variable in order to distinguish from global latent variables
the image completion tasks) or the addition of some down-
that are present in the models later on. We call this latent
stream task (like using an NP for reinforcement learning).
variable local as it is sampled anew for each of the output
It is worth mentioning that all of the conditional latent vari-
predictions yT,i . This is in contrast to a global latent vari-
able models have also been applied to few-shot classification
able that is only sampled once and used to predict multiple
problems, where the data space generally consists of input
values of yt . In the CVAE, conditioning on the context is
tuples (x, y), rather than just single outputs y. The models
done by adding the dependence both in the prior p(zT |yc )
are able to carry this out by framing classification either as
and decoder p(y|zT , yc ) so they can be considered as deter-
a comparison between the log likelihoods of the different
ministic functions of the context.
classes or by looking at the KL between the different pos-
CVAEs have been extended in a number of ways for ex- teriors, thereby overcoming the need of working with data
ample by adding attention (Rezende et al., 2016). Another tuples.
related extension is generative matching networks (GMNs,
Bartunov & Vetrov (2016)), where the conditioning input 4. Results
is pre-processed in a way that is similar to the matching
networks model. 4.1. 1-D function regression
A more complex version of the CVAE that is very relevant In order to test whether neural processes indeed learn to
in this context is the neural statistician (NS, Edwards & model distributions over functions we first apply them to a
Storkey (2016)). Similar to the neural process, the neural 1-D function regression task. The functions for this experi-
statistician contains a global latent variable z that captures ment are generated using a GP with varying kernel parame-
global uncertainty (see Figure 2b). A crucial difference is ters for each function. At every training step we sample a set
that while NPs represent the distribution over functions, NS of values for the Gaussian kernel of a GP and use those to
represents the distribution over sets. Since NS does not sample a function fD (x). A random number of the (x, y)C
contain a corresponding x value for each y value, it does not pairs are passed into the decoder of the NP as context points.
capture a pair-wise relation like GPs and NPs, but rather a We pick additional unobserved pairs (x, y)U which we com-
general distribution of the y values. Rather than generating bine with the observed context points (x, y)C as targets and
different y values by querying the model with different feed xT to the decoder that returns its estimate yˆT of the
x values, NS generates different y values by sampling an underlying value of yT .
Neural Processes
Some sample curves are shown in Figure 3. For the same to the pixel intensity (see Figure 4 for an explanation of this).
underlying ground truth curve (black line) we run the neu- It is important to point out that we choose images as our
ral process using varying numbers of context points and dataset because they constitute a complex 2-D function and
generate several samples for each run (light-blue lines). As they are easy to evaluate visually. It is important to point
evidenced by the results the model has learned some key out that NPs, as such, have not been designed for image
properties of the 1-D curves from the data such as continuity generation like other specialised generative models.
and the general shape of functions sampled from a GP with
We train separate models on the MNIST (LeCun et al., 1998)
a Gaussian kernel. When provided with only one context
and the CelebA (Liu et al., 2015) datasets. As shown in
point the model generates curves that fluctuate around 0, the
Figure 4 the model performs well on both tasks. In the
prior of the data-generating GP. Crucially, these curves go
case of the MNIST digits the uncertainty is reflected in the
through or near the observed context point and display a
variability of the generated digit. Given only a few context
higher variance in regions where no observations are present.
points more than just one digit can fit the observations and
As the number of context points increases this uncertainty is
as a result the model produces different digits when sampled
reduced and the model’s predictions better match the under-
several times. As the number of context points increases the
lying ground truth. Given that this is a neural approximation
set of possible digits is reduced and the model produces the
the curves will sometimes only approach the observations
same digit, albeit with structural modifications that become
points as opposed to go through them as it is the case for
smaller as the number of context points increases.
GPs. On the other hand once the model is trained it can
regress more than just one data set i.e. it will produce sensi- The same holds for the CelebA dataset. In this case, when
ble results for curves generated using any kernel parameters provided limited context the model samples from a wider
observed during training. range of possible faces and as it observes more context
points it converges towards very similar looking faces. We
4.2. 2-D function regression do not expect the model to reconstruct the target image
perfectly even when all the pixels are provided as context,
One of the benefits of neural processes is their functional since the latent variable z constitutes a strong bottleneck.
flexibility as they can learn non-trivial ‘kernels’ from the This can be seen in the final column of the figure where the
data directly. In order to test this we apply NPs to a more predicted images are not only not identical to the ground
complex regression problem. We carry out image com- truth but also vary between themselves. The latter is likely
pletion as a regression task, where we provide some of the a cause of the latent variance which has been clipped to a
pixels as context and do pixel-wise prediction over the entire small value to avoid collapsing, so even when no uncertainty
image. In this formulation the xi values would correspond is present we can generate different samples from p(z|C).
to the Cartesian coordinates of each pixel and the yi values
Neural Processes
2 2 2 2 2
Objective function New observation
Prediction
Context points
0 0 0 0
0
Next evaluation
-2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2
Figure 5. Thompson sampling with neural processes on a 1-D objective function. The plots show the optimisation process over five
iterations. Each prediction function (blue) is drawn by sampling a latent variable conditioned on an incresing number of context points
(black circles). The underlying ground truth function is depicted as a black dotted line. The red triangle indicates the next evaluation point
which corresponds to the minimum value of the sampled NP curve. The red circle in the following iteration corresponds to this evaluation
point with its underlying ground truth value that serves as a new context point to the NP.
Table 2. Results on the wheel bandit problem for increasing values of δ. Shown are mean and standard errors for both cumulative and
simple regret (a measure of the quality of the final policy) over 100 trials. Results normalised wrt. the performance of a uniform agent.
δ 0.5 0.7 0.9 0.95 0.99
Cumulative regret
Uniform 100.00 ± 0.08 100.00 ± 0.09 100.00 ± 0.25 100.00 ± 0.37 100.00 ± 0.78
LinGreedy ( = 0.0) 65.89 ± 4.90 71.71 ± 4.31 108.86 ± 3.10 102.80 ± 3.06 104.80 ± 0.91
Dropout 7.89 ± 1.51 9.03 ± 2.58 36.58 ± 3.62 63.12 ± 4.26 98.68 ± 1.59
LinGreedy ( = 0.05) 7.86 ± 0.27 9.58 ± 0.35 19.42 ± 0.78 33.06 ± 2.06 74.17 ± 1.63
Bayes by Backprob (Blundell et al., 2015) 1.37 ± 0.07 3.32 ± 0.80 34.42 ± 5.50 59.04 ± 5.59 97.38 ± 2.66
NeuralLinear 0.95 ± 0.02 1.60 ± 0.03 4.65 ± 0.18 9.56 ± 0.36 49.63 ± 2.41
MAML (Finn et al., 2017) 2.95 ± 0.12 3.11 ± 0.16 4.84 ± 0.22 7.01 ± 0.33 22.93 ± 1.57
Neural Processes 1.60 ± 0.06 1.75 ± 0.05 3.31 ± 0.10 5.71 ± 0.24 22.13 ± 1.23
Simple regret
Uniform 100.00 ± 0.45 100.00 ± 0.78 100.00 ± 1.18 100.00 ± 2.21 100.00 ± 4.21
LinGreedy ( = 0.0) 66.59 ± 5.02 73.06 ± 4.55 108.56 ± 3.65 105.01 ± 3.59 105.19 ± 4.14
Dropout 6.57 ± 1.48 6.37 ± 2.53 35.02 ± 3.94 59.45 ± 4.74 102.12 ± 4.76
LinGreedy ( = 0.05) 5.53 ± 0.19 6.07 ± 0.24 8.49 ± 0.47 12.65 ± 1.12 57.62 ± 3.57
Bayes by Backprob (Blundell et al., 2015) 0.60 ± 0.09 1.45 ± 0.61 27.03 ± 6.19 56.64 ± 6.36 102.96 ± 5.93
NeuralLinear 0.33 ± 0.04 0.79 ± 0.07 2.17 ± 0.14 4.08 ± 0.20 35.89 ± 2.98
MAML (Finn et al., 2017) 2.49 ± 0.12 3.00 ± 0.35 4.75 ± 0.48 7.10 ± 0.77 22.89 ± 1.41
Neural Processes 1.04 ± 0.06 1.26 ± 0.21 2.90 ± 0.35 5.45 ± 0.47 21.45 ± 1.3
Øksendal, B. Stochastic differential equations. In Stochastic van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
differential equations, pp. 11. Springer, 2003. O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Quiñonero-Candela, J. and Rasmussen, C. E. A unifying Processing Systems, pp. 4790–4798, 2016.
view of sparse approximate gaussian process regression.
Journal of Machine Learning Research, 6(Dec):1939– Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
1959, 2005. Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S., 2016.
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
learn distributions. 2017. Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic preprint arXiv:1611.05763, 2016.
backpropagation and approximate inference in deep gen- Wilson, A. and Nickisch, H. Kernel interpolation for scal-
erative models. arXiv preprint arXiv:1401.4082, 2014. able structured gaussian processes (kiss-gp). In Interna-
tional Conference on Machine Learning, pp. 1775–1784,
Rezende, D. J., Mohamed, S., Danihelka, I., Gregor, K., and
2015.
Wierstra, D. One-shot generalization in deep generative
models. arXiv preprint arXiv:1603.05106, 2016. Wilson, A. G., Knowles, D. A., and Ghahramani, Z.
Gaussian process regression networks. arXiv preprint
Riquelme, C., Tucker, G., and Snoek, J. Deep bayesian arXiv:1110.4411, 2011.
bandits showdown: An empirical comparison of bayesian
deep networks for thompson sampling. arXiv preprint Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
arXiv:1802.09127, 2018. Deep kernel learning. In Artificial Intelligence and Statis-
tics, pp. 370–378, 2016.
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.
Sun, S., Zhang, G., Wang, C., Zeng, W., Li, J., and Grosse,
R. Differentiable compositional kernel learning for gaus-
sian processes. arXiv preprint arXiv:1806.04326, 2018.