0% found this document useful (0 votes)
13 views11 pages

Neural Processes

This document introduces Neural Processes (NPs), a class of models that combine the strengths of neural networks and Gaussian processes to efficiently approximate stochastic processes. NPs are capable of rapid adaptation to new observations, uncertainty estimation, and computational efficiency during training and evaluation, making them suitable for various machine learning tasks. The paper discusses the theoretical foundations of NPs, their implementation, and demonstrates their performance on tasks such as regression and optimization.

Uploaded by

kyakrnahetujhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Neural Processes

This document introduces Neural Processes (NPs), a class of models that combine the strengths of neural networks and Gaussian processes to efficiently approximate stochastic processes. NPs are capable of rapid adaptation to new observations, uncertainty estimation, and computational efficiency during training and evaluation, making them suitable for various machine learning tasks. The paper discusses the theoretical foundations of NPs, their implementation, and demonstrates their performance on tasks such as regression and optimization.

Uploaded by

kyakrnahetujhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Neural Processes

Marta Garnelo 1 Jonathan Schwarz 1 Dan Rosenbaum 1 Fabio Viola 1 Danilo J. Rezende 1 S. M. Ali Eslami 1
Yee Whye Teh 1

Abstract for example, is an increasingly popular field of research that


addresses exactly this limitation (Sutskever et al., 2014;
A neural network (NN) is a parameterised func- Wang et al., 2016; Vinyals et al., 2016; Finn et al., 2017).
tion that can be tuned via gradient descent to ap-
arXiv:1807.01622v1 [cs.LG] 4 Jul 2018

proximate a labelled collection of data with high As an alternative to using neural networks one can also per-
precision. A Gaussian process (GP), on the other form inference on a stochastic process in order to carry out
hand, is a probabilistic model that defines a dis- function regression. The most common instantiation of this
tribution over possible functions, and is updated approach is a Gaussian process (GP), a model with compli-
in light of data via the rules of probabilistic in- mentary properties to those of neural networks: GPs do not
ference. GPs are probabilistic, data-efficient and require a costly training phase and can carry out inference
flexible, however they are also computationally in- about the underlying ground truth function conditioned on
tensive and thus limited in their applicability. We some observations, which renders them very flexible at test-
introduce a class of neural latent variable models time. In addition GPs represent infinitely many different
which we call Neural Processes (NPs), combin- functions at locations that have not been observed thereby
ing the best of both worlds. Like GPs, NPs define capturing the uncertainty over their predictions given some
distributions over functions, are capable of rapid observations. However, GPs are computationally expen-
adaptation to new observations, and can estimate sive: in their original formulation they scale cubically with
the uncertainty in their predictions. Like NNs, respect to the number of data points, and current state of
NPs are computationally efficient during training the art approximations still scale quadratically (Quiñonero-
and evaluation but also learn to adapt their priors Candela & Rasmussen, 2005). Furthermore, the available
to data. We demonstrate the performance of NPs kernels are usually restricted in their functional form and an
on a range of learning tasks, including regression additional optimisation procedure is required to identify the
and optimisation, and compare and contrast with most suitable kernel, as well as its hyperparameters, for any
related models in the literature. given task.
As a result, there is growing interest in combining aspects of
neural networks and inference on stochastic processes as a
1. Introduction potential solution to some of the downsides of both (Huang
et al., 2015; Wilson et al., 2016). In this work we intro-
Function approximation lies at the core of numerous prob- duce a neural network-based formulation that learns an ap-
lems in machine learning and one approach that has been proximation of a stochastic process, which we term Neural
exceptionally popular for this purpose over the past decade Processes (NPs). NPs display some of the fundamental
are deep neural networks. At a high level neural networks properties of GPs, namely they learn to model distributions
constitute black-box function approximators that learn to over functions, are able to estimate the uncertainty over
parameterise a single function from a large number of train- their predictions conditioned on context observations, and
ing data points. As such, the majority of the workload of shift some of the workload from training to test time, which
a networks falls on the training phase while the evaluation allows for model flexibility. Crucially, NPs generate predic-
and testing phases are reduced to quick forward-passes. Al- tions in a computationally efficient way. Given n context
though high test-time performance is valuable for many points and m target points, inference with a trained NP cor-
real-world applications, the fact that network outputs cannot responds to a forward pass in a deep NN, which scales with
be updated after training may be undesirable. Meta-learning, O(n+m) as opposed to the O((n+m)3 ) runtime of classic
1
DeepMind, London, UK. Correspondence to: Marta Garnelo GPs. Furthermore the model overcomes many functional
<garnelo@google.com>. design restrictions by learning an implicit kernel from the
data directly.
Presented at the ICML 2018 workshop on Theoretical Foundations
and Applications of Deep Generative Models
Neural Processes

Our main contributions are: Take, for example, three different sequences x1:n , π(x1:n )
and x1:m as well as their corresponding joint distributions
1. We introduce Neural Processes, a class of models that ρx1:n , ρπ(x1:n ) and ρx1:m . In order for these joint distribu-
combine benefits of neural networks and stochastic tions to all be marginals of some higher-dimensional dis-
processes. tribution given by the stochastic process F , they have to
satisfy equations 1 and 2 above.
2. We compare NPs to related work in meta-learning,
deep latent variable models and Gaussian processes. Given a particular instantiation of the stochastic process f
Given that NPs are linked to many of these areas, they the joint distribution is defined as:
form a bridge for comparison between many related Z
topics. ρx1:n (y1:n ) = p(f )p(y1:n |f, x1:n )df. (3)
3. We showcase the benefits and abilities of NPs by apply-
ing them to a range of tasks including 1-D regression, Here p denotes the abstract probability distribution over all
real-world image completion, Bayesian optimization random quantities. Instead of Yi = F (xi ), we add some
and contextual bandits. observation noise Yi ∼ N (F (xi ), σ 2 ) and define p as:
n
Y
2. Model p(y1:n |f, x1:n ) = N (yi |f (xi ), σ 2 ). (4)
i=1
2.1. Neural processes as stochastic processes
Inserting this into equation 3 the stochastic process is speci-
The standard approach to defining a stochastic process is via fied by:
its finite-dimensional marginal distributions. Specifically,
n
we consider the process as a random function F : X → Y
Z Y
and for each finite sequence x1:n = (x1 , . . . , xn ) with ρx1:n (y1:n ) = p(f ) N (yi |f (xi ), σ 2 )df. (5)
i=1
xi ∈ X , we define the marginal joint distribution over the
function values Y1:n := (F (x1 ), . . . , F (xn )). For example,
in the case of GPs, these joint distributions are multivari- In other words, exchangeability and consistency of the col-
ate Gaussians parameterised by a mean and a covariance lection of joint distributions {ρx1:n } impliy the existence of
function. a stochastic process F such that the observations Y1:n be-
come iid conditional upon F . This essentially corresponds
Given a collection of joint distributions ρx1:n we can derive to a conditional version of de Finetti’s Theorem that anchors
two necessary conditions to be able to define a stochastic much of Bayesian nonparametrics (De Finetti, 1937). In
process F such that ρx1:n is the marginal distribution of order to represent a stochastic process using a NP, we will
(F (x1 ), . . . , F (xn )), for each finite sequence x1:n . These approximate it with a neural network, and assume that F
conditions are: (finite) exchangeability and consistency. As can be parameterised by a high-dimensional random vector
stated by the Kolmogorov Extension Theorem (Øksendal, z, and write F (x) = g(x, z) for some fixed and learnable
2003) these conditions are sufficient to define a stochastic function g (i.e. the randomness in F is due to that of z). The
process. generative model (Figure 1a) then follows from (5):
Exchangeability This condition requires the joint distribu- n
Y
tions to be invariant to permutations of the elements in x1:n . p(z, y1:n |x1:n ) = p(z) N (yi |g(xi , z), σ 2 ) (6)
More precisely, for each finite n, if π is a permutation of i=1
{1, . . . , n}, then:
where, following ideas of variational auto-encoders, we
ρx1:n (y1:n ) := ρx1 ,...,xn (y1 , . . . , yn ) (1) assume p(z) is a multivariate standard normal, and g(xi , z)
=ρxπ(1) ,...,xπ(n) (yπ(1) , . . . , yπ(n) ) =: ρπ(x1:n ) (π(y1:n )) is a neural network which captures the complexities of the
model.
where π(x1:n ) := (xπ(1) , . . . , xπ(n) ) and π(y1:n ) :=
(yπ(1) , . . . , yπ(n) ). To learn such a distribution over random functions, rather
than a single function, it is essential to train the system using
Consistency If we marginalise out a part of the sequence multiple datasets concurrently, with each dataset being a se-
the resulting marginal distribution is the same as that defined quence of inputs x1:n and outputs y1:n , so that we can learn
on the original sequence. More precisely, if 1 ≤ m ≤ n, the variability of the random function from the variability
then: of the datasets (see section 2.2).
Z
ρx1:m (y1:m ) = ρx1:n (y1:n )dym+1:n . (2) Since the decoder g is non-linear, we can use amortised
variational inference to learn it. Let q(z|x1:n , y1:n ) be a
Neural Processes

yC rC a r z g yT
z

xC h a rT h xT
C T
yC xC yT xT
C T Generation Inference

(a) Graphical model (b) Computational diagram

Figure 1. Neural process model. (a) Graphical model of a neural process. x and y correspond to the data where y = f (x). C and T
are the number of context points and target points respectively and z is the global latent variable. A grey background indicates that the
variable is observed. (b) Diagram of our neural process implementation. Variables in circles correspond to the variables of the graphical
model in (a), variables in square boxes to the intermediate representations of NPs and unbound, bold letters to the following computation
modules: h - encoder, a - aggregator and g - decoder. In our implementation h and g correspond to neural networks and a to the mean
function. The continuous lines depict the generative process, the dotted lines the inference.

variational posterior of the latent variables z, parameterised More formally, to train a NP we form a dataset that con-
by another neural network that is invariant to permutations sists of functions f : X → Y that are sampled from some
of the sequences x1:n , y1:n . Then the evidence lower-bound underlying distribution D. As an illustrating example con-
(ELBO) is given by: sider a dataset consisting of functions fd (x) ∼ GP that
have been generated using a Gaussian process with a fixed
log p(y1:n |x1:n ) (7) kernel. For each of the functions fd (x) our dataset contains
" n # a number of (x, y)i tuples where yi = fd (xi ). For train-
X p(z)
≥Eq(z|x1:n ,y1:n ) log p(yi |z, xi ) + log ing purposes we divide these points into a set of n context
i=1
q(z|x1:n , y1:n ) points C = {(x, y)i }ni=1 and a set of n + m target points
which consists of all points in C as well as m additional
In an alternative objective that better reflects the desired unobserved points T = {(x, y)i }n+m i=1 . During testing the
model behaviour at test time, we split the dataset into a model is presented with some context C and has to predict
context set, x1:m , y1:m and a target set xm+1:n , ym+1:n , and the target values yT = f (xT ) at target positions xT .
model the conditional of the target given the context. This
gives: In order to be able to predict accurately across the entire
dataset a model needs to learn a distribution that covers all
log p(ym+1:n |x1:n , y1:m ) of the functions observed in training and be able to take into
" n #
X p(z|x1:m , y1:m ) account the context data at test time.
≥Eq(z|x1:n ,y1:n ) log p(yi |z, xi ) + log
i=m+1
q(z|x1:n , y1:n )
(8) 2.3. Global latent variable
As mentioned above, neural processes include a latent vari-
Note that in the above the conditional prior p(z|x1:m , y1:m )
is intractable. We can approximate it using the variational able z that captures F . This latent variable is of particular
posterior q(z|x1:m , y1:m ), which gives, interest because it captures the global uncertainty, which
allows us to sample at a global level – one function fd at a
log p(ym+1:n |x1:n , y1:m ) (9) time, rather than at a local output level – one yi value for
" n
X q(z|x1:m , y1:m )
#
each xi at a time (independently of the remaining yT ).
≥Eq(z|x1:n ,y1:n ) log p(yi |z, xi ) + log
i=m+1
q(z|x1:n , y1:n ) In addition, since we are passing all of the context’s infor-
mation through this single variable we can formulate the
model in a Bayesian framework. In the absence of context
points C the latent distribution p(z) would correspond to
2.2. Distributions over functions
a data specific prior the model has learned during training.
A key motivation for NPs is the ability to represent a dis- As we add observations the latent distribution encoded by
tribution over functions rather than a single function. In the model amounts to the posterior p(z|C) over the function
order to train such a model we need a training procedure given the context. On top of this, as shown in equation 9,
that reflects this task. instead of using a zero-information prior p(z), we condition
Neural Processes

z
z

zT zT

yC yT yC yT yC xC yT xT yC xC yT xT
C T C T C T C T

(a) Conditional VAE (b) Neural statistician (c) Conditional neural process (d) Neural process

Figure 2. Graphical models of related models (a-c) and of the neural process (d). Gray shading indicates the variable is observed. C
stands for context variables and T for target variables i.e. the variables to predict given C.

the prior on the context. As such this prior is equivalent to a large part of the motivation behind neural processes, but
a less informed posterior of the underlying function. This lack a latent variable that allows for global sampling (see
formulation makes it clear that the posterior given a subset Figure 2c for a diagram of the model). As a result, CNPs
of the context points will serve as the prior when additional are unable to produce different function samples for the
context points are included. By using this setup, and train- same context data, which can be important if modelling this
ing with different sizes of context, we encourage the learned uncertainty is desirable. It is worth mentioning that the orig-
model to be flexible with regards to the number and position inal CNP formulation did include experiments with a latent
of the context points. variable in addition to the deterministic connection. How-
ever, given the deterministic connections to the predicted
2.4. The Neural process model variables, the role of the global latent variable is not clear.
In contrast, NPs constitute a more clear-cut generalisation
In our implementation of NPs we accommodate for two ad- of the original deterministic CNP with stronger parallels
ditional desiderata: invariance to the order of context points to other latent variable models and approximate Bayesian
and computational efficiency. The resulting model can be methods. These parallels allow us to compare our model
boiled down to three core components (see Figure 1b): to a wide range of related research areas in the following
sections.
• An encoder h from input space into representation
space that takes in pairs of (x, y)i context values and Finally, NPs and CNPs themselves can be seen as gener-
produces a representation ri = h((x, y)i ) for each of alizations of recently published generative query networks
the pairs. We parameterise h as a neural network. (GQN) which apply a similar training procedure to predict
new viewpoints in 3D scenes given some context observa-
• An aggregator a that summarises the encoded inputs. tions (Eslami et al., 2018). Consistent GQN (CGQN) is
We are interested in obtaining a single order-invariant an extension of GQN that focuses on generating consistent
global representation r that parameterises the latent dis- samples and is thus also closely related to NPs (Kumar et al.,
tribution z ∼ N (µ(r), Iσ(r)). The simplest operation 2018).
that ensures order-invariance and worksP well in prac-
n
tice is the mean function r = a(ri ) = n1 i=1 ri . Cru-
3.2. Gaussian processes
cially, the aggregator reduces the runtime to O(n + m)
where n and m are the number of context and target We start by considering models that, like NPs, lie on the
points respectively. spectrum between neural networks (NNs) and Gaussian
processes (GPs). Algorithms on the NN end of the spectrum
• A conditional decoder g that takes as input the sam-
fit a single function that they learn from a very large amount
pled global latent variable z as well as the new target
of data directly. GPs on the other hand can represent a
locations xT and outputs the predictions ŷT for the
distribution over a family of functions, which is constrained
corresponding values of f (xT ) = yT .
by an assumption on the functional form of the covariance
between two points.
3. Related work
Scattered across this spectrum, we can place recent research
3.1. Conditional neural processes that has combined ideas from Bayesian non-parametrics
Neural Processes (NPs) are a generalisation of Conditional with neural networks. Methods like (Calandra et al., 2016;
Neural Processes (CNPs, Garnelo et al. (2018)). CNPs share Huang et al., 2015) remain fairly close to the GPs, but incor-
Neural Processes

porate NNs to pre-process the input data. Deep GPs have


some conceptual similarity to NNs as they stack GPs to
obtain deep models (Damianou & Lawrence, 2013). Ap-
proaches that are more similar to NNs include for exam-
ple neural networks whose weights are sampled using a
GPs (Wilson et al., 2011) or networks where each unit rep-
resents a different kernel (Sun et al., 2018).
There are two models on this spectrum that are closely re-
lated to NPs: matching networks (MN, Vinyals et al. (2016))
and deep kernel learning (DKL, Wilson et al. (2016)). As
with NPs both use NNs to extract representations from the
data, but while NPs learn the ‘kernel’ to compare data points
implicitly these other two models pass the representation to
an explicit distance kernel. MNs use this kernel to measure
the similarity between contexts and targets for few shot clas-
sification while the kernel in DKL is used to parametrise a Figure 3. 1-D function regression. The plots show samples of
GP. Because of this explicit kernel the computational com- curves conditioned on an increasing number of context points
plexity of MNs and DKL would be quadratic and cubic (1, 10 and 100 in each row respectively). The true underlying
instead of O(n + m) like it is for NPs. To overcome this curve is shown in black and the context points as black circles.
computational complexity DKL replace a standard GP with Each column corresponds to a different example. With only one
a kernel approximation given by a KISS GP (Wilson & observation the variance away from that context point is high
Nickisch, 2015), while prototypical networks (Snell et al., between the sampled curves. As the number of context points
increases the sampled curves increasingly resemble the ground
2017) are introduced as a more light-weight version of MNs
truth curve and the overall variance is reduced.
that also scale with O(n + m).
Finally concurrent work by Ma et al introduces variational
of NPs as they shift workload from training time to test
implicit processes, which share large part of the motivation
time. NPs can therefore be described as meta-learning algo-
of NPs but are implemented as GPs (Ma et al., 2018). In this
rithms for few-shot function regression, although as shown
context NPs can be interpreted as a neural implementation
in Garnelo et al. (2018) they can also be applied to few-shot
of an implicit stochastic process.
learning tasks beyond regression.
On this spectrum from NNs to GPs, neural processes remain
closer to the neural end than most of the models mentioned 3.4. Bayesian methods
above. By giving up on the explicit definition of a kernel
NPs lose some of the mathematical guarantees of GPs, but The link between meta-learning methods and other research
trade this off for data-driven ‘priors’ and computational areas, like GPs and Bayesian methods, is not always evident.
efficiency. Interestingly, recent work by Grant et al. (2018) out the re-
lation between model agnostic meta learning (MAML, Finn
et al. (2017)) and hierarchical Bayesian inference. In this
3.3. Meta-learning
work the meta-learning properties of MAML are formulated
In contemporary meta-learning vocabulary, NPs and GPs as a result of task-specific variables that are conditionally
can be seen to be methods for ‘few-shot function estimation’. independent given a higher level variable. This hierarchical
In this section we compare with related models that can be description can be rewritten as a probabilistic inference prob-
used to the same end. A prominent example is matching lem and the resulting marginal likelihood p(y|C) matches
networks (Vinyals et al., 2016), but there is a large litera- the original MAML objective. The parallels between NPs
ture of similar models for classification (Koch et al., 2015; and hierarchical Bayesian methods are similarly straightfor-
Santoro et al., 2016), reinforcement learning (Wang et al., ward. Given the graphical model in Figure 2d we can write
2016), parameter update (Finn et al., 2017; 2018), natural out the conditional marginal likelihood as a hierarchical
language processing (Bowman et al., 2015) and program inference problem:
induction (Devlin et al., 2017). Related are generative meta- Z
learning approaches that carry out few-shot estimation of log p(yt |C, xt ) = log p(yt |z, xt )p(z|C)dz (10)
the data densities (van den Oord et al., 2016; Reed et al.,
2017; Bornschein et al., 2017; Rezende et al., 2016). Another interesting area that connects Bayesian methods
Meta-learning models share the fundamental motivations and NNs are Bayesian neural networks (Gal & Ghahramani,
2016; Blundell et al., 2015; Louizos et al., 2017; Louizos &
Neural Processes

Welling, 2017). These models learn distributions over the additional local hidden variable zT .
network weights and use the posterior of these weights to
The ELBO of the NS reflects the hierarchical nature of
estimate the values of yT given yC . In this context NPs can
the model with a double expectation over the local and the
be thought of as amortised version of Bayesian DL.
global variable. If we leave out the local latent variable for
a more direct comparison to NPs the ELBO becomes:
3.5. Conditional latent variable models
"
We have covered algorithms that are conceptually similar to log p(yt , yc ) ≥ Eq(z|yc ,yt ) log p(yt |z)
NPs and algorithms that carry out similar tasks to NPs. In
this section we look at models of the same family as NPs: # (12)
p(z)
conditional latent variable models. Such models (Figure 2a) + log
learn the conditional distribution p(yT |yC , z) where z is q(z|yc , yt )
a latent variable that can be sampled to generate different
predictions. Training this type of directed graphical model Notably the prior p(z) of NS is not conditioned on the
is intractable and as with variational autoencoders (VAEs, context. The prior of NPs on the other hand is conditional
Rezende et al. (2014); Kingma & Welling (2013)), condi- (equation 9), which brings the training objective closer to
tional variational autoencoders (CVAEs, Sohn et al. (2015)) the way the model is used at test time. Another variant of the
approximate the objective function using the variational ELBO is presented in the variational homoencoder (Hewitt
lower bound on the log likelihood: et al., 2018), a model that is very similar to the neural
" statistician but uses a separate subset of data points for the
context and predictions.
log p(yt |yc ) ≥ Eq(zT |yc ,yt ) log p(yt |zT , yc )
As reflected in Figure 2 the main difference between NPs
# (11) and the conditional latent variable models is the lack of an
p(zT |yc ) x variable that allows for targeted sampling of the latent
+ log
q(zT |yc , yt ) distribution. This change, despite seeming small, drastically
changes the range of applications. Targeted sampling, for
We refer to the latent variable of CVAEs zT as a local latent
example allows for generation and completion tasks (e.g.
variable in order to distinguish from global latent variables
the image completion tasks) or the addition of some down-
that are present in the models later on. We call this latent
stream task (like using an NP for reinforcement learning).
variable local as it is sampled anew for each of the output
It is worth mentioning that all of the conditional latent vari-
predictions yT,i . This is in contrast to a global latent vari-
able models have also been applied to few-shot classification
able that is only sampled once and used to predict multiple
problems, where the data space generally consists of input
values of yt . In the CVAE, conditioning on the context is
tuples (x, y), rather than just single outputs y. The models
done by adding the dependence both in the prior p(zT |yc )
are able to carry this out by framing classification either as
and decoder p(y|zT , yc ) so they can be considered as deter-
a comparison between the log likelihoods of the different
ministic functions of the context.
classes or by looking at the KL between the different pos-
CVAEs have been extended in a number of ways for ex- teriors, thereby overcoming the need of working with data
ample by adding attention (Rezende et al., 2016). Another tuples.
related extension is generative matching networks (GMNs,
Bartunov & Vetrov (2016)), where the conditioning input 4. Results
is pre-processed in a way that is similar to the matching
networks model. 4.1. 1-D function regression
A more complex version of the CVAE that is very relevant In order to test whether neural processes indeed learn to
in this context is the neural statistician (NS, Edwards & model distributions over functions we first apply them to a
Storkey (2016)). Similar to the neural process, the neural 1-D function regression task. The functions for this experi-
statistician contains a global latent variable z that captures ment are generated using a GP with varying kernel parame-
global uncertainty (see Figure 2b). A crucial difference is ters for each function. At every training step we sample a set
that while NPs represent the distribution over functions, NS of values for the Gaussian kernel of a GP and use those to
represents the distribution over sets. Since NS does not sample a function fD (x). A random number of the (x, y)C
contain a corresponding x value for each y value, it does not pairs are passed into the decoder of the NP as context points.
capture a pair-wise relation like GPs and NPs, but rather a We pick additional unobserved pairs (x, y)U which we com-
general distribution of the y values. Rather than generating bine with the observed context points (x, y)C as targets and
different y values by querying the model with different feed xT to the decoder that returns its estimate yˆT of the
x values, NS generates different y values by sampling an underlying value of yT .
Neural Processes

Number of context points Number of context points


10 100 300 784 15 30 90 1024

Sample 3 Sample 2 Sample 1 Context

Sample 3 Sample 2 Sample 1 Context


Figure 4. Pixel-wise regression on MNIST and CelebA The diagram on the left visualises how pixel-wise image completion can be
framed as a 2-D regression task where f(pixel coordinates) = pixel brightness. The figures to the right of the diagram show the results on
image completion for MNIST and CelebA. The images on the top correspond to the context points provided to the model. For better
clarity the unobserved pixels have been coloured blue for the MNIST images and white for CelebA. Each of the rows corresponds to a
different sample given the context points. As the number of context points increases the predicted pixels get closer to the underlying ones
and the variance across samples decreases.

Some sample curves are shown in Figure 3. For the same to the pixel intensity (see Figure 4 for an explanation of this).
underlying ground truth curve (black line) we run the neu- It is important to point out that we choose images as our
ral process using varying numbers of context points and dataset because they constitute a complex 2-D function and
generate several samples for each run (light-blue lines). As they are easy to evaluate visually. It is important to point
evidenced by the results the model has learned some key out that NPs, as such, have not been designed for image
properties of the 1-D curves from the data such as continuity generation like other specialised generative models.
and the general shape of functions sampled from a GP with
We train separate models on the MNIST (LeCun et al., 1998)
a Gaussian kernel. When provided with only one context
and the CelebA (Liu et al., 2015) datasets. As shown in
point the model generates curves that fluctuate around 0, the
Figure 4 the model performs well on both tasks. In the
prior of the data-generating GP. Crucially, these curves go
case of the MNIST digits the uncertainty is reflected in the
through or near the observed context point and display a
variability of the generated digit. Given only a few context
higher variance in regions where no observations are present.
points more than just one digit can fit the observations and
As the number of context points increases this uncertainty is
as a result the model produces different digits when sampled
reduced and the model’s predictions better match the under-
several times. As the number of context points increases the
lying ground truth. Given that this is a neural approximation
set of possible digits is reduced and the model produces the
the curves will sometimes only approach the observations
same digit, albeit with structural modifications that become
points as opposed to go through them as it is the case for
smaller as the number of context points increases.
GPs. On the other hand once the model is trained it can
regress more than just one data set i.e. it will produce sensi- The same holds for the CelebA dataset. In this case, when
ble results for curves generated using any kernel parameters provided limited context the model samples from a wider
observed during training. range of possible faces and as it observes more context
points it converges towards very similar looking faces. We
4.2. 2-D function regression do not expect the model to reconstruct the target image
perfectly even when all the pixels are provided as context,
One of the benefits of neural processes is their functional since the latent variable z constitutes a strong bottleneck.
flexibility as they can learn non-trivial ‘kernels’ from the This can be seen in the final column of the figure where the
data directly. In order to test this we apply NPs to a more predicted images are not only not identical to the ground
complex regression problem. We carry out image com- truth but also vary between themselves. The latter is likely
pletion as a regression task, where we provide some of the a cause of the latent variance which has been clipped to a
pixels as context and do pixel-wise prediction over the entire small value to avoid collapsing, so even when no uncertainty
image. In this formulation the xi values would correspond is present we can generate different samples from p(z|C).
to the Cartesian coordinates of each pixel and the yi values
Neural Processes

2 2 2 2 2
Objective function New observation
Prediction
Context points

0 0 0 0
0
Next evaluation
-2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2

Figure 5. Thompson sampling with neural processes on a 1-D objective function. The plots show the optimisation process over five
iterations. Each prediction function (blue) is drawn by sampling a latent variable conditioned on an incresing number of context points
(black circles). The underlying ground truth function is depicted as a black dotted line. The red triangle indicates the next evaluation point
which corresponds to the minimum value of the sampled NP curve. The red circle in the following iteration corresponds to this evaluation
point with its underlying ground truth value that serves as a new context point to the NP.

4.3. Black-box optimisation with Thompson sampling function evaluations increases.


To showcase the utility of sampling entire consistent trajecto- Neural process Gaussian process Random Search
ries we apply neural processes to Bayesian optimisation on
1-D function using Thompson sampling (Thompson, 1933). 0.26 0.14 1.00
Thompson sampling (also known as randomised proba-
bility matching) is an approach to tackle the exploration- Table 1. Bayesian Optimisation using Thompson sampling.
exploitation dilemma by maintaining a posterior distribution Average number of optimisation steps needed to reach the global
over model parameters. A decision is taken by drawing a minimum of a 1-D function generated by a Gaussian process. The
sample of model parameters and acting greedily under the values are normalised by the number of steps taken using random
resulting policy. The posterior distribution is then updated search. The performance of the Gaussian process with the correct
and the process is repeated. Despite its simplicity, Thomp- kernel constitutes an upper bound on performance.
son sampling has been shown to be highly effective both
empirically and in theory. It is commonly applied to black
box optimisation and multi-armed bandit problems (e.g. 4.4. Contextual bandits
Agrawal & Goyal, 2012; Shahriari et al., 2016).
Finally, we apply neural processes to the wheel bandit prob-
Neural processes lend themselves naturally to Thompson lem introduced in Riquelme et al. (2018), which constitutes
sampling by instead drawing a function over the space of a contextual bandit task on the unit circle with varying needs
interest, finding its minimum and adding the observed out- for exploration that can be smoothly parameterised. The
come to the context set for the next iteration. As shown in problem can be summarised as follows (see Figure 6 for
in Figure 3, function draws show high variance when only clarity): a unit circle is divided into a low-reward region
few observations are available, modelling uncertainty in a (blue area) and four high-reward regions (the other four
similar way to draws from a posterior over parameters given coloured areas). The size of the low-reward region is de-
a small data set. An example of this procedure for neural fined by a scalar δ. At every episode a different value for δ is
processes on a 1-D objective function is shown in Figure 5. selected. The agent is then provided with some coordinates
X = (X1 , X2 ) within the circle and has to choose among
We report the average number of steps required by an NP
k = 5 arms depending on the area the coordinates fall into.
to reach the global minimum of a function generated from
If ||X|| ≤ δ, the sample falls within the low-reward region
a GP prior in Table 1. For an easier comparison the values
(blue). In this case k = 1 is the optimal action, as it pro-
are normalised by the amount of steps required when doing
vides a reward drawn from r ∼ N (1.2, 0.012 ), while all
optimisation using random search. On average, NPs take
other actions only return r ∼ N (1.0, 0.012 ). If the sample
four times fewer iterations than random search on this task.
falls within any of the four high-reward region (||X|| > δ),
An upper bound on performance is given by a Gaussian
the optimal arm will be any of the remaining four k = 2 − 5,
process with the same kernel than the GP that generated
depending on the specific area. Pulling the optimal arm here
the function to be optimised. NPs do not reach this optimal
results in a high reward r ∼ N (50.0, 0.012 ), and as before
performance, as their samples are more noisy than those
all other arms receive N (1.0, 0.012 ) except for arm k = 1
of a GP, but are faster to evaluate since merely a forward
which again returns N (1.2, 0.012 ).
pass through the network is needed. This difference in
computational speed is bound to get more notable as the We compare our model to a large range of methods that
dimensionality of the problem and the number of necessary can be used for Thompson sampling, taking results from
Neural Processes

Table 2. Results on the wheel bandit problem for increasing values of δ. Shown are mean and standard errors for both cumulative and
simple regret (a measure of the quality of the final policy) over 100 trials. Results normalised wrt. the performance of a uniform agent.
δ 0.5 0.7 0.9 0.95 0.99
Cumulative regret

Uniform 100.00 ± 0.08 100.00 ± 0.09 100.00 ± 0.25 100.00 ± 0.37 100.00 ± 0.78
LinGreedy ( = 0.0) 65.89 ± 4.90 71.71 ± 4.31 108.86 ± 3.10 102.80 ± 3.06 104.80 ± 0.91
Dropout 7.89 ± 1.51 9.03 ± 2.58 36.58 ± 3.62 63.12 ± 4.26 98.68 ± 1.59
LinGreedy ( = 0.05) 7.86 ± 0.27 9.58 ± 0.35 19.42 ± 0.78 33.06 ± 2.06 74.17 ± 1.63
Bayes by Backprob (Blundell et al., 2015) 1.37 ± 0.07 3.32 ± 0.80 34.42 ± 5.50 59.04 ± 5.59 97.38 ± 2.66
NeuralLinear 0.95 ± 0.02 1.60 ± 0.03 4.65 ± 0.18 9.56 ± 0.36 49.63 ± 2.41

MAML (Finn et al., 2017) 2.95 ± 0.12 3.11 ± 0.16 4.84 ± 0.22 7.01 ± 0.33 22.93 ± 1.57
Neural Processes 1.60 ± 0.06 1.75 ± 0.05 3.31 ± 0.10 5.71 ± 0.24 22.13 ± 1.23
Simple regret

Uniform 100.00 ± 0.45 100.00 ± 0.78 100.00 ± 1.18 100.00 ± 2.21 100.00 ± 4.21
LinGreedy ( = 0.0) 66.59 ± 5.02 73.06 ± 4.55 108.56 ± 3.65 105.01 ± 3.59 105.19 ± 4.14
Dropout 6.57 ± 1.48 6.37 ± 2.53 35.02 ± 3.94 59.45 ± 4.74 102.12 ± 4.76
LinGreedy ( = 0.05) 5.53 ± 0.19 6.07 ± 0.24 8.49 ± 0.47 12.65 ± 1.12 57.62 ± 3.57
Bayes by Backprob (Blundell et al., 2015) 0.60 ± 0.09 1.45 ± 0.61 27.03 ± 6.19 56.64 ± 6.36 102.96 ± 5.93
NeuralLinear 0.33 ± 0.04 0.79 ± 0.07 2.17 ± 0.14 4.08 ± 0.20 35.89 ± 2.98

MAML (Finn et al., 2017) 2.49 ± 0.12 3.00 ± 0.35 4.75 ± 0.48 7.10 ± 0.77 22.89 ± 1.41
Neural Processes 1.04 ± 0.06 1.26 ± 0.21 2.90 ± 0.35 5.45 ± 0.47 21.45 ± 1.3

context and 50 target points for Neural Processes, and an


equal amount of data points for the meta- and inner-updates
in MAML. Note that since gradient steps are necessary for
MAML to adapt to data from each test problem, we reset
the parameters after each evaluation run. This additional
step is not necessary for neural processes.
Table 2 shows the quantitative evaluation on this task. We
observe that Neural Processes are a highly competitive
method, performing similar to MAML and the NeuralLinear
baseline in Riquelme et al. (2018), which is consistently
among the best out of 20 algorithms compared.

Figure 6. The wheel bandit problem with varying values of δ.


5. Discussion
We introduce Neural processes, a family of models that
combines the benefits of stochastic processes and neural
Riquelme et al. (2018), who kindly agreed to share the ex- networks. NPs learn to represent distributions over func-
periment and evaluation code with us. Neural Processes tions and make flexible predictions at test time conditioned
can be applied to this problem by training on a distribution on some context input. Instead of requiring a handcrafted
of tasks before applying the method. Since the methods kernel, NPs learn an implicit measure from the data directly.
described in Riquelme et al. (2018) do not require such a We apply NPs to a range of regression tasks to showcase
pre-training phase, we also include Model-agnostic meta- their flexibility. The goal of this paper is to introduce NPs
learning (MAML, Finn et al. (2017)), a method relying on and compare them to the currenly ongoing research. As
a similar pre-training phase, using code made available by such, the tasks presented here are diverse but relatively low-
the authors. For both NPs and MAML methods, we create a dimensional. We leave it to future work to scale NPs up to
batch for pre-training by first sampling M different wheel higher dimensional problems that are likely to highlight the
problems {δi }M i=1 , δi ∼ U(0, 1), followed by sampling tu- benefit of lower computational complexity and data driven
ples {(X, a, r)j }N
j=1 for context X, arm a and associated representations.
reward r for each δi . We set M = 64, N = 562, using 512
Neural Processes

Acknowledgements Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-


learning for fast adaptation of deep networks. arXiv
We would like to thank Tiago Ramalho, Oriol Vinyals, preprint arXiv:1703.03400, 2017.
Adam Kosiorek, Irene Garnelo, Daniel Burgess, Kevin Mc-
Kee and Claire McCoy for insightful discussions and be- Finn, C., Xu, K., and Levine, S. Probabilistic model-
ing awesome people. We would also like to thank Carlos agnostic meta-learning. arXiv preprint arXiv:1806.02817,
Riquelme, George Tucker and Jasper Snoek for providing 2018.
the code to reproduce the results of their contextual bandits
Gal, Y. and Ghahramani, Z. Dropout as a bayesian approx-
experiments (and, of course, also being awesome people).
imation: Representing model uncertainty in deep learn-
ing. In international conference on machine learning, pp.
References 1050–1059, 2016.
Agrawal, S. and Goyal, N. Analysis of thompson sampling Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T.,
for the multi-armed bandit problem. In Conference on Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D. J., and
Learning Theory, pp. 39–1, 2012. Eslami, A. Conditional neural processes. In International
Bartunov, S. and Vetrov, D. P. Fast adaptation in genera- Conference on Machine Learning, 2018.
tive models with generative matching networks. arXiv Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T.
preprint arXiv:1612.02192, 2016. Recasting gradient-based meta-learning as hierarchical
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, bayes. arXiv preprint arXiv:1801.08930, 2018.
D. Weight uncertainty in neural networks. arXiv preprint Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
arXiv:1505.05424, 2015. variational homoencoder: Learning to infer high-capacity
Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D. generative models from few examples. 2018.
Variational memory addressing in generative models. In Huang, W.-b., Zhao, D., Sun, F., Liu, H., and Chang, E. Y.
Advances in Neural Information Processing Systems, pp. Scalable gaussian process regression using deep neural
3923–3932, 2017. networks. In IJCAI, pp. 3576–3582, 2015.
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Joze- Kingma, D. P. and Welling, M. Auto-encoding variational
fowicz, R., and Bengio, S. Generating sentences from bayes. arXiv preprint arXiv:1312.6114, 2013.
a continuous space. arXiv preprint arXiv:1511.06349,
2015. Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Calandra, R., Peters, J., Rasmussen, C. E., and Deisenroth, Learning Workshop, volume 2, 2015.
M. P. Manifold gaussian processes for regression. In
Neural Networks (IJCNN), 2016 International Joint Con- Kumar, A., Eslami, S. M. A., Rezende, D. J., Garnelo, M.,
ference on, pp. 3338–3345. IEEE, 2016. Viola, F., Lockhart, E., and Shanahan, M. Consistent
generative query networks. In CoRR, 2018.
Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215, LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
2013. based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998.
De Finetti, B. La prévision: ses lois logiques, ses sources
subjectives. In Annales de l’institut Henri Poincaré, vol- Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
ume 7, pp. 1–68, 1937. attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.
Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in Louizos, C. and Welling, M. Multiplicative normalizing
Neural Information Processing Systems, pp. 2077–2085, flows for variational bayesian neural networks. arXiv
2017. preprint arXiv:1703.01961, 2017.
Edwards, H. and Storkey, A. Towards a neural statistician. Louizos, C., Ullrich, K., and Welling, M. Bayesian compres-
arXiv preprint arXiv:1606.02185, 2016. sion for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.
Eslami, S. A., Rezende, D. J., Besse, F., Viola, F., Morcos,
A. S., Garnelo, M., Ruderman, A., Rusu, A. A., Dani- Ma, C., Li, Y., and Hernández-Lobato, J. M. Variational
helka, I., Gregor, K., et al. Neural scene representation implicit processes. arXiv preprint arXiv:1806.02390,
and rendering. Science, 360(6394):1204–1210, 2018. 2018.
Neural Processes

Øksendal, B. Stochastic differential equations. In Stochastic van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
differential equations, pp. 11. Springer, 2003. O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Quiñonero-Candela, J. and Rasmussen, C. E. A unifying Processing Systems, pp. 4790–4798, 2016.
view of sparse approximate gaussian process regression.
Journal of Machine Learning Research, 6(Dec):1939– Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
1959, 2005. Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S., 2016.
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
learn distributions. 2017. Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic preprint arXiv:1611.05763, 2016.
backpropagation and approximate inference in deep gen- Wilson, A. and Nickisch, H. Kernel interpolation for scal-
erative models. arXiv preprint arXiv:1401.4082, 2014. able structured gaussian processes (kiss-gp). In Interna-
tional Conference on Machine Learning, pp. 1775–1784,
Rezende, D. J., Mohamed, S., Danihelka, I., Gregor, K., and
2015.
Wierstra, D. One-shot generalization in deep generative
models. arXiv preprint arXiv:1603.05106, 2016. Wilson, A. G., Knowles, D. A., and Ghahramani, Z.
Gaussian process regression networks. arXiv preprint
Riquelme, C., Tucker, G., and Snoek, J. Deep bayesian arXiv:1110.4411, 2011.
bandits showdown: An empirical comparison of bayesian
deep networks for thompson sampling. arXiv preprint Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
arXiv:1802.09127, 2018. Deep kernel learning. In Artificial Intelligence and Statis-
tics, pp. 370–378, 2016.
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and


De Freitas, N. Taking the human out of the loop: A review
of bayesian optimization. Proceedings of the IEEE, 104
(1):148–175, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks


for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.

Sohn, K., Lee, H., and Yan, X. Learning structured output


representation using deep conditional generative models.
In Advances in Neural Information Processing Systems,
pp. 3483–3491, 2015.

Sun, S., Zhang, G., Wang, C., Zeng, W., Li, J., and Grosse,
R. Differentiable compositional kernel learning for gaus-
sian processes. arXiv preprint arXiv:1806.04326, 2018.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-


quence learning with neural networks. In Advances in
neural information processing systems, pp. 3104–3112,
2014.

Thompson, W. R. On the likelihood that one unknown


probability exceeds another in view of the evidence of
two samples. Biometrika, 25(3/4):285–294, 1933.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy