Damianou, Lawrence - 2013 - Deep Gaussian Processes
Damianou, Lawrence - 2013 - Deep Gaussian Processes
dinov, 2006]. Traditional GP models have been extended maximizing with respect to the variables (instead of the pa-
to more expressive variants, for example by considering rameters, which are marginalized) and these models have
sophisticated covariance functions [Durrande et al., 2011, been combined in stacks to form the hierarchical GP-LVM
Gönen and Alpaydin, 2011] or by embedding GPs in more [Lawrence and Moore, 2007] which is a maximum a pos-
complex probabilistic structures [Snelson et al., 2004, Wil- teriori (MAP) approach for learning deep GP models. For
son et al., 2012] able to learn more powerful representa- this MAP approach to work, however, a strong prior is re-
tions of the data. However, all GP-based approaches con- quired on the top level of the hierarchy to ensure the algo-
sidered so far do not lead to a principled way of obtaining rithm works and MAP learning prohibits model selection
truly deep architectures and, to date, the field of deep learn- because no estimate of the marginal likelihood is available.
ing remains mainly associated with RBM-based models.
There are two main contributions in this paper. Firstly, we
The conditional probability of a single hidden unit in an exploit recent advances in variational inference [Titsias and
RBM model, given its parents, is written as Lawrence, 2010] to marginalize the latent variables in the
hierarchy variationally. Damianou et al. [2011] has already
p(y|x) = σ(w> x)y (1 − σ(w> x))(1−y) , shown how using these approaches two Gaussian process
models can be stacked. This paper goes further to show
where here y is the output variable of the RBM, x is that through variational approximations any number of GP
the set of inputs being conditioned on and σ(z) = (1 + models can be stacked to give truly deep hierarchies. The
exp(−z))−1 . The conditional density of the output de- variational approach gives us a rigorous lower bound on the
pends only on a linear weighted sum of the inputs. The marginal likelihood of the model, allowing it to be used
representational power of a Gaussian process in the same for model selection. Our second contribution is to use this
role is significantly greater than that of an RBM. For the lower bound to demonstrate the applicability of deep mod-
GP the corresponding likelihood is over a continuous vari- els even when data is scarce. The variational lower bound
able, but it is a nonlinear function of the inputs, gives us an objective measure from which we can select dif-
ferent structures for our deep hierarchy (number of layers,
p(y|x) = N y|f (x), σ 2 ,
number of nodes per layer). In a simple digits example we
where N ·|µ, σ 2 is a Gaussian density with mean µ and find that the best lower bound is given by the model with
variance σ 2 . In this case the likelihood is dependent on a the deepest hierarchy we applied (5 layers).
mapping function, f (·), rather than a set of intermediate The deep GP consists of a cascade of hidden layers of la-
parameters, w. The approach in Gaussian process mod- tent variables where each node acts as output for the layer
elling is to place a prior directly over the classes of func- above and as input for the layer below—with the observed
tions (which often specifies smooth, stationary nonlinear outputs being placed in the leaves of the hierarchy. Gaus-
functions) and integrate them out. This can be done an- sian processes govern the mappings between the layers.
alytically. In the RBM the model likelihood is estimated
and maximized with respect to the parameters, w. For the A single layer of the deep GP is effectively a Gaussian
process latent variable model (GP-LVM), just as a single
RBM marginalizing w is not analytically tractable. We
layer of a regular deep model is typically an RBM. [Tit-
note in passing that the two approaches can be mixed if
sias and Lawrence, 2010] have shown that latent variables
p(y|x) = σ(f (x))y (1 − σ(f (x))(1−y) , which recovers a
GP classification model. Analytic integration is no longer can be approximately marginalized in the GP-LVM allow-
possible though, and a common approach to approximate ing a variational lower bound on the likelihood to be com-
inference is the expectation propagation algorithm [see e.g. puted. The appropriate size of the latent space can be com-
puted using automatic relevance determination (ARD) pri-
Rasmussen and Williams, 2006]. However, we don’t con-
ors [Neal, 1996]. Damianou et al. [2011] extended this
sider this idea further in this paper.
approach by placing a GP prior over the latent space, re-
Inference in deep models requires marginalization of x as sulting in a Bayesian dynamical GP-LVM. Here we extend
they are typically treated as latent variables2 , which in the that approach to allow us to approximately marginalize any
case of the RBM are binary variables. The number of the number of hidden layers. We demonstrate how a deep hier-
terms in the sum scales exponentially with the input dimen- archy of Gaussian processes can be obtained by marginal-
sion rendering it intractable for anything but the smallest ising out the latent variables in the structure, obtaining an
models. In practice, sampling and, in particular, the con- approximation to the fully Bayesian training procedure and
trastive divergence algorithm, are used for training. Simi- a variational approximation to the true posterior of the la-
larly, marginalizing x in the GP is analytically intractable, tent variables given the outputs. The resulting model is very
even for simple prior densities like the Gaussian. In the flexible and should open up a range of applications for deep
GP-LVM [Lawrence, 2005] this problem is solved through structures 3 .
2 3
They can also be treated as observed, e.g. in the upper most A preliminary version of this paper has been presented in
layer of the hierarchy where we might include the data label. [Damianou and Lawrence, 2012].
Andreas C. Damianou, Neil D. Lawrence
2 The Model finite data set, the Gaussian process priors take the form
D
We first consider standard approaches to modeling with Y
p(F|X) = N (fd |0, KN N ) (3)
GPs. We then extend these ideas to deep GPs by consid-
d=1
ering Gaussian process priors over the inputs to the GP
model. We can apply this idea recursively to obtain a deep which is a Gaussian and, thus, allows for general non-linear
GP model. mappings to be marginalised
QD out analytically to obtain the
likelihood p(Y|X) = d=1 N (yd |0, KN N + σ2 I), anal-
2.1 Standard GP Modelling ogously to equation (2).
In the traditional probabilistic inference framework, we are 2.2 Deep Gaussian Processes
given a set of training input-output pairs, stored in matri-
ces X ∈ RN ×Q and Y ∈ RN ×D respectively, and seek Our deep Gaussian process architecture corresponds to a
to estimate the unobserved, latent function f = f (x), re- graphical model with three kinds of nodes, illustrated in
sponsible for generating Y given X. In this setting, Gaus- figure 1(a): the leaf nodes Y ∈ RN ×D which are ob-
sian processes (GPs) [Rasmussen and Williams, 2006] can served, the intermediate latent spaces Xh ∈ RN ×Qh , h =
be employed as nonparametric prior distributions over the 1, ..., H − 1, where H is the number of hidden layers, and
latent function f . More formally, we assume that each dat- the parent latent node Z = XH ∈ RN ×QZ . The parent
apoint yn is generated from the corresponding f (xn ) by node can be unobserved and potentially constrained with a
adding independent Gaussian noise, i.e. prior of our choice (e.g. a dynamical prior), or could con-
stitute the given inputs for a supervised learning task. For
yn = f (xn ) + n , ∼ N (0, σ2 I), (1) simplicity, here we focus on the unsupervised learning sce-
nario. In this deep architecture, all intermediate nodes Xh
and f is drawn from a Gaussian process, i.e. f (x) ∼
act as inputs for the layer below (including the leaves) and
GP (0, k(x, x0 )). This (zero-mean) Gaussian process prior
as outputs for the layer above. For simplicity, consider a
only depends on the covariance function k operating on
structure with only two hidden units, as the one depicted in
the inputs X. As we wish to obtain a flexible model, we
figure 1(b). The generative process takes the form:
only make very general assumptions about the form of the
generative mapping f and this is reflected in the choice ynd =fdY (xn ) + Ynd , d = 1, ..., D, xn ∈ RQ
of the covariance function which defines the properties of
xnq =fqX (zn ) + X
nq , q = 1, ..., Q, zn ∈ R
QZ
(4)
this mapping. For example, an exponentiatedquadratic co-
(x −x )2
variance function, k (xi , xj ) = (σse )2 exp − i 2l2 j , and the intermediate node is involved in two Gaussian pro-
forces the latent functions to be infinitely smooth. We cesses, f Y and f X , playing the role of an input and an out-
denote any covariance function hyperparameters (such as put respectively: f Y ∼ GP(0, k Y (X, X)) and f X ∼
(σse , l) of the aforementioned covariance function) by θ. GP(0, k X (Z, Z)). This structure can be naturally extended
The collection of latent function instantiations, denoted by vertically (i.e. deeper hierarchies) or horizontally (i.e. seg-
F = {fn }N n , is normally distributed, allowing us to com-
mentation of each layer into different partitions of the out-
pute analytically the marginal likelihood 4 put space), as we will see later in the paper. However, it is
already obvious how each layer adds a significant number
N
of model parameters (Xh ) as well as a regularization chal-
Z Y
p(Y|X) = p(yn |fn )p(fn |xn )dF lenge, since the size of each latent layer is crucial but has
n=1
to be a priori defined. For this reason, unlike Lawrence and
= N (Y|0, KN N + σ2 I), KN N = k(X, X). (2) Moore [2007], we seek to variationally marginalise out the
whole latent space. Not only this will allow us to obtain
Gaussian processes have also been used with success in un- an automatic Occam’s razor due to the Bayesian training,
supervised learning scenarios, where the input data X are but also we will end up with a significantly lower number
not directly observed. The Gaussian process latent vari- of model parameters, since the variational procedure only
able model (GP-LVM) [Lawrence, 2005, 2004] provides adds variational parameters. The first step to this approach
an elegant solution to this problem by treating the unob- is to define automatic relevance determination (ARD) co-
served inputs X as latent variables, while employing a variance functions for the GPs:
product of D independent GPs as prior for the latent map- 1
PQ
wq (xi,q −xj ,q )2
ping. The assumed generative procedure takes the form:
2
k (xi , xj ) = σard e− 2 q=1 . (5)
ynd = fd (xn ) + nd , where is again Gaussian with vari- This covariance function assumes a different weight wq
ance σ2 and F = {fd }D d=1 with fnd = fd (xn ). Given a for each latent dimension and this can be exploited in a
4
All probabilities involving f should also have θ in the con- Bayesian training framework in order to “switch off” irrel-
ditioning set, but here we omit it for clarity. evant dimensions by driving their corresponding weight to
Deep Gaussian Processes
zero, thus helping towards automatically finding the struc- clarity. Note that FY and UY are draws from the same
ture of complex models. However, the nonlinearities intro- GP so that p(UY ) and p(FY |UY , X) are also Gaussian
duced by this covariance function make the Bayesian treat- distributions (and similarly for p(UX ), p(FX |UX , Z)).
ment of this model challenging. Nevertheless, following
We are now able to define a variational distribution Q
recent non-standard variational inference methods we can
which, when combined with the new expressions for the
define analytically an approximate Bayesian training pro-
augmented GP priors, results in a tractable variational
cedure, as will be explained in the next section.
bound. Specifically, we have:
2.3 Bayesian Training Q =p(FY |UY , X)q(UY )q(X)
A Bayesian training procedure requires optimisation of the ·p(FX |UX , Z)q(UX )q(Z). (10)
model evidence:
Z We select q(UY ) and q(UX ) to be free-form variational
log p(Y) = log p(Y|X)p(X|Z)p(Z). (6) distributions, while q(X) and q(Z) are chosen to be Gaus-
X,Z
sian, factorised with respect to dimensions:
When prior information is available regarding the observed
Q QZ
data (e.g. their dynamical nature is known a priori), the Y Y
prior distribution on the parent latent node can be selected q(X) = N (µX X
q , Sq ), q(Z) = N (µZ Z
q , Sq ). (11)
q=1 q=1
so as to constrain the whole latent space through propaga-
tion of the prior density through the cascade. Here we take By substituting equation (10) back to (7) while also re-
the general case where p(Z) = N (Z|0, I). However, the placing the original joint distribution with its augmented
integral of equation (6) is intractable due to the nonlinear version in equation (9), we see that the “difficult” terms
way in which X and Z are treated through the GP priors p(FY |UY , X) and p(FX |UX , Z) cancel out in the frac-
f Y and f X . As a first step, we apply Jensen’s inequality to tion, leaving a quantity that can be computed analytically:
find a variational lower bound Fv ≤ log p(Y), with
p(Y|FY )p(UY )p(X|FX )p(UX )p(Z)
Z
p(Y, FY , FX , X, Z) Fv = Q log
Z
,
Fv = Q log , (7) Q0
X,Z,FY ,FX Q (12)
where Q0 = q(UY )q(X)q(UX )q(Z) and the above inte-
where we introduced a variational distribution Q, the form
gration is with respect to {X, Z, FY , FX , UY , UX }. More
of which will be defined later on. By noticing that the joint
specifically, we can break the logarithm in equation (12) by
distribution appearing above can be expanded in the form
grouping the variables of the fraction in such a way that the
p(Y,FY , FX , X, Z) = bound can be written as:
p(Y|FY )p(FY |X)p(X|FX )p(FX |Z)p(Z), (8) Fv = gY + rX + Hq(X) − KL (q(Z) k p(Z)) (13)
we see that the integral of equation (7) is still intractable be-
where H represents the entropy with respect to a distribu-
cause X and Z still appear nonlinearly in the p(FY |X) and
tion, KL denotes the Kullback – Leibler divergence and,
p(FX |Z) terms respectively. A key result of [Titsias and
using h·i to denote expectations,
Lawrence, 2010] is that expanding the probability space of
the GP prior p(F|X) with extra variables allows for priors
on the latent space to be propagated through the nonlinear gY = g(Y, FY , UY , X)
mapping f . More precisely, we augment the probability
D Y
E
= log p(Y|FY ) + log p(U )
q(UY )
space of equation (3) with K auxiliary pseudo-inputs X̃ ∈ p(FY |UY ,X)q(UY )q(X)
RK×Q and Z̃ ∈ RK×QZ that correspond to a collection of
function values UY ∈ RK×D and UX ∈ RK×Q respec- rX = r(X, FX , UX , Z)
tively 5 . Following this approach, we obtain the augmented D X
E
= log p(X|FX ) + log p(U )
probability space: p(Y, FY , FX , X, Z, UY , UX , X̃, Z̃) = X
q(U ) p(FX |UX ,Z)q(UX )q(X)q(Z)
(14)
p(Y|FY )p(FY |UY , X)p(UY |X̃)
·p(X|FX )p(FX |UX , Z)p(UX |X̃)p(Z) (9) Both terms gY and rX involve known Gaussian densities
and are, thus, tractable. The gY term is only associated
The pseudo-inputs X̃ and Z̃ are known as inducing points, with the leaves and, thus, is the same as the bound found
and will be dropped from our expressions from now on, for for the Bayesian GP-LVM [Titsias and Lawrence, 2010].
5
The number of inducing points, K, does not need to be the Since it only involves expectations with respect to Gaussian
same for every GP of the overall deep structure. distributions, the GP output variables are only involved in
Andreas C. Damianou, Neil D. Lawrence
a quantity of the form YYT . Further, as can be seen from that the columns of Y that encode similar information will
the above equations, the function r(·) is similar to g(·) but be assigned relevance weight vectors that are also similar.
it requires expectations with respect to densities of all of the This idea can be extended to all levels of the hierarchy, thus
variables involved (i.e. with respect to all function inputs). obtaining a fully factorised deep GP model.
Therefore, rX will involve X (the outputs of the topi layer)
PQ h X X T This special case of our model makes the connection be-
T
in a term XX q(X) = q=1 µq µq + SX q . tween our model’s structure and neural network architec-
tures more obvious: the ARD parameters play a role similar
3 Extending the hierarchy to the weights of neural networks, while the latent variables
play the role of neurons which learn hierarchies of features.
Although the main calculations were demonstrated in a
simple hierarchy, it is easy to extend the model ver- ...
tically, i.e. by adding more hidden layers, or horizon-
tally, i.e. by considering conditional independencies of
the latent variables belonging to the same layer. The (a)
first case only requires adding more rX functions to the
variational bound, i.e. instead
PH−1of a single rX term we
will now have the sum: h=1 rXh , where rXh =
(b)
r(Xh , FXh , UXh , Xh+1 ), XH = Z .
Now consider the horizontal expansion scenario and as-
sume that we wish to break the single latent space Xh ,
of layer h, to Mh conditionally independent subsets. As
long as the variational distribution q(Xh ) of equation (11)
is chosen to be factorised in a consistent way, this is fea-
sible by just breaking the original rXh term of equation
PMh (m)
(14) into the sum m=1 rXh . This follows just from
the fact that, due to the independence assumption, it holds
PMh (m)
that log p(Xh |Xh+1 ) = m=1 log p(Xh |Xh+1 ). No-
(c)
tice that the same principle can also be applied to the leaves
by breaking the gY term of the bound. This scenario arises Figure 1: Different representations of the Deep GP model:
when, for example we are presented with multiple different (a) shows the general architecture with a cascade of H hid-
output spaces which, however, we believe they have some den layers, (b) depicts a simplification of a two hidden layer
commonality. For example, when the observed data are hierarchy also demonstrating the corresponding GP map-
coming from a video and an audio recording of the same pings and (c) illustrates the most general case where the
event. Given the above, the variational bound for the most leaves and all intermediate nodes are allowed to form con-
general version of the model takes the form: ditionally independent groups. The terms of the objective
(15) corresponding to each layer are included on the left.
MY H−1 Mh H−1
(m) (m)
X X X X
Fv = gY + rXh + Hq(Xh )
m=1 h=1 m=1 h=1 3.2 Parameters and complexity
− KL (q(Z) k p(Z)) . (15)
In all graphical variants shown in figure 1, every arrow rep-
Figure 1(c) shows the association of this objective func- resents a generative procedure with a GP prior, correspond-
tion’s terms with each layer of the hierarchy. Recall that ing to a set of parameters {X̃, θ, σ }. Each layer of la-
(m) (m)
each rXh and gY term is associated with a different GP tent variables corresponds to a variational distribution q(X)
and, thus, is coming with its own set of automatic relevance which is associated with a set of variational means and
determination (ARD) weights (described in equation (5)). covariances, as shown in equation (11). The parent node
can have the same form as equation (11) or can be con-
3.1 Deep multiple-output Gaussian processes strained with a more informative prior which would couple
the points of q(Z). For example, a dynamical prior would
The particular way of extending the hierarchies horizon- introduce Q × N 2 parameters which can, nevertheless,
tally, as presented above, can be seen as a means of per- be reparametrized using less variables [Damianou et al.,
forming unsupervised multiple-output GP learning. This 2011]. However, as is evident from equations (10) and
only requires assigning a different gY term (and, thus, as- (12), the inducing points and the parameters of q(X) and
sociated ARD weights) to each vector yd , where d indexes q(Z) are variational rather than model parameters, some-
the output dimensions. After training our model, we hope thing which significantly helps in regularizing the problem.
Deep Gaussian Processes
Therefore, adding more layers to the hierarchy does not learning problem we have described, but in the uppermost
introduce many more model parameters. Moreover, as in layer we make observations of some set of inputs. For this
common sparse methods for Gaussian processes [Titsias, simple example we created a toy data set by stacking two
2009], the complexity of each generative GP mapping is Gaussian processes as follows: the first Gaussian process
reduced from the typical O(N 3 ) to O(N M 2 ). employed a covariance function which was the sum of a
linear and an quadratic exponential kernel and received as
input an equally spaced vector of 120 points. We generated
4 Demonstration
1-dimensional samples from the first GP and used them as
input for the second GP, which employed a quadratic expo-
In this section we demonstrate the deep GP model in toy
nential kernel. Finally, we generated 10-dimensional sam-
and real-world data sets. For all experiments, the model
ples with the second GP, thus overall simulating a warped
is initialised by performing dimensionality reduction in the
process. The final data set was created by simply ignor-
observations to obtain the first hidden layer and then re-
ing the intermediate layer (the samples from the first GP)
peating this process greedily for the next layers. To obtain
and presenting to the tested methods only the continuous
the stacked initial spaces we experimented with PCA and
equally spaced input given to the first GP and the output of
the Bayesian GP-LVM, but the end result did not vary sig-
the second GP. To make the data set more challenging, we
nificantly. Note that the usual process in deep learning is
randomly selected only 25 datapoints for the training set
to seek a dimensional expansion, particularly in the lower
and left the rest for the test set.
layers. In deep GP models, such an expansion does occur
between the latent layers because there is an infinite basis Figure 3 nicely illustrates the effects of sampling through
layer associated with the GP between each latent layer. two GP models, nonstationarity and long range correlations
across the input space become prevalent. A data set of this
4.1 Toy Data form would be challenging for traditional approaches be-
cause of these long range correlations. Another way of
We first test our model on toy data, created by sampling thinking of data like this is as a nonlinear warping of the
from a three-level stack of GPs. Figure 2 (a) depicts the true input space to the GP. Because this type of deep GP only
hierarchy: from the top latent layer two intermediate latent contains one hidden layer, it is identical to the model de-
signals are generated. These, in turn, together generate 10- veloped by [Damianou et al., 2011] (where the input given
dimensional observations (not depicted) through sampling at the top layer of their model was a time vector, but their
of another GP. These observations are then used to train code is trivially generalized). The additional contribution
the following models: a deep GP, a simple stacked Isomap in this paper will be to provide a more complex deep hier-
[Tenenbaum et al., 2000] and a stacked PCA method, the archy, but still learn the underlying representation correctly.
results of which are shown in figures 2 (b, c, d) respec- To this end we applied a standard GP (1 layer less than the
tively. From these models, only the deep GP marginalises actual process that generated the data) and a deep GP with
the latent spaces and, in contrast to the other two, it is not two hidden layers (1 layer more than the actual generating
given any information about the dimensionality of each true process). We repeated our experiment 10 times, each time
signal in the hierarchy; instead, this is learnt automatically obtaining different samples from the simulated warped pro-
through ARD. As can be seen in figure 2, the deep GP finds cess and different random training splits. Our results show
the correct dimensionality for each hidden layer, but it also that the deep GP predicted better the unseen data, as can
discovers latent signals which are closer to the real ones. be seen in figure 3(b). The results, therefore, suggest that
This result is encouraging, as it indicates that the model can our deep model can at the same time be flexible enough to
recover the ground truth when samples from it are taken, model difficult data as well as robust, when modelling data
and gives confidence in the variational learning procedure. that is less complex than that representable by the hierar-
chy. We assign these characteristics to the Bayesian learn-
ing approach that deals with capacity control automatically.
MSE
GP
3.5
deepGP
2.5
1.5
(a) (b) (c)
experiment #
1
1 2 3 4 5 6 7 8 9 10
(a) (b) Figure 4: Figure (a) shows the deep GP model employed.
Figure (b) shows the ARD weights for f Y1 (blue/wider
Figure 3: (a) shows the toy data created for the regression bins) and f Y2 (red/thinner bins) and figure (c) those for f X .
experiment. The top plot shows the (hidden) warping func-
tion and bottom plot shows the final (observed) output. (b)
shows the results obtained over each experiment repetition.
Figure 6: The ARD weights of a deep GP with 5 hidden we plan to further investigate by incorporating ideas from
layers as learned for the digits experiment. past approaches. Indeed, previous efforts to combine GPs
with deep structures were successful at unsupervised pre-
training [Erhan et al., 2010] or guiding [Snoek et al., 2012]
of traditional deep models.
Although the experiments presented here considered only
up to 5 layers in the hierarchy, the methodology is directly
applicable to deeper architectures, with which we intend to
experiment in the future. The marginalisation of the latent
space allows for such an expansion with simultaneous reg-
ularisation. The variational lower bound allows us to make
a principled choice between models trained using different
initializations and with different numbers of layers.
The deep hierarchy we have proposed can also be used with
Figure 7: The nearest neighbour class separation test on a inputs governing the top layer of the hierarchy, leading to
deep GP model with depth 5. a powerful model for regression based on Gaussian pro-
cesses, but which is not itself a Gaussian process. In the
future, we wish to test this model for applications in multi-
5 Discussion and future work task learning (where intermediate layers could learn repre-
sentations shared across the tasks) and in modelling nonsta-
We have introduced a framework for efficient Bayesian tionary data or data involving jumps. These are both areas
training of hierarchical Gaussian process mappings. Our where a single layer GP struggles.
approach approximately marginalises out the latent space, A remaining challenge is to extend our methodologies to
thus allowing for automatic structure discovery in the hi- very large data sets. A very promising approach would be
erarchy. The method was able to successfully learn a hi- to apply stochastic variational inference [Hoffman et al.,
erarchy of features which describe natural human motion 2012]. In a recent workshop publication Hensman and
and the pixels of handwritten digits. Our variational lower Lawrence [2012] have shown that the standard variational
bound selected a deep hierarchical representation for hand- GP and Bayesian GP-LVM can be made to fit within this
written digits even though the data in our experiment was formalism. The next step for deep GPs will be to incorpo-
relatively scarce (150 data points). We gave persuasive ev- rate these large scale variational learning algorithms.
idence that deep GP models are powerful enough to en-
code abstract information even for smaller data sets. Fur- Acknowledgements
ther exploration could include testing the model on other
inference tasks, such as class conditional density estima- Research was supported by the University of Sheffield
tion to further validate the ideas. Our method can also be Moody endowment fund and the Greek State Scholarships
used to improve existing deep algorithms, something which Foundation (IKY).
Andreas C. Damianou, Neil D. Lawrence