0% found this document useful (0 votes)
102 views9 pages

Damianou, Lawrence - 2013 - Deep Gaussian Processes

Uploaded by

Black Fox
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views9 pages

Damianou, Lawrence - 2013 - Deep Gaussian Processes

Uploaded by

Black Fox
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Deep Gaussian Processes

Andreas C. Damianou Neil D. Lawrence


Dept. of Computer Science & Sheffield Institute for Translational Neuroscience,
University of Sheffield, UK

Abstract the question as to whether deep structures and the learning


of abstract structure can be undertaken in smaller data sets.
For smaller data sets, questions of generalization arise: to
In this paper we introduce deep Gaussian process
demonstrate such structures are justified it is useful to have
(GP) models. Deep GPs are a deep belief net-
an objective measure of the model’s applicability.
work based on Gaussian process mappings. The
data is modeled as the output of a multivariate The traditional approach to deep learning is based around
GP. The inputs to that Gaussian process are then binary latent variables and the restricted Boltzmann ma-
governed by another GP. A single layer model is chine (RBM) [Hinton, 2010]. Deep hierarchies are con-
equivalent to a standard GP or the GP latent vari- structed by stacking these models and various approxi-
able model (GP-LVM). We perform inference in mate inference techniques (such as contrastive divergence)
the model by approximate variational marginal- are used for estimating model parameters. A significant
ization. This results in a strict lower bound on the amount of work has then to be done with annealed impor-
marginal likelihood of the model which we use tance sampling if even the likelihood1 of a data set under
for model selection (number of layers and nodes the RBM model is to be estimated [Salakhutdinov and Mur-
per layer). Deep belief networks are typically ap- ray, 2008]. When deeper hierarchies are considered, the es-
plied to relatively large data sets using stochas- timate is only of a lower bound on the data likelihood. Fit-
tic gradient descent for optimization. Our fully ting such models to smaller data sets and using Bayesian
Bayesian treatment allows for the application of approaches to deal with the complexity seems completely
deep models even when data is scarce. Model se- futile when faced with these intractabilities.
lection by our variational bound shows that a five
The emergence of the Boltzmann machine (BM) at the core
layer hierarchy is justified even when modelling
of one of the most interesting approaches to modern ma-
a digit data set containing only 150 examples.
chine learning is very much a case of a the field going back
to the future: BMs rose to prominence in the early 1980s,
but the practical implications associated with their train-
1 Introduction ing led to their neglect until families of algorithms were
developed for the RBM model with its reintroduction as a
Probabilistic modelling with neural network architectures product of experts in the late nineties [Hinton, 1999].
constitute a well studied area of machine learning. The re-
The computational intractabilities of Boltzmann machines
cent advances in the domain of deep learning [Hinton and
led to other families of methods, in particular kernel meth-
Osindero, 2006, Bengio et al., 2012] have brought this kind
ods such as the support vector machine (SVM), to be con-
of models again in popularity. Empirically, deep models
sidered for the domain of data classification. Almost con-
seem to have structural advantages that can improve the
temporaneously to the SVM, Gaussian process (GP) mod-
quality of learning in complicated data sets associated with
els [Rasmussen and Williams, 2006] were introduced as a
abstract information [Bengio, 2009]. Most deep algorithms
fully probabilistic substitute for the multilayer perceptron
require a large amount of data to perform learning, how-
(MLP), inspired by the observation [Neal, 1996] that, un-
ever, we know that humans are able to perform inductive
der certain conditions, a GP is an MLP with infinite units in
reasoning (equivalent to concept generalization) with only
the hidden layer. MLPs also relate to deep learning models:
a few examples [Tenenbaum et al., 2006]. This provokes
deep learning algorithms have been used to pretrain autoen-
coders for dimensionality reduction [Hinton and Salakhut-
Appearing in Proceedings of the 16th International Conference on
Artificial Intelligence and Statistics (AISTATS) 2013, Scottsdale, 1
We use emphasis to clarify we are referring to the model like-
AZ, USA. Volume 31 of JMLR: W&CP 31. Copyright 2013 by
lihood, not the marginal likelihood required in Bayesian model
the authors.
selection.
Deep Gaussian Processes

dinov, 2006]. Traditional GP models have been extended maximizing with respect to the variables (instead of the pa-
to more expressive variants, for example by considering rameters, which are marginalized) and these models have
sophisticated covariance functions [Durrande et al., 2011, been combined in stacks to form the hierarchical GP-LVM
Gönen and Alpaydin, 2011] or by embedding GPs in more [Lawrence and Moore, 2007] which is a maximum a pos-
complex probabilistic structures [Snelson et al., 2004, Wil- teriori (MAP) approach for learning deep GP models. For
son et al., 2012] able to learn more powerful representa- this MAP approach to work, however, a strong prior is re-
tions of the data. However, all GP-based approaches con- quired on the top level of the hierarchy to ensure the algo-
sidered so far do not lead to a principled way of obtaining rithm works and MAP learning prohibits model selection
truly deep architectures and, to date, the field of deep learn- because no estimate of the marginal likelihood is available.
ing remains mainly associated with RBM-based models.
There are two main contributions in this paper. Firstly, we
The conditional probability of a single hidden unit in an exploit recent advances in variational inference [Titsias and
RBM model, given its parents, is written as Lawrence, 2010] to marginalize the latent variables in the
hierarchy variationally. Damianou et al. [2011] has already
p(y|x) = σ(w> x)y (1 − σ(w> x))(1−y) , shown how using these approaches two Gaussian process
models can be stacked. This paper goes further to show
where here y is the output variable of the RBM, x is that through variational approximations any number of GP
the set of inputs being conditioned on and σ(z) = (1 + models can be stacked to give truly deep hierarchies. The
exp(−z))−1 . The conditional density of the output de- variational approach gives us a rigorous lower bound on the
pends only on a linear weighted sum of the inputs. The marginal likelihood of the model, allowing it to be used
representational power of a Gaussian process in the same for model selection. Our second contribution is to use this
role is significantly greater than that of an RBM. For the lower bound to demonstrate the applicability of deep mod-
GP the corresponding likelihood is over a continuous vari- els even when data is scarce. The variational lower bound
able, but it is a nonlinear function of the inputs, gives us an objective measure from which we can select dif-
ferent structures for our deep hierarchy (number of layers,
p(y|x) = N y|f (x), σ 2 ,

number of nodes per layer). In a simple digits example we

where N ·|µ, σ 2 is a Gaussian density with mean µ and find that the best lower bound is given by the model with
variance σ 2 . In this case the likelihood is dependent on a the deepest hierarchy we applied (5 layers).
mapping function, f (·), rather than a set of intermediate The deep GP consists of a cascade of hidden layers of la-
parameters, w. The approach in Gaussian process mod- tent variables where each node acts as output for the layer
elling is to place a prior directly over the classes of func- above and as input for the layer below—with the observed
tions (which often specifies smooth, stationary nonlinear outputs being placed in the leaves of the hierarchy. Gaus-
functions) and integrate them out. This can be done an- sian processes govern the mappings between the layers.
alytically. In the RBM the model likelihood is estimated
and maximized with respect to the parameters, w. For the A single layer of the deep GP is effectively a Gaussian
process latent variable model (GP-LVM), just as a single
RBM marginalizing w is not analytically tractable. We
layer of a regular deep model is typically an RBM. [Tit-
note in passing that the two approaches can be mixed if
sias and Lawrence, 2010] have shown that latent variables
p(y|x) = σ(f (x))y (1 − σ(f (x))(1−y) , which recovers a
GP classification model. Analytic integration is no longer can be approximately marginalized in the GP-LVM allow-
possible though, and a common approach to approximate ing a variational lower bound on the likelihood to be com-
inference is the expectation propagation algorithm [see e.g. puted. The appropriate size of the latent space can be com-
puted using automatic relevance determination (ARD) pri-
Rasmussen and Williams, 2006]. However, we don’t con-
ors [Neal, 1996]. Damianou et al. [2011] extended this
sider this idea further in this paper.
approach by placing a GP prior over the latent space, re-
Inference in deep models requires marginalization of x as sulting in a Bayesian dynamical GP-LVM. Here we extend
they are typically treated as latent variables2 , which in the that approach to allow us to approximately marginalize any
case of the RBM are binary variables. The number of the number of hidden layers. We demonstrate how a deep hier-
terms in the sum scales exponentially with the input dimen- archy of Gaussian processes can be obtained by marginal-
sion rendering it intractable for anything but the smallest ising out the latent variables in the structure, obtaining an
models. In practice, sampling and, in particular, the con- approximation to the fully Bayesian training procedure and
trastive divergence algorithm, are used for training. Simi- a variational approximation to the true posterior of the la-
larly, marginalizing x in the GP is analytically intractable, tent variables given the outputs. The resulting model is very
even for simple prior densities like the Gaussian. In the flexible and should open up a range of applications for deep
GP-LVM [Lawrence, 2005] this problem is solved through structures 3 .
2 3
They can also be treated as observed, e.g. in the upper most A preliminary version of this paper has been presented in
layer of the hierarchy where we might include the data label. [Damianou and Lawrence, 2012].
Andreas C. Damianou, Neil D. Lawrence

2 The Model finite data set, the Gaussian process priors take the form
D
We first consider standard approaches to modeling with Y
p(F|X) = N (fd |0, KN N ) (3)
GPs. We then extend these ideas to deep GPs by consid-
d=1
ering Gaussian process priors over the inputs to the GP
model. We can apply this idea recursively to obtain a deep which is a Gaussian and, thus, allows for general non-linear
GP model. mappings to be marginalised
QD out analytically to obtain the
likelihood p(Y|X) = d=1 N (yd |0, KN N + σ2 I), anal-
2.1 Standard GP Modelling ogously to equation (2).

In the traditional probabilistic inference framework, we are 2.2 Deep Gaussian Processes
given a set of training input-output pairs, stored in matri-
ces X ∈ RN ×Q and Y ∈ RN ×D respectively, and seek Our deep Gaussian process architecture corresponds to a
to estimate the unobserved, latent function f = f (x), re- graphical model with three kinds of nodes, illustrated in
sponsible for generating Y given X. In this setting, Gaus- figure 1(a): the leaf nodes Y ∈ RN ×D which are ob-
sian processes (GPs) [Rasmussen and Williams, 2006] can served, the intermediate latent spaces Xh ∈ RN ×Qh , h =
be employed as nonparametric prior distributions over the 1, ..., H − 1, where H is the number of hidden layers, and
latent function f . More formally, we assume that each dat- the parent latent node Z = XH ∈ RN ×QZ . The parent
apoint yn is generated from the corresponding f (xn ) by node can be unobserved and potentially constrained with a
adding independent Gaussian noise, i.e. prior of our choice (e.g. a dynamical prior), or could con-
stitute the given inputs for a supervised learning task. For
yn = f (xn ) + n ,  ∼ N (0, σ2 I), (1) simplicity, here we focus on the unsupervised learning sce-
nario. In this deep architecture, all intermediate nodes Xh
and f is drawn from a Gaussian process, i.e. f (x) ∼
act as inputs for the layer below (including the leaves) and
GP (0, k(x, x0 )). This (zero-mean) Gaussian process prior
as outputs for the layer above. For simplicity, consider a
only depends on the covariance function k operating on
structure with only two hidden units, as the one depicted in
the inputs X. As we wish to obtain a flexible model, we
figure 1(b). The generative process takes the form:
only make very general assumptions about the form of the
generative mapping f and this is reflected in the choice ynd =fdY (xn ) + Ynd , d = 1, ..., D, xn ∈ RQ
of the covariance function which defines the properties of
xnq =fqX (zn ) + X
nq , q = 1, ..., Q, zn ∈ R
QZ
(4)
this mapping. For example, an exponentiatedquadratic co-
(x −x )2

variance function, k (xi , xj ) = (σse )2 exp − i 2l2 j , and the intermediate node is involved in two Gaussian pro-
forces the latent functions to be infinitely smooth. We cesses, f Y and f X , playing the role of an input and an out-
denote any covariance function hyperparameters (such as put respectively: f Y ∼ GP(0, k Y (X, X)) and f X ∼
(σse , l) of the aforementioned covariance function) by θ. GP(0, k X (Z, Z)). This structure can be naturally extended
The collection of latent function instantiations, denoted by vertically (i.e. deeper hierarchies) or horizontally (i.e. seg-
F = {fn }N n , is normally distributed, allowing us to com-
mentation of each layer into different partitions of the out-
pute analytically the marginal likelihood 4 put space), as we will see later in the paper. However, it is
already obvious how each layer adds a significant number
N
of model parameters (Xh ) as well as a regularization chal-
Z Y
p(Y|X) = p(yn |fn )p(fn |xn )dF lenge, since the size of each latent layer is crucial but has
n=1
to be a priori defined. For this reason, unlike Lawrence and
= N (Y|0, KN N + σ2 I), KN N = k(X, X). (2) Moore [2007], we seek to variationally marginalise out the
whole latent space. Not only this will allow us to obtain
Gaussian processes have also been used with success in un- an automatic Occam’s razor due to the Bayesian training,
supervised learning scenarios, where the input data X are but also we will end up with a significantly lower number
not directly observed. The Gaussian process latent vari- of model parameters, since the variational procedure only
able model (GP-LVM) [Lawrence, 2005, 2004] provides adds variational parameters. The first step to this approach
an elegant solution to this problem by treating the unob- is to define automatic relevance determination (ARD) co-
served inputs X as latent variables, while employing a variance functions for the GPs:
product of D independent GPs as prior for the latent map- 1
PQ
wq (xi,q −xj ,q )2
ping. The assumed generative procedure takes the form:
2
k (xi , xj ) = σard e− 2 q=1 . (5)
ynd = fd (xn ) + nd , where  is again Gaussian with vari- This covariance function assumes a different weight wq
ance σ2 and F = {fd }D d=1 with fnd = fd (xn ). Given a for each latent dimension and this can be exploited in a
4
All probabilities involving f should also have θ in the con- Bayesian training framework in order to “switch off” irrel-
ditioning set, but here we omit it for clarity. evant dimensions by driving their corresponding weight to
Deep Gaussian Processes

zero, thus helping towards automatically finding the struc- clarity. Note that FY and UY are draws from the same
ture of complex models. However, the nonlinearities intro- GP so that p(UY ) and p(FY |UY , X) are also Gaussian
duced by this covariance function make the Bayesian treat- distributions (and similarly for p(UX ), p(FX |UX , Z)).
ment of this model challenging. Nevertheless, following
We are now able to define a variational distribution Q
recent non-standard variational inference methods we can
which, when combined with the new expressions for the
define analytically an approximate Bayesian training pro-
augmented GP priors, results in a tractable variational
cedure, as will be explained in the next section.
bound. Specifically, we have:
2.3 Bayesian Training Q =p(FY |UY , X)q(UY )q(X)
A Bayesian training procedure requires optimisation of the ·p(FX |UX , Z)q(UX )q(Z). (10)
model evidence:
Z We select q(UY ) and q(UX ) to be free-form variational
log p(Y) = log p(Y|X)p(X|Z)p(Z). (6) distributions, while q(X) and q(Z) are chosen to be Gaus-
X,Z
sian, factorised with respect to dimensions:
When prior information is available regarding the observed
Q QZ
data (e.g. their dynamical nature is known a priori), the Y Y
prior distribution on the parent latent node can be selected q(X) = N (µX X
q , Sq ), q(Z) = N (µZ Z
q , Sq ). (11)
q=1 q=1
so as to constrain the whole latent space through propaga-
tion of the prior density through the cascade. Here we take By substituting equation (10) back to (7) while also re-
the general case where p(Z) = N (Z|0, I). However, the placing the original joint distribution with its augmented
integral of equation (6) is intractable due to the nonlinear version in equation (9), we see that the “difficult” terms
way in which X and Z are treated through the GP priors p(FY |UY , X) and p(FX |UX , Z) cancel out in the frac-
f Y and f X . As a first step, we apply Jensen’s inequality to tion, leaving a quantity that can be computed analytically:
find a variational lower bound Fv ≤ log p(Y), with
p(Y|FY )p(UY )p(X|FX )p(UX )p(Z)
Z
p(Y, FY , FX , X, Z) Fv = Q log
Z
,
Fv = Q log , (7) Q0
X,Z,FY ,FX Q (12)
where Q0 = q(UY )q(X)q(UX )q(Z) and the above inte-
where we introduced a variational distribution Q, the form
gration is with respect to {X, Z, FY , FX , UY , UX }. More
of which will be defined later on. By noticing that the joint
specifically, we can break the logarithm in equation (12) by
distribution appearing above can be expanded in the form
grouping the variables of the fraction in such a way that the
p(Y,FY , FX , X, Z) = bound can be written as:
p(Y|FY )p(FY |X)p(X|FX )p(FX |Z)p(Z), (8) Fv = gY + rX + Hq(X) − KL (q(Z) k p(Z)) (13)
we see that the integral of equation (7) is still intractable be-
where H represents the entropy with respect to a distribu-
cause X and Z still appear nonlinearly in the p(FY |X) and
tion, KL denotes the Kullback – Leibler divergence and,
p(FX |Z) terms respectively. A key result of [Titsias and
using h·i to denote expectations,
Lawrence, 2010] is that expanding the probability space of
the GP prior p(F|X) with extra variables allows for priors
on the latent space to be propagated through the nonlinear gY = g(Y, FY , UY , X)
mapping f . More precisely, we augment the probability
D Y
E
= log p(Y|FY ) + log p(U )
q(UY )
space of equation (3) with K auxiliary pseudo-inputs X̃ ∈ p(FY |UY ,X)q(UY )q(X)
RK×Q and Z̃ ∈ RK×QZ that correspond to a collection of
function values UY ∈ RK×D and UX ∈ RK×Q respec- rX = r(X, FX , UX , Z)
tively 5 . Following this approach, we obtain the augmented D X
E
= log p(X|FX ) + log p(U )
probability space: p(Y, FY , FX , X, Z, UY , UX , X̃, Z̃) = X
q(U ) p(FX |UX ,Z)q(UX )q(X)q(Z)
(14)
p(Y|FY )p(FY |UY , X)p(UY |X̃)
·p(X|FX )p(FX |UX , Z)p(UX |X̃)p(Z) (9) Both terms gY and rX involve known Gaussian densities
and are, thus, tractable. The gY term is only associated
The pseudo-inputs X̃ and Z̃ are known as inducing points, with the leaves and, thus, is the same as the bound found
and will be dropped from our expressions from now on, for for the Bayesian GP-LVM [Titsias and Lawrence, 2010].
5
The number of inducing points, K, does not need to be the Since it only involves expectations with respect to Gaussian
same for every GP of the overall deep structure. distributions, the GP output variables are only involved in
Andreas C. Damianou, Neil D. Lawrence

a quantity of the form YYT . Further, as can be seen from that the columns of Y that encode similar information will
the above equations, the function r(·) is similar to g(·) but be assigned relevance weight vectors that are also similar.
it requires expectations with respect to densities of all of the This idea can be extended to all levels of the hierarchy, thus
variables involved (i.e. with respect to all function inputs). obtaining a fully factorised deep GP model.
Therefore, rX will involve X (the outputs of the topi layer)

PQ h X X T This special case of our model makes the connection be-
T
in a term XX q(X) = q=1 µq µq + SX q . tween our model’s structure and neural network architec-
tures more obvious: the ARD parameters play a role similar
3 Extending the hierarchy to the weights of neural networks, while the latent variables
play the role of neurons which learn hierarchies of features.
Although the main calculations were demonstrated in a
simple hierarchy, it is easy to extend the model ver- ...
tically, i.e. by adding more hidden layers, or horizon-
tally, i.e. by considering conditional independencies of
the latent variables belonging to the same layer. The (a)
first case only requires adding more rX functions to the
variational bound, i.e. instead
PH−1of a single rX term we
will now have the sum: h=1 rXh , where rXh =
(b)
r(Xh , FXh , UXh , Xh+1 ), XH = Z .
Now consider the horizontal expansion scenario and as-
sume that we wish to break the single latent space Xh ,
of layer h, to Mh conditionally independent subsets. As
long as the variational distribution q(Xh ) of equation (11)
is chosen to be factorised in a consistent way, this is fea-
sible by just breaking the original rXh term of equation
PMh (m)
(14) into the sum m=1 rXh . This follows just from
the fact that, due to the independence assumption, it holds
PMh (m)
that log p(Xh |Xh+1 ) = m=1 log p(Xh |Xh+1 ). No-
(c)
tice that the same principle can also be applied to the leaves
by breaking the gY term of the bound. This scenario arises Figure 1: Different representations of the Deep GP model:
when, for example we are presented with multiple different (a) shows the general architecture with a cascade of H hid-
output spaces which, however, we believe they have some den layers, (b) depicts a simplification of a two hidden layer
commonality. For example, when the observed data are hierarchy also demonstrating the corresponding GP map-
coming from a video and an audio recording of the same pings and (c) illustrates the most general case where the
event. Given the above, the variational bound for the most leaves and all intermediate nodes are allowed to form con-
general version of the model takes the form: ditionally independent groups. The terms of the objective
(15) corresponding to each layer are included on the left.
MY H−1 Mh H−1
(m) (m)
X X X X
Fv = gY + rXh + Hq(Xh )
m=1 h=1 m=1 h=1 3.2 Parameters and complexity
− KL (q(Z) k p(Z)) . (15)
In all graphical variants shown in figure 1, every arrow rep-
Figure 1(c) shows the association of this objective func- resents a generative procedure with a GP prior, correspond-
tion’s terms with each layer of the hierarchy. Recall that ing to a set of parameters {X̃, θ, σ }. Each layer of la-
(m) (m)
each rXh and gY term is associated with a different GP tent variables corresponds to a variational distribution q(X)
and, thus, is coming with its own set of automatic relevance which is associated with a set of variational means and
determination (ARD) weights (described in equation (5)). covariances, as shown in equation (11). The parent node
can have the same form as equation (11) or can be con-
3.1 Deep multiple-output Gaussian processes strained with a more informative prior which would couple
the points of q(Z). For example, a dynamical prior would
The particular way of extending the hierarchies horizon- introduce Q × N 2 parameters which can, nevertheless,
tally, as presented above, can be seen as a means of per- be reparametrized using less variables [Damianou et al.,
forming unsupervised multiple-output GP learning. This 2011]. However, as is evident from equations (10) and
only requires assigning a different gY term (and, thus, as- (12), the inducing points and the parameters of q(X) and
sociated ARD weights) to each vector yd , where d indexes q(Z) are variational rather than model parameters, some-
the output dimensions. After training our model, we hope thing which significantly helps in regularizing the problem.
Deep Gaussian Processes

Therefore, adding more layers to the hierarchy does not learning problem we have described, but in the uppermost
introduce many more model parameters. Moreover, as in layer we make observations of some set of inputs. For this
common sparse methods for Gaussian processes [Titsias, simple example we created a toy data set by stacking two
2009], the complexity of each generative GP mapping is Gaussian processes as follows: the first Gaussian process
reduced from the typical O(N 3 ) to O(N M 2 ). employed a covariance function which was the sum of a
linear and an quadratic exponential kernel and received as
input an equally spaced vector of 120 points. We generated
4 Demonstration
1-dimensional samples from the first GP and used them as
input for the second GP, which employed a quadratic expo-
In this section we demonstrate the deep GP model in toy
nential kernel. Finally, we generated 10-dimensional sam-
and real-world data sets. For all experiments, the model
ples with the second GP, thus overall simulating a warped
is initialised by performing dimensionality reduction in the
process. The final data set was created by simply ignor-
observations to obtain the first hidden layer and then re-
ing the intermediate layer (the samples from the first GP)
peating this process greedily for the next layers. To obtain
and presenting to the tested methods only the continuous
the stacked initial spaces we experimented with PCA and
equally spaced input given to the first GP and the output of
the Bayesian GP-LVM, but the end result did not vary sig-
the second GP. To make the data set more challenging, we
nificantly. Note that the usual process in deep learning is
randomly selected only 25 datapoints for the training set
to seek a dimensional expansion, particularly in the lower
and left the rest for the test set.
layers. In deep GP models, such an expansion does occur
between the latent layers because there is an infinite basis Figure 3 nicely illustrates the effects of sampling through
layer associated with the GP between each latent layer. two GP models, nonstationarity and long range correlations
across the input space become prevalent. A data set of this
4.1 Toy Data form would be challenging for traditional approaches be-
cause of these long range correlations. Another way of
We first test our model on toy data, created by sampling thinking of data like this is as a nonlinear warping of the
from a three-level stack of GPs. Figure 2 (a) depicts the true input space to the GP. Because this type of deep GP only
hierarchy: from the top latent layer two intermediate latent contains one hidden layer, it is identical to the model de-
signals are generated. These, in turn, together generate 10- veloped by [Damianou et al., 2011] (where the input given
dimensional observations (not depicted) through sampling at the top layer of their model was a time vector, but their
of another GP. These observations are then used to train code is trivially generalized). The additional contribution
the following models: a deep GP, a simple stacked Isomap in this paper will be to provide a more complex deep hier-
[Tenenbaum et al., 2000] and a stacked PCA method, the archy, but still learn the underlying representation correctly.
results of which are shown in figures 2 (b, c, d) respec- To this end we applied a standard GP (1 layer less than the
tively. From these models, only the deep GP marginalises actual process that generated the data) and a deep GP with
the latent spaces and, in contrast to the other two, it is not two hidden layers (1 layer more than the actual generating
given any information about the dimensionality of each true process). We repeated our experiment 10 times, each time
signal in the hierarchy; instead, this is learnt automatically obtaining different samples from the simulated warped pro-
through ARD. As can be seen in figure 2, the deep GP finds cess and different random training splits. Our results show
the correct dimensionality for each hidden layer, but it also that the deep GP predicted better the unseen data, as can
discovers latent signals which are closer to the real ones. be seen in figure 3(b). The results, therefore, suggest that
This result is encouraging, as it indicates that the model can our deep model can at the same time be flexible enough to
recover the ground truth when samples from it are taken, model difficult data as well as robust, when modelling data
and gives confidence in the variational learning procedure. that is less complex than that representable by the hierar-
chy. We assign these characteristics to the Bayesian learn-
ing approach that deals with capacity control automatically.

4.2 Modeling human motion


(a) (b) (c) (d)
For our first demonstration on real data we recreate a mo-
Figure 2: Attempts to reconstruct the real data (fig. (a)) tion capture data experiment from Lawrence and Moore
with our model (b), stacked Isomap (c) and stacked PCA [2007]. They used data from the CMU MOCAP database
(d). Our model can also find the correct dimensionalities representing two subjects walking towards each other and
automatically. performing a ‘high-five’. The data contains 78 frames of
motion and each character has 62 dimensions, leading to
We next tested our model on a toy regression problem. 124 dimensions in total (i.e. more dimensions than data).
A deep regression problem is similar to the unsupervised To account for the correlated motions of the subjects we
Andreas C. Damianou, Neil D. Lawrence

sponding to the subject’s hand.


4

MSE
GP
3.5
deepGP

2.5

1.5
(a) (b) (c)
experiment #
1
1 2 3 4 5 6 7 8 9 10

(a) (b) Figure 4: Figure (a) shows the deep GP model employed.
Figure (b) shows the ARD weights for f Y1 (blue/wider
Figure 3: (a) shows the toy data created for the regression bins) and f Y2 (red/thinner bins) and figure (c) those for f X .
experiment. The top plot shows the (hidden) warping func-
tion and bottom plot shows the final (observed) output. (b)
shows the results obtained over each experiment repetition.

(a) (b) (c) (d) (e) (f)


applied our method with a two-level hierarchy where the
two observation sets were taken to be conditionally inde- Figure 5: Left (a,b,c): projections of the latent spaces dis-
pendent given their parent latent layer. In the layer closest covered by our model, Right (d,e,f): the full latent space
to the data we associated each GP-LVM with a different set learned for the model of Lawrence and Moore [2007].
of ARD parameters, allowing the layer above to be used
in different ways for each character. In this approach we
are inspired by the shared GP-LVM structure of Damianou 4.3 Deep learning of digit images
et al. [2012] which is designed to model loosely correlated
Our final experiment demonstrates the ability of our model
data sets within the same model. The end result was that
to learn latent features of increasing abstraction and we
we obtained three optimised sets of ARD parameters: one
demonstrate the usefulness of an analytic bound on the
for each modality of the bottom layer (fig. 4(b)), and one
model evidence as a means of evaluating the quality of the
for the top node (fig. 4(c)). Our model discovered a com-
model fit for different choices of the overall depth of the
mon subspace in the intermediate layer, since for dimen-
hierarchy. Many deep learning approaches are applied to
sions 2 and 6 both ARD sets have a non-zero value. This
large digit data sets such as MNIST. Our specific intention
is expected, as the two subjects perform very similar mo-
is to explore the utility of deep hierarchies when the digit
tions with opposite directions. The ARD weights are also
data set is small. We subsampled a data set consisting of
a means of automatically selecting the dimensionality of
50 examples for each of the digits {0, 1, 6} taken from the
each layer and subspace. This kind of modelling is impos-
USPS handwritten digit database. Each digit is represented
sible for a MAP method like [Lawrence and Moore, 2007]
as an image in 16 × 16 pixels. We experimented with deep
which requires the exact latent structure to be given a priori.
GP models of depth ranging from 1 (equivalent to Bayesian
The full latent space learned by the aforementioned MAP
GP-LVM) to 5 hidden layers and evaluated each model
method is plotted in figure 5 (d,e,f), where fig. (d) corre-
by measuring the nearest neighbour error in the latent fea-
sponds to the top latent space and each of the other two
tures discovered in each hierarchy. We found that the lower
encodes information for each of the two interacting sub-
bound on the model evidence increased with the number of
jects. Our method is not constrained to two dimensional
layers as did the quality of the model in terms of nearest
spaces, so for comparison we plot two-dimensional projec-
neighbour errors 6 . Indeed, the single-layer model made 5
tions of the dominant dimensions of each subspace in figure
mistakes even though it automatically decided to use 10 la-
5 (a,b,c). The similarity of the latent spaces is obvious. In
tent dimensions and the quality of the trained models was
contrast to Lawrence and Moore [2007], we did not have to
increasing with the number of hidden layers. Finally, only
constrain the latent space with dynamics in order to obtain
one point had a nearest neighbour of a different class in the
results of good quality.
4−dimensional top level’s feature space of a model with
Further, we can sample from these spaces to see what kind depth 5. A 2D projection of this space is plotted in fig.7.
of information they encode. Indeed, we observed that the The ARD weights for this model are depicted in fig. 6.
top layer generates outputs which correspond to different
variations of the whole sequence, while when sampling 6
As parameters increase linearly in the deep GP with latent
from the first layer we obtain outputs which only differ in units, we also considered the Bayesian Information Criterion, but
a small subset of the output dimensions, e.g. those corre- we found that it had no effect on the ranking of model quality.
Deep Gaussian Processes

Our final goal is to demonstrate that, as we rise in the hier-


archy, features of increasing abstraction are accounted for.
To this end, we generated outputs by sampling from each
hidden layer. The samples are shown in figure 8. There,
it can be seen that the lower levels encode local features
whereas the higher ones encode more abstract information.

layer 5 layer 4 layer 3


Figure 8: The first two rows (top-down) show outputs ob-
tained when sampling from layers 1 and 2 respectively and
encode very local features, e.g. explaining if a “0” has a
1 2 ...
3 4 5 6 1 2
...
3 4 5 6 7 8 1 2
... 3 4 5 6 7 8 9 10 closed circle or how big the circle of a “6” is. We found
layer 2 layer 1 many more local features when we sampled from different
dimensions. Conversely, when we sampled from the two
dominant dimensions of the parent node (two rows in the
bottom) we got much more varying outputs, i.e. the higher
levels indeed encode much more abstract information.
1 2
...
3 4 5 6 7 8 9 10 11 12 1 2
...
3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 6: The ARD weights of a deep GP with 5 hidden we plan to further investigate by incorporating ideas from
layers as learned for the digits experiment. past approaches. Indeed, previous efforts to combine GPs
with deep structures were successful at unsupervised pre-
training [Erhan et al., 2010] or guiding [Snoek et al., 2012]
of traditional deep models.
Although the experiments presented here considered only
up to 5 layers in the hierarchy, the methodology is directly
applicable to deeper architectures, with which we intend to
experiment in the future. The marginalisation of the latent
space allows for such an expansion with simultaneous reg-
ularisation. The variational lower bound allows us to make
a principled choice between models trained using different
initializations and with different numbers of layers.
The deep hierarchy we have proposed can also be used with
Figure 7: The nearest neighbour class separation test on a inputs governing the top layer of the hierarchy, leading to
deep GP model with depth 5. a powerful model for regression based on Gaussian pro-
cesses, but which is not itself a Gaussian process. In the
future, we wish to test this model for applications in multi-
5 Discussion and future work task learning (where intermediate layers could learn repre-
sentations shared across the tasks) and in modelling nonsta-
We have introduced a framework for efficient Bayesian tionary data or data involving jumps. These are both areas
training of hierarchical Gaussian process mappings. Our where a single layer GP struggles.
approach approximately marginalises out the latent space, A remaining challenge is to extend our methodologies to
thus allowing for automatic structure discovery in the hi- very large data sets. A very promising approach would be
erarchy. The method was able to successfully learn a hi- to apply stochastic variational inference [Hoffman et al.,
erarchy of features which describe natural human motion 2012]. In a recent workshop publication Hensman and
and the pixels of handwritten digits. Our variational lower Lawrence [2012] have shown that the standard variational
bound selected a deep hierarchical representation for hand- GP and Bayesian GP-LVM can be made to fit within this
written digits even though the data in our experiment was formalism. The next step for deep GPs will be to incorpo-
relatively scarce (150 data points). We gave persuasive ev- rate these large scale variational learning algorithms.
idence that deep GP models are powerful enough to en-
code abstract information even for smaller data sets. Fur- Acknowledgements
ther exploration could include testing the model on other
inference tasks, such as class conditional density estima- Research was supported by the University of Sheffield
tion to further validate the ideas. Our method can also be Moody endowment fund and the Greek State Scholarships
used to improve existing deep algorithms, something which Foundation (IKY).
Andreas C. Damianou, Neil D. Lawrence

References N. D. Lawrence. Probabilistic non-linear principal compo-


nent analysis with Gaussian process latent variable mod-
Y. Bengio. Learning Deep Architectures for AI. Found. els. Journal of Machine Learning Research, 6:1783–
Trends Mach. Learn., 2(1):1–127, Jan. 2009. ISSN 1816, 11 2005.
1935-8237. doi: 10.1561/2200000006.
N. D. Lawrence and A. J. Moore. Hierarchical Gaussian
Y. Bengio, A. C. Courville, and P. Vincent. Unsupervised process latent variable models. In Z. Ghahramani, editor,
feature learning and deep learning: A review and new Proceedings of the International Conference in Machine
perspectives. CoRR, abs/1206.5538, 2012. Learning, volume 24, pages 481–488. Omnipress, 2007.
A. C. Damianou and N. D. Lawrence. Deep Gaussian pro- ISBN 1-59593-793-3.
cesses. NIPS workshop on Deep Learning and Unsuper- R. M. Neal. Bayesian Learning for Neural Networks.
vised Feature Learning (arXiv:1211.0358v1), 2012. Springer, 1996. Lecture Notes in Statistics 118.
A. C. Damianou, M. Titsias, and N. D. Lawrence. Varia- C. E. Rasmussen and C. K. I. Williams. Gaussian Pro-
tional Gaussian process dynamical systems. In J. Shawe- cesses for Machine Learning. Cambridge, MA, 2006.
Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Wein- ISBN 0-262-18253-X.
berger, editors, Advances in Neural Information Process- R. Salakhutdinov and I. Murray. On the quantitative analy-
ing Systems 24, pages 2510–2518. 2011. sis of deep belief networks. In Proceedings of the Inter-
A. C. Damianou, C. H. Ek, M. K. Titsias, and N. D. national Conference on Machine Learning, volume 25,
Lawrence. Manifold relevance determination. In 2008.
J. Langford and J. Pineau, editors, Proceedings of the E. Snelson, C. E. Rasmussen, and Z. Ghahramani. Warped
International Conference in Machine Learning, vol- Gaussian processes. In S. Thrun, L. Saul, and
ume 29, San Francisco, CA, 2012. Morgan Kauffman. B. Schölkopf, editors, Advances in Neural Information
Processing Systems 16. MIT Press, Cambridge, MA,
N. Durrande, D. Ginsbourger, and O. Roustant. Additive 2004.
kernels for Gaussian process modeling. ArXiv e-prints
1103.4023, 2011. J. Snoek, R. P. Adams, and H. Larochelle. On nonparamet-
ric guidance for learning autoencoder representations. In
D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vin- Fifteenth International Conference on Artificial Intelli-
cent, and S. Bengio. Why does unsupervised pre-training gence and Statistics (AISTATS), 2012.
help deep learning? J. Mach. Learn. Res., 11:625–660,
J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global
Mar. 2010. ISSN 1532-4435.
geometric framework for nonlinear dimensionality re-
M. Gönen and E. Alpaydin. Multiple kernel learning algo- duction. Science, 290(5500):2319–2323, 2000. doi:
rithms. J. Mach. Learn. Res., 12:2211–2268, Jul 2011. 10.1126/science.290.5500.2319.
ISSN 1532-4435. J. B. Tenenbaum, C. Kemp, and P. Shafto. Theory-based
J. Hensman and N. Lawrence. Gaussian processes for bayesian models of inductive learning and reasoning. In
big data through stochastic variational inference. NIPS Trends in Cognitive Sciences, pages 309–318, 2006.
workshop on Big Learning, 2012. M. Titsias. Variational learning of inducing variables in
G. Hinton. A Practical Guide to Training Restricted Boltz- sparse Gaussian processes. JMLR W&CP, 5:567–574,
mann Machines. Technical report, 2010. 2009.
M. K. Titsias and N. D. Lawrence. Bayesian Gaussian pro-
G. E. Hinton. Training products of experts by maximiz-
cess latent variable model. In Y. W. Teh and D. M. Tit-
ing contrastive likelihood. Technical report, Tech. Rep.,
terington, editors, Proceedings of the Thirteenth Interna-
Gatsby Computational Neuroscience Unit, 1999.
tional Workshop on Artificial Intelligence and Statistics,
G. E. Hinton and S. Osindero. A fast learning algorithm for volume 9, pages 844–851, Chia Laguna Resort, Sardinia,
deep belief nets. Neural Computation, 18:2006, 2006. Italy, 13-16 May 2010. JMLR W&CP 9.
G. E. Hinton and R. R. Salakhutdinov. Reducing the di- A. G. Wilson, D. A. Knowles, and Z. Ghahramani. Gaus-
mensionality of data with neural networks. Science, 303 sian process regression networks. In J. Langford and
(5786):504–507, 2006. J. Pineau, editors, Proceedings of the 29th International
Conference on Machine Learning (ICML), Edinburgh,
M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic June 2012. Omnipress.
variational inference. ArXiv e-prints 1206.7051, 2012.
N. D. Lawrence. Gaussian process latent variable models
for visualisation of high dimensional data. In In NIPS,
page 2004, 2004.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy