Gaussian Processes For Machine
Gaussian Processes For Machine
GRAU DE MATEMÀTIQUES
Introduction iii
2 Covariance Functions 13
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Examples of Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Stationary Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Dot Product Covariance Functions . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Other Non-stationary Covariance Functions . . . . . . . . . . . . . . . 20
2.3 Combining Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Combining Kernels through multiplication . . . . . . . . . . . . . . . . 23
2.3.2 Multi-dimensional models . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Modelling Sums of Functions . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.4 Changepoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Conclusions 53
i
Bibliography 55
Abstract
The main focus of this project is to present clearly and concisely an overview of the main
ideas of Gaussian processes for regression in a machine learning context. The introductory
chapters contain core material and give some theoretical analysis of how to construct
Gaussian processes models, the way to express many types of structure through kernels
and how adding and multiplying different kernels combines their properties. Moreover
explanatory examples are also covered in order to gain a deeper understanding of the
material. Finally we provide an alternative approach from a Machine Learning complex
problem carried out by a prestigious European Bank in endeavours to achieve a reliable
method to predict customer expenses and consequently to forecast the balance of bank
accounts. The aim was to analyze deeply this problem using Gaussian Processes for
Regression from two different perspectives. Firstly, all the bank accounts studied as a
global entity and then clustering the accounts regarding its similarity allowing a more
flexible and adaptive approach.
Introduction 1
In this work we will be concerned with supervised learning, which is the problem of
learning input-output mapping from empirical data. Since we will work with continuous
outputs this problem is known as regression.
In general we denote the input as x, and the output (or target) as y. The input is usually
represented as a vector x as there are in general many input variables and the target y in
regression problems, as mentioned before, must be continuous. Thus, if we have a dataset
D of n observations, we write D = {( xi , yi )|i = 1, ..., n} or D = { X, y}.
Given this training data we wish to make predictions for new inputs x⇤ that we have
not seen in the training set. Hence, we need to move from the finite training data D to
a function f that makes predictions for all possible inputs values. To do this we must
make assumptions about the characteristics of the underlying function, as otherwise any
function which is consistent with the training data would be equally valid. In that context,
one particularly elegant way to learn functions is by giving a prior probability to every
possible function, where higher probabilities are given to functions that we consider to be
more likely.
In other words, we assume that yi = f ( xi ), for some unknown function f, possibly cor-
rupted by noise. Then we infer a distribution over functions given the data, p( f | X, y), and
we use this to make predictions given new inputs, i.e., to compute
Z
p(y⇤ | x⇤ , X, y) = p(y⇤ | f , x⇤ ) p( f | X, y)d f
Thus, we will discuss a way to perform Bayesian inference over functions themselves.
However this approach apppears to have serious problems, in that surely there are an
uncountably infinite set of possible functions, and how are we going to compute with this
set in finite time? This is where the Gaussian process, which is a generalitzation of the
Gaussian probability distribution, comes to our rescue.
A GP defines a prior over functions, which can be converted into a posterior over functions
once we have seen some data. Although it might seem difficult to represent a distribution
over a function, it turns out that we only need to be able to define a distribution over the
function’s values at a finite, but arbitrary, set of points, say x1 , . . . , x N . A GP assumes that
p( f ( x1 ), ..., f ( x N )) is jointly Gaussian, with some mean µ( x ) and covariance S( x ) given
by Sij = k( xi , x j ), where k is a positive definite kernel function.
Actually, one of the main attractions of the Gaussian process framework is precisely that
unites a sophisticated and consistent view with computational tractability. Therefore
many models that are commonly employed in both machine learning and statistics are
in fact special cases or restricted kinds of Gaussian processes.
To sum up, Gaussian Processes provide a principled, practical, probabilistic approach to
learning in kernel machines and in some sense bring together work in the statistics and
machine learning communities. The main goal of this work is to present clearly and
concisely an overview of the main ideas of Gaussian processes in a machine learning
context and to apply them in a complex forecasting problem consisting of predicting the
movements of a bank account (expenses) in order to determine whether or not this bank
2 Introduction
account will be in red numbers only having access to a small sample of historical values.
The main interest of being able to answer that question is that quite recently a leading
bank company has been allocating a huge amount of resources and efforts trying to solve
that riddle and creating a tool able to warn customers in case a major expense is likely to
occur. Actually, they have already launched a preliminary version that has been tested in
their employees. However, they adressed the problem in a different way since they faced
that time series prediction problem using Long Short-Term Memory (LSTM) Networks,
which is as type of recurrent neural network used in deep learning capable of succesfully
training very large architectures.
As matter of fact, the responsibles of that project tried to provide a solution using Gaussian
Processes, which at the end did not perform as well as LSTM Networks, but the differences
were not very large. Thus, another added incentive is to reduce the gap between both
approaches.
The work has a natural split into two parts, with the chapters up to and including chapter
3 covering core material, such as definition of Gaussian Processes for Regression, detailed
analysis of kernel functions, model selection and a practical example. The remaining
sections covers the Bank account forecasting problem.
The programming language used all along this work is Python 2.7. and some machine
learning libraries such as scikit-learn (provided by Google), GPy (University of Sheffield)
and PyGPs (University of Bremen). Source code to produce all figures and examples is
available at http://www.github.com/gerardmartinezcanelles.
Introduction 3
Symbol Meaning
\ left matrix divide: A \ b is the vector x which solves Ax = b
|K | determinant of K matrix
yT the transpose of vector y
µ proportional to; e.g. p( x |y) µ f ( x, y) means that p( x |y) is equal to f ( x, y)
times a factor which is independent of x
⇠ distributed according to; example: x ⇠ N (µ, s2 )
C number of classes in a classification problem
cholesky( A) Cholesky decomposition: L is a lower triangular matrix such that LL T = A
cov( f ⇤ ) Gaussian process posterior covariance
D dimension of input space X
D data set: D = {( xi , yi )|i = 1, ..., n}
diag(w) (vector argument) a diagonal matrix containing the elements of vector w
diag(W ) (matrix argument) a diagonal matrix containing the elements of vector W
dpq Kronecker delta, dpq = 1 iff p = q and 0 otherwise
f ( x) or f Gaussian process latent function values, f = ( f ( x1 ), . . . , f ( xn )) T
f⇤ Gaussian process (posterior) prediction (random variable)
f̄ ⇤ Gaussian process posterior mean
GP Gaussian process: f ⇠ GP(m( x), k( x, x0 )), the function f is distributed as a
Gaussian process with mean function m( x) and covariance function k ( x, x0 )
GPR Gaussian process for Regression
I or In the identity matrix (of size n)
Jv (z) Bessel function of the first kind
k( x, x0 ) covariance (or kernel) function evaluated at x and x’
K ( X, X 0 ) n ⇥ n covariance (or Gram) matrix
K⇤ n ⇥ n⇤ matrix K ( X, X⇤ ), the covariance between training and test cases
k( x⇤ ) vector, short for K ( X, x⇤ ), when there is only a single test case
K f or K covariance matrix for the (noise free) f values
Ky covariance matrix for the (noisy) y values; for independent homoscedastic
noise, Ky = K f + sn2 I
L( a, b) loss function; the loss of predicting b, when a is true; note argument order
log(z) natural logarithm (base e)
` or `d characteristic length-scale (for input dimension d)
m( x) the mean function of a Gaussian process
N (µ, Â) multivariate Gaussian (normal) distribution with mean vector µ and covari-
ance matrix Â
N (x) short for unit Gaussian x ⇠ N (0, I )
n and n⇤ number of training (and test) cases
N dimension of feature space
NH number of hidden units in a neural network
4 Introduction
Symbol Meaning
y| x and p(y| x ) conditional random variable y given x and its probability(density)
s2f variance of the (noise free) signal
sn2 noise variance
q vector of hyperparameters (parameters of the covariance function)
tr ( A) trace of (square) matrix A
X input space and also the index for the stochastic process
X D ⇥ n matrix of the training inputs { xi }in=1 : the design matrix
X⇤ matrix of test inputs
xi the ith training input
O(·) The big-O asymptotic complexity of an algorithm.
SE The squared-exponential kernel, also known as the radial-basis function
(RBF) kernel, the Gaussian kernel, or the exponentiated quadratic.
RQ The rational-quadratic kernel.
Per The periodic kernel.
Lin The linear kernel.
WN The white-noise kernel.
C The constant kernel.
Chapter 1
In this first chapter Gaussian process methods for regression problems is described. There
are several ways to interpret Gaussian process (GP) regression models and one of them is
function-space view which defines a distribution over functions and inference taking place
directly in the space of functions. This approach will be discussed in section 1.1. An-
other interesting and equivalent view of GP, which may be more familiar and accessible,
is the weight-space view where GP are presented as a Bayesian analysis of the standard
linear regression model 1 . This first chapter also includes a section where we discuss how
to incorporate non zero-mean functions into the models and in the last section an easy
example is shown.
Gaussian processes are a simple and general class of models of functions, meaning, we
use them to describe a distribution over functions with a continuous domain. Formally:
Definition 1.1. A Gaussian process is a collection of random variables, any finite number
of which have a joint Gaussian distribution.2
1 This particular view is extensively explained in Carl E. Rasmussen and Christopher K.I. Williams. Gaussian
{ Xt , t 2 T }, where T is an index set and all the finite-dimensional distributions have a multivariate normal
distribution. That is, for any choice of distinct values t1 , . . . , tk 2 T, the random vector X = ( Xt1 , . . . , Xtk ) has a
multivariate normal distribution. Where a multivariate Gaussian (or Normal) distribution has a joint probability
density given by
✓ ◆
M/2 1 1 1
p( x |m, S) = (2p ) |S| 2 exp (x m)T S (x m) (1.1)
2
where m is the mean vector (of length M) and S is the (symmetric, positive definite) covariance matrix (of size
M ⇥ M).
5
6 Gaussian Process for Regression
A Gaussian process is completely specified by its mean function and covariance function,
defining both of them as:
m( x ) = E[f(x)] (1.2)
f ( x ) ⇠ GP(m( x ), k ( x, x 0 )) (1.4)
It is common practice to assume that the mean function is simply zero everywhere, al-
though this is not necessary as we will see since we can incorporate non-zero mean func-
tions into the models.
Note that in our case the random variables represent the value of the funtion f ( x ) at
location x. On the other hand, it may happen that the index set of the random variables is
time, in other words, Gaussian processes can be defined over time. Indeed, many of the
examples discussed in that work are time series.
After accounting for the mean, the kind of structure that can be captured by a GP model is
entirely determined by its covariance function, also known as kernel. The kernel specifies
how the model generalizes, or extrapolates to new data. There are many possible choices
of covariance function, and we can specify a wide range of models just by specifying the
kernel of a GP. Actually, one of the main difficulties in using GPs is constructing a kernel
which represents the particular structure present in the data being modelled. An example
of a covariance function is the Squared Exponential (SE) which specifies the covariance
between pairs of random variables.
1
Cov( f ( x p ), f ( xq )) = k( x p , xq ) = exp( |x p x q |2 ) (1.5)
2
Note that the covariance between the outputs is written as a function of the inputs. As said
before, the specification of the covariance function implies a distribution over functions.
In the example in Figure 1.1 we have drawn samples from the distribution of functions
evaluated at any number of input points X after determining the covariance function
using SE kernel. More precisely we have generated a random Gaussian vector with this
covariance matrix
f ⇠ N (0, K ( X, X ))
and plotted the generated values as a function of the inputs. The mechanism used to
generate multivariate Gaussian samples is explained in detail in the Github.
1.1 Function-space View 7
Figure 1.1: Panel (a) shows three functions drawn at random from a GP prior, in other
words, a GP not coinditioned on any datapoints. Panel (b) The posterior after conditioning
on five noise free observations . The shaded area represents the 95% confidence region,
corresponding to the pointwise mean plus and minus two times the standard desviation
for each input value.
✓ ◆
f K ( X, X ) K ( X, X⇤ )
⇠N 0, (1.6)
f⇤ K ( X⇤ , X ) K ( X⇤ , X⇤ )
where K ( X, X⇤ ) denotes the n ⇥ n⇤ matrix of the covariances evaluated at all pairs of train-
ing and test points, and similarly for the other entries K ( X, X ), K ( X⇤ , X⇤ ) and K ( X⇤ , X ).
The noise free-predictive distribution is given by
1 1
f ⇤ | X⇤ , X, f ⇠ N (K ( X⇤ , X )K ( X, X ) f, K ( X⇤ , X⇤ ) K ( X⇤ , X )K ( X, X ) K ( X, X⇤ ))
(1.7)
Function values f ⇤ can be sampled from the joint posterior distribution by evaluating the
mean and covariance matrix from equation (1.7). Figure 1.1b shows the results of these
computations given the five datapoint marked with blue dots.
8 Gaussian Process for Regression
where dpq is a Kronecker delta which is one iff p = q and zero otherwise. Taking into
consideration the noise term and introducing it in equation (1.6) we can write the joint
distribution of the observed target values and the function values at the test locations
under the prior as
✓ ◆
y K ( X, X ) + sn2 I K ( X, X⇤ )
⇠N 0, (1.9)
f⇤ K ( X⇤ , X ) K ( X⇤ , X⇤ )
Thus, the key predictive equations for Gaussian process regression, regarding the conclu-
sions obtained in (1.7), are
where
D
f¯⇤ = E[ f ⇤ | X, y, X⇤ ] = K ( X⇤ , X )[K ( X, X ) + sn2 I ] 1
y (1.11)
cov( f ⇤ ) = K ( X⇤ , X⇤ ) K ( X⇤ , X )[K ( X, X ) + sn2 I ] 1
K ( X, X⇤ ) (1.12)
Before considering one particular case of GP, we should examine in more detail expression
for the variance given in equation (1.12). First, note that the variance only depends on
the inputs, and not on the observed targets. Secondly, spliting (1.12) term by term we
realize that the variance is the difference between two terms: the simple prior covariance
K ( X⇤ , X⇤ ) minus the information the observation gives us about the function. Moreover,
the equation holds unchanged when X⇤ denotes multiple test inputs.
Now let’s evaluate the case when the test set consists of a single point x⇤ . In that context
the predictive distributions obtained, after adapting equations (1.11) and (1.12) to that
particular case, reduce to
V[( f ⇤ )] = k ( x⇤ , x⇤ ) k⇤T [K ( X, X ) + s2 I ] 1
k⇤ (1.14)
where k⇤ = k( x⇤ ) denotes the vector of covariances between the test point and the n
training points. Since equation (1.13) is a linear combination of observations y is some-
times referred to as a linear predictor, thus it can be rewritted as a linear combination of
n kernels functions, each one centered on a training point, by writing
n
f¯( x⇤ ) = Â ai k ( xi , x ⇤ ) (1.15)
i =1
where ai = ((K ( X, X ) + sn2 I ) 1 )y. Even though the GP defines a joint Gaussian dis-
tribution over all of the y variables, for making predictions at x⇤ we only care about the
(n+1)-dimensional distribution defined by the n training points and the test point. Coindi-
tioning this (n+1)-dimensional distribution on the observations gives us the desired result
since a Gaussian distribution is marginalized by just taking the relevant block of the joint
covariance matrix.
Before concluding that section, I would like to introduce the concept of the marginal likeli-
hood or p(y| X ). The marginal likelihood is the integral of the likelihood times the prior
Z
p(y| X ) = p(y| f , X ) p( f | X )d f (1.16)
The term marginal likelihood refers to the marginalization over the function values f.
Under the Gaussian process model the prior is Gaussian, f | X ⇠ N (0, K ( X, X )), or
1 T 1 1 n
logp( f | X ) = f K ( X, X ) f log|K | log 2p (1.17)
2 2 2
and the likelihood is a factorized Gaussian y| f ⇠ N ( f , sn2 I ) so knowing that the product
of two Gaussians gives another (un-normalized) Gaussian 3 and the normalizing constant
also looks itself like a Gaussian4 , the log marginal likelihood can be written as
1 T 1 n
logp(y| X ) = y (K ( X, X ) + sn2 I ) 1
y log|(K ( X, X ) + sn2 I )| log 2p (1.18)
2 2 2
Although this result can also be obtained directly by observing that.
y ⇠ N (0, K ( X, X ) + sn2 I )
A practical implementation of Gaussian process regression is shown below and the com-
plete python code is provided in the Github. The algorithm uses Cholesky decomposition
since it is numerically more stable and faster. Computing the matrix inverse in equations
in a conventional way takes O(n3 ) time, making exact inference prohibitively slow for
3 N ( x | a,
A)N ( x |b, B) = Z 1 N ( x | c, C ) where c = C ( A 1 a + B 1 b) and C = ( A 1 +B 1) 1
4Z 1 1/2 1
= (2p ) D/2 | A + B| exp( 2 (a b) T ( A + B) 1 ( a b))
10 Gaussian Process for Regression
more than a few thousand datapoints. However the computational complexity for the
Cholesky decomposition in line 2 is O(n3 /6). We also have to consider O(n2 /2) complex-
ity for solving triangular systems in line 3 and (for each test case) in line 5. The algorithm
returns the predictive mean and variance for noise free data test, adding the noise vari-
ance sn2 to the predictive variance of f ⇤ allows us to compute the predictive distribution
for noisy test data y⇤ .
f ( x ) ⇠ GP(m( x ), k ( x, x 0 )) (1.19)
where Ky = K + sn2 I and the predictive variance remains unchanged from eq. (1.12).
1.1.4 Example
After providing a theoretical explanation of how Gaussian Processes regression works it
would be interesting to show a step by step example in order to make the forecasting
process more understandable.
Let’s consider six noisy data points (error bars are indicated with vertical lines) and we
are interested in estimating a seventh at x⇤ .
1.1 Function-space View 11
For example ( xi , yi ) = [( 2.5, 0.6), ( 1.50, 0.1), ( 0.5, 0.3), (0.75, 0.45), (1.95, 0.6), (2.8, 0.75)]
for i = {1, . . . , 6} and x⇤ = 3.2.
Figure 1.2: Blue points represent the noisy data known and the green one the seventh
point that we are interested in estimating.
As mentioned in previous sections what relates one observation to another in such cases
is the covariance function. Purely for simplicity of exposition our choice would be the
squared exponential (SE) eq.(1.5) and since the data points are noisy we fold the noise
into k( x, x 0 ), by writing
✓ ◆
(x x 0 )2
k( x, x 0 ) = s2f exp + sn2 d( x, x 0 ) (1.21)
2`2
where s2f should be high for functions which cover a broad range on the y axis, d( x, x 0 )
is the Kronecker delta function and ` is the length-scale parameter whose role would be
explained in detail in the next chapter.
To prepare for GPR, we calculate the covariance function, among all possible combinations
of these points, summarizing our findings in three matrices:
2 3
k ( x1 , x1 ) k ( x1 , x2 ) ... k ( x1 , x n )
6 k ( x2 , x1 ) k ( x2 , x2 ) ... k ( x2 , x n ) 7
6 7
K=6 .. .. .. .. 7
4 . . . . 5
k ( x n , x1 ) k ( x n , x2 ) ... k ( xn , xn )
In our particular case there are 6 observations with xi = [ 2.5, 1.50, 0.5, 0.75, 1.95, 2.8]
for i = {1, . . . , 6} and with judicious choices of s2f = 1.5, ` = 2.0 and sn2 = 0.1 from the
error bars. Thus, we have enough to calculate the covariance matrices using eq. (1.21).
Thus:
2 3
1.6 1.323745 0.909795 0.400577 0.126205 0.044789
61.323745 1.6 1.323745 0.796643 0.33879 0.1487057
6 7
6 7
60.909795 1.323745 1.6 1.233866 0.708328 0.3845107
K=6 7
60.400577 0.796643 1.233866 1.6 1.252905 0.88705 7
6 7
40.126205 0.33879 0.708328 1.252905 1.6 1.3704685
0.044789 0.148705 0.384510 0.88705 1.370468 1.6
Since the test set consists of a single point x⇤ and taking into consideration the expres-
sions given by (1.13) and (1.14) the mean of the distribution and the uncertainty of the
estimatiton captured by the variance are
Figure 1.3: The GP posterior conditioned on six noisy observations and the prediction for
the test point x⇤ = 3.2. The shaded area represents the 95% confidence region.
Chapter 2
Covariance Functions
The covariance function (also called kernel, kernel function, or covariance kernel) is the
driving factor in a Gaussian process predictor as it specifies which functions are likely
under the GP prior, which in turn determines the generalization properties of the model.
In other words, choosing a useful kernel is equivalent to learning a useful representation
of the input as it encodes our assumptions about the function which we wish to learn.
Colloquially, kernels are often said to specify the similarity between two objects, in our
case, data points. In this way, as we have already mentioned in the first chapter, inputs x
which are close are likely to have similar target values y, and thus training points that are
near to a test point should be informative about the prediction at that point.
The purpose of this chapter is to give some theoretical properties of covariance function
as well as some of the commonly-used examples. We also show how to use kernels to
build models of functions with many different kinds of structure: additivity, symmetry,
periodicity, interactions between variables, changepoints and some of the structures which
can be obtained by combining them.
2.1 Definition
Definition 2.1. Let X be a nonempty set, sometimes referred to as the index set. A symmetric
function K : X ⇥ X ! R is called a positive definite (p.d.) kernel on X if
n
 ci c j K ( xi , x j ) 0 (2.1)
i,j=1
13
14 Covariance Functions
• K (( x1 , . . . , xn ), (y1 , . . . , yn )) = Âin=1 Ki ( xi , yi )
are p.d. kernels on X = X1 ⇥ · · · ⇥ Xn
Moreover, let X0 ⇢ X . Then the restriction K0 of K to X0 ⇥ X0 is also a p.d. kernel.
In this work we consider covariance functions where the inputs domain X is a subset of
the vector space R D . A footnote, that provides an alternative and curious definition of
kernel1 in a machine learning context by R.Schaback and H.Wendland, as well as a more
detailed explanation of positive defined matrix propierty is added below. 2
Each kernel has a number of parameters which specify the precise shape of the covariance
function. These are sometimes referred to as hyperparameters, since they can be viewed
as specifying a distribution over function parameters, instead of being parameters which
specify a function directly. The length-scale `, the signal variance s2f and the noise vari-
ance sn2 are the most representative. In the next section, for every mentioned kernel we
will provide some graphics showing the effects of varying the hyperparameters on GP
prediction as well as more formal definitions.
Before giving an overview of some commonly used kernels we will provide a popular
classification of them.
If a covariance function is a function of x x0 , thus is invariant to translations in the input
space, is called a statitonary covariance function. If further the covariance function is in-
variant to all rigid motions, meaning, is a function only of | x x0 | it is called isotropic. For
example the squared exponential covariance function given in equation (1.5) is isotropic
and consequently stationary3 .
On the other hand, if a covariance function depends only on x and x0 through x · x0 we
call it a dot product covariance function. An important example is the inhomogeneous
1 A kernel is a function K : W ⇥ W ! R where W can be an arbitrary nonempty set. Some readers may
consider this as being far too general. However, in the context of learning algorithms, the set W defines the
possible learning inputs. Thus W should be general enough to allow Shakespeare texts or X-ray images, i.e. W
should better have no predefined structure at all. Thus the kernels occurring in machine learning are extremely
general, but still they take a special form which can be tailored to meet the demands of applications.
2 A real n ⇥ n covariance matrix K which satisfies Q ( v ) = v T Kv 0 for all vectors v 2 Rn is called positive
semidefinite (PSD). If Q(v) = 0 only when v = 0 the matrix is postive definite. Q(v) is called a quadratic form .
A symmettric matrix is PSD iff all of its eigenvalues are non-negative.
3 As the kernel is now only a function of r = | x x0 | this are also known as radial basis functions (RBFs)
2.2 Examples of Kernels 15
If µ has a density S(s), then is called the spectral density or power spectrum of k, and k and
S are Fourier duals:
Z
T
k(t ) = S(s)e2pis t ds, (2.2)
Z
2pis T t
S(s) = k(t )e dt. (2.3)
In other words, a spectral density entirely determines the properties of a stationary kernel.
And often spectral densities are more interpretable than kernels. If we Fourier transform
a stationary kernel, the resulting spectral density shows the distribution of support for
various frequencies. A heavy tailed spectral density has relatively large support for high
frequencies. A uniform density over the spectrum corresponds to white noise. Therefore
draws from a process with a heavy tailed spectral density tend to appear more erratic
(containing higher frequencies, and behaving more like white noise) than draws from a
4 A proof and further reading can be found in Stein, M. L. (1999). Interpolation of Spatial Data. Springer-
process with spectral density that concentrates its support on low frequency functions.
Indeed, one can gain insights into kernels by considering their spectral densities.
We now give some examples of stationary covariance functions.
Squared Exponential
The SE kernel has become the de-facto default kernel for GPs and SVMs (Support Vector
Machines). Also known as the Radial Basis Function or the Gaussian Kernel. It has already
been introduced in the first chapter, equation (1.5), and has the form
✓ ◆
(x x 0 )2
k SE ( x, x 0 ) = s2f exp (2.4)
2`2
This covariance function has some nice properties, such as infinitely differentiability,
which means that the GP has mean square derivatives of all orders, and thus is very
smooth. It has two parameters:
-The lengthscale ` determines the length of the ’wiggles’ in a function. In general, we
won’t be able to extrapolate more than ` units away from your data. Informally can be
thought of as roughly the distance you have to move in input space before the function
value can change significantly.6
- The output variance s2f determines the average distance of your function away from its
mean. Every kernel has this parameter out in front; it’s just a scale factor.
(a) (`, s2f ) = (1, 1) (b) (`, s2f ) = (0.1, 1) (c) (`, s2f ) = (1, 3.3)
Figure 2.1: GP priors generated with the SE kernel with different hyperparameters values.
(a) Shows priors generated with (l, s2f ) = (1, 1). The function is very smooth and the
average distance of the functions away from its mean is quite controlled, thus the variance
term is small. Panel (b) The length-scale has been shortened and the priors are more
wiggled and oscillate more quickly. Panel (c) As the variance is larger the range of the
priors increases.
6 For 1-d GP one way to understand the characteristic length-scale of the process is in terms of the number
of upcrossing of a level u. The number of upcrossings E[Nu ] of the level u on the unit interval by a zero-mean,
stationary, almost surely continuous Gaussian process is given by
s
1 k00 (0) u2
E[ Nu ] = exp( )
2p k (0) 2k (0)
2.2 Examples of Kernels 17
✓ p ◆◆ ✓ p
3t 3t
k n=3/2 (t ) = 1 + exp , (2.6)
` `
✓ p ◆ ✓ p ◆
5t 5t 2 5t
k n=5/2 (t ) = 1 + + 2 exp (2.7)
` 3` `
Another special case is n = 1/2 known as the Laplacian covariance function which sample
stationary Browninan motion.
Figure 2.2: GP priors generated with the Matérn kernel with different hyperparameters n
values. (c) For n = 1/2 the process becomes very rough. (d) For values of n 7/2 is hard
to distinguish between finite values of n and n ! •, the smooth squared exponential case.
7 The modified Bessel functions (also named the hyperbolic Bessel functions) of the first and second kind are
defined by
• ✓ ◆2m+a
1 x
Ia = i a Ja (ix ) = Â
m=0 m!G ( m + a + 1) 2
p Ia ( x ) Ia ( x )
Ka ( x ) =
2 sin(ap )
when a is not an integer; when a is an integer, then the limit is used.
18 Covariance Functions
✓ ◆ a
t2
k RQ (t ) = s2f 1 + (2.8)
2a`2
with a, ` > 0 can be seen as an infinite sum of squared exponential (SE) covariance func-
tions with different characteristic length-scales. So, GP priors with this kernel expect to
see functions which vary smoothly across many lengthscales. The parameter a determines
the relative weighting of large-scale and small-scale variations. When a ! •, the RQ is
identical to the SE. Note that the process is infinitely mean-squared differentiable for every
a in contrast to the Matérn kernel.
(a) (s2f , `, a) = (1.0, 1.0, 1.0) (b) (s2f , `, a) = (1.0, 1.0, 0.01)
(c) (s2f , `, a) = (1.0, 1.0, 100) (d) (s2f , `, a) = (1.0, 0.1, 40)
Figure 2.3: Note that panel (c), since a value is high, is almost identical to figure 2.1a. In
panel (d), as the value of the length-scale decreases the function becomes more wiggly as
expected.
2 sin2 (p | x x 0 |/p)
k Per ( x, x 0 ) = s2f exp( ) (2.9)
`2
2.2 Examples of Kernels 19
(a) (s2f , `, p) = (1, 1, 1) (b) (s2f , `, p) = (1, 1, 0.1) (c) (s2f , `, p) = (2, 0.8, 0.5)
Figure 2.4
k( x, x 0 ) = s02 (2.10)
If you use just a linear kernel in a GP, you’re simply doing Bayesian linear regression,
and significantly improving the computation time, since instead of taking O(n3 ) time,
inference can be made in O(n). Indeed, combining it with other kernels gives rise to some
nice properties, as it will be shown further on.
(a) (sb2 , sv2 , c, p) = (1.0, 0.3, 0, 1) (b) (sb2 , sv2 , c, p) = (2.0, 1.2, 0, 1) (c) (sb2 , sv2 , c, p) = (1.0, 0.4, 0, 2)
Figure 2.5: In panel (c) since p = 2, meaning that the kernel is the result of the product of
two linear kernels, the prior is a quadratic function.
J
f ( x ) = b + Â vi h( x; ui ). (2.12)
i =1
vi are the hidden to output weights, h is any bounded hidden unit transfer function, ui
are the input to hidden weights, and J is the number of hidden units. Let the bias b and
the v’s have independent zero-mean distribution of variance sb2 and sv2 /J respectively, and
the weights for each hidden unit ui have independent and identical distributions.
The first two moments of f ( x ) in equation (2.12), collecting all weights together into the
vector w, are
E[ f ( x )] = 0 (2.13)
J
1
cov[ f ( x ), f ( x 0 )] = Ew [ f ( x ) f ( x 0 )] = sb2 + sv2 Eu [ hi ( x; ui )hi ( x 0 ; ui )] =
J iÂ
=1 (2.14)
sb2 + sv2 Eu [h( x; u)h( x 0 ; u)]
2.2 Examples of Kernels 21
where the last equality follows from the fact that the ui are identically distributed. The
sum in equation (2.14) is over J i.i.d random variables, and all moments are bounded. If
b has a Gaussian distribution, the central theorem can be applied to show that as J ! •
any collection of function values f ( x1 ), . . . , f ( x N ) has a Joint Gaussian distribution, and
thus the neural network in equation (2.14) becomes a Gaussian process with covariance
function given by the last term in (2.14).
✓ ◆
2 2x̃ T Â x̃ 0
k NN ( x, x 0 ) = sin p (2.15)
p (1 + 2x̃ T Â x̃ )(1 + 2x̃ 0T Â x̃ 0 )
The main use-case of this kernel is as part of a sum-kernel where it explains the noise-
component of the signal. Tuning its parameter corresponds to estimating the noise-level.
k( x, x 0 ) = s2f d( x x0 ) (2.16)
8 A detailed steps by step explanation is provided by Williams, C. K. I. (1998) in Computation with Infinite
✓p ◆n ✓p ◆n
21 v 2nt 2nt p
Matérn k Matern (t ) = G(n) ` Kn `
✓ ◆ a
t2 p
Rational Quadratic k RQ (t ) = s2f 1 + 2a`2
2 sin2 (p | x x 0 |/p) p
Periodic k Per ( x, x 0 ) = s2 exp( `2
)
✓ ◆
2 2x̃ T Â x̃ 0
Neural Network k NN ( x, x 0 ) = p sin
p
(1+2x̃ Â x̃ )(1+2x̃ 0T Â x̃ 0 )
T
http://ttic.uchicago.edu/ dmcallester/ttic101-07/lectures/kernels/kernels
2.3 Combining Kernels 23
k1 + k2 = k1 ( x, x 0 ) + k2 ( x, x 0 ) (2.17)
0 0
k1 ⇥ k2 = k1 ( x, x ) ⇥ k2 ( x, x ) (2.18)
(a) Lin ⇥ Lin (b) SE ⇥ Per (c) Lin ⇥ SE (d) Lin ⇥ Per
? ? ? ?
y y y y
✓ ◆ ✓ ◆
D
1 ( xd xd0 )2 1 D ( xd xd0 )2
’ sd2 exp s2f exp
2 dÂ
0
SE-ARD( x, x ) = = (2.19)
d =1
2 `2d =1 `2d
Note that this proceedure was named ARD since estimating the lengthscale parameters
`1 , `2 , · · · , ` D implicitly determines the relevance of each dimension. Take into account
that input dimensions with relatively large lengthscales imply relatively little variation
along those dimensions in the function being modeled.
f a + f b ⇠ GP(µ a + µb , k a + k b ) (2.20)
As it is easy to encode additivity into GP models, note that kernels k a and k b can be of
different types, allowing us to model the data as a sum of independent functions, each
possibly representing a wide range type of structures.
(a) Lin + Per (b) SE + Per (c) SE + Lin (d) SElong +SEShort
? ? ? ?
y y y y
Periodic plus trend Periodic plus noise Linear plus variation Slow & fast variation
2.3.4 Changepoints
An example of how combining kernels can give rise to more structured priors is given
by changepoint kernels, which can express a change between different types of structure.
Changepoints kernels can be defined through addition and multiplication with sigmoidal
functions such as s ( x ) = 1+exp1 ( x) :
CP(k1 , k2 )( x, x 0 ) = k1 ⇥ s + k2 ⇥ s̄ (2.22)
In chapter 1 it has been seen how to do regression using a Gaussian process with a given
fixed covariance functions as well as in chapter 2 several examples of those covariance
functions were presented. While some properties of them such as stationarity may be easy
to determine from the context, it may not be trivial to specify with confidence the value
of free hyperparameters, length-scales and variances. Therefore, the natural question
that follows, and that will give rise to turn Gaussian processes into powerful practical
tools, is how to develop methods that adress the model selection problem, refering to the
discrete choice of the functional form for the covariance function as well as values for any
hyperparameters.
In section 3.1 we outline the model selection for regression problems, focusing on Bayesian
approach in section 3.1.1 and cross-validation, in particular the leave-one-out estimator, in
section 3.1.2. In the remaining section Bayesian principles are applied to a specific case in
order to provide a practical view.
27
28 Model Selection and Adaptation of Hyperparameters
Z
p(y| X ) = p(y| f , X ) p( f | X )d f (3.1)
Since p( f | X ) = N ( f |0, K ) and p(y| f ) = ’i N (yi | f i , sy2 ) the marginal likelihood is given
by
1 1 N
log(( p(y| X )) = log N (y|0, Ky ) = yK 1 y log|Ky | log(2p ) (3.2)
2 y 2 2
where Ky = K f + sn2 I is the covariance matrix for the noisy targets y and K f is the covari-
ance matrix for the noise-free latent f.
The first term of the marginal likelihood in equation 3.2 is a data fit term since it is the only
one involving the observed targets. The second term, log|Ky |/2, is the model complexity
depending only on the covariance function and the inputs. Lastly, log(2p )/2 is just a
normalization constant.
To understand the tradeoff between the first two terms, consider a 1-dimensional SE ker-
nel, as we vary the length scale ` and hold sy2 fixed. For short length scales, the fit will
be good, so yKy 1 y will be small. However, the model complexity will be high: Ky will be
almost diagonal, since most points will not be considered near any others, so the log|Ky |
will be large. For long length scales, the fit will be poor but the model complexity will be
low: Ky will be almost all 1’s so log|Ky | will be small.
Now, it is time to maximize the marginal likelihood in order to set the hyperparame-
ters. We seek the partial derivatives of the marginal likelihood w.r.t the hyperparameters
denoted by q. Using 3.2 and the rules of the matrix derivatives 2 one can show that
✓ ◆
∂ 1 ∂K 1 1 ∂Ky 1 ∂Ky
log p(y| X, q ) = y T Ky 1 K 1y tr (Ky ) = tr (aa T Ky 1
) (3.3)
∂q j 2 ∂q j y 2 ∂q 2 ∂q j
where a = K 1 y. It takes O(n3 ) time to compute Ky 1 , and O(n2 ) time per hyperparame-
ter to compute the gradient. Often there exists some constraints on the hyperparameters,
such as sy2 0. In this case, we can define q = log(sy2 ), and then use the chain rule.
Although not being a devastating problem note that there is no guarantee that the marginal
likelihood does not suffer from multiple local optima since the objective is not convex and
every local maximum corresponds to a particular interpretation of the data.
1 The reason it is called the marginal likelihood, rather than just likelihood, is because we have marginalized
In Figure 3.1 an example with two local optima is provided. For a data set of 20 observa-
tions randomly generated the inferred underlying functions have been specified consider-
ing to different sum-kernels. Despite the fact both of them result from the addition of a
Squared Exponential plus a White noise kernel they differ in the values of the hyperpa-
rameters such as the lengthscale and noise level.
By doing so, one may expect, that even there exists differences in the kernels, the log-
marginal-likelihood will converge to a global maximum after the optimitzation process.
However the illustration of the log-marginal-likelihood (LML) landscape shows that there
exist two local maxima of LML. The first corresponds to a model with a high noise level
and a large length scale, which explains all variations in the data by noise. The second one
has a smaller noise level and shorter length scale, which explains most of the variation by
the noise-free functional relationship.
(a) Initial: (`, s2f , sn2 ) = (100, 1, 1) (b) Initial: (`, s2f , sn2 ) = (1, 1, 1e 5)
Optimum: (`, s2f , sn2 )
= (109, 0.00316, 0.637) Optimum: (`, s2f , sn2 )
= (0.365, 0.64, 0.294)
Log-Marginal-Likelihood: 23.872337362 Log-Marginal-Likelihood: 21.8050908902
(c)
Figure 3.1: Panels (a) and (b) show the underlying functions (and 95% confidence inter-
vals). Panel (c) shows marginal likelihood as a function of the hyperparameters ` and
sn2 .
30 Model Selection and Adaptation of Hyperparameters
The basic idea of cross-validation is to split the training set into two disjoints sets, one
used for training and the other used to monitor performance usually called the validation
set. Having defined a dataset to test the model in the training phase shrinks problems like
overfitting and variability and gives an insight on how the model will generalize to an
independent dataset.
Although different types of cross-validation implementations can be distinguished, in the
Gaussian processes context, the most extended are those that involves multiple rounds of
cross-validations using different partitions and validations results are averaged over the
rounds. Methods such k-fold cross-validation and leave-one-out cross validation(LOO-
CV) belong to that particular framework.
In the k-fold CV approach the training set is splitted into k smaller sets. Then the model
is trained using k-1 of the folds as training data and the resulting model is validated on
the remaining part of the data. Finally, as mentioned before, the performance measure
reported by k-fold cross-validation is the average of the values computed in the loop.
On the other side, LOO-CV is an extreme case of k-fold cross validation since the number
of training cases k = n. Despite the fact that computational costs are prohibitive, there
are computational shortcuts that allows LOO-CV to become an efficient way to tune the
hyperparameters and model selection.
To begin with, since cross-validation can be used with any loss function, using the negative
predictive log probabiity when leaving out training case i is
1 ( yi µ i )2 1
log p(yi | X, y i , q ) = log si2 log 2p (3.4)
2 2si2 2
where the notation y i means all the targets except number i, the training sets are taken
to be ( Xi , yi ) and µi and si2 are computed according to (1.11) and (1.12) respectively. Thus,
the LOO log predictive probability is
n
L LOO ( X, y, q ) = Â log p(yi |X, y i, q) (3.5)
i =1
Note that in each of the n LOO-CV loops the inverse of the covariance matrix has to be
computed in order to determine the mean and variance in eq. (1.11) and (1.12). However,
in each rotation the expressions are almost the same since only a single column and row
of the covariances matrix are removed. Therefore, applying matrix inversion using par-
titioning on the complete covariance matrix increases efficiency, since the expressions for
the LOO-CV predictive mean and variance become
h i h i h i
1 1
µi = yi K y / K and si2 = 1/ K 1
, (3.6)
i ii ii
3.2 Mauna Loa Atmospheric Carbon Dioxide Example and Discussion 31
where the computational expense of these quantities is O(n3 ) once for computing the
inverse of K plus O(n2 ) for the entire LOO-CV procedure. At this point, replacing the
expressions of (3.6) into eq. (3.4) and (3.5) gives rise to a performance estimator that is
possible to optimize w.r.t hyperparameters to do model selection.
But before giving an expression to the partial derivatives of L LOO w.r.t the hyperparame-
ters we need the partial derivatives of the LOO-CV predictive mean and variances from
eq. (3.6) w.r.t the hyperparameters, and those are
⇥ ⇤ ⇥ 1
⇤ ⇥ ⇤
∂µi Zj a i ai Z j K ii ∂si2 Zj K 1 ii
= , = (3.7)
∂q j [K 1 ]ii [K 1 ]2 ∂q j 2
[K 1 ]ii
ii
where a = K 1y and Zj = K 1 ∂K . Thus the partial derivatives of (3.5) obtained using the
∂q j
chain rule and eq. (3.7) are
∂L LOO n
∂ log p(yi | X, y i , q ) ∂µi ∂ log p(y| X, y i , q ) ∂si2
=Â + = (3.8)
∂q j i =1
∂µi ∂q j ∂si2 ∂q j
n ✓ ⇥ ✓
a2i
◆h i ◆ h i
⇤ 1
 ai Zj a i 2 1 + [K 1 ] Zj K ii / K 1 ii 1
(3.9)
i =1 ii
where the computational complexity is O(n3 ) for the inverse of K and O(n3 ) for hyper-
parameter for the derivative equation above.
Having reached this stage, one natural question to ask is under which circumstances
which of the methods discussed, marginal likelihood or LOO-CV, might be preferable
since their computational complexity is very similar. In the following sections as in the
bank account balance forecasting problem model selection and adaptation of hyperpa-
rameteres has been done using marginal likelihood for programming convenience and
widespread use in machine learning community.
This example is based on Section 5.4.3 of ’Gaussian Processes for Machine Learning’
[Rasmussen & C.K.I Williams] and it represents a clear example of complex kernel and
hyperparameter optimization using gradient ascent on the log-marginal-likelihood.
The data consists of the monthly average atmospheric CO2 concentrations (in parts per
million by volume (ppmv)) collected at the Mauna Loa Observatory in Hawaii, between
1959 and 2003. The data is shown in figure 3.2. The objective is to model the CO2 concen-
tration as a function of the time x. Thus, we are working in a time series environment.
32 Model Selection and Adaptation of Hyperparameters
Figure 3.2: The 540 observations of monthly CO2 concentration made between 1959 and
2003. Three missing values were replaced by the mean concentracion of the year. Note the
rising trend and seasonal variations.
A reasonable way to start the modelling process is identifying the different components
of our data. Classic deterministic analysis of time series suggests that every function
should be decomposed in all or some of this features: trend, seasonal variations, term
irregularities and noise. Once this first step is accomplished, the kernel and, therefore,
hyperparameter selection and combination should take care of these individual properties
in order to provide an excellent fit to the data.
In our particular case Mauna Loa dataset presents a pronounced long term rising trend
that we try to model using a squared exponential kernel with two hyperparameters con-
trolling the variance q1 and characterisitc length-scale q2
✓ ◆
0 (x x 0 )2
k1 ( x, x ) = q1 exp (3.10)
2q22
The SE kernel with a large length-scale enforces this component to be smooth, note that it
is not enforced that the trend is rising which leaves this choice to the GP.
On the other side, the seasonal component 3 seems also evident. We can use the periodic
kernel with a fixed periodicity of 1 year. The length-scale q5 of this periodic component,
controlling its smoothness, is a free parameter and q3 gives the magnitude. In order
to allow decaying away from exact periodicity, the product with an SE kernel is taken.
3 Seasonal component is caused by different concentrations of plants.
3.2 Mauna Loa Atmospheric Carbon Dioxide Example and Discussion 33
The length-scale q4 of this SE component controls the decay time and is a further free
parameter.
✓ ◆
(x x 0 )2 2 sin2 (p ( x x 0 ))
k2 ( x, x 0 ) = q32 exp (3.11)
2q42 q52
In order to capture the medium term irregularities a rational quadratic term is used. These
irregularities can better be explained by a Rational Quadratic than a SE kernel component,
since it gives lower marginal likelihood as it can accommodate several length-scales.
✓ ◆ q8
( x x 0 )2
0
k3 ( x, x ) = q62 1+ (3.12)
2q72 q8
As before q6 is the variance, the length-scale q7 and q8 (also known as a parameter, which
determines the diffuseness of the length-scales) are to be determined.
The last feature to model, the noise term, has been specified using an SE kernel, which
explains the correlated noise components such as local weather phenomena, and a White
Noise kernel
✓ ◆
(x x 0 )2
k4 ( x, x 0 ) = q92 exp 2
+ q11 dpq (3.13)
2q210
where q9 is the magnitude of the correlated noise component, q10 is its length-scale and
q11 is the variance of the independent noise component.
To sum up, the final covariance function is
k ( x, x 0 ) = k1 ( x, x 0 ) + k2 ( x, x 0 ) + k3 ( x, x 0 ) + k4 ( x, x 0 ) (3.14)
(b) Long term smooth rising trend (c) Seasonal variation over the years
Figure 3.3: Panel (a) shows the underlying function together with 95% predictive confi-
dence region for predictions until 2017. The confidence interval gets wider as the time
increases. In Panels (b),(c),(d),(e) some components capture slowly-changing structure
while others capture quickly-varying structure.
To sum up, Mauna Loa has been a good way to test the powerful possibilities that infer-
rence in composite covariance functions provides, as well as the utility to allow the data
to determined the hyperparameters. The experience acquired facing this problem enables
to confront the main project of this work, the bank account forecasting problem, with a
more solid background.
Chapter 4
In this final chapter, since the main theoretical contributions and an introductory example
as Mauna Loa has been covered, we are going to face a more complex problem that
consists of forecasting bank accounts balance using Gaussian Processes.
Nowadays, with the advance of affordable, fast computation the machine learning com-
munity has adressed increasingly large and difficult problems and one appealing area
that can be applied is banking. Even though, it is a extremely regulated environment it
offers many possibilities and it becomes extremely exciting when working with massive
amounts of financial data and how to deal with them in order to produce new tools.
In that context, anticipating the behaviour of personal accounts or predicting people’s
expenses is a challenge that holds a special place in any modern platform of intelligent
banking services. In that particular exercise, we have 3.35 milion accounts1 and we are
trying to predict for each client the 13th state of the balance account with only having
12 historical values. More precisely we work on just a year-worth of data with monthly
aggregations and the 13th is our forecasting goal.2
As mentioned in the introduction a prestigious European bank tackled that problem in
order to provide their customers with a warning system in case the balance of their bank
accounts has high probabilities to decrease significantly or even be negative.
Therefore, we are facing a time-series3 regression problem. However, it is important to
note that since our time series are very short statistical methods such as GARCH, ARIMA
or Holt-Winters do not perform accurately. This led us to the next question: Can Gaussian
Processes predict accurately the 13th state of a bank account given the 12th previous ones?
To answer that question we are going to approach the problem in two different ways. First,
1 For computational limitations only 100.000 accounts will be analyzed as they are a representative sample.
Examine the hole sample can be considered as a very interesting Big Data problem.
2 The anonymized bank account data comes from real customers and is provided by a Bank whose name
35
36 Bank account forecasting Problem
all the bank accounts will be analyzed as a global entity, meaning, that one main kernel
would be considered to perform the Gaussian Process regression. On the other hand, in
the second approach, in order to be more flexible and adaptive the bank accounts will be
clustered regarding its similarity. Thus, for every cluster a different and more appropiate
kernel can be considered. The details about the clustering process such as the similarity
measure, the choice of centroids and kernels will be explained in the Clustering section.
To sum up, the questions to answer are whether or not GPR works as expected and if
clusteritzation processes improve the performance of GPR. Thus, a comparative evaluation
of the both approacches will be shown.
Data description
Before starting the forecasting process we will focus a bit more on the structure of the
data. We have at our disposal 3.35 million of bank accounts. Each account xi is a vector of
1 row and 13 columns, where the first 12 columns represent the last 12 monthly states of
the account and the 13th state is our goal prediction. For example, the first of our accounts
is:
x1⇤ = [ 4200]
We actually know the value of the 13th state in order to evaluate the error of our fore-
casting. Thus, more formally our dataset can be expressed as D = {( xi , y j ) for i =
1, . . . , 3.351.168 and j = 1, . . . , 13} , or alternatively can be thought as a matrix of 3.351.168
rows and 13 columns.
Given the size of the dataset we performed some data cleaning to reduce the number
of rows. Firstly, we drop all the accounts whose 12th first states were zeros. From 3.35
million accounts 107.245 were dropped. Since 3.243.923 accounts are still remaining and
the computational expenses4 are unaffordable, we consider two random simple samples
of 25.000 and 100.000 accounts. The reason why we take that samples size is because
in the clustering the 25.000 accounts will be distributed in 20 groups and the 100.000 in
40, matching with all the different patterns observed and checking if more clusters mean
more accuracy.
4 Moreover, the forecasting process for 100.000 lasts more than 36 hours.
4.1 Global approach 37
Figure 4.1: Note that the accounts tend to have many zeros over the different months.
Forecasting
As mentioned above two different random simple samples have been considered (N1 =
25.000 and N2 = 100.000 accounts). Both have been analyzed in terms of the same kernel
function, which has been designed to adapt and fit well with the general properties of the
accounts. Note that since we have to be able to model a wide range of different behaviour
patterns, the properties expressed by kernels can not be very specific or particular. Thus,
the kernel function is derived by combining several different kinds of simple covariance
functions. The final covariance function is
k( x, x 0 ) = k1 ( x, x 0 ) + k2 ( x, x 0 ) + k3 ( x, x 0 ) (4.1)
where
✓ ◆ ✓ ◆
(x x 0 )2 ( x x 0 )2
k1 ( x, x 0 ) = q12 exp = 4502 exp (4.2)
2q22 2 ⇤ 75.02
To model the long term smooth trend we use a SE covariance function with a high ampli-
tude paramater q1 since the variance is high and a large characteristic length-scale q2 .
To model the medium term irregularities a Rational Quadratic term is used:
✓ ◆ q5 ✓ ◆ 1.5
( x x 0 )2 ( x x 0 )2
0
k3 ( x, x ) = q32 1+ = 5.5 2
1+ (4.3)
2q42 q5 2 ⇤ 2.02 ⇤ 1.5
One could have also used a SE form for this component, but it turns out that rational
quadratic works better. Finally we specify a noise model as the sum of a SE and WN
kernel. Noise in the series models rare behaviour changes caused by unexpected expen-
ditures.
✓ ◆ ✓ ◆
(x x 0 )2 ( x x 0 )2
k4 ( x, x 0 ) = q62 exp + q82 dpq = 4.52 exp + 20.22 dpq (4.4)
2q72 2 ⇤ 0.32
38 Bank account forecasting Problem
Note that we have not considered a periodic covariance function since it was not clear that
all the accounts have a seasonal trend and least of all exactly periodic5 . Actually, some
trials have been done considering a seasonal kernel, but it turns out not to be relevant
since a decay-time parameter became very large.
The hyperparameters were fitted by optimizing the marginal likelihood using a conjugate
gradient optimizer as BFGS, likewise we have proceeded in the Mauna Loa problem.
Moreover, for each account we have considered a k-cross validation for k = 3 since it gives
rise to more accurate predicitons.
Before showing the predictions and the results achieved we must define the prediction
metrics or at least, taking into account the final goal of the forecasting, the measure accu-
racy that allows to grade the 13th state of the bank account forecasted as good or poor.
Some of the Measures of Forecast Accuracy used are the regulars in the statistics field
while others were specially designed for that particular problem. Therefore we consid-
ered
N
1 |ŷ(i) y (i ) |
MASE =
N Â 1 T (i ) (i )
i =1 T 1 Â t =2 | y t yt 1| + l
One interesting and favorable property of MASE when compared to other methods
for calculating forecast errors is the interpretability, since values greater than one
indicate that in-sample one-step forecasts from the naïve method perform better
than the forecast values under consideration. Thus, values near zero means a good
accuracy compared to naïve method. For our particular case T = 12 and l 6= 0 is a
constant parameter to avoid zero division.
5 However, indeed some of the accounts behave seasonaly and in the clustering the kernel will capture that.
4.1 Global approach 39
Results
95%
MAE RMSE MASE SAFC SAFC0 SAFCcor conf.
inter.
N = 25.000
150.6495 1897.5950 0.92243 65.644% 91.724 % 81.464% 88.288%
accounts
N = 100.000
159.9937 5389.1124 0.71326 64.826 % 91.409 % 81.058% 88.233%
accounts
40 Bank account forecasting Problem
After many trials, specially in order to decide the structure of the kernel and the initial
values of the hyperparameters, the final performance is the one presented above. On the
whole, the results are cautiously positive, however it is obvious that there is still room for
improvement and hopefully it will be achieved in the clustering approach.
MASE for both cases is less than one, meaning that GP perform better than naïve method.
Moreover SAFC measure and mainly SAFC0 show very good accuracy achieving more
than 91% of good sign predicitions (as it includes zeros). At the same time SAFCcor
is around 81%, meaning that the gap between SAFC0 and SAFCcor is approximately
10%. Consequently a small percentage of the 0 includes as a good predictions in SAFC0
corresponds to predictions that actually are distant from 0.
On the negative side, the percentage of accounts whose predicted value is in the 95%
confidence interval is much less than expected since it barely overcomes 88%. This is
not excessively worrying and concerning because that value is not quite far from 95%
but as the definition of confidence interval states and given the fact that the uncertainty
region is wide we expect that in the clustering approach the results will improve. From
a strategic and business perspective, this index is extremely critical seeing as the goal
of bank was to provide a warning system able to predict when the balance of the bank
account will be negative or a drastic decrease of the balance can occur and thus send a
message to the customer. Obviously, the customer might feel confused, upset or ashamed
if receiving the alarm message the balance of the bank account remains without great
changes. Thus, the bank expects to be highly sure and in this way be a powerful aid
instead of disturbing customers. In the next section we decided to cluster the accounts
with the aim of improving the results.
4.2 Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that ob-
jects in the same group are more similar to each other than to those in another group. For
our particular problem, classification methods such as SVMs and Naive Bayes would not
be a good choice since they assume that the input features are independent. Simultane-
ously, the k-NN algorithm could still work, however it relies on the notion of a similarity
measure between input examples. Thus, how do we measure the similarity between two
time series?
At first, one might think that simply calculating the Euclidean distance between two time
series would give us a good idea of the similarity between them. Given two time series
of length n, { Xt : t = 1, . . . , n} and {Yt : t = 1, . . . , n} we define the Euclidean distance
between them as:
s
n
dist( Xt , Yt ) = Â ( Xt Yt )2 (4.5)
t =1
After all, the Euclidean distance between identical time series is zero and between very
different time series is large. However, before we settle on Euclidean distance as a simi-
4.2 Clustering 41
larity measure we should clearly state our desired criteria for determining the similarity
between two time series.
With a good similarity measure, small changes in two time series should result in small
changes in their similarity. With respect to Euclidean distance this is true for changes in
the y-axis, but it is not true for changes in the time axis. This is the problem with using
the Euclidean distance measure. It often produces pessimistic similarity measures when
it encounters distortion in the time axis. The way to deal with this is to use dynamic time
warping.
The only important drawback is that dynamic time warping is quadratic in the length of
the time series used. Thus, having two time series where n is the length of the first one
and m is the length of the second gives rise to a complexity of O(nm). Therefore since
we are performing DTW multiple times and this can be prohibitively expensive we speed
things up using a lower bound of DTW known as LB Keogh which is defined as:
n
LBKeogh( Xt , Yt ) = Â (Yt Ut )2 I (Yt > Ut ) + (Yt Lt )2 I (Yt < Li ) (4.7)
t =1
where Ut and Lt are upper and lower bounds for time series Xt which are defined as
Ut = max( Xt r , . . . , Xt+r ) and Lt = min( Xt r , . . . , Xt+r ) for a reach r and I is the indicator
function.6
(
6 The 1 if x 2 A,
indicator function of a subset A of a set X is a function I A : X ! {0, 1} defined as I A ( x ) :=
0 if x 2
/ A.
42 Bank account forecasting Problem
Another way to speed things up when using DTW is to enforce a locality constraint. This
works for long time series under the assumption that it is unlikely for two elements to be
matched if they are too far apart. The threshold is determined by a window size usully
noted as w . This way, only mappings within this window are considered which speeds up
the inner loop. Since the time series involved in our problem are quite short this second
speed up was not required.
In the next subsection the clusteritzation conducted in the bank account forecasting prob-
lem is discussed in detail.
Once a reliable method to determine the similarity between two time series is already
specified everything is set to begin the process of clustering. The purpose, as mentioned
above, is to lump together the bank accounts that behave the same way in order to specify
a kernel function that gathers better the properties of the accounts in the cluster and be
able to demonstrate if clustering improves the performance of GPR.
After a deep data exploration with the aim of identifying the most common behaviour
patterns and since the number of clusters has to be chosen manually, two clusteritzation
processes have been performed. The first one, grouping 25.000 accounts in 20 different
clusters and the second one 100.000 in 40. 7
For both approaches the process of clustering begins taking a simple and representative
random sample of the bank accounts. More precisely, 10.000 for the 20 clusters and 15.000
for the 40 clusters. Then, again randomly taken, 20 and 40 accounts are specified as a
centroids. Therefore, through all the 10.000 and 15.000 accounts initially taken a search is
performed in order to find the centroid that minimizes the DTW distance with the account,
or in other words, to find the centroid that is more similar to the account analyzed.
Given that DTW is quadratic, this can be computationally expensive. Thus, since LB
Keogh lower bound is linear we can speed up classification. Note that given two time
series Xt , Yt it is always true that LBKeogh( Xt , Yt ) DTW ( Xt , Yt ). Hence we can eliminate
time series that cannot possibly be more similar than the current most similar time series.
In this way we are eliminating many unnecessary DTW computations.
After all the accounts have been assigned to a cluster, we recalculate the centroid of each
cluster finding the account that minimizes the global distance between all the accounts
in the cluster. The main goal is to provide a representative centroid for each cluster.
Consequently given the need to determine whether a new account belongs to a cluster
or another we only need to compute the DTW distance between the account and the
centroids. Figure 4.2 shows a graphical representation of the clustering process performed
in the first approach (20 clusters).
7 Given the computational limitations we cloud not performe a clustering involving 100 groups of more. Thus,
Figure 4.2: Clustering of 25.000 bank accounts in 20 clusters. [1] Selection process of
representative centroids. Take a simple random sample of 10.000 bank accounts. [1.1]
Again randomly, specify 20 centroids. [1.2] Compute DTW distance between the 10.000
accounts and the centroids. Cluster the accounts. [1.3] Recalculate de centroids finding the
account that minimizes the global distance between all the accounts of the cluster. [2]
Clusteritzacion of the accounts. Take a simple random sample of 25.000 bank accounts.
[2.1] DTW distance between the accounts and the 20 centroids found in [1]. Then the
accounts are assigned to the cluster that minimizes distance between centroid and account.
Note that the clustering of 100.000 bank accounts in 40 clusters is done using the same
mechanism as shown above. However, instead of taking 10.000 accounts in order to de-
termine the centroids, the sample size was 15.000. The figures below show the centroids
found in both approaches.
Figure 4.3: Plot showing the 12 historical values of the different centroids. Figure(a) shows
the 20 centroids found for the first approach. In figure(b) 40 centroids are plotted.
44 Bank account forecasting Problem
Finally, some images of accounts belonging to the same cluster are provided. Many of that
clusters include accounts that are extremely similar, meaning that even DTW provides a
non-linear alignment the similarity between the accounts in the cluster is remarkably high.
(a) Seasonal (b) Last quarter variance (c) Last quarter highly variance
(d) Seasonal and high variance (e) Seasonal and high variance (f) Seasonal and high variance
(g) Stationary (h) Earlier and final variations (i) Seasonal and final variations
Figure 4.4: Examples of bank account clusters. Panel(a) shows a cluster which contains
account with seasonal behavior and medium variance. Panels (b) and (c) present large
ineastability in the last quarter. Panels (d), (e) and (f) outcome extremely high variance
with seasonal patterns. Panels (g) and (h) are quite stable with no remarkable oscilations,
only (h) shows variation in the early and final months. Panel (i) differs with (a) since in
the last quarter the accounts shows more activity.
Forecast
Once the clustering process is complete for both samples of bank accounts we are able to
begin with the forecasting. The first stage involves a deep and exhaustive exploration of
the bank accounts of each cluster trying figure out the particularities and features that has
to be expressed by each kernel function.
Since the election of a specific kernel function is done manually8 , it is sometimes a trial
8 Cambridge Machine Learning Group on this topic is currently developing and automatic statistician,
https://www.automaticstatistician.com/index/
4.2 Clustering 45
and error procedure. Thus, our particular election might not be the optimal, however to
mitigate this problem several restarts and a wide variety of kernel functions have been
considered for each cluster.
As in the global approach, the hyperparameters were fitted by optimizing the marginal
likelihood using a conjugate gradient optimizer as BFGS. On the other side, since our goal
is to compare the results achieved in the different approaches the measures of accuracy
considered in the clustering are the same as in the global view, but for exploratory reasons
we incorporate the 68,75% confidence interval. Hence, we specify the forecast results for
each cluster as well as for all the clusters as an entity. The results of the samples cointaing
25.000 and 100.000 bank accounts clustered in 20 and 40 different groups respectively
would be analyzed separately.
Results of the 25.000 accounts sampled in 20 clusters
The analysis will proceed from the general to the particular, since we will compare first
the results achieved with the clustering to the ones in the global approach. Then, we will
provide a table showing for each cluster a plot of the accounts, the kernel function used
and the different measures of accuracy.
The general results of the approaches used to forecast the 13th state of 25.000 bank ac-
counts are summarized in the table below. We can state that clustering has overperformed
the global study in all of the measures of accuracy. Most significantly, the increase in the
percentage of predictions that actually fit in the 95% of confidence interval as well as the
reduction in all the error magnitudes. There is a substantial improvement, the fact that we
reach a percentage close to 95% in the last measure of accuracy it is promising. However,
it has been noted that those accounts whose confidence interval has not been accurately
specified has a larger DTW distance from the centroid, meaning that there is not a rep-
resentative cluster for that particular accounts and consequently the kernel function used
was not the appropiate. Those miss specifications are caused by accounts with and erratic
behaviour and high oscillations in the 13th state that do not get captured by GPs.
From a bussiness point of view it would be extremely profitable the study of each cluster
independently since it provides a more comprehensive and useful way to understand in
which type of accounts GP performs better. Therefore, we provide a table showing the
performance of each cluster in the following pages.
95%
MAE RMSE MASE SAFC SAFC0 SAFCcor conf.
inter.
N = 25.000
accounts 148,7359 1786,455 0,901701 65,68% 96,712% 83.142% 93,568%
cluster
N = 25.000
150.6495 1897.5950 0.92243 65.644% 91.724 % 81.464% 88.288%
accounts
46 Bank account forecasting Problem
( x x 0 )2 MAE: 209,915
• k2 ( x, x 0 ) = 202 exp( 2⇤102 RMSE 325,979
2 sin2 (p ( x x 0 )/4)
202
) MASE: 0,7799
( x x 0 )2 SAFC: 0,931574
• k3 ( x, x 0 ) = 0.52 (1 + 2⇤0.5 )
0.5
SAFC0: 0,998045
( x x 0 )2 SAFCc: 0,931574
• k4 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) +
95%: 0,937439
0.22 dpq
68,75%: 0,680352
• kernel = k1 + k2 + k3 + k4
( x x 0 )2 MAE: 90,822
• k2 ( x, x 0 ) = 202 exp( 2 ⇤ 82 RMSE 388,137
2 sin2 (p ( x x 0 )/4)
122
) MASE: 0,5103
( x x 0 )2 SAFC: 0,5891
• k3 ( x, x 0 ) = 1.52 (1 + 2⇤0.6 )
0.6
SAFC0: 0,9666
( x x 0 )2 SAFCc: 0,7524
• k4 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) +
95%: 0,9631
0.22 dpq
68,75%: 0,9376
• kernel = k1 + k2 + k3 + k4
4.2 Clustering 47
( x x 0 )2 MAE: 208,0299
• k2 ( x, x 0 ) = 32 exp( 2⇤1.12
) +
RMSE 520,6157
2 sin2 (p ( x x 0 )/3)
40 exp( 22
) MASE: 0,6107
( x x 0 )2 SAFC: 0,5601
• k3 ( x, x 0 ) = 10.52 (1 + 2⇤1.5 )
1.5
SAFC0: 0,9680
( x x 0 )2 SAFCc: 0,6240
• k4 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) +
95%: 0,9557
5.22 dpq
68,75%: 0,8525
• kernel = k1 + k2 + k3 + k4
( x x 0 )2 MAE: 180,942
• k2 ( x, x 0 ) = 682 exp( 2⇤102
2 sin2 (p ( x x 0 )/4)
RMSE 1359,294
22
) MASE: 0,8055
( x x 0 )2 SAFC: 0,6657
• k3 ( x, x 0 ) = 102 exp( 2⇤0.42
)
SAFC0: 0,9662
( x x 0 )2 SAFCc: 0,7674
• k4 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) +
95%: 0,9429
8.22 dpq
68,75%: 0,8732
• kernel = k1 + k2 + k3 + k4
48 Bank account forecasting Problem
( x x 0 )2 MAE: 20,3396
• k1 ( x, x 0 ) = 302 exp( 2⇤20.02
)
RMSE 97,1138
( x x 0 )2
• k2 ( x, x 0 ) = 1.52 (1 + 2⇤0.5 )
0.5 MASE: 0,8351
SAFC: 0,6498
( x x 0 )2
• k3 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) + SAFC0: 0,97643
0.22 dpq SAFCc: 0,94276
95%: 0,92929
• kernel = k1 + k2 + k3
68,75%: 0,88552
( x x 0 )2 MAE: 695,318
• k2 ( x, x 0 ) = 5502 exp( 2⇤562 RMSE 4083,819
2 sin2 (p ( x x 0 )/4)
2⇤212
) MASE: 0,5403
( x x 0 )2 SAFC: 0,7067
• k2 ( x, x 0 ) = 35.52 (1 + 2⇤2⇤0.8 )
0.8
SAFC0: 0,9654
( x x 0 )2 SAFCc: 0,7578
• k4 ( x, x 0 ) = 0.92 exp( 2⇤0.82
) +
95%: 0,9093
10.22 dpq
68,75%: 0,7578
• kernel = k1 + k2 + k3 + k4
( x x 0 )2 MAE: 39,096
• k1 ( x, x 0 ) = 1052 exp( 2⇤22.52
)
RMSE 82,827
( x x 0 )2
• k2 ( x, x 0 ) = 1.52 (1 + 2⇤0.5 )
0.5 MASE: 0,6128
SAFC: 0,4781
( x x 0 )2
• k3 ( x, x 0 ) = 0.52 exp( 2⇤0.52
) + SAFC0: 0,9895
12.52 dpq SAFCc: 0,8376
95%: 0,9546
• kernel = k1 + k2 + k3
68,75%: 0,8708
4.2 Clustering 49
( x x 0 )2 MAE: 5,7145
• k1 ( x, x 0 ) = 652 exp( 2⇤25.02
)
RMSE 31,596
( x x 0 )2
• k2 ( x, x 0 ) = 3.52 (1 + 2⇤2⇤0.5 )
0.5 MASE: 0,389677
SAFC: 0,742025
( x x 0 )2
• k3 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) + SAFC0: 0,941285
0.22 dpq SAFCc: 0,973185
95%: 0,930652
• kernel = k1 + k2 + k3
68,75%: 0,900139
• k2 ( x, x 0 ) = MAE: 930,989
2 sin2 (p ( x x 0 )/4)
852 exp( 2⇤352
) RMSE 4652,450
MASE: 0,5371
• k3 ( x, x 0 ) = 135.52 (1 + SAFC: 0,8397
(x x 0 )21.5
2⇤67⇤1.5 ) SAFC0: 0,9761
( x x 0 )2 SAFCc: 0,8443
• k4 ( x, x 0 ) = 0.52 exp( ) +
2⇤0.32 95%: 0,8690
45.22 dpq 68,75%: 0,7710
• kernel = k1 + k2 + k3 + k4
• k2 ( x, x 0 ) = MAE: 58,0384
2 sin2 (p ( x x 0 )/3)
25.02 exp( 22
) ⇤ RMSE 1973,60
2
(10.5 + 5.0 ( x c)( x c)) 0 MASE: 3,3544
SAFC: 0,6620
( x x 0 )2
• k3 ( x, x 0 ) = 1.52 (1 + 2⇤0.5 )
0.5
SAFC0: 0,9649
( x x 0 )2 SAFCc: 0,9793
• k4 ( x, x 0 ) = 1.52 exp( ) +
2⇤0.82 95%: 0,9401
0.22 dpq 68,75%: 0,9036
• kernel = k1 + k2 + k3 + k4
( x x 0 )2 MAE: 19,7471
• k1 ( x, x 0 ) = 402 exp( 2⇤20.02
)
RMSE 399,311
( x x 0 )2
• k2 ( x, x 0 ) = 1.52 (1 + 2⇤0.65 )
0.65 MASE: 0,9873
SAFC: 0,6725
( x x 0 )2
• k3 ( x, x 0 ) = 2.82 exp( 2⇤0.82
) + SAFC0: 0,9475
0.22 dpq SAFCc: 0,95
95%: 0,9435
• kernel = k1 + k2 + k3
68,75%: 0,9035
( x x 0 )2 MAE: 16,5497
• k1 ( x, x 0 ) = 402 exp( 2⇤20.02
)
RMSE 36,331
( x x 0 )2
• k2 ( x, x 0 ) = 122 (1 + 2⇤8⇤0.5 )
0.65 MASE: 0,5600
SAFC: 0,4279
( x x 0 )2
• k3 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) + SAFC0: 0,9927
10.22 dpq SAFCc: 0,9854
95%: 0,9285
• kernel = k1 + k2 + k3
68,75%: 0,8008
4.2 Clustering 51
In the table above for each cluster we provide a plot of the accounts, in the second column
we specify the kernel function used to perform the forecasting process. Note that each ker-
nel is the result of adding different covariance functions that can actually be combinations
of products or sums of other covariance functions in order to capture the particularities
and properties of the accounts. In the third column the measures of accuracy are written.
95%
MAE RMSE MASE SAFC SAFC0 SAFCcor conf.
inter.
N = 100.000
accounts 134,2835 5146,1119 0,691138 64,8630% 96,965% 83.312% 93,4240%
cluster
N = 100.000
159.9937 5389.1124 0.71326 64.826 % 91.409 % 81.058% 88.233%
accounts
To sum up, once the comparative analysis between global approach and clustering is
widely discussed, we can state without reservation that the latter one works better and
provides a more accurate forecasting. As expected being able to adapt the kernel functions
to the behaviour of the accounts gives rise to a more flexible model. Obviously, there are
52 Bank account forecasting Problem
some aspects that have to be improved, specifically being able to work with the whole
sample of bank accounts while speeding up the computations. Thus, we could establish
with greater certitude the number of optimal clusters, redefine some covariance functions
and, ultimately, upgrade the forecasting model.
Finally, in the last chapter we analyze all the results from a more global perspective, out-
lining many of the aspects developed all along the project, specially for the bank account
problem, and suggesting possible future directions of work on Gaussian processes.
Chapter 5
Conclusions
In this chapter we briefly wrap up some threads we developed throughout the project, a
bussines perspective of the bank account forecasting problem and, as said before, propose
exciting and new problems that can be adressed using Gaussian Processes.
Over the course of the project we saw how Gaussian process regression is a natural ex-
tension of Bayesian linear regression to a more flexible class of models. Placing Gaussian
process priors over functions is a clear Bayesian viewpoint. Thus, since the adoption of
Bayesian methods in the machine learning community is quite widespread this allowed
me to acquire solid foundations in order to move forward in that exciting field.
From my personal perspective one of the most challenging issues was working with co-
variance functions and the incorporation of prior knowledge through the choice and the
parameterization of the kernels. I do not like to view Gaussian processes as a black box,
(what exactly goes in the box is less important, as long it gives good predicitons) thus I
have always tried to ask how and why the models work or fail. This meant testing differ-
ent hypotheses, trying out different components and basically experimenting by trial and
error in order to gain real insight into the data.
In that sense the first part of the work, with the chapters up to and including chapter
3, where core material and theoretical concepts are given an in depth treatment have
represented a substantial part of the time invested. I attempted to put in practice all that
knowledge with the prediciton example developed at the end of chapter 3. Hence to face
the bank account forecasting problem with the highest expectations of success.
Once the problem of the bank accounts has been adressed and the results provided we
can assert that Gaussian Process has not performed as well as LSTM. Although, our goal
was not to overperform LSTM since it has been proved by the bank that this latter works
better. Therefore the aim was to provide a forecasting model using Gaussian Processes
for Regression and decide whether or not clustering the accounts gives rise to a better
performance. Both questions have been answered positively.
With the clustering approach we were able to determine an uncertainty region for the
prediction of each account and it turned out that 93,42% of the values fitted in that region.
53
54 Conclusions
Moreover, having defined different clusters, having acces to individual measures of accu-
racy and knowing the behaviour patterns of the accounts belonging to each cluster we can
adapt our predicitons to a wide range of needs.
In that sense, given the duty to predict the 13th state of a new account the procedure will
be clear. Firstly, we would compute the DTW between the account and all of the centroids
and assign the account to the centroid that minimizes the DTW. Once that account is
placed in a cluster we can begin the forecasting taking into consideration the features of
the cluster and adapating all of the decisions regarding that particular propiertes. Thus,
if the account was placed in a cluster that actually gives poor performance we have to be
cautious and prudent with the applications of that forecast. By contrast, if the account
belongs to a cluster with nice perfomances rates our we can be more confident.
To sum up, our model performs quite accurately but when a company as a bank is devel-
oping a risky project like this, every variable has to be perfectly studied and no mistakes
are allowed. Hence, it would be unnacceptable to send a message to a client warning him
to take care of his expenses and fail in the forecast. The model has to be trained and the
more data we have the more precise the prediction will be since the kernel function can fit
better the data or the cluster be more precise. In fact, as mentioned at the end of chapter
4, one area for future work might be learn to deal with the 3.35 million of bank accounts .
Finally, let me state some interesting developments done in the Gaussian Process context.
In the recent years one of the most important advances in the Gaussian process framework
has been the creation of an Automatic Statistician. This project developed by a group of
the most prestigious experts in Gaussian processes, as James Lloyd, David Duvenaud and
Zoubin Ghahramani from Cambridge University, automatically produces a 10-15 page
report describing patterns discovered in data. Moreover, it returns a statistical model
with state-of-the-art extrapolation performance evaluated over real time series data sets
from various domains. The system is based on reasoning over an open-ended language
of nonparametric models using Bayesian inference. Actually, The Automatic Statistician
project has won the Google Focused Research Award. This Award consists of a no-strings-
attached donation to support research on this topic.
Another interesting area in which there has been an explosion of work over the last years
is the use of deep neural networks for modeling high-dimensional functions as a popular
alternative to Gaussian process. In that sense, many researches propose to study deep
neural networks as priors on functions. As a starting point, they relate neural networks
to Gaussian processes and examine a class of infinitely wide deep neural networks called
deep Gaussian processes which turn to be compositions of functions drawn from GP
priors. One of the main reason why deep GPs are attractive from a model-analysis point
of view is because they remove some of the details of finite neural networks. Thus, this
area seems to be exciting and appealing for further study.
Bibliography
[1] Cunningham, J. P., Shenoy, K. V., and Sahani, M. (2008). Fast Gaussian process meth-
ods for point process intensity estimation. In International Conference on Machine
Learning, pages 192-199.
[2] Duvenaud D., James Robert Lloyd, Roger B. Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani. Structure discovery in nonparametric regression through com-
positional kernel search. In Proceedings of the 30th International Conference on
Machine Learning, 2013.
[4] Gordon Wilson, Andrew. Covariance Kernels for Fast Automatic Pattern Discovery
and Extrapolation with Gaussian Processes. Trinity College & University of Cam-
bridge, 2014, p. 12-45.
[5] Kristjanson Duvenaud, David. Automatic Model Construction with Gaussian Pro-
cesses. University of Cambridge, 2014, p. 8-30.
[6] David J. C. MacKay. Introduction to Gaussian processes. NATO ASI Series F Com-
puter and Systems Sciences,1998.
[7] David J. C. MacKay. Information theory, inference, and learning algorithms. Cam-
bridge University press, 2003.
[8] Radford M. Neal. Bayesian learning for neural networks. PhD thesis, University of
Toronto, 1995.
[9] Murphy, K. P. Machine Learning, a Probabilistic Perspective. The MIT Press, 2006.
p. 496-536.
[10] Carl E. Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine
Learning, volume 38. The MIT Press, Cambridge, MA, USA, 2006.
55