0% found this document useful (0 votes)
57 views62 pages

Gaussian Processes For Machine

This document is the final thesis project for a degree in mathematics from the University of Barcelona. It presents an overview of Gaussian processes for machine learning and their application to predicting bank account balances. The thesis contains chapters on Gaussian process regression, covariance functions, model selection, and two case studies: globally modeling all bank accounts, and clustering similar accounts for a more flexible approach.

Uploaded by

Federico Garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views62 pages

Gaussian Processes For Machine

This document is the final thesis project for a degree in mathematics from the University of Barcelona. It presents an overview of Gaussian processes for machine learning and their application to predicting bank account balances. The thesis contains chapters on Gaussian process regression, covariance functions, model selection, and two case studies: globally modeling all bank accounts, and clustering similar accounts for a more flexible approach.

Uploaded by

Federico Garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Treball final de grau

GRAU DE MATEMÀTIQUES

Facultat de Matemàtiques i Informàtica


Universitat de Barcelona

Gaussian Processes for Machine


Learning

Autor: Gerard Martínez Canelles

Director: Dr. Jordi Vitrià Marca


Realitzat a: Mathematics and Computer Science

Barcelona, 28th June, 2017


The capacity of a parametric model, the amount of information that the model can
represent, is bounded, even if the amount of observed data becomes unbounded.

-Zoubin Ghahramani [2013]


Acknowledgements
I would like to thank my advisor, Jordi Vitrià Marca, for so much encouragement, support
and feedback. Jordi was patient while I spent the first few months chasing half-baked
ideas, and then gently suggested a series of notions which worked. It was wonderful
working with someone who is always willing to help and whose dedication to science is
inspiring.
Contents

Introduction iii

1 Gaussian Process for Regression 5


1.1 Function-space View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Prediction with Noise-free Observations . . . . . . . . . . . . . . . . . 7
1.1.2 Prediction using Noisy Observations . . . . . . . . . . . . . . . . . . . 8
1.1.3 Non-zero mean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Covariance Functions 13
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Examples of Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Stationary Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Dot Product Covariance Functions . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Other Non-stationary Covariance Functions . . . . . . . . . . . . . . . 20
2.3 Combining Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Combining Kernels through multiplication . . . . . . . . . . . . . . . . 23
2.3.2 Multi-dimensional models . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Modelling Sums of Functions . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.4 Changepoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Model Selection and Adaptation of Hyperparameters 27


3.1 Model Selection for GP Regression . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.2 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Mauna Loa Atmospheric Carbon Dioxide Example and Discussion . . . . . . 31

4 Bank account forecasting Problem 35


4.1 Global approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2 Clustering of Bank Accounts . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Conclusions 53

i
Bibliography 55
Abstract
The main focus of this project is to present clearly and concisely an overview of the main
ideas of Gaussian processes for regression in a machine learning context. The introductory
chapters contain core material and give some theoretical analysis of how to construct
Gaussian processes models, the way to express many types of structure through kernels
and how adding and multiplying different kernels combines their properties. Moreover
explanatory examples are also covered in order to gain a deeper understanding of the
material. Finally we provide an alternative approach from a Machine Learning complex
problem carried out by a prestigious European Bank in endeavours to achieve a reliable
method to predict customer expenses and consequently to forecast the balance of bank
accounts. The aim was to analyze deeply this problem using Gaussian Processes for
Regression from two different perspectives. Firstly, all the bank accounts studied as a
global entity and then clustering the accounts regarding its similarity allowing a more
flexible and adaptive approach.
Introduction 1

In this work we will be concerned with supervised learning, which is the problem of
learning input-output mapping from empirical data. Since we will work with continuous
outputs this problem is known as regression.
In general we denote the input as x, and the output (or target) as y. The input is usually
represented as a vector x as there are in general many input variables and the target y in
regression problems, as mentioned before, must be continuous. Thus, if we have a dataset
D of n observations, we write D = {( xi , yi )|i = 1, ..., n} or D = { X, y}.
Given this training data we wish to make predictions for new inputs x⇤ that we have
not seen in the training set. Hence, we need to move from the finite training data D to
a function f that makes predictions for all possible inputs values. To do this we must
make assumptions about the characteristics of the underlying function, as otherwise any
function which is consistent with the training data would be equally valid. In that context,
one particularly elegant way to learn functions is by giving a prior probability to every
possible function, where higher probabilities are given to functions that we consider to be
more likely.
In other words, we assume that yi = f ( xi ), for some unknown function f, possibly cor-
rupted by noise. Then we infer a distribution over functions given the data, p( f | X, y), and
we use this to make predictions given new inputs, i.e., to compute
Z
p(y⇤ | x⇤ , X, y) = p(y⇤ | f , x⇤ ) p( f | X, y)d f

Thus, we will discuss a way to perform Bayesian inference over functions themselves.
However this approach apppears to have serious problems, in that surely there are an
uncountably infinite set of possible functions, and how are we going to compute with this
set in finite time? This is where the Gaussian process, which is a generalitzation of the
Gaussian probability distribution, comes to our rescue.
A GP defines a prior over functions, which can be converted into a posterior over functions
once we have seen some data. Although it might seem difficult to represent a distribution
over a function, it turns out that we only need to be able to define a distribution over the
function’s values at a finite, but arbitrary, set of points, say x1 , . . . , x N . A GP assumes that
p( f ( x1 ), ..., f ( x N )) is jointly Gaussian, with some mean µ( x ) and covariance S( x ) given
by Sij = k( xi , x j ), where k is a positive definite kernel function.
Actually, one of the main attractions of the Gaussian process framework is precisely that
unites a sophisticated and consistent view with computational tractability. Therefore
many models that are commonly employed in both machine learning and statistics are
in fact special cases or restricted kinds of Gaussian processes.
To sum up, Gaussian Processes provide a principled, practical, probabilistic approach to
learning in kernel machines and in some sense bring together work in the statistics and
machine learning communities. The main goal of this work is to present clearly and
concisely an overview of the main ideas of Gaussian processes in a machine learning
context and to apply them in a complex forecasting problem consisting of predicting the
movements of a bank account (expenses) in order to determine whether or not this bank
2 Introduction

account will be in red numbers only having access to a small sample of historical values.
The main interest of being able to answer that question is that quite recently a leading
bank company has been allocating a huge amount of resources and efforts trying to solve
that riddle and creating a tool able to warn customers in case a major expense is likely to
occur. Actually, they have already launched a preliminary version that has been tested in
their employees. However, they adressed the problem in a different way since they faced
that time series prediction problem using Long Short-Term Memory (LSTM) Networks,
which is as type of recurrent neural network used in deep learning capable of succesfully
training very large architectures.
As matter of fact, the responsibles of that project tried to provide a solution using Gaussian
Processes, which at the end did not perform as well as LSTM Networks, but the differences
were not very large. Thus, another added incentive is to reduce the gap between both
approaches.
The work has a natural split into two parts, with the chapters up to and including chapter
3 covering core material, such as definition of Gaussian Processes for Regression, detailed
analysis of kernel functions, model selection and a practical example. The remaining
sections covers the Bank account forecasting problem.
The programming language used all along this work is Python 2.7. and some machine
learning libraries such as scikit-learn (provided by Google), GPy (University of Sheffield)
and PyGPs (University of Bremen). Source code to produce all figures and examples is
available at http://www.github.com/gerardmartinezcanelles.
Introduction 3

Symbols and Notation


Matrices are always capitalized. A subscript aterisk, such as in X⇤ , indicates reference to
a test set quantity. A superscript asterisk denotes complex conjugate.

Symbol Meaning
\ left matrix divide: A \ b is the vector x which solves Ax = b
|K | determinant of K matrix
yT the transpose of vector y
µ proportional to; e.g. p( x |y) µ f ( x, y) means that p( x |y) is equal to f ( x, y)
times a factor which is independent of x
⇠ distributed according to; example: x ⇠ N (µ, s2 )
C number of classes in a classification problem
cholesky( A) Cholesky decomposition: L is a lower triangular matrix such that LL T = A
cov( f ⇤ ) Gaussian process posterior covariance
D dimension of input space X
D data set: D = {( xi , yi )|i = 1, ..., n}
diag(w) (vector argument) a diagonal matrix containing the elements of vector w
diag(W ) (matrix argument) a diagonal matrix containing the elements of vector W
dpq Kronecker delta, dpq = 1 iff p = q and 0 otherwise
f ( x) or f Gaussian process latent function values, f = ( f ( x1 ), . . . , f ( xn )) T
f⇤ Gaussian process (posterior) prediction (random variable)
f̄ ⇤ Gaussian process posterior mean
GP Gaussian process: f ⇠ GP(m( x), k( x, x0 )), the function f is distributed as a
Gaussian process with mean function m( x) and covariance function k ( x, x0 )
GPR Gaussian process for Regression
I or In the identity matrix (of size n)
Jv (z) Bessel function of the first kind
k( x, x0 ) covariance (or kernel) function evaluated at x and x’
K ( X, X 0 ) n ⇥ n covariance (or Gram) matrix
K⇤ n ⇥ n⇤ matrix K ( X, X⇤ ), the covariance between training and test cases
k( x⇤ ) vector, short for K ( X, x⇤ ), when there is only a single test case
K f or K covariance matrix for the (noise free) f values
Ky covariance matrix for the (noisy) y values; for independent homoscedastic
noise, Ky = K f + sn2 I
L( a, b) loss function; the loss of predicting b, when a is true; note argument order
log(z) natural logarithm (base e)
` or `d characteristic length-scale (for input dimension d)
m( x) the mean function of a Gaussian process
N (µ, Â) multivariate Gaussian (normal) distribution with mean vector µ and covari-
ance matrix Â
N (x) short for unit Gaussian x ⇠ N (0, I )
n and n⇤ number of training (and test) cases
N dimension of feature space
NH number of hidden units in a neural network
4 Introduction

Symbol Meaning
y| x and p(y| x ) conditional random variable y given x and its probability(density)
s2f variance of the (noise free) signal
sn2 noise variance
q vector of hyperparameters (parameters of the covariance function)
tr ( A) trace of (square) matrix A
X input space and also the index for the stochastic process
X D ⇥ n matrix of the training inputs { xi }in=1 : the design matrix
X⇤ matrix of test inputs
xi the ith training input
O(·) The big-O asymptotic complexity of an algorithm.
SE The squared-exponential kernel, also known as the radial-basis function
(RBF) kernel, the Gaussian kernel, or the exponentiated quadratic.
RQ The rational-quadratic kernel.
Per The periodic kernel.
Lin The linear kernel.
WN The white-noise kernel.
C The constant kernel.
Chapter 1

Gaussian Process for Regression

In this first chapter Gaussian process methods for regression problems is described. There
are several ways to interpret Gaussian process (GP) regression models and one of them is
function-space view which defines a distribution over functions and inference taking place
directly in the space of functions. This approach will be discussed in section 1.1. An-
other interesting and equivalent view of GP, which may be more familiar and accessible,
is the weight-space view where GP are presented as a Bayesian analysis of the standard
linear regression model 1 . This first chapter also includes a section where we discuss how
to incorporate non zero-mean functions into the models and in the last section an easy
example is shown.

1.1 Function-space View

Gaussian processes are a simple and general class of models of functions, meaning, we
use them to describe a distribution over functions with a continuous domain. Formally:

Definition 1.1. A Gaussian process is a collection of random variables, any finite number
of which have a joint Gaussian distribution.2

1 This particular view is extensively explained in Carl E. Rasmussen and Christopher K.I. Williams. Gaussian

Processes for Machine Learning, volume 38


2 In probability theory and statistics, Gaussian processes are usually defined as a real-valued stochastic process

{ Xt , t 2 T }, where T is an index set and all the finite-dimensional distributions have a multivariate normal
distribution. That is, for any choice of distinct values t1 , . . . , tk 2 T, the random vector X = ( Xt1 , . . . , Xtk ) has a
multivariate normal distribution. Where a multivariate Gaussian (or Normal) distribution has a joint probability
density given by

✓ ◆
M/2 1 1 1
p( x |m, S) = (2p ) |S| 2 exp (x m)T S (x m) (1.1)
2
where m is the mean vector (of length M) and S is the (symmetric, positive definite) covariance matrix (of size
M ⇥ M).

5
6 Gaussian Process for Regression

A Gaussian process is completely specified by its mean function and covariance function,
defining both of them as:

m( x ) = E[f(x)] (1.2)

k( x, x 0 ) = E[(f(x) m(x))(f(x0 ) m(x0 ))] (1.3)

and will write the Gaussian process as

f ( x ) ⇠ GP(m( x ), k ( x, x 0 )) (1.4)

It is common practice to assume that the mean function is simply zero everywhere, al-
though this is not necessary as we will see since we can incorporate non-zero mean func-
tions into the models.
Note that in our case the random variables represent the value of the funtion f ( x ) at
location x. On the other hand, it may happen that the index set of the random variables is
time, in other words, Gaussian processes can be defined over time. Indeed, many of the
examples discussed in that work are time series.
After accounting for the mean, the kind of structure that can be captured by a GP model is
entirely determined by its covariance function, also known as kernel. The kernel specifies
how the model generalizes, or extrapolates to new data. There are many possible choices
of covariance function, and we can specify a wide range of models just by specifying the
kernel of a GP. Actually, one of the main difficulties in using GPs is constructing a kernel
which represents the particular structure present in the data being modelled. An example
of a covariance function is the Squared Exponential (SE) which specifies the covariance
between pairs of random variables.

1
Cov( f ( x p ), f ( xq )) = k( x p , xq ) = exp( |x p x q |2 ) (1.5)
2
Note that the covariance between the outputs is written as a function of the inputs. As said
before, the specification of the covariance function implies a distribution over functions.
In the example in Figure 1.1 we have drawn samples from the distribution of functions
evaluated at any number of input points X after determining the covariance function
using SE kernel. More precisely we have generated a random Gaussian vector with this
covariance matrix

f ⇠ N (0, K ( X, X ))

and plotted the generated values as a function of the inputs. The mechanism used to
generate multivariate Gaussian samples is explained in detail in the Github.
1.1 Function-space View 7

(a) prior (b) posterior

Figure 1.1: Panel (a) shows three functions drawn at random from a GP prior, in other
words, a GP not coinditioned on any datapoints. Panel (b) The posterior after conditioning
on five noise free observations . The shaded area represents the 95% confidence region,
corresponding to the pointwise mean plus and minus two times the standard desviation
for each input value.

1.1.1 Prediction with Noise-free Observations


Taking into account the information provided by the Figure 1.1 to get the posterior dis-
tribution over functions, intuitively, one may think of generating functions from the prior
and rejecting the ones that disagree with the noise free observations. Even though this
strategy would not be computationally very efficient. In probabilistic terms conditioning
the joint Gaussian prior distribution on the observations is extremely simple.
Therefore let {( xi , f i )|i = 1, ..., n} be the noise free observations, n the number of traning
points and n⇤ be the number of test points. Expressing the idea mentioned in the para-
graph above in a more formal way and keeping in mind that the joint distribution of the
training outputs f, and the test outputs f ⇤ , according to the prior is

 ✓  ◆
f K ( X, X ) K ( X, X⇤ )
⇠N 0, (1.6)
f⇤ K ( X⇤ , X ) K ( X⇤ , X⇤ )

where K ( X, X⇤ ) denotes the n ⇥ n⇤ matrix of the covariances evaluated at all pairs of train-
ing and test points, and similarly for the other entries K ( X, X ), K ( X⇤ , X⇤ ) and K ( X⇤ , X ).
The noise free-predictive distribution is given by

1 1
f ⇤ | X⇤ , X, f ⇠ N (K ( X⇤ , X )K ( X, X ) f, K ( X⇤ , X⇤ ) K ( X⇤ , X )K ( X, X ) K ( X, X⇤ ))
(1.7)

Function values f ⇤ can be sampled from the joint posterior distribution by evaluating the
mean and covariance matrix from equation (1.7). Figure 1.1b shows the results of these
computations given the five datapoint marked with blue dots.
8 Gaussian Process for Regression

1.1.2 Prediction using Noisy Observations


Although there are some situations where it is reasonable that the observations are noise-
free (e.g. computer simulations), it is uncommon to have acces to function values them-
selves.
Thus, we tend to consider models that incorporate a noise term # giving rise to y = f ( x ) +
# expression. Some of the assumptions of the noise term include additive independent
identically Gaussian distribution and variance sn2 so # n ⇠ N (0, sn2 ).
Indeed, despite the fact that more complicated noise models with non-trivial covariance
structure can also be handled, we will also assume the property of independence between
noise terms. To sum up, all the assumptions mentioned before cause that the prior on the
noisy observations becomes

cov(y p , yq ) = k( x p , xq ) + sn2 dpq or cov(y) = K ( X, X ) + sn2 I (1.8)

where dpq is a Kronecker delta which is one iff p = q and zero otherwise. Taking into
consideration the noise term and introducing it in equation (1.6) we can write the joint
distribution of the observed target values and the function values at the test locations
under the prior as

 ✓  ◆
y K ( X, X ) + sn2 I K ( X, X⇤ )
⇠N 0, (1.9)
f⇤ K ( X⇤ , X ) K ( X⇤ , X⇤ )

Thus, the key predictive equations for Gaussian process regression, regarding the conclu-
sions obtained in (1.7), are

f ⇤ | X, y, X⇤ ⇠ N ( f¯⇤ , cov( f ⇤ )) (1.10)

where
D
f¯⇤ = E[ f ⇤ | X, y, X⇤ ] = K ( X⇤ , X )[K ( X, X ) + sn2 I ] 1
y (1.11)
cov( f ⇤ ) = K ( X⇤ , X⇤ ) K ( X⇤ , X )[K ( X, X ) + sn2 I ] 1
K ( X, X⇤ ) (1.12)

Before considering one particular case of GP, we should examine in more detail expression
for the variance given in equation (1.12). First, note that the variance only depends on
the inputs, and not on the observed targets. Secondly, spliting (1.12) term by term we
realize that the variance is the difference between two terms: the simple prior covariance
K ( X⇤ , X⇤ ) minus the information the observation gives us about the function. Moreover,
the equation holds unchanged when X⇤ denotes multiple test inputs.
Now let’s evaluate the case when the test set consists of a single point x⇤ . In that context
the predictive distributions obtained, after adapting equations (1.11) and (1.12) to that
particular case, reduce to

f¯⇤ = k⇤T [K ( X, X ) + sn2 I ] 1


y (1.13)
1.1 Function-space View 9

V[( f ⇤ )] = k ( x⇤ , x⇤ ) k⇤T [K ( X, X ) + s2 I ] 1
k⇤ (1.14)

where k⇤ = k( x⇤ ) denotes the vector of covariances between the test point and the n
training points. Since equation (1.13) is a linear combination of observations y is some-
times referred to as a linear predictor, thus it can be rewritted as a linear combination of
n kernels functions, each one centered on a training point, by writing

n
f¯( x⇤ ) = Â ai k ( xi , x ⇤ ) (1.15)
i =1

where ai = ((K ( X, X ) + sn2 I ) 1 )y. Even though the GP defines a joint Gaussian dis-
tribution over all of the y variables, for making predictions at x⇤ we only care about the
(n+1)-dimensional distribution defined by the n training points and the test point. Coindi-
tioning this (n+1)-dimensional distribution on the observations gives us the desired result
since a Gaussian distribution is marginalized by just taking the relevant block of the joint
covariance matrix.
Before concluding that section, I would like to introduce the concept of the marginal likeli-
hood or p(y| X ). The marginal likelihood is the integral of the likelihood times the prior

Z
p(y| X ) = p(y| f , X ) p( f | X )d f (1.16)

The term marginal likelihood refers to the marginalization over the function values f.
Under the Gaussian process model the prior is Gaussian, f | X ⇠ N (0, K ( X, X )), or

1 T 1 1 n
logp( f | X ) = f K ( X, X ) f log|K | log 2p (1.17)
2 2 2
and the likelihood is a factorized Gaussian y| f ⇠ N ( f , sn2 I ) so knowing that the product
of two Gaussians gives another (un-normalized) Gaussian 3 and the normalizing constant
also looks itself like a Gaussian4 , the log marginal likelihood can be written as

1 T 1 n
logp(y| X ) = y (K ( X, X ) + sn2 I ) 1
y log|(K ( X, X ) + sn2 I )| log 2p (1.18)
2 2 2
Although this result can also be obtained directly by observing that.

y ⇠ N (0, K ( X, X ) + sn2 I )

A practical implementation of Gaussian process regression is shown below and the com-
plete python code is provided in the Github. The algorithm uses Cholesky decomposition
since it is numerically more stable and faster. Computing the matrix inverse in equations
in a conventional way takes O(n3 ) time, making exact inference prohibitively slow for
3 N ( x | a,
A)N ( x |b, B) = Z 1 N ( x | c, C ) where c = C ( A 1 a + B 1 b) and C = ( A 1 +B 1) 1
4Z 1 1/2 1
= (2p ) D/2 | A + B| exp( 2 (a b) T ( A + B) 1 ( a b))
10 Gaussian Process for Regression

more than a few thousand datapoints. However the computational complexity for the
Cholesky decomposition in line 2 is O(n3 /6). We also have to consider O(n2 /2) complex-
ity for solving triangular systems in line 3 and (for each test case) in line 5. The algorithm
returns the predictive mean and variance for noise free data test, adding the noise vari-
ance sn2 to the predictive variance of f ⇤ allows us to compute the predictive distribution
for noisy test data y⇤ .

input: X(inputs), y (targets), k (covariance function), sn2 (noise level), x⇤ (test


input)
2: L := cholesky(K + sn2 I )
a := L T \ ( L \ y)
4: f¯( x⇤ ) := k⇤T a
v := L \ k⇤
6: V [ f ⇤ ] := k( x⇤ , x⇤ ) v T v
log p(y| X ) := 21 y T a Âi log Lii n2 log 2p
8: return: f¯( x⇤ ) (mean), V [ f ⇤ ] (variance), log p(y| X ) (log marginal likelihood)

1.1.3 Non-zero mean Functions


There are several reasons why it is quite common to consider GPs with a zero mean func-
tion such as interpretability of the model or convenience of expressing prior information.
Note that this is not necessarily a drastic limitation since the mean of the posterior process
is not confined to be zero. Moreover, the use of explicit fixed (deterministic) mean function
m( x) is a way to specify a non-zero mean over functions. In other words, we can simply
apply the zero mean GP to the difference between the observations and the fixed mean
function. With

f ( x ) ⇠ GP(m( x ), k ( x, x 0 )) (1.19)

the predictive mean becomes

f¯⇤ = m( X⇤ ) + K ( X⇤ , X )Ky 1 (y m( X )), (1.20)

where Ky = K + sn2 I and the predictive variance remains unchanged from eq. (1.12).

1.1.4 Example
After providing a theoretical explanation of how Gaussian Processes regression works it
would be interesting to show a step by step example in order to make the forecasting
process more understandable.
Let’s consider six noisy data points (error bars are indicated with vertical lines) and we
are interested in estimating a seventh at x⇤ .
1.1 Function-space View 11

For example ( xi , yi ) = [( 2.5, 0.6), ( 1.50, 0.1), ( 0.5, 0.3), (0.75, 0.45), (1.95, 0.6), (2.8, 0.75)]
for i = {1, . . . , 6} and x⇤ = 3.2.

Figure 1.2: Blue points represent the noisy data known and the green one the seventh
point that we are interested in estimating.

As mentioned in previous sections what relates one observation to another in such cases
is the covariance function. Purely for simplicity of exposition our choice would be the
squared exponential (SE) eq.(1.5) and since the data points are noisy we fold the noise
into k( x, x 0 ), by writing

✓ ◆
(x x 0 )2
k( x, x 0 ) = s2f exp + sn2 d( x, x 0 ) (1.21)
2`2
where s2f should be high for functions which cover a broad range on the y axis, d( x, x 0 )
is the Kronecker delta function and ` is the length-scale parameter whose role would be
explained in detail in the next chapter.
To prepare for GPR, we calculate the covariance function, among all possible combinations
of these points, summarizing our findings in three matrices:
2 3
k ( x1 , x1 ) k ( x1 , x2 ) ... k ( x1 , x n )
6 k ( x2 , x1 ) k ( x2 , x2 ) ... k ( x2 , x n ) 7
6 7
K=6 .. .. .. .. 7
4 . . . . 5
k ( x n , x1 ) k ( x n , x2 ) ... k ( xn , xn )

K⇤ = [k( x⇤ , x1 ), k ( x⇤ , x1 ), · · · , k( x⇤ , xn )] K⇤⇤ = [k( x⇤ , x⇤ )]


12 Gaussian Process for Regression

In our particular case there are 6 observations with xi = [ 2.5, 1.50, 0.5, 0.75, 1.95, 2.8]
for i = {1, . . . , 6} and with judicious choices of s2f = 1.5, ` = 2.0 and sn2 = 0.1 from the
error bars. Thus, we have enough to calculate the covariance matrices using eq. (1.21).
Thus:
2 3
1.6 1.323745 0.909795 0.400577 0.126205 0.044789
61.323745 1.6 1.323745 0.796643 0.33879 0.1487057
6 7
6 7
60.909795 1.323745 1.6 1.233866 0.708328 0.3845107
K=6 7
60.400577 0.796643 1.233866 1.6 1.252905 0.88705 7
6 7
40.126205 0.33879 0.708328 1.252905 1.6 1.3704685
0.044789 0.148705 0.384510 0.88705 1.370468 1.6

K⇤ = [0.025841 0.094819 0.270959 0.708328 1.233866 1.470298] and K⇤⇤ = [1.6]

Since the test set consists of a single point x⇤ and taking into consideration the expres-
sions given by (1.13) and (1.14) the mean of the distribution and the uncertainty of the
estimatiton captured by the variance are

f¯⇤ = 0.68098409 V[( f ⇤ )] = 0.23471905

Figure 1.3: The GP posterior conditioned on six noisy observations and the prediction for
the test point x⇤ = 3.2. The shaded area represents the 95% confidence region.
Chapter 2

Covariance Functions

The covariance function (also called kernel, kernel function, or covariance kernel) is the
driving factor in a Gaussian process predictor as it specifies which functions are likely
under the GP prior, which in turn determines the generalization properties of the model.
In other words, choosing a useful kernel is equivalent to learning a useful representation
of the input as it encodes our assumptions about the function which we wish to learn.
Colloquially, kernels are often said to specify the similarity between two objects, in our
case, data points. In this way, as we have already mentioned in the first chapter, inputs x
which are close are likely to have similar target values y, and thus training points that are
near to a test point should be informative about the prediction at that point.
The purpose of this chapter is to give some theoretical properties of covariance function
as well as some of the commonly-used examples. We also show how to use kernels to
build models of functions with many different kinds of structure: additivity, symmetry,
periodicity, interactions between variables, changepoints and some of the structures which
can be obtained by combining them.

2.1 Definition
Definition 2.1. Let X be a nonempty set, sometimes referred to as the index set. A symmetric
function K : X ⇥ X ! R is called a positive definite (p.d.) kernel on X if

n
 ci c j K ( xi , x j ) 0 (2.1)
i,j=1

holds for any n 2 N , x1 , . . . , xn 2 X , c1 , . . . , cn 2 R


Kernels are usually complex valued functions, but in this work we assume real-valued
functions, which is the common practice in machine learning and other applications of
p.d. kernels.

13
14 Covariance Functions

Some general properties are:


For a familiy of p.d. kernels (Ki )i2N , Ki : X ⇥ X ! R :

• The sum Âin li Ki is p.d. given l1 , . . . , ln 0


a
• The product K11 , . . . , Knan is p.d. given a1 , . . . , an 2 N

• The limit K = limn!• Kn is p.d. if the limit exists

If (Xi )in=1 is a sequence of sets, and (Ki )in=1 , Ki : Xi ⇥ Xi ! R a sequence of p.d.


kernels, then both
• K (( x1 , . . . , xn ), (y1 , . . . , yn )) = ’in=1 Ki ( xi , yi )

• K (( x1 , . . . , xn ), (y1 , . . . , yn )) = Âin=1 Ki ( xi , yi )
are p.d. kernels on X = X1 ⇥ · · · ⇥ Xn
Moreover, let X0 ⇢ X . Then the restriction K0 of K to X0 ⇥ X0 is also a p.d. kernel.
In this work we consider covariance functions where the inputs domain X is a subset of
the vector space R D . A footnote, that provides an alternative and curious definition of
kernel1 in a machine learning context by R.Schaback and H.Wendland, as well as a more
detailed explanation of positive defined matrix propierty is added below. 2
Each kernel has a number of parameters which specify the precise shape of the covariance
function. These are sometimes referred to as hyperparameters, since they can be viewed
as specifying a distribution over function parameters, instead of being parameters which
specify a function directly. The length-scale `, the signal variance s2f and the noise vari-
ance sn2 are the most representative. In the next section, for every mentioned kernel we
will provide some graphics showing the effects of varying the hyperparameters on GP
prediction as well as more formal definitions.
Before giving an overview of some commonly used kernels we will provide a popular
classification of them.
If a covariance function is a function of x x0 , thus is invariant to translations in the input
space, is called a statitonary covariance function. If further the covariance function is in-
variant to all rigid motions, meaning, is a function only of | x x0 | it is called isotropic. For
example the squared exponential covariance function given in equation (1.5) is isotropic
and consequently stationary3 .
On the other hand, if a covariance function depends only on x and x0 through x · x0 we
call it a dot product covariance function. An important example is the inhomogeneous
1 A kernel is a function K : W ⇥ W ! R where W can be an arbitrary nonempty set. Some readers may

consider this as being far too general. However, in the context of learning algorithms, the set W defines the
possible learning inputs. Thus W should be general enough to allow Shakespeare texts or X-ray images, i.e. W
should better have no predefined structure at all. Thus the kernels occurring in machine learning are extremely
general, but still they take a special form which can be tailored to meet the demands of applications.
2 A real n ⇥ n covariance matrix K which satisfies Q ( v ) = v T Kv 0 for all vectors v 2 Rn is called positive
semidefinite (PSD). If Q(v) = 0 only when v = 0 the matrix is postive definite. Q(v) is called a quadratic form .
A symmettric matrix is PSD iff all of its eigenvalues are non-negative.
3 As the kernel is now only a function of r = | x x0 | this are also known as radial basis functions (RBFs)
2.2 Examples of Kernels 15

polynomial kernel k( x, x0 ) = (s02 + x · x0 ) p where p is a positive integer. Note that dot


product covariance functions are invariant to a rotation of the coordinates about the origin,
but not translations.
Finally, there are also other interesting kernels which are not included in the groups of
stationary or dot product kernels. This covariance functions belongs to a particular type
of neural network and this construction is due to Radford M. Neal and his exhaustive
research in Bayesian Learning for Neural Networks.

2.2 Examples of Kernels


2.2.1 Stationary Kernels
As mentioned above, a stationary kernel is a function of t = x x0 . Sometimes in this case
we will write k as a function of a single argument, i.e, k (t ). Although, not being the main
aim of this section it must be said that the covariance function of a stationary process can
be represented as the Fourier transform of a positive finite measure as Bochner’s theorem
states. 4

Theorem 2.2. (Bochner’s theorem) A complex-valued function k on Rd is the covariance function


for a weakly stationary5 mean square continuous complex-valued random process on Rd if and only
if it can be represented as Z
k(t ) = e2pis·t dµ(s)
Rd
where µ is a positive finite measure.

If µ has a density S(s), then is called the spectral density or power spectrum of k, and k and
S are Fourier duals:

Z
T
k(t ) = S(s)e2pis t ds, (2.2)
Z
2pis T t
S(s) = k(t )e dt. (2.3)

In other words, a spectral density entirely determines the properties of a stationary kernel.
And often spectral densities are more interpretable than kernels. If we Fourier transform
a stationary kernel, the resulting spectral density shows the distribution of support for
various frequencies. A heavy tailed spectral density has relatively large support for high
frequencies. A uniform density over the spectrum corresponds to white noise. Therefore
draws from a process with a heavy tailed spectral density tend to appear more erratic
(containing higher frequencies, and behaving more like white noise) than draws from a
4 A proof and further reading can be found in Stein, M. L. (1999). Interpolation of Spatial Data. Springer-

Verlag, New York.


5 A Gaussian time series { X } is said to be stationary if m ( t ) = E [ X ] = µ is independent of t and
t t
Cov( Xt+h , Xt ) is independent of t for all h.
16 Covariance Functions

process with spectral density that concentrates its support on low frequency functions.
Indeed, one can gain insights into kernels by considering their spectral densities.
We now give some examples of stationary covariance functions.

Squared Exponential
The SE kernel has become the de-facto default kernel for GPs and SVMs (Support Vector
Machines). Also known as the Radial Basis Function or the Gaussian Kernel. It has already
been introduced in the first chapter, equation (1.5), and has the form

✓ ◆
(x x 0 )2
k SE ( x, x 0 ) = s2f exp (2.4)
2`2

This covariance function has some nice properties, such as infinitely differentiability,
which means that the GP has mean square derivatives of all orders, and thus is very
smooth. It has two parameters:
-The lengthscale ` determines the length of the ’wiggles’ in a function. In general, we
won’t be able to extrapolate more than ` units away from your data. Informally can be
thought of as roughly the distance you have to move in input space before the function
value can change significantly.6
- The output variance s2f determines the average distance of your function away from its
mean. Every kernel has this parameter out in front; it’s just a scale factor.

(a) (`, s2f ) = (1, 1) (b) (`, s2f ) = (0.1, 1) (c) (`, s2f ) = (1, 3.3)

Figure 2.1: GP priors generated with the SE kernel with different hyperparameters values.
(a) Shows priors generated with (l, s2f ) = (1, 1). The function is very smooth and the
average distance of the functions away from its mean is quite controlled, thus the variance
term is small. Panel (b) The length-scale has been shortened and the priors are more
wiggled and oscillate more quickly. Panel (c) As the variance is larger the range of the
priors increases.
6 For 1-d GP one way to understand the characteristic length-scale of the process is in terms of the number

of upcrossing of a level u. The number of upcrossings E[Nu ] of the level u on the unit interval by a zero-mean,
stationary, almost surely continuous Gaussian process is given by
s
1 k00 (0) u2
E[ Nu ] = exp( )
2p k (0) 2k (0)
2.2 Examples of Kernels 17

The Matérn Class of Covariance Functions


The Matérn class of covariance functions is given by
✓p ◆n ✓ p ◆n
21 v 2nt 2nt
k Matern (t ) = Kn (2.5)
G(n) ` `
with positive parameters n and `, where Kn is a modified Bessel function. 7
It might not be immediate, but note that the scaling is chosen so that for n ! • we
obtain the SE covariance function. The most interesting cases for machine learning are
n = 3/2 and n = 5/2, for which

✓ p ◆◆ ✓ p
3t 3t
k n=3/2 (t ) = 1 + exp , (2.6)
` `
✓ p ◆ ✓ p ◆
5t 5t 2 5t
k n=5/2 (t ) = 1 + + 2 exp (2.7)
` 3` `
Another special case is n = 1/2 known as the Laplacian covariance function which sample
stationary Browninan motion.

(a) n = 3/2 (b) n = 5/2

(c) n = 1/2 (d) n = 100

Figure 2.2: GP priors generated with the Matérn kernel with different hyperparameters n
values. (c) For n = 1/2 the process becomes very rough. (d) For values of n 7/2 is hard
to distinguish between finite values of n and n ! •, the smooth squared exponential case.
7 The modified Bessel functions (also named the hyperbolic Bessel functions) of the first and second kind are

defined by
• ✓ ◆2m+a
1 x
Ia = i a Ja (ix ) = Â
m=0 m!G ( m + a + 1) 2
p Ia ( x ) Ia ( x )
Ka ( x ) =
2 sin(ap )
when a is not an integer; when a is an integer, then the limit is used.
18 Covariance Functions

The Rational Quadratic Covariance Function


The Rational quadratic (RQ) covariance function is

✓ ◆ a
t2
k RQ (t ) = s2f 1 + (2.8)
2a`2

with a, ` > 0 can be seen as an infinite sum of squared exponential (SE) covariance func-
tions with different characteristic length-scales. So, GP priors with this kernel expect to
see functions which vary smoothly across many lengthscales. The parameter a determines
the relative weighting of large-scale and small-scale variations. When a ! •, the RQ is
identical to the SE. Note that the process is infinitely mean-squared differentiable for every
a in contrast to the Matérn kernel.

(a) (s2f , `, a) = (1.0, 1.0, 1.0) (b) (s2f , `, a) = (1.0, 1.0, 0.01)

(c) (s2f , `, a) = (1.0, 1.0, 100) (d) (s2f , `, a) = (1.0, 0.1, 40)

Figure 2.3: Note that panel (c), since a value is high, is almost identical to figure 2.1a. In
panel (d), as the value of the length-scale decreases the function becomes more wiggly as
expected.

The Periodic Covariance Function


This kernel is useful when data has periodicity, therefore it allows one to model functions
which repeat themselves exactly. The expression is given by

2 sin2 (p | x x 0 |/p)
k Per ( x, x 0 ) = s2f exp( ) (2.9)
`2
2.2 Examples of Kernels 19

Its parameters are easily interpretable:


- The period p simply determines the distance between repititions of the function.
- The lengthscale acts in the same way as in the SE kernel.

(a) (s2f , `, p) = (1, 1, 1) (b) (s2f , `, p) = (1, 1, 0.1) (c) (s2f , `, p) = (2, 0.8, 0.5)

Figure 2.4

Constant covariance function


Can be used as part of a product-kernel where it scales the magnitude of the other factor
(kernel) or as part of a sum-kernel, where it modifies the mean of the Gaussian process.

k( x, x 0 ) = s02 (2.10)

2.2.2 Dot Product Covariance Functions


Above we have seen examples of stationary kernels. However, there are also other inter-
esting kernels which are not of this form. Being non-stationary means that the parameters
of the kernel are about specifying the origin. In this section we describe and provide a
graphical understanding of the linear kernel.
Linear Kernel
The linear kernel is just a particular case of the polynomial kernel, where the positive
integer p given in the equation that follows is p = 1.

k Pol ( x, x 0 ) = (sb2 + sv2 ( x c)( x 0 c)) p (2.11)

Where the hyperparameters means:


- The offset c determines the x-coordinate of the point that all the lines in the posterior
go though. At this point, the function will have zero variance (unless you add noise).
-The constant variance sb2 determines how far from 0 the height of the function will
be at zero. It’s a little confusing, because it’s not specifying that value directly, but rather
putting a prior on it. It’s equivalent to adding an uncertain offset to our model.
-The hyperparameter p specifies the degree of the polynomial expression resulting
from the product of p linear kernels. Therefore, if p = 2 the function sampled from GP
prior will be quadratic, likewise if p = 3 cubic.
20 Covariance Functions

If you use just a linear kernel in a GP, you’re simply doing Bayesian linear regression,
and significantly improving the computation time, since instead of taking O(n3 ) time,
inference can be made in O(n). Indeed, combining it with other kernels gives rise to some
nice properties, as it will be shown further on.

(a) (sb2 , sv2 , c, p) = (1.0, 0.3, 0, 1) (b) (sb2 , sv2 , c, p) = (2.0, 1.2, 0, 1) (c) (sb2 , sv2 , c, p) = (1.0, 0.4, 0, 2)

Figure 2.5: In panel (c) since p = 2, meaning that the kernel is the result of the product of
two linear kernels, the prior is a quadratic function.

2.2.3 Other Non-stationary Covariance Functions


In this section we will describe the covariance function belonging to a particular type of
neural network. The neural network kernel is perhaps most notable for research on Gaus-
sian processes within the machine learning community. Its construction is due to Radford
M. Neal who pursued the limits of large models, and showed that a Bayesian neural net-
work becomes a Gaussian process with a neural network kernel as the number of units
approaches infinity. Moreover, a brief explanation of White noise Kernel is provided.

Neural Network covariance function


Consider a neural network with one hidden layer

J
f ( x ) = b + Â vi h( x; ui ). (2.12)
i =1

vi are the hidden to output weights, h is any bounded hidden unit transfer function, ui
are the input to hidden weights, and J is the number of hidden units. Let the bias b and
the v’s have independent zero-mean distribution of variance sb2 and sv2 /J respectively, and
the weights for each hidden unit ui have independent and identical distributions.
The first two moments of f ( x ) in equation (2.12), collecting all weights together into the
vector w, are

E[ f ( x )] = 0 (2.13)
J
1
cov[ f ( x ), f ( x 0 )] = Ew [ f ( x ) f ( x 0 )] = sb2 + sv2 Eu [ hi ( x; ui )hi ( x 0 ; ui )] =
J iÂ
=1 (2.14)
sb2 + sv2 Eu [h( x; u)h( x 0 ; u)]
2.2 Examples of Kernels 21

where the last equality follows from the fact that the ui are identically distributed. The
sum in equation (2.14) is over J i.i.d random variables, and all moments are bounded. If
b has a Gaussian distribution, the central theorem can be applied to show that as J ! •
any collection of function values f ( x1 ), . . . , f ( x N ) has a Joint Gaussian distribution, and
thus the neural network in equation (2.14) becomes a Gaussian process with covariance
function given by the last term in (2.14).

If we choose the transfer function as h( x; u) = er f (u0 + Â Pj=1 u j x j ), where er f (z) =


R z t2
p2 e dt, and we choose u ⇠ N (0, Â), then we obtain 8 from (2.14)
p 0

✓ ◆
2 2x̃ T Â x̃ 0
k NN ( x, x 0 ) = sin p (2.15)
p (1 + 2x̃ T Â x̃ )(1 + 2x̃ 0T Â x̃ 0 )

where x 2 R P and x̃ = (1, x T ) T . This is a true neural network covariance function. On


the other hand the sigmoid kernel k( x, x 0 ) = tanh( a + bx · x 0 ), although being proposed is
never positive definite and thus is not a valid covariance function.

White noise covariance function

The main use-case of this kernel is as part of a sum-kernel where it explains the noise-
component of the signal. Tuning its parameter corresponds to estimating the noise-level.

k( x, x 0 ) = s2f d( x x0 ) (2.16)

(a) (s2f = 0.4) (b) (s2f = 3.0)

Figure 2.6: GP priors drawn from a White Noise Kernel

8 A detailed steps by step explanation is provided by Williams, C. K. I. (1998) in Computation with Infinite

Neural Networks. Neural Computation.


22 Covariance Functions

Table 2.1: Summary of covariance functions

Covariance function Expression Stationary


p
Constant k( x, x 0 ) = s02

Linear k Pol ( x, x 0 ) = (sb2 + sv2 ( x c)( x 0 c))

Polynomial k Pol ( x, x 0 ) = (sb2 + sv2 ( x c)( x 0 c)) p


✓ ◆
( x x 0 )2 p
Squared Exponential k SE ( x, x 0 ) = s2f exp 2`2

✓p ◆n ✓p ◆n
21 v 2nt 2nt p
Matérn k Matern (t ) = G(n) ` Kn `

✓ ◆ a
t2 p
Rational Quadratic k RQ (t ) = s2f 1 + 2a`2

2 sin2 (p | x x 0 |/p) p
Periodic k Per ( x, x 0 ) = s2 exp( `2
)
✓ ◆
2 2x̃ T Â x̃ 0
Neural Network k NN ( x, x 0 ) = p sin
p
(1+2x̃ Â x̃ )(1+2x̃ 0T Â x̃ 0 )
T

White Noise k( x, x 0 ) = s2f d( x x0 )

2.3 Combining Kernels


In the previous section we have developed many covariance functions all of them summa-
rized in Table 2.1. The kernels above are useful if data is all the same type, however this
is not a common situation. Therefore, we must provide solutions if the kind of structure
of our data is not expressed by any known kernel.
The next few sections of this chapter will explore ways in which kernels can be combined
to create new ones with different properties. This will allow us to include as much high-
level structure as necessary into our models.
Although being many ways of combining kernel, such as convolution 9 or tensor product,
the more extended and the ones analyzed are addition and multiplication. 10
9 Consider
R
R an arbitrary fixed kernel h( x, z) and the map g( x ) = h( x, z) f (z)dz. Then cov( g( x ), g( x 0 )) =
h( x, z)k(z, z0 )h( x 0 , z0 )dzdz0
10 For further explanation proving that the sum and the product of two kernels is a kernel see

http://ttic.uchicago.edu/ dmcallester/ttic101-07/lectures/kernels/kernels
2.3 Combining Kernels 23

Sum and multiplication of kernels are usually written as:

k1 + k2 = k1 ( x, x 0 ) + k2 ( x, x 0 ) (2.17)
0 0
k1 ⇥ k2 = k1 ( x, x ) ⇥ k2 ( x, x ) (2.18)

2.3.1 Combining Kernels through multiplication


Each kernel in a product modifies the resulting GP model in a consistent way. Thus,
multiplying kernels is an efficient way to produce kernels combining several high-level
properties. But what properties do these new kernels have? Here, we discuss a few
examples.
- Multiplication by SE: Converts any global correlation structure into local correlation
since SE( x, x 0 ) decreases monotonically to 0 as | x x 0 | increases. For example, Periodic
kernel corresponds to exactly periodic structure, whereas Per ⇥ SE corresponds to locally
periodic structure, as shown in the figure (2.7).
- Multiplicating Linear Kernels: By multiplying together n linear kernels, we obtain a
prior on polynomials of degree n. As we have shown in the previous section in figure
(2.5c).
- Multiplication by a linear kernel: It is equivalent to multiplying the function being
modeled by a linear function. If f ( x ) ⇠ GP(0, k) then x ⇥ f ( x ) ⇠ GP(0, Lin ⇥ k), thus the
marginal standard deviation of the function being modeled grows linearly away from the
location given by kernel parameter c.

Figure 2.7: Examples of one-dimensional structures expressible by multiplying kernels.


First row shows k( x, x 0 ). The second one functions f ( x ) sampled from GP prior.

(a) Lin ⇥ Lin (b) SE ⇥ Per (c) Lin ⇥ SE (d) Lin ⇥ Per

? ? ? ?
y y y y

Quadratic functions Locally periodic Increasing variation Growing amplitude


24 Covariance Functions

2.3.2 Multi-dimensional models


Let’s discuss briefly a way to model functions having more than one input. It consists on
multiplying together kernels defined on each individual input. For example, a product of
SE kernels over different dimensions each having a lengthscale parameter. Usually named
as Automatic Relevance Determination (ARD) the particular case mentioned before would
have the following expression

✓ ◆ ✓ ◆
D
1 ( xd xd0 )2 1 D ( xd xd0 )2
’ sd2 exp s2f exp
2 dÂ
0
SE-ARD( x, x ) = = (2.19)
d =1
2 `2d =1 `2d
Note that this proceedure was named ARD since estimating the lengthscale parameters
`1 , `2 , · · · , ` D implicitly determines the relevance of each dimension. Take into account
that input dimensions with relatively large lengthscales imply relatively little variation
along those dimensions in the function being modeled.

2.3.3 Modelling Sums of Functions


Let f a , f b be functions drawn independently from GP priors f a ⇠ GP(µ a , k a ) and f b ⇠
GP(µb , k b ). Then the distribution of the sum of those functions is simply another GP:

f a + f b ⇠ GP(µ a + µb , k a + k b ) (2.20)
As it is easy to encode additivity into GP models, note that kernels k a and k b can be of
different types, allowing us to model the data as a sum of independent functions, each
possibly representing a wide range type of structures.

(a) Lin + Per (b) SE + Per (c) SE + Lin (d) SElong +SEShort

? ? ? ?
y y y y

Periodic plus trend Periodic plus noise Linear plus variation Slow & fast variation

Figure 2.8: Examples of one-dimensional structures expressible by adding kernels. SElong


denotes a SE kernel whose lengthscale is long relative to that of SEShort .
2.3 Combining Kernels 25

2.3.4 Changepoints
An example of how combining kernels can give rise to more structured priors is given
by changepoint kernels, which can express a change between different types of structure.
Changepoints kernels can be defined through addition and multiplication with sigmoidal
functions such as s ( x ) = 1+exp1 ( x) :

CP(k1 , k2 )( x, x 0 ) = s ( x )k1 ( x, x 0 )s ( x 0 ) + (1 s ( x ))k2 ( x, x 0 )(1 s ( x 0 )) (2.21)

which can be written in shorthand as

CP(k1 , k2 )( x, x 0 ) = k1 ⇥ s + k2 ⇥ s̄ (2.22)

where s = s ( x )s ( x 0 ) and s̄ = (1 s ( x ))(1 s ( x 0 )). This compound kernel expresses a


change from one kernel to another. The parameters of the sigmoid determine where, and
how rapidly, this change occurs.
26 Covariance Functions
Chapter 3

Model Selection and Adaptation


of Hyperparameters

In chapter 1 it has been seen how to do regression using a Gaussian process with a given
fixed covariance functions as well as in chapter 2 several examples of those covariance
functions were presented. While some properties of them such as stationarity may be easy
to determine from the context, it may not be trivial to specify with confidence the value
of free hyperparameters, length-scales and variances. Therefore, the natural question
that follows, and that will give rise to turn Gaussian processes into powerful practical
tools, is how to develop methods that adress the model selection problem, refering to the
discrete choice of the functional form for the covariance function as well as values for any
hyperparameters.

In section 3.1 we outline the model selection for regression problems, focusing on Bayesian
approach in section 3.1.1 and cross-validation, in particular the leave-one-out estimator, in
section 3.1.2. In the remaining section Bayesian principles are applied to a specific case in
order to provide a practical view.

3.1 Model Selection for GP Regression


Based on a set of training data, our task is to make inference about the form and param-
eters of the covariance function, in other words, about the relationships in the data. It
should be clear that model selection is essentially open ended, meaning that even for a
particular kernel there is a huge variety of possible distance measures. Thus we need to
be able to compare two (or more) methods differing in values of particular parameteres,
or the shape of the covaraince functions, or compare a GP model to any other kind of
model.
In the next section we describe the Bayesian view on model selection, which involves
the computation of the probability of the model given the data, based on the marginal
likelihood.

27
28 Model Selection and Adaptation of Hyperparameters

3.1.1 Marginal Likelihood


To estimate the kernel parameters, we could use exhaustive search over a discrete grid of
values, with validation loss as an objective, but this can be quite slow. Here we consider an
empirical Bayes approach, which will allow us to use continuous optimization methods,
which are much faster. In particular, we will maximize the marginal likelihood 1

Z
p(y| X ) = p(y| f , X ) p( f | X )d f (3.1)

Since p( f | X ) = N ( f |0, K ) and p(y| f ) = ’i N (yi | f i , sy2 ) the marginal likelihood is given
by

1 1 N
log(( p(y| X )) = log N (y|0, Ky ) = yK 1 y log|Ky | log(2p ) (3.2)
2 y 2 2
where Ky = K f + sn2 I is the covariance matrix for the noisy targets y and K f is the covari-
ance matrix for the noise-free latent f.
The first term of the marginal likelihood in equation 3.2 is a data fit term since it is the only
one involving the observed targets. The second term, log|Ky |/2, is the model complexity
depending only on the covariance function and the inputs. Lastly, log(2p )/2 is just a
normalization constant.
To understand the tradeoff between the first two terms, consider a 1-dimensional SE ker-
nel, as we vary the length scale ` and hold sy2 fixed. For short length scales, the fit will
be good, so yKy 1 y will be small. However, the model complexity will be high: Ky will be
almost diagonal, since most points will not be considered near any others, so the log|Ky |
will be large. For long length scales, the fit will be poor but the model complexity will be
low: Ky will be almost all 1’s so log|Ky | will be small.
Now, it is time to maximize the marginal likelihood in order to set the hyperparame-
ters. We seek the partial derivatives of the marginal likelihood w.r.t the hyperparameters
denoted by q. Using 3.2 and the rules of the matrix derivatives 2 one can show that

✓ ◆
∂ 1 ∂K 1 1 ∂Ky 1 ∂Ky
log p(y| X, q ) = y T Ky 1 K 1y tr (Ky ) = tr (aa T Ky 1
) (3.3)
∂q j 2 ∂q j y 2 ∂q 2 ∂q j

where a = K 1 y. It takes O(n3 ) time to compute Ky 1 , and O(n2 ) time per hyperparame-
ter to compute the gradient. Often there exists some constraints on the hyperparameters,
such as sy2 0. In this case, we can define q = log(sy2 ), and then use the chain rule.
Although not being a devastating problem note that there is no guarantee that the marginal
likelihood does not suffer from multiple local optima since the objective is not convex and
every local maximum corresponds to a particular interpretation of the data.
1 The reason it is called the marginal likelihood, rather than just likelihood, is because we have marginalized

out the latent Gaussian vector f.


2 ∂ K 1 = K 1 ∂K 1 where ∂K is a matrix of elementwise derivatives.
∂q ∂q K ∂q
For the log determinant of a p.d. symmetric matrix we have ∂q ∂
log|K | = tr (K 1 ∂K
∂q ).
3.1 Model Selection for GP Regression 29

In Figure 3.1 an example with two local optima is provided. For a data set of 20 observa-
tions randomly generated the inferred underlying functions have been specified consider-
ing to different sum-kernels. Despite the fact both of them result from the addition of a
Squared Exponential plus a White noise kernel they differ in the values of the hyperpa-
rameters such as the lengthscale and noise level.

By doing so, one may expect, that even there exists differences in the kernels, the log-
marginal-likelihood will converge to a global maximum after the optimitzation process.
However the illustration of the log-marginal-likelihood (LML) landscape shows that there
exist two local maxima of LML. The first corresponds to a model with a high noise level
and a large length scale, which explains all variations in the data by noise. The second one
has a smaller noise level and shorter length scale, which explains most of the variation by
the noise-free functional relationship.

(a) Initial: (`, s2f , sn2 ) = (100, 1, 1) (b) Initial: (`, s2f , sn2 ) = (1, 1, 1e 5)
Optimum: (`, s2f , sn2 )
= (109, 0.00316, 0.637) Optimum: (`, s2f , sn2 )
= (0.365, 0.64, 0.294)
Log-Marginal-Likelihood: 23.872337362 Log-Marginal-Likelihood: 21.8050908902

(c)

Figure 3.1: Panels (a) and (b) show the underlying functions (and 95% confidence inter-
vals). Panel (c) shows marginal likelihood as a function of the hyperparameters ` and
sn2 .
30 Model Selection and Adaptation of Hyperparameters

3.1.2 Cross Validation

The basic idea of cross-validation is to split the training set into two disjoints sets, one
used for training and the other used to monitor performance usually called the validation
set. Having defined a dataset to test the model in the training phase shrinks problems like
overfitting and variability and gives an insight on how the model will generalize to an
independent dataset.
Although different types of cross-validation implementations can be distinguished, in the
Gaussian processes context, the most extended are those that involves multiple rounds of
cross-validations using different partitions and validations results are averaged over the
rounds. Methods such k-fold cross-validation and leave-one-out cross validation(LOO-
CV) belong to that particular framework.
In the k-fold CV approach the training set is splitted into k smaller sets. Then the model
is trained using k-1 of the folds as training data and the resulting model is validated on
the remaining part of the data. Finally, as mentioned before, the performance measure
reported by k-fold cross-validation is the average of the values computed in the loop.
On the other side, LOO-CV is an extreme case of k-fold cross validation since the number
of training cases k = n. Despite the fact that computational costs are prohibitive, there
are computational shortcuts that allows LOO-CV to become an efficient way to tune the
hyperparameters and model selection.
To begin with, since cross-validation can be used with any loss function, using the negative
predictive log probabiity when leaving out training case i is

1 ( yi µ i )2 1
log p(yi | X, y i , q ) = log si2 log 2p (3.4)
2 2si2 2

where the notation y i means all the targets except number i, the training sets are taken
to be ( Xi , yi ) and µi and si2 are computed according to (1.11) and (1.12) respectively. Thus,
the LOO log predictive probability is

n
L LOO ( X, y, q ) = Â log p(yi |X, y i, q) (3.5)
i =1

Note that in each of the n LOO-CV loops the inverse of the covariance matrix has to be
computed in order to determine the mean and variance in eq. (1.11) and (1.12). However,
in each rotation the expressions are almost the same since only a single column and row
of the covariances matrix are removed. Therefore, applying matrix inversion using par-
titioning on the complete covariance matrix increases efficiency, since the expressions for
the LOO-CV predictive mean and variance become

h i h i h i
1 1
µi = yi K y / K and si2 = 1/ K 1
, (3.6)
i ii ii
3.2 Mauna Loa Atmospheric Carbon Dioxide Example and Discussion 31

where the computational expense of these quantities is O(n3 ) once for computing the
inverse of K plus O(n2 ) for the entire LOO-CV procedure. At this point, replacing the
expressions of (3.6) into eq. (3.4) and (3.5) gives rise to a performance estimator that is
possible to optimize w.r.t hyperparameters to do model selection.

But before giving an expression to the partial derivatives of L LOO w.r.t the hyperparame-
ters we need the partial derivatives of the LOO-CV predictive mean and variances from
eq. (3.6) w.r.t the hyperparameters, and those are

⇥ ⇤ ⇥ 1
⇤ ⇥ ⇤
∂µi Zj a i ai Z j K ii ∂si2 Zj K 1 ii
= , = (3.7)
∂q j [K 1 ]ii [K 1 ]2 ∂q j 2
[K 1 ]ii
ii

where a = K 1y and Zj = K 1 ∂K . Thus the partial derivatives of (3.5) obtained using the
∂q j
chain rule and eq. (3.7) are

∂L LOO n
∂ log p(yi | X, y i , q ) ∂µi ∂ log p(y| X, y i , q ) ∂si2
=Â + = (3.8)
∂q j i =1
∂µi ∂q j ∂si2 ∂q j
n ✓ ⇥ ✓
a2i
◆h i ◆ h i
⇤ 1
 ai Zj a i 2 1 + [K 1 ] Zj K ii / K 1 ii 1
(3.9)
i =1 ii

where the computational complexity is O(n3 ) for the inverse of K and O(n3 ) for hyper-
parameter for the derivative equation above.

Having reached this stage, one natural question to ask is under which circumstances
which of the methods discussed, marginal likelihood or LOO-CV, might be preferable
since their computational complexity is very similar. In the following sections as in the
bank account balance forecasting problem model selection and adaptation of hyperpa-
rameteres has been done using marginal likelihood for programming convenience and
widespread use in machine learning community.

3.2 Mauna Loa Atmospheric Carbon Dioxide Example and


Discussion

This example is based on Section 5.4.3 of ’Gaussian Processes for Machine Learning’
[Rasmussen & C.K.I Williams] and it represents a clear example of complex kernel and
hyperparameter optimization using gradient ascent on the log-marginal-likelihood.

The data consists of the monthly average atmospheric CO2 concentrations (in parts per
million by volume (ppmv)) collected at the Mauna Loa Observatory in Hawaii, between
1959 and 2003. The data is shown in figure 3.2. The objective is to model the CO2 concen-
tration as a function of the time x. Thus, we are working in a time series environment.
32 Model Selection and Adaptation of Hyperparameters

Figure 3.2: The 540 observations of monthly CO2 concentration made between 1959 and
2003. Three missing values were replaced by the mean concentracion of the year. Note the
rising trend and seasonal variations.

A reasonable way to start the modelling process is identifying the different components
of our data. Classic deterministic analysis of time series suggests that every function
should be decomposed in all or some of this features: trend, seasonal variations, term
irregularities and noise. Once this first step is accomplished, the kernel and, therefore,
hyperparameter selection and combination should take care of these individual properties
in order to provide an excellent fit to the data.
In our particular case Mauna Loa dataset presents a pronounced long term rising trend
that we try to model using a squared exponential kernel with two hyperparameters con-
trolling the variance q1 and characterisitc length-scale q2

✓ ◆
0 (x x 0 )2
k1 ( x, x ) = q1 exp (3.10)
2q22

The SE kernel with a large length-scale enforces this component to be smooth, note that it
is not enforced that the trend is rising which leaves this choice to the GP.
On the other side, the seasonal component 3 seems also evident. We can use the periodic
kernel with a fixed periodicity of 1 year. The length-scale q5 of this periodic component,
controlling its smoothness, is a free parameter and q3 gives the magnitude. In order
to allow decaying away from exact periodicity, the product with an SE kernel is taken.
3 Seasonal component is caused by different concentrations of plants.
3.2 Mauna Loa Atmospheric Carbon Dioxide Example and Discussion 33

The length-scale q4 of this SE component controls the decay time and is a further free
parameter.

✓ ◆
(x x 0 )2 2 sin2 (p ( x x 0 ))
k2 ( x, x 0 ) = q32 exp (3.11)
2q42 q52

In order to capture the medium term irregularities a rational quadratic term is used. These
irregularities can better be explained by a Rational Quadratic than a SE kernel component,
since it gives lower marginal likelihood as it can accommodate several length-scales.

✓ ◆ q8
( x x 0 )2
0
k3 ( x, x ) = q62 1+ (3.12)
2q72 q8

As before q6 is the variance, the length-scale q7 and q8 (also known as a parameter, which
determines the diffuseness of the length-scales) are to be determined.

The last feature to model, the noise term, has been specified using an SE kernel, which
explains the correlated noise components such as local weather phenomena, and a White
Noise kernel

✓ ◆
(x x 0 )2
k4 ( x, x 0 ) = q92 exp 2
+ q11 dpq (3.13)
2q210

where q9 is the magnitude of the correlated noise component, q10 is its length-scale and
q11 is the variance of the independent noise component.
To sum up, the final covariance function is

k ( x, x 0 ) = k1 ( x, x 0 ) + k2 ( x, x 0 ) + k3 ( x, x 0 ) + k4 ( x, x 0 ) (3.14)

with hyperparameters q = (q1 , · · · , q11 ) T . Before fitting the hyperparameters by optimiz-


ing the marginal likelihood using conjugate gradient optimitzer as BFGS4 algorithm we
substract the empirical mean of the data. To avoid bad local minima and to improve the
global performance a few random starts are tried as well as k-cross validation on training
data with k = 3. The best marginal likelihood result was log p(y| X, q ) = 77.146 that
gives rise to an R2 = 0.9455 score on the predictions.
The plot containing the final model and the predictions together with its decomposition
into additive components is shown in figure 3.3.

4 The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is an iterative method for solving unconstrained

nonlinear optimization problems.


34 Model Selection and Adaptation of Hyperparameters

(a) The full posterior on the Mauna Loa dataset

(b) Long term smooth rising trend (c) Seasonal variation over the years

(d) Medium term irregularities (e) White Noise

Figure 3.3: Panel (a) shows the underlying function together with 95% predictive confi-
dence region for predictions until 2017. The confidence interval gets wider as the time
increases. In Panels (b),(c),(d),(e) some components capture slowly-changing structure
while others capture quickly-varying structure.

To sum up, Mauna Loa has been a good way to test the powerful possibilities that infer-
rence in composite covariance functions provides, as well as the utility to allow the data
to determined the hyperparameters. The experience acquired facing this problem enables
to confront the main project of this work, the bank account forecasting problem, with a
more solid background.
Chapter 4

Bank account forecasting Problem

In this final chapter, since the main theoretical contributions and an introductory example
as Mauna Loa has been covered, we are going to face a more complex problem that
consists of forecasting bank accounts balance using Gaussian Processes.
Nowadays, with the advance of affordable, fast computation the machine learning com-
munity has adressed increasingly large and difficult problems and one appealing area
that can be applied is banking. Even though, it is a extremely regulated environment it
offers many possibilities and it becomes extremely exciting when working with massive
amounts of financial data and how to deal with them in order to produce new tools.
In that context, anticipating the behaviour of personal accounts or predicting people’s
expenses is a challenge that holds a special place in any modern platform of intelligent
banking services. In that particular exercise, we have 3.35 milion accounts1 and we are
trying to predict for each client the 13th state of the balance account with only having
12 historical values. More precisely we work on just a year-worth of data with monthly
aggregations and the 13th is our forecasting goal.2
As mentioned in the introduction a prestigious European bank tackled that problem in
order to provide their customers with a warning system in case the balance of their bank
accounts has high probabilities to decrease significantly or even be negative.
Therefore, we are facing a time-series3 regression problem. However, it is important to
note that since our time series are very short statistical methods such as GARCH, ARIMA
or Holt-Winters do not perform accurately. This led us to the next question: Can Gaussian
Processes predict accurately the 13th state of a bank account given the 12th previous ones?
To answer that question we are going to approach the problem in two different ways. First,
1 For computational limitations only 100.000 accounts will be analyzed as they are a representative sample.

Examine the hole sample can be considered as a very interesting Big Data problem.
2 The anonymized bank account data comes from real customers and is provided by a Bank whose name

would not be disclosed.


3 A time series is a doubly infinite sequence · · · , X , X , X , X , X , . . . of random variables or random
2 1 0 1 2
vectors. We refer to the index t of Xt as time and think of Xt as the state or output of a stochastic system at time
t. The variable Xt is assumed to be real valued and for our purposes t = 1, . . . , 13.

35
36 Bank account forecasting Problem

all the bank accounts will be analyzed as a global entity, meaning, that one main kernel
would be considered to perform the Gaussian Process regression. On the other hand, in
the second approach, in order to be more flexible and adaptive the bank accounts will be
clustered regarding its similarity. Thus, for every cluster a different and more appropiate
kernel can be considered. The details about the clustering process such as the similarity
measure, the choice of centroids and kernels will be explained in the Clustering section.

To sum up, the questions to answer are whether or not GPR works as expected and if
clusteritzation processes improve the performance of GPR. Thus, a comparative evaluation
of the both approacches will be shown.

4.1 Global approach

Data description
Before starting the forecasting process we will focus a bit more on the structure of the
data. We have at our disposal 3.35 million of bank accounts. Each account xi is a vector of
1 row and 13 columns, where the first 12 columns represent the last 12 monthly states of
the account and the 13th state is our goal prediction. For example, the first of our accounts
is:

x1 = [ 3000, 5000, 3000, 5500, 0, 4000, 2000, 2000, 0, 5000, 0, 2000]

x1⇤ = [ 4200]

We actually know the value of the 13th state in order to evaluate the error of our fore-
casting. Thus, more formally our dataset can be expressed as D = {( xi , y j ) for i =
1, . . . , 3.351.168 and j = 1, . . . , 13} , or alternatively can be thought as a matrix of 3.351.168
rows and 13 columns.

Given the size of the dataset we performed some data cleaning to reduce the number
of rows. Firstly, we drop all the accounts whose 12th first states were zeros. From 3.35
million accounts 107.245 were dropped. Since 3.243.923 accounts are still remaining and
the computational expenses4 are unaffordable, we consider two random simple samples
of 25.000 and 100.000 accounts. The reason why we take that samples size is because
in the clustering the 25.000 accounts will be distributed in 20 groups and the 100.000 in
40, matching with all the different patterns observed and checking if more clusters mean
more accuracy.

4 Moreover, the forecasting process for 100.000 lasts more than 36 hours.
4.1 Global approach 37

(a) Histogram showing the distribution of zeros


(b) 400 bank accounts
over the 12 firsts states of the accounts.

Figure 4.1: Note that the accounts tend to have many zeros over the different months.

Forecasting
As mentioned above two different random simple samples have been considered (N1 =
25.000 and N2 = 100.000 accounts). Both have been analyzed in terms of the same kernel
function, which has been designed to adapt and fit well with the general properties of the
accounts. Note that since we have to be able to model a wide range of different behaviour
patterns, the properties expressed by kernels can not be very specific or particular. Thus,
the kernel function is derived by combining several different kinds of simple covariance
functions. The final covariance function is

k( x, x 0 ) = k1 ( x, x 0 ) + k2 ( x, x 0 ) + k3 ( x, x 0 ) (4.1)

where

✓ ◆ ✓ ◆
(x x 0 )2 ( x x 0 )2
k1 ( x, x 0 ) = q12 exp = 4502 exp (4.2)
2q22 2 ⇤ 75.02

To model the long term smooth trend we use a SE covariance function with a high ampli-
tude paramater q1 since the variance is high and a large characteristic length-scale q2 .
To model the medium term irregularities a Rational Quadratic term is used:

✓ ◆ q5 ✓ ◆ 1.5
( x x 0 )2 ( x x 0 )2
0
k3 ( x, x ) = q32 1+ = 5.5 2
1+ (4.3)
2q42 q5 2 ⇤ 2.02 ⇤ 1.5

One could have also used a SE form for this component, but it turns out that rational
quadratic works better. Finally we specify a noise model as the sum of a SE and WN
kernel. Noise in the series models rare behaviour changes caused by unexpected expen-
ditures.

✓ ◆ ✓ ◆
(x x 0 )2 ( x x 0 )2
k4 ( x, x 0 ) = q62 exp + q82 dpq = 4.52 exp + 20.22 dpq (4.4)
2q72 2 ⇤ 0.32
38 Bank account forecasting Problem

Note that we have not considered a periodic covariance function since it was not clear that
all the accounts have a seasonal trend and least of all exactly periodic5 . Actually, some
trials have been done considering a seasonal kernel, but it turns out not to be relevant
since a decay-time parameter became very large.
The hyperparameters were fitted by optimizing the marginal likelihood using a conjugate
gradient optimizer as BFGS, likewise we have proceeded in the Mauna Loa problem.
Moreover, for each account we have considered a k-cross validation for k = 3 since it gives
rise to more accurate predicitons.
Before showing the predictions and the results achieved we must define the prediction
metrics or at least, taking into account the final goal of the forecasting, the measure accu-
racy that allows to grade the 13th state of the bank account forecasted as good or poor.
Some of the Measures of Forecast Accuracy used are the regulars in the statistics field
while others were specially designed for that particular problem. Therefore we consid-
ered

• Mean Absolute Error(MAE): MAE = 1


N ÂiN=1 |ŷ(i) y(i) |, where ŷ(i) is the predicted
value and y(i) is the actual one.
q
• Root Mean Squared Error (RMSE): RMSE = 1
N ÂiN=1 (ŷ(i) y (i ) )2

• Mean Absolute Scaled Error (MASE) :

N
1 |ŷ(i) y (i ) |
MASE =
N Â 1 T (i ) (i )
i =1 T 1 Â t =2 | y t yt 1| + l

One interesting and favorable property of MASE when compared to other methods
for calculating forecast errors is the interpretability, since values greater than one
indicate that in-sample one-step forecasts from the naïve method perform better
than the forecast values under consideration. Thus, values near zero means a good
accuracy compared to naïve method. For our particular case T = 12 and l 6= 0 is a
constant parameter to avoid zero division.

• Percentage of sign accounts forecasted correctly (SAFC):

Number of sign accounts predicted correctly


SAFC =
N
Given the initial aim of the project driven by the bank this index can be seen as a
intuitive measure of accuracy. If the sign of the bank account balance forecasted
matches with the 13th actual state of the account a counter is increased. The follow-
ing line of code expresses that idea.
i f ( accountsreg[i ][12] < 0 and ypred < 0) or ( accountsreg[i ][12] > 0 and ypred >
0) or (np.abs(( accountsreg[i ][12] ypred)) <= 1e 1)

5 However, indeed some of the accounts behave seasonaly and in the clustering the kernel will capture that.
4.1 Global approach 39

• Percentage of sign accounts forecasted correctly including 0 (SAFC0):

Num. of sign accounts predicted correctly [ {0}


SAFC0 =
N
Similar to the measure mentioned above but with one important modification. Through-
out the project we realize that in a significant amount of bank accounts the 13th
balance state was 0. However, our forecast in a not negligible number of accounts
that fulfill the situation mentioned in the sentence above was very close to 0 but not
that specific value. Thus we reformulate the indicator SAFC in order to adjust that
situation changing the code:
i f ( accountsreg[i ][12] <= 0 and ypred < 0) or ( accountsreg[i ][12] >=
0 and ypred > 0) or (np.abs(( accountsreg[i ][12] ypred)) <= 1e 1)

• Percentage of sign accounts forecasted correctly including an uncertainty zone


around 0 (SAFCcor):

Num. of sign accounts predicted correctly + ( y(i) = 0 ^ ŷ(i) 2 [ 5, 5])


SAFCcor =
N
The definition of SAFC0 and specially the inclusion of 0 makes sense for the type
of bank accounts we are facing. However we can provide a more accurate measure
since SAFC0 takes as a good forecast, for example, a prediction of 350 when the
actual value is 0. Thus, we define SAFCcor that evaluates the number of accounts
whose sign predicition matches with the actual 13th state and instead of including
0 (which seems quite gullible) we also consider as correct those predictions whose
final value was 0 but the forecasted value was not, but it fits in a small uncertainty
zone around zero (we define a bounded zone around zero from -5 to 5). For example,
a prediction of 2.5 when the actual value is 0 would be considered as a good forecast.

• 95% confidence intervals :


q q
Number of y(i) 2 [ŷ(i) 1.96 · var (ŷ(i) ), ŷ(i) + 1.96 · var (ŷ(i) )]
N
Since for every predicition we have access to its 95% confidence interval we can
compute the percentage of predicitons that fit in that interval.

Results

95%
MAE RMSE MASE SAFC SAFC0 SAFCcor conf.
inter.
N = 25.000
150.6495 1897.5950 0.92243 65.644% 91.724 % 81.464% 88.288%
accounts
N = 100.000
159.9937 5389.1124 0.71326 64.826 % 91.409 % 81.058% 88.233%
accounts
40 Bank account forecasting Problem

After many trials, specially in order to decide the structure of the kernel and the initial
values of the hyperparameters, the final performance is the one presented above. On the
whole, the results are cautiously positive, however it is obvious that there is still room for
improvement and hopefully it will be achieved in the clustering approach.
MASE for both cases is less than one, meaning that GP perform better than naïve method.
Moreover SAFC measure and mainly SAFC0 show very good accuracy achieving more
than 91% of good sign predicitions (as it includes zeros). At the same time SAFCcor
is around 81%, meaning that the gap between SAFC0 and SAFCcor is approximately
10%. Consequently a small percentage of the 0 includes as a good predictions in SAFC0
corresponds to predictions that actually are distant from 0.
On the negative side, the percentage of accounts whose predicted value is in the 95%
confidence interval is much less than expected since it barely overcomes 88%. This is
not excessively worrying and concerning because that value is not quite far from 95%
but as the definition of confidence interval states and given the fact that the uncertainty
region is wide we expect that in the clustering approach the results will improve. From
a strategic and business perspective, this index is extremely critical seeing as the goal
of bank was to provide a warning system able to predict when the balance of the bank
account will be negative or a drastic decrease of the balance can occur and thus send a
message to the customer. Obviously, the customer might feel confused, upset or ashamed
if receiving the alarm message the balance of the bank account remains without great
changes. Thus, the bank expects to be highly sure and in this way be a powerful aid
instead of disturbing customers. In the next section we decided to cluster the accounts
with the aim of improving the results.

4.2 Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that ob-
jects in the same group are more similar to each other than to those in another group. For
our particular problem, classification methods such as SVMs and Naive Bayes would not
be a good choice since they assume that the input features are independent. Simultane-
ously, the k-NN algorithm could still work, however it relies on the notion of a similarity
measure between input examples. Thus, how do we measure the similarity between two
time series?
At first, one might think that simply calculating the Euclidean distance between two time
series would give us a good idea of the similarity between them. Given two time series
of length n, { Xt : t = 1, . . . , n} and {Yt : t = 1, . . . , n} we define the Euclidean distance
between them as:

s
n
dist( Xt , Yt ) = Â ( Xt Yt )2 (4.5)
t =1

After all, the Euclidean distance between identical time series is zero and between very
different time series is large. However, before we settle on Euclidean distance as a simi-
4.2 Clustering 41

larity measure we should clearly state our desired criteria for determining the similarity
between two time series.
With a good similarity measure, small changes in two time series should result in small
changes in their similarity. With respect to Euclidean distance this is true for changes in
the y-axis, but it is not true for changes in the time axis. This is the problem with using
the Euclidean distance measure. It often produces pessimistic similarity measures when
it encounters distortion in the time axis. The way to deal with this is to use dynamic time
warping.

4.2.1 Dynamic Time Warping


Dynamic time warping (DTW) finds the optimal non-linear alignment between two time
series. Thus, the Euclidean distances between alignments are then much less susceptable
to pessimistic similarity measurements due to distortion in the time axis.
Considering to time series, { Xt : t = 1, · · · , n} and {Yt : t = 1, . . . , n} of the same length n
Dynamic time warping works in the following way. The first thing is to construct an n ⇥ n
matrix where the element (i, j) is the Euclidean distance between Xi and Yj . The goal is
to find a path through this matrix that minimizes the cumulative distance. This path then
determines the optimal alignment between the two time series. It should be noted that it
is possible for one point in a time series to be mapped to multiple points in the other time
series.
Let M be the path. M = m1 , m2 , . . . , mk where each element of M represents the dis-
tance between a point Xi and Yj . The optimal path, the one with the minimum Euclidean
✓q ◆
k
distance, given by the expression M⇤ = argminm Âi=1 i is found via dynamic pro-
m
gramming, specifically applying the following recursive function

g(i, j) = dist( Xi , Yj ) + min(g(i 1, j 1), g ( i 1, j), g(i, j 1)) (4.6)

The only important drawback is that dynamic time warping is quadratic in the length of
the time series used. Thus, having two time series where n is the length of the first one
and m is the length of the second gives rise to a complexity of O(nm). Therefore since
we are performing DTW multiple times and this can be prohibitively expensive we speed
things up using a lower bound of DTW known as LB Keogh which is defined as:

n
LBKeogh( Xt , Yt ) = Â (Yt Ut )2 I (Yt > Ut ) + (Yt Lt )2 I (Yt < Li ) (4.7)
t =1

where Ut and Lt are upper and lower bounds for time series Xt which are defined as
Ut = max( Xt r , . . . , Xt+r ) and Lt = min( Xt r , . . . , Xt+r ) for a reach r and I is the indicator
function.6
(
6 The 1 if x 2 A,
indicator function of a subset A of a set X is a function I A : X ! {0, 1} defined as I A ( x ) :=
0 if x 2
/ A.
42 Bank account forecasting Problem

Another way to speed things up when using DTW is to enforce a locality constraint. This
works for long time series under the assumption that it is unlikely for two elements to be
matched if they are too far apart. The threshold is determined by a window size usully
noted as w . This way, only mappings within this window are considered which speeds up
the inner loop. Since the time series involved in our problem are quite short this second
speed up was not required.

In the next subsection the clusteritzation conducted in the bank account forecasting prob-
lem is discussed in detail.

4.2.2 Clustering of Bank Accounts

Once a reliable method to determine the similarity between two time series is already
specified everything is set to begin the process of clustering. The purpose, as mentioned
above, is to lump together the bank accounts that behave the same way in order to specify
a kernel function that gathers better the properties of the accounts in the cluster and be
able to demonstrate if clustering improves the performance of GPR.

After a deep data exploration with the aim of identifying the most common behaviour
patterns and since the number of clusters has to be chosen manually, two clusteritzation
processes have been performed. The first one, grouping 25.000 accounts in 20 different
clusters and the second one 100.000 in 40. 7

For both approaches the process of clustering begins taking a simple and representative
random sample of the bank accounts. More precisely, 10.000 for the 20 clusters and 15.000
for the 40 clusters. Then, again randomly taken, 20 and 40 accounts are specified as a
centroids. Therefore, through all the 10.000 and 15.000 accounts initially taken a search is
performed in order to find the centroid that minimizes the DTW distance with the account,
or in other words, to find the centroid that is more similar to the account analyzed.

Given that DTW is quadratic, this can be computationally expensive. Thus, since LB
Keogh lower bound is linear we can speed up classification. Note that given two time
series Xt , Yt it is always true that LBKeogh( Xt , Yt )  DTW ( Xt , Yt ). Hence we can eliminate
time series that cannot possibly be more similar than the current most similar time series.
In this way we are eliminating many unnecessary DTW computations.

After all the accounts have been assigned to a cluster, we recalculate the centroid of each
cluster finding the account that minimizes the global distance between all the accounts
in the cluster. The main goal is to provide a representative centroid for each cluster.
Consequently given the need to determine whether a new account belongs to a cluster
or another we only need to compute the DTW distance between the account and the
centroids. Figure 4.2 shows a graphical representation of the clustering process performed
in the first approach (20 clusters).

7 Given the computational limitations we cloud not performe a clustering involving 100 groups of more. Thus,

it can be understand as a future exciting exercise.


4.2 Clustering 43

Figure 4.2: Clustering of 25.000 bank accounts in 20 clusters. [1] Selection process of
representative centroids. Take a simple random sample of 10.000 bank accounts. [1.1]
Again randomly, specify 20 centroids. [1.2] Compute DTW distance between the 10.000
accounts and the centroids. Cluster the accounts. [1.3] Recalculate de centroids finding the
account that minimizes the global distance between all the accounts of the cluster. [2]
Clusteritzacion of the accounts. Take a simple random sample of 25.000 bank accounts.
[2.1] DTW distance between the accounts and the 20 centroids found in [1]. Then the
accounts are assigned to the cluster that minimizes distance between centroid and account.

Note that the clustering of 100.000 bank accounts in 40 clusters is done using the same
mechanism as shown above. However, instead of taking 10.000 accounts in order to de-
termine the centroids, the sample size was 15.000. The figures below show the centroids
found in both approaches.

(a) 20 centroids (b) 40 centroids

Figure 4.3: Plot showing the 12 historical values of the different centroids. Figure(a) shows
the 20 centroids found for the first approach. In figure(b) 40 centroids are plotted.
44 Bank account forecasting Problem

Finally, some images of accounts belonging to the same cluster are provided. Many of that
clusters include accounts that are extremely similar, meaning that even DTW provides a
non-linear alignment the similarity between the accounts in the cluster is remarkably high.

(a) Seasonal (b) Last quarter variance (c) Last quarter highly variance

(d) Seasonal and high variance (e) Seasonal and high variance (f) Seasonal and high variance

(g) Stationary (h) Earlier and final variations (i) Seasonal and final variations

Figure 4.4: Examples of bank account clusters. Panel(a) shows a cluster which contains
account with seasonal behavior and medium variance. Panels (b) and (c) present large
ineastability in the last quarter. Panels (d), (e) and (f) outcome extremely high variance
with seasonal patterns. Panels (g) and (h) are quite stable with no remarkable oscilations,
only (h) shows variation in the early and final months. Panel (i) differs with (a) since in
the last quarter the accounts shows more activity.

Forecast
Once the clustering process is complete for both samples of bank accounts we are able to
begin with the forecasting. The first stage involves a deep and exhaustive exploration of
the bank accounts of each cluster trying figure out the particularities and features that has
to be expressed by each kernel function.
Since the election of a specific kernel function is done manually8 , it is sometimes a trial
8 Cambridge Machine Learning Group on this topic is currently developing and automatic statistician,

https://www.automaticstatistician.com/index/
4.2 Clustering 45

and error procedure. Thus, our particular election might not be the optimal, however to
mitigate this problem several restarts and a wide variety of kernel functions have been
considered for each cluster.
As in the global approach, the hyperparameters were fitted by optimizing the marginal
likelihood using a conjugate gradient optimizer as BFGS. On the other side, since our goal
is to compare the results achieved in the different approaches the measures of accuracy
considered in the clustering are the same as in the global view, but for exploratory reasons
we incorporate the 68,75% confidence interval. Hence, we specify the forecast results for
each cluster as well as for all the clusters as an entity. The results of the samples cointaing
25.000 and 100.000 bank accounts clustered in 20 and 40 different groups respectively
would be analyzed separately.
Results of the 25.000 accounts sampled in 20 clusters
The analysis will proceed from the general to the particular, since we will compare first
the results achieved with the clustering to the ones in the global approach. Then, we will
provide a table showing for each cluster a plot of the accounts, the kernel function used
and the different measures of accuracy.
The general results of the approaches used to forecast the 13th state of 25.000 bank ac-
counts are summarized in the table below. We can state that clustering has overperformed
the global study in all of the measures of accuracy. Most significantly, the increase in the
percentage of predictions that actually fit in the 95% of confidence interval as well as the
reduction in all the error magnitudes. There is a substantial improvement, the fact that we
reach a percentage close to 95% in the last measure of accuracy it is promising. However,
it has been noted that those accounts whose confidence interval has not been accurately
specified has a larger DTW distance from the centroid, meaning that there is not a rep-
resentative cluster for that particular accounts and consequently the kernel function used
was not the appropiate. Those miss specifications are caused by accounts with and erratic
behaviour and high oscillations in the 13th state that do not get captured by GPs.
From a bussiness point of view it would be extremely profitable the study of each cluster
independently since it provides a more comprehensive and useful way to understand in
which type of accounts GP performs better. Therefore, we provide a table showing the
performance of each cluster in the following pages.

Table 4.1: Measures of accuracy achieved examining the 25.000

95%
MAE RMSE MASE SAFC SAFC0 SAFCcor conf.
inter.
N = 25.000
accounts 148,7359 1786,455 0,901701 65,68% 96,712% 83.142% 93,568%
cluster
N = 25.000
150.6495 1897.5950 0.92243 65.644% 91.724 % 81.464% 88.288%
accounts
46 Bank account forecasting Problem

Clusters Kernel Measures of accuracy

[1] 1023 accounts • k1 ( x, x 0 ) = 10502 exp(


( x x 0 )2
)
2⇤70.02

( x x 0 )2 MAE: 209,915
• k2 ( x, x 0 ) = 202 exp( 2⇤102 RMSE 325,979
2 sin2 (p ( x x 0 )/4)
202
) MASE: 0,7799
( x x 0 )2 SAFC: 0,931574
• k3 ( x, x 0 ) = 0.52 (1 + 2⇤0.5 )
0.5
SAFC0: 0,998045
( x x 0 )2 SAFCc: 0,931574
• k4 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) +
95%: 0,937439
0.22 dpq
68,75%: 0,680352
• kernel = k1 + k2 + k3 + k4

[2] 1108 accounts


( x x 0 )2
• k1 ( x, x 0 ) = 1702 exp( 2⇤23.02
)
MAE: 91,7583
( x x 0 )2 RMSE 164,059
• k2 ( x, x 0 ) = 102 exp( 2⇤0.62
)
MASE: 0,73564
( x x 0 )2
• k3 ( x, x 0 ) = 3.52 (1 + 2⇤0.7 )
0.7
SAFC: 0,8510
SAFC0: 0,9981
( x x 0 )2
• k4 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) + SAFCc: 0,8574
0.22 dpq 95%: 0,9205
• kernel = k1 + k2 + k3 + k4 68,75%: 0,7229

[3] 1647 accounts ( x x 0 )2


• k1 ( x, x 0 ) = 402 exp( 2⇤15.02
)
MAE: 20,667
• k2 ( x, x 0 ) = (7.5 + 5.02 ( x
RMSE 94,138
c)( x 0 c))
MASE: 0,5581
( x x 0 )2 SAFC: 0,63995
• k3 ( x, x 0 ) = 1.52 (1 + 2⇤0.8 )
0.8
SAFC0: 0,9477
( x x 0 )2
• k4 ( x, x 0 ) = 1.52 exp( 2⇤0.82
) + SAFCc: 0,9453
0.22 dpq 95%: 0,9489
68,75%: 0,9058
• kernel = k1 + k2 + k3 + k4

[4] 1139 accounts • k1 ( x, x 0 ) = 7252 exp(


( x x 0 )2
)
2⇤45.02

( x x 0 )2 MAE: 90,822
• k2 ( x, x 0 ) = 202 exp( 2 ⇤ 82 RMSE 388,137
2 sin2 (p ( x x 0 )/4)
122
) MASE: 0,5103
( x x 0 )2 SAFC: 0,5891
• k3 ( x, x 0 ) = 1.52 (1 + 2⇤0.6 )
0.6
SAFC0: 0,9666
( x x 0 )2 SAFCc: 0,7524
• k4 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) +
95%: 0,9631
0.22 dpq
68,75%: 0,9376
• kernel = k1 + k2 + k3 + k4
4.2 Clustering 47

Clusters Kernel Measures of accuracy


[5] 128 accounts
( x x 0 )2
• k1 ( x, x 0 ) = 15702 exp( 2⇤43.02
)
MAE: 3413,176
( x x 0 )2 RMSE 15365,86
• k2 ( x, x 0 ) = 182 exp( 2⇤0.62
)
MASE: 0,1883
( x x 0 )2
• k3 ( x, x 0 ) = 3.52 (1 + 2⇤0.7 )
0.7
SAFC: 0,7421
SAFC0: 0,9843
( x x 0 )2
• k4 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) + SAFCc: 0,7656
0.22 dpq 95%: 0,7890
• kernel = k1 + k2 + k3 + k4 68,75%: 0,7656

[6] 1561 accounts


( x x 0 )2
• k1 ( x, x 0 ) = 382 exp( 2⇤15.02
)
MAE: 37,225
• k2 ( x, x 0 ) = (7.5 + 4.5.02 ( x )( x 0 )) RMSE 119,897
MASE: 0,4687
( x x 0 )2
• k3 ( x, x 0 ) = 1.22 (1 + 2⇤0.8 )
0.8
SAFC: 0,6111
( x x 0 )2 SAFC0: 0,9160
• k4 ( x, x 0 ) = 1.52 exp( 2⇤0.82
) +
SAFCc: 0,8821
0.42 dpq
95%: 0,9525
• kernel = k1 + k2 + k3 + k4 68,75%: 0,9160

[7] 407 accounts • k1 ( x, x 0 ) = 11502 exp(


( x x 0 )2
)
2⇤20.02

( x x 0 )2 MAE: 208,0299
• k2 ( x, x 0 ) = 32 exp( 2⇤1.12
) +
RMSE 520,6157
2 sin2 (p ( x x 0 )/3)
40 exp( 22
) MASE: 0,6107
( x x 0 )2 SAFC: 0,5601
• k3 ( x, x 0 ) = 10.52 (1 + 2⇤1.5 )
1.5
SAFC0: 0,9680
( x x 0 )2 SAFCc: 0,6240
• k4 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) +
95%: 0,9557
5.22 dpq
68,75%: 0,8525
• kernel = k1 + k2 + k3 + k4

[8] 2666 accounts • k1 ( x, x 0 ) = 18402 exp(


( x x 0 )2
)
2⇤431.02

( x x 0 )2 MAE: 180,942
• k2 ( x, x 0 ) = 682 exp( 2⇤102
2 sin2 (p ( x x 0 )/4)
RMSE 1359,294
22
) MASE: 0,8055
( x x 0 )2 SAFC: 0,6657
• k3 ( x, x 0 ) = 102 exp( 2⇤0.42
)
SAFC0: 0,9662
( x x 0 )2 SAFCc: 0,7674
• k4 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) +
95%: 0,9429
8.22 dpq
68,75%: 0,8732
• kernel = k1 + k2 + k3 + k4
48 Bank account forecasting Problem

Clusters Kernel Measures of accuracy


[9] 1066 accounts ( x x 0 )2
• k1 ( x, x 0 ) = 402 exp( 2⇤20.02
)
MAE: 39,290
• k2 ( x, x 0 ) =
2 sin2 (p ( x x 0 )/4) RMSE 108,776
3.02 exp( 2 ⇤ 22
)
MASE: 0,7780
( x x 0 )2 SAFC: 0,7298
• k3 ( x, x 0 ) = 1.52 (1 + 2⇤2⇤0.8 )
0.8
SAFC0: 0,9765
( x x 0 )2 SAFCc: 0,8958
• k4 ( x, x 0 ) = 0.92 exp( 2⇤0.52
) +
0.22 dpq 95%: 0,93808
68,75%: 0,8151
• kernel = k1 + k2 + k3 + k4

[10] 297 accounts

( x x 0 )2 MAE: 20,3396
• k1 ( x, x 0 ) = 302 exp( 2⇤20.02
)
RMSE 97,1138
( x x 0 )2
• k2 ( x, x 0 ) = 1.52 (1 + 2⇤0.5 )
0.5 MASE: 0,8351
SAFC: 0,6498
( x x 0 )2
• k3 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) + SAFC0: 0,97643
0.22 dpq SAFCc: 0,94276
95%: 0,92929
• kernel = k1 + k2 + k3
68,75%: 0,88552

[11] 1214 accounts • k1 ( x, x 0 ) = 21982 exp(


( x x 0 )2
)
2⇤760.02

( x x 0 )2 MAE: 695,318
• k2 ( x, x 0 ) = 5502 exp( 2⇤562 RMSE 4083,819
2 sin2 (p ( x x 0 )/4)
2⇤212
) MASE: 0,5403
( x x 0 )2 SAFC: 0,7067
• k2 ( x, x 0 ) = 35.52 (1 + 2⇤2⇤0.8 )
0.8
SAFC0: 0,9654
( x x 0 )2 SAFCc: 0,7578
• k4 ( x, x 0 ) = 0.92 exp( 2⇤0.82
) +
95%: 0,9093
10.22 dpq
68,75%: 0,7578
• kernel = k1 + k2 + k3 + k4

[12] 573 accounts

( x x 0 )2 MAE: 39,096
• k1 ( x, x 0 ) = 1052 exp( 2⇤22.52
)
RMSE 82,827
( x x 0 )2
• k2 ( x, x 0 ) = 1.52 (1 + 2⇤0.5 )
0.5 MASE: 0,6128
SAFC: 0,4781
( x x 0 )2
• k3 ( x, x 0 ) = 0.52 exp( 2⇤0.52
) + SAFC0: 0,9895
12.52 dpq SAFCc: 0,8376
95%: 0,9546
• kernel = k1 + k2 + k3
68,75%: 0,8708
4.2 Clustering 49

Clusters Kernel Measures of accuracy


[13] 2163 accounts

( x x 0 )2 MAE: 5,7145
• k1 ( x, x 0 ) = 652 exp( 2⇤25.02
)
RMSE 31,596
( x x 0 )2
• k2 ( x, x 0 ) = 3.52 (1 + 2⇤2⇤0.5 )
0.5 MASE: 0,389677
SAFC: 0,742025
( x x 0 )2
• k3 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) + SAFC0: 0,941285
0.22 dpq SAFCc: 0,973185
95%: 0,930652
• kernel = k1 + k2 + k3
68,75%: 0,900139

[14] 1092 accounts ( x x 0 )2


• k1 ( x, x 0 ) = 90502 exp( 2⇤955.02
)

• k2 ( x, x 0 ) = MAE: 930,989
2 sin2 (p ( x x 0 )/4)
852 exp( 2⇤352
) RMSE 4652,450
MASE: 0,5371
• k3 ( x, x 0 ) = 135.52 (1 + SAFC: 0,8397
(x x 0 )21.5
2⇤67⇤1.5 ) SAFC0: 0,9761
( x x 0 )2 SAFCc: 0,8443
• k4 ( x, x 0 ) = 0.52 exp( ) +
2⇤0.32 95%: 0,8690
45.22 dpq 68,75%: 0,7710
• kernel = k1 + k2 + k3 + k4

[15] 2657 accounts ( x x 0 )2


• k1 ( x, x 0 ) = 1402 exp( 2⇤20.02
)

• k2 ( x, x 0 ) = MAE: 58,0384
2 sin2 (p ( x x 0 )/3)
25.02 exp( 22
) ⇤ RMSE 1973,60
2
(10.5 + 5.0 ( x c)( x c)) 0 MASE: 3,3544
SAFC: 0,6620
( x x 0 )2
• k3 ( x, x 0 ) = 1.52 (1 + 2⇤0.5 )
0.5
SAFC0: 0,9649
( x x 0 )2 SAFCc: 0,9793
• k4 ( x, x 0 ) = 1.52 exp( ) +
2⇤0.82 95%: 0,9401
0.22 dpq 68,75%: 0,9036
• kernel = k1 + k2 + k3 + k4

[16] 261 accounts


( x x 0 )2
• k1 ( x, x 0 ) = 1202 exp( 2⇤60.02
) ⇤ MAE: 80,1597
(5.02 ( x c)( x 0 c)) RMSE 176,4381
( x x 0 )2 MASE: 0,4522
• k2 ( x, x 0 ) = 3.52 (1 + 2⇤12⇤0.5 )
0.5
SAFC: 0,6551
( x x 0 )2 SAFC0: 0,9731
• k3 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) +
12.02 dpq SAFCc: 0,7816
95%: 0,9808
• kernel = k1 + k2 + k3 68,75%: 0,9080
50 Bank account forecasting Problem

Clusters Kernel Measures of accuracy


[17] 1316 accounts ( x x 0 )2
• k1 ( x, x 0 ) = 402 exp( 2⇤20.02
)
MAE: 44,1591
• k2 ( x, x 0 ) = 12 ⇤
2 sin2 (p ( x x 0 )/4) RMSE 88,0341
exp( 2 ⇤ 22
)
MASE: 0,6279
( x x 0 )2 SAFC: 0,4962
• k3 ( x, x 0 ) = 1.52 (1 + 2⇤2.0⇤0.5 )
0.5
SAFC0: 0,9992
( x x 0 )2 SAFCc: 0,7963
• k4 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) +
8.22 dpq 95%: 0,9354
68,75%: 0,7796
• kernel = k1 + k2 + k3 + k4

[18] 2000 accounts

( x x 0 )2 MAE: 19,7471
• k1 ( x, x 0 ) = 402 exp( 2⇤20.02
)
RMSE 399,311
( x x 0 )2
• k2 ( x, x 0 ) = 1.52 (1 + 2⇤0.65 )
0.65 MASE: 0,9873
SAFC: 0,6725
( x x 0 )2
• k3 ( x, x 0 ) = 2.82 exp( 2⇤0.82
) + SAFC0: 0,9475
0.22 dpq SAFCc: 0,95
95%: 0,9435
• kernel = k1 + k2 + k3
68,75%: 0,9035

[19] 1030 accounts


( x x 0 )2
• k1 ( x, x 0 ) = 402 exp( 2⇤20.02
)
MAE: 11,8496
( x x 0 )2 RMSE 29,2630
• k2 ( x, x 0 ) = 1.52 (1 + 2⇤5⇤0.5 ) 0.5
MASE: 0,5213
( x x 0 )2
• k3 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) + SAFC: 0,4728
1.82 dpq SAFC0: 0,9932
SAFCc: 0,9631
• kernel = k1 + k2 + k3
95%: 0,9388
1 68,75%: 0,8048

[20] 1652 accounts

( x x 0 )2 MAE: 16,5497
• k1 ( x, x 0 ) = 402 exp( 2⇤20.02
)
RMSE 36,331
( x x 0 )2
• k2 ( x, x 0 ) = 122 (1 + 2⇤8⇤0.5 )
0.65 MASE: 0,5600
SAFC: 0,4279
( x x 0 )2
• k3 ( x, x 0 ) = 0.52 exp( 2⇤0.32
) + SAFC0: 0,9927
10.22 dpq SAFCc: 0,9854
95%: 0,9285
• kernel = k1 + k2 + k3
68,75%: 0,8008
4.2 Clustering 51

In the table above for each cluster we provide a plot of the accounts, in the second column
we specify the kernel function used to perform the forecasting process. Note that each ker-
nel is the result of adding different covariance functions that can actually be combinations
of products or sums of other covariance functions in order to capture the particularities
and properties of the accounts. In the third column the measures of accuracy are written.

Results of the 100.000 accounts sampled in 40 clusters


As seen in table below, the clustering process has improve all the previous results given
by the global approach. MAE has decreased by 17%, RMSE by slightly more than 5% and
MASE has consolidated this positive trend since it has also been reduced. Note that MAE,
RMSE and MASE are cumulative. Thus, even thought the error is divided by the size of the
sample, it seems natural that the error measures for the 100.000 bank accounts clustering
would be greater than in the 25.000 sample as occurs with MAE and RMSE. However,
MASE decreases as more data we consider meaning that naive method performs worse.
The accuracy in the sign predicitions including 0, expressed by SAFC0, gets an impresive
96,96% and SAFCco shows a performance slightly greather than 83% also improving the
non-clustering approach.
Again the 95% confidence interval measure of accuracy is promising. We are not far from
the desired result and the reason why we did not achieve it is because in a few clusters,
the ones with high variance and steep oscillations, the performance is less satisfactory and
affects negatively the global accuracy.
However, from the perspective of the Bank that is not an important drawback, since once a
particular account is assigned to that type of clusters they have to take into consideration
that characteristic behaviour and be more adaptive and flexible. Similarly as done before
in the 20 clusters approach, a table showing a plot of the accounts, the kernel function
used and the different measures of accuracy for each cluster is shown, but given the space
it takes up is attached in the Github.

Table 4.22: Measures of accuracy achieved examining the 100.000

95%
MAE RMSE MASE SAFC SAFC0 SAFCcor conf.
inter.
N = 100.000
accounts 134,2835 5146,1119 0,691138 64,8630% 96,965% 83.312% 93,4240%
cluster
N = 100.000
159.9937 5389.1124 0.71326 64.826 % 91.409 % 81.058% 88.233%
accounts

To sum up, once the comparative analysis between global approach and clustering is
widely discussed, we can state without reservation that the latter one works better and
provides a more accurate forecasting. As expected being able to adapt the kernel functions
to the behaviour of the accounts gives rise to a more flexible model. Obviously, there are
52 Bank account forecasting Problem

some aspects that have to be improved, specifically being able to work with the whole
sample of bank accounts while speeding up the computations. Thus, we could establish
with greater certitude the number of optimal clusters, redefine some covariance functions
and, ultimately, upgrade the forecasting model.
Finally, in the last chapter we analyze all the results from a more global perspective, out-
lining many of the aspects developed all along the project, specially for the bank account
problem, and suggesting possible future directions of work on Gaussian processes.
Chapter 5

Conclusions

In this chapter we briefly wrap up some threads we developed throughout the project, a
bussines perspective of the bank account forecasting problem and, as said before, propose
exciting and new problems that can be adressed using Gaussian Processes.
Over the course of the project we saw how Gaussian process regression is a natural ex-
tension of Bayesian linear regression to a more flexible class of models. Placing Gaussian
process priors over functions is a clear Bayesian viewpoint. Thus, since the adoption of
Bayesian methods in the machine learning community is quite widespread this allowed
me to acquire solid foundations in order to move forward in that exciting field.
From my personal perspective one of the most challenging issues was working with co-
variance functions and the incorporation of prior knowledge through the choice and the
parameterization of the kernels. I do not like to view Gaussian processes as a black box,
(what exactly goes in the box is less important, as long it gives good predicitons) thus I
have always tried to ask how and why the models work or fail. This meant testing differ-
ent hypotheses, trying out different components and basically experimenting by trial and
error in order to gain real insight into the data.
In that sense the first part of the work, with the chapters up to and including chapter
3, where core material and theoretical concepts are given an in depth treatment have
represented a substantial part of the time invested. I attempted to put in practice all that
knowledge with the prediciton example developed at the end of chapter 3. Hence to face
the bank account forecasting problem with the highest expectations of success.
Once the problem of the bank accounts has been adressed and the results provided we
can assert that Gaussian Process has not performed as well as LSTM. Although, our goal
was not to overperform LSTM since it has been proved by the bank that this latter works
better. Therefore the aim was to provide a forecasting model using Gaussian Processes
for Regression and decide whether or not clustering the accounts gives rise to a better
performance. Both questions have been answered positively.
With the clustering approach we were able to determine an uncertainty region for the
prediction of each account and it turned out that 93,42% of the values fitted in that region.

53
54 Conclusions

Moreover, having defined different clusters, having acces to individual measures of accu-
racy and knowing the behaviour patterns of the accounts belonging to each cluster we can
adapt our predicitons to a wide range of needs.
In that sense, given the duty to predict the 13th state of a new account the procedure will
be clear. Firstly, we would compute the DTW between the account and all of the centroids
and assign the account to the centroid that minimizes the DTW. Once that account is
placed in a cluster we can begin the forecasting taking into consideration the features of
the cluster and adapating all of the decisions regarding that particular propiertes. Thus,
if the account was placed in a cluster that actually gives poor performance we have to be
cautious and prudent with the applications of that forecast. By contrast, if the account
belongs to a cluster with nice perfomances rates our we can be more confident.
To sum up, our model performs quite accurately but when a company as a bank is devel-
oping a risky project like this, every variable has to be perfectly studied and no mistakes
are allowed. Hence, it would be unnacceptable to send a message to a client warning him
to take care of his expenses and fail in the forecast. The model has to be trained and the
more data we have the more precise the prediction will be since the kernel function can fit
better the data or the cluster be more precise. In fact, as mentioned at the end of chapter
4, one area for future work might be learn to deal with the 3.35 million of bank accounts .
Finally, let me state some interesting developments done in the Gaussian Process context.
In the recent years one of the most important advances in the Gaussian process framework
has been the creation of an Automatic Statistician. This project developed by a group of
the most prestigious experts in Gaussian processes, as James Lloyd, David Duvenaud and
Zoubin Ghahramani from Cambridge University, automatically produces a 10-15 page
report describing patterns discovered in data. Moreover, it returns a statistical model
with state-of-the-art extrapolation performance evaluated over real time series data sets
from various domains. The system is based on reasoning over an open-ended language
of nonparametric models using Bayesian inference. Actually, The Automatic Statistician
project has won the Google Focused Research Award. This Award consists of a no-strings-
attached donation to support research on this topic.
Another interesting area in which there has been an explosion of work over the last years
is the use of deep neural networks for modeling high-dimensional functions as a popular
alternative to Gaussian process. In that sense, many researches propose to study deep
neural networks as priors on functions. As a starting point, they relate neural networks
to Gaussian processes and examine a class of infinitely wide deep neural networks called
deep Gaussian processes which turn to be compositions of functions drawn from GP
priors. One of the main reason why deep GPs are attractive from a model-analysis point
of view is because they remove some of the details of finite neural networks. Thus, this
area seems to be exciting and appealing for further study.
Bibliography

[1] Cunningham, J. P., Shenoy, K. V., and Sahani, M. (2008). Fast Gaussian process meth-
ods for point process intensity estimation. In International Conference on Machine
Learning, pages 192-199.

[2] Duvenaud D., James Robert Lloyd, Roger B. Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani. Structure discovery in nonparametric regression through com-
positional kernel search. In Proceedings of the 30th International Conference on
Machine Learning, 2013.

[3] M. Ebden. Gaussian Processes for Regression: A Quick Introduction. University of


Oxford, 2008.

[4] Gordon Wilson, Andrew. Covariance Kernels for Fast Automatic Pattern Discovery
and Extrapolation with Gaussian Processes. Trinity College & University of Cam-
bridge, 2014, p. 12-45.

[5] Kristjanson Duvenaud, David. Automatic Model Construction with Gaussian Pro-
cesses. University of Cambridge, 2014, p. 8-30.

[6] David J. C. MacKay. Introduction to Gaussian processes. NATO ASI Series F Com-
puter and Systems Sciences,1998.

[7] David J. C. MacKay. Information theory, inference, and learning algorithms. Cam-
bridge University press, 2003.

[8] Radford M. Neal. Bayesian learning for neural networks. PhD thesis, University of
Toronto, 1995.

[9] Murphy, K. P. Machine Learning, a Probabilistic Perspective. The MIT Press, 2006.
p. 496-536.

[10] Carl E. Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine
Learning, volume 38. The MIT Press, Cambridge, MA, USA, 2006.

[11] Sardá - Espinosa, Alexis. Comparing Time-Series Clustering Algorithms in R Using


the dtwclust Package. Cologne University of Applied Science. p.4-22

[12] Robert Schaback,Holger Wendland Kernel Techniques: From Machine Learning to


Mesh less Methods, Cambridge University Press (2006)

55

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy