0% found this document useful (0 votes)
19 views18 pages

Predictive Machines with Uncertainty Quantification (2022)

Uploaded by

havadese.tarikhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views18 pages

Predictive Machines with Uncertainty Quantification (2022)

Uploaded by

havadese.tarikhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

MathematicS In Action

Predictive machines with uncertainty quantification


Philippe G. LeFloch
Jean-Marc Mercier


Laboratoire Jacques-Louis Lions, Centre National de la Recherche Scientifique, Sorbonne
Université, 4 Place Jussieu, 75252 Paris, France.
E-mail address: contact@philippelefloch.org

MPG-Partners, 136 Boulevard Haussmann, 75008 Paris, France.
E-mail address: jean-marc.mercier@mpg-partners.com.
Abstract. We outline a strategy proposed by the authors to design high-dimensional ex-
trapolation algorithms, also called learning or predictive machines, which are endowed with
numerically computable, uncertainty quantification estimates. We provide a computational
framework based on kernels, which applies as well to neural networks (also called deep learning
networks). This framework was primarily designed to target models based on partial differ-
ential equations. There exist numerous extrapolation strategies in the literature, which apply
to a wide number of applications ranging from numerical simulations to statistics and ma-
chine learning. Among them, we advocate here the use of predictive machines based on kernels
and endowed with error estimates, since they are efficient, versatile, and competitive in in-
dustrial applications. We highlight their benefits with illustrative examples consisting of fully
reproducible benchmarks and we compare our results with more traditional approaches. Our
presentation here is focused on tests relevant to machine learning, and we discuss the natural
connections between predictive machines and techniques of optimal transport. In particular,
we provide a risk management framework that solely relies on historical observations of time
series. The proposed strategy has led us, in collaboration with S. Miryusupov, to a Python
code (referred to as CodPy) which is now publicly available.

1. Introduction

For problems in large dimensions, we consider here extrapolation methods (referred as learning
or predictive machines) which allow us to make predictions supported by quantifiable numerical
error estimates. Indeed, a major interest of the proposed “machines” is that they are endowed
with uncertainly quantification estimates based on a particularly simple notion of “distance”,
making these machines competitive and versatile for industrial applications.
In machine learning, the first application of an error estimate is to provide a confidence
criterion on any prediction, allowing one to trigger an alert when a given tolerance is reached.
A numerical estimate allows one to fully understand the prediction of a learning machine, since
its performance can be fully explained in terms of training set, test set, and internal parameters
of the method. Such methods are then explainable to the user, hence auditable, and escapes
the black box effect —a major and common criticism made to artificial intelligence methods.
Such explainability properties are required in order to pass the qualification tests for critical
applications in an industrial context.
A second benefit of a (numerical) error estimate is to provide a clear view on how efficient a
learning machine is, as it allows one to study its convergence rate. Via the notion of performance
profile, the convergence rate associated with a given machine allows one to estimate its algorith-
mic complexity, a figure that is directly linked to its electrical consumption and environmental
impact. Methods endowed with error estimates and optimization procedures lead to efficient

1
Philippe G. LeFloch & Jean-Marc Mercier

clustering methods, or learning machines with superior convergence rate, computing optimized
sequences of points (centroids or clusters). This strategy has analogies to quasi Monte-Carlo’s
low discrepancy sequences (LDS), such as the popular Sobol sequences [15]. However, optimized
sequences exhibit superior convergence rate (in comparison to LDS) for numerical integration,
at the expense of an heavier computational load; we refer to [11] where such sequences are
analyzed under the terminology ’sharp discrepancy sequences’. This approach is at the heart
of a numerical strategy developed by the authors in [10]–[9] in order to solve a wide range of
partial differential equations (PDEs) in high dimensions.
Finally, as these learning machines are endowed with a notion of distance, their study relates
naturally with the theory of optimal transport. Given any such machine, we can consider
the polar factorization of maps [3, 13] and the Monge-Kantorovitch problem which, in their
discrete versions, can be expressed in terms of a linear square assignment problem (LSAP) (as
first observed in [4]). This allows us to design some novel algorithms that combine elements
from machine learning and optimal transportation, and rely on the computation of transition
probabilities of martingale processes. This strategy [9] was first introduced for applications in
mathematical finance and provides an alternative algorithm to the standard Sinkhorn algorithm
[14].
This test is organized by following the lines of the three paragraphs above. In Section 2 we
(informally) describes our strategy and discuss how to compute error estimates. We illustrate
the interest of the method numerically with two examples1. The first test is a benchmark of
different methods applied to the MNIST dataset, illustrating how scores of learning machines
can be explained with error estimates. The second test is a benchmark of different methods
applied to the so-called Boston housing prices dataset, illustrating a reproducibility property
that is of crucial importance in industrial applications.
In Section 3 we discuss distance-based, clustering algorithms. We illustrate the interest of our
approach with two benchmarks of unsupervised learning methods, namely the standard k-mean
algorithm and the design of sharp discrepancy sequences, for the problem of credit card fraud
detection (a large, unbalanced, real dataset). To support our conclusions and highlight potential
gains in convergence rates, we also provide a similar benchmark for the MNIST problem, which
is considered in the supervised learning case ; cf. Section 3.
Finally, in Section 4 we discuss algorithms based on optimal transport. We present two nu-
merical examples of particular importance in financial applications: a benchmark of methods
computing transition probabilities, and an application to times series predictions. In mathe-
matical finance, these are the two pillars of a risk management framework based on historical
observations of time series.

2. Learning machines with error estimates

2.1. Extrapolation methods


There exist numerous methods for high-dimensional data extrapolation, and we indeed consider
several such methods in the present text. We introduce here a unified approach, as well as some
terminology used in machine learning.
A predictive, learning machine m, is a function taking, as input, a sequence of points
X = (x1 , . . . , xNX ) of RD , called the training set, a continuous vector-valued function f (X) =
1
All numerical examples are based on our code CodPy (see [12]), hence are fully reproducible

2
PREDICTIVE MACHINES WITH QUANTIFICATION

(f (x1 ), . . . , f (xNX )) ∈ RNX ×Df called the training set, and a sequence of points Z = (z 1 , . . . , z NZ )
called the test set, and is applied as follows:
Z 7→ fZ = Pm (X, Y, Z, f (X)) ' f (Z). (2.1)
In principle, fZ predicts the ground truth values f (Z), extrapolated from the data X, f (X).
Here, Y is a set of internal parameters required by the method m. When a method m can also
compute consistent differential operators, as for instance the gradient,
Z 7→ (∇f )Z = ∇Z Pm (X, Y, Z, f (X)) ∼ (∇f )(Z), (2.2)
then we say that it is a differentiable learning machine, and it can then also be used for
numerical simulations of PDE models.
Accuracy of the method m can be measured through numerous performance indicators (or
metrics) of the form kf (Z) − fZ k, as for instance the round mean squared error (RMSE)
kf (Z) − fZ k`2 . To analyze these indicators, a standard method is the cross-validation, which
relies on a statistical argument: one can compute
kf (X2 ) − Pm (X1 , Y, X2 , f (X1 ))k,
for one (or several) suitably chosen partitions of the training set X1 ∪ X2 = X, and one can
expect that kf (Z) − fZ k behaves similarly, if for instance Z and X are i.i.d. from the same
distribution.

2.2. Discrepancy functionals


We are especially interested in error measures for learning machines that are based on a notion
of distance between measures. This is the case of kernel-based methods from the theory of
reproducing kernel Hilbert spaces (RKHS); see [8, 2, 5], or neural network methods.
Specifically, consider a kernel, that is, a continuous and symmetric mapping (x, y) ∈ RD ×
D
R 7→ K = K(x, y) ∈ R satisfying K(x, y) = K(y, x). Given any µ, ν in the set of probability
measures M on RD we define the associated discrepancy, also referred to as the maximum
mean discrepancy functional (MMD), by
ZZ ZZ ZZ
dK (µ, ν) = K(x, y)dµx dµy + K(x, y)dνx dνy − 2 K(x, y)dνx dµy (2.3)

Under suitable assumption [16]), this defines a metric on M;for instance,  K is assumed to
i j
be a positive definite kernel, that is, the matrix K(X, X) = K(x , x ) (with i = 1 . . . NX
and j = 1 . . . NX ) is positive definite for all sequences of distinct points X. Observe that the
discrete case δX = N1X
P
δxi (where δx is the Dirac mass) was introduced in [6]. Interestingly,
this definition leads to a numerically tractable formula:
j=1...N
XX j=1...N
XY j=1...N
XZ
1 i j 1 i j 2
dK (δX , δZ ) = 2 K(x , x ) + 2 K(z , z ) − K(xi , z j ),
NX NZ NX NZ
i=1...NX i=1...NY i=1...NX
(2.4)
also explored numerically in [11] in the case dK (µ, δY ). Discrepancy are also available for neural
network methods2, as these methods can be interpreted as linear kernel ones, but using a non

2although we do not know any framework in which these indicators are computed in practice

3
Philippe G. LeFloch & Jean-Marc Mercier

linear mapping, having form (cf. [18])


 
K(x, y) =< S(x), S(y) >, S(x) = WL σL . . . σ1 (W1 x + b1 ) + b2 + bL

with weight matrices Wi ∈ RDi ×Di−1 , bias term bi ∈ RDi , and activation functions σi : R 7→ R
applied dimension-wise, for i = 1 . . . L, L being the depth of the network, and < ·, · > holds for
the standard scalar product. Considering the framework (2.1), deep learning methods defines
their set of internal parameters as Y = {Wi , bi , i = 1 . . . L}.

2.3. Quantitative error bounds


A positive definite kernel K defines a Hilbert space of continuous functions, denoted HK , en-
dowed with a scalar product < ·, · >HK , with the reproducing property < ϕ(·), K(x, ·) >HK = ϕ(x)
for any ϕ ∈ HK (see [2]). Considering the discrepancy functional 2.3, the following estimate
holds (see [16]): Z Z
ϕ(x)dµx − ϕ(y)dνy ≤ dK (µ, ν)kϕkHK . (2.5)
RD RD
This inequality is computationally tractable for the discrete case: in the right-hand side, the
discrepancy functional can be estimated using (2.4), since a lower bound of the norm of a
function ϕ ∈ HK is given by the bilinear form
kϕk2HK ≥ kϕk2HK = ϕ(X)T K(X, X)−1 ϕ(X), (2.6)
X

where HKX ⊂ HK is a finite dimensional functional space of size NX equipped with the norm
< ·, · >HK . This lower bound captures all information available in the training data set. The
estimate (2.5) has a quite natural interpretation: the integration error is split in two parts, the
first one being the distance between the training set and the test set variables, and the second
one measuring the training set of values.
Observe also that the estimate (2.6) is very general and relates to many other performance
indicators. For instance, one can deduce (pointwisely)
Z
ϕ(x) − ϕ(y)dνy ≤ dK (δx , ν)kϕkHK ,
RD
or the discrete RMSE error estimator
kϕ(Z) − ϕZ k`2 ≤ dK (δZ , δX )kϕkHK . (2.7)

2.4. Error estimates from the MNIST dataset


The MNIST dataset is composed of a training set of variables containing 60, 000 images of
handwritten digits, and we considered the set available at LeCun’s MNIST home page3. This is
a classical test for supervised learning machines. Each image is a vector having dimensions 784
(a 28 × 28 grayscale image flattened in row-order). To each image we provide a label, taking
values in 10 digits 09, defining the training set of values. The test set is composed of 10, 000
images with their labels. The test consists in predicting these labels.
Considering our notion of learning machines in (2.1), the problem is formalized as follows.
Given the training set of variables represented by a matrix X ∈ RNX ×D , D = 784, the training
3http://yann.lecun.com/exdb/mnist/

4
PREDICTIVE MACHINES WITH QUANTIFICATION

Figure 1. Performance profiles of several methods for the MNIST dataset.

Figure 2. Discrepancy profile dK (δX , δZ ) for the MNIST dataset.

set of variables (labels) f (X) ∈ RNX ×Df , Df = 10, and the test set Z ∈ RNZ ×D , NZ = 10000,
predict the label function f (Z) ∈ RNZ ×Df .
Performance are here measured with a common indicator to compare labelled supervised
learning methods, defined as the following score
1
#{fzn = f (z n ), n = 1 . . . NZ } (2.8)
NZ
with NZ = 10000. This produces an indicator between 0 and 1, the higher being the better. We
benchmark here the performance profile, considering scores as a function of the training set size
NX . Our purpose in this presentation is not to discuss each method m under consideration;
all of the methods we consider have a public documentation that the reader can consult. All
these methods are taken with standard internal parameters, and the performance profiles are
reported in Figure 1.
One of these methods, referred to as CodPy, is a kernel-based method and includes a com-
putation of the discrepancy functional (2.4), and we plot the corresponding results in Figure
2. One can see, and we checked thisnumerically, that the indicator 1 − dK (δX , δZ ) is a strict
minorant of the score for this kernel method, in accordance with the error measure (2.5) (ob-
serving that the score function 2.8 is normalized). The discrepancy is a reliable indicator to
explain the performance profile of this method.

2.5. Reproducibility with the Boston Housing prices dataset


We now illustrate with a numerical example the reproducibility property of the training set, an
important property for industrial applications.

5
Philippe G. LeFloch & Jean-Marc Mercier

Figure 3. Convergence profile for the housing dataset.

This property is illustrated now with the so-called Boston housing price dataset. This dataset
contains information collected by the U.S Census Service concerning housing in the area of
Boston Mass. There are 506 houses having 13 attributes (variables or features) with a target
column (values that are housing prices). Further details can be found in [7].
The benchmark is formalized as follows. Given the test set Z ∈ RNZ ×D , NZ = 506, D = 13,
consider a subset as the training set of variables X ∈ RNX ×D ⊂ Z, the training set of variables
(labels) f (X) ∈ RNX ×Df , Df = 1, and predict the housing prices f (Z) ∈ RNZ ×Df .
We plot the results of a benchmark in Figure 3, considering a normalized version of the
RMSE error
kf (Z) − Pm (X, Y, Z, f (X))k`2
, (2.9)
kPm (X, Y, Z, f (X))k`2 + kf (Z)k`2
thus here the lower is better. Considering Figure 3, there is one method reaching zero at
NX = 506, this very last point being the one where training set and test set equals X = Z. We
say that a method m satisfies the reproducibility property if it satisfies dK (δX , δX ) = 0, this
notion being motivated by the estimation error (2.5).
Above accuracy, reproducibility is important for explainability purposes, as well as data
quality. Indeed, it implies that the predictions fZ can be expressed as a linear combination of
the training set. This is a useful facet of explainability, allowing to question predictions directly
using the training set, challenging input datas for potential detection of outliers. Another
important aspect of the reproducibility property is to ensure that no artificial noise on the
training set is added, a bold relief while computing differential operators as (2.2) for numerical
simulations.

3. Error-minimization based algorithms

3.1. Clustering methods


In this section, we discuss an important class of popular unsupervised machine learning methods
taking its root in quantization, called clustering. These methods aim to partition NX variables
X into NY < NX clusters Y in which each observation x ∈ X belongs to a cluster, defined by
a point y ∈ Y called centroid. This partition is computed considering a distance between set
of points d(X, Y ), defining equivalently a distance between the discrete measures δX , δY . With

6
PREDICTIVE MACHINES WITH QUANTIFICATION

this notation, the minimization problem can be expressed as


Ȳ = arg inf d(X, Y ) (3.1)
Y ∈RNY ×D

These distances between set of points are usually based on simpler distances, as for instance the
Euclidean, or the Manhattan or log-entropy, depending upon the problem under consideration.
Consider any point x ∈ RD . Then x is attached naturally to the centroid y σ(x,Y ) , where the
indice function σ(x, Y ) is defined as

σ(x, Y ) = arg inf d(x, y k ).


k=1...NY

It also defines (Voronoi) cells Ωi = {x ∈ RD : d(x, y i ) = d(x, y σ(x,Y ) )}, quantizing (i.e. parti-
tioning) the space RD . A popular choice for the distance in (3.1), called the inertia, leads to
the k-means algorithm, considered usually with the Euclidean distance d(x, y) = |x − y|:
Nx
n ,Y
X
d(X, Y ) = d(xn , y σ(x )
). (3.2)
n=1

The link between supervised learning (extrapolation) and unsupervised learning (as clustering)
is straightforward, as one can quantize any set of observations f (X) with f (Y ), defining a
learning machine as follows, using the notation (2.1):

fZ = Pd (X, Y, Z, f (X)) = f (y σ(Z,Y ) ). (3.3)


In other words, for these learning machines the set of internal parameters Y in (2.1) are the
centroids.

3.2. Sharp discrepancy sequences


Sharp discrepancy sequences (SDS), introduced in [9] for the semi-discrete case, can be inter-
preted in the discrete case as clustering methods defining the centroids in (3.1) as
Ȳ = arg inf dK (δX , δY ), (3.4)
Y ∈RNY ×D

where the considered distance is the discrepancy one 2.3. They are designed to minimize the
numerical integration error 2.5.
Minimization problems as (3.1)-(3.4) are expected to be quite challenging to solve. For in-
stance, if a sequence Ȳ is a solution to (3.1), then any permutations of Ȳ is also expected to
provide a solution, and the functional d(X, Y ) have numerous global minimum. Moreover, the
functional Y 7→ d(X, Y ) is not expected to be convex. We plot in Figure 4 two examples of
very simple discrepancy, starting from five one-dimensional points X ∈ [−1, 1]5 (orange dots),
and plotting the function y 7→ dK (X, y), for two different kernels, a gaussian, and a Matern
one (see [2] for a definition of these kernels). We plot also a linear interpolation between the
five points, so that the reader can appreciate the non convexity of the discrepancy functional,
even for this simple case. A consequence is that standard gradient descent algorithms usually
fail finding a global minimum, and more refined algorithms are needed, as those described in
[12] or [17]. In the next paragraph, we illustrate an example of performance gain obtained with
this strategy.

7
Philippe G. LeFloch & Jean-Marc Mercier

Figure 4. Examples of discrepancy functionals, left gaussian, right matern, kernels.

3.3. Clustering benchmarks for the CCFD dataset


We now illustrate a benchmark of k-means centroids computations versus SDS ones in terms
of discrepancy errors, inertia, accuracy scores and execution time, for the credit card fraud
dataset (CCFD).
This dataset contains transactions made by credit cards in September 2013 by European
cardholders, presenting transactions that occurred in two days, with 492 frauds out of 284807
transactions, each transactions characterized by 30 numerical values resulting from a PCA
analysis. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all
transactions4
The benchmark is formalized as follows. Given the training set of variables X ∈ RNX ×D ,
NX = 284807, D = 30, compute the parameter set (clusters) Y ∈ RNY ×D , and predict the
training set of values f (X) ∈ RNX ⊂ {0, 1}NX , 1 for fraud, 0 else, using the machine 3.3. Scores
(2.8), inertia (3.2), discrepancy (2.3), and computational times are reported in Figure 5 for
different numbers NY of clusters that we comment now.
Scores obtained with SDS seems stables. The indicator of discrepancy is better for SDS, that
is expected as SDS are discrepancy minimizers specialized algorithms. However, notice that
both algorithms provides comparable results regarding inertia, that is somehow surprising,
as k-means algorithms are specialized in minimizing this quantity. Indeed, on the manifold
characterized by sequences of points minimizing the inertia, those minimizing the discrepancy
provides a way to refine centroids locations.

3.4. Unsupervised learning benchmarks - MNIST


We perform here the same benchmark methodology of unsupervised learning machines (3.3),
for the MNIST problem, illustrated for the supervised case section 2.4.
The setting is slightly different than the one in Section 3.3: we considered a training set of
fixed size NX = 4096 hand-written image, and plot the results using the score 2.8 as a function
of the number of clusters NY . Results are presented in Figure 6, confirming the conclusions in
Section 3.3, but noticing that SDS provide a superior convergence rate than k-means for this
test. We end with Figures 7, allowing to appreciate some differences between the computed
clusters y ∈ Y by k-means and SDS.

4Further details are available at Kaggle site https://www.kaggle.com/mlg-ulb/creditcardfraud.

8
PREDICTIVE MACHINES WITH QUANTIFICATION
0.99 codpy codpy
0.70
0.98 k-means k-means
0.65
0.60

discrepancy_errors
0.97
0.55

scores
0.96 0.50
0.95 0.45
0.40
0.94
0.35
0.93 0.30
20 40 60 80 20 40 60 80
Ny Ny
22000
codpy 1.0 codpy
20000 k-means k-means
0.9
18000
0.8

execution_time
16000
0.7
inertia

14000
0.6
12000
0.5
10000
0.4
8000
0.3
6000
20 40 60 80 20 40 60 80
Ny Ny

Figure 5. Benchmarks profiles for the credit card data set.

Figure 6. Benchmarks profiles for the unsupervised learning mnist test.

Figure 7. Examples of images rendered from computed MNIST clusters : left


SDS, right k-means.

4. Optimal transport algorithms

4.1. Discrepancy-based polar factorization of maps


Let us briefly introduce the polar factorization of maps. These tools are credited to [3], [13] for
the continuous case, and we focus on the discrete cases discussed in [4]. We then describe how
we connect these tools with error-based learning machines.

9
Philippe G. LeFloch & Jean-Marc Mercier

Consider a probability measure ν ∈ M on RD , and a mapping S : RD 7→ RD . Denote µ ∈ M,


defined through the change of variable
Z Z
ϕ(x)dµ = (ϕ ◦ S)(x)dν, ϕ ∈ C(RD ).
RD RD
We say that S transports ν into µ, and S# ν = µ is the push-forward. Consider a cost function,
that is a positive, scalar valued, symmetric C 1 function c(·, ·). The Monge problem, given ν, µ,
consists in finding a mapping S minimizing the transportation cost from ν to µ, that is
Z
S = arg inf c(x, S(x))dν (4.1)
S# ν=µ RD

From a discrete point of view, consider two discrete measures µ, ν = δX , δZ , with X =


(x1 , . . . , xN ), Z = (z 1 , . . . , z N ) two sequences of distinct points having equal lengths. Then
the Monge problem (4.1) amounts to determine a permutation σ : [1 . . . N ] 7→ [1 . . . N ] satisfy-
ing
XN
S(xn ) = z σ(n) with σ = arg inf c(xn , z σ(n) ), (4.2)
σ∈Σ
n=1
Σ being the set of all permutations, and we denote for short in this document S(X) = Z σ .
 j=1...N
Consider the matrix C(X, Z) = c(xi , z j ) , then the following problem is called the
i=1...N
discrete Kantorovitch problem
γ̄ = arg inf C(X, Z) · γ, (4.3)
γ∈Γ
where A · B denotes the Frobenius
PN scalar matrix
PN product, Γ is the set of all bi-stochastic matrix
γ ∈R N ×N , i.e. satisfying, n=1 γm,n = n=1 γn,m = 1, γn,m ≥ 0, for all m = 1, . . . , N . The
minimization problem (4.3) admits a dual expression, called the dual-Kantorovitch problem
N
X
ϕ, ψ = arg sup ϕ(xn ) − ψ(z n ), ϕ(xn ) − ψ(z m ) ≤ c(xn , z m ) (4.4)
ϕ,ψ n=1

where ϕ : X 7→ R, ψ : Z 7→ R are discrete functions. As stated in [4], the three discrete problems
(4.3)-(4.4)-(4.2) are equivalent. We observe that the discrete Monge problem (4.2) is also known
as the linear sum assignment problem (LSAP), solved in the early 50’s by an algorithm due to
H.W. Kuhn, known as the Hungarian method5
For the continuous case, under suitable condition on ν, µ (having compact, connected, smooth
support), any transport map S# ν = µ can be polar factorized as
S(x) = S ◦ T (x), T# ν = ν, (4.5)
where S is the unique solution to the Monge problem (4.1),  having the property to be the
gradient of a c−convex potential S(x) = expx − ∇h(x) , expx being the standard notation
for the exponential map in Riemanian geometry. A scalar function is said to be c-convex
if hcc = h, where hc (z) = inf x c(x, z) − h(x) is called the infimal c−convolution. Standard
convexity coincides with c-convexity for convex cost functions as the Euclidean one, in which
case the following polar factorization holds: S(x) = (∇h) ◦ T (x), h convex.
Consider now a learning machine ((2.1)), for which the discrepancy 2.3 holds, and consider
the cost function as c(x, z) = dK (δx , δz ). Consider as above two discrete measures µ, ν = δX , δZ ,
5This algorithm is often credited to Jacobi in 1890.

10
PREDICTIVE MACHINES WITH QUANTIFICATION

defining the map S(xn ) = z n . In this setting, the preserving map T appearing in the right-hand
side of the polar factorization (4.5) consists in finding the permutation
N
X
σ = arg inf dK (δxn , δz σ(n) ). (4.6)
σ∈Σ
n=1

Then, considering a differential learning machine (2.2), a discrete polar factorization consists
in solving the following equation for the unknown potential h
 
Z σ = expX − ∇X Pm (X, Y, X, h(X)) . (4.7)
Such algorithms can be implemented for any differential, error-based learning machines.

4.2. SDE-based conditional expectation algorithm


For a better understanding of the next section, we summarize the numerical strategy developed
in [9], that can be implemented with any differential learning machine. Consider w.l.o.g. any
martingale process t 7→ X(t) ∈ RD , driven by the following stochastic differential equation
(SDE)
dXt = σ(t, Xt )dBt , t ≥ 0, X(0) ∈ RD , (4.8)
Bt being the standard Brownian motion, σ(t, x) ∈ RD×D any smooth field of matrix. Consider
the density of the process X(t), denoted as the probability measure t 7→ µ(t, ·) ∈ M. Then µ
satisfies the Fokker-Planck equation
1
∂t µ = ∇2 · (Aµ), A = σσ T , µ(0, ·) = δx , (4.9)
2
∇2 = ∂i ∂j being the Hessian. Consider any time T > 0, and any integrable vector-valued
function ψ(·) ∈ L1 (µ(T, ·)). The Feynman-Kac theorem states that the conditional expectation
EX(T ) (ψ(·)|X(t) = x) =: ϕ(t, x) satisfies a Kolmogorov equation, that is the dual of the Fokker-
Planck (4.9), to be solved backward in time:
∂t ϕ − A · ∇2 ϕ = 0, t ∈ [0, T ] ϕ(T, x) = ψ(x). (4.10)
This continuous setting translates into a Markov one in the discrete case as follows. The measure
µ is approximated by δX(t) , where the sequence t 7→ X(t) of length NX is computed by the
following discrete Fokker-Planck equation, modeling (4.9)
1
∂t δX(t) = ∇2X(t) · (AδX(t) ), A = σσ T , µ(0, ·) = δX(0) , (4.11)
2
where ∇2X(t) is a consistent discrete Hessian operator, computable by any differential learning
machine through (2.2) for instance. Once the trajectories t 7→ X(t) are computed, the discrete
Kolmogorov equation is solved backward in time as
∂t ϕ = A · ∇2X(t) ϕ, ϕ(T, X(T )) = ψ(X(T )). (4.12)
Specifically, solutions to (4.12) can be written as ϕ(t, X(t)) = ΠX(t) ψ(X(t)), where t 7→
ΠX(t) , the generator of the equation (4.12), is a stochastic matrix, i.e. a matric satisfying
PNX
n=1 ΠX(t) (m, n) = 1 for all m = 1, . . . , NX , solution to

∂t Π = A · ∇2X(t) Π, ΠX(T ) = INX , (4.13)

11
Philippe G. LeFloch & Jean-Marc Mercier

INX being the identity matrix. Solution t 7→ ΠX(t) defines a path of stochastic matrix, that is
a transition probability one, and ΠX(t) (m, n) represents the probability of the discrete Markov
chain to jump from the state xm (t) to the state xn (T ). The particular case ϕ(t, X(t)) = X(t)
leads to the relation X(t) = ΠX(t) X(T ).

4.3. Historical-based conditional expectation algorithms


In this section, we consider a slightly different setting, in which the underlying stochastic process
is not given by the SDE (4.8), but observed from historical values, input as NX i.i.d samples
of a martingale stochastic process t 7→ X(t). The main difference with the previous section is
thus that the field of matrix σ(t, Xt ) is unknown. Such algorithms have a lot of applications,
as they provide an analysis based on experienced historical data, a quite reasonable point of
view for risk management. For industrial purposes, these problems are historically tackled with
parametric models : one supposes that the stochastic process follows a given SDE (as ARCH-
GARCH models), which parameters are calibrated to fit historical values. Machine learning
based algorithms demarcate with the parametric approach, because they are somehow agnostic,
i.e. no hypothesis are made on the historical process, a motivating property.
From a discrete point of view, we would like somehow to calibrate σ(t, X(t)) such that
t 7→ X(t) be a solution to (4.11). Once determined, compute the transition probability matrix
for risk analysis, using for instance the method described in (4.13) or any alternative method.
The whole process can be described by a function, which we denote by Π and define by
 
fZ|X = Π Z, X, f (Z) , (4.14)
where the inputs are as follows:

• X ∈ RNX ×D is a i.i.d. generated samples of X(t1 ) where t1 is a given time.

• Z ∈ RNZ ×D is another i.i.d sample of X(t2 ) at any time t2 > t1 .

• f (Z) ∈ RNZ ×Df is any discrete vector-valued function.

The output is a matrix fZ|X , representing the conditional expectation


2
fZ|X ∼ EX(t ) (f (·)|X(t1 )) ∈ RNX ×Df =:not. f (Z|X). (4.15)
For instance, the previous section proposes a method to compute (4.14) as Π(Z, X)f (Z), where
Π(Z, X) is a stochastic matrix solving (4.13). This method, that also use the polar factorization
(4.7), is benchmarked against alternative ones in Section 4.5.

4.4. Time series predictions


Time series predictions is a quite active field of applied research, and machine learning methods
as recurrent neural networks, long short-term memory, neural Turing machine, etc. are very
popular, neural network based, time series predictors. This section discusses two different ap-
proaches for time series predictions, one being quite similar to the mentioned ones, the other
is a new time series prediction algorithm based on the polar factorization.
The notation in this section is as follows :
X ∈ RNX ×D×TX (4.16)

12
PREDICTIVE MACHINES WITH QUANTIFICATION

Figure 8. Recurrent kernels : Generated (yellow) BTC-USD left / HR right,


versus historical (blue).

is a three-dimensional tensor, representing NX i.i.d samples of a D dimensional process X(t) ∈


RD , sampled on a time grid t1 < . . . < tTX having size TX .

4.4.1. Recurrent methods for time series predictions


Let us describe recurrent methods that can be implemented for any predictive machine (2.1),
and we discuss an example of prediction.
Consider some historical observations X as in (4.16), and two integers H and P , satisfying
H + P ≤ TX . H is called the historical depth, P the prediction depth. This setting defines a
sliding window of size H+P over the data X, used to define the training set as follows

X 0 = X [·,·,i:i+H] ∈ RÑX ×D×H , f (X 0 ) = X [·,·,i+H:i+H+P ] ∈ RÑX ×D×P


for any i = 1, . . . , ÑX , with ÑX = (T − H − P )NX . We can iterate the procedure, producing
at each step P new predicted values, using recursively a predictive machine (2.1) as follows
X k+1 = [X k , f (X k )], f (X k+1 ) = Pm (X k , Y, X k+1 , f (X k )), (4.17)
[X k , f (X k )] being the concatenation of these two tensors in the last variable. Such a construc-
tion allows to produce predicted values of the temporal series at any future times.
The recurrent method (4.17) allows to draw one trajectory, that can be considered as a iid
realization of the temporal series, based on the knowledge of its history. Figure 8 shows a toy
example of historical temporal series forecast, having two components : the Bitcoin price and
the hash-rate values, for which we considered TX covering daily observations from 01/01/2015
to 23/11/2020, since H and P are set to fit 6 months datas. In Figure 8, we have chosen
NX = 1, D = 2, TX = 1460 in (4.16). Starting from this setting, we predict the temporal
series up to 31/12/2021 and compare it with the historically observed one, using a kernel
implementation of the scheme (4.17).
This method has a lot of forecasting applications, useful for professional purposes. However,
in the context of temporal series forecasting, such a method faces a number of challenges.
First, we are left with two extra parameters, H and P . Secondly, it is not clear how to generate
other realizations of the studied temporal series. As a consequence, it is not clear neither how
to generate a pertinent mean estimator using this construction. Finally, we do not have any

13
Philippe G. LeFloch & Jean-Marc Mercier

Figure 9. Optimal transport : mean estimator (yellow) and random path


(shaded) for BTC-USD left / HR right, versus historical (blue).

argument to ensure the stability of the recurrent scheme 4.17. Thus we provide an alternative
method in the next paragraph.

4.4.2. Optimal transport methods for time series predictions


In this section, we illustrate a time serie prediction algorithm based on the polar factorization,
that we describe shortly now. Consider a temporal serie X ∈ RNX ×D×TX , including a time-
stamp in the first dimension X(·, 1, k) = tk , k = 1, . . . , TX . Consider any uniform distribution
UX ∈ [0, 1]NX ×D×T , except for the time-stamp values UX (·, 1, k) = tk . Consider the polar
factorization 4.7 of the map S(UX ) = X, defined by solving the following equation for the
unknown scalar function h:
 
σ
X = expX − ∇X Pm (UX , Y, UX , h(UX )) (4.18)
See (4.6) for a definition of the permutation σ. Suppose that we want to sample NZ new tra-
jectories Z ∈ RNZ ×D×TZ , on a time grid t1 < . . . < tTZ . Then we fill out a uniform distribution
UZ ∈ [0, 1]NZ ×D×TZ , adding the time-stamp values UZ (·, 1, k) = tk for k = 1, . . . , TZ , and we
can use the polar factorization algorithm as follows
 
Z = expX − ∇X Pm (UX , Y, UZ , h(UX )) (4.19)
This method enjoy some remarkable properties. First it is numerically very efficient, allowing to
sample large numbers of trajectories with few computational resources. Secondly, we have also
a quite clear interpretation of a mean estimator, considering UZ (·, d, ·) = 12 , for d = 2, . . . , D. A
numerical illustration of this method is provided in Figure 9 for the BTC-Hash rate discussed
above, where the mean estimator is plot. To complete this section, we observe that this method
also allows to use sharp discrepancy sequences of the unit distribution to fill out UZ , providing
sampling methods with higher convergence rate, as illustrated in the next section.

4.5. The Bachelier problem


4.5.1. Description of the problem
This section provides a benchmark of the methods (4.14) approximating the conditional ex-
pectation (4.15) for the Bachelier problem, which we describe now. Consider a martingale
process t 7→ X(t) ∈ RD , given by the Brownian motion dX = σdWt , where the matrix
σ ∈ RD×D is randomly generated. The initial condition is X(0) = (1, · · · , 1), w.l.o.g. Con-
sider two times 1 = t1 < t2 = 2, t2 being the maturity of an option, which is a function denoted

14
PREDICTIVE MACHINES WITH QUANTIFICATION

0.7
0.4
0.6
0.5 0.3
0.4
0.3 0.2
0.2
0.1
0.1
0.0 0.0
0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.4 0.6 0.8 1.0 1.2 1.4 1.6

Figure 10. Bachelier problem. Left training set b(Z), f (X), right test set b(X), f (Z|X).

f (x) = max(b(x) − K, 0), where K = 1.1, b(x) = x · a with random weights a ∈ RD . It is


straightforward to verify that b(x) follows a Brownian motion db = θdWt . To get a fixed value
for θ (fixed to 0.2 in our tests), we normalize the diffusion matrix σ above.
In this setting, the conditional expectation (4.15) can be determined by using a closed for-
mula, which provides us with the reference value
p b(x) − K
f (x) = θ t2 − t1 pdf (d) + (b(x) − K)cdf (d), d(x, K) = √ , (4.20)
θ t2 − t1
pdf (resp. cdf) holding for the probability density function (resp. cumulative) of the normal
law.

4.5.2. Methodology and input/output data


We test different numerical methods implementing (4.14), with the following inputs:

• X ∈ RNX ×D , is given by iid samples of the brownian motion X(t1 ) at time t1 = 1. The
reference values are f (Z|X) ∈ RNX ×1 , computed using (4.20).

• Z ∈ RNZ ×D is an iid realization of the brownian motion X(t2 )|X(t1 ) at time t2 = 2,


since f (Z) ∈ RNZ ×1 are the functions values.

For each method, the output are fZ|X ∈ RNZ ×Df approximating (4.15), hence are compared to
f (Z|X) in our experiments. We plot the generated learning and test set in picture 10, comparing
the observed variable fZ and the reference values f (Z|X). Thus the problem can be stated as
: knowing the noisy data in the left-hand side, deduce the one at right.

4.5.3. Four methods to tackle the Bachelier problem


We compare four methods for the Bachelier problem. Two methods are based on a standard
approach, that uses predictive machines of the form (2.1), in order to approximating the con-
ditional expectation (4.14) as
fZ|X = Pm (Z, Y, X, f (Z)) (4.21)
The first machine m is a neural network method, the second is a kernel one, labelled ANN
and CodPy pred in the figures. The third machine solves (4.13), labelled Pi:iid in the figures.
The fourth provides a similar approach, but picks up X (resp. Z) as the sharp discrepancy
sequences (SDS) of X(t1 ) (resp. X(t2 )) and is labelled Pi:sharp in our figures.

15
Philippe G. LeFloch & Jean-Marc Mercier

predicted (red) vs test (green) variables and values


0.5
0.4 0.5
0.4
0.4
0.4 0.4
0.3
0.3
0.3
0.3 0.3
Pi:sharp

Pi:sharp

Pi:sharp

Pi:sharp

Pi:sharp
0.2
0.2
0.2 0.2 0.2

0.1
0.1 0.1 0.1 0.1

0.0 0.0 0.0


0.0 0.0

0.5 1.0 1.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5
Basket values Basket values Basket values Basket values Basket values

Figure 11. Exact and predicted values for sharp discrepancy sequences

ANN
0.6 Pi:i.i.d.
Pi:sharp
0.5 codpy pred
0.4
scores

0.3
0.2
0.1

102 103 104


log2(Nx)

Figure 12. Benchmark of scores

To illustrate a typical benchmark run of one of these four methods, Figure 11 shows the
predicted values fZ|X against the exact ones f (Z|X), as functions of the basket values b(Z), for
the last method (SDS). We show five runs of the method with NX = NZ = 32, 64, 128, 256, 512.
Figure 12 presents a benchmark for scores, computed accordingly to the RMSE % (2.9)
(lower is better), for the two dimensional case D = 2, however the results are similar whatever
the dimensions are.

4.5.4. Concluding remarks

We emphasize that the axis in Figure 12 is in log-scale of the size of the training Nx . This
test shows numerically that both predictive methods based on (4.21) are not converging. The
method Pi:iid (in yellow color) shows a performance profile which has a convergence pattern at
1
the statistical rate √N , that is, the one expected with randomly sampled data. The method
X
Pi:sharp (in green color) is an illustration of performance gains when using the proposed sharp
discrepancy sequences.

16
PREDICTIVE MACHINES WITH QUANTIFICATION

Bibliography
[1] I. Babuska, U. Banerjee, and J.E. Osborn, Survey of mesh-less and generalized finite element
methods: a unified approach, Acta Numer. 12 (2003), 1–125.

[2] A. Berlinet and C. Thomas-Agnan, Reproducing kernel Hilbert spaces in probability and sta-
tistics, Springer Science, Business Media, LLC, 2004.

[3] Y. Brenier, Polar factorization and monotone rearrangement of vector-valued functions, Comm.
Pure Applied Math. 44 (1991), 375–417.

[4] H. Brezis, Remarques sur le problème de MongeKantorovich dans le cas discret, Comptes Rendus
Math. 356 (2018), 207–213.

[5] G.E. Fasshauer, Mesh-free methods, in “Handbook of Theoretical and Computational Nanotech-
nology”, Vol. 2, 2006.

[6] A. Gretton, K.M. Borgwardt, M. Rasch, B. Schölkopf, and A.J. Smola, A kernel
method for the two sample problems, Proc. 19th Int. Conf. on Neural Information Processing
Systems, 2006, pp. 513–520.

[7] D. Harrison and D.L. Rubinfeld, Hedonic prices and the demand for clean air, J. Environ.
Economics & Management 5 (1978), 81–102.

[8] T. Hofmann, B. Schlkopf, and A. J. Smola, Kernel methods in machine learning, Ann.
Statist. 36 (2008), 1171–1220.

[9] P.G. LeFloch and J.-M. Mercier, A new method for solving Kolmogorov equations in math-
ematical finance, C. R. Math. Acad. Sci. Paris 355 (2017), 680–686.

[10] P.G. LeFloch and J.-M. Mercier, The Transport-based Mesh-free Method (TMM). A short
review, The Wilmott journal 109 (2020), 52–57. Available at ArXiv:1911.00992.

[11] P.G. LeFloch and J.-M. Mercier, Mesh-free error integration in arbitrary dimensions: a nu-
merical study of discrepancy functions, Comput. Methods Appl. Mech. Engrg. 369 (2020), 113245.

[12] P.G. LeFloch, J.-M. Mercier, and S. Miryusupov, CodPy : a Python library for machine
learning, statistic, and numerical simulations, Monograph in preparation. Code available at https:
//pypi.org/project/codpy/.

[13] R. McCann, Polar factorization of maps on Riemannian manifolds, Geom. Funct. Anal. 11 (2001),
589–608.

[14] R. Sinkhorn and P. Knopp, Concerning nonnegative matrices and doubly stochastic matrices,
Pacific J. Math. 21 (1967), 343–348.

[15] I.M. Sobol, Distribution of points in a cube and approximate evaluation of integrals, U.S.S.R
Comput. Maths. Math. Phys. 7 (1967), 86–112.

[16] B.K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Scholkopf, and G.R. Lanckriet,


Hilbert space embeddings and metrics on probability measures, J. Mach. Learn. Res. 11 (2010),
1517–1561.

17
Philippe G. LeFloch & Jean-Marc Mercier

[17] O. Teymur, J. Gorham, M. Riabiz, and C.J. Oates Proc. 24th Int. Conf. on Artificial Intel-
ligence and Statistics (AISTATS) 2021, San Diego, California, USA, Volume 130, pp. 1027–1035.

[18] T. Wenzel, G. Santin, and B. Haasdonk, Universality and optimality of structured deep
kernel networks, ArXiv:2105.07228.

18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy