Predictive Machines with Uncertainty Quantification (2022)
Predictive Machines with Uncertainty Quantification (2022)
∗
Laboratoire Jacques-Louis Lions, Centre National de la Recherche Scientifique, Sorbonne
Université, 4 Place Jussieu, 75252 Paris, France.
E-mail address: contact@philippelefloch.org
†
MPG-Partners, 136 Boulevard Haussmann, 75008 Paris, France.
E-mail address: jean-marc.mercier@mpg-partners.com.
Abstract. We outline a strategy proposed by the authors to design high-dimensional ex-
trapolation algorithms, also called learning or predictive machines, which are endowed with
numerically computable, uncertainty quantification estimates. We provide a computational
framework based on kernels, which applies as well to neural networks (also called deep learning
networks). This framework was primarily designed to target models based on partial differ-
ential equations. There exist numerous extrapolation strategies in the literature, which apply
to a wide number of applications ranging from numerical simulations to statistics and ma-
chine learning. Among them, we advocate here the use of predictive machines based on kernels
and endowed with error estimates, since they are efficient, versatile, and competitive in in-
dustrial applications. We highlight their benefits with illustrative examples consisting of fully
reproducible benchmarks and we compare our results with more traditional approaches. Our
presentation here is focused on tests relevant to machine learning, and we discuss the natural
connections between predictive machines and techniques of optimal transport. In particular,
we provide a risk management framework that solely relies on historical observations of time
series. The proposed strategy has led us, in collaboration with S. Miryusupov, to a Python
code (referred to as CodPy) which is now publicly available.
1. Introduction
For problems in large dimensions, we consider here extrapolation methods (referred as learning
or predictive machines) which allow us to make predictions supported by quantifiable numerical
error estimates. Indeed, a major interest of the proposed “machines” is that they are endowed
with uncertainly quantification estimates based on a particularly simple notion of “distance”,
making these machines competitive and versatile for industrial applications.
In machine learning, the first application of an error estimate is to provide a confidence
criterion on any prediction, allowing one to trigger an alert when a given tolerance is reached.
A numerical estimate allows one to fully understand the prediction of a learning machine, since
its performance can be fully explained in terms of training set, test set, and internal parameters
of the method. Such methods are then explainable to the user, hence auditable, and escapes
the black box effect —a major and common criticism made to artificial intelligence methods.
Such explainability properties are required in order to pass the qualification tests for critical
applications in an industrial context.
A second benefit of a (numerical) error estimate is to provide a clear view on how efficient a
learning machine is, as it allows one to study its convergence rate. Via the notion of performance
profile, the convergence rate associated with a given machine allows one to estimate its algorith-
mic complexity, a figure that is directly linked to its electrical consumption and environmental
impact. Methods endowed with error estimates and optimization procedures lead to efficient
1
Philippe G. LeFloch & Jean-Marc Mercier
clustering methods, or learning machines with superior convergence rate, computing optimized
sequences of points (centroids or clusters). This strategy has analogies to quasi Monte-Carlo’s
low discrepancy sequences (LDS), such as the popular Sobol sequences [15]. However, optimized
sequences exhibit superior convergence rate (in comparison to LDS) for numerical integration,
at the expense of an heavier computational load; we refer to [11] where such sequences are
analyzed under the terminology ’sharp discrepancy sequences’. This approach is at the heart
of a numerical strategy developed by the authors in [10]–[9] in order to solve a wide range of
partial differential equations (PDEs) in high dimensions.
Finally, as these learning machines are endowed with a notion of distance, their study relates
naturally with the theory of optimal transport. Given any such machine, we can consider
the polar factorization of maps [3, 13] and the Monge-Kantorovitch problem which, in their
discrete versions, can be expressed in terms of a linear square assignment problem (LSAP) (as
first observed in [4]). This allows us to design some novel algorithms that combine elements
from machine learning and optimal transportation, and rely on the computation of transition
probabilities of martingale processes. This strategy [9] was first introduced for applications in
mathematical finance and provides an alternative algorithm to the standard Sinkhorn algorithm
[14].
This test is organized by following the lines of the three paragraphs above. In Section 2 we
(informally) describes our strategy and discuss how to compute error estimates. We illustrate
the interest of the method numerically with two examples1. The first test is a benchmark of
different methods applied to the MNIST dataset, illustrating how scores of learning machines
can be explained with error estimates. The second test is a benchmark of different methods
applied to the so-called Boston housing prices dataset, illustrating a reproducibility property
that is of crucial importance in industrial applications.
In Section 3 we discuss distance-based, clustering algorithms. We illustrate the interest of our
approach with two benchmarks of unsupervised learning methods, namely the standard k-mean
algorithm and the design of sharp discrepancy sequences, for the problem of credit card fraud
detection (a large, unbalanced, real dataset). To support our conclusions and highlight potential
gains in convergence rates, we also provide a similar benchmark for the MNIST problem, which
is considered in the supervised learning case ; cf. Section 3.
Finally, in Section 4 we discuss algorithms based on optimal transport. We present two nu-
merical examples of particular importance in financial applications: a benchmark of methods
computing transition probabilities, and an application to times series predictions. In mathe-
matical finance, these are the two pillars of a risk management framework based on historical
observations of time series.
2
PREDICTIVE MACHINES WITH QUANTIFICATION
(f (x1 ), . . . , f (xNX )) ∈ RNX ×Df called the training set, and a sequence of points Z = (z 1 , . . . , z NZ )
called the test set, and is applied as follows:
Z 7→ fZ = Pm (X, Y, Z, f (X)) ' f (Z). (2.1)
In principle, fZ predicts the ground truth values f (Z), extrapolated from the data X, f (X).
Here, Y is a set of internal parameters required by the method m. When a method m can also
compute consistent differential operators, as for instance the gradient,
Z 7→ (∇f )Z = ∇Z Pm (X, Y, Z, f (X)) ∼ (∇f )(Z), (2.2)
then we say that it is a differentiable learning machine, and it can then also be used for
numerical simulations of PDE models.
Accuracy of the method m can be measured through numerous performance indicators (or
metrics) of the form kf (Z) − fZ k, as for instance the round mean squared error (RMSE)
kf (Z) − fZ k`2 . To analyze these indicators, a standard method is the cross-validation, which
relies on a statistical argument: one can compute
kf (X2 ) − Pm (X1 , Y, X2 , f (X1 ))k,
for one (or several) suitably chosen partitions of the training set X1 ∪ X2 = X, and one can
expect that kf (Z) − fZ k behaves similarly, if for instance Z and X are i.i.d. from the same
distribution.
Under suitable assumption [16]), this defines a metric on M;for instance, K is assumed to
i j
be a positive definite kernel, that is, the matrix K(X, X) = K(x , x ) (with i = 1 . . . NX
and j = 1 . . . NX ) is positive definite for all sequences of distinct points X. Observe that the
discrete case δX = N1X
P
δxi (where δx is the Dirac mass) was introduced in [6]. Interestingly,
this definition leads to a numerically tractable formula:
j=1...N
XX j=1...N
XY j=1...N
XZ
1 i j 1 i j 2
dK (δX , δZ ) = 2 K(x , x ) + 2 K(z , z ) − K(xi , z j ),
NX NZ NX NZ
i=1...NX i=1...NY i=1...NX
(2.4)
also explored numerically in [11] in the case dK (µ, δY ). Discrepancy are also available for neural
network methods2, as these methods can be interpreted as linear kernel ones, but using a non
2although we do not know any framework in which these indicators are computed in practice
3
Philippe G. LeFloch & Jean-Marc Mercier
with weight matrices Wi ∈ RDi ×Di−1 , bias term bi ∈ RDi , and activation functions σi : R 7→ R
applied dimension-wise, for i = 1 . . . L, L being the depth of the network, and < ·, · > holds for
the standard scalar product. Considering the framework (2.1), deep learning methods defines
their set of internal parameters as Y = {Wi , bi , i = 1 . . . L}.
where HKX ⊂ HK is a finite dimensional functional space of size NX equipped with the norm
< ·, · >HK . This lower bound captures all information available in the training data set. The
estimate (2.5) has a quite natural interpretation: the integration error is split in two parts, the
first one being the distance between the training set and the test set variables, and the second
one measuring the training set of values.
Observe also that the estimate (2.6) is very general and relates to many other performance
indicators. For instance, one can deduce (pointwisely)
Z
ϕ(x) − ϕ(y)dνy ≤ dK (δx , ν)kϕkHK ,
RD
or the discrete RMSE error estimator
kϕ(Z) − ϕZ k`2 ≤ dK (δZ , δX )kϕkHK . (2.7)
4
PREDICTIVE MACHINES WITH QUANTIFICATION
set of variables (labels) f (X) ∈ RNX ×Df , Df = 10, and the test set Z ∈ RNZ ×D , NZ = 10000,
predict the label function f (Z) ∈ RNZ ×Df .
Performance are here measured with a common indicator to compare labelled supervised
learning methods, defined as the following score
1
#{fzn = f (z n ), n = 1 . . . NZ } (2.8)
NZ
with NZ = 10000. This produces an indicator between 0 and 1, the higher being the better. We
benchmark here the performance profile, considering scores as a function of the training set size
NX . Our purpose in this presentation is not to discuss each method m under consideration;
all of the methods we consider have a public documentation that the reader can consult. All
these methods are taken with standard internal parameters, and the performance profiles are
reported in Figure 1.
One of these methods, referred to as CodPy, is a kernel-based method and includes a com-
putation of the discrepancy functional (2.4), and we plot the corresponding results in Figure
2. One can see, and we checked thisnumerically, that the indicator 1 − dK (δX , δZ ) is a strict
minorant of the score for this kernel method, in accordance with the error measure (2.5) (ob-
serving that the score function 2.8 is normalized). The discrepancy is a reliable indicator to
explain the performance profile of this method.
5
Philippe G. LeFloch & Jean-Marc Mercier
This property is illustrated now with the so-called Boston housing price dataset. This dataset
contains information collected by the U.S Census Service concerning housing in the area of
Boston Mass. There are 506 houses having 13 attributes (variables or features) with a target
column (values that are housing prices). Further details can be found in [7].
The benchmark is formalized as follows. Given the test set Z ∈ RNZ ×D , NZ = 506, D = 13,
consider a subset as the training set of variables X ∈ RNX ×D ⊂ Z, the training set of variables
(labels) f (X) ∈ RNX ×Df , Df = 1, and predict the housing prices f (Z) ∈ RNZ ×Df .
We plot the results of a benchmark in Figure 3, considering a normalized version of the
RMSE error
kf (Z) − Pm (X, Y, Z, f (X))k`2
, (2.9)
kPm (X, Y, Z, f (X))k`2 + kf (Z)k`2
thus here the lower is better. Considering Figure 3, there is one method reaching zero at
NX = 506, this very last point being the one where training set and test set equals X = Z. We
say that a method m satisfies the reproducibility property if it satisfies dK (δX , δX ) = 0, this
notion being motivated by the estimation error (2.5).
Above accuracy, reproducibility is important for explainability purposes, as well as data
quality. Indeed, it implies that the predictions fZ can be expressed as a linear combination of
the training set. This is a useful facet of explainability, allowing to question predictions directly
using the training set, challenging input datas for potential detection of outliers. Another
important aspect of the reproducibility property is to ensure that no artificial noise on the
training set is added, a bold relief while computing differential operators as (2.2) for numerical
simulations.
6
PREDICTIVE MACHINES WITH QUANTIFICATION
These distances between set of points are usually based on simpler distances, as for instance the
Euclidean, or the Manhattan or log-entropy, depending upon the problem under consideration.
Consider any point x ∈ RD . Then x is attached naturally to the centroid y σ(x,Y ) , where the
indice function σ(x, Y ) is defined as
It also defines (Voronoi) cells Ωi = {x ∈ RD : d(x, y i ) = d(x, y σ(x,Y ) )}, quantizing (i.e. parti-
tioning) the space RD . A popular choice for the distance in (3.1), called the inertia, leads to
the k-means algorithm, considered usually with the Euclidean distance d(x, y) = |x − y|:
Nx
n ,Y
X
d(X, Y ) = d(xn , y σ(x )
). (3.2)
n=1
The link between supervised learning (extrapolation) and unsupervised learning (as clustering)
is straightforward, as one can quantize any set of observations f (X) with f (Y ), defining a
learning machine as follows, using the notation (2.1):
where the considered distance is the discrepancy one 2.3. They are designed to minimize the
numerical integration error 2.5.
Minimization problems as (3.1)-(3.4) are expected to be quite challenging to solve. For in-
stance, if a sequence Ȳ is a solution to (3.1), then any permutations of Ȳ is also expected to
provide a solution, and the functional d(X, Y ) have numerous global minimum. Moreover, the
functional Y 7→ d(X, Y ) is not expected to be convex. We plot in Figure 4 two examples of
very simple discrepancy, starting from five one-dimensional points X ∈ [−1, 1]5 (orange dots),
and plotting the function y 7→ dK (X, y), for two different kernels, a gaussian, and a Matern
one (see [2] for a definition of these kernels). We plot also a linear interpolation between the
five points, so that the reader can appreciate the non convexity of the discrepancy functional,
even for this simple case. A consequence is that standard gradient descent algorithms usually
fail finding a global minimum, and more refined algorithms are needed, as those described in
[12] or [17]. In the next paragraph, we illustrate an example of performance gain obtained with
this strategy.
7
Philippe G. LeFloch & Jean-Marc Mercier
8
PREDICTIVE MACHINES WITH QUANTIFICATION
0.99 codpy codpy
0.70
0.98 k-means k-means
0.65
0.60
discrepancy_errors
0.97
0.55
scores
0.96 0.50
0.95 0.45
0.40
0.94
0.35
0.93 0.30
20 40 60 80 20 40 60 80
Ny Ny
22000
codpy 1.0 codpy
20000 k-means k-means
0.9
18000
0.8
execution_time
16000
0.7
inertia
14000
0.6
12000
0.5
10000
0.4
8000
0.3
6000
20 40 60 80 20 40 60 80
Ny Ny
9
Philippe G. LeFloch & Jean-Marc Mercier
where ϕ : X 7→ R, ψ : Z 7→ R are discrete functions. As stated in [4], the three discrete problems
(4.3)-(4.4)-(4.2) are equivalent. We observe that the discrete Monge problem (4.2) is also known
as the linear sum assignment problem (LSAP), solved in the early 50’s by an algorithm due to
H.W. Kuhn, known as the Hungarian method5
For the continuous case, under suitable condition on ν, µ (having compact, connected, smooth
support), any transport map S# ν = µ can be polar factorized as
S(x) = S ◦ T (x), T# ν = ν, (4.5)
where S is the unique solution to the Monge problem (4.1), having the property to be the
gradient of a c−convex potential S(x) = expx − ∇h(x) , expx being the standard notation
for the exponential map in Riemanian geometry. A scalar function is said to be c-convex
if hcc = h, where hc (z) = inf x c(x, z) − h(x) is called the infimal c−convolution. Standard
convexity coincides with c-convexity for convex cost functions as the Euclidean one, in which
case the following polar factorization holds: S(x) = (∇h) ◦ T (x), h convex.
Consider now a learning machine ((2.1)), for which the discrepancy 2.3 holds, and consider
the cost function as c(x, z) = dK (δx , δz ). Consider as above two discrete measures µ, ν = δX , δZ ,
5This algorithm is often credited to Jacobi in 1890.
10
PREDICTIVE MACHINES WITH QUANTIFICATION
defining the map S(xn ) = z n . In this setting, the preserving map T appearing in the right-hand
side of the polar factorization (4.5) consists in finding the permutation
N
X
σ = arg inf dK (δxn , δz σ(n) ). (4.6)
σ∈Σ
n=1
Then, considering a differential learning machine (2.2), a discrete polar factorization consists
in solving the following equation for the unknown potential h
Z σ = expX − ∇X Pm (X, Y, X, h(X)) . (4.7)
Such algorithms can be implemented for any differential, error-based learning machines.
11
Philippe G. LeFloch & Jean-Marc Mercier
INX being the identity matrix. Solution t 7→ ΠX(t) defines a path of stochastic matrix, that is
a transition probability one, and ΠX(t) (m, n) represents the probability of the discrete Markov
chain to jump from the state xm (t) to the state xn (T ). The particular case ϕ(t, X(t)) = X(t)
leads to the relation X(t) = ΠX(t) X(T ).
12
PREDICTIVE MACHINES WITH QUANTIFICATION
13
Philippe G. LeFloch & Jean-Marc Mercier
argument to ensure the stability of the recurrent scheme 4.17. Thus we provide an alternative
method in the next paragraph.
14
PREDICTIVE MACHINES WITH QUANTIFICATION
0.7
0.4
0.6
0.5 0.3
0.4
0.3 0.2
0.2
0.1
0.1
0.0 0.0
0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.4 0.6 0.8 1.0 1.2 1.4 1.6
Figure 10. Bachelier problem. Left training set b(Z), f (X), right test set b(X), f (Z|X).
• X ∈ RNX ×D , is given by iid samples of the brownian motion X(t1 ) at time t1 = 1. The
reference values are f (Z|X) ∈ RNX ×1 , computed using (4.20).
For each method, the output are fZ|X ∈ RNZ ×Df approximating (4.15), hence are compared to
f (Z|X) in our experiments. We plot the generated learning and test set in picture 10, comparing
the observed variable fZ and the reference values f (Z|X). Thus the problem can be stated as
: knowing the noisy data in the left-hand side, deduce the one at right.
15
Philippe G. LeFloch & Jean-Marc Mercier
Pi:sharp
Pi:sharp
Pi:sharp
Pi:sharp
0.2
0.2
0.2 0.2 0.2
0.1
0.1 0.1 0.1 0.1
0.5 1.0 1.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5
Basket values Basket values Basket values Basket values Basket values
Figure 11. Exact and predicted values for sharp discrepancy sequences
ANN
0.6 Pi:i.i.d.
Pi:sharp
0.5 codpy pred
0.4
scores
0.3
0.2
0.1
To illustrate a typical benchmark run of one of these four methods, Figure 11 shows the
predicted values fZ|X against the exact ones f (Z|X), as functions of the basket values b(Z), for
the last method (SDS). We show five runs of the method with NX = NZ = 32, 64, 128, 256, 512.
Figure 12 presents a benchmark for scores, computed accordingly to the RMSE % (2.9)
(lower is better), for the two dimensional case D = 2, however the results are similar whatever
the dimensions are.
We emphasize that the axis in Figure 12 is in log-scale of the size of the training Nx . This
test shows numerically that both predictive methods based on (4.21) are not converging. The
method Pi:iid (in yellow color) shows a performance profile which has a convergence pattern at
1
the statistical rate √N , that is, the one expected with randomly sampled data. The method
X
Pi:sharp (in green color) is an illustration of performance gains when using the proposed sharp
discrepancy sequences.
16
PREDICTIVE MACHINES WITH QUANTIFICATION
Bibliography
[1] I. Babuska, U. Banerjee, and J.E. Osborn, Survey of mesh-less and generalized finite element
methods: a unified approach, Acta Numer. 12 (2003), 1–125.
[2] A. Berlinet and C. Thomas-Agnan, Reproducing kernel Hilbert spaces in probability and sta-
tistics, Springer Science, Business Media, LLC, 2004.
[3] Y. Brenier, Polar factorization and monotone rearrangement of vector-valued functions, Comm.
Pure Applied Math. 44 (1991), 375–417.
[4] H. Brezis, Remarques sur le problème de MongeKantorovich dans le cas discret, Comptes Rendus
Math. 356 (2018), 207–213.
[5] G.E. Fasshauer, Mesh-free methods, in “Handbook of Theoretical and Computational Nanotech-
nology”, Vol. 2, 2006.
[6] A. Gretton, K.M. Borgwardt, M. Rasch, B. Schölkopf, and A.J. Smola, A kernel
method for the two sample problems, Proc. 19th Int. Conf. on Neural Information Processing
Systems, 2006, pp. 513–520.
[7] D. Harrison and D.L. Rubinfeld, Hedonic prices and the demand for clean air, J. Environ.
Economics & Management 5 (1978), 81–102.
[8] T. Hofmann, B. Schlkopf, and A. J. Smola, Kernel methods in machine learning, Ann.
Statist. 36 (2008), 1171–1220.
[9] P.G. LeFloch and J.-M. Mercier, A new method for solving Kolmogorov equations in math-
ematical finance, C. R. Math. Acad. Sci. Paris 355 (2017), 680–686.
[10] P.G. LeFloch and J.-M. Mercier, The Transport-based Mesh-free Method (TMM). A short
review, The Wilmott journal 109 (2020), 52–57. Available at ArXiv:1911.00992.
[11] P.G. LeFloch and J.-M. Mercier, Mesh-free error integration in arbitrary dimensions: a nu-
merical study of discrepancy functions, Comput. Methods Appl. Mech. Engrg. 369 (2020), 113245.
[12] P.G. LeFloch, J.-M. Mercier, and S. Miryusupov, CodPy : a Python library for machine
learning, statistic, and numerical simulations, Monograph in preparation. Code available at https:
//pypi.org/project/codpy/.
[13] R. McCann, Polar factorization of maps on Riemannian manifolds, Geom. Funct. Anal. 11 (2001),
589–608.
[14] R. Sinkhorn and P. Knopp, Concerning nonnegative matrices and doubly stochastic matrices,
Pacific J. Math. 21 (1967), 343–348.
[15] I.M. Sobol, Distribution of points in a cube and approximate evaluation of integrals, U.S.S.R
Comput. Maths. Math. Phys. 7 (1967), 86–112.
17
Philippe G. LeFloch & Jean-Marc Mercier
[17] O. Teymur, J. Gorham, M. Riabiz, and C.J. Oates Proc. 24th Int. Conf. on Artificial Intel-
ligence and Statistics (AISTATS) 2021, San Diego, California, USA, Volume 130, pp. 1027–1035.
[18] T. Wenzel, G. Santin, and B. Haasdonk, Universality and optimality of structured deep
kernel networks, ArXiv:2105.07228.
18