0% found this document useful (0 votes)

19 views18 pages

Predictive Machines with Uncertainty Quantification (2022)

Uploaded by

havadese.tarikhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views18 pages

Predictive Machines with Uncertainty Quantification (2022)

Uploaded by

havadese.tarikhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

MathematicS In Action

Predictive machines with uncertainty quantification

Philippe G. LeFloch
Jean-Marc Mercier

∗
Laboratoire Jacques-Louis Lions, Centre National de la Recherche Scientifique, Sorbonne
Université, 4 Place Jussieu, 75252 Paris, France.
E-mail address: contact@philippelefloch.org
†
MPG-Partners, 136 Boulevard Haussmann, 75008 Paris, France.
E-mail address: jean-marc.mercier@mpg-partners.com.
Abstract. We outline a strategy proposed by the authors to design high-dimensional ex-
trapolation algorithms, also called learning or predictive machines, which are endowed with
numerically computable, uncertainty quantification estimates. We provide a computational
framework based on kernels, which applies as well to neural networks (also called deep learning
networks). This framework was primarily designed to target models based on partial differ-
ential equations. There exist numerous extrapolation strategies in the literature, which apply
to a wide number of applications ranging from numerical simulations to statistics and ma-
chine learning. Among them, we advocate here the use of predictive machines based on kernels
and endowed with error estimates, since they are efficient, versatile, and competitive in in-
dustrial applications. We highlight their benefits with illustrative examples consisting of fully
reproducible benchmarks and we compare our results with more traditional approaches. Our
presentation here is focused on tests relevant to machine learning, and we discuss the natural
connections between predictive machines and techniques of optimal transport. In particular,
we provide a risk management framework that solely relies on historical observations of time
series. The proposed strategy has led us, in collaboration with S. Miryusupov, to a Python
code (referred to as CodPy) which is now publicly available.

1. Introduction

For problems in large dimensions, we consider here extrapolation methods (referred as learning
or predictive machines) which allow us to make predictions supported by quantifiable numerical
error estimates. Indeed, a major interest of the proposed “machines” is that they are endowed
with uncertainly quantification estimates based on a particularly simple notion of “distance”,
making these machines competitive and versatile for industrial applications.
In machine learning, the first application of an error estimate is to provide a confidence
criterion on any prediction, allowing one to trigger an alert when a given tolerance is reached.
A numerical estimate allows one to fully understand the prediction of a learning machine, since
its performance can be fully explained in terms of training set, test set, and internal parameters
of the method. Such methods are then explainable to the user, hence auditable, and escapes
the black box effect —a major and common criticism made to artificial intelligence methods.
Such explainability properties are required in order to pass the qualification tests for critical
applications in an industrial context.
A second benefit of a (numerical) error estimate is to provide a clear view on how efficient a
learning machine is, as it allows one to study its convergence rate. Via the notion of performance
profile, the convergence rate associated with a given machine allows one to estimate its algorith-
mic complexity, a figure that is directly linked to its electrical consumption and environmental
impact. Methods endowed with error estimates and optimization procedures lead to efficient

1
Philippe G. LeFloch & Jean-Marc Mercier

clustering methods, or learning machines with superior convergence rate, computing optimized
sequences of points (centroids or clusters). This strategy has analogies to quasi Monte-Carlo’s
low discrepancy sequences (LDS), such as the popular Sobol sequences [15]. However, optimized
sequences exhibit superior convergence rate (in comparison to LDS) for numerical integration,
at the expense of an heavier computational load; we refer to [11] where such sequences are
analyzed under the terminology ’sharp discrepancy sequences’. This approach is at the heart
of a numerical strategy developed by the authors in [10]–[9] in order to solve a wide range of
partial differential equations (PDEs) in high dimensions.
Finally, as these learning machines are endowed with a notion of distance, their study relates
naturally with the theory of optimal transport. Given any such machine, we can consider
the polar factorization of maps [3, 13] and the Monge-Kantorovitch problem which, in their
discrete versions, can be expressed in terms of a linear square assignment problem (LSAP) (as
first observed in [4]). This allows us to design some novel algorithms that combine elements
from machine learning and optimal transportation, and rely on the computation of transition
probabilities of martingale processes. This strategy [9] was first introduced for applications in
mathematical finance and provides an alternative algorithm to the standard Sinkhorn algorithm
[14].
This test is organized by following the lines of the three paragraphs above. In Section 2 we
(informally) describes our strategy and discuss how to compute error estimates. We illustrate
the interest of the method numerically with two examples1. The first test is a benchmark of
different methods applied to the MNIST dataset, illustrating how scores of learning machines
can be explained with error estimates. The second test is a benchmark of different methods
applied to the so-called Boston housing prices dataset, illustrating a reproducibility property
that is of crucial importance in industrial applications.
In Section 3 we discuss distance-based, clustering algorithms. We illustrate the interest of our
approach with two benchmarks of unsupervised learning methods, namely the standard k-mean
algorithm and the design of sharp discrepancy sequences, for the problem of credit card fraud
detection (a large, unbalanced, real dataset). To support our conclusions and highlight potential
gains in convergence rates, we also provide a similar benchmark for the MNIST problem, which
is considered in the supervised learning case ; cf. Section 3.
Finally, in Section 4 we discuss algorithms based on optimal transport. We present two nu-
merical examples of particular importance in financial applications: a benchmark of methods
computing transition probabilities, and an application to times series predictions. In mathe-
matical finance, these are the two pillars of a risk management framework based on historical
observations of time series.

2. Learning machines with error estimates

2.1. Extrapolation methods

There exist numerous methods for high-dimensional data extrapolation, and we indeed consider
several such methods in the present text. We introduce here a unified approach, as well as some
terminology used in machine learning.
A predictive, learning machine m, is a function taking, as input, a sequence of points
X = (x1 , . . . , xNX ) of RD , called the training set, a continuous vector-valued function f (X) =
1
All numerical examples are based on our code CodPy (see [12]), hence are fully reproducible

2
PREDICTIVE MACHINES WITH QUANTIFICATION

(f (x1 ), . . . , f (xNX )) ∈ RNX ×Df called the training set, and a sequence of points Z = (z 1 , . . . , z NZ )
called the test set, and is applied as follows:
Z 7→ fZ = Pm (X, Y, Z, f (X)) ' f (Z). (2.1)
In principle, fZ predicts the ground truth values f (Z), extrapolated from the data X, f (X).
Here, Y is a set of internal parameters required by the method m. When a method m can also
compute consistent differential operators, as for instance the gradient,
Z 7→ (∇f )Z = ∇Z Pm (X, Y, Z, f (X)) ∼ (∇f )(Z), (2.2)
then we say that it is a differentiable learning machine, and it can then also be used for
numerical simulations of PDE models.
Accuracy of the method m can be measured through numerous performance indicators (or
metrics) of the form kf (Z) − fZ k, as for instance the round mean squared error (RMSE)
kf (Z) − fZ k`2 . To analyze these indicators, a standard method is the cross-validation, which
relies on a statistical argument: one can compute
kf (X2 ) − Pm (X1 , Y, X2 , f (X1 ))k,
for one (or several) suitably chosen partitions of the training set X1 ∪ X2 = X, and one can
expect that kf (Z) − fZ k behaves similarly, if for instance Z and X are i.i.d. from the same
distribution.

2.2. Discrepancy functionals

We are especially interested in error measures for learning machines that are based on a notion
of distance between measures. This is the case of kernel-based methods from the theory of
reproducing kernel Hilbert spaces (RKHS); see [8, 2, 5], or neural network methods.
Specifically, consider a kernel, that is, a continuous and symmetric mapping (x, y) ∈ RD ×
D
R 7→ K = K(x, y) ∈ R satisfying K(x, y) = K(y, x). Given any µ, ν in the set of probability
measures M on RD we define the associated discrepancy, also referred to as the maximum
mean discrepancy functional (MMD), by
ZZ ZZ ZZ
dK (µ, ν) = K(x, y)dµx dµy + K(x, y)dνx dνy − 2 K(x, y)dνx dµy (2.3)

Under suitable assumption [16]), this defines a metric on M;for instance, K is assumed to
i j
be a positive definite kernel, that is, the matrix K(X, X) = K(x , x ) (with i = 1 . . . NX
and j = 1 . . . NX ) is positive definite for all sequences of distinct points X. Observe that the
discrete case δX = N1X
P
δxi (where δx is the Dirac mass) was introduced in [6]. Interestingly,
this definition leads to a numerically tractable formula:
j=1...N
XX j=1...N
XY j=1...N
XZ
1 i j 1 i j 2
dK (δX , δZ ) = 2 K(x , x ) + 2 K(z , z ) − K(xi , z j ),
NX NZ NX NZ
i=1...NX i=1...NY i=1...NX
(2.4)
also explored numerically in [11] in the case dK (µ, δY ). Discrepancy are also available for neural
network methods2, as these methods can be interpreted as linear kernel ones, but using a non

2although we do not know any framework in which these indicators are computed in practice

3
Philippe G. LeFloch & Jean-Marc Mercier

linear mapping, having form (cf. [18])

K(x, y) =< S(x), S(y) >, S(x) = WL σL . . . σ1 (W1 x + b1 ) + b2 + bL

with weight matrices Wi ∈ RDi ×Di−1 , bias term bi ∈ RDi , and activation functions σi : R 7→ R
applied dimension-wise, for i = 1 . . . L, L being the depth of the network, and < ·, · > holds for
the standard scalar product. Considering the framework (2.1), deep learning methods defines
their set of internal parameters as Y = {Wi , bi , i = 1 . . . L}.

2.3. Quantitative error bounds

A positive definite kernel K defines a Hilbert space of continuous functions, denoted HK , en-
dowed with a scalar product < ·, · >HK , with the reproducing property < ϕ(·), K(x, ·) >HK = ϕ(x)
for any ϕ ∈ HK (see [2]). Considering the discrepancy functional 2.3, the following estimate
holds (see [16]): Z Z
ϕ(x)dµx − ϕ(y)dνy ≤ dK (µ, ν)kϕkHK . (2.5)
RD RD
This inequality is computationally tractable for the discrete case: in the right-hand side, the
discrepancy functional can be estimated using (2.4), since a lower bound of the norm of a
function ϕ ∈ HK is given by the bilinear form
kϕk2HK ≥ kϕk2HK = ϕ(X)T K(X, X)−1 ϕ(X), (2.6)
X

where HKX ⊂ HK is a finite dimensional functional space of size NX equipped with the norm
< ·, · >HK . This lower bound captures all information available in the training data set. The
estimate (2.5) has a quite natural interpretation: the integration error is split in two parts, the
first one being the distance between the training set and the test set variables, and the second
one measuring the training set of values.
Observe also that the estimate (2.6) is very general and relates to many other performance
indicators. For instance, one can deduce (pointwisely)
Z
ϕ(x) − ϕ(y)dνy ≤ dK (δx , ν)kϕkHK ,
RD
or the discrete RMSE error estimator
kϕ(Z) − ϕZ k`2 ≤ dK (δZ , δX )kϕkHK . (2.7)

2.4. Error estimates from the MNIST dataset

The MNIST dataset is composed of a training set of variables containing 60, 000 images of
handwritten digits, and we considered the set available at LeCun’s MNIST home page3. This is
a classical test for supervised learning machines. Each image is a vector having dimensions 784
(a 28 × 28 grayscale image flattened in row-order). To each image we provide a label, taking
values in 10 digits 09, defining the training set of values. The test set is composed of 10, 000
images with their labels. The test consists in predicting these labels.
Considering our notion of learning machines in (2.1), the problem is formalized as follows.
Given the training set of variables represented by a matrix X ∈ RNX ×D , D = 784, the training
3http://yann.lecun.com/exdb/mnist/

4
PREDICTIVE MACHINES WITH QUANTIFICATION

Figure 1. Performance profiles of several methods for the MNIST dataset.

Figure 2. Discrepancy profile dK (δX , δZ ) for the MNIST dataset.

set of variables (labels) f (X) ∈ RNX ×Df , Df = 10, and the test set Z ∈ RNZ ×D , NZ = 10000,
predict the label function f (Z) ∈ RNZ ×Df .
Performance are here measured with a common indicator to compare labelled supervised
learning methods, defined as the following score
1
#{fzn = f (z n ), n = 1 . . . NZ } (2.8)
NZ
with NZ = 10000. This produces an indicator between 0 and 1, the higher being the better. We
benchmark here the performance profile, considering scores as a function of the training set size
NX . Our purpose in this presentation is not to discuss each method m under consideration;
all of the methods we consider have a public documentation that the reader can consult. All
these methods are taken with standard internal parameters, and the performance profiles are
reported in Figure 1.
One of these methods, referred to as CodPy, is a kernel-based method and includes a com-
putation of the discrepancy functional (2.4), and we plot the corresponding results in Figure
2. One can see, and we checked thisnumerically, that the indicator 1 − dK (δX , δZ ) is a strict
minorant of the score for this kernel method, in accordance with the error measure (2.5) (ob-
serving that the score function 2.8 is normalized). The discrepancy is a reliable indicator to
explain the performance profile of this method.

2.5. Reproducibility with the Boston Housing prices dataset

We now illustrate with a numerical example the reproducibility property of the training set, an
important property for industrial applications.

5
Philippe G. LeFloch & Jean-Marc Mercier

Figure 3. Convergence profile for the housing dataset.

This property is illustrated now with the so-called Boston housing price dataset. This dataset
contains information collected by the U.S Census Service concerning housing in the area of
Boston Mass. There are 506 houses having 13 attributes (variables or features) with a target
column (values that are housing prices). Further details can be found in [7].
The benchmark is formalized as follows. Given the test set Z ∈ RNZ ×D , NZ = 506, D = 13,
consider a subset as the training set of variables X ∈ RNX ×D ⊂ Z, the training set of variables
(labels) f (X) ∈ RNX ×Df , Df = 1, and predict the housing prices f (Z) ∈ RNZ ×Df .
We plot the results of a benchmark in Figure 3, considering a normalized version of the
RMSE error
kf (Z) − Pm (X, Y, Z, f (X))k`2
, (2.9)
kPm (X, Y, Z, f (X))k`2 + kf (Z)k`2
thus here the lower is better. Considering Figure 3, there is one method reaching zero at
NX = 506, this very last point being the one where training set and test set equals X = Z. We
say that a method m satisfies the reproducibility property if it satisfies dK (δX , δX ) = 0, this
notion being motivated by the estimation error (2.5).
Above accuracy, reproducibility is important for explainability purposes, as well as data
quality. Indeed, it implies that the predictions fZ can be expressed as a linear combination of
the training set. This is a useful facet of explainability, allowing to question predictions directly
using the training set, challenging input datas for potential detection of outliers. Another
important aspect of the reproducibility property is to ensure that no artificial noise on the
training set is added, a bold relief while computing differential operators as (2.2) for numerical
simulations.

3. Error-minimization based algorithms

3.1. Clustering methods

In this section, we discuss an important class of popular unsupervised machine learning methods
taking its root in quantization, called clustering. These methods aim to partition NX variables
X into NY < NX clusters Y in which each observation x ∈ X belongs to a cluster, defined by
a point y ∈ Y called centroid. This partition is computed considering a distance between set
of points d(X, Y ), defining equivalently a distance between the discrete measures δX , δY . With

6
PREDICTIVE MACHINES WITH QUANTIFICATION

this notation, the minimization problem can be expressed as

Ȳ = arg inf d(X, Y ) (3.1)
Y ∈RNY ×D

These distances between set of points are usually based on simpler distances, as for instance the
Euclidean, or the Manhattan or log-entropy, depending upon the problem under consideration.
Consider any point x ∈ RD . Then x is attached naturally to the centroid y σ(x,Y ) , where the
indice function σ(x, Y ) is defined as

σ(x, Y ) = arg inf d(x, y k ).

k=1...NY

It also defines (Voronoi) cells Ωi = {x ∈ RD : d(x, y i ) = d(x, y σ(x,Y ) )}, quantizing (i.e. parti-
tioning) the space RD . A popular choice for the distance in (3.1), called the inertia, leads to
the k-means algorithm, considered usually with the Euclidean distance d(x, y) = |x − y|:
Nx
n ,Y
X
d(X, Y ) = d(xn , y σ(x )
). (3.2)
n=1

The link between supervised learning (extrapolation) and unsupervised learning (as clustering)
is straightforward, as one can quantize any set of observations f (X) with f (Y ), defining a
learning machine as follows, using the notation (2.1):

fZ = Pd (X, Y, Z, f (X)) = f (y σ(Z,Y ) ). (3.3)

In other words, for these learning machines the set of internal parameters Y in (2.1) are the
centroids.

3.2. Sharp discrepancy sequences

Sharp discrepancy sequences (SDS), introduced in [9] for the semi-discrete case, can be inter-
preted in the discrete case as clustering methods defining the centroids in (3.1) as
Ȳ = arg inf dK (δX , δY ), (3.4)
Y ∈RNY ×D

where the considered distance is the discrepancy one 2.3. They are designed to minimize the
numerical integration error 2.5.
Minimization problems as (3.1)-(3.4) are expected to be quite challenging to solve. For in-
stance, if a sequence Ȳ is a solution to (3.1), then any permutations of Ȳ is also expected to
provide a solution, and the functional d(X, Y ) have numerous global minimum. Moreover, the
functional Y 7→ d(X, Y ) is not expected to be convex. We plot in Figure 4 two examples of
very simple discrepancy, starting from five one-dimensional points X ∈ [−1, 1]5 (orange dots),
and plotting the function y 7→ dK (X, y), for two different kernels, a gaussian, and a Matern
one (see [2] for a definition of these kernels). We plot also a linear interpolation between the
five points, so that the reader can appreciate the non convexity of the discrepancy functional,
even for this simple case. A consequence is that standard gradient descent algorithms usually
fail finding a global minimum, and more refined algorithms are needed, as those described in
[12] or [17]. In the next paragraph, we illustrate an example of performance gain obtained with
this strategy.

7
Philippe G. LeFloch & Jean-Marc Mercier

Figure 4. Examples of discrepancy functionals, left gaussian, right matern, kernels.

3.3. Clustering benchmarks for the CCFD dataset

We now illustrate a benchmark of k-means centroids computations versus SDS ones in terms
of discrepancy errors, inertia, accuracy scores and execution time, for the credit card fraud
dataset (CCFD).
This dataset contains transactions made by credit cards in September 2013 by European
cardholders, presenting transactions that occurred in two days, with 492 frauds out of 284807
transactions, each transactions characterized by 30 numerical values resulting from a PCA
analysis. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all
transactions4
The benchmark is formalized as follows. Given the training set of variables X ∈ RNX ×D ,
NX = 284807, D = 30, compute the parameter set (clusters) Y ∈ RNY ×D , and predict the
training set of values f (X) ∈ RNX ⊂ {0, 1}NX , 1 for fraud, 0 else, using the machine 3.3. Scores
(2.8), inertia (3.2), discrepancy (2.3), and computational times are reported in Figure 5 for
different numbers NY of clusters that we comment now.
Scores obtained with SDS seems stables. The indicator of discrepancy is better for SDS, that
is expected as SDS are discrepancy minimizers specialized algorithms. However, notice that
both algorithms provides comparable results regarding inertia, that is somehow surprising,
as k-means algorithms are specialized in minimizing this quantity. Indeed, on the manifold
characterized by sequences of points minimizing the inertia, those minimizing the discrepancy
provides a way to refine centroids locations.

3.4. Unsupervised learning benchmarks - MNIST

We perform here the same benchmark methodology of unsupervised learning machines (3.3),
for the MNIST problem, illustrated for the supervised case section 2.4.
The setting is slightly different than the one in Section 3.3: we considered a training set of
fixed size NX = 4096 hand-written image, and plot the results using the score 2.8 as a function
of the number of clusters NY . Results are presented in Figure 6, confirming the conclusions in
Section 3.3, but noticing that SDS provide a superior convergence rate than k-means for this
test. We end with Figures 7, allowing to appreciate some differences between the computed
clusters y ∈ Y by k-means and SDS.

4Further details are available at Kaggle site https://www.kaggle.com/mlg-ulb/creditcardfraud.

8
PREDICTIVE MACHINES WITH QUANTIFICATION
0.99 codpy codpy
0.70
0.98 k-means k-means
0.65
0.60

discrepancy_errors
0.97
0.55

scores
0.96 0.50
0.95 0.45
0.40
0.94
0.35
0.93 0.30
20 40 60 80 20 40 60 80
Ny Ny
22000
codpy 1.0 codpy
20000 k-means k-means
0.9
18000
0.8

execution_time
16000
0.7
inertia

14000
0.6
12000
0.5
10000
0.4
8000
0.3
6000
20 40 60 80 20 40 60 80
Ny Ny

Figure 5. Benchmarks profiles for the credit card data set.

Figure 6. Benchmarks profiles for the unsupervised learning mnist test.

Figure 7. Examples of images rendered from computed MNIST clusters : left

SDS, right k-means.

4. Optimal transport algorithms

4.1. Discrepancy-based polar factorization of maps

Let us briefly introduce the polar factorization of maps. These tools are credited to [3], [13] for
the continuous case, and we focus on the discrete cases discussed in [4]. We then describe how
we connect these tools with error-based learning machines.

9
Philippe G. LeFloch & Jean-Marc Mercier

Consider a probability measure ν ∈ M on RD , and a mapping S : RD 7→ RD . Denote µ ∈ M,

defined through the change of variable
Z Z
ϕ(x)dµ = (ϕ ◦ S)(x)dν, ϕ ∈ C(RD ).
RD RD
We say that S transports ν into µ, and S# ν = µ is the push-forward. Consider a cost function,
that is a positive, scalar valued, symmetric C 1 function c(·, ·). The Monge problem, given ν, µ,
consists in finding a mapping S minimizing the transportation cost from ν to µ, that is
Z
S = arg inf c(x, S(x))dν (4.1)
S# ν=µ RD

From a discrete point of view, consider two discrete measures µ, ν = δX , δZ , with X =

(x1 , . . . , xN ), Z = (z 1 , . . . , z N ) two sequences of distinct points having equal lengths. Then
the Monge problem (4.1) amounts to determine a permutation σ : [1 . . . N ] 7→ [1 . . . N ] satisfy-
ing
XN
S(xn ) = z σ(n) with σ = arg inf c(xn , z σ(n) ), (4.2)
σ∈Σ
n=1
Σ being the set of all permutations, and we denote for short in this document S(X) = Z σ .
j=1...N
Consider the matrix C(X, Z) = c(xi , z j ) , then the following problem is called the
i=1...N
discrete Kantorovitch problem
γ̄ = arg inf C(X, Z) · γ, (4.3)
γ∈Γ
where A · B denotes the Frobenius
PN scalar matrix
PN product, Γ is the set of all bi-stochastic matrix
γ ∈R N ×N , i.e. satisfying, n=1 γm,n = n=1 γn,m = 1, γn,m ≥ 0, for all m = 1, . . . , N . The
minimization problem (4.3) admits a dual expression, called the dual-Kantorovitch problem
N
X
ϕ, ψ = arg sup ϕ(xn ) − ψ(z n ), ϕ(xn ) − ψ(z m ) ≤ c(xn , z m ) (4.4)
ϕ,ψ n=1

where ϕ : X 7→ R, ψ : Z 7→ R are discrete functions. As stated in [4], the three discrete problems
(4.3)-(4.4)-(4.2) are equivalent. We observe that the discrete Monge problem (4.2) is also known
as the linear sum assignment problem (LSAP), solved in the early 50’s by an algorithm due to
H.W. Kuhn, known as the Hungarian method5
For the continuous case, under suitable condition on ν, µ (having compact, connected, smooth
support), any transport map S# ν = µ can be polar factorized as
S(x) = S ◦ T (x), T# ν = ν, (4.5)
where S is the unique solution to the Monge problem (4.1), having the property to be the
gradient of a c−convex potential S(x) = expx − ∇h(x) , expx being the standard notation
for the exponential map in Riemanian geometry. A scalar function is said to be c-convex
if hcc = h, where hc (z) = inf x c(x, z) − h(x) is called the infimal c−convolution. Standard
convexity coincides with c-convexity for convex cost functions as the Euclidean one, in which
case the following polar factorization holds: S(x) = (∇h) ◦ T (x), h convex.
Consider now a learning machine ((2.1)), for which the discrepancy 2.3 holds, and consider
the cost function as c(x, z) = dK (δx , δz ). Consider as above two discrete measures µ, ν = δX , δZ ,
5This algorithm is often credited to Jacobi in 1890.

10
PREDICTIVE MACHINES WITH QUANTIFICATION

defining the map S(xn ) = z n . In this setting, the preserving map T appearing in the right-hand
side of the polar factorization (4.5) consists in finding the permutation
N
X
σ = arg inf dK (δxn , δz σ(n) ). (4.6)
σ∈Σ
n=1

Then, considering a differential learning machine (2.2), a discrete polar factorization consists
in solving the following equation for the unknown potential h

Z σ = expX − ∇X Pm (X, Y, X, h(X)) . (4.7)
Such algorithms can be implemented for any differential, error-based learning machines.

4.2. SDE-based conditional expectation algorithm

For a better understanding of the next section, we summarize the numerical strategy developed
in [9], that can be implemented with any differential learning machine. Consider w.l.o.g. any
martingale process t 7→ X(t) ∈ RD , driven by the following stochastic differential equation
(SDE)
dXt = σ(t, Xt )dBt , t ≥ 0, X(0) ∈ RD , (4.8)
Bt being the standard Brownian motion, σ(t, x) ∈ RD×D any smooth field of matrix. Consider
the density of the process X(t), denoted as the probability measure t 7→ µ(t, ·) ∈ M. Then µ
satisfies the Fokker-Planck equation
1
∂t µ = ∇2 · (Aµ), A = σσ T , µ(0, ·) = δx , (4.9)
2
∇2 = ∂i ∂j being the Hessian. Consider any time T > 0, and any integrable vector-valued
function ψ(·) ∈ L1 (µ(T, ·)). The Feynman-Kac theorem states that the conditional expectation
EX(T ) (ψ(·)|X(t) = x) =: ϕ(t, x) satisfies a Kolmogorov equation, that is the dual of the Fokker-
Planck (4.9), to be solved backward in time:
∂t ϕ − A · ∇2 ϕ = 0, t ∈ [0, T ] ϕ(T, x) = ψ(x). (4.10)
This continuous setting translates into a Markov one in the discrete case as follows. The measure
µ is approximated by δX(t) , where the sequence t 7→ X(t) of length NX is computed by the
following discrete Fokker-Planck equation, modeling (4.9)
1
∂t δX(t) = ∇2X(t) · (AδX(t) ), A = σσ T , µ(0, ·) = δX(0) , (4.11)
2
where ∇2X(t) is a consistent discrete Hessian operator, computable by any differential learning
machine through (2.2) for instance. Once the trajectories t 7→ X(t) are computed, the discrete
Kolmogorov equation is solved backward in time as
∂t ϕ = A · ∇2X(t) ϕ, ϕ(T, X(T )) = ψ(X(T )). (4.12)
Specifically, solutions to (4.12) can be written as ϕ(t, X(t)) = ΠX(t) ψ(X(t)), where t 7→
ΠX(t) , the generator of the equation (4.12), is a stochastic matrix, i.e. a matric satisfying
PNX
n=1 ΠX(t) (m, n) = 1 for all m = 1, . . . , NX , solution to

∂t Π = A · ∇2X(t) Π, ΠX(T ) = INX , (4.13)

11
Philippe G. LeFloch & Jean-Marc Mercier

INX being the identity matrix. Solution t 7→ ΠX(t) defines a path of stochastic matrix, that is
a transition probability one, and ΠX(t) (m, n) represents the probability of the discrete Markov
chain to jump from the state xm (t) to the state xn (T ). The particular case ϕ(t, X(t)) = X(t)
leads to the relation X(t) = ΠX(t) X(T ).

4.3. Historical-based conditional expectation algorithms

In this section, we consider a slightly different setting, in which the underlying stochastic process
is not given by the SDE (4.8), but observed from historical values, input as NX i.i.d samples
of a martingale stochastic process t 7→ X(t). The main difference with the previous section is
thus that the field of matrix σ(t, Xt ) is unknown. Such algorithms have a lot of applications,
as they provide an analysis based on experienced historical data, a quite reasonable point of
view for risk management. For industrial purposes, these problems are historically tackled with
parametric models : one supposes that the stochastic process follows a given SDE (as ARCH-
GARCH models), which parameters are calibrated to fit historical values. Machine learning
based algorithms demarcate with the parametric approach, because they are somehow agnostic,
i.e. no hypothesis are made on the historical process, a motivating property.
From a discrete point of view, we would like somehow to calibrate σ(t, X(t)) such that
t 7→ X(t) be a solution to (4.11). Once determined, compute the transition probability matrix
for risk analysis, using for instance the method described in (4.13) or any alternative method.
The whole process can be described by a function, which we denote by Π and define by

fZ|X = Π Z, X, f (Z) , (4.14)
where the inputs are as follows:

• X ∈ RNX ×D is a i.i.d. generated samples of X(t1 ) where t1 is a given time.

• Z ∈ RNZ ×D is another i.i.d sample of X(t2 ) at any time t2 > t1 .

• f (Z) ∈ RNZ ×Df is any discrete vector-valued function.

The output is a matrix fZ|X , representing the conditional expectation

2
fZ|X ∼ EX(t ) (f (·)|X(t1 )) ∈ RNX ×Df =:not. f (Z|X). (4.15)
For instance, the previous section proposes a method to compute (4.14) as Π(Z, X)f (Z), where
Π(Z, X) is a stochastic matrix solving (4.13). This method, that also use the polar factorization
(4.7), is benchmarked against alternative ones in Section 4.5.

4.4. Time series predictions

Time series predictions is a quite active field of applied research, and machine learning methods
as recurrent neural networks, long short-term memory, neural Turing machine, etc. are very
popular, neural network based, time series predictors. This section discusses two different ap-
proaches for time series predictions, one being quite similar to the mentioned ones, the other
is a new time series prediction algorithm based on the polar factorization.
The notation in this section is as follows :
X ∈ RNX ×D×TX (4.16)

12
PREDICTIVE MACHINES WITH QUANTIFICATION

Figure 8. Recurrent kernels : Generated (yellow) BTC-USD left / HR right,

versus historical (blue).

is a three-dimensional tensor, representing NX i.i.d samples of a D dimensional process X(t) ∈

RD , sampled on a time grid t1 < . . . < tTX having size TX .

4.4.1. Recurrent methods for time series predictions

Let us describe recurrent methods that can be implemented for any predictive machine (2.1),
and we discuss an example of prediction.
Consider some historical observations X as in (4.16), and two integers H and P , satisfying
H + P ≤ TX . H is called the historical depth, P the prediction depth. This setting defines a
sliding window of size H+P over the data X, used to define the training set as follows

X 0 = X [·,·,i:i+H] ∈ RÑX ×D×H , f (X 0 ) = X [·,·,i+H:i+H+P ] ∈ RÑX ×D×P

for any i = 1, . . . , ÑX , with ÑX = (T − H − P )NX . We can iterate the procedure, producing
at each step P new predicted values, using recursively a predictive machine (2.1) as follows
X k+1 = [X k , f (X k )], f (X k+1 ) = Pm (X k , Y, X k+1 , f (X k )), (4.17)
[X k , f (X k )] being the concatenation of these two tensors in the last variable. Such a construc-
tion allows to produce predicted values of the temporal series at any future times.
The recurrent method (4.17) allows to draw one trajectory, that can be considered as a iid
realization of the temporal series, based on the knowledge of its history. Figure 8 shows a toy
example of historical temporal series forecast, having two components : the Bitcoin price and
the hash-rate values, for which we considered TX covering daily observations from 01/01/2015
to 23/11/2020, since H and P are set to fit 6 months datas. In Figure 8, we have chosen
NX = 1, D = 2, TX = 1460 in (4.16). Starting from this setting, we predict the temporal
series up to 31/12/2021 and compare it with the historically observed one, using a kernel
implementation of the scheme (4.17).
This method has a lot of forecasting applications, useful for professional purposes. However,
in the context of temporal series forecasting, such a method faces a number of challenges.
First, we are left with two extra parameters, H and P . Secondly, it is not clear how to generate
other realizations of the studied temporal series. As a consequence, it is not clear neither how
to generate a pertinent mean estimator using this construction. Finally, we do not have any

13
Philippe G. LeFloch & Jean-Marc Mercier

Figure 9. Optimal transport : mean estimator (yellow) and random path

(shaded) for BTC-USD left / HR right, versus historical (blue).

argument to ensure the stability of the recurrent scheme 4.17. Thus we provide an alternative
method in the next paragraph.

4.4.2. Optimal transport methods for time series predictions

In this section, we illustrate a time serie prediction algorithm based on the polar factorization,
that we describe shortly now. Consider a temporal serie X ∈ RNX ×D×TX , including a time-
stamp in the first dimension X(·, 1, k) = tk , k = 1, . . . , TX . Consider any uniform distribution
UX ∈ [0, 1]NX ×D×T , except for the time-stamp values UX (·, 1, k) = tk . Consider the polar
factorization 4.7 of the map S(UX ) = X, defined by solving the following equation for the
unknown scalar function h:

σ
X = expX − ∇X Pm (UX , Y, UX , h(UX )) (4.18)
See (4.6) for a definition of the permutation σ. Suppose that we want to sample NZ new tra-
jectories Z ∈ RNZ ×D×TZ , on a time grid t1 < . . . < tTZ . Then we fill out a uniform distribution
UZ ∈ [0, 1]NZ ×D×TZ , adding the time-stamp values UZ (·, 1, k) = tk for k = 1, . . . , TZ , and we
can use the polar factorization algorithm as follows

Z = expX − ∇X Pm (UX , Y, UZ , h(UX )) (4.19)
This method enjoy some remarkable properties. First it is numerically very efficient, allowing to
sample large numbers of trajectories with few computational resources. Secondly, we have also
a quite clear interpretation of a mean estimator, considering UZ (·, d, ·) = 12 , for d = 2, . . . , D. A
numerical illustration of this method is provided in Figure 9 for the BTC-Hash rate discussed
above, where the mean estimator is plot. To complete this section, we observe that this method
also allows to use sharp discrepancy sequences of the unit distribution to fill out UZ , providing
sampling methods with higher convergence rate, as illustrated in the next section.

4.5. The Bachelier problem

4.5.1. Description of the problem
This section provides a benchmark of the methods (4.14) approximating the conditional ex-
pectation (4.15) for the Bachelier problem, which we describe now. Consider a martingale
process t 7→ X(t) ∈ RD , given by the Brownian motion dX = σdWt , where the matrix
σ ∈ RD×D is randomly generated. The initial condition is X(0) = (1, · · · , 1), w.l.o.g. Con-
sider two times 1 = t1 < t2 = 2, t2 being the maturity of an option, which is a function denoted

14
PREDICTIVE MACHINES WITH QUANTIFICATION

0.7
0.4
0.6
0.5 0.3
0.4
0.3 0.2
0.2
0.1
0.1
0.0 0.0
0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.4 0.6 0.8 1.0 1.2 1.4 1.6

Figure 10. Bachelier problem. Left training set b(Z), f (X), right test set b(X), f (Z|X).

f (x) = max(b(x) − K, 0), where K = 1.1, b(x) = x · a with random weights a ∈ RD . It is

straightforward to verify that b(x) follows a Brownian motion db = θdWt . To get a fixed value
for θ (fixed to 0.2 in our tests), we normalize the diffusion matrix σ above.
In this setting, the conditional expectation (4.15) can be determined by using a closed for-
mula, which provides us with the reference value
p b(x) − K
f (x) = θ t2 − t1 pdf (d) + (b(x) − K)cdf (d), d(x, K) = √ , (4.20)
θ t2 − t1
pdf (resp. cdf) holding for the probability density function (resp. cumulative) of the normal
law.

4.5.2. Methodology and input/output data

We test different numerical methods implementing (4.14), with the following inputs:

• X ∈ RNX ×D , is given by iid samples of the brownian motion X(t1 ) at time t1 = 1. The
reference values are f (Z|X) ∈ RNX ×1 , computed using (4.20).

• Z ∈ RNZ ×D is an iid realization of the brownian motion X(t2 )|X(t1 ) at time t2 = 2,

since f (Z) ∈ RNZ ×1 are the functions values.

For each method, the output are fZ|X ∈ RNZ ×Df approximating (4.15), hence are compared to
f (Z|X) in our experiments. We plot the generated learning and test set in picture 10, comparing
the observed variable fZ and the reference values f (Z|X). Thus the problem can be stated as
: knowing the noisy data in the left-hand side, deduce the one at right.

4.5.3. Four methods to tackle the Bachelier problem

We compare four methods for the Bachelier problem. Two methods are based on a standard
approach, that uses predictive machines of the form (2.1), in order to approximating the con-
ditional expectation (4.14) as
fZ|X = Pm (Z, Y, X, f (Z)) (4.21)
The first machine m is a neural network method, the second is a kernel one, labelled ANN
and CodPy pred in the figures. The third machine solves (4.13), labelled Pi:iid in the figures.
The fourth provides a similar approach, but picks up X (resp. Z) as the sharp discrepancy
sequences (SDS) of X(t1 ) (resp. X(t2 )) and is labelled Pi:sharp in our figures.

15
Philippe G. LeFloch & Jean-Marc Mercier

predicted (red) vs test (green) variables and values

0.5
0.4 0.5
0.4
0.4
0.4 0.4
0.3
0.3
0.3
0.3 0.3
Pi:sharp

Pi:sharp

Pi:sharp
0.2
0.2
0.2 0.2 0.2

0.1
0.1 0.1 0.1 0.1

0.0 0.0 0.0

0.0 0.0

0.5 1.0 1.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5
Basket values Basket values Basket values Basket values Basket values

Figure 11. Exact and predicted values for sharp discrepancy sequences

ANN
0.6 Pi:i.i.d.
Pi:sharp
0.5 codpy pred
0.4
scores

0.3
0.2
0.1

102 103 104

log2(Nx)

Figure 12. Benchmark of scores

To illustrate a typical benchmark run of one of these four methods, Figure 11 shows the
predicted values fZ|X against the exact ones f (Z|X), as functions of the basket values b(Z), for
the last method (SDS). We show five runs of the method with NX = NZ = 32, 64, 128, 256, 512.
Figure 12 presents a benchmark for scores, computed accordingly to the RMSE % (2.9)
(lower is better), for the two dimensional case D = 2, however the results are similar whatever
the dimensions are.

4.5.4. Concluding remarks

We emphasize that the axis in Figure 12 is in log-scale of the size of the training Nx . This
test shows numerically that both predictive methods based on (4.21) are not converging. The
method Pi:iid (in yellow color) shows a performance profile which has a convergence pattern at
1
the statistical rate √N , that is, the one expected with randomly sampled data. The method
X
Pi:sharp (in green color) is an illustration of performance gains when using the proposed sharp
discrepancy sequences.

16
PREDICTIVE MACHINES WITH QUANTIFICATION

Bibliography
[1] I. Babuska, U. Banerjee, and J.E. Osborn, Survey of mesh-less and generalized finite element
methods: a unified approach, Acta Numer. 12 (2003), 1–125.

[2] A. Berlinet and C. Thomas-Agnan, Reproducing kernel Hilbert spaces in probability and sta-
tistics, Springer Science, Business Media, LLC, 2004.

[3] Y. Brenier, Polar factorization and monotone rearrangement of vector-valued functions, Comm.
Pure Applied Math. 44 (1991), 375–417.

[4] H. Brezis, Remarques sur le problème de MongeKantorovich dans le cas discret, Comptes Rendus
Math. 356 (2018), 207–213.

[5] G.E. Fasshauer, Mesh-free methods, in “Handbook of Theoretical and Computational Nanotech-
nology”, Vol. 2, 2006.

[6] A. Gretton, K.M. Borgwardt, M. Rasch, B. Schölkopf, and A.J. Smola, A kernel
method for the two sample problems, Proc. 19th Int. Conf. on Neural Information Processing
Systems, 2006, pp. 513–520.

[7] D. Harrison and D.L. Rubinfeld, Hedonic prices and the demand for clean air, J. Environ.
Economics & Management 5 (1978), 81–102.

[8] T. Hofmann, B. Schlkopf, and A. J. Smola, Kernel methods in machine learning, Ann.
Statist. 36 (2008), 1171–1220.

[9] P.G. LeFloch and J.-M. Mercier, A new method for solving Kolmogorov equations in math-
ematical finance, C. R. Math. Acad. Sci. Paris 355 (2017), 680–686.

[10] P.G. LeFloch and J.-M. Mercier, The Transport-based Mesh-free Method (TMM). A short
review, The Wilmott journal 109 (2020), 52–57. Available at ArXiv:1911.00992.

[11] P.G. LeFloch and J.-M. Mercier, Mesh-free error integration in arbitrary dimensions: a nu-
merical study of discrepancy functions, Comput. Methods Appl. Mech. Engrg. 369 (2020), 113245.

[12] P.G. LeFloch, J.-M. Mercier, and S. Miryusupov, CodPy : a Python library for machine
learning, statistic, and numerical simulations, Monograph in preparation. Code available at https:
//pypi.org/project/codpy/.

[13] R. McCann, Polar factorization of maps on Riemannian manifolds, Geom. Funct. Anal. 11 (2001),
589–608.

[14] R. Sinkhorn and P. Knopp, Concerning nonnegative matrices and doubly stochastic matrices,
Pacific J. Math. 21 (1967), 343–348.

[15] I.M. Sobol, Distribution of points in a cube and approximate evaluation of integrals, U.S.S.R
Comput. Maths. Math. Phys. 7 (1967), 86–112.

[16] B.K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Scholkopf, and G.R. Lanckriet,

Hilbert space embeddings and metrics on probability measures, J. Mach. Learn. Res. 11 (2010),
1517–1561.

17
Philippe G. LeFloch & Jean-Marc Mercier

[17] O. Teymur, J. Gorham, M. Riabiz, and C.J. Oates Proc. 24th Int. Conf. on Artificial Intel-
ligence and Statistics (AISTATS) 2021, San Diego, California, USA, Volume 130, pp. 1027–1035.

[18] T. Wenzel, G. Santin, and B. Haasdonk, Universality and optimality of structured deep
kernel networks, ArXiv:2105.07228.

Beast Academy 2 Sequence
0% (2)
Beast Academy 2 Sequence
24 pages
Gwen Harwood Context
No ratings yet
Gwen Harwood Context
13 pages
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
No ratings yet
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
644 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
poly_aml
No ratings yet
poly_aml
76 pages
Main Notes
No ratings yet
Main Notes
227 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
ML Main Printing Material
No ratings yet
ML Main Printing Material
241 pages
MidA-F21
No ratings yet
MidA-F21
8 pages
Modeling Systems With Machine Learning Based Differential Equations
No ratings yet
Modeling Systems With Machine Learning Based Differential Equations
12 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Andrew NG Main - Notes PDF
No ratings yet
Andrew NG Main - Notes PDF
226 pages
Main Notes
No ratings yet
Main Notes
227 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
119 pages
Vahid
No ratings yet
Vahid
18 pages
Reliable extrapolation [Comput. Methods Appl. Mech. Eng.]
No ratings yet
Reliable extrapolation [Comput. Methods Appl. Mech. Eng.]
36 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
CodPy - A Python Library For Numerical, ML, and Stats
No ratings yet
CodPy - A Python Library For Numerical, ML, and Stats
133 pages
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
No ratings yet
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
51 pages
A Rate of Convergence of Physics Informed Neural Networks For The Linear Second Order Elliptic PDEs
No ratings yet
A Rate of Convergence of Physics Informed Neural Networks For The Linear Second Order Elliptic PDEs
24 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
2501.10465v1
No ratings yet
2501.10465v1
10 pages
Lecture Notes For Machine Learning Theory
No ratings yet
Lecture Notes For Machine Learning Theory
167 pages
Prediction Errors Tech Report
No ratings yet
Prediction Errors Tech Report
9 pages
Selected theoretical aspects of ML and deep learning
No ratings yet
Selected theoretical aspects of ML and deep learning
46 pages
2024 - Math Data Sci RPT
No ratings yet
2024 - Math Data Sci RPT
48 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
Cs229 Ml Notes
No ratings yet
Cs229 Ml Notes
192 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Content-CS229 MachineLearning Notes
No ratings yet
Content-CS229 MachineLearning Notes
4 pages
36_neural_operator_graph_kernel_n
No ratings yet
36_neural_operator_graph_kernel_n
21 pages
2504.19089v1
No ratings yet
2504.19089v1
33 pages
Online Learning With Kernels: Jyrki Kivinen, Alexander J. Smola, and Robert C. Williamson, Member, IEEE
No ratings yet
Online Learning With Kernels: Jyrki Kivinen, Alexander J. Smola, and Robert C. Williamson, Member, IEEE
12 pages
ML Algorithm For PM
No ratings yet
ML Algorithm For PM
8 pages
Mathematical_Foundations_of_Deep_Learning
No ratings yet
Mathematical_Foundations_of_Deep_Learning
174 pages
Index
No ratings yet
Index
127 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
ML Lecture Notes 2022 v0.0
No ratings yet
ML Lecture Notes 2022 v0.0
176 pages
Mathematical Theory of Deep
No ratings yet
Mathematical Theory of Deep
275 pages
MLBasicsBook
No ratings yet
MLBasicsBook
287 pages
OPERATOR LEARNING ALGORITHMS AND ANALYSIS
No ratings yet
OPERATOR LEARNING ALGORITHMS AND ANALYSIS
36 pages
ProjectDescription
No ratings yet
ProjectDescription
5 pages
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
No ratings yet
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
601 pages
OptimML
No ratings yet
OptimML
41 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Texto para Discussão: Departamento de Economia
No ratings yet
Texto para Discussão: Departamento de Economia
43 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
Mathematics Theory of Deep Learning
No ratings yet
Mathematics Theory of Deep Learning
3 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
SchSmo03c
No ratings yet
SchSmo03c
24 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
67 pages
1st Exam Question Paper
No ratings yet
1st Exam Question Paper
2 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
- Nonparametric Risk Bounds for Time-Series Forecasting - McDonald at Al
No ratings yet
- Nonparametric Risk Bounds for Time-Series Forecasting - McDonald at Al
40 pages
Deep Learning Math
No ratings yet
Deep Learning Math
282 pages
Breaking The Curse of Dimensionality With Convex Neural Networks
No ratings yet
Breaking The Curse of Dimensionality With Convex Neural Networks
53 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Computational Geometry: Exploring Geometric Insights for Computer Vision
From Everand
Computational Geometry: Exploring Geometric Insights for Computer Vision
Fouad Sabry
No ratings yet
peptido_51-51
No ratings yet
peptido_51-51
1 page
peptido_60-60
No ratings yet
peptido_60-60
1 page
peptido_43-43
No ratings yet
peptido_43-43
1 page
peptido_50-50
No ratings yet
peptido_50-50
1 page
peptido_53-53
No ratings yet
peptido_53-53
1 page
peptido_52-52
No ratings yet
peptido_52-52
1 page
peptido_45-45
No ratings yet
peptido_45-45
1 page
peptido_41-41
No ratings yet
peptido_41-41
1 page
peptido_26-26
No ratings yet
peptido_26-26
1 page
peptido_40-40
No ratings yet
peptido_40-40
1 page
peptido_34-34
No ratings yet
peptido_34-34
1 page
peptido_38-38
No ratings yet
peptido_38-38
1 page
peptido_31-31
No ratings yet
peptido_31-31
1 page
peptido_28-28
No ratings yet
peptido_28-28
1 page
peptido_19-19
No ratings yet
peptido_19-19
1 page
peptido_21-21
No ratings yet
peptido_21-21
1 page
peptido_35-35
No ratings yet
peptido_35-35
1 page
peptido_11-11
No ratings yet
peptido_11-11
1 page
peptido_4-4
No ratings yet
peptido_4-4
1 page
peptido_27-27
No ratings yet
peptido_27-27
1 page
comb_HTS_9-9
No ratings yet
comb_HTS_9-9
1 page
peptido_25-25
No ratings yet
peptido_25-25
1 page
comb_HTS_11-11
No ratings yet
comb_HTS_11-11
1 page
DNA-Encoded Library Hit Confirmation -- Bridging the Gap Between On-DNA and Off-DNA Chemistry (2021)
No ratings yet
DNA-Encoded Library Hit Confirmation -- Bridging the Gap Between On-DNA and Off-DNA Chemistry (2021)
7 pages
cdk9__11-11
No ratings yet
cdk9__11-11
1 page
Off-the-shelf cell therapy with iPS cell-derived natural killer cells (2019)
No ratings yet
Off-the-shelf cell therapy with iPS cell-derived natural killer cells (2019)
10 pages
cancer_DR_32-32
No ratings yet
cancer_DR_32-32
1 page
cancer_DR_5-5
No ratings yet
cancer_DR_5-5
1 page
cancer_DR_38-38
No ratings yet
cancer_DR_38-38
1 page
cancer_DR_33-33
No ratings yet
cancer_DR_33-33
1 page
Isabella I of Castile
No ratings yet
Isabella I of Castile
20 pages
HW02 Sol
No ratings yet
HW02 Sol
11 pages
IPS E-Max ZirCAD
No ratings yet
IPS E-Max ZirCAD
52 pages
Albert E. Peacock Collegiate: B.Mus - Ed, M.M
No ratings yet
Albert E. Peacock Collegiate: B.Mus - Ed, M.M
1 page
Perfect Mandarin Chinese
No ratings yet
Perfect Mandarin Chinese
20 pages
Suggestions From God - Salah Tul Istikhara
No ratings yet
Suggestions From God - Salah Tul Istikhara
3 pages
Task 4: New Product Development Process: (10 Marks)
No ratings yet
Task 4: New Product Development Process: (10 Marks)
1 page
Embedded System_UG_Eng_3rd Yr (6)
No ratings yet
Embedded System_UG_Eng_3rd Yr (6)
259 pages
Eea-Pam 561 A 12
No ratings yet
Eea-Pam 561 A 12
6 pages
Strategy Marketing Plans and Small Organisations
No ratings yet
Strategy Marketing Plans and Small Organisations
119 pages
Design and Fiber Installation For University Campus System
No ratings yet
Design and Fiber Installation For University Campus System
7 pages
Method Statement For Loose Furniture Fixing: Sandvik PVT LTD, Dapodi, Pune
No ratings yet
Method Statement For Loose Furniture Fixing: Sandvik PVT LTD, Dapodi, Pune
2 pages
Theory of Crime Advocators Concepts
No ratings yet
Theory of Crime Advocators Concepts
5 pages
Free Trade and Autarky
No ratings yet
Free Trade and Autarky
21 pages
Bobby Ramani's Diet Plan
No ratings yet
Bobby Ramani's Diet Plan
2 pages
Turn Taking
No ratings yet
Turn Taking
31 pages
Varnishes Liners Bases
No ratings yet
Varnishes Liners Bases
35 pages
Policy On Response To Access To Patient Records
No ratings yet
Policy On Response To Access To Patient Records
3 pages
Summative Test in Community Engagement Solidarity and Citizenship
No ratings yet
Summative Test in Community Engagement Solidarity and Citizenship
2 pages
Burda - 05 2021
50% (2)
Burda - 05 2021
88 pages
The Importance of Energy Changes and Electron Transfer in Metabolism
No ratings yet
The Importance of Energy Changes and Electron Transfer in Metabolism
19 pages
AG-HMC70 Manual
No ratings yet
AG-HMC70 Manual
8 pages
DT - Consignment Sales and Revenue From Contracts - Answer Key
No ratings yet
DT - Consignment Sales and Revenue From Contracts - Answer Key
5 pages
What Kind of Strategies Can Fourth Grade Student Use To Master Learning Their Multiplication Facts
No ratings yet
What Kind of Strategies Can Fourth Grade Student Use To Master Learning Their Multiplication Facts
59 pages
Gaina Sutras Part 1
No ratings yet
Gaina Sutras Part 1
420 pages
Brkccie 3203
No ratings yet
Brkccie 3203
93 pages
ICH 10 2021 Not - For - Submission EN
No ratings yet
ICH 10 2021 Not - For - Submission EN
57 pages
Topic 6 - The Role of eCommerce
No ratings yet
Topic 6 - The Role of eCommerce
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Predictive Machines with Uncertainty Quantification (2022)

Uploaded by

Predictive Machines with Uncertainty Quantification (2022)

Uploaded by

MathematicS In Action

Predictive machines with uncertainty quantification

2. Learning machines with error estimates

2.1. Extrapolation methods

2.2. Discrepancy functionals

linear mapping, having form (cf. [18])

2.3. Quantitative error bounds

2.4. Error estimates from the MNIST dataset

Figure 1. Performance profiles of several methods for the MNIST dataset.

Figure 2. Discrepancy profile dK (δX , δZ ) for the MNIST dataset.

2.5. Reproducibility with the Boston Housing prices dataset

Figure 3. Convergence profile for the housing dataset.

3. Error-minimization based algorithms

3.1. Clustering methods

this notation, the minimization problem can be expressed as

σ(x, Y ) = arg inf d(x, y k ).

fZ = Pd (X, Y, Z, f (X)) = f (y σ(Z,Y ) ). (3.3)

3.2. Sharp discrepancy sequences

Figure 4. Examples of discrepancy functionals, left gaussian, right matern, kernels.

3.3. Clustering benchmarks for the CCFD dataset

3.4. Unsupervised learning benchmarks - MNIST

4Further details are available at Kaggle site https://www.kaggle.com/mlg-ulb/creditcardfraud.

Figure 5. Benchmarks profiles for the credit card data set.

Figure 6. Benchmarks profiles for the unsupervised learning mnist test.

Figure 7. Examples of images rendered from computed MNIST clusters : left

4. Optimal transport algorithms

4.1. Discrepancy-based polar factorization of maps

Consider a probability measure ν ∈ M on RD , and a mapping S : RD 7→ RD . Denote µ ∈ M,

From a discrete point of view, consider two discrete measures µ, ν = δX , δZ , with X =

4.2. SDE-based conditional expectation algorithm

∂t Π = A · ∇2X(t) Π, ΠX(T ) = INX , (4.13)

4.3. Historical-based conditional expectation algorithms

• X ∈ RNX ×D is a i.i.d. generated samples of X(t1 ) where t1 is a given time.

• Z ∈ RNZ ×D is another i.i.d sample of X(t2 ) at any time t2 > t1 .

• f (Z) ∈ RNZ ×Df is any discrete vector-valued function.

The output is a matrix fZ|X , representing the conditional expectation

4.4. Time series predictions

Figure 8. Recurrent kernels : Generated (yellow) BTC-USD left / HR right,

is a three-dimensional tensor, representing NX i.i.d samples of a D dimensional process X(t) ∈

4.4.1. Recurrent methods for time series predictions

X 0 = X [·,·,i:i+H] ∈ RÑX ×D×H , f (X 0 ) = X [·,·,i+H:i+H+P ] ∈ RÑX ×D×P

Figure 9. Optimal transport : mean estimator (yellow) and random path

4.4.2. Optimal transport methods for time series predictions

4.5. The Bachelier problem

f (x) = max(b(x) − K, 0), where K = 1.1, b(x) = x · a with random weights a ∈ RD . It is

4.5.2. Methodology and input/output data

• Z ∈ RNZ ×D is an iid realization of the brownian motion X(t2 )|X(t1 ) at time t2 = 2,

4.5.3. Four methods to tackle the Bachelier problem

predicted (red) vs test (green) variables and values

0.0 0.0 0.0

102 103 104

Figure 12. Benchmark of scores

4.5.4. Concluding remarks

[16] B.K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Scholkopf, and G.R. Lanckriet,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.