2205.11393generic Bounds On The Approximation Error
2205.11393generic Bounds On The Approximation Error
Tim De Ryck ∗
arXiv:2205.11393v2 [cs.LG] 10 Oct 2022
Siddhartha Mishra †
Abstract
We propose a very general framework for deriving rigorous bounds on the approxi-
mation error for physics-informed neural networks (PINNs) and operator learning
architectures such as DeepONets and FNOs as well as for physics-informed oper-
ator learning. These bounds guarantee that PINNs and (physics-informed) Deep-
ONets or FNOs will efficiently approximate the underlying solution or solution
operator of generic partial differential equations (PDEs). Our framework utilizes
existing neural network approximation results to obtain bounds on more involved
learning architectures for PDEs. We illustrate the general framework by deriving
the first rigorous bounds on the approximation error of physics-informed opera-
tor learning and by showing that PINNs (and physics-informed DeepONets and
FNOs) mitigate the curse of dimensionality in approximating nonlinear parabolic
PDEs.
1 Introduction
The efficient numerical approximation of partial differential equations (PDEs) is of paramount im-
portance as PDEs mathematically describe an enormous range of interesting phenomena in the sci-
ences and engineering. Machine learning techniques, particularly deep learning, are playing an
increasingly important role in this context. For instance, given their universal approximation proper-
ties, deep neural networks serve as ansatz spaces for supervised learning of a variety of (parametric)
PDEs [19, 70, 40, 53, 54] and references therein. In this setting, large amounts of training data
might be required. However, this data is often acquired from expensive computer simulations or
physical measurements [53], necessitating the design of learning frameworks that work with lim-
ited data. Physics-informed neural networks (PINNs), proposed by [18, 42, 41] and popularized by
[67, 68], are a prominent example of such a learning framework as the residual of the underlying
PDE is minimized within the class of neural networks and in principle, little (or even no) training
data is required. PINNs and their variants have proven to be a very powerful and computationally
efficient framework for approximating solutions to PDEs, [69, 51, 55, 65, 76, 34, 35, 61, 59, 60, 2]
and references therein.
Often in the context of PDEs, one needs to approximate the underlying solution operator that maps
one infinite-dimensional function space into another [27, 39]. As neural networks can only map
between finite dimensional spaces, a new field of operator learning is emerging wherein novel
learning frameworks need to be designed in order to approximate operators. These include deep
operator networks (DeepONets) [9, 49] and their variants as well as neural operators [39], which
generalize neural networks to this setting. A variety of neural operators have been proposed, see
[45, 46] but arguably, the most efficient form of neural operators is provided by the so-called Fourier
∗
Seminar for Applied Mathematics (SAM), D-MATH, ETH Zürich, Switzerland
†
Seminar for Applied Mathematics (SAM), D-MATH and ETH AI center, ETH Zürich, Switzerland
2 Preliminaries
2.1 Setting
Given T > 0 and D ⊂ Rd compact, consider the function u : [0, T ] × D → Rm , for m ≥ 1, that
belongs to a function space H and solves the following (time-dependent) PDE,
La (u)(t, x) = 0 and u(x, 0) = u0 ∀(t, x) ∈ [0, T ] × D, (2.1)
where u0 ∈ Y ⊂ L (D) is the initial condition and La : H → L ([0, T ] × D) is a differential
2 2
operator that can depend on a parameter (function) a ∈ Z ⊂ L2 (D). In our notation, we will often
suppress the dependence of L := La on a for simplicity. Depending on the context, one might
want to recover one of the following mathematical objects: for fixed a and u0 , one might want to
approximate u(T, ·) or u(·, ·) with a neural network; a more challenging task would be to learn the
solution operator G : X → L2 (Ω) : v 7→ u, where v ∈ {u0 , a}, X ∈ {Y, Z} and Ω = D or
2
Ω = [0, T ] × D. We will use this notation consistently throughout the paper, see SM A.1 for an
overview.
PINNs Physics-informed neural networks (PINNs) are neural networks that are trained with a
different, residual-based loss function. As the PDE solution u satisfies L(u) = 0, the goal of
physics-informed learning is to find a neural network uθ : [0, T ] × D → R for which the PDE
residual is approximately zero, L(uθ ) ≈ 0. To ensure uniqueness, one also needs to require that the
initial condition is satisfied i.e., uθ (0, x) ≈ u0 (x), and similarly for boundary conditions. In practice
one minimizes a quadrature approximation of J (θ) = kL(uθ )k2L2 ([0,T ]×D) + kuθ (0, ·) − u0 k2L2 (D) ,
where additional terms can be added to (approximately) impose boundary conditions and augment
the loss function using data. A desirable property of PINNs is that only very little or even no training
data is needed to construct the loss function.
Operator learning In order to approximate operators, one needs to allow the input and output of
the learning architecture to be infinite-dimensional. A possible approach is to use deep operator
networks (DeepONets), as proposed in [9, 49]. Given m, fixed sensor locations {xj }m j=1 ⊂ D
and the corresponding sensor values {v(xj )}m j=1 as input, a DeepONet can be formulated in terms
of two (deep) neural networks: a branch net β : Rm → Rp and a trunk net τ : D → Rp+1 .
The branch and trunk nets are then combined to approximate the underlying Ppnonlinear operator
as the following DeepONet Gθ : X → L2 (D), with Gθ (v)(y) = τ0 (y) + k=1 βk (v)τk (y). A
second approach is that of neural operators, which generalize hidden layers by including a non-
local integral operator [45], of which particularly Fourier neural operators (FNOs) [44] are already
well-established. The practical implementation (i.e. discretization) of an FNO maps from and to the
space of trigonometric polynomials of degree at most N ∈ N, denoted by L2N , and can be identified
with a finite-dimensional mapping that is a composition of affine maps and nonlinear layers of the
−1
form Ll (z)j = σ(Wl vj + bl,j FN (Pl (k) · FN (z)(k)j )), where the Pl (k) are coefficients that define
a non-local convolution operator via the discrete Fourier transform FN , see [38].
Physics-informed operator learning Both DeepONets and FNOs are trained by choosing a
suitable probability measure µ on X and minimizing a quadrature approximation of J (θ) =
kGθ (v) − G(v)kL2µ×dx (X ×Ω) . Generating training sets might require many calls to an expensive
PDE solver, leading to an enormous computational cost. In order to reduce or even fully eliminate
the need for training data, physics-informed operator learning has been proposed in [74] for Deep-
ONets and in [47] for FNOs. Similar to PINNs, the training procedure aims to minimize a quadrature
approximation of J (θ) = kL(Gθ )kL2µ×dx (X ×Ω) .
3 General results
We propose a framework to obtain bounds on the approximation error for the various neural network
architectures introduced in Section 2.2. Figure 1 visualizes how different types of error estimates
can be obtained from one another. Every box shows the name of the network architecture, the form
of the relevant loss and the theorem which proves the corresponding estimate for the approximation
error. Every arrow in the flowchart represents a proof technique that allows one to transfer an error
3
Neural network (fixed time) FNO DeepONet
C D
u(T ) − uθ (T ) Lq (D) < ε kG − Gθ kL2 (X ×D) < ε kG − Gθ kL2 (X ×D) < ε
Assumed to be known Theorem 3.7 Corollary 3.8
B
PINN Physics-informed FNO Physics-informed DeepONet
C D
L(uθ ) Lq ([0,T ]×D) < ε L(Gθ ) L2 (X ×Ω) < ε L(Gθ ) L2 (X ×Ω) < ε
Theorem 3.5 Theorem 3.9 Theorem 3.9 & 3.10
Figure 1: Flowchart of the structure of the results in this paper, with q ∈ {2, ∞}. The letters reflect
the techniques used in the proofs: A uses Taylor approximations (Section 3.1), B is based on finite
difference approximations (Section 3.1), C uses trigonometric polynomial interpolation (Section
3.2) and D uses the connection between FNOs and DeepONets (Section 3.2).
estimate from one type of method to another (see caption of Figure 1 for an overview of those
techniques).
We give particular attention to the case where it is known that a neural network can efficiently
approximate the solution to a time-dependent PDE at a fixed time. Such neural networks are usually
obtained by emulating a classical numerical method. Examples include finite difference schemes,
finite volume schemes, finite element methods, iterative methods and Monte Carlo methods, e.g.
[36, 63, 10, 57]. More precisely, for ε > 0, we assume to have access to an operator U ε : X ×
[0, T ] → H that for any t ∈ [0, T ] maps any initial condition/parameter function v ∈ X to a neural
network U ε (v, t) that approximates the PDE solution G(v)(·, t) = u(·, t)∈ Lq (D), q ∈ {2, ∞}, at
time t, as specified below. Moreover, we will assume that we know how its size depends on the
accuracy ε. Explicit examples of the operator U ε will be given in Section 4 and SM C.
Assumption 3.1. Let q ∈ {2, ∞}. For any B, ε > 0, ℓ ∈ N, t ∈ [0, T ] and any v ∈ X with
kvkC ℓ ≤ B there exist a neural network U ε (v, t) : D → R and a constant Cε,ℓ
B
> 0 s.t.
Remark 3.2. For vanilla neural networks and PINNs one can set X := {v}, G(v) := u and v := u0
or v := a in Assumption 3.1 above and Assumption 3.4 below.
Under this assumption, we prove the existence of space-time neural networks and PINNs that effi-
ciently approximate the PDE solution (Section 3.1), as well as FNOs and DeepONets (Section 3.2)
and physics-informed FNOs and DeepONets (Section 3.3). Finally, we also prove a general result
on the generalization error (Section 3.4).
We will construct a space-time neural network uθ for which both kuθ − ukLq ([0,T ]×D) and the PINN
loss kL(uθ )kLq ([0,T ]×D) are small. To accurately approximate the time derivatives of u we emulate
Taylor expansions, whereas for the spatial derivatives we employ finite difference (FD) operators
in our proofs. Depending on whether forward, backward or central differences are used, a FD
operator might not be defined on the whole domain D, e.g. for f ∈ C([0, 1]) the (forward) operator
∆+h [f ] := f (x + h) − f (x) is not well-defined for x ∈ (1 − h, 1]. This can be solved by resorting
to piecewise-defined FD operators, e.g. a forward operator on [0, 0.5] and a backward operator on
(0.5, 1]. In a general domain Ω one can find a well-defined piecewise FD operator if Ω satisfies the
following assumption, which is satisfied by many domains (e.g. rectangular, smooth).
4
Assumption 3.3. There exists a finite partition P of Ω such that for all P ∈ P there exists εP > 0
1
and vP ∈ B∞ = {x ∈ Rdim(Ω) : kxk∞ ≤ 1} such that for all x ∈ P it holds that x + εP (vP +
1
B∞ ) ⊂ Ω.
Additionally, we need to assume that the PINN error can be bounded in terms of the errors related to
all relevant partial derivatives, denoted by D(k,α) := Dtk Dxα := ∂tk ∂xα11 . . . ∂xαdd , for (k, α) ∈ Nd+1
0 .
This assumption is valid for many classical solutions of PDEs. A few worked out examples can be
found in SM D.5 (gravity pendulum) and SM D.6 (Darcy flow).
Assumption 3.4. Let k, ℓ ∈ N, q ∈ {2, ∞}, C > 0 be independent from d. For all v ∈ X it holds,
X ′
L(Gθ (v)) Lq ([0,T ]×D) ≤ C · poly(d) · D(k ,α) (G − Gθ ) q . (3.2)
L ([0,T ]×D)
(k′ ,α)∈Nd+1
0
k′ ≤k,kαk1 ≤ℓ
In this setting, we prove the following approximation result for space-time networks and PINNs.
Theorem 3.5. Let s, r ∈ N, let u ∈ C (s,r) ([0, T ] × D) be the solution of the PDE (2.1) and let
Assumption 3.1 be satisfied. There exists a constant C(s, r) > 0 such that for every M ∈ N and
ε, h > 0 there exists a tanh neural network uθ : [0, T ] × D → R for which it holds that,
kuθ − ukLq ([0,T ]×D) ≤ C(kukC (s,0) M −s + ε). (3.3)
and if additionally Assumption 3.3 and Assumption 3.4 hold then,
L(uθ ) L2 ([0,T ]×D)
+ kuθ − ukL2 (∂([0,T ]×D))
(3.4)
≤ C · poly(d) · lnk (M )(kukC (s,ℓ) M k−s + M 2k (εh−ℓ + Cε,ℓ
B r−ℓ
h )).
Moreover, depth(uθ ) ≤ C · depth(U ε ) and width(uθ ) ≤ CM · width(U ε ).
Proof. We only provide a sketch of the full proof (SM B.2). The main idea is to divide [0, T ]
into M uniform subintervals and construct a neural network that approximates a Taylor approxi-
mation in time of u in each subinterval. In the obtained formula, we approximate the monomials
and multiplications by neural networks (SM A.7) and approximate the derivatives of u by finite
differences and use (A.2) of SM A.2 to find an error estimate in C k ([0, T ], Lq (D))-norm. We
use again finite difference operators to prove that spatial derivatives of u are accurately approxi-
mated as well. The neural network will also approximately satisfy the initial/boundary conditions
as kuθ − ukL2 (∂([0,T ]×D)) . Cpoly(d)kuθ − ukH 1 ([0,T ]×D) , which follows from a Sobolev trace
inequality.
We note that the bounds (3.3) and (3.4) together imply that there exists a neural network for which
the total error as well as the PINN loss can be made as small as possible, providing a solid theoretical
foundation to PINNs for approximating the PDE (2.1).
In this section, we use Assumption 3.1 to prove estimates for DeepONets and FNOs. First, we prove
a generic error estimate for FNOs. Using the known connection between FNOs and DeepONets
(SM Lemma B.6) this result can then easily be applied to DeepONets (Corollary 3.8). In order to
prove these error estimates, we need to assume that the operator U ε from Assumption 3.1 is stable
with respect to its input function, as specified in Assumption 3.6 below. Moreover, we will take the
d-dimensional torus as domain D = Td = [0, 2π)d and assume periodic boundary conditions for
simplicity in what follows. This is not a restriction, as for every Lipschitz subset of Td there exists a
(linear and continuous) Td -periodic extension operator of which also the derivatives are Td -periodic
[38, Lemma 41].
Assumption 3.6. Assumption 3.1 is satisfied and let p ∈ {2, ∞}. For every ε > 0 there exists a
ε
constant Cstab > 0 such that for all v, v ′ ∈ X it holds that,
U ε (v, T ) − U ε (v ′ , T ) L2
ε
≤ Cstab v − v′ Lp
. (3.5)
5
Theorem 3.7. Let r ∈ N, T > 0, let G : C r (Td ) → C r (Td ) be an operator that maps a function
u0 to the solution u(·, T ) of the PDE (2.1) with initial condition u0 , let Assumption 3.6 be satisfied
and let p∗ ∈ {2, ∞} \ {p}. Then there exists a constant C > 0 such that for every ε > 0, N ∈ N
there is an FNO Gθ : L2N (Td ) → L2N (Td ) of depth O(depth(U ε )) and width O(N d width(U ε ))
with accuracy,
∗
ε
kG − Gθ kL2 ≤ C(ε + Cstab BN −r+d/p + Cε,r CB −r
N ). (3.6)
Proof. We give a sketch of the proof, details can be found in SM B.3. Given function values of v
on a uniform grid with grid size 1/N , we use trigonometric polynomial interpolation (SM A.6) to
reconstruct v and use this together with Assumption 3.1 to construct a neural network. The resulting
approximation is then projected onto the space L2N , of trigonometric polynomials of degree at most
N ∈ N, again through trigonometric polynomial interpolation.
A recent result, [38, Theorem 36] (SM Lemma B.6), shows that any error bound for FNOs also
implies an error bound for DeepONets, by choosing the trunk nets as neural network approximations
of the Fourier basis. We apply this result with ε ∼ poly(1/N ) to Theorem 3.7 to obtain the following
generic error bound for DeepONets.
Corollary 3.8. Assume the setting of Theorem 3.7. Then for every ε > 0, N ∈ N and ev-
ery corresponding FNO Gθ from Theorem 3.7 there exists a DeepONet Gθ∗ : X → L2 (D) with
width(β) = O(N d ), depth(β) = O(depth(Gθ )), width(τ ) = O(N d+1 ) and depth(τ ) ≤ 3 that
satisfies (3.6).
Using the techniques from previous sections, we now present the very first theoretical result for
physics-informed operator learning. We demonstrate that if an error estimate for a DeepONet/FNO
and the growth of its derivatives are known (see SM D.1 on how to obtain these), then one can
prove an error estimate for the corresponding physics-informed DeepONet/FNO. For simplicity,
the following result focuses only on operators mapping to C r (D) but the generalization to e.g.
C r ([0, T ] × D) is immediate by considering D′ := [0, T ] × D.
Theorem 3.9. Consider an operator G : X → C r (D), r ∈ N, that satisfies Assumption 3.3 and
Assumption 3.4 with ℓ ∈ N . Let λ∗ ∈ (0, ∞], let λ, C(λ) > 0 with λ ≤ λ∗ and let σ : N → R be a
function such that for all p ∈ N there is a DeepONet/FNO Gθ such that
G(v) − Gθ (v) L2 (D)
≤ Cp−λ and Gθ (v) C r (D)
≤ Cpσ(r) ∀r ∈ N, v ∈ X . (3.7)
(r−ℓ)λ∗ −ℓσ(r)
Then for all β ∈ R with 0 < β ≤ r there exists a constant C ∗ > 0 such that for all
v ∈ X and p ∈ N it holds that
L(Gθ (v)) L2 (D)
≤ C ∗ p−β . (3.8)
Proof. For suitable Dα , use SM Lemma B.1 with q = 2, f1 = G(v) and f2 = Gθ (v) together with
(3.7) to find
Dα (G(v) − Gθ (v)) L2 (D) ≤ C(r, λ)(p−λ h−ℓ + pσ(r) hr−ℓ ). (3.9)
∗ σ(r)+β
Let β ∈ R with 0 < β ≤ (r−ℓ)λr−ℓσ(r) . We carefully balance terms by setting h = p− r−ℓ and
ℓ r
λ = r−ℓ σ(r) + r−ℓ β to find (3.8). Conclude using Assumption 3.4.
Finally, we use Theorem 3.5 to present an alternative error estimate for a physics-informed Deep-
ONet in the case that Assumption 3.1 is satisfied. As this assumption is different from assuming
access to an error bound for the corresponding DeepONet, it is interesting to use the techniques
from the previous sections rather than directly apply Theorem 3.9. The proof of the following theo-
rem can be found in SM B.4.
Theorem 3.10. Let s, r ∈ N, T > 0, let G : C r (Td ) → C (s,r) ([0, T ] × Td ) be an operator that
maps a function u0 to the solution u of the PDE (2.1) with initial condition u0 , let Assumption 3.1
and Assumption 3.6 be satisfied and let p∗ ∈ {2, ∞} \ {p}. There exists a constant C > 0 such that
6
for every Z, N, M ∈ N, ε, ρ > 0 there is an DeepONet Gθ : C r (Td ) → L2 ([0, T ] × Td ) with Z d
sensors with accuracy,
∗
G(v) − Gθ (v) ≤ CM ρ (kukC (s,0) M −s + M s−1 (ε + Cstab
L2 ([0,T ]×Td )
ε
Z −r+d/p + Cε,r
CB −r
N ))
(3.10)
and if additionally Assumption 3.3 and Assumption 3.4 hold then,
∗
L(Gθ (v)) ≤ CM k+ρ (kukC (s,ℓ) M −s + M s−1 N ℓ (ε + Cstab
L2 ([0,T ]×Td )
ε
Z −r+d/p + Cε,r
CB −r
N )),
(3.11)
for all v. Moreover, it holds that, depth(β) = depth(U ε ), width(β) = O(M (Z d +
N d width(U ε ))), depth(τ ) = 3 and width(τ ) = O(M N d (N + ln(N ))).
Proof. The proof (SM B.5) combines standard techniques, based on covering numbers and Hoeffd-
ing’s inequality, with an error composition from [16].
For any type of neural network architecture of depth L, width W and weights bounded by R, one
finds that dΘ ∼ LW (W + d). For tanh neural networks and operator learning architectures, one
has that ln(L) ∼ L ln(dRW ), whereas for physics-informed neural networks and DeepONets one
finds that ln(L) ∼ (k + ℓ)L ln(dRW ) with k and ℓ as in Assumption 3.4 [43, 16]. Taking this
into account, one also finds that the imposed lower bound on n is not very restrictive. Moreover,
the RHS of (3.13) depends at most polynomially on L, W, R, d, k, ℓ and c. For physics-informed
architectures, however, upper bounds on c often depend exponentially on L [16, 14].
Remark 3.12. As Theorem 3.11 is an a posteriori error estimate, one can use the network sizes of
the trained networks for L, W and R. The sizes stemming from the approximation error estimates
of the previous sections can be disregarded for this result. Moreover, instead of considering the
expected values of EG and ET in (3.13), one can also prove that such an inequality holds with a
certain probability (see SM B.5).
4 Applications
We demonstrate the power and generality of the framework proposed in Section 3 by applying the
presented theory to the following case studies. First, we demonstrate how these generic bounds can
7
be used to overcome the curse of dimensionality (CoD) for linear Kolmogorov PDEs and nonlinear
parabolic PDEs (Section 4.1). These are the first available results that overcome the CoD for nonlin-
ear parabolic PDEs for PINNs and (physics-informed) operator learning. Next, we apply the results
of Section 3.3 to both linear and nonlinear operators and provide bounds on the approximation error
for physics-informed operator learning.
For high-dimensional PDEs, it is not possible to obtain efficient approximation results using standard
neural network approximation theory [77, 15] as they will lead to convergence rates that suffer from
the CoD, meaning that the neural network size scales exponentially in the input dimension. In
literature, one has shown for some PDEs that their solution at a fixed time can be approximated to
accuracy ε > 0 with a network that has size O(poly(d)ε−β ), with β > 0 independent of d, and
therefore overcomes the CoD.
Linear Kolmogorov PDEs We consider linear time-dependent PDEs of the following form.
Setting 4.1. Let s, r ∈ N, u0 ∈ C02 (Rd ) and let u ∈ C (s,r) ([0, T ] × Rd ) be the solution of
1
L(u)(x, t) = ∂t u(x, t) − Tr(σ(x)σ(x)T ∆x [u](x, t)) − µ(x)T ∇x [u](x, t) = 0, u(0, x) = u0 (x)
2
(4.1)
for all (x, t) ∈ D × [0, T ], where σ : Rd → Rd×d and µ : Rd → Rd are affine functions and for
which kukC (s,2) grows at most polynomially in d. For every ε > 0, there is a neural network u b0 of
width O(poly(d)ε−β ) such that ku0 − u b0 kL∞ (Rd ) < ε.
Prototypical examples of such linear Kolmogorov PDEs include the heat equation and the Black-
Scholes equation. In [23, 7, 36] the authors construct a neural network that approximates u(T ) and
overcomes the CoD by emulating Monte-Carlo methods based on the Feynman-Kac formula. In
[16] one has proven that PINNs overcome the CoD as well, in the sense that the network size grows
as O(poly(dρd )ε−β ), with ρd as defined in SM (C.10). For a subclass of Kolmogorov PDEs it is
known that ρd = poly(d), such that the CoD is fully overcome.
We demonstrate that the generic bounds of Section 3 (Theorem 3.5) can be used to provide a much
shorter proof for this result. SM Lemma C.6 verifies that Assumption 3.1 is indeed satisfied. The
full proof can be found in SM C.2.
Theorem 4.2. Assume that Setting 4.1 holds. For every σ, ε > 0 and d ∈ N, there is a tanh neural
r+σ s+1 1+σ
network uθ of depth O(depth(b u0 )) and width O(poly(dρd )ε−(2+β) r−2 s−1 − s−1 ) such that,
L(uθ ) L2 ([0,T ]×[0,1]d )
+ kuθ − ukL2 (∂([0,T ]×[0,1]d )) ≤ ε. (4.2)
Nonlinear parabolic PDEs Next, we consider nonlinear parabolic PDEs as in Section 4.3, which
typically arise in the context of nonlinear diffusion-reaction equations that describe the change in
space and time of some quantities, such as in the well-known Allen-Cahn equation [1].
Setting 4.3. Let s, r ∈ N and for u0 ∈ X ⊂ C r (Td ) let u ∈ C (s,r) ([0, T ] × Td ) be the solution of
L(u)(x, t) = ∂t u(t, x) − ∆x u(t, x) − F (u(t, x)) = 0, u(0, x) = u0 (x), (4.3)
for all (t, x) ∈ [0, T ] × D, with period boundary conditions, where F : R → R is a polynomial and
for which kukC (s,2) grows at most polynomially in d. For every ε > 0, there is a neural network
b0 of width O(poly(d)ε−β ) such that ku0 − u
u b0 kL∞ (Td ) < ε. Let µ, resp. µ∗ , be the normalized
Lebesgue measure on [0, T ] × Td , resp. ∂([0, T ] × Td ).
In [32] the authors have proven that ReLU neural networks overcome the CoD in the approximation
of u(T ). We have reproven this result in SM Lemma C.14 for tanh neural networks to show that
Assumption 3.1 is satisfied. Using Theorem 3.5 we can now prove that PINNs overcome the CoD
for nonlinear parabolic PDEs. The proof is analogous to that of Theorem 4.2.
Theorem 4.4. Assume Setting 4.3. For every σ, ε > 0 and d ∈ N there is a tanh neural network uθ
r+σ s+1 1+σ
of depth O(depth(b u0 ) + poly(d) ln 1/ε ) and width O(poly(d)ε−(2+β) r−2 s−1 − s−1 ) such that,
L(uθ ) L2 ([0,T ]×Td ,µ)
+ ku − uθ kL2 (∂([0,T ]×Td ,µ∗ )) ≤ ε. (4.4)
8
Similarly, one can use the results from Section 3.2 to obtain estimates for (physics-informed)
DeepONets for nonlinear parabolic PDEs (4.3) such as the Allen-Cahn equation. In particular, a
dimension-independent convergence rate can be obtained if the solution is smooth enough, which
improves upon the result of [43], which incurred the CoD. For simplicity, we present results for
C (2,r) functions, rather than C (s,r) functions, as we found that assuming more regularity did not
necessarily further improve the convergence rate. The proof is given in SM B.4.
Theorem 4.5. Assume Setting 4.3 and let G : X → C r (Td ) : u0 7→ u(T ) and G ∗ : X →
C (2,r) ([0, T ] × Td ) : u0 7→ u. For every σ, ε > 0, there exists a DeepONets Gθ and Gθ∗ such that
kG − Gθ kL2 (Td ×X ) ≤ ε, L(Gθ∗ ) L2 ([0,T ]×Td ×X )
≤ ε. (4.5)
d+σ
Moreover, for Gθ we have O(ε− r ) sensors and,
(d+σ)(2+β)
width(β) = O(ε− r ), depth(β) = O(ln 1/ε ),
− d+1+σ
(4.6)
width(τ ) = O(ε r ), depth(τ ) = 3,
(3+σ)d
whereas for Gθ∗ we have O(ε− r−2 ) sensors and,
(3+σ)(d+r(2+β))
width(β) = O(ε−1− r−2 ), depth(β) = O(ln 1/ε ),
(3+σ)(d+1) (4.7)
width(τ ) = O(ε−1− r−2 ), depth(τ ) = 3.
We demonstrate how Theorem 3.9 can be used to generalize available error estimates for DeepONets
and FNOs, e.g. [43, 38] and SM D.1, to estimates for their physics-informed counterparts.
Linear operators In the simplest case, the operator G of interest is linear. In [43, Theorem D.2],
a general error bound for ReLU DeepONets for linear operators has been established, which still
holds for tanh DeepONets. Using Theorem 3.9 it is then straightforward to prove convergence rates
for physics-informed DeepONets for solution operators of linear PDEs (2.1).
Consider an operator G : X → L2 (Td ) : v 7→ u as in Section 2.1, where v is the parameter/initial
condition and u the solution of the PDE (2.1). Following [43], we fix the measureP µ on L2 (Td ) as
a Gaussian random field, such that v allows the Karhunen-Loève expansion v = k∈Zd αk Xk ek ,
where |αk | ≤ exp −ℓ|k| with ℓ > 0, the Xk ∼ N (0, 1) are iid Gaussian random variables
and {ek }k∈Zd is the standard Fourier basis (SM A.5). In this setting, we can prove the following
approximation result, the proof of which can be found in SM D.3. The result can be generalized to
other data distributions µ for which a convergence result for DeepONets can be proven, as in [43].
Theorem 4.6. Assume the setting above and that of Assumption 3.4, and assume that G(v) ∈
C ℓ+1 (Td ) for all v ∈ X . For all β > 0 there exists a constant C > 0 such that for any p ∈ N there
exists a DeepONet Gθ with p sensors and branch and trunk nets such that
L(Gθ )) L2 (L2 (Td ),µ)
≤ Cp−β . (4.8)
d+1
Moreover, size(τ ) ≤ Cp d , depth(τ ) = 3, size(β) ≤ p and depth(β) = 1.
Nonlinear operators For nonlinear PDEs a general result like Theorem 4.6 can not be obtained
from the currently available tools. Instead one needs to use Theorem 3.9 for every PDE of interest
on a case-by-case basis. In the SM, we demonstrate this for a nonlinear ODE (gravity pendulum
with external force, SM D.5) and an elliptic PDE (Darcy flow, SM D.6).
9
generalization to PINNs is not immediate as the proof involves the emulation of the forward Euler
method. We have overcome this difficulty by constructing space-time neural networks using Taylor
expansions instead (Theorem 3.5). To bound the approximation error of PINNs one can use the
generic error bounds in Sobolev norms of e.g. [25, 26] for very general activation functions or the
more concrete bounds [15] for tanh neural networks. In both approaches, the only assumption is
that the solution of the PDE has sufficient Sobolev regularity. As a consequence, these results incur
the curse of dimensionality and are not applicable to high-dimensional PDEs. The authors of [15]
analyze PINNs based on three theoretical questions related to approximation, stability and general-
ization. Other theoretical analyses of PINNs include e.g. [71, 72, 30]. For DeepONets, convergence
rates for advection-diffusion equations are presented in [17] and a clear workflow for obtaining
generic error estimates as well as worked out examples can be found in [43]. Similar results are ob-
tained for FNOs in [38]. A comprehensive comparison of DeepONets and FNOs is the topic of [50].
To the best of the authors’ knowledge, no theoretical results for physics-informed operator learning
are currently available. Unrelated to the approximation error, we also report generic bounds on the
expected value of the generalization error of all the aforementioned deep learning architectures, in
the form of an a posteriori error estimate on the generalization error.
A second goal of the paper is to prove that deep learning-based frameworks can overcome the curse
of dimensionality (CoD). PDEs for which the curse of dimensionality has been overcome include
linear Kolmogorov PDEs e.g. [23, 36], nonlinear parabolic PDEs [32] and elliptic PDEs [4, 10, 57].
By assuming that the initial data lies in a Barron class, the authors of [52] proved for elliptic PDEs
that the Deep Ritz Method [20] can overcome the CoD. Since the Barron class is a Banach algebra
[10] it is possible that our results, which mostly only involve multiplications and additions of neural
networks, can be extended to Barron functions. For PINNs, it is proven that they can overcome
the CoD for linear Kolmogorov PDEs [16]. We give an alternative proof of this result, improve the
convergence rate (Theorem 4.2) and additionally prove that PINNs can also overcome the CoD for
nonlinear parabolic PDEs (Theorem 4.4). DeepONets and FNOs can overcome the CoD in many
cases [43, 38] but we note that this does not yet include nonlinear parabolic PDEs such as the Allen-
Cahn equation. In Theorem 4.5 we prove that dimension-independent convergence rates can be
obtained if the solution is sufficiently regular. Similar results are expected to hold for e.g. elliptic
PDEs by using the results from [4, 10, 57].
It is evident that the generic bounds presented here can only be obtained under suitable assumptions.
These should always be checked to prevent misleading claims about mathematical guarantees for
the considered deep learning methods. We briefly discuss how restrictive these are and whether they
can be relaxed. Assuming the existence of a neural network that approximates the solution of PDE
at a fixed time (Assumption 3.1) is of course essential, but such a result can usually be obtained by
emulating an existing numerical method. Proving a bound on the Sobolev norm of that network is
always possible as we only consider smooth networks. Assumption 3.3 holds for many domains,
including rectangular and smooth ones. Assumption 3.4 and Assumption 3.6 also hold for a very
broad class of PDEs, much like the assumption on the size of the neural network approximation in
Setting 4.1 and 4.3 holds for most functions of interest. Therefore, the assumption that the PDE
solution is C (s,r) -regular seems to be the most restrictive. However, results like Theorem 3.5 could
be extended to e.g. Sobolev regular functions by using the Bramble-Hilbert lemma instead of Taylor
expansions. Another restriction is that we exclusively focused on neural networks with the tanh acti-
vation function. This was only for simplicity of exposition. All results still hold for other sigmoidal
activation functions, as well as more general smooth activation functions, which might give rise
to slightly different convergence rates. A last restriction is that the obtained rates are not optimal,
but this is not the goal of our framework. In particular, for PINNs for low-dimensional PDEs it is
beneficial to use e.g. [26, 15].
Optimizing the obtained convergence rates and comparing with optimal ones is one direction for
future research. Previously mentioned possibilities include extending to more general activation
functions and less regular functions. Another direction is to make the connection between our results
and that of [10] where they prove that Barron spaces are Banach algebras and use this to obtain
dimension-independent convergence rates for PDEs with initial data in a Barron class by emulating
numerical methods.
Here, we have considered the approximation and generalization errors in the present analysis. It is
clear that the bounds on the generalization error may not be sharp, as in traditional deep learning.
Obtaining sharper bounds will be an interesting topic for further investigation. Finally, there is no
10
explicit bound on the training (optimization) errors. Obtaining such bounds will be considered in
the future.
References
[1] S. M. Allen and J. W. Cahn. A microscopic theory for antiphase boundary motion and its application to
antiphase domain coarsening. Acta metallurgica, 27(6):1085–1095, 1979.
[2] G. Bai, U. Koley, S. Mishra, and R. Molinaro. Physics informed neural networks (PINNs) for approxi-
mating nonlinear dispersive PDEs. arXiv preprint arXiv:2104.05584, 2021.
[3] A. Barth, A. Jentzen, A. Lang, and C. Schwab. Numerical Analysis of Stochastic Ordinary Differential
Equations. ETH Zürich, 2018.
[4] C. Beck, L. Gonon, and A. Jentzen. Overcoming the curse of dimensionality in the numerical
approximation of high-dimensional semilinear elliptic partial differential equations. arXiv preprint
arXiv:2003.00596, 2020.
[5] C. Beck, F. Hornung, M. Hutzenthaler, A. Jentzen, and T. Kruse. Overcoming the curse of dimensionality
in the numerical approximation of Allen-Cahn partial differential equations via truncated full-history
recursive multilevel Picard approximations. Journal of Numerical Mathematics, 28(4):197–222, 2020.
[6] C. Beck, A. Jentzen, and B. Kuckuck. Full error analysis for the training of deep neural networks, 2020.
[7] J. Berner, P. Grohs, and A. Jentzen. Analysis of the generalization error: Empirical risk minimization over
deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of
Black-Scholes partial differential equations. SIAM Journal on Mathematics of Data Science, 2(3):631–
657, 2020.
[8] S. Cai, Z. Wang, L. Lu, T. A. Zaki, and G. E. Karniadakis. DeepM&Mnet: Inferring the electroconvec-
tion multiphysics fields based on operator approximation by neural networks. Journal of Computational
Physics, 436:110296, 2021.
[9] T. Chen and H. Chen. Universal approximation to nonlinear operators by neural networks with arbitrary
activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks,
6(4):911–917, 1995.
[10] Z. Chen, J. Lu, and Y. Lu. On the representation of solutions to elliptic PDEs in Barron spaces. arXiv
preprint arXiv:2106.07539, 2021.
[11] A. Cohen, R. Devore, and C. Schwab. Analytic regularity and polynomial approximation of parametric
and stochastic elliptic PDEs. Analysis and Applications, 9(01):11–47, 2011.
[12] G. Constantine and T. Savits. A multivariate Faa di Bruno formula with applications. Transactions of the
American Mathematical Society, 348(2):503–520, 1996.
[13] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals
and systems, 2(4):303–314, 1989.
[14] T. De Ryck, A. D. Jagtap, and S. Mishra. Error estimates for physics informed neural networks approxi-
mating the Navier-Stokes equations. arXiv preprint arXiv:2203.09346, 2022.
[15] T. De Ryck, S. Lanthaler, and S. Mishra. On the approximation of functions by tanh neural networks.
Neural Networks, 2021.
[16] T. De Ryck and S. Mishra. Error analysis for physics informed neural networks (PINNs) approximating
Kolmogorov PDEs. arXiv preprint arXiv:2106.14473, 2021.
[17] B. Deng, Y. Shin, L. Lu, Z. Zhang, and G. E. Karniadakis. Convergence rate of DeepONets for learning
operators arising from advection-diffusion equations. arXiv preprint arXiv:2102.10621, 2021.
[18] M. Dissanayake and N. Phan-Thien. Neural-network-based approximations for solving partial differential
equations. Communications in Numerical Methods in Engineering, 1994.
[19] W. E, J. Han, and A. Jentzen. Deep learning-based numerical methods for high-dimensional parabolic
partial differential equations and backward stochastic differential equations. Communications in Mathe-
matics and Statistics, 5(4):349–380, 2017.
11
[20] W. E and B. Yu. The deep Ritz method: a deep learning-based numerical algorithm for solving variational
problems. Communications in Mathematics and Statistics, 6(1):1–12, 2018.
[21] R. A. Fisher. The wave of advance of advantageous genes. Annals of eugenics, 7(4):355–369, 1937.
[22] S. Goswami, M. Yin, Y. Yu, and G. E. Karniadakis. A physics-informed variational DeepONet for pre-
dicting crack path in quasi-brittle materials. Computer Methods in Applied Mechanics and Engineering,
391:114587, 2022.
[23] P. Grohs, F. Hornung, A. Jentzen, and P. Von Wurstemberger. A proof that artificial neural networks
overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential
equations. arXiv preprint arXiv:1809.02362, 2018.
[24] P. Grohs, F. Hornung, A. Jentzen, and P. Zimmermann. Space-time error estimates for deep neural network
approximations for differential equations. arXiv preprint arXiv:1908.03833, 2019.
[25] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep ReLU neural
networks in W s,p norms. Analysis and Applications, 18(05):803–859, 2020.
[26] I. Gühring and M. Raslan. Approximation rates for neural networks with encodable weights in smoothness
spaces. Neural Networks, 134:107–130, 2021.
[27] W. H. Guss and R. Salakhutdinov. On universal approximation by neural networks with uniform guaran-
tees on approximation of infinite dimensional maps. arXiv preprint arXiv:1910.01545, 2019.
[28] P. Henry-Labordere. Counterparty risk valuation: A marked branching diffusion approach. Available at
SSRN 1995503, 2012.
[29] P. Henry-Labordere, X. Tan, and N. Touzi. A numerical algorithm for a class of BSDEs via the branching
process. Stochastic Processes and their Applications, 124(2):1112–1140, 2014.
[30] B. Hillebrecht and B. Unger. Certified machine learning: A posteriori error estimation for physics-
informed neural networks. arXiv preprint arXiv:2203.17055, 2022.
[31] F. Hornung, A. Jentzen, and D. Salimova. Space-time deep neural network approximations for high-
dimensional partial differential equations. arXiv preprint arXiv:2006.02199, 2020.
[32] M. Hutzenthaler, A. Jentzen, T. Kruse, and T. A. Nguyen. A proof that rectified deep neural networks
overcome the curse of dimensionality in the numerical approximation of semilinear heat equations. SN
partial differential equations and applications, 1(2):1–34, 2020.
[33] M. Hutzenthaler, A. Jentzen, B. Kuckuck, and J. L. Padgett. Strong Lp -error analysis of nonlinear
Monte Carlo approximations for high-dimensional semilinear partial differential equations. arXiv preprint
arXiv:2110.08297, 2021.
[34] A. D. Jagtap and G. E. Karniadakis. Extended physics-informed neural networks (XPINNs): A general-
ized space-time domain decomposition based deep learning framework for nonlinear partial differential
equations. Communications in Computational Physics, 28(5):2002–2041, 2020.
[35] A. D. Jagtap, E. Kharazmi, and G. E. Karniadakis. Conservative physics-informed neural networks on dis-
crete domains for conservation laws: Applications to forward and inverse problems. Computer Methods
in Applied Mechanics and Engineering, 365:113028, 2020.
[36] A. Jentzen, D. Salimova, and T. Welti. A proof that deep artificial neural networks overcome the curse of
dimensionality in the numerical approximation of Kolmogorov partial differential equations with constant
diffusion and nonlinear drift coefficients. arXiv preprint arXiv:1809.07321, 2018.
[37] A. N. Kolmogorov. Étude de l’équation de la diffusion avec croissance de la quantité de matière et son
application à un problème biologique. Bull. Univ. Moskow, Ser. Internat., Sec. A, 1:1–25, 1937.
[38] N. Kovachki, S. Lanthaler, and S. Mishra. On universal approximation and error bounds for Fourier
Neural Operators. arXiv preprint arXiv:2107.07562, 2021.
[39] N. Kovachki, Z. Li, B. Liu, K. Azizzadensheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. Neural
operator: Learning maps between function spaces. arXiv preprint arXiv:2108.08481v3, 2021.
[40] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical analysis of deep neural networks
and parametric PDEs. Constructive Approximation, pages 1–53, 2021.
[41] I. E. Lagaris, A. Likas, and P. G. D. Neural-network methods for boundary value problems with irregular
boundaries. IEEE Transactions on Neural Networks, 11:1041–1049, 2000.
[42] I. E. Lagaris, A. Likas, and D. I. Fotiadis. Artificial neural networks for solving ordinary and partial
differential equations. IEEE Transactions on Neural Networks, 9(5):987–1000, 2000.
[43] S. Lanthaler, S. Mishra, and G. E. Karniadakis. Error estimates for DeepONets: A deep learning frame-
work in infinite dimensions. Transactions of Mathematics and Its Applications, 6(1):tnac001, 2022.
[44] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Fourier
neural operator for parametric partial differential equations, 2020.
12
[45] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and A. Anandkumar.
Neural operator: Graph kernel network for partial differential equations. CoRR, abs/2003.03485, 2020.
[46] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, A. M. Stuart, K. Bhattacharya, and A. Anandkumar.
Multipole graph neural operator for parametric partial differential equations. In H. Larochelle, M. Ran-
zato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems
(NeurIPS), volume 33, pages 6755–6766. Curran Associates, Inc., 2020.
[47] Z. Li, H. Zheng, N. Kovachki, D. Jin, H. Chen, B. Liu, K. Azizzadenesheli, and A. Anandkumar. Physics-
informed neural operator for learning partial differential equations. arXiv preprint arXiv:2111.03794,
2021.
[48] C. Lin, Z. Li, L. Lu, S. Cai, M. Maxey, and G. E. Karniadakis. Operator learning for predicting multiscale
bubble growth dynamics. The Journal of Chemical Physics, 154(10):104118, 2021.
[49] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via DeepONet
based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229,
2021.
[50] L. Lu, X. Meng, S. Cai, Z. Mao, S. Goswami, Z. Zhang, and G. E. Karniadakis. A comprehensive and
fair comparison of two neural operators (with practical extensions) based on fair data. Computer Methods
in Applied Mechanics and Engineering, 393:114778, 2022.
[51] L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis. DeepXDE: A deep learning library for solving differential
equations. SIAM Review, 63(1):208–228, 2021.
[52] Y. Lu, J. Lu, and M. Wang. A priori generalization analysis of the deep Ritz method for solving high
dimensional elliptic partial differential equations. In Conference on Learning Theory, pages 3196–3241.
PMLR, 2021.
[53] K. O. Lye, S. Mishra, and D. Ray. Deep learning observables in computational fluid dynamics. Journal
of Computational Physics, page 109339, 2020.
[54] K. O. Lye, S. Mishra, D. Ray, and P. Chandrashekar. Iterative surrogate model optimization (ISMO): An
active learning algorithm for pde constrained optimization with deep neural networks. Computer Methods
in Applied Mechanics and Engineering, 374:113575, 2021.
[55] Z. Mao, A. D. Jagtap, and G. E. Karniadakis. Physics-informed neural networks for high-speed flows.
Computer Methods in Applied Mechanics and Engineering, 360:112789, 2020.
[56] Z. Mao, L. Lu, O. Marxen, T. A. Zaki, and G. E. Karniadakis. DeepM&Mnet for hypersonics: Predicting
the coupled flow and finite-rate chemistry behind a normal shock using neural-network approximation of
operators. Journal of Computational Physics, 447:110698, 2021.
[57] T. Marwah, Z. Lipton, and A. Risteski. Parametric complexity bounds for approximating pdes with neural
networks. Advances in Neural Information Processing Systems, 34:15044–15055, 2021.
[58] H. P. McKean. Application of brownian motion to the equation of Kolmogorov-Petrovskii-Piskunov.
Communications on pure and applied mathematics, 28(3):323–331, 1975.
[59] S. Mishra and R. Molinaro. Estimates on the generalization error of physics-informed neural networks
for approximating a class of inverse problems for PDEs. IMA Journal of Numerical Analysis, 2021.
[60] S. Mishra and R. Molinaro. Physics informed neural networks for simulating radiative transfer. Journal
of Quantitative Spectroscopy and Radiative Transfer, 270:107705, 2021.
[61] S. Mishra and R. Molinaro. Estimates on the generalization error of physics informed neural networks
(PINNs) for approximating PDEs. IMA Journal of Numerical Analysis, 2022.
[62] B. Øksendal. Stochastic differential equations. Springer, 2003.
[63] J. A. Opschoor, P. C. Petersen, and C. Schwab. Deep ReLU networks and high-order finite element
methods. Analysis and Applications, 18(05):715–770, 2020.
[64] J. A. Opschoor, C. Schwab, and J. Zech. Exponential ReLU DNN expression of holomorphic maps in
high dimension. Constructive Approximation, pages 1–46, 2021.
[65] G. Pang, L. Lu, and G. E. Karniadakis. fPINNs: Fractional physics-informed neural networks. SIAM
journal of Scientific computing, 41:A2603–A2626, 2019.
[66] J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay, M. Mardani, T. Kurth, D. Hall,
Z. Li, K. Azizzadenesheli, p. Hassanzadeh, K. Kashinath, and A. Anandkumar. Fourcastnet: A global
data-driven high-resolution weather model using adaptive fourier neural operators. arXiv preprint
arXiv:2202.11214, 2022.
[67] M. Raissi and G. E. Karniadakis. Hidden physics models: Machine learning of nonlinear partial differen-
tial equations. Journal of Computational Physics, 357:125–141, 2018.
13
[68] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning
framework for solving forward and inverse problems involving nonlinear partial differential equations.
Journal of Computational Physics, 378:686–707, 2019.
[69] M. Raissi, A. Yazdani, and G. E. Karniadakis. Hidden fluid mechanics: A Navier-Stokes informed deep
learning framework for assimilating flow visualization data. arXiv preprint arXiv:1808.04327, 2018.
[70] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates for generalized
polynomial chaos expansions in uq. Analysis and Applications, 17(01):19–55, 2019.
[71] Y. Shin, J. Darbon, and G. E. Karniadakis. On the convergence and generalization of physics informed
neural networks. arXiv preprint arXiv:2004.01806, 2020.
[72] Y. Shin, Z. Zhang, and G. E. Karniadakis. Error estimates of residual minimization using neural networks
for linear equations. arXiv preprint arXiv:2010.08019, 2020.
[73] S. Wang and P. Perdikaris. Long-time integration of parametric evolution equations with physics-informed
DeepONets. arXiv preprint arXiv:2106.05384, 2021.
[74] S. Wang, H. Wang, and P. Perdikaris. Learning the solution operator of parametric partial differential
equations with physics-informed DeepOnets. arXiv preprint arXiv:2103.10974, 2021.
[75] J. Yang, Q. Du, and W. Zhang. Uniform Lp -bound of the Allen-Cahn equation and its numerical dis-
cretization. International Journal of Numerical Analysis & Modeling, 15, 2018.
[76] L. Yang, X. Meng, and G. E. Karniadakis. B-PINNs: Bayesian physics-informed neural networks for
forward and inverse pde problems with noisy data. Journal of Computational Physics, 425:109913, 2021.
[77] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114,
2017.
Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes] See Section 3 and Section 4.
(b) Did you describe the limitations of your work? [Yes] See Section 5.
(c) Did you discuss any potential negative societal impacts of your work? [Yes] See
Section 5.
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [Yes] All assump-
tions are either stated in the theorem statement or described in the text right above the
theorem statement.
(b) Did you include complete proofs of all theoretical results? [Yes] For each result we
mention where the proof can be found.
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main experi-
mental results (either in the supplemental material or as a URL)? [N/A]
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [N/A]
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [N/A]
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [N/A]
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [N/A]
(b) Did you mention the license of the assets? [N/A]
(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]
(d) Did you discuss whether and how consent was obtained from people whose data
you’re using/curating? [N/A]
14
(e) Did you discuss whether the data you are using/curating contains personally identifi-
able information or offensive content? [N/A]
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]
15
A Notation and preliminaries
We introduce notation and preliminary results regarding finite differences, Sobolev spaces, the Leg-
endre basis, the Fourier basis, trigonometric polynomial interpolation and neural network approxi-
mation theory.
For h > 0, α ∈ Nd0 , r ∈ N and ℓ := kαk1 , we define a finite difference operator ∆α,r
h as,
X
∆α,r
h [f ](t, x) = cα,r α,r
j f (t, x + hbj ), (A.1)
j
for f ∈ C r+ℓ (Rd ), where the number of non-zero terms in the summation can be chosen to be finite
and only dependent on ℓ and r and where the choice of bα,rj ∈ Rd allows to approximate Dxα f up
to accuracy O(h ). This means that for any f ∈ C (R ) it holds for all x that,
r r+ℓ d
h−ℓ · ∆α,r α
h [f ](t, x) − Dx f (t, x) ≤ cℓ,r f (t, ·) C r+ℓ
hr for h > 0, (A.2)
where cℓ,r > 0 does not depend on f and h. Similarly, we can define a finite difference operator
∆k,s
h,t [f ](t, x) to approximate Dt f (t, x) to accuracy O(h ).
k s
16
A.3 Sobolev spaces
Let d ∈ N, k ∈ N0 , 1 ≤ p ≤ ∞ and let Ω ⊆ Rd be open. For a function f : Ω → R and a
(multi-)index α ∈ Nd0 we denote by
∂ |α| f
Dα f = (A.3)
∂xα
1
1
· · · ∂xα
d
d
the classical or distributional (i.e. weak) derivative of f . We denote by Lp (Ω) the usual Lebesgue
space and for we define the Sobolev space W k,p (Ω) as
W k,p (Ω) = {f ∈ Lp (Ω) : Dα f ∈ Lp (Ω) for all α ∈ Nd0 with |α| ≤ k}. (A.4)
For p < ∞, we define the following seminorms on W k,p (Ω),
1/p
X p
|f | m,p
W =
(Ω) kDα f k p L (Ω) for m = 0, . . . , k, (A.5)
|α|=m
Based on these seminorms, we can define the following norm for p < ∞,
1/p
Xk
p
kf kW k,p (Ω) = |f |W m,p (Ω) , (A.7)
m=0
The space W k,p (Ω) equipped with the norm k·kW k,p (Ω) is a Banach space.
We denote by C k (Ω) the space of functions that are k times continuously differentiable and equip
this space with the norm kf kC k (Ω) = kf kW k,∞ (Ω) .
Lemma A.1 (Continuous Sobolev embedding). Let d, ℓ ∈ N and let k ≥ d/2 + ℓ. Then there exists
a constant C > 0 such that for any f ∈ H k (Td ) it holds that
kf kC ℓ (Td ) ≤ Ckf kH k (Td ) . (A.9)
constitute an orthonormal basis of L2 ([−1, 1]d , λ/2d ). By considering the lexicographic order on
Nd0 , of which we denote the enumeration by κ : N → Nd0 , one can defined an ordered basis (Lj )j∈N
by setting Lj := Lκ(j) .
From [64, eq. (2.19)] it also follows that,
d
Y
∀s ∈ N0 , ν ∈ Nd0 : kLν kC s ([−1,1]d ) ≤ (1 + 2νj )1/2+2s . (A.12)
j=1
17
A.5 Notation for Standard Fourier basis
Using the notation from [43], we introduce the following “standard” real Fourier basis {eκ }κ∈Zd in
d dimensions. For κ = (κ1 , . . . , κd ) ∈ Zd , we let σ(κ) be the sign of the first non-zero component
of κ and we define
1, σ(κ) = 0,
eκ := Cκ cos hκ, xi , σ(κ) = 1, (A.13)
sin hκ, xi, σ(κ) = −1,
where the factor Cκ > 0 ensures that eκ is properly normalized, i.e. that keκ kL2 (Td ) = 1. Next, let
κ : N → Zd be a fixed enumeration of Zd , with the property that j 7→ |κ(j)|∞ is monotonically
increasing, i.e. such that j ≤ j ′ implies that |κ(j)|∞ ≤ |κ(j ′ )|∞ . This will allow us to introduce an
N-indexed version of the Fourier basis,
ej (x) := eκ(j) (x), ∀j ∈ N. (A.14)
Finally we note that
s
keκ kC s ([0,2π]d ) ≤ kκk∞ . (A.15)
where,
1, σ(k) = 0,
ak,j = cos hk, xj i , σ(k) = 1, (A.19)
sin hk, x i, σ(k) = −1,
j
18
A.7 Neural network approximation theory
We recall some basic results on the approximation of functions by tanh neural networks in this
section. All results are adaptations from results in [15]. The following two lemmas address the
approximation of univariate monomials and the multiplication operator.
Lemma A.3 (Approximation of univariate monomials, Lemma 3.2 in [15]). Let k ∈ N0 , s ∈ 2N− 1,
M > 0 and define fp : [−M, M ] → R : x 7→ xp for all p ∈ N. For every ε > 0, there exists a
shallow tanh neural network ψs,ε : [−M, M ] → Rs of width 3(s+1)
2 such that
Lemma A.4 (Shallow approximation of multiplication of d numbers, Corollary 3.7 in [15]). Let
d ∈ N, k ∈ N0 and M > 0. Then l mfor every ε > 0, there exist a shallow tanh neural network
b εd : [−M, M ]d → R of width 3 d+1 Pd,d (or 4 if d = 2) such that
× 2
d
Y
b εd (x) −
× xi ≤ ε. (A.23)
i=1
W k,∞
Lemma B.1. Let q ∈ [1, ∞], r, ℓ ∈ N with ℓ ≤ r and f1 , f2 ∈ C (0,r) ([0, T ] × D). If Assumption
3.3 holds then there exists a constant C(r) > 0 such that for any α ∈ Nd0 with ℓ := kαk1 it holds
that
Dxα (f1 − f2 ) Lq
≤ C(kf1 − f2 kLq h−ℓ + max fj C (0,r)
hr−ℓ ) ∀h > 0. (B.1)
j=1,2
Proof. From the triangle inequality and (A.2) the existence of a constant C(r) > 0 follows such
that,
Dxα (f1 − f2 ) Lq
≤ max Dα fj − h−ℓ · ∆α,r
h [fj ] + C(r)h−ℓ kf1 − f2 kLq
j=1,2 Lq
Lemma B.2. Using the notation of the proof of Theorem 3.5 (SM B.2), it holds that
D(k,α) (e
u−u
b) ≤ δ. (B.2)
C0
Proof. Using the Faà di Bruno formula [12] and its consequences for estimating the norms of deriva-
tives of compositions [15, Lemma A.7] one can prove for sufficiently regular functions g1 , g2 , h1 , h2
and a suitable multi-index β estimates of the form,
assuming that the compositions are well-defined and where the constant C > 0 may depend on
g1 , g2 , h1 , h2 and their derivatives. Using this theorem we can prove that
s−1 ∆i,s−i [b
M X
X ε
1/M,t um ](tm , x)
D(k,α) u
b − D(k,α) bδi (t − tm ) · ΦM
·ϕ m (t) < Cδ. (B.4)
m=1 i=0
M −i i!
b δ in the definition of u
Because the size of the neural network × b does not depend on its accuracy δ
(see Lemma A.4) we can rescale δ and therefore set C = 1/2 in the above inequality.
19
Next, we observe that,
s−1 ∆i,s−i [b
M X
X ε
(k,α) 1/M,t um ](tm , x)
D bδi − ϕi )(t − tm ) · ΦM
· (ϕ m (t)
m=1 i=0
M −i i!
(B.5)
XM Xs−1 ∆i,s−i [D α u ε k
1/M,t x bm ](tm , x) X k
= −i i!
· bδi − ϕi )(t − tm ) · ∂tk−n ΦM
∂tn (ϕ m (t)
m=1 i=0
M n=0
n
D(k,α) (e
u−u
b) ≤ k(B.4)kC 0 + k(B.5)kC 0 ≤ δ. (B.6)
C0
Lemma B.3. Let ∆k,s h,t be a finite difference operator cf. Section 3.3 and SM A.2, let 1 ≤ j ≤ d, let
1 ≤ q ≤ ∞, let ℓ ∈ N0 and let α ∈ Nd0 with kαk1 = ℓ. Let u, u b ∈ C (s,ℓ) ([−2h, 2h] × D) such that
for all t ∈ [−2h, 2h],
Dxα (u(t, ·) − u
b(t, ·)) Lq (D) ≤ ε. (B.7)
Then there exists cs > 0 holds that,
s−1
X ∆i,s−i
[b
u ](0, x)
h,t
Dk,α i i!
ti − u(t, ·) ≤ cs εh−k + |Dxα u|C (s,0) hs−k . (B.8)
i=0
h
Lq
Proof. Let t ∈ [−2h, 2h], α ∈ Nd0 with kαk1 = ℓ and x ∈ Rd be arbitrary. We first observe that,
k,α
s−1
X ∆i,s−i
h,t [u](0, x) i
s−1
X ∆i,s−i α
h,t [Dx u](0, x)
D t = ti−k . (B.9)
i=0
hi i! hi (i − k)!
i=k
Taylor’s theorem then guarantees the existence of ξt,x ∈ [−2h, 2h] such that
s−1 i,s−i
X ∆ h,t [u](0, x)
Dk,α i i!
ti − u(t, ·)
i=0
h
(B.10)
s−1−k i+k,s−i−k
X ∆h,t [Dxα u](0, x) i Di+k,α u(0, x) i s,α
= t − t + D u(ξt,x , x) ts−k .
i=0
hi+k i! i! (s − k)!
Now observe that because of assumption (B.7) and the definition and properties (A.2) of the finite
difference operator, there exists a constant Cs > 0 such that,
i+k,s−i−k i+k,s−i−k
∆h,t [Dxα u
b](0, x) − ∆h,t [Dxα u](0, x) ≤ Cs ε,
Lq
i+k,s−i−k
∆h,t [Dxα u](0, x) (B.11)
− Di+k,α u(0, x) ≤ Cs |Dxα u|C (s,0) hr−i−k .
hi+k
Combining all previous results provides us with the existence of a constant cs > 0 such that,
s−1
X i,s−i
∆ h,t [b
u ](0, x)
Dk,α i i!
ti − u(t, ·)
i=0
h
Lq
X
s−1−k
Cs ε i Cs α
1 (B.12)
s−i−k i
≤ i+k i!
h + |Dx u|C (s,0) h h + |Dxα u|C (s,0) hs−k
i=0
h i! (s − k)!
≤ cs εh−k + |Dxα u|C (s,0) hs−k .
20
Definition B.4. Let C > 0, N ∈ N, 0 < ε < 1 and α = ln CN k /ε . For every 1 ≤ j ≤ N , we
define the function ΦN
j : [0, T ] → [0, 1] by
!
N 1 1 T
Φ1 (t) = − σ α t − ,
2 2 N
! !
N 1 T (j − 1) 1 Tj
Φj (t) = σ α t − − σ α t− , (B.13)
2 N 2 N
!
N 1 T (N − 1) 1
ΦN (t) = σ α t − + .
2 N 2
The functions {ΦNj }j approximate a partition of unity in the sense that for every j it holds on Ij
N
f − pN
j = h max i Dtℓ (f (t, ·) − pN
j (t, ·)) ≤ Cℓ∗ N −s+ℓ + ξ.
C ℓ (JjN ,Lq (µ)) t∈
(j−2)T (j+1)T
, N Lq (µ)
N
(B.15)
Let Ck := max{max0≤ℓ≤k Cℓ∗ , kf kC k ([0,T ],Lq (µ)) , 1}. There exists a constant C(k) > 0 that only
depends on k such that for all N ≥ 3 it holds that,
N
X
Ck
f− pN N
j · Φj ≤ C lnk (N ) + ξN k
. (B.16)
j=1
N s−k
C k ([0,T ],Lq (µ))
Proof. We follow the proof of [15, Theorem 5.1]. All steps of the proofs are identical, with the only
difference being that the W k,∞ ([0, 1]d )-norm of [15] is replaced by the C k ([0, T ], L2(µ))-norm
in this work. Following [15], one divides the domain [0, T ] into intervals IiN = [ti−1 , ti ], with
ti = iT /N and N ∈ N large enough. On each of these intervals, f locally can be approximated
(in Sobolev norm) by pNj , by virtue of the assumptions of the theorem. A global approximation can
then be constructed by multiplying each pN j with an approximation of the indicator function of the
corresponding intervals and summing over all intervals.
We now highlight the main steps in the proof. Step 2a (as in [15]) results in the following estimate,
!
XN k
CN
f− f · ΦNj ≤ Ckf kC k (I N ,Lq (µ)) ε + N k+1 lnk ε . (B.17)
j=1
i ε
C k (IiN ,Lq (µ))
Step 2b results in the estimate,
N
!
X CN k Ck
(f − pN
j ) · ΦN,d
j ≤ C ln k k
+ ξN + Ck N k+1
ε , (B.18)
j=1
ε N s−k
C k (IiN ,Lq (µ))
21
In particular, if we set N k+1 ε = N −s+k and N ≥ 3, then we find that
N
" #
X kf kC k (I N ,Lq (µ)) + Ck
N N k i k
f− p j · Φj ≤ C ln (N ) + ξN . (B.20)
j=1
N s−k
C k ([0,T ],Lq (µ))
Lemma B.6. Let Gθ : X → H be a tanh FNO with grid size N ∈ N and let B > 0. For every
ε > 0, there exists a tanh DeepONet Gθ∗ : X → H with N d sensors and N d branch and trunk nets
such that
sup sup Gθ∗ (v)(x) − Gθ (v)(x) ≤ ε. (B.21)
kvkL∞ ≤B x∈Td
Furthermore, width(β) ∼ N d , depth(β) ∼ ln(N ), width(τ ) ∼ N d (N + ln N/ε ) and
depth(τ ) = 3.
Proof. This is a consequence of [38, Theorem 36] and Lemma D.1 with ε ← N d ε.
Proof. Step 1: construction. To define the approximation, we divide [0, T ] into M subintervals
of the form [tm−1 , tm ], where tm = mT /M with 1 ≤ m ≤ M . One could approximate u on
every subinterval by an s-th order accurate Taylor approximation around tm , provided that one has
access to Dti u(·, tm ) for 0 ≤ i ≤ s − 1. As those values are unknown, we resort to the finite
difference approximation Dti u(·, tm ) ≈ M i · ∆i,s−i 1/M,t [U (u0 , tm )], which is a neural network. See
ε
SM A.2 for an overview of the notation for finite difference operators. Moreover, we replace the
univariate monomials ϕi : [0, T ] → R : t 7→ ti in the Taylor approximation by neural networks
s−1
bδi : [0, T ] → R with kϕi − ϕ
ϕ bδi kC k+1 . δ. Lemma A.3 guarantees that the output of (ϕ bδi )i=1 can
be obtained using a shallow network with width 2(s + 1) (independent of δ). The multiplication
operator is replaced by a shallow neural network × b δ : [−a, a]2 → R (for suitable a > 0) for which
k × −× b δ kC k+1 . δ. By Lemma A.4 only four neurons are needed for this network. This results in
the following approximation for f ∈ C 0 ([0, T ] × D),
s−1 i,s−i
X ∆1/M,t [f ](t m , x)
Nbmδ
[f ](t, x) := bδ
× bδi (t − tm ) ∀t ∈ [0, T ], x ∈ D, 1 ≤ m ≤ M.
,ϕ
M −i i!
i=0
(B.22)
Next, we patch together these individual approximations by (approximately) multiplying them with
a NN approximation of a partition of unity, denoted by ΦM
1 , . . . , ΦM : [0, T ] → [0, 1], as introduced
M
function on [tm−1 , tm ]. For any ε, δ > 0, we then define our final neural network approximation
b : [0, T ] × D → R as,
u
M
X
b(t, x) :=
u bδ N
× bm
δ
[U ε (u0 , tm )](t, x), ΦM
m (t) ∀t ∈ [0, T ], x ∈ D. (B.23)
m=1
Step 2: error estimate. In order to facilitate the proof, we introduce the intermediate approxima-
e : [0, T ] × D → R and Nm : C 0 (D) × [0, T ] × D → R by,
tions u
M
X s−1 ∆i,s−i [b
M X
X ε
1/M,t um ](tm , x)
e(t, x) :=
u uεm ](t, x)·ΦM
Nm [b m (t) := ·ϕi (t−tm )·ΦM
m (t), (B.24)
m=1 m=1 i=0
M −i i!
22
It remains to prove that D(k,α) u
e ≈ D(k,α) u. Combining the observation that D(k,α) Nm [b uεm ] =
k α ε
Dt Nm [Dx u bm ] with Lemma B.3 lets us conclude that for all 0 ≤ k ≤ s − 1 and t ∈ [tm−2 , tm+2 ],
D(k,α) (Nm [b
uεm ](t, ·) − u(t, ·)) ≤ C(r)M k ( Dxα (b
uεm − u)(·, tm ) Lq
+ |u|C (s,ℓ) M −s )
Lq
(B.25)
We use Theorem B.5 with f ← u, pN α ε
bm ], ξ ← C(r)M k
j ← Nm [Dx u uεm − u)(·, tm ) Lq ,
Dxα (b
Cℓ ← C(s)|u|C (k,ℓ) , N ← M to find that,
∗
D(k,α) (b
u − u) ≤ C lnk (M )(kukC (s,ℓ) M k−s + M 2k Dxα (b
uεm − u)(·, tm ) Lq
), (B.26)
Lq
where C(r, s) > 0 only depend on r and s. Finally, using Lemma B.1 to bound
uεm − u)(·, tm ) Lq and combining this with Assumption 3.1 proves (3.4).
Dxα (b
Step 3: size estimate. The following holds,
u) ≤ Cdepth(U ε ), width(b
depth(b u) ≤ CM width(U ε ). (B.27)
First, we find using a Sobolev embedding result (Lemma A.1) and Lemma A.2 that,
∗
u0 − (QN ◦ EN )(u0 ) Lp
≤ u0 − (QN ◦ EN )(u0 ) ≤ C(d, r)N −r+d/p ku0 kH r ,
H d/p∗
(B.29)
where p∗ is such that 1/p + 1/p∗ = 1/2. Next, we observe that for any u0 ∈ X with ku0 kC r ≤ B
that (QN ◦ EN )(u0 ) H r (Td ) ≤ CB =: B. Hence, by applying Lemma A.2 to the second and last
term of (B.28) we find that,
∗
ε
kG − Gθ kL2 ≤ C(ε + Cstab BN −r+d/p + Cε,r
B
N −r ). (B.30)
Step 3: size estimate. As for any FNO, the width is equal to N d width(U ε ). The depth in this case
is equal to depth(U ε ).
cf. Lemma D.1. Using notation from SM A.6, let QN : R|JN | → C(Td ) be the trigonometric
polynomial interpolation operator as in (A.18) and let EN : C(Td ) → R|JN | be the encoder as in
(A.20). We define
X X
QbN : R|JN | → C(Td ) : y 7→ 1 yj ak,j b
ek , (B.32)
|KN |
k∈KN j∈JN
23
with coefficients ak,j as in (A.19), as a neural network approximation of QN .
Inspired by the proof of Theorem 3.5 (and using its notation as well), we define Gb : C(Td ) → L2 (µ)
by
s−1 ∆i,s−i [U ε (Q ◦ E ◦ u , t )](t , x)
M X
X 1/M Z Z 0 m m
b 0 )(t, x) =
G(u · ϕδi (t − tm )ΦM (B.33)
m (t),
m=1 i=0
M −i i!
Then it holds that
M Xs−1 i,s−i ε
X X X ak,j ∆1/M [U (QZ ◦ EZ ◦ u0 , tm )](tm , xj )
b 0 )(t, x) =
(QN ◦ EN ◦ G)(u · Ψi,m,k (t, x)
m=1 i=0
|KN | M −i i!
k∈KN j∈JN
Next, we observe that using Assumption 3.1, Assumption 3.6 and (B.29) it holds that for all t,
∗
(U ε (QZ ◦ EZ ◦ u0 ) − G(u0 ))(·, t) L2
ε
≤ ε + Cstab CBZ −r+d/p . (B.38)
One can then use Theorem 3.5, but by replacing ε by (B.38) in the error bound (3.4), to find that
∗
b
D(k,α) (G − G) ≤ C lnk (M )(kukC (s,ℓ) M k−s + M 2k ((ε + Cstab
ε CB r−ℓ
Z −r+d/p )h−ℓ + Cε,ℓ h ))
L2
(B.39)
Then, using the observation that D(k,α) (Id − QN ◦ EN )Gb = Dxα (Id − QN ◦ EN )Dtk Gb we find that
b 0)
D(k,α) (Id − QN ◦ EN )G(u b 0)
≤ CN −(r−ℓ) Dtk G(u , (B.40)
L2 Hr
which can be combined with the estimate
b 0)
Dtk G(u ≤ M s−1 · M k lnk (M ) U ε (QZ ◦ EZ ◦ u0 ) ≤ M s+k−1 lnk (M )Cε,r
B
, (B.41)
Hr Hr
where we used that for u0 ∈ X with ku0 kC r ≤ B it holds (QN ◦ EN )(u0 ) H r (Td )
≤ CB =: B.
Next, we make the rough estimate that,
bN − QN ) ◦ EN )G(u
D(k,α) (Q b 0) ≤ CN d M s+k−1 lnk (M ) max kek − b
ek kC r . (B.42)
L2 k
24
By setting η = N ℓ−r−d, h = 1/N and using that M 2k ≤ M k+s−1 and Cε,ℓ
B B
≤ Cε,r we find,
∗
L(G − Gθ ) ≤ C lnk (M )(kukC (s,ℓ) M k−s + M k+s−1 ((ε + Cstab
L2
ε
Z −r+d/p )N ℓ + Cε,r
B
N ℓ−r )).
(B.44)
We conclude by using that lnk (M ) ≤ CM ρ for any ρ > 0.
Step 3: size estimate. It follows immediately that depth(β) = depth(U ε ), width(β) =
O(M (Z d + N d width(U ε ))), depth(τ ) = 3 and width(τ ) = O(M N d (N + ln(N ))).
Proof. Define the random variable Y = EG (θ∗ (S))2 − ET (θ∗ (S), S)2 . Then if follows from equa-
tion (4.8) in the proof of [16, Theorem 5] that
d !
2 2RL Θ −2ε4 n
P(Y > ε ) ≤ exp , (B.45)
ε2 c2
since P(Y > ε2 ) = 1 − P (A), where A is as defined in the proof of [16, Theorem 5]. It follows that
E[Y ] = E[Y 1Y ≤ε2 ] + E[Y 1Y >ε2 ] ≤ ε2 + cP Y > ε2 . (B.46)
Setting ε2 = cP Y > ε2 leads to
v
u
u 2c2 dΘ !
c 2RL
E[Y ] ≤ 2ε2 = t ln 2 . (B.47)
n ε ε2
√
For ε < 1, and using that ln(x) ≤ x for all x > 0, this equality implies that
2c3 (2RL)dΘ /2
εdΘ +1 ≤ . (B.48)
n
Hence, we find that if n ≥ 2c2 e8 /(2RL)dΘ /2 then εdΘ +1 ≤ ce−8 (2RL)dΘ which implies that
dΘ ! −1/2
ln c 2RL 1
2 2
≤ √ . (B.49)
ε ε 2 2
Using once more that ε2 = cP Y > ε2 and (B.49) gives us,
v
u
u dΘ +1
u 2 √ dΘ ! −1/2
u 2c 2n c 2RL
E[Y ] ≤ u ln
c(2RL)dΘ ln 2
t n c ε ε 2
(B.50)
r r
2c2 √ 2c2 (dΘ + 1) √
≤ ln (aL n)dΘ +1 = ln aL n .
n n
Lemma C.1. Let ε > 0, let(Ω, F , P) be a probability space, and let X : Ω → R be a random
variable that satisfies E |X| ≤ ε. Then it holds that P(|X| ≤ ε) > 0.
25
Lemma C.2. Let γ ∈ {0, 1}, β ∈ [1, ∞), α0 , α1 , x0 , x1 , x2 , . . . ∈ [0, ∞) satisfy for all k ∈ N0 that
k−1
X h i
xk ≤ 1N (k)(α0 + α1 k)β k +
γ
(k − l) β (k−l) xl + xmax{l−1,0} . (C.1)
l=0
Lemma C.3. Let α ∈ [1, ∞), x0 , x1 , . . . ∈ [0, ∞) satisfy for all k ∈ N0 that xk ≤ αxkk−1 . Then it
holds for all k ∈ N0 that
xk ≤ α(k+1)! xk!
0 (C.3)
Proof. We provide a proof by induction. First of all, it is clear that x0 ≤ αx0 . For the induction
(k−1)!
step, assume that xk−1 ≤ αk! x0 for an arbitrary k ∈ N0 . We calculate that
k
(k−1)!
xk ≤ α αk! x0 ≤ α(k+1)! xk!
0 . (C.4)
Lemma C.4. Let ℓ ∈ N, f ∈ C ℓ (R, R), h ∈ C ℓ (Td , R) and let Bℓ denote the ℓ-th Bell number.
Then it holds that
ℓ
|f ◦ h|C ℓ (R) ≤ kf kC ℓ (R) Bℓ khkC ℓ−1 (Td ) + |h|C ℓ (Td ) . (C.5)
Proof. Let Π be the set of all partitions of the set {1, . . . , ℓ}, let α ∈ Nd0 such that kαk1 = ℓ and
ℓ
let ι : Nℓ → Nd be a map such that Dα = Qℓ ∂ x . Then the Faà di Bruno formula can be
j=1 ι(j)
reformulated as [12],
X Y ∂ |B| h(x)
Dα f (h(x)) = f (|π|) (h(x)) · Q
π∈Π B∈π j∈B ∂xι(j)
X Y ∂ |B| h(x) (C.6)
= f (|π|) (h(x)) · Q + f ′ (h(x))Dα h(x).
π∈Π, B∈π j∈B ∂xι(j)
|π|≥2
Combining this formula with the definition of the Bell number as Bℓ = |Π|, we find the following
upper bound,
X ℓ
|f ◦ h|C ℓ (R) ≤ kf kC ℓ (R) khkC ℓ−1 (R) + kf kC 1 (R) |h|C ℓ (R)
π∈Π
(C.7)
ℓ
≤ kf kC ℓ (R) Bℓ khkC ℓ−1 (R) + |h|C ℓ (R) .
Definition C.5. Let (Ω, F , µ) be a measure space and let q > 0. For every F /B(Rd )-measurable
function f : Ω → Rd , we define
ˆ 1/q
q
kf kLq (µ,k·k d ) := f (ω) Rd µ(dω) . (C.8)
R
Ω
26
Let (Ω, F , P, (Ft )t∈[0,T ] ) be a stochastic basis, D ⊆ Rd a compact set and, for every x ∈ D, let
X x : Ω × [0, T ] → Rd be the solution, in the Itô sense, of the following stochastic differential
equation,
dXtx = µ(Xtx )dt + σ(Xtx )dBt , X0x = x, x ∈ D, t ∈ [0, T ], (C.9)
where Bt is a standard d-dimensional Brownian motion on (Ω, F , P, (Ft )t∈[0,T ] ). The existence of
X x is guaranteed by [3, Theorem 4.5.1].
As in [16, Theorem 3.3] we define ρd as
kXsx − Xtx kLq (P,k·k )
Rd
ρd := max sup 1 < ∞, (C.10)
x∈D s,t∈[0,T ], |s − t| p
s<t
where X x is the solution, in the Itô sense, of the SDE (C.9), q > 2 is independent of d and
k·kLq (P,k·k d ) is as in Definition C.5.
R
Lemma C.6. In Setting 4.1, Assumption 3.1 and Assumption 3.6 are satisfied with
u(·, t) − U ε (ϕ, t) L2 (µ)
≤ ε, B
Cε,ℓ = CB · poly(dρd ), ε
Cstab = 1, p = ∞, (C.11)
where t ∈ [0, T ] and ϕ ∈ C02 (Rd ). Moreover, there exists C ∗ > 0 (independent of d) for which it
holds that depth(U ε ) ≤ C ∗ depth(ϕbε ) and {width, size}(U ε ) ≤ C ∗ ε−2 {width, size}(ϕ
bε ).
Proof. It follows from the Feynman-Kac formula that u(t, x) = E ϕ(Xtx ) [62]. Replacing ϕ by a
neural network ϕbε with kϕ − ϕbε kC 0 ≤ ε gives us for any probability measure µ that,
ε x
E ϕ(Xtx ) − E ϕ b (Xt ) ≤ kϕ − ϕ bε kC 0 . (C.12)
L2 (µ)
From [16, Lemma A.5], for all x ∈ Rd , t ∈ [0, T ] and ω ∈ Ω it holds that
d
X
Xtx (ω) = Xtei (ω) − Xt0 (ω) xi + Xt0 (ω). (C.14)
i=1
Using this equality, together with Hölder’s inequality and the boundedness of kXtx kLp [16, Lemma
A.5] we find that,
1/2
2
m
1 X α ε x
ˆ
E (IIα ) := E bε (Xtx ) −
E Dxα ϕ Dx ϕb (Xt (ωm )) µ(dx)
D m i=1
1/4
4
m
X
ε x 1
ˆ
≤ C · poly(dρd ) · E
b (Xt ) −
E ϕ bε (Xtx (ωm )) µ(dx)
ϕ
D m i=1
≤ CB · poly(dρd )
(C.15)
Combining the previous results gives us,
√ X
E m · (I) + (IIα ) ≤ CB · poly(dρd ). (C.16)
kαk1 ≤ℓ
27
If we combine this with Lemma C.1 then we find the existence of (ωi∗ )mi=1 such that for
Xm X d
1
U ε (ϕ, t)(x) = bε
ϕ Xtei (ωi∗ ) − Xt0 (ωi∗ ) xi + Xt0 (ωi∗ ) (C.17)
m i=1 i=1
it holds that
ε x 2C
b (Xt ) − U ε (ϕ, t)
E ϕ ≤√ . (C.18)
L2 (D) m
and by setting m = ε−2 and (C.12) we find that,
Proof of Theorem 4.2. We use Theorem 3.5 with k = 1 and ℓ = 2 and combine the result with
Lemma C.6. We find that for every M ∈ N and δ, h > 0 it holds that,
L(b
u − u) Lq ([0,T ]×D)
+ kb
u − ukL2 (∂([0,T ]×D))
(C.21)
≤ CB · poly(dρd ) · ln(M )(kukC (1,2) M 1−s + M 2 (δh−2 + hr−2 )).
Using that ln(M ) ≤ CM σ for arbitrarily small σ > 0, we find that we should set
r+σ s+1 −1−σ
δ = ε r−2 s−1 , M =ε s−1 (C.23)
28
C.4 Multilevel Picard approximations
In what follows, we will provide a definition of a particular kind of MLP approximation (cf. [33])
and a theorem that quantifies the accuracy of the approximation. First, we rigorously introduce the
setting of the nonlinear parabolic PDE (4.3) that is under consideration, cf. [33, Setting 3.2 with
p ← 0]. We choose the d-dimensional torus Td = [0, 2π)d as domain and impose periodic boundary
conditions. This setting allows us to use the results of [33], which are set in Rd , and yet still consider
a bounded domain so that the error can be quantified using an uniform probability measure.
Setting C.7. Let d, m ∈ N, T, L, L ∈ [0, ∞), let (Td , B(Td ), µ) be a probability space where µ is
the rescaled Lebesgue measure, let g ∈ C(Td , R) ∩ L2 (µ), let F ∈ C(R, R), assume for all x ∈ Td ,
y, z ∈ R that
F (y) − F (z) ≤ L|y − z|, max{ F (y) , g(x) } ≤ L. (C.24)
Let ud ∈ C 1,2 ([0, T ] × Rd , R) ∩ L2 (µ) satisfy for all t ∈ [0, T ], x ∈ Rd that
(∂t ud )(t, x) = (∆x ud )(t, x) + F (ud (t, x)), ud (0, x) = g(x). (C.25)
Assume that for every ε > 0 there exists a neural network Fbε , a neural network gbε and a neural
network Iε with depth depth(Iε ) = depth(Fbε ) such that
Fbε − F ≤ ε, kb
gε − gkL2 (µ) ≤ ε, kIε − IdkC 0 ([−1−L,1+L]) ≤ ε. (C.26)
C 0 (R)
Note that for some of the equations introduced in Section C.3 the nonlinearity F might not be
globally Lipschitz and hence does not satisfy (C.24). However, it is easy to argue or rescale g [43, 5]
such that ud is globally bounded by some constant C. For instance, for the Allen-Cahn equation
it holds that if kgkL∞ ≤ 1 then ud (t, ·) L∞ ≤ 1 for any t ∈ [0, T ] [75]. One can then define a
‘smooth’, globally Lipschitz, bounded function Fe : R → R such that Fe (v) = F (v) for |v| ≤ C and
such that Fe (v) = 0 for |v| > 2C. This will then also ensure the existence of a neural network Fb
that is close to Fe in C 0 (R)-norm.
In this setting, multilevel Picard approximations can be introduced. We follow the definition of [33].
S
Definition C.8 (MLP approximation). Assume Setting C.7. Let Θ = n∈N Zn , let (Ω, F , P) be a
probability space, let Y θ : Ω → [0, 1], θ ∈ Θ, be i.i.d. random variables, assume for all θ ∈ Θ,
r ∈ (0, 1) that P(Y θ ≤ r) = r, let Uθ : [0, T ] × Ω → [0, T ], θ ∈ Θ, satisfy for all t ∈ [0, T ], θ ∈ Θ
that Uθt = t + (T − t)Y θ , let W θ : [0, T ] × Ω → Rd , θ ∈ Θ, be independent standard Brownian
motions, assume that (Uθ )θ∈Θ and (W θ )θ∈Θ are independent, and let Unθ : [0, T ] × Td × Ω → R,
n ∈ Z, θ ∈ Θ, satisfy for all n ∈ N0 , θ ∈ Θ, t ∈ [0, T ], x ∈ Td that
" mn #
θ 1N (n) X (θ,0,−k)
Un (t, x) = g(x + WT −t )
mn
k=1
n−1
" n−i #
X (T − t) mX
) − 1N (i)F (Ui−1
(θ,i,k) (θ,−i,k) (θ,i,k) (θ,i,k)
+ (F (Ui ))(Ut , x + W (θ,i,k) ) .
i=0
mn−i Ut −t
k=1
(C.27)
Example C.9. In order to improve the intuition of the reader regarding Definition C.8, we provide
explicit formulas for the multilevel Picard approximation (C.27) for n = 0 and n = 1,
"m #
θ θ 1 X (θ,0,−k)
U0 (t, x) = 0 and U1 (t, x) = g(x + WT −t ) + (T − t)F (0). (C.28)
m
k=1
Finally, we provide a result on the accuracy of MLP approximations at single space-time points.
Theorem C.10. It holds for all n ∈ N0 , t ∈ [0, T ], x ∈ Td that
!1/2
0
2 L(T + 1) exp(LT )(1 + 2LT )n
E Un (t, x) − u(t, x) ≤ . (C.29)
mn/2 exp −m/2
29
C.5 Neural network approximation of nonlinear parabolic equations
In this section, we will prove that the solution of the nonlinear parabolic PDE as in Setting C.7 can
be approximated with a neural network without the curse of dimensionality. At this point, we do not
specify the activation function, with the only restriction being that the considered neural networks
should be expressive enough to satisfy (C.26). By emulating an MLP approximation and using that
F , g and the identity function can be approximated using neural networks, the following theorem
can be proven.
Theorem C.11. Assume Setting C.7. For every ε, σ > 0 and t ∈ [0, T ] there exists a neural network
bε : Td → R such that
u
bε (·) − u(t, ·) L2 (µ) ≤ ε.
u (C.30)
In addition, u
b satisfies that
gδ ) + logC2 (3C1 exp m/2 /ε)depth(Fbδ ),
uε ) ≤ depth(b
depth(b
!2+3σ
4C1 exp m/2 (C.31)
width(b
uε ), size(b gδ ) + size(Fbδ ) + size(Iδ ))
uε ) ≤ (size(b ,
ε
where
C1 = (T + 1)(1 + L exp(LT )), C2 = 5 + 3LT,
2
ε 2(1+1/σ) (C.32)
δ= , m = C2 .
9C12 exp m/2
Proof. Step 1: construction of the neural network. Let ε, δ > 0 be arbitrary and let Fb = Fbδ ,
gb = gbδ and I = Iδ as in Setting C.7. We then define for all n ∈ N and θ ∈ Θ,
" mn #
b θ 1N (n) X n−1 (θ,0,−k)
Un (t, x) = (I ◦bg)(x + WT −t )
mn
k=1
n−1
" n−i #
X (T − t) mX
) − 1N (i)(I
(θ,i,k) (θ,−i,k) (θ,i,k) (θ,i,k)
+ ((I n−i−1
◦ Fb )(U
b
i
n−i
◦ Fb )(U
b
i−1 ))(Ut , x + W (θ,i,k) ) ,
i=0
mn−i Ut −t
k=1
(C.33)
with notation and random variables cf. Definition C.8. Note that for every t ∈ [0, T ], n ∈ N, θ ∈ Θ,
bnθ (t, ·) is a neural network that maps from Td to R.
every realization of the random variable U
Let n ∈ N0 , m ∈ N and t ∈ [0, T ] be arbitrary. Integrating the square of the error bound of Theorem
C.10 and Fubini’s theorem tell us that
ˆ ˆ
2 2
E Un0 (t, x) − u(t, x) dµ(x) = E Un0 (t, x) − u(t, x) dµ(x)
Td Td
(C.34)
4L2 (T + 1)2 exp(2LT )(1 + 2LT )2n
≤ .
mn exp(−m)
From Lemma C.1 it then follows that
!
2 4L2 (T + 1)2 exp(2LT )(1 + 2LT )2n
ˆ
0
P Un (t, x) − u(t, x) dµ(x) ≤ > 0. (C.35)
Td mn exp(−m)
As a result, there exists ω = ω(t, n, m) ∈ Ω and a realization Un0 (ω) such that
L(T + 1) exp(LT )(1 + 2LT )n
Un0 (ω)(t, ·) − u(t, ·) ≤ . (C.36)
L2 (µ) mn/2 exp −m/2
We define
ω : [0, T ] × N2 → Ω : (t, n, m) 7→ ω(t, n, m) (C.37)
and set for every 1 ≤ k ≤ n,
b θ (t, x) = U
U b θ (ω(t, n, m))(t, x) θ
and Uk,ω (t, x) = Ukθ (ω(t, n, m))(t, x) (C.38)
k,ω k
30
b 0 (t, ·).
for all k ∈ N0 and all θ ∈ Θ. We then define our approximation as U n,ω
g − gkL2 (µ) +T Fb − F
and in addition we define α0 = kb , α1 = kI − IdkC 0 and β = 2+LT .
C 0 (R)
Taking the supremum over all θ ∈ Θ in (C.40) gives us for all k ∈ N0 that,
k−1
X
xk ≤ 1N (k)(α0 + α1 k)β + k
β k−i (xi + xmax{i−1,0} ). (C.42)
i=0
Therefore, we can use Lemma C.2 with γ ← 0 then gives us that for all k ∈ N0 it holds that,
bk,ω
sup U θ θ
(t, ·) − Uk,ω (t, ·) 2
θ∈Θ L (µ)
√ k
(1 + 2)
≤ 1N (k) g − gkL2 (µ) + T Fb − F
kb + kI − IdkC 0 (2 + LT )k .
2 C 0 (R)
(C.43)
Next we define
C1 = (T + 1)(1 + L exp(LT )), C2 = 5 + 3LT. (C.44)
Combining (C.36) with (C.43) then gives us that,
b 0 (t, ·) − u(t, ·)
U n,ω
L2 (µ)
≤ Ub 0 (t, ·) − U 0 (t, ·) 0
+ Un,ω (t, ·) − u(t, ·)
n,ω n,ω
L2 (µ) L2 (µ)
(C.45)
≤ C1 C2n kb g − gkL2 (µ) + Fb − F 0 + kI − IdkC 0 + m−n/2 exp m/2 .
C (R)
31
For an arbitrary σ > 0, we choose
2(1+1/σ)
m = C2 , n = σ logC2 (4C1 exp m/2 /ε) (C.46)
and if we choose gb = gbδ and Fb = Fbδ such that,
ε ε1+σ
kb
g − gkL2 (µ) ≤ δ = n = 1+σ
, (C.47)
4C1 C2 (4C1 ) exp σm/2
then we obtain that
b 0 (t, ·) − u(t, ·)
U ≤ ε. (C.48)
n,ω
L2 (µ)
Step 3: size estimate. We now provide estimates on the size of the network constructed in Step 1.
First of all, it is straightforward to see that the depth of the network can be bounded by
Lε (Ub 0 ) ≤ Lδ (b g ) + logC2 (3C1 exp m/2 /ε)Lδ (Fb ).
g ) + (n − 1)Lδ (Fb ) ≤ Lδ (b (C.49)
n,ω
Next we prove an estimate on the number of needed neurons. For notation, we write Mn =
Mε (Ub 0 ). We find that for all 0 ≤ k ≤ n,
n,ω
Mk ≤ 1N (k)mk (Mδ (b
g ) + (k − 1)Mδ (I))
k−1
X
+ mk−i (2Mδ (Fb ) + (2k − 2i − 1)Mδ (I) + Mi + Mmax{i−1,0} )
i=0 (C.50)
k−1
X
≤ 1N (k)(Mδ (b
g ) + Mδ (Fb ) + kMδ (I))(2m)k + mk−i (Mi + Mmax{i−1,0} ).
i=0
Applying Lemma C.2 to (C.50) (i.e. α0 ← Mδ (b g) + Mδ (Fb ), α1 ← Mδ (I) and β ← 2m) then
gives us that
1 √
Mn ≤ (Mδ (b g ) + Mδ (Fb) + Mδ (I))(1 + 2)n (2m)n . (C.51)
2
√ 2(1+1/σ)
Observing that 2 + 2 2 ≤ C2 and recalling that m = C2 we find that
1 (3σ+2)n/σ
Mn ≤ g ) + Mδ (Fb) + Mδ (I))C2
(Mδ (b
2
!2+3σ (C.52)
1 4C1 exp m/2
g ) + Mδ (Fb) + Mδ (I))
= (Mδ (b .
2 ε
bn,ω
For the width, we make the estimate widthε (U 0
) ≤ Mn .
Setting C.12. Assume Setting C.7, let b g ∈ C(Td , R) ∩ L2 (µ)3 and let ω : [0, T ] × N2 → Ω be
defined as in (C.37) in the proof of Theorem C.11. Let U bn,ω
θ :
[0, T ] × Td × Ω → R, n ∈ Z, θ ∈ Θ,
d
satisfy for all n ∈ N0 , ε > 0, θ ∈ Θ, t ∈ [0, T ], x ∈ T that
" mn #
b θ 1N (n) X n−1 (θ,0,−k)
Un,ω (t, x) = (Iε ◦ gb)(x + WT −t (ω(t, n, m)))
mn
k=1
" n−i
X (T − t) mX
n−1
+ (Iεn−i−1 ◦ Fbε )(U b (θ,i,k) )
n−i i,ω
i=0
m
k=1
#
− 1N (i)(I n−i ◦ Fbε )(U
b (θ,−i,k) (θ,i,k) (θ,i,k)
ε i−1,ω) U t (ω(t, n, m)), x + W (θ,i,k) (ω(t, n, m)) .
Ut −t
(C.53)
3
The function gb can but need not be the same as the function gbε , for some ε > 0, of Setting C.7.
32
Lemma C.13. Assume Setting C.12. Under the assumption that,
max Ij ≤ 2, (C.54)
1≤j≤k C k ([−L−1,L+1])
and where Bℓ denote the ℓ-th Bell number i.e., the number of possible partitions of a set with ℓ
elements.
bk,ω
sup U θ
gkC 0 (Td ) + 2T Fb
≤ kb k. (C.56)
θ∈Θ C 0 ([0,T ]×Td ) C 0 (R)
ℓ
and where (again using Lemma C.4) it holds that I j ◦ Fb ≤ 2Bℓ Fb .
C ℓ (R) Cℓ
Using this estimate and the fact that (Ck,ℓ )k≥0 is non-decreasing for any ℓ, we can make the follow-
ing calculation for every k ∈ N0 ,
bk,ω
sup U θ
θ∈Θ C (0,ℓ) ([0,T ]×Td )
k−1
X
≤ 1N (k)|b
g|C ℓ (Td ) + T sup (I k−i−1 ◦ Fb)(U
bθ )
i,ω
C (0,ℓ) (([0,T ]×Td ))
i=0 θ∈Θ
k−1
X
+T 1N (i) sup (I k−i ◦ Fb)(Ubi−1,ω
θ
)
θ∈Θ C (0,ℓ) (([0,T ]×Td ))
i=0
k−1
!
ℓ X
≤ 1N (k)|b
g|C ℓ (Td ) + 2Bℓ T Fb ℓ
Bℓ Ck,ℓ−1 + sup bθ
U i,ω
C ℓ (R) θ∈Θ C (0,ℓ) (([0,T ]×Td ))
i=0
k−1
!
ℓ X
+ 2Bℓ T Fb 1N (i) ℓ
Bℓ Ck,ℓ−1 + sup bθ
U i−1,ω
C ℓ (R) θ∈Θ C (0,ℓ) (([0,T ]×Td ))
i=0
ℓ
≤ 1N (k)(|b
g|C ℓ (Td ) + 2Bℓ T Fb ℓ
Ck,ℓ−1 k)
C ℓ (R)
k−1
!
X ℓ
+ 2Bℓ T Fb bθ
sup U i,ω + 1N (i) sup U
bθ
i−1,ω
C ℓ (R) θ∈Θ C (0,ℓ) (([0,T ]×Td )) θ∈Θ C (0,ℓ) (([0,T ]×Td ))
i=0
(C.58)
33
ℓ
Application of Lemma C.2 with α0 ← |b
g|C ℓ (Td ) , α1 ← 2Bℓ Ck,ℓ−1
ℓ
, β ← (1 + 2BℓT Fb ) and
C ℓ (R)
γ ← 0 gives us
ℓ
|b
g |C ℓ (Td ) + 2Bℓ Ck,ℓ−1 √ k ℓ
bθ
sup U ≤ (1 + 2) (1 + 2Bℓ T Fb )k
k,ω
θ∈Θ C 0 ([0,T ]×Td ) 2 C ℓ (R)
(C.59)
√ ℓ
≤ |b
g |C ℓ (Td ) + 2Bℓ (1 + 2) (1 + 2Bℓ T Fb
k k
) ℓ
Ck,ℓ−1 .
C ℓ (R)
Filling in the definition of Ck,ℓ−1 indeed gives us the formula as stated in (C.55), thereby concluding
the proof of the claim.
Lemma C.14. Let F be a polynomial. For every σ, ε > 0 there is an operator U ε as in Assumption
3.1 such that for every t ∈ [0, T ],
Proof. The three bounds are a consequence of, respectively, Theorem C.11 and Lemma C.13 and
(C.45). The size estimates follow from Theorem C.11. Note that one might have to rescale the
constant σ > 0.
In [43], numerous error estimates for DeepONets are proven, with a focus on DeepONets that use
the ReLU activation function. In order to quantify this error, the authors fix a probability measure
µ ∈ P(X ) and define the error as,
1/2
ˆ ˆ
2
Eb = G(u)(y) − Gθ (u)(y) dy dµ(u) , (D.1)
X U
assuming that there exist embeddings X ֒→L2 (D) and Y֒→L2 (U ). From [43, Lemma 3.4], it then
follows that Eb (D.1) can be bounded as,
Eb ≤ Lipα (G)Lip(R ◦ P) (EbE )α + Lip(R)EbA + EbR , (D.2)
where Lipα (·) denotes the α-Hölder coefficient of an operator and where EbE quantifies the encoding
error, where EbA is the error incurred in approximating the approximator A and where EbR quantifies
the reconstruction error. Assuming that all Hölder coefficients are finite, one can prove that Eb is
small if EbE , EbA and EbR are all small. We summarize how each of these three errors can be bounded
using the results from [43].
• The upper bound on the encoding error EbE depends on the chosen sensors and the spectral
decay rate for the covariance operator associated to the measure µ. Use bespoke sensor
points to obtain optimals bounds when possible, otherwise use random sensors to obtain
almost optimal bounds. More information can be found in [43, Section 3.5].
• The upper bound on the reconstruction error EbR depends on the smoothness of the operator
and the chosen basis functions τ i.e., neural networks, for the reconstruction operator R.
Following [43, Section 3.4], one first chooses a standard basis τe of which the properties
are well-known. We denote the corresponding reconstruction by R e and the corresponding
34
reconstruction error by EbR e . In this work, we focus on Fourier and Legendre basis func-
tion, both of which are introduced in SM A. One then proceeds by constructing the neural
network basis τ i.e., the trunk nets, that satisfy for some ε > 0 and p ≥ 1 the condition
ε
max kτk − τek kL2 ≤ , (D.3)
k=1,...,p p3/2
which is shown to imply that,
EbR ≤ EbR
e + Cε, (D.4)
where C ≥ 1 depends only on L2 kuk2 dG# µ(u). Using standard approximation theory,
´
one can calculate an upper bound on EbRe and using neural network theory one can quantify
the network size of τ needed such that (D.3) is satisfied. For the Fourier and Legendre
bases such results are presented in Lemma D.1 and Lemma D.2, respectively.
• The upper bound on the approximation error EbA depends on the regularity of the operator
G. We present the tanh counterparts of some results of [43, Section 3.6] in the following
sections, with the main result being Theorem D.6.
For bounded linear operators, these calculations are rather straightforward and are presented in [43,
SM D]. For nonlinear operators, one has to complete all the above steps for each specific case. In
[43, Section 4], this has been done for four types of differential equations.
Following Section D.1, we need results on the required neural network size to approximate the
reconstruction basis to a certain accuracy (D.3). The following lemma provides such a result for the
Fourier basis introduced in SM A.5.
Lemma D.1. Let s, d, p ∈ N. For any ε > 0, there exists a trunk net τ : Rd → Rp with 2 hidden
d+1
layers of width O(p d + ps ln psε−1 ) and such that
Proof. We note that each element in the (real) trigonometric basis e1 , . . . , ep can be expressed in
the form
ej (x) = cos(κ · x), or ej (x) = sin(κ · x), (D.6)
for κ = κ(j) ∈ Zd with |κ|∞ ≤ N , where N is chosen as the smallest natural number such that
p ≤ (2N + 1)d . We focus only focus on the first form, as the proof for the second form is entirely
similar. Define f : [0, 2π]d → R : x 7→ κ · x and g : [−2πdN, 2πdN ] → R : x 7→ cos(x).
As f ([0, 2π]d ) ⊂ [−2πdN, 2πdN ], the composition g ◦ f is well-defined and one can see that it
coincides with a trigonometric basis function ej . Moreover, the linear map f is a trivial neural
network without hidden layers. Approximating ej by a neural network τj therefore boils down to
approximating g by a suitable neural network.
From [15, Theorem 5.1] it follows that the function g there exists an independent constant R > 0
such that for large enough t ∈ N there is a tanh neural network gbt with two hidden layers and
O(t + N ) neurons such that
This can be proven from [15, eq. (74)] by setting δ ← 31 , k ← s, s ← t, N ← 2 and using
kgkC s = 1 and Stirling’s approximation to obtain
t−s t−s
1 3 1 e
≤p ≤ exp(s − t) for t > s + e2 . (D.8)
(t − s)! 2 · 2 2π(t − s) t − s
35
Setting t = O(ln δ −1 + s ln(s)) then gives a neural network b
gt with kg − gbt kC s < η. Next, it
follows from [15, Lemma A.7] that
s
kg ◦ f − gbt ◦ f kC s ([0,2π]d ) ≤ 16(e2 s4 d2 )s kg − b
gt kC s ([−2πdN,2πdN ])kf kC s ([0,2π]d )
(D.9)
≤ 16(e2 s4 d2 )s η(2πdN )s .
From this follows that we can obtain the desired accuracy (D.5) if we set τj = b
gt(η) ◦ f with
εp−3/2
η= , (D.10)
16(2πN d3 e2 s4 )s
which amounts to t = O(s ln sNε−1 ). As a consequence, the tanh neural network τj has two
hidden layers with O(s ln sN ε−1 + N ) neurons and therefore, by recalling that p ∼ N d , the
combined network τ has two hidden layers with
d+1
O(p(s ln sN ε−1 + N )) = O(ps ln psε−1 + p d ) (D.11)
neurons.
Proof. Consider the setting of Theorem 4.6. Using [43, Theorem D.3], the reasoning as in [43,
Example D.4] and Lemma D.1 we find that there exists a constant C = C(d, ℓ) > 0, such that for
any m, p, s ∈ N there exists a DeepONet with trunk net τ and branch net β, such that
d+1
size(τ ) ≤ C(p d + ps ln psε−1 ), depth(τ ) = 3, (D.12)
and where
size(β) ≤ p, depth(β) ≤ 1, (D.13)
and such that the DeepONet approximation error (D.1) is bounded by
!
c m1/d
1/d
G(v) − Gθ (v) L2 (µ×λ)
≤ ε + C exp −c p + C exp − 1/d
. (D.14)
log(m)
Moreover, it holds that
N (u)(·) C s ≤ Cps/d , (D.15)
since in this case τ approximates the Fourier basis (SM A.5). From (A.15), one can then deduce
the estimate on the C s -norm of the DeepONet. This proves that (3.7) in Theorem 3.9 holds with
σ(s) = s/d. This concludes the proof.
36
In our proofs, we require tanh counterparts to the results for DeepONets with ReLU activation
function from [43]. We present these adapted results below for completeness.
The first lemma considers the neural network approximation of the map u 7→ Yb (u), as defined in
[43, Eq. (3.59)].
Lemma D.3. Let N, d ∈ N, and denote m := (2N + 1)d . There exists a constant C > 0, in-
dependent of N , such that for every N there exists a tanh neural network Ψ : Rm → Rm , with
We can now state the following result [70, Theorem 3.10] which is the counterpart of [43, Theorem
3.32] for tanh neural networks.
Theorem D.4. Let V be a Banach space and let J be a countable index set. Let F : [−1, 1]J → V
be a (b, ε, κ)-holomorphic map for some b ∈ ℓq (N) and q ∈ (0, 1), and an enumeration κ : N → J .
Then there exists a constant C > 0, such that for every N ∈ N, there exists an index set
n Q o
ΛN ⊂ ν = (ν1 , ν2 , . . . ) ∈ j∈J N0 | νj 6= 0 for finitely many j ∈ J , (D.18)
with |ΛN | = N , a finite set of coefficients {cν }ν∈ΛN ⊂ V , and a tanh network Ψ : RN → RΛN ,
y 7→ {Ψν (y)}ν∈ΛN with
size(Ψ) ≤ C(1 + N log(N )), depth(Ψ) ≤ C(1 + log log(N )), (D.19)
and such that
X
sup F (y) − cν Ψν (yκ(1) , . . . , yκ(N ) ) ≤ CN 1−1/q . (D.20)
y∈[−1,1]J ν∈ΛN
V
Using this theorem, we can state the tanh counterpart to [43, Corollary 3.33].
Corollary D.5. Let V be a Banach space. Let F : [−1, 1]J → V be a (b, ε, κ)-holomorphic map
for some b ∈ ℓq (N) and q ∈ (0, 1), where κ : N → J is an enumeration of J . In particular, it
is assumed that {bj }j∈N is a monotonically decreasing sequence. If P : V → Rp is a continuous
linear mapping, then there exists a constant C > 0, such that for every m ∈ N, there exists a tanh
network Ψ : Rm → Rp , with
size(Ψ) ≤ C(1 + pm log(m)), depth(Ψ) ≤ C(1 + log log(m)), (D.21)
and such that
sup kP ◦ F (y) − Ψ(yκ(1) , . . . , yκ(m) )kℓ2 (Rp ) ≤ CkPk m−s , (D.22)
y∈[−1,1]J
where s := q −1 − 1 > 0 and kPk = kPkV →ℓ2 denotes the operator norm.
Proof. The proof is identical to the one presented in [43, Appendix C.18].
Finally, we use this result to state the counterpart to [43, Theorem 3.34], which considers the ap-
proximation of a parametrized version of the operator G, defined as a mapping
F : [−1, 1]J → L2 (U ) : y 7→ G(u(·; y)). (D.23)
A more detailled discussion can be found in [43, Section 3.6.2].
Theorem D.6. Let F : [−1, 1]J → L2 (U ) be (b, ε, κ)-holomorphic with b ∈ ℓq (N) and κ : N → J
an enumeration, and assume that F is given by (D.23). Assume that the encoder/decoder pair is
constructed as in [43, Section 3.5.3], so that [43, Eq. (3.69)] holds. Given an affine reconstruction
R : Rp → L2 (U ), let P : L2 (U ) → Rp denote the corresponding optimal linear projection [43, Eq.
37
(3.17)]. Then given k ∈ N, there exists a constant Ck > 0, independent of m, p and an approximator
A : Rm → Rp that can be represented by a neural network with
size(A) ≤ Ck (1 + pm log(m)), depth(A) ≤ Ck (1 + log(m)).
and such that the approximation error EbA can be estimated by
EbA ≤ Ck kPk m−k ,
where kPk = kPkL2(U)→Rp is the operator norm of P.
Next, we consider the following nonlinear ODE system, already considered in the context of approx-
imation by DeepONets in [49] and [43],
dv1 = v2 ,
dt (D.24)
dv
2 = −γ sin(v1 ) + u(t).
dt
with initial condition v(0) = 0 and where γ > 0 is a parameter. Let us denote v = (v1 , v2 ) and
v2 0
g(v) := , U (t) := , (D.25)
−γ sin(v1 ) u(t)
so that equation (D.24) can be written in the form
dv
Lu (v) := − g(v) + U = 0, v(0) = 0. (D.26)
dt
In (D.26), v1 , v2 are the angle and angular velocity of the pendulum and the constant γ denotes a
frequency parameter. The dynamics of the pendulum is driven by an external force u. With the
external force u as the input, the output of the system is the solution vector v and the underlying
nonlinear operator is given by G : L2 ([0, T ]) → L2 ([0, T ]) : u 7→ G(u) = v. Following the
discussion in [43], we choose an underlying (parametrized) measure µ ∈ P(L2 ([0, T ])) as a law of
a random field u, that can be expanded in the form
X
2πt
u(t; Y ) = Yk αk ek , t ∈ [0, T ], (D.27)
T
k∈Z
38
Proof. The proof of the statement is identical to that of [43, Theorem 4.10], with the only difference
that we consider tanh neural networks instead of ReLU neural networks. As a result, the proof
comes down to determining the size of the trunk net τ using Lemma D.2 instead of [64, Proposition
2.10], thereby proving the tanh counterpart of [43, Proposition 4.5], and replacing [43, Proposition
4.9] by Theorem D.6. The C s -bound of the DeepONet follows from the C s -bound of Legendre
polynomials (A.12) and Lemma D.2.
We can again follow Theorem 3.9 to obtain error bounds for physics-informed DeepONets. As-
sumption 3.3 is satisfied for [0, T ]. As a result, we can apply Theorem 3.9 to obtain the following
result.
Theorem D.8. Consider the setting of Lemma D.7. For every β > 0, there exists a constant C > 0
such that for any p ∈ N , there exists a DeepONet Gθ with a trunk net τ = (0, τ1 , . . . , τp ) with p
outputs and branch net β = (0, β1 , . . . , βp ), such that
size(τ ) ≤ Cp, depth(τ ) = 2, (D.32)
and
size(β) ≤ C(1 + p2 log(p)), depth(β) ≤ C(1 + log(p)), (D.33)
and such that
dGθ (u)1 dGθ (u)2
− Gθ (u)2 + + γ sin Gθ (u)1 − u(t) ≤ Cp−β . (D.34)
dt L2 (µ) dt L2 (µ)
Proof. Lemma D.7 with s ← 1, k ← r and m ← p then provides a DeepONet that satisfies
the conditions of Theorem 3.9 with r∗ = +∞ and equation (3.7) with σ(s) = d/2 + 2sd. The
smoothness of v is guaranteed by [43, Lemma 4.3]. Moreover, it holds that,
dGθ (u)1 dGθ (u)1 dG(u)1
− Gθ (u)2 ≤ − + G(u)2 − Gθ (u)2 L2 (µ)
, (D.35)
dt L2 (µ) dt dt L2 (µ)
Combining this estimate with Theorem 3.9 with k = 2 then gives the wanted result.
with notation from SM A.5, and where for simplicity a(x) ≡ 1 is assumed to be constant. Further-
more, we will consider the case of smooth coefficients x 7→ a(x; Y ), which is ensured by requiring
39
that there exist constants Cα > 0 and ℓ > 1, such that |αk | ≤ Cα exp −ℓ|k|∞ for all k ∈ Zd . Still
following [43], we define b = (b1 , b2 , . . . ) ∈ ℓ1 (N) by
bj := Cα exp −ℓ|κ(j)|∞ , (D.39)
where κ : N → Zd is the enumeration for the standard Fourier basis, (SM A.5). Note that by
assumption on the enumeration κ, we have that b is a monotonically decreasing sequence. In the
following, we will assume throughout that kbkℓ1 < 1, ensuring a uniform coercivity condition on
all random coefficients a = a( · ; Y ) in (D.37). Finally, we assume that the Yj ∈ [−1, 1] are centered
random variables and we let µ ∈ P(L2 (Td )) denote the law of the random coefficient (D.38).
The following lemma provides an error estimate for DeepONets approximating the operator G that
maps the input coefficient a into the solution field u of the PDE (D.37).
Lemma D.9. For any k, r ∈ N, there exists a constant C > 0, such that for any m, p ∈ N, there
exists a DeepONet Gθ = R ◦ A ◦ E with m sensors, a trunk net τ = (0, τ1 , . . . , τp ) with p outputs
and branch net β = (0, β1 , . . . , βp ), such that
size(β) ≤ C(1 + pm log(m)), depth(β) ≤ C(1 + log(m)), (D.40)
and
d+1
size(τ ) ≤ Cp d depth(τ ) ≤ 2 (D.41)
such that the DeepONet approximation error (D.1) satisfies
1
Eb ≤ Ce−cℓm d + Cm−k + Cp−r , (D.42)
and that for all s ∈ N
Gθ (u)(·) Cs
≤ Cps/d . (D.43)
Proof. This statement is the tanh counterpart of [43, Theorem 4.19], which addresses ReLU Deep-
ONets. We only highlight the differences in the proof. First, one should use Lemma D.1 instead of
[43, Lemma 3.13], which then results in different network sizes in [43, Lemma 3.14, Proposition
3.17, Corollary 3.18, Proposition 4.17]. Second, one needs to replace [43, Proposition 4.18] with
Theorem D.6.
Moreover, in this case the trunk net τ approximates the Fourier basis (SM A.5). From (A.15), one
can then deduce the estimate on the C s -norm of the DeepONet.
It is straightforward to verify that the conditions of Theorem 3.9 are satisfied in the current set-
ting. Applying Theorem 3.9 then results in the following theorem on the error of physics-informed
DeepONets for (D.37).
Theorem D.10. Consider the elliptic equation (D.37) with b ≥ 1. For every β > 0, there exists
a constant C > 0 such that for any p ∈ N , there exists a DeepONet Gθ with a trunk net τ =
(0, τ1 , . . . , τp ) with p outputs and branch net β = (0, β1 , . . . , βp ), such that
size(β) ≤ C(1 + p2 log(p)), depth(β) ≤ C(1 + log(p)), (D.44)
and
size(τ ) ≤ Cp2 depth(τ ) ≤ 2 (D.45)
such that
∇ · (a(x)∇Gθ (a)(x)) − f (x) L2 (µ)
≤ Cp−β . (D.46)
Proof. We first check the conditions of Theorem 3.9. Lemma D.9 with s ← 1, k ← r and m ← p
then provides a DeepONet that satisfies the conditions of Theorem 3.9 with r∗ = +∞ and equation
(3.7) with σ(s) = s/d. Moreover, the following estimate holds,
∇ · (a(x)∇Gθ (a)(x)) − f (x) L2 (µ)
40