Statistical Learning Theory For Neural Operators
Statistical Learning Theory For Neural Operators
niklas.reinhardt@iwr.uni-heidelberg.de jakob.zech@uni-heidelberg.de
2
Institut für Mathematik, Humboldt-Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin,
Germany
sven.wang@hu-berlin.de
Abstract
We present statistical convergence results for the learning of (possibly) non-linear map-
pings in infinite-dimensional spaces. Specifically, given a map G0 : X → Y between two
separable Hilbert spaces, we analyze the problem of recovering G0 from n ∈ N noisy input-
output pairs (xi , yi )n
i=1 with yi = G0 (xi ) + εi ; here the xi ∈ X represent randomly drawn
“design” points, and the εi are assumed to be either i.i.d. white noise processes or sub-
gaussian random variables in Y. We provide general convergence results for least-squares-type
empirical risk minimizers over compact regression classes G ⊆ L∞ (X, Y ), in terms of their ap-
proximation properties and metric entropy bounds, which are derived using empirical process
techniques. This generalizes classical results from finite-dimensional nonparametric regression
to an infinite-dimensional setting. As a concrete application, we study an encoder-decoder
based neural operator architecture termed FrameNet. Assuming G0 to be holomorphic, we
prove algebraic (in the sample size n) convergence rates in this setting, thereby overcoming
the curse of dimensionality. To illustrate the wide applicability, as a prototypical example
we discuss the learning of the non-linear solution operator to a parametric elliptic partial
differential equation.
Contents
1 Introduction 2
1.1 Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Existing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1
3.2.2 The FrameNet Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Approximation of Holomorphic Operators . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Approximation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.3 Entropy Bounds for FrameNet . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Statistical Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Applications 19
4.1 Finite Dimensional Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Parametric Darcy Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Conclusions 22
References 23
Appendices 30
B Proofs of Section 2 37
B.1 Proof of Theorem 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
B.1.1 Existence and Measurability of Ĝn . . . . . . . . . . . . . . . . . . . . . . . 37
B.1.2 Concentration Inequality for Ĝn . . . . . . . . . . . . . . . . . . . . . . . . 38
B.2 Proof of Theorem 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
B.3 Proof of Corollary 2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
B.4 Proof of Corollary 2.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
B.5 Proof of Theorem 2.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
D Proofs of Section 3 50
D.1 Proof of Proposition 3.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
D.2 RePU-realization of Tensorized Legendre Polynomials . . . . . . . . . . . . . . . . 53
D.3 Proof of Theorem 3.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
D.4 Proof of Lemma 3.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
D.5 Proof of Lemma 3.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
E Proofs of Section 4 67
E.1 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1 Introduction
Learning non-linear relationships of high- and infinite-dimensional data is a fundamental problem
in modern statistics and machine learning. In recent years, “Operator Learning” has emerged
as a powerful tool for analyzing and approximating mappings G0 between infinite-dimensional
spaces [59, 44, 10, 61, 86, 5, 79, 66, 51]. The primary motivation for considering truly infinite-
dimensional data stems from applications in the natural sciences, where inputs and outputs of
operators are elements in function spaces. For instance, G0 could be the operator relating an
initial condition x of a dynamical system to the state G0 (x) of the system after a certain time, or
a coefficient-to-solution map of a parametric partial differential equation (PDE).
2
For finite-dimensional inputs and outputs, nonparametric regression is the standard framework
for inferring general, non-linear relationships. There, one aims to reconstruct some “ground truth”
G0 : Rd → Rm , d, m ∈ N, from noisy data (xi , yi ) ∈ Rd × Rm , i = 1, . . . , n, generated via
yi = G0 (xi ) + εi , where xi are called the “design points” and εi are typically independent and
identically distributed (i.i.d.) noise variables. In the framework of empirical risk minimization
(ERM), one chooses a suitable function class G of mappings from Rd to Rm and some loss function
L : Rm ×Rm → R measuring the discrepancy between predictions G(xi ) and the data yi . Statistical
estimation is achieved by minimizing
n
1X
Ĝn ∈ arg min Jn (G), Jn (G) := L(G(xi ), yi ), (1.1)
G
G∈G n i=1
assuming that minimizers exist. In the finite-dimensional setting, statistically optimal conver-
gence rates for such estimators were established for least-squares, maximum likelihood, and more
generally “minimum contrast” estimators, in [36, 8, 11]; see also [89] where such results are shown
for ERMs over neural network classes. However, as is well-known, both—approximation rates [28,
29] as well as statistical convergence rates [36, 38] over classical smoothness classes—deterioriate
exponentially in terms of the dimension d. This renders computations practically infeasible for
large d. This phenomenon is referred to as the curse of dimensionality, see also Section 4.1 ahead.
The framework for operator learning considered in this paper can be viewed as a direct ex-
tension of (1.1) to the infinite-dimensional case. Given Hilbert spaces X and Y, and a mapping
G0 : X → Y, the goal is to reconstruct G0 from “training data” (xi , yi ) ∈ X × Y with
yi = G0 (xi ) + εi ,
where the regression class G is a suitable set of measurable mappings between X and Y and εi
are centered noise variables, see Section 2 for details. This “supervised learning” setting underlies
popular methods such as the PCA-Net [44, 10].
We also mention the framework of “physics-informed learning” which is common in opera-
tor learning (relevant e.g. for the DeepONet [61]), but which is not considered in the present
manuscript. Here, information on the ground truth G0 is not known in the form of input-output
pairs, but instead is implicitly described via
N(G0 (x), y) = 0,
where N : X × Y → Z for a third vector space Z. Typically, N(x, ·) encodes a family of differential
operators parametrized by x whichPrepresent the underlying physical model. In this case, the loss to
n
minimize is a residual of the form i=1 kN(xi , G(xi ))k2Z , thus leading to an “unsupervised learning
problem”. The two cases, supervised and unsupervised learning, can also be combined. Various
different (neural network based) architectures (i.e. regression classes G ) have been proposed in
recent years for the purpose of supervised or unsupervised operator learning.
3
smooth source function f : Td → (0, ∞) and let amin > 0. For a sufficiently smooth and uniformly
positive conductivity a : Td → R, denote by G0 (a) the unique solution of the elliptic PDE
Z
d
−∇ · (a∇u) = f on T and u(x)dx = 0. (1.2)
Td
for some R > 0, s > 2d + 1. Suppose we observe noisy input-output pairs (ai , yi )ni=1 given by
yi = G0 (ai ) + εi , where the εi are independent L2 (Td )-Gaussian white noise processes (Section 2).
The operator G0 can then be learned from this data as stated in the next theorem (Section 4.2);
it regards empirical risk minimizers Ĝn over the so-called FrameNet G FN , which corresponds to
a neural network based class of measurable mappings from X → Y (Section 3). Formally, Ĝn is
defined as a minimizer of the least squares objective
n
1X
Ĝn ∈ arg min kyi − G(ai )k2L2 (Td ) , (1.4)
GFN
G∈G n i=1
although a suitable modification is required to make this mathematically rigorous (Section 2.1).
Theorem 1.1 (Informal). Consider the operator G0 from the Darcy problem on the d-dimensional
torus Td (d ≥ 2), and suppose that γ satisfies (1.3) for some s > 3d/2 + 1 and amin > 0. Fix
τ > 0 (arbitrarily small).
Then there exists a constant C such that for each n ∈ N there exists a FrameNet class G FN (n)
and any empirical risk minimizer Ĝn in (1.4) satisfies1
Z
2s+2−3d
EG0 kĜn (a) − G0 (a)k2L2 (Td ) dγ(a) ≤ Cn− 2s+2−d +τ . (1.5)
The most significant feature of the above statement is that the convergence rate in (1.5) is
algebraic in n, and thus circumvents the curse of dimensionality. The classes G FN , whose existence
is postulated by the theorem, can be precisely characterized in terms of the sparsity, depth, width
and other network class parameters, which are chosen in terms of the statistical sample size n.
We also note that the regularity assumption s > 3d/2 + 1 was made here for convenience and can
be weakened to s > 3d/2, see Theorem 4.2 and Remark 4.3 below.
To achieve Theorem 1.1 and several other related results, we build our theory in multiple steps.
In Section 2, a general regression framework for mappings between Hilbert spaces is considered.
Our first main result, Theorem 2.5, gives a non-asyptotic concentration upper bound on the
empirical risk between Ĝn and G0 , with respect to the design points xi . The upper bound is
quantified in terms of the metric entropy of the “regression class” G and the best approximation
of G0 from G . Theorem 2.6 strengthens this statement to L2 (γ)-loss, for the case of random design
points xi ∼ γ. These results provide an operator learning analogue to classical convergence rates
in nonparametric regression. The proofs rely on probabilistic generic chaining techniques [97, 30]
and “slicing” arguments as introduced in [37], which we generalize to the current setting. We also
note our proofs contrast existing nonparametric statistical analyses of neural networks [89] for
real-valued regression, where generic chaining techniques were not required for obtaining optimal
rates (up to log-factors).
In the second part of this work, we apply our statistical results to the specific deep operator
network class G FN termed “FrameNet” and introduced in [43]. Together with their underlying
decoder-encoder structure and feedforward neural network structure, FrameNet classes are defined
in Section 3. These classes are known to satisfy good approximation properties for holomorphic
1 Here and in the following E
G0 denotes the expectation w.r.t. the random data (xi , yi )i generated by the ground
truth G0 . Similarly, we write PG0 for corresponding probabilities.
4
operators, a property which is fulfilled for the Darcy problem (1.2) and more broadly a wide
range of PDE based problems [21, 22, 20, 49, 41, 42, 46, 93, 24]. In Section 3, we identify
such operator holomorphy as a key regularity property which allows to derive “dimension-free”
statistical convergence rates. By extending approximation theoretic results from [43], as well
as establishing metric entropy bounds for G FN based on [89], we obtain algebraic convergence
results for ERMs over FrameNet classes, for reconstructing holomorphic operators G0 . Specifically,
Theorem 3.15 bounds the L2 -risk E[kĜn − G0 k2L2 (γ) ] . n−κ/(κ+1) , where κ > 0 denotes the
approximation rate established in [43]. We treat the case of ReLU and RePU [57] activation
functions for both sparse and fully-connected architectures.
In Section 4, we illustrate the usefulness of our general theory in two concrete settings. First,
we show how our theory recovers well-known minimax-optimal convergence rates for real-valued
regression (i.e., Y = R) on d-dimensional domains. This proves that our abstract results from
Section 2 cannot be improved in general, although matching lower bounds are yet unknown in the
infinite-dimensional setting. Thereafter, Section 4.2 demonstrates how our theory can be used to
yield the first algebraic convergence rates for a non-linear operator arising from PDEs – see in
particular Theorem 4.2 and Remark 4.3, which underlie Theorem 1.1.
Approximation Theory for Neural Operators First theoretical results on operator learning
focused on the approximation error, establishing the existence of neural network architectures
capable of approximating G0 up to a certain accuracy, with the error decreasing algebraic in terms
of the number of learnable network parameters. For example, [92, 54, 91] showed that neural
networks have sufficient expressivity to efficiently approximate certain (holomorphic) mappings
G0 . Such results are based on the observation that the smoothness of G0 implies the image of this
operator to have moderate n-widths, i.e. to be well approximated in moderate-dimensional linear
subspaces. See for example [32, 21, 22, 20, 47, 7]. Specifically for DeepONets [61], such a result
was obtained in [56].
Statistical Theory for Neural Operators The analysis of sample complexity has received
less attention so far. In [55], the authors analysed in particular the error of PCA encoders and
decoders used for PCA-Net, but did not analyse the statistical error for the full operator.
The paper [48] provides such a result for the estimation of linear and diagonalizable mappings
from noisy data; for lower bounds see, e.g., [13]. For other work on “functional regression”, see, e.g.,
[40, 64]. An analysis for nonparametric regression of nonlinear mappings from noisy data in infinite
dimensions was provided in [60]. There, the authors considered Lipschitz continuous mappings G0 ,
and proved consistency in the large data limit. Additionally they give convergence results, which
in general suffer from the curse of dimension however. This is due to their very general assumption
on the smoothness of G0 : Mhaskar and Hahm [62] showed very early, that the nonlinear n-width
of Lipschitz operators in L2 decays only logarithmically, i.e. the number of (exact) data points
needed for the reconstruction of the functional is exponential in the desired accuracy. Recently,
[52] generalized these results and showed a generic curse of dimensionality for the reconstruction of
Lipschitz operators and C k -operators from exact data. Moreover, the authors show that under the
existence of some intrinsic low-dimensionality allowing for fast approximation, also the dependence
5
on the data complexity improves. Concerning the case of noisy and holomorphic operators, we
also refer to the recent works [3, 2] who consider a setup similar to ours, but contrary to us also
treat the more general case of Banach space valued functions. The authors derive upper bounds
and concentration inequalities for the L2 -error, and lower bounds for the approximation error, in
terms of a neural network based architecture. Key differences to our work include in particular
that [3, 2] consider network architectures that are linear in the trainable parameters, in the noisy
case their analysis does in general not imply convergence in the large data limit n → ∞, and they
do not provide convergence rates for concrete PDE models.
1.3 Notation
We write N = {1, 2 . . . } and N0 = {0, 1, 2, . . . }. We write an . bn , an & bn for real sequences
(an )n∈N , (bn )n∈N if an is respectively upper or lower bounded by a positive multiplicative constant
which does not depend on n (but may well depend on other ambient parameters which we make
explicit whenever confusion may arise). By an ≃ bn , we mean that both an . bn and an & bn .
For a pseudometric space (T, d) and any δ > 0, let N (T, d, δ) be the δ-covering number of T ,
i.e. the minimal number of open δ-balls in d needed to cover T . We denote the metric entropy of
T by
H(T, d, δ) = log N (T, d, δ). (1.6)
Given a Borel probability measure γ on X and a subset D ⊆ X, we define the norms
Z
kGk2L2 (X,γ;Y) := kG(x)k2Y dγ(x),
X
kGk∞,D := sup kG(x)kY
x∈D
and also write k · kL2 (γ) and k · k∞ if the underlying spaces are clear from context. The space of
real-valued, square summable sequences indexed over N is denoted by ℓ2 (N). The complexification
of a real Hilbert space H is denoted by HC , see [50, 65].
6
iid
xi ∼ γ i = 1, . . . , n, (2.1a)
and
yi = G0 (xi ) + σεi i = 1, . . . , n, (2.1b)
where σ > 0 denotes a scalar “noise level”, εi are independent random noise variables and γ is a
probability distribution on X. The xi ∈ X are also referred to as the “design points”, and we write
x = (x1 , ..., xn ) ∈ Xn . We will both derive results which are conditional on the design x , as well
x
as results for random design. To avoid confusions, we will use the notations PG 0
, ExG0 to denote
probabilities and expectations under the distribution (2.1) with fixed design x , and we use PG0 ,
EG0 to denote probabilities and expectations with random design xi ∼ γ.
Remark 2.1. In practice, we will often deal with scenarios in which G0 is only defined on some
measurable subset V ⊂ X, see e.g. the solution operator in the Darcy flow example in Section 4.2.1.
In this case, our results can be applied to any measurable extension of G0 on X.
White Noise Model In this article we consider two assumptions on the noise, the first being
that the (εi )ni=1 in (2.1) are independent copies of a Y-white noise process. Recall that for any
given separable Hilbert space Y, the Y-Gaussian white noise process is defined as the mean-zero
Gaussian process WY = (WY (y) : y ∈ Y) indexed by Y with “iso-normal” covariance structure
It is well-known that WY does not take values in Y unless dim(Y) < ∞, but is interpreted as a
stochastic process indexed by Y, see [38, p.19] for details. Nevertheless, we slightly abuse notation
and use the common notation hWY , yiY := WY (y).
Under this assumption, conditionally on xi we interpret each observation yi in (2.1) as a
realisation of a Gaussian process (yi (f ) : f ∈ Y) with
and we shall again use the notation hyi , f iY to denote yi (f ) (see also [99, 39, 67] where this common
viewpoint is explained in detail).
Example 2.2. Let O ⊆ Rd be a bounded, smooth domain. Then, for Y = L2 (O), one can show
that draws of an L2 (O)-white noise process a.s. take values in negative Sobolev spaces H −κ for
κ > d/2, see, e.g., [69, 12].
Sub-Gaussian Noise Model The second setting we consider is that of sub-Gaussian noise. We
say that a random vector X taking values in Y is sub-Gaussian with parameter η > 0 if E[X] = 0
and
t2
P(kXkY ≥ t) ≤ 2 exp − 2 , for all t ≥ 0.
2η
In the sub-Gaussian noise model, we assume that (εi )ni=1 in (2.1) are independent sub-Gaussian
variables in Y with parameter η = 1.
7
However, this functional takes finite values almost surely only in the sub-Gaussian noise model.
In the white noise model, since yi ∈ / Y, it holds I˜n (G) = ∞ almost surely – we thus consider a
modified definition of least-squares type estimators which is common in the literature on regression
with white noise [67, 39]. Instead of (2.1.1), we consider
n
1X 2
In (G) = −2hG(xi ), yi iY + G(xi ) Y
, In : G → R, (2.2)
n i=1
which takes finite values a.s. also in the white noise model. Note that
Pthe latter objective function
n
can be obtained from (2.1.1) by formally subtracting the term n−1 i=1 kyi k2Y which exhibits no
dependency on G. Therefore in the sub-Gaussian noise model the minimization of I˜n and In are
equivalent which is why we consider (2.2) in the following. We will denote minimizers of In (G) by
Ĝn .
Our assumptions on the class G in the ensuing theorems will ensure that a measurable choice
of minimizers Ĝn of In exists, see Theorem 2.5 (i). However, the ERM Ĝn will in general not
be unique, since we do not impose convexity on G . The reason is that our main application, the
NN-based FrameNet class G FN , is non-convex.
Remark 2.3 (Connection to maximum likelihood). In the white noise model, it follows from the
Cameron-Martin theorem (see, e.g., Theorem 2.6.13 in [39]) that −nIn (G)/(2σ 2 ) constitutes the
negative log-likelihood of the (dominated) statistical model arising from (2.1) with white noise. In
this case Ĝn can also be interpreted as a (nonparametric) maximum likelihood estimator over the
class G .
Remark 2.4. Consider nonparametric regression of an unknown function f : O → R for some
bounded, smooth domain O. Here, it is well-known that the observation white noise error model,
where data is given by Y = f + σW (with W a L2 (O)-white noise process) is asymptotically equiva-
lent in a Le Cam-sense to an observation model with m “equally spaced” (random or deterministic)
observation points throughout O,
Yi = f (zi ) + ηi , i = 1, ..., m,
√
with i.i.d. N (0, 1) errors, where the equivalence holds for σ ≍ 1/ m, see [88]. Therefore, our
observation model (2.1) may be viewed as a simplified proxy.
The following result provides a general convergence theorem for empirical risk minimizers with
high probability, which relates the empirical risk of ERMs over some operator class G to the
metric entropy of G . It can be viewed as a generalisation of classical convergence results for sieved
M-estimators [100] to Hilbert space valued functions. The proof can be found in Appendix B.1.
8
Theorem 2.5. For some measurable G0 : X → Y, let the data (xi , yi )ni=1 arise from (2.1) either
with white noise or with sub-Gaussian noise. Let G be a class of measurable maps from X → Y,
let G∗ ∈ G , and let x = (x1 , ..., xn ) ∈ Xn be such that the following holds.
(a) There exists a constant C > 0 s.t. G is compact with respect to some norm k · k satisfying
k · kn ≤ Ck · k.
G ∗n (δ), k · kn ) for all δ > 0 and
(b) There exists Ψn : (0, ∞) → [0, ∞) s.t. Ψn (δ) ≥ J(G
Ψn (δ)
δ 7→ is non-increasing for δ ∈ (0, ∞).
δ2
(ii) Fix any measurable selection Ĝn from part (i). Then there exists a universal constant CCh >
0 (see Lemma A.4) such that for any G∗ ∈ G as above, any positive sequence (δn )n∈N
satisfying
√ 2
nδn ≥ 32Cch σΨn (δn ), (2.5)
The lower bound for R in (ii), which determines the convergence rate of kĜn −G0 kn , is typically
optimized by balancing the “stochastic term” δn and the empirical approximation error kG∗ −G0 kn ,
see, e.g., Theorem 3.15 below. Note that Theorem 2.5 gives a convergence rate with respect to
the empirical norm k · kn . Therefore, when the design points x are random, the norm itself is also
random, and the assumptions (a) and (b) in Theorem 2.5 have to be understood conditional on
possible realizations of x . Theorem 2.6 below will give a corresponding concentration inequality
on the k · kL2 (γ) -error under the assumption of i.i.d. random design xi ∼ γ. In this setting, a
sufficient condition for (b) to be satisfied almost surely is to take Ψn as an upper bound for the
entropy integral with uniform entropy H(G G, k · k∞,supp(γ) , δ).
The compactness of G in part (a) is needed for the existence of the ERM Ĝn in (2.2). The
existence result for Ĝn (see e.g. [70, Proposition 5]) requires compact metric spaces, which is why
we assume copmactness w.r.t. a norm k · k stronger than the empirical seminorm k · kn .
In particular, compactness implies separability of G with respect to k · kn , which in turn is
needed to guarantee the measurability of certain suprema of empirical processes ranging over G .
See the proof of Theorem 2.5 below, in particular (B.4) and (B.6). The technical growth restriction
in (b) on the function Ψn (δ) (i.e., our upper bound for the entropy integral) is required for the
G, k · k∞ , δ) .
“peeling device” in (B.4). In particular, this assumption is satisfied in case that H(G
δ −s for any 0 < α < 2, see Corollary 2.7 below for details.
Theorem 2.5 provides a concentration inequality for the empirical norm kĜn − G0 kn . Under
the assumption of randomly chosen design points xi ∼ γ, xi ∈ X, this statement can be extended
to a convergence result for the L2 (γ)-norm. To this end, we also need slightly stronger technical
assumptions on the class G with respect to the k · k∞,supp(γ) -norm.
Assumption 1. For some probability measure γ on X, assume x = (x1 , . . . , xn ) arises from
i.i.d. draws xi ∼ γ. Let G be a class of measurable maps X → Y, G∗ ∈ G and kG0 k∞,supp(γ) < ∞.
Suppose
9
(a) G is compact with respect to k · k∞,supp(γ) ,
(b) There exists a (deterministic) upper bound Ψn : (0, ∞) → [0, ∞) such that for a.e. x ∼ γ n ,
G∗n (δ), k · kn ) for all δ > 0 and
it holds Ψn (δ) ≥ J(G
Ψn (δ)
δ 7→ is non-increasing for δ ∈ (0, ∞). (2.7)
δ2
Note that Assumption 1 is strictly stronger than assumptions (a) and (b) in Theorem 2.5,
since k · k∞,supp(γ) is stronger than k · kn and Ψn in (2.7) does not depend on x . In particular,
Assumption 1 implies the existence of a measurable ERM Ĝn by Theorem 2.5 (i).
The uniform k · k∞,supp(γ) assumptions are used to control the concentration of the empirical
norm k · kn around the “population norm” k · k∞,supp(γ) , see Lemma A.5 and also Lemma A.6
below. Let us write F = {G − G0 : G ∈ G }. As an immediate consequence of Assumption 1, there
exists some F∞ < ∞ such that
We can now state our main concentration inequality for the convergence of kĜn − G0 kL2 (γ) , which
in particular provides a bound for the mean squared error EG0 [kĜn − G0 k2L2 (γ) ] as well. The proof
of Theorem 2.6 can be found in Appendix B.2.
Theorem 2.6 (L2 (γ)-Concentration under Random Design). Consider the nonparametric regres-
sion model (2.1) either with white noise or with sub-Gaussian noise and any measurable empirical
risk minimizer Ĝn from (2.2). Suppose that G0 , G , G∗ , γ and Ψn (·) are such that Assumption
1 holds. Then there exists some universal constant C > 0 such that for any positive sequences
(δn )n∈N and (δ̃n )n∈N with
√ 2
nδn ≥ CσΨn (δn ) and nδ̃n2 ≥ CF∞ 2
H(GG, k · k∞,supp(γ) , δ̃n ), (2.9)
we have,
nR2
PG0 kĜn − G0 kL2 (γ) ≥ R ≤ 2 exp − . (2.11)
C 2 (σ 2 + F∞
2 )
To keep the presentation simple, we have left the numerical constants in the preceding theorem
implicit. However, they can be made explicit, see the proof for details. The following bound on the
mean squared error is obtained upon integration of the concentration inequalites from Theorem
2.5 and Lemma A.5. Note that directly integrating the L2 (γ)-concentration, cf. (2.11), gives
an approximation term in the uniform norm k · k∞,supp(γ) (following (2.10)), which has weaker
convergence properties in general, see Theorem 3.10. For the proof of Corollary 2.7, see Appendix
B.3.
Corollary 2.7 (L2 (γ)-Mean Squared Error). Consider the setting of Theorem 2.6 and assume
in addition that Assumption 1 is fulfilled for all G∗ ∈ G (with the same Ψn ). Then, for some
universal constant C > 0 and all n ∈ N,
h i 2
σ 2 + F∞
EG0 kĜn − G0 k2L2 (γ) ≤ C δn2 + δ̃n2 + + 8 inf kG∗ − G0 k2L2 (γ) . (2.12)
n G∗ ∈G
G
10
To demonstrate the typical use-cases of our abstract results, we summarize in the following
corollary the rates which can be obtained under algebraic approximation properties of G and two
concrete scalings of the metric entropy of G . These correspond to the typical entropy bounds
satisfied by (i) some fixed n-independent, infinite-dimensional regression class, and (ii) an N -
dimensional approximation class, where N is chosen in terms of n. In the following corollary, .
refers to an inequality involving a constant independent of N and δ. For a proof of Corollary 2.8,
see Appendix B.4.
Corollary 2.8. Consider the setting of Corollary 2.7. Let G = G (N ), N ∈ N, be a sequence of
regression classes2 such that inf G∗ ∈G ∗ 2
G(N ) kG − G0 kL2 (γ) . N
−β
for some β > 0 and all N ∈ N.
Denote the entropy by H(N, δ) := H(G G (N ), k · k∞,supp(γ) , δ).
(i) If H(N, δ) . δ −α for some 0 < α < 2, then
h i 2
EG0 kĜn − G0 k2L2 (γ) . n− 2+α .
11
Remark 2.11. Consider α ≥ 2 in Corollary 2.8 (i). Then J(δ) in (2.4) is not necessarily finite,
and hence Assumption 1 (b) need not be satisfied. Therefore Theorem 2.6 cannot be applied. In
the sub-Gaussian noise case, we may still use the L2 (γ)-bound in (2.15) however. Similar as in
1 2
the proof of Corollary 2.8, it can then be shown that δn2 . n− 2+α and δ̃n2 . n− 2+α satisfy (2.13).
This yields
h i 1
EG0 kĜn − G0 k2L2 (γ) . n− 2+α ,
i.e. half the convergence rate of the “chaining regime” considered in Corollary 2.8.
3.1.1 Frames
Definition 3.1. A family Ψ = {ψj : j ∈ N} ⊂ X is called a frame of X, if the analysis operator
F : X → ℓ2 (N), v 7→ hv, ψj iX j∈N
and finally the frame operator as T := F ′ F : X → X. The following lemma gives a characterization
of T . For a proof, see [17, Lemma 5.1.5].
12
Lemma 3.2. The frame operator T is boundedly invertible, self-adjoint and positive. Furthermore,
2 −2
it holds that kT kX→X = ΛΨ and kT −1kX→X = λΨ Ψ := T −1Ψ is a frame of X, called
. The family Ψ̃
the (canonical) dual frame of X. The analysis operator of the dual frame is F̃ := F (F ′ F )−1 and
−1
its frame bounds are λΨ and Λ−1
ψ .
The corresponding analysis operators are FX , F̃X , FY and F̃Y . We then introduce encoder and
decoder maps via
( (
X → ℓ2 (N), ′ ℓ2 (N) → Y,
EX := F̃X = DY := FY = P (3.3)
x 7→ (hx, ψ̃j iX )j∈N , (yj )j∈N 7→ j∈N yj ηj .
In case Ψ X and Ψ Y are Riesz bases, the mappings in (3.3) are boundedly invertible.
where
X X
kxk2Xr := hx, ψ̃j i2X θj−2r and kyk2Yt := hy, η̃j i2Y θj−2t .
j∈N j∈N
3.2 FrameNet
In this subsection we recall the FrameNet architecture from [43, Section 2]. We start by formally
introducing feedforward neural networks (NNs) following [75, 43].
13
3.2.1 Feedforward Neural Networks
Definition 3.5. A function f : Rp0 → RpL+1 is called a neural network, if there exists σ : R → R,
l
integers p1 , . . . , pL+1 ∈ N, L ∈ N and real numbers wi,j , blj ∈ R such that for all x = (xi )pi=1
0
∈ Rp0
X
p0
zj1 =σ 1
wi,j xi + b1j , j = 1, . . . , p1 , (3.4a)
i=1
Xpl
zjl+1 = σ l+1 l
wi,j zi + bl+1
j , l = 1, . . . , L − 1, j = 1, . . . , pl+1 ,
i=1
X
pL pL+1
pL+1
f (x) = (zjL+1 )j=1 = L+1 L
wi,j zi + bL+1
j . (3.4b)
i=1 j=1
l
We call σ the activation function, L the depth, p := maxl=0,...,L+1 pl the width, wi,j ∈ R the
l
weights, and bj ∈ R the biases of the NN.
While different NNs can realize the same function, for simplicity we refer to a function f : Rp0 →
pL+1
R as an NN of type (3.4), if it allows for (at least) one such representation. Additionally, our
analysis will require some further terminology: For a NN f : Rp0 → RpL+1 as in (3.4) its
• size is the number of nonzero parameters
l
size(f ) := |{(i, j, l) : wi,j 6= 0}| + |{(j, l) : blj 6= 0}|,
• maximum of parameters is
n o
l
mpar(f ) := max max |wi,j |, max |blj | ,
i,j,l j,l
For q = 1, σ1 is the rectified linear unit (ReLU), for q ≥ 2, σq is called rectified power unit (RePU).
Remark 3.6. Definition 3.5 introduces a NN as a function f : Rp0 → RpL+1 . Throughout, we
also understand the realization of a NN as a map f : ℓ2 (N) → ℓ2 (N) via extension by zeros. This
l
is equivalent to suitably padding the weight matrices (wi,j )i,j and bias vectors (blj )j in Definition
3.5 for l ∈ {0, L + 1} with infinitely many zeros, see [43, Remark 13].
G = DY ◦ g ◦ EX .
14
Given σ : R → R, positive integers L, p, s ∈ N, and reals M , B ∈ R, let
g FN (σ, L, p, s, M, B) := g : Rp0 → RpL+1 is NN with activation function σ s.t.
depth(g) ≤ L, width(g) ≤ p, size(g) ≤ s,
mpar(g) ≤ M, mran[−1,1]p0 (g) ≤ B, p0 , pL+1 ≤ p .
Instead of directly considering (3.1), for our analysis, contrary to [43], with θ from Definition 3.5,
fixed R > 0, and U := [−1, 1]N , it will be convenient to introduce the linear scaling
(
×j∈N [−Rθjr , Rθjr ] → U
Sr := x (3.5)
(xj )j∈N 7→ Rθjr j∈N .
j
(ii) the metric entropy of G , cf. the terms δn2 and δ̃n2 in (2.12).
In this subsection we give results for both the approximation quality and the metric entropy of
the FrameNet class G FN .
3.3.1 Setting
Let us start by making the assumptions on G0 and the sampling measure γ on X more precise.
Denote in the following U := [−1, 1]N , let r > 12 , R > 0, and let the frames (ψj )j∈N , (ηj )j∈N be as
in Section 3, and (θj )j∈N as in Definition 3.4. Then
(
r U → X,
σR := P (3.7)
y 7→ R j∈N θjr yj ψj
yields a well-defined map, since by construction (yj θjr )j∈N ∈ ℓ2 (N) for every y ∈ U , and (ψj )j∈N
is a frame. Next, we introduce the “cubes”
CR r
(X) = a ∈ X : sup θj−r |ha, ψ̃j iX | ≤ R . (3.8)
j∈N
r r
Observe that with σR (U ) := {σR (y) : y ∈ U }, clearly
r r
CR (X) ⊆ σR (U ),
but equality holds in general only if the (ψj )j∈N form a Riesz basis [43, Remark 10].
Example 3.7. Let X = R, r = 1, R = 1, θ1 = 3/2 and θ2 = 1/2. Consider the frame Ψ = {1, 1}
(which is not a basis) with dual analogue Ψ̃ = {1/2, 1/2}. Then with U = [−1, 1]2
The holomorphy assumption on G0 can now be formulated as follows, [43, Assumption 1].
Recall that for a real Hilbert space X, we denote by XC its complexification.
Assumption 2. For some r > 1, R > 0, t > 0, CG0 < ∞ there exists an open set OC ⊂ XC
r
containing σR (U ), such that
15
(a) supa∈OC kG0 (a)kYtC ≤ CG0 , and G0 : OC → YC is holomorphic,
r
(b) γ is a probability measure on X with supp(γ) ⊆ CR (X).
r
The assumption requires G0 to be holomorphic on a superset of σR (U ). The probability
r r
measure γ is allowed to have support on CR (X) which is potentially smaller than σR (U ). In
particular, G0 is then holomorphic on a superset of the support of γ.
Example 3.8. Denote by λ the Lebesgue measure. Then π := ⊗j∈N λ2 is the uniform probability
measure on U = [−1, 1]N , and
r
γ := (σR )♯ π (3.9)
r
defines a probability measure on X with support σR (U ). If ψ is a Riesz basis, then supp(γ) =
r r
CR (X), so that supp(γ) ⊆ CR (X) as required in Assumption 2.
Next, let F be the set of infinite-dimensional multiindices with finite support, i.e.
n o
F := (ν j )j∈N0 ∈ NN
0 : | supp ν| < ∞ , (3.11)
where supp ν := {j : νj 6= 0}. For finite sets of multiindices Λ ⊆ F, their effective dimension and
maximal order is defined as
Furthermore, there exists a constant C > 0 independent of Λ, d(Λ), m(Λ) and δ such that
h i
depth(fΛ,δ ) ≤ C log(d(Λ))d(Λ) log(m(Λ))m(Λ) + m(Λ)2 + log(δ −1 ) log(d(Λ)) + m(Λ)
width fΛ,δ ≤ C|Λ|d(Λ)
h i
size fΛ,δ ≤ C |Λ|d(Λ)2 log m(Λ) + m(Λ)3 d(Λ)2 + log δ −1 m(Λ)2 + |Λ| d(Λ)
mpar fΛ,δ ≤ 1.
16
A similar result holds for RePU neural networks, see Proposition D.1. Given q ∈ N, and
N ∈ N, we introduce the two FrameNet classes
G sp
FN (σq , N ) := G FN (σq , widthN , depthN , sizeN , M, B),
(3.14a)
G full
FN (σq , N ) := G FN (σq , widthN , depthN , ∞, M, B)
depthN = max{1, ⌈CL log(N )⌉}, widthN = ⌈Cp N ⌉, sizeN = ⌈Cs N ⌉, (3.14b)
Thus G full
FN (σq , N ) can essentially be quadratically larger than G sp
FN (σq , N ).
The next theorem extends [43, Theorems 1 and 2] to the case of bounded network parameters.
The proof is provided in Appendix D.3.
Theorem 3.10 (Sparse network approximation). Let G0 , γ satisfy Assumption 2 with r > 1,
t > 0. Let q ≥ 1 be an integer and fix τ > 0 (arbitrarily small).
(i) There exists C > 0 s.t. for all N ∈ N
2
inf
sp
G − G0 ∞,supp(γ)
≤ CN −2 min{r−1,t}+τ .
GFN (σq ,N )
G∈G
(ii) Let Ψ X be a Riesz basis and let γ be as in (3.9). Then there exists C > 0 s.t. for all N ∈ N
h i
2 1
inf
sp
G − G 0 L2 (γ)
≤ CN −2 min{r− 2 ,t}+τ .
GFN (σq ,N )
G∈G
Remark 3.11. Theorem 3.10 provides an approximation rate for G sp FN (σq , N ) in terms of N .
sp
Since G full
FN (σq , N ) ⊇ G FN (σq , N ), trivially the statement remains true for G full
FN (σq , N ). However,
full
as the size of G FN (σq , N ) can be quadratically larger by (3.15), the convergence rate in terms of
network size is essentially halfed for the fully connected architecture.
SP FC
In particular there exist constants CH , CH > 0 such that for the sparse and fully-connected
FrameNet classes from (3.14) it holds
−1
H(GGsp
FN (σ1 , N ), k · k∞,σR
r (U) , δ) ≤ C
SP 2
H N 1 + log(N ) + log max 1, δ
H(GGfull
FN (σ1 , N ), k · k∞,σR
r (U) , δ) ≤ C
FC 2
H N 1 + log(N )3 + log max 1, δ −1
17
Remark 3.13. The entropy bound is independent of the constant B bounding the maximum range
of the network, see Subsection 3.2. However, B < ∞ will be necessary to apply Theorem 2.6.
For RePU activation, as mentioned before, the metric entropy bounds exhibit a worse depen-
dency on the network parameters, due to the lack of global Lipschitz continuity of σq if q ≥ 2.
The proof of Lemma 3.14 is given in Appendix D.5.
r
Lemma 3.14. Let q ∈ N, q ≥ 2, and L, p, s, M , B ≥ 1, σR be as in (3.7) and U = [−1, 1]N .
Then G FN = G FN (σq , L, p, s, M, B) is compact with respect to k · k∞,σRr (U) and k · kn . Furthermore,
G FN satisfies for all δ > 0
2L+2
H(GGFN , k · k∞,σRr (U) , δ) ≤ (s + 1) log ΛΨ Y Lq L+q (2pM )4q max{1, δ −1 } . (3.17)
SP FC
Consider the constant CL from (3.14). Then there exists CH , CH > 0 such that
G sp
H(G SP 1+2CL log(q)
FN (σq , N ), k · k∞,σR (U) , δ) ≤ CH N
r 1 + log(N )2 + log max 1, δ −1
G full
H(G FN (σq , N ), k · k∞,σR
r (U) , δ) ≤ C
FC 2+2CL log(q)
H N 1 + log(N )2 + log max 1, δ −1
For every n ∈ N, let (xi , yi )ni=1 be data generated by (2.1) either in the white noise model or in the
sub-Gaussian noise model. Then there exist constants CL , Cp , Cs , M , B ≥ 1 in (3.14) and C > 0
(all independent of n) such that
1
(i) Sparse FrameNet: with N = N (n) = ⌈n κ+1 ⌉ and G = Gsp FN (σ1 , N ), there exists a mea-
surable choice of an ERM Ĝn of (2.2). Any such Ĝn satisfies
κ
EG0 [kĜn − G0 k2L2 (γ) ] ≤ Cn− κ+1 +τ , (3.18)
1
(ii) Fully connected FrameNet: with N = N (n) = ⌈n κ+2 ⌉ and G = G full FN (σ1 , N ), there exists
a measurable choice of an ERM Ĝn of (2.2). Any such Ĝn satisfies
κ
EG0 [kĜn − G0 k2L2 (γ) ] ≤ Cn− κ+2 +τ . (3.19)
One can further use Theorem 2.6 to show concentration inequalities for the risk kĜn −G0 kL2 (γ) .
18
Proof. The proof follows directly from the entropy bounds in Lemma 3.12 and Corollary 2.8
(ii): First, Lemma 3.12 in particular verifies Assumption 1 for all G∗ ∈ G , which is required for
Corollary 2.8. Applying the corollary with β = κ then gives (3.18), and with β = κ2 we obtain
(3.19). Note that crucially F∞ (see (2.8)) does not depend on N , because kG0 k∞,supp γ < ∞ and
G FN is universally bounded by the N -independent constant B, see (3.6).
Theorem 3.16 (ERM for white noise and RePU). Consider the setting of Theorem 3.15 and let
q ∈ N, q ≥ 2. There exist constants CL , Cp , Cs , M , B ≥ 1 in (3.14) and C > 0 (all independent
of N ) such that
1
(i) Sparse FrameNet: with N = N (n) = ⌈n κ+1+4CL log(q) ⌉ and G = G sp FN (σq , N ), there exists
a measurable choice of an ERM Ĝn of (2.2). Any such Ĝn satisfies
− κ+1+4Cκ +τ
EG0 [kĜn − G0 k2L2 (γ) ] ≤ Cn L log(q) . (3.20)
1
(ii) Fully connected FrameNet: with N = N (n) = ⌈n κ+2+4CL log(q) ⌉ and G = Gfull FN (σq , N ),
there exists a measurable choice of ERM Ĝn of (2.2). Any such Ĝn satisfies
− κ+2+4Cκ +τ
EG0 [kĜn − G0 k2L2 (γ) ] ≤ Cn L log(q) . (3.21)
Proof. Similar to the proof of Theorem 3.15, the proof of Theorem 3.16 follows from the entropy
bounds in Lemma 3.14 below and Corollary 2.8 (ii) with γ = κ/(1 + 2CL log(q)) for (3.20) and
γ = κ/(2 + 2CL log(q)) for (3.21).
A few remarks are in order. In the RePU case the activation function σq (x) = max{0, x}q
has no global Lipschitz condition for q ≥ 2. As a result, the entropy bounds obtained for the
corresponding FrameNet class are larger than for ReLU. This leads to worse convergence rates.
Moreover, for the RePU case, the convergence rate depends on the constant CL in (3.14b). The
proof of Theorem 3.10 shows that CL depends on the decay properties of the Legendre coefficients
r
(cν i ,j )i,j∈N of the function G0 ◦ σR , i.e. CL depends on G0 (see (D.23) and (D.24) in Theorem
D.4). Explicit bounds on CL are possible, see [104, Lemma 1.4.15].
4 Applications
We now present two applications of our results. First, in finite dimensional regression, our analysis
recovers well-known minimax-optimal rates for standard smoothness classes. This indicates that
our main statistical result in Theorem 2.6 is in general optimal. However, we do not claim opti-
mality specifically for the approximation of holomorphic operators discussed in Section 3. Second,
for an infinite dimensional problem we address the learning of a solution operator to a parameter
dependent PDE.
19
For s ≥ 0, it is well-known that the minimax-optimal rate of recovering a ground-truth function
G0 in s-smooth function classes, such as the Sobolev space H s (D) or the Besov space B∞,∞
s
(D)
2s
(see, e.g., [33] for definitions), equals n− 2s+d , e.g., [38, 100].
Denote now by G sR the ball of radius R > 0 around the origin in either H s (D) or B∞,∞ s
(D).
Then,
R d/s
H(GG sR , k · k∞,supp(γ) , δ) ≃ ∀δ ∈ (0, 1),
δ
∞
which holds for all s > 0 if G sR is the ball in Bs,s (D), and for all s > d/2 in case G sR is the ball in
H (D), see Theorem 4.10.3 in [98]. Corollary 2.8 (i) (with α = d/s < 2) then directly yield the
s
following theorem. It recovers the minimax optimal rate for nonparametric least squares/maximum
likelihood estimators.
Theorem 4.1. Let R > 0 and s > d/2. Then, there exists C > 0 such that for all G0 ∈ G sR , the
estimator Ĝn in (4.2) with G = G sR and data as in (4.1) satisfies
2 2s
EG0 kĜn − G0 k2L2 (D) ≤ Cn− 2+α = Cn− 2s+d ∀n ∈ N.
4.2.1 Setup
We recall the setup from [43, Sections 7.1.1, 7.1.2].
Let d ∈ N, and denote by Td ≃ [0, 1]d the d-dimensional torus. In the following, all function
spaces on Td are understood to be one-periodic in each variable. Fix ā ∈ L∞ (Td ) and f ∈
H −1 (Td )/R such that for some constant amin > 0
Then for r ≥ 0, {max{1, |j|}r ξj : j ∈ Nd0 } forms an ONB of H r (Td ) equipped with inner product
X 2r
hu, viH r (Td ) := hu, ξj iL2 hv, ξj iL2 max {1, |jj |} .
j ∈Nd
0
so that ΨX := (ψj )j∈Nd0 , ΨY := (ηj )j∈Nd0 form ONBs of X, Y respectively. The encoder EX and
decoder DY are now as in (3.3). Direct calculation shows Xr = H r0 +rd and Yt = H t0 +td for r,
t ≥ 0; for more details see [43, Section 7.1.2].
20
4.2.2 Sample Complexity
We now analyze the sample complexity for learning the PDE solution operator G0 in Section 4.2.1.
For a proof of Theorem 4.2, see Appendix E.1.
3d τ1 d
Theorem 4.2. Let d ∈ N, d ≥ 2, s > 3d 2 and t0 ∈ [0, 1]. Fix τ1 > 0, τ2 ∈ (0, min{s − 2 , 8 })
(both arbitrarily small), and set
(
d
2 + τ2 if s ∈ ( 3d
2 , 2d + 1 − t0 ]
r0 = s+t 0 −1
2 if s > 2d + 1 − t0 .
in particular, (4.8) holds for any γ with supp(γ) ⊆ BR (H s (Td )). This shows Theorem 1.1.
Similar rates can also be obtained for this PDE model on a convex, polygonal domain D ⊂ T2
with Dirichlet boundary conditions. The argument uses the Riesz basis constructed in [26], but
is otherwise similar to the torus, for details see [43, Section 7.2]. Moreoever, using the RePU
activation function, (4.6) holds with convergence rate κ/(κ + 1 + 4CL log(q)), where κ is from (4.7)
and CL from (3.14b). The proof is similar to Theorem 4.2 using Theorem 3.16 instead of Theorem
3.15. Finally, rates for the fully-connected class G full
FN (σq , N ) can be established using Theorems
3.15 (ii) and 3.16 (ii).
21
5 Conclusions
In this work, we established convergence theorems for empirical risk minimization to learn map-
pings G0 between infinite dimensional Hilbert spaces. Our setting assumes given data in the form
of n input-output pairs, with an additive noise model. We discuss both the case of Gaussian white
noise and sub-Gaussian noise. Our main statistical result, Theorem 2.6, bounds the mean-squared
L2 -error E[kĜn −G0 k2L2 (γ) ] in terms of the approximation error, and algebraic rates in n depending
only on the metric entropy of G . This provides a general framework to study operator learning
from the perspective of sample complexity.
In the second part of this work, we applied our statistical results to a specific operator learning
architecture from [43], termed FrameNet. As our main application, we showed that holomor-
phic operators G0 can be learned with NN-based surrogates without suffering from the curse of
dimension, cf. Theorem 3.15. Such results have wide applicability, as the required holomorphy
assumption is well-established in the literature, and has been verified for a variety of models in-
cluding for example general elliptic PDEs [21, 22, 20, 41], Maxwell’s equations [49], the Calderon
projector [42], the Helmholtz equation [46, 93] and also nonlinear PDEs such as the Navier-Stokes
equations [24].
22
References
[1] Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions with formulas,
graphs, and mathematical tables. Vol. 55. US Government printing office, 1948.
[2] Ben Adcock, Nick Dexter, and Sebastian Moraga. Optimal deep learning of holomorphic
operators between Banach spaces. 2024. url: http://arxiv.org/pdf/2406.13928.
[3] Ben Adcock et al. “Near-optimal learning of Banach-valued, high-dimensional functions via
deep neural networks”. In: Neural Networks 181 (2025), p. 106761. issn: 0893-6080. doi:
https://doi.org/10.1016/j.neunet.2024.106761. url: https://www.sciencedirect
.com/science/article/pii/S0893608024006853.
[4] Sergios Agapiou and Sven Wang. “Laplace priors and spatial inhomogeneity in Bayesian
inverse problems”. In: arXiv:2112.05679 (2021).
[5] Anima Anandkumar et al. “Neural Operator: Graph Kernel Network for Partial Differential
Equations”. In: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential
Equations. 2019. url: https://openreview.net/forum?id=fg2ZFmXFO3.
[6] Ivo Babuška, Fabio Nobile, and Raúl Tempone. “A stochastic collocation method for elliptic
partial differential equations with random input data”. In: SIAM Rev. 52.2 (2010), pp. 317–
355. issn: 0036-1445,1095-7200. doi: 10.1137/100786356. url: https://doi.org/10.11
37/100786356.
[7] Markus Bachmayr et al. “Sparse polynomial approximation of parametric elliptic PDEs.
Part II: Lognormal coefficients”. In: ESAIM Math. Model. Numer. Anal. 51.1 (2017),
pp. 341–363. issn: 2822-7840,2804-7214. doi: 10.1051/m2an/2016051. url: https://doi
.org/10.1051/m2an/2016051.
[8] Andrew Barron, Lucien Birgé, and Pascal Massart. “Risk bounds for model selection via
penalization”. In: Probability theory and related fields 113 (1999), pp. 301–413.
[9] Sebastian Becker et al. “Learning the random variables in Monte Carlo simulations with
stochastic gradient descent: Machine learning for parametric PDEs and financial derivative
pricing”. In: Mathematical Finance 34.1 (2024), pp. 90–150.
[10] Kaushik Bhattacharya et al. “Model reduction and neural networks for parametric PDEs”.
In: SMAI J. Comput. Math. 7 (2021), pp. 121–157.
[11] Lucien Birgé and Pascal Massart. “Rates of convergence for minimum contrast estimators”.
In: Probability Theory and Related Fields 97 (1993), pp. 113–150.
[12] Ismaël Castillo and Richard Nickl. “Nonparametric Bernstein–von Mises theorems in Gaus-
sian white noise”. In: The Annals of Statistics 41.4 (2013). issn: 0090-5364. doi: 10.1214
/13-AOS1133.
[13] Gaëlle Chagny, Anouar Meynaoui, and Angelina Roche. “Adaptive nonparametric estima-
tion in the functional linear model with functional output”. In: (2022).
[14] Abdellah Chkifa, Albert Cohen, and Christoph Schwab. “Breaking the curse of dimension-
ality in sparse polynomial approximation of parametric PDEs”. In: J. Math. Pures Appl.
(9) 103.2 (2015), pp. 400–428. issn: 0021-7824,1776-3371. doi: 10.1016/j.matpur.2014
.04.009. url: https://doi.org/10.1016/j.matpur.2014.04.009.
[15] Abdellah Chkifa et al. “Discrete least squares polynomial approximation with random eval-
uations—application to parametric and stochastic elliptic PDEs”. In: ESAIM Math. Model.
Numer. Anal. 49.3 (2015), pp. 815–837. issn: 0764-583X. doi: 10.1051/m2an/2014050.
url: https://doi.org/10.1051/m2an/2014050.
[16] Abdellah Chkifa et al. “Sparse adaptive Taylor approximation algorithms for parametric
and stochastic elliptic PDEs”. In: ESAIM Math. Model. Numer. Anal. 47.1 (2013), pp. 253–
280. issn: 0764-583X. doi: 10.1051/m2an/2012027. url: https://doi.org/10.1051/m2
an/2012027.
23
[17] Ole Christensen. An Introduction to Frames and Riesz Bases [recurso electrónico]. Second
edition 2016. Applied and Numerical Harmonic Analysis. Cham, 2016. isbn: 978-3-319-
25613-9.
[18] Ludovica Cicci, Stefania Fresca, and Andrea Manzoni. “Deep-HyROMnet: A deep learning-
based operator approximation for hyper-reduction of nonlinear parametrized PDEs”. In:
Journal of Scientific Computing 93.2 (2022), p. 57.
[19] K. A. Cliffe et al. “Multilevel Monte Carlo methods and applications to elliptic PDEs with
random coefficients”. In: Comput. Vis. Sci. 14.1 (2011), pp. 3–15. issn: 1432-9360,1433-
0369. doi: 10.1007/s00791-011-0160-x. url: https://doi.org/10.1007/s00791-011
-0160-x.
[20] Albert Cohen and Ronald DeVore. “Approximation of high-dimensional parametric PDEs”.
In: Acta Numer. 24 (2015), pp. 1–159. issn: 0962-4929. doi: 10.1017/S0962492915000033.
url: https://doi.org/10.1017/S0962492915000033.
[21] Albert Cohen, Ronald DeVore, and Christoph Schwab. “Convergence rates of best N -term
Galerkin approximations for a class of elliptic sPDEs”. In: Found. Comput. Math. 10.6
(2010), pp. 615–646. issn: 1615-3375. doi: 10.1007/s10208-010-9072-2. url: http://d
x.doi.org/10.1007/s10208-010-9072-2.
[22] Albert Cohen, Ronald Devore, and Christoph Schwab. “Analytic regularity and polynomial
approximation of parametric and stochastic elliptic PDE’s”. In: Anal. Appl. (Singap.) 9.1
(2011), pp. 11–47. issn: 0219-5305. doi: 10.1142/S0219530511001728. url: http://dx.d
oi.org/10.1142/S0219530511001728.
[23] Albert Cohen, Giovanni Migliorati, and Fabio Nobile. “Discrete least-squares approxima-
tions over optimized downward closed polynomial spaces in arbitrary dimension”. In: Con-
str. Approx. 45.3 (2017), pp. 497–519. issn: 0176-4276. doi: 10.1007/s00365-017-9364-8.
url: https://doi.org/10.1007/s00365-017-9364-8.
[24] Albert Cohen, Christoph Schwab, and Jakob Zech. “Shape Holomorphy of the stationary
Navier-Stokes Equations”. In: SIAM J. Math. Analysis 50.2 (2018), pp. 1720–1752. doi:
https://doi.org/10.1137/16M1099406.
[25] Niccolò Dal Santo, Simone Deparis, and Luca Pegolotti. “Data driven approximation of
parametrized PDEs by reduced basis and neural networks”. In: Journal of Computational
Physics 416 (2020), p. 109550.
[26] Oleg Davydov and Rob Stevenson. “Hierarchical Riesz Bases for H s (Ω), 1 < s < 5/2”. In:
Constructive Approximation 22.3 (2005), pp. 365–394. issn: 1432-0940. doi: 10.1007/s00
365-004-0593-2.
[27] Tim De Ryck, Samuel Lanthaler, and Siddhartha Mishra. “On the approximation of func-
tions by tanh neural networks”. In: Neural Networks 143 (2021), pp. 732–750.
[28] R. DeVore, R. Howard, and C. Micchelli. “Optimal nonlinear approximation.” In: Manus-
cripta mathematica 63.4 (1989), pp. 469–478. url: http://eudml.org/doc/155392.
[29] R.A. DeVore and G.G. Lorentz. Constructive Approximation. Grundlehren der mathema-
tischen Wissenschaften. Springer Berlin Heidelberg, 1993. isbn: 9783540506270. url: http
s://books.google.de/books?id=cDqNW6k7_ZwC.
[30] Sjoerd Dirksen. “Tail bounds via generic chaining”. In: Electronic Journal of Probability
20 (2015). issn: 1083-6489. doi: 10.1214/EJP.v20-3760.
[31] Alireza Doostan and Houman Owhadi. “A non-adapted sparse approximation of PDEs with
stochastic inputs”. In: Journal of Computational Physics 230.8 (2011), pp. 3015–3034. issn:
0021-9991. doi: https://doi.org/10.1016/j.jcp.2011.01.002. url: https://www.sci
encedirect.com/science/article/pii/S0021999111000106.
24
[32] Dinh Dung et al. Analyticity and sparsity in uncertainty quantification for PDEs with
Gaussian random field inputs. Vol. 2334. Lecture Notes in Mathematics. Springer, Cham,
2023, pp. xv+205. isbn: 978-3-031-38383-0. doi: 10 . 1007 / 978 - 3 - 031 - 38384 - 7. url:
https://doi.org/10.1007/978-3-031-38384-7.
[33] D. E. Edmunds and H. Triebel. Function spaces, entropy numbers, differential operators.
Vol. 120. Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge, 1996,
pp. xii+252. isbn: 0-521-56036-5. doi: 10.1017/CBO9780511662201.
[34] Dennis Elbrachter et al. “Deep Neural Network Approximation Theory”. In: IEEE Trans-
actions on Information Theory 67.5 (2021), pp. 2581–2623. issn: 0018-9448. doi: 10.1109
/TIT.2021.3062161.
[35] Utku Evci et al. The Difficulty of Training Sparse Neural Networks. 2019. url: http://a
rxiv.org/pdf/1906.10732.
[36] Sara van de Geer. Empirical Processes in M-Estimation. Cambridge U. Press, 2000.
[37] Sara van de Geer. “Least squares estimation with complexity penalties”. In: Mathematical
Methods of statistics 10 (2001), pp. 355–374.
[38] Evarist Giné and Richard Nickl. Mathematical foundations of infinite-dimensional statistical
models. Cambridge series in statistical and probabilistic mathematics. New York (NY):
Cambridge University Press, 2016. isbn: 1107043166.
[39] Evarist Giné and Richard Nickl. Mathematical foundations of infinite-dimensional statis-
tical models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge
University Press, New York, 2016, pp. xiv+690.
[40] Sonja Greven and Fabian Scheipl. “A general framework for functional regression mod-
elling”. In: Statistical Modelling 17.1-2 (2017), pp. 1–35.
[41] Helmut Harbrecht, Michael Peters, and Markus Siebenmorgen. “Analysis of the domain
mapping method for elliptic diffusion problems on random domains”. In: Numerische Math-
ematik 134.4 (2016), pp. 823–856.
[42] Fernando Henrı́quez and Christoph Schwab. “Shape holomorphy of the Calderón projector
for the Laplacian in R 2”. In: Integral Equations and Operator Theory 93.4 (2021), p. 43.
[43] Lukas Herrmann, Christoph Schwab, and Jakob Zech. “Neural and spectral operator sur-
rogates: unified construction and expression rate bounds”. In: Advances in Computational
Mathematics 50.4 (2024), pp. 1–43. issn: 1019-7168. doi: 10.1007/s10444-024-10171-2.
url: https://link.springer.com/article/10.1007/s10444-024-10171-2.
[44] J.S. Hesthaven and S. Ubbiali. “Non-intrusive reduced order modeling of nonlinear problems
using neural networks”. In: Journal of Computational Physics 363 (2018), pp. 55–78. issn:
0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.02.037. url: https://www.sci
encedirect.com/science/article/pii/S0021999118301190.
[45] Jan S. Hesthaven, Gianluigi Rozza, and Benjamin Stamm. Certified reduced basis meth-
ods for parametrized partial differential equations. SpringerBriefs in Mathematics. BCAM
SpringerBriefs. Springer, Cham; BCAM Basque Center for Applied Mathematics, Bilbao,
2016, pp. xiii+131. isbn: 978-3-319-22469-5. doi: 10.1007/978- 3- 319- 22470- 1. url:
https://doi.org/10.1007/978-3-319-22470-1.
[46] R. Hiptmair et al. “Large deformation shape uncertainty quantification in acoustic scatter-
ing”. In: Advances in Computational Mathematics 44.5 (Oct. 2018), pp. 1475–1518. issn:
1572-9044. doi: 10.1007/s10444-018-9594-8. url: https://doi.org/10.1007/s10444
-018-9594-8.
[47] Viet Ha Hoang and Christoph Schwab. “N-term Wiener chaos approximation rates for ellip-
tic PDEs with lognormal Gaussian random inputs”. In: Mathematical Models and Methods
in Applied Sciences 24.04 (2014), pp. 797–826. doi: 10.1142/S0218202513500681. eprint:
https://doi.org/10.1142/S0218202513500681. url: https://doi.org/10.1142/S021
8202513500681.
25
[48] Maarten V. de Hoop et al. “Convergence Rates for Learning Linear Operators from Noisy
Data”. In: SIAM/ASA Journal on Uncertainty Quantification 11.2 (2023), pp. 480–513.
doi: 10.1137/21M1442942. eprint: https://doi.org/10.1137/21M1442942. url: https
://doi.org/10.1137/21M1442942.
[49] Carlos Jerez-Hanckes, Christoph Schwab, and Jakob Zech. “Electromagnetic wave scatter-
ing by random surfaces: shape holomorphy”. In: Math. Models Methods Appl. Sci. 27.12
(2017), pp. 2229–2259. issn: 0218-2025. doi: 10.1142/S0218202517500439. url: https
://doi.org/10.1142/S0218202517500439.
[50] Padraig Kirwan. “Complexifications of multilinear and polynomial mappings”. Ph.D. thesis,
National University of Ireland, Galway. PhD thesis. 1997.
[51] Nikola B Kovachki, Samuel Lanthaler, and Andrew M Stuart. “Operator learning: Algo-
rithms and analysis”. In: arXiv preprint arXiv:2402.15715 (2024).
[52] Nikola B. Kovachki, Samuel Lanthaler, and Hrushikesh Mhaskar. Data Complexity Esti-
mates for Operator Learning. 2024. url: http://arxiv.org/pdf/2405.15992.
[53] Fabian Kröpfl, Roland Maier, and Daniel Peterseim. “Operator compression with deep
neural networks”. In: Advances in Continuous and Discrete Models 2022.1 (2022), p. 29.
[54] Gitta Kutyniok et al. “A theoretical analysis of deep neural networks and parametric
PDEs”. In: Constr. Approx. 55.1 (2022), pp. 73–125. issn: 0176-4276,1432-0940. doi: 1
0.1007/s00365-021-09551-4. url: https://doi.org/10.1007/s00365-021-09551-4.
[55] Samuel Lanthaler. “Operator learning with PCA-Net: upper and lower complexity bounds”.
In: Journal of Machine Learning Research 24.318 (2023), pp. 1–67. url: http://jmlr.or
g/papers/v24/23-0478.html.
[56] Samuel Lanthaler, Siddhartha Mishra, and George E. Karniadakis. “Error estimates for
DeepONets: a deep learning framework in infinite dimensions”. In: Transactions of Math-
ematics and Its Applications 6.1 (2022). doi: 10.1093/imatrm/tnac001.
[57] Bo Li. “Better Approximations of High Dimensional Smooth Functions by Deep Neural
Networks with Rectified Power Units”. In: Communications in Computational Physics 27.2
(2020), pp. 379–411. issn: 1815-2406. doi: 10.4208/cicp.OA-2019-0168.
[58] Bo Li, Shanshan Tang, and Haijun Yu. “PowerNet: Efficient representations of polynomi-
als and smooth functions by deep neural networks with rectified power units”. In: arXiv
preprint arXiv:1909.05136 (2019).
[59] Zongyi Li et al. Fourier Neural Operator for Parametric Partial Differential Equations. cite
arxiv:2010.08895. 2020. url: http://arxiv.org/abs/2010.08895.
[60] Hao Liu et al. Deep Nonparametric Estimation of Operators between Infinite Dimensional
Spaces. Jan. 1, 2022. url: http://arxiv.org/pdf/2201.00217v1.
[61] Lu Lu et al. “Learning nonlinear operators via DeepONet based on the universal approx-
imation theorem of operators”. In: Nature Machine Intelligence 3.3 (Mar. 2021), pp. 218–
229. issn: 2522-5839. doi: 10.1038/s42256-021-00302-5. url: https://doi.org/10.10
38/s42256-021-00302-5.
[62] H. N. Mhaskar and N. Hahm. “Neural networks for functional approximation and system
identification”. In: Neural computation 9.1 (1997), pp. 143–159. issn: 0899-7667. doi: 10.1
162/neco.1997.9.1.143.
[63] Hrushikesh N Mhaskar. “Neural networks for optimal approximation of smooth and analytic
functions”. In: Neural computation 8.1 (1996), pp. 164–177.
[64] Jeffrey S Morris and Raymond J Carroll. “Wavelet-based functional mixed models”. In:
Journal of the Royal Statistical Society Series B: Statistical Methodology 68.2 (2006),
pp. 179–199.
26
[65] Gustavo A. Muñoz, Yannis Sarantopoulos, and Andrew Tonge. “Complexifications of real
Banach spaces, polynomials and multilinear maps”. In: Studia Math. 134.1 (1999), pp. 1–
33. issn: 0039-3223.
[66] Nicholas H Nelsen and Andrew M Stuart. “Operator learning using random features: A
tool for scientific computing”. In: SIAM Review 66.3 (2024), pp. 535–571.
[67] R. Nickl, S. van de Geer, and S. Wang. “Convergence rates for Penalised Least Squares
Estimators in PDE-constrained regression problems”. In: SIAM J. Uncert. Quant. 8 (2020).
[68] Richard Nickl. Bayesian Non-linear Statistical Inverse Problems. EMS press, 2023.
[69] Richard Nickl. “Bernstein–von Mises theorems for statistical inverse problems I: Schrödinger
equation”. In: Journal of the European Mathematical Society 22.8 (2020), pp. 2697–2750.
issn: 1435-9855. doi: 10.4171/JEMS/975.
[70] Richard Nickl. “Donsker-type theorems for nonparametric maximum likelihood estimators”.
In: Probability Theory and Related Fields 138.3-4 (2007). issn: 0178-8051. doi: 10.1007/s
00440-006-0031-4.
[71] Richard Nickl and Sven Wang. “On polynomial-time computation of high-dimensional pos-
terior measures by Langevin-type algorithms”. In: Journal of the European Mathematical
Society, to appear (2020).
[72] F. Nobile, R. Tempone, and C. G. Webster. “A Sparse Grid Stochastic Collocation Method
for Partial Differential Equations with Random Input Data”. In: SIAM Journal on Numer-
ical Analysis 46.5 (2008), pp. 2309–2345. doi: 10.1137/060663660. eprint: https://doi
.org/10.1137/060663660. url: https://doi.org/10.1137/060663660.
[73] Thomas O’Leary-Roseberry et al. “Derivative-Informed Neural Operator: An efficient frame-
work for high-dimensional parametric derivative learning”. In: Journal of Computational
Physics 496 (2024), p. 112555. issn: 0021-9991. doi: https://doi.org/10.1016/j.jcp.2
023.112555. url: https://www.sciencedirect.com/science/article/pii/S00219991
23006502.
[74] Thomas O’Leary-Roseberry et al. “Derivative-informed projected neural networks for high-
dimensional parametric maps governed by PDEs”. In: Computer Methods in Applied Me-
chanics and Engineering 388 (2022), p. 114199. issn: 0045-7825. doi: https://doi.org/1
0.1016/j.cma.2021.114199. url: https://www.sciencedirect.com/science/article
/pii/S0045782521005302.
[75] J. A. A. Opschoor, Ch. Schwab, and J. Zech. “Exponential ReLU DNN Expression of Holo-
morphic Maps in High Dimension”. In: Constructive Approximation 55.1 (2022), pp. 537–
582. issn: 1432-0940. doi: 10.1007/s00365-021-09542-5. url: https://link.springer
.com/article/10.1007/s00365-021-09542-5#citeas.
[76] J. A. A. Opschoor, Ch. Schwab, and J. Zech. “Exponential ReLU DNN expression of
holomorphic maps in high dimension”. In: Constr. Approx. 55.1 (2022), pp. 537–582. issn:
0176-4276. doi: 10.1007/s00365-021-09542-5. url: https://doi.org/10.1007/s0036
5-021-09542-5.
[77] Joost A. A. Opschoor, Philipp C. Petersen, and Christoph Schwab. “Deep ReLU net-
works and high-order finite element methods”. In: Analysis and Applications 18.05 (2020),
pp. 715–770. issn: 0219-5305. doi: 10.1142/S0219530519410136.
[78] Joost A. A. Opschoor, Christoph Schwab, and Jakob Zech. “Deep learning in high dimen-
sion: ReLU neural network expression for Bayesian PDE inversion”. In: Optimization and
control for partial differential equations—uncertainty quantification, open and closed-loop
control, and shape optimization. Vol. 29. Radon Ser. Comput. Appl. Math. De Gruyter,
Berlin, 2022, pp. 419–462. isbn: 978-3-11-069596-0. doi: 10.1515/9783110695984- 015.
url: https://doi.org/10.1515/9783110695984-015.
[79] Houman Owhadi and Gene Ryan Yoo. “Kernel flows: From learning kernels from data into
the abyss”. In: Journal of Computational Physics 389 (2019), pp. 22–47.
27
[80] Philipp Petersen, Mones Raslan, and Felix Voigtlaender. “Topological Properties of the Set
of Functions Generated by Neural Networks of Fixed Size”. In: Foundations of Computa-
tional Mathematics 21.2 (2021), pp. 375–444. issn: 1615-3375. doi: 10.1007/s10208-020
-09461-0. url: https://link.springer.com/article/10.1007/s10208-020-09461-0.
[81] Philipp Petersen and Felix Voigtlaender. “Optimal approximation of piecewise smooth func-
tions using deep ReLU neural networks”. In: Neural networks : the official journal of the
International Neural Network Society 108 (2018), pp. 296–330. doi: 10.1016/j.neunet.2
018.08.019.
[82] Allan Pinkus. “Approximation theory of the MLP model in neural networks”. In: Acta
numerica, 1999. Vol. 8. Acta Numer. Cambridge Univ. Press, Cambridge, 1999, pp. 143–
195. isbn: 0-521-77088-2. doi: 10.1017/S0962492900002919. url: https://doi.org/10
.1017/S0962492900002919.
[83] Tomaso Poggio et al. “Why and when can deep-but not shallow-networks avoid the curse
of dimensionality: a review”. In: International Journal of Automation and Computing 14.5
(2017), pp. 503–519.
[84] David Pollard. Convergence of Stochastic Processes. New York, NY: Springer New York,
1984. isbn: 978-1-4612-9758-1. doi: 10.1007/978-1-4612-5254-2.
[85] Alfio Quarteroni, Andrea Manzoni, and Federico Negri. Reduced basis methods for par-
tial differential equations. Vol. 92. Unitext. An introduction, La Matematica per il 3+2.
Springer, Cham, 2016, pp. xi+296. isbn: 978-3-319-15430-5. doi: 10.1007/978-3-319-15
431-2. url: https://doi.org/10.1007/978-3-319-15431-2.
[86] Bogdan Raonic et al. “Convolutional neural operators”. In: ICLR 2023 Workshop on
Physics for Machine Learning. 2023.
[87] Holger Rauhut and Christoph Schwab. “Compressive sensing Petrov-Galerkin approxima-
tion of high-dimensional parametric operator equations”. In: Math. Comp. 86.304 (2017).
Report 2014-14, Seminar for Applied Mathematics, ETH Zürich, pp. 661–700. issn: 0025-
5718. doi: 10.1090/mcom/3113. url: http://dx.doi.org/10.1090/mcom/3113.
[88] Markus Reiß. “Asymptotic equivalence for nonparametric regression with multivariate and
random design”. In: Ann. Statist. 36.4 (2008), pp. 1957–1982. issn: 0090-5364. doi: 10.12
14/07-AOS525. url: http://dx.doi.org/10.1214/07-AOS525.
[89] Johannes Schmidt-Hieber. “Supplement to “Nonparametric regression using deep neural
networks with ReLU activation function””. In: The Annals of Statistics 48.4 (2020). issn:
0090-5364. doi: 10.1214/19-AOS1875SUPP.
[90] C. Schwab and A. M. Stuart. “Sparse deterministic approximation of Bayesian inverse
problems”. In: Inverse Problems 28.4 (2012), pp. 045003, 32. issn: 0266-5611,1361-6420.
doi: 10.1088/0266-5611/28/4/045003. url: https://doi.org/10.1088/0266-5611/28
/4/045003.
[91] Christoph Schwab and Jakob Zech. “Deep learning in high dimension: neural network ex-
pression rates for analytic functions in L2 (Rd , γd )”. In: SIAM/ASA J. Uncertain. Quantif.
11.1 (2023), pp. 199–234. issn: 2166-2525. doi: 10.1137/21M1462738. url: https://doi
.org/10.1137/21M1462738.
[92] Christoph Schwab and Jakob Zech. “Deep learning in high dimension: neural network ex-
pression rates for generalized polynomial chaos expansions in UQ”. In: Anal. Appl. (Singap.)
17.1 (2019), pp. 19–55. issn: 0219-5305,1793-6861. doi: 10.1142/S0219530518500203. url:
https://doi.org/10.1142/S0219530518500203.
[93] Euan A Spence and Jared Wunsch. “Wavenumber-explicit parametric holomorphy of Helm-
holtz solutions in the context of uncertainty quantification”. In: SIAM/ASA Journal on
Uncertainty Quantification 11.2 (2023), pp. 567–590.
[94] Andrew M. Stuart. “Inverse problems: a Bayesian perspective”. In: Acta Numer. 19 (2010),
pp. 451–559. issn: 0962-4929. doi: 10.1017/S0962492910000061.
28
[95] Taiji Suzuki. “Adaptivity of deep ReLU network for learning in Besov and mixed smooth
Besov spaces: optimal rate and curse of dimensionality”. In: he 7th International Conference
on Learning Representations (ICLR2019). Vol. 7. 2019.
[96] Michel Talagrand. The generic chaining: Upper and lower bounds for stochastic processes.
Berlin and New York: Springer, 2005. isbn: 3-540-24518-9. doi: 10.1007/3-540-27499-5.
[97] Michel Talagrand. Upper and lower bounds for stochastic processes. Vol. 60. Springer, 2014.
[98] Hans Triebel. Interpolation theory, function spaces, differential operators. 2., rev. and enl.
ed. Heidelberg and Leipzig: Barth, 1995. isbn: 3335004205.
[99] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer series in statis-
tics. Dordrecht: Springer, 2009. doi: 10.1007/b13794. url: https://cds.cern.ch/reco
rd/1315296.
[100] Sara A. de van Geer. Applications of empirical process theory. Digitally printed version.
Cambridge series in statistical and probabilistic mathematics. Cambridge: Cambridge Uni-
versity Press, 2009. isbn: 052165002X.
[101] Roman Vershynin. High-dimensional probability: An introduction with applications in data
science. Vol. 47. Cambridge series in statistical and probabilistic mathematics. Cambridge,
United Kingdom and New York, NY: Cambridge University Press, 2018. isbn: 1108415199.
[102] Dongbin Xiu and George Em Karniadakis. “The Wiener–Askey Polynomial Chaos for
Stochastic Differential Equations”. In: SIAM Journal on Scientific Computing 24.2 (2002),
pp. 619–644. doi: 10.1137/S1064827501387826. eprint: https://doi.org/10.1137/S10
64827501387826. url: https://doi.org/10.1137/S1064827501387826.
[103] Dmitry Yarotsky. “Error bounds for approximations with deep ReLU networks”. In: Neu-
ral networks : the official journal of the International Neural Network Society 94 (2017),
pp. 103–114. doi: 10.1016/j.neunet.2017.07.002.
[104] Jakob Zech. “Sparse-Grid Approximation of High-Dimensional Parametric PDEs”. PhD
thesis. ETH Zurich, 2018. doi: 10.3929/ethz-b-000340651.
29
Appendices
A Auxiliary Probabilistic Lemmas
We recall the classical Bernstein inequality.
Lemma A.1 (Bernstein’s inequality). Let X1 , . . . , Xn be independent, centered RVs with finite
second moments E[Xi2 ] < ∞ and uniform bound |Xi | ≤ M for i = 1, . . . , n. Then it holds
n
!
X t2
P Xi ≥ t ≤ 2 exp − Pn , t ≥ 0.
i=1
2 i=1 E[Xi2 ] + 32 M t
Proof. Let G∗ ∈ G be arbitrary. Using the definition of Ĝn from (2.2), it holds
kĜn − G0 k2n
n
1X
= kĜn (xi )k2Y − 2hG0 (xi ), Ĝn (xi )iY + kG0 (xi )k2Y
n i=1
n
1X
= kĜn (xi )k2Y − 2hG0 (xi ) + σεi , Ĝn (xi )iY + 2σhεi , Ĝn (xi )iY + kG0 (xi )k2Y
n i=1
n
1X ∗
≤ kG (xi )k2Y − 2hG0 (xi ) + σεi , G∗ (xi )iY + 2σhεi , Ĝn (xi )iY + kG0 (xi )k2Y
n i=1
n
1X ∗
= kG (xi )k2Y − 2hG0 (xi ), G∗ (xi )iY + kG0 (xi )k2Y + 2σhεi , Ĝn (xi ) − G∗ (xi )iY
n i=1
n
2σ X
= kG∗ − G0 k2n + hεi , Ĝn (xi ) − G∗ (xi )iY ,
n i=1
30
Proof. Since T is countable, we can write T = {tj : j ∈ N}. Using this, we define Tn = {tj : j ≤ n}
for n ∈ N. Since Tn is finite [30, Theorem 3.2, Eq. (3.2) and its proof] gives M̃ > 0 s.t.
p1
1
p
E sup |Xt − Xt0 | ≤ M̃ Jα (Tn , d) + sup d(s, t)p a (A.2)
t∈Tn s,t∈Tn
for all p ≥ 1, t0 ∈ T and n ∈ N. In (A.2), we used [30, Eq. (2.3)] to upper bound the γα -functionals
by the respective metric entropy integrals Jα . The monotone convergence theorem shows (A.2) for
T in the limit n → ∞. Applying [30, Lemma A.1] then gives the claim with M = exp(α−1 )M̃ .
We now use Lemma A.3 to establish the following concentration bound, which is tailored
towards the empirical processes appearing in our proofs, cf. Lemma A.2. Note that this lemma
can be viewed as a generalization of the key chaining Lemma 3.12 in [71] to Y-valued regression
functions; the proof follows along the same lines.
Lemma A.4 (Chaining Lemma). Let X, Y be separable Hilbert spaces, and suppose Θ is a (possibly
uncountable) set parameterizing a class of maps
H = {hθ : X → Y, θ ∈ Θ} .
where x1 , . . . , xn ∈ X are fixed elements and εi , . . . , εn are either (i) i.i.d. Gaussian white noise
processes indexed by Y, or (ii) i.i.d. sub-Gaussian random variables in Y with parameter 1.
Recall the empirical seminorm k · kn . Suppose that
Let the space (Θ, dn ) be separable. Then supθ∈Θ Zn (θ) is measurable and there exists a universal
constant CCh > 0 such that for all δ > 0 with
√
nδ ≥ CCh J(H, dn ), (A.4)
it holds
8nδ 2
P sup |Zn (θ)| ≥ δ ≤ exp − 2 2 .
θ∈Θ CCh U
Proof. In both of the cases, we will apply [30, Theorem 3.2], which we stated in Lemma A.3.
White noise case. Let θ, θ′ ∈ Θ be arbitrary. Since εi , i = 1, . . . , n are independent white
noise processes, we have Zn (θ) − Zn (θ′ ) ∼ N(0, n−1 khθ − hθ′ k2n ), i.e. the increments of Zn are
normal (recall that x is regarded as fixed here). Thus,
P (|Zn (θ) − Zn (θ′ )| ≥ t) ≤ 2 exp −nt2 /(2khθ − hθ′ k2n ) , t ≥ 0,
and
√ !
′ 2tdn (θ, θ′ ) dn (θ, θ′ )2 t2 2
P |Zn (θ) − Zn (θ )| ≥ √ ≤ 2 exp − = 2 exp −t , t ≥ 0, (A.5)
n khθ − hθ′ k2n
31
√ √
which verifies the assumption (A.1) for α = 2 and d¯n := 2dn / n.
¯ ′
√ Eq. (A.5) shows√ that the process Zn (θ) is sub-Gaussian w.r.t. the pseudometric dn (θ, θ ) =
2khθ − hθ′ kn / n. Therefore [38, Theorem 2.3.7 (a)] yields that Zn (θ) is sample bounded and
uniformly sample continuous. Since (Θ, dn ) is separable, so is (Θ, d¯n ). Thus it holds
where Θ0 ⊂ Θ denotes a countable, dense subset. The right hand side of (A.6) is measurable as
a countable supremum. Therefore also the left hand side supθ∈Θ Zn (θ) is measurable. Applying
Lemma A.3 (to the countable set Θ0 ) and using (A.6) gives that for some universal constant M
and all θ† ∈ Θ,
2
¯ tU t
P sup |Zn (θ) − Zn (θ† )| ≥ M J(H, dn ) + √ ≤ exp − , t ≥ 1.
θ∈Θ n 2
and therefore
√ ! 2
2M t
P sup |Zn (θ) − Zn (θ† )| ≥ √ (J(H, dn ) + tU ) ≤ exp − , t ≥ 1. (A.7)
θ∈Θ n 2
32
√
where we√assumed without loss of generality that M ≥ 1. Substitute δ = 3M/ n (J(H, dn ) + tU ),
i.e. tp
= ( nδ/3M − J(H, dn ))/U√. Because N (H, dn , τ ) ≥ 2 for τ ≤ U/2, we have that J(H, dn ) ≥
U/2 log(2) > U/4. Therefore nδ ≥ 15M J(H, dn ) := CCh J(H, dn ) implies t ≥ 1 and thus
√ 2
!
( nδ/(3M ) − J(H, dn )) 8nδ 2
P sup |Zn (θ)| ≥ δ ≤ 2 exp − ≤ 2 exp − 2 2 , (A.9)
θ∈Θ 2U 2 CCh U
Since the centered RVs n−1 hεi , hθ (xi )−hθ′ (xi )iY are i.i.d. sub-Gaussian with parameter n−1 khθ (xi )
− hθ′ (xi )kY , the ‘generalized’ Hoeffding inequality for sub-Gaussian variables (see [101, Theorem
2.6.2]) implies that for some universal
√ constant c > 0 the increment Zn (θ)−Zn (θ′ ) is sub-Gaussian
with parameter ckhθ − hθ kn / n. Therefore
′
√ !
2tc
P |Zn (θ) − Zn (θ )| ≥ √ dn (θ, θ ) ≤ 2 exp −t2 , t ≥ 0.
′ ′
n
From here on, the proof is similar to the white noise case and we obtain (A.9) by absorbing c into
CCh .
iid
G, k · k∞,supp(γ) , δ). Let (δ̃n )n∈N
Lemma A.5. Let xi ∼ γ and consider the entropy H(δ) = H(G
be a positive sequence with
nδ̃n2 ≥ 6F∞
2
H(δ̃n ).
√
Then for R ≥ max{8δ̃n , 18F∞ / n}, it holds that
nR2
P kĜn − G0 kL2 (γ) ≥ R, kĜn − G0 kL2 (γ) ≥ 2kĜn − G0 kn ≤ 2 exp − 2
.
320F∞
X∞
sR
≤ PG0 sup kF kL2 (γ) − kF kn ≥ ,
s=1 Fs
F ∈F 2
33
Now consider some covering F ∗s = {Fs,j }N j=1 with N = N (F F s , k · k∞ , R/8). Thus for arbitrary
s ∈ N and F ∈ F s there exists Fs,j ∈ F ∗s s.t. kF − Fs,j k∞ ≤ R/8. We get
kF kL2 (γ) − kF kn ≤ kF kL2 (γ) − kFs,j kL2 (γ) + kFs,j kL2 (γ) − kFs,j kn
+ |kFs,j kn − kF kn |
R
≤ + kFs,j kL2 (γ) − kFs,j kn .
4
Therefore
∞
X
sR
PG0 sup kF kL2 (γ) − kF kn ≥
s=1 Fs
F ∈F 2
X∞
sR
≤ PG0 max kFs,j kL2 (γ) − kFs,j kn ≥
s=1
j=1,...,N 4
X∞
sR
≤ F s , k · k∞ , R/8) max PG0
N (F kFs,j kL2 (γ) − kFs,j kn ≥ , (A.11)
s=1
j=1,...,N 4
where we used sR/2 − R/4 ≥ sR/4 for s ∈ N. Since Fs,j ∈ F s , we have kFs,j kL2 (γ) ≥ sR and
therefore
E[Yi ] = 0,
2
2F∞
|Yi | ≤ ,
n
1 2
E[Yi2 ] = E kF k 2
s,j L2 (γ) − kF (x
s,j i Y )k 2
n2
1 h i
4 2 2 4
= E kF s,j k 2
L (γ) − 2kFs,j k 2
L (γ) kF s,j (xi )k Y + kFs,j (xi )k Y
n2
1
= E kFs,j (xi )k4Y − kFs,j k4L2 (γ)
n2
2
F∞
≤ kFs,j k2L2 (γ) .
n2
34
Applying Bernstein’s inequality (Lemma A.1) for the variables Yi yields
∞
X
s2 R 2
F s , k · k∞ , R/8) max PG0
N (F kFs,j k2L2 (γ) − kFs,j k2n ≥
s=1
j=1,...,N 4
∞
!
X ns4 R4
≤ F s , k · k∞ , R/8) max exp −
N (F s2 R2
32F∞2 (kF k2
s=1
j=1,...,N s,j L2 (γ) + 6 )
X∞
ns2 R2
≤ F s , k · k∞ , R/8) −
exp H(F 2
,
s=1
160F∞
where we used kFs,j k2L2 (γ) ≤ (s + 1)2 R2 ≤ 4s2 R2 for all j = 1, . . . , N and s ∈ N, since Fj,s ∈ F s .
Since H(δ)/δ 2 is non-increasing in δ and R ≥ 8δ̃n , we have
2
ns2 R2 n R
2
≥ 2
G, k · k∞ , R/8) ≥ 2H(F
≥ 2H(G F s , k · k∞ , R/8).
160F∞ 3F∞ 8
Therefore we get
∞
X
ns2 R2
exp H(FF s , k · k∞ , R/8) − 2
s=1
160F∞
X∞ X ∞
ns2 R2 1 nR2 nR2
≤ exp − 2
≤ exp − < 2 exp − . (A.12)
s=1
320F∞ s=1
s2 2
320F∞ 2
320F∞
δn2
nδn4 ≥C σ2 2 2
F∞ H ,
8σ 2 + δn2
√
all G∗ ∈ G and R ≥ max{δn , 2kG∗ − G0 kn }, it holds for every x = (x1 , ..., xn ) ∈ Xn
x nR4
PG0 kĜn − G0 kn ≥ R ≤ 4 exp − 2 2 2 )
. (A.13)
C σ (1 + F∞
Proof. In the following we write k · k∞ = k · k∞,supp(γ) . For R2 ≥ 2kG∗ − G0 k2n , we use the basic
inequality (Lemma A.2) to obtain
R2
PxG0 kĜn − G0 k2n ≥ R2 ≤ PxG0 kĜn − G0 k2n − kG∗ − G0 k2n ≥ .
2
n
!
x 1X ∗ R2
≤ PG0 hεi , Ĝn (xi ) − G (xi )iY ≥ . (A.14)
n i=1 4σ
35
It holds for R > 0
n
!
1X R2
PxG0 hεi , Ĝn (xi ) − G∗ (xi )iY ≥
n i=1 4σ
n
!
1X R2
≤ PxG0 sup hεi , G(xi ) − G∗ (xi )iY ≥
G n i=1
G∈G 4σ
n n
!
1X 1X R2
≤ PxG0 sup hεi , G(xi ) − G∗ (xi ) − Gj ∗ (xi )iY + max hεi , Gj (xi )iY ≥
G∈GG n i=1 j=1,...N n
i=1
4σ
n
!
1X R2
≤ PxG0 sup hεi , G(xi ) − G∗ (xi ) − Gj ∗ (xi )iY ≥
G∈GG n i=1 8σ
| {z }
(i)
n
!
1 X R2
+ PxG0 max hεi , Gj (xi )iY ≥ .
j=1,...N n i=1
8σ
| {z }
(ii)
We estimate the terms (i) and (ii) separatly. For (i), use kG−G∗ −Gj ∗ k∞ ≤ R2 /(8σE[kεi kY ]+R2 )
for all G ∈ G and estimate
n
!
x 1X ∗ R2
PG0 sup hεi , G(xi ) − G (xi ) − Gj ∗ (xi )iY ≥
G n i=1
G∈G 8σ
n
!
1X R2 R2
≤ PxG0 kεi kY ≥
n i=1 8σE[kεi kY ] + R2 8σ
n
!
1X R2 nR4
≤ PxG0 kεi kY − E[kεi kY ] ≥ ≤ 2 exp − 2 2 ,
n i=1 8σ C σ
where we used the Hoeffding inequality for sub-Gaussian random variables [101, Theorem 2.6.2].
For (ii), it holds
n
!
x 1X R2
PG0 max hεi , Gj (xi )iY ≥
j=1,...N n 8σ
i=1
n
!
x 1X R2
≤ N max PG0 hεi , Gj (xi )iY ≥
j=1,...N n i=1 8σ
2nR4 R2 2nR4
≤ 2N exp − 2 2 2 ≤ 2 exp H − ,
C σ F∞ 8σ 2 + R2 C 2 σ 2 F∞
2
where we used E[kεi kY ] ≤ σ (Cauchy-Schwarz) and [101, Theorem 2.6.2] for Yi = n−1 hεi , Gj (xi )iY ,
i = 1, . . . , n.
Since H(δ)/δ 2 is non-increasing, also H(δ/(8σ 2 + δ))/δ 2 is non-increasing and thus it holds for
R ≥ δn
nR4 R2
≥ H .
C 2 σ 2 F∞
2 8σ 2 + R2
This gives for R ≥ δn
nR4
(ii) ≤ 2 exp − . (A.15)
C σ 2 F∞
2 2
36
B Proofs of Section 2
B.1 Proof of Theorem 2.5
B.1.1 Existence and Measurability of Ĝn
Proof of Theorem 2.5 (i). White noise case. For i = 1, . . . , n, denote the probability space of
the RVs xi and the white noise processes εi as (Ω, Σ, P). Furthermore, equip the Hilbert spaces
X and Y with Borel σ-algebras BX and BY and the space G with the Borel σ-algebra BG . Then,
there exists an orthonormal basis (ψj )j∈N of Y and i.i.d. Gaussian variables Zj ∼ N(0, 1) s.t.
εi : Ω × Y → R,
∞
X
εi (ω, y) = hy, ψj iY Zj (ω)
j=1
ui : Ω × G → R,
ui (ω, G) = 2σεi (ω, G(xi (ω))) + 2hG0 (xi (ω)), G(xi (ω))iY − kG(xi (ω))k2Y . (B.1)
Pn
We aim to apply [70, Proposition 5] to u := 1/n i=1 ui in order to get existence and measurability
of Ĝn in (2.2).
Per assumption, the metric space (G G , k·k) is compact. We show that ui from (B.1) is measurable
in the first component and continuous in the second component for all i = 1, . . . , n. Then [70,
Proposition 5] shows the claim. Consider an arbitrary G ∈ G. We show that ui (. , G) is (Σ, BR )-
measurable, where BR is the Borel σ-algebra in R.
The RVs xi are (Σ, BX )-measurable by definition. The maps G and G0 are assumed to
be (BX , BY )-measurable. Furthermore, because of their continuity, the scalar product h. , .iY is
(BY , BR )-measurable in both components and also the norm k . kY is (BY , BR )-measurable. There-
fore, since the composition of measurable functions is measurable, the latter two summands in
(B.1) are (Σ, BR )-measurable.
Proceeding with the first summand, the RVs Zj are (Σ, BR )-measurable by definition for all
j ∈ N. Therefore the products h . , ψj iY Zj are (Σ ⊗ BY , BR )-measurable for all j ∈ N. Thus
εi is, as the pointwise limit, (Σ ⊗ BY , BR )-measurable. Then, as the composition of measurable
functions, the first summand in (B.1) is (Σ, BR )-measurable. Therefore ui ( . , G), i = 1, . . . , n, and
thus u( . , G) is (Σ, BR )-measurable for all G ∈ G .
We proceed and show that u(ω, . ) is continuous w.r.t. k · k. Therefore choose G, G′ ∈ G and
ω ∈ Ω. Then it holds for i = 1, . . . , n and xi = xi (ω)
|ui (ω, G) − ui (ω, G′ )| ≤ 2σ |εi (ω, G(xi ) − G′ (xi ))| + 2 |hG0 (xi ), G(xi ) − G′ (xi )iY |
+ |kG(xi )kY − kG′ (xi )kY | |kG(xi )kY + kG′ (xi )kY |
≤ 2σ |εi (ω, G(xi ) − G′ (xi ))|
√
+ n (2kG0 (xi )kY + kG(xi )kY + kG′ (xi )kY ) kG − G′ kn
≤ 2σ |εi (ω, G(xi ) − G′ (xi ))|
√
+ C n (2kG0 (xi )kY + kG(xi )kY + kG′ (xi )kY ) kG − G′ k, (B.2)
where we used that k · kn ≤ Ck · k at the last inequality. Furthermore, [38, page 40 and Proposition
2.3.7(a)] yields that the white noise processes εi are a.s. sample dεi -continuous w.r.t. their intrinsic
pseudometrics dεi : Y × Y → R, dεi (y, y ′ ) = E[hεi , y − y ′ i2Y ]1/2 = ky − y ′ kY . Since for all G, G′ ∈ G
we have
√ √
kG(xi ) − G′ (xi )kY ≤ nkG − G′ kn ≤ C nkG − G′ k,
37
the white noise processes are also a.s. sample d-continuous, where d is the metric induced by k·k.
Together with (B.2) this shows that there exists a null-set Ω0 ⊂ Ω such that for all ω ∈ Ω\Ω0 ,
u(ω, .) is continuous w.r.t. k · k. Now we choose versions x̃i and ε̃i s.t. u(ω, .) = 0 for all ω ∈ Ω0 .
Then G 7→ u(ω, G) : (G G , k · k) → R is continuous for all ω ∈ Ω. Applying [70, Proposition 5] gives
an (Σ, BG )-measurable MLSE Ĝn in (2.2) with the desired minimization property.
Sub-Gaussian case. For i = 1, . . . , n, consider the functions ui from (B.1). The measurability
of ui ( . , G), i = 1, . . . , n follows from the measurability of the scalar product h. , .iY , the norm k·kY ,
G0 and all G ∈ G . Note that in contrast to the white noise case, εi are RVs in Y for all i = 1, . . . , n
and therefore measurable without any further investigation. Also, since εi (ω) ∈ Y for all ω,
Cauchy-Schwarz immediately shows that ui (ω, .) and therefore u(ω, .) is continuous w.r.t. k · k for
all ω ∈ Ω\Ω0 . Choosing versions x̃i and ε̃i and applying [70, Proposition 5] as above gives the
existence of an (Σ, BG )-measurable LSE Ĝn in (2.2) and therefore finishes the proof.
Slicing argument. Recall the definition (2.3) of the empirical norm. For R2 ≥ 2kG∗ − G0 k2n ,
we have
PxG0 kĜn − G0 k2n ≥ R2 ≤ PxG0 2 kĜn − G0 k2n − kG∗ − G0 k2n ≥ R2 . (B.3)
Applying the basic inequality (Lemma A.2) and defining the empirical process (XG : G ∈ G )
indexed by the operator class G as
n
1X
XG = hεi , G(xi ) − G∗ (xi )iY ,
n i=1
gives
∞
X
PxG0 22s R2 ≤ 2 kĜn − G0 k2n − kG∗ − G0 k2n < 22s+2 R2
s=0
∞
X
22s−2 R2
≤ PxG0 XĜn ≥ , kĜn − G0 k2n − kG∗ − G0 k2n < 22s+1 R2
s=0
σ
X∞
x 22s−2 R2
≤ PG0 sup XG ≥ . (B.4)
s=0 G∗
G∈G n (2
s+3/2 R) σ
38
Concentration inequality for each slice. We wish to apply Lemma A.4 to bound the prob-
abilities in (B.4). Let CCh be the generic constant from this lemma and let δn satisfy (2.5). Then,
due to δ 7→ Ψn (δ)/δ 2 being non-increasing, (2.5) gives for all R ≥ δn and s ∈ N0
√ 2s+3 2
n(2 R ) ≥ 32CCh σΨn (2s+3/2 R)
√ 22s−2 R2
n ≥ CCh Ψn (2s+3/2 R) ≥ CCh J(Θ, k · kn ). (B.5)
σ
With hG := G − G∗ and H := {hG : G ∈ Θ} we have J(Θ, k · kn ) = J(H, k · kn ) and thus (B.5)
verifies assumption (A.4) of Lemma A.4 for δ = 22s−2 R2 /σ.
Furthermore, since Θ = G ∗n (2s+3/2 R) ⊂ G for s ∈ N0 , R > 0 and (G G , k · kn ) is compact, the
space (Θ, k · kn ) is separable, which verifies the last assumption of Lemma A.4. Applying this
lemma with U = 2s+3/2 R shows that the (uncountable) suprema in (B.4) are measurable and that
for all R ≥ δn ,
∞
X ∞
X
22s−2 R2 8n24s−4 R4
PxG0 sup XG ≥ ≤ exp − 2 2 2s+3 2
s=0 G∗
G∈G n (2
s+3/2 R) σ s=0
CCh σ 2 R
X∞
22s−4 nR2
≤ exp − 2 σ2
s=0
CCh
X∞
−2s nR2
≤ 2 exp − 2 σ2
s=0
16CCh
nR2
< 2 exp − 2 σ2 . (B.6)
16CCh
In (B.6)√we additionally used exp(−xy) ≤ exp(−x)/y, which holds for all x, y ≥ 1, i.e. for R ≥
4CCh σ/ n. Combining (B.3)–(B.4) and (B.6) gives (2.6) and therefore shows the claim.
39
Furthermore, Lemma A.5 gives
nR2
PG0 kĜn − G0 kL2 (γ) ≥ R, kĜn − G0 kL2 (γ) ≥ 2kĜn − G0 kn ≤ 2 exp − 2
, (B.9)
320F∞
√
since we have assumed R ≥ max{8δ̃n , 18F∞ / n}. Combining (B.8) and (B.9) shows (2.11) for
some C.
40
Proof of (i): Choosing δ̃n2 ≃ n−2/(2+α) immediately shows the second part of (2.9). For J(δ)
from (2.4) it holds
Z δp
J(δ) ≤ H(ρ, N ) dρ . δ 1−α/2 =: Ψn (δ).
0
Since the entropy term is independent of N , the limit N → ∞ gives the claim.
Proof of (ii): Choosing δ̃n2 ≃ N (1 + log(n))/n shows in particular log(n) & log(δ̃n−1 ), which in
turn shows that δ̃n satisfies the second part of (2.9). For J(δ) from (2.4) it holds
Z δ p √
J(δ) . H(ρ, N ) dρ . N δ 1 + log(δ −1 ) =: Ψn (δ).
0
Hence δn2 ≃ N (1 + log(n))/n satisfies the first part of (2.9). Therefore Corollary 2.7 shows
h i N (1 + log(n))
EG0 kĜn − G0 k2L2 (γ) . N −β + .
n
1
Choosing N (n) = ⌈n β+1 ⌉ and using log(n) ≤ nτ /τ for all τ > 0 and n ≥ 1 gives the claim.
Using (B.10) and (B.11), and minimizing over G∗ ∈ G gives the bound on the mean-squared error
(2.15) for some C2 .
41
of f and g as nf and ng and the output dimensions as mf and mg .3 Then there exists a σ-NN
(f, g), called the parallelization of f and g, which simultaneously realizes f and g, i.e.
It holds
N
X
N
size {fi }i=1 = size(fi ), (C.2)
i=1
N
X
N
sizein {fi }i=1 = sizein (fi ),
i=1
N
X
N
sizeout {fi }i=1 = sizeout (fi ),
i=1
N
depth {fi }i=1 = depth(f1 ),
N
X
N
width {fi }i=1 = width(fi ), (C.3)
i=1
N
mpar {fi }i=1 = max mpar(fi ),
i=1,...,N
2 XN
N
mranΩ {fi }i=1 = mranΩ (fi )2 .
i=1
There is no simple control over the size and the weight bound of the concatenation f • g in
Definition C.2. The reason is that the network f • g multiplies network weights and biases of
the NNs f and g at layer l = depth(g) + 1 (for details see [81, Definition 2.2]). In the following
we use sparse concatenation to get control over the size and the weights. We first introduce the
realization of the identity map and separate the analysis for the σ1 - and σq -case. The following
lemma is proven in [81, Remark 2.4].
3 Using the syntax from (3.4a)–(3.4b), it holds nf = p0 (f ), ng = p0 (g), mf = pL+1 (f ) and mg = pL+1 (g).
42
Lemma C.3 (σ1 -realization of identity map). Let d ∈ N and L ∈ N. Then there exists a σ1 -
identity network IdRd of depth L, which exactly realizes the identity map IdRd : Rd → Rd , x 7→ x .
It holds
We proceed with the analogous result for the RePU activation function. The following lemma
follows from the construction in [57, Theorem 2.5 (2)] with a parallelization argument.
Lemma C.4 (σq -realization of identity map). Let q ∈ N with q ≥ 2. Further let d ∈ N and L ∈ N
be arbitrary. Then there exists a σq -NN IdRd of depth L, which exactly realizes the identity map
IdRd . It holds
Proof. The bounds in (C.9), (C.10) and (C.12) follow from Definition C.3 with the NN calculus
from Definition C.2. The bounds on the sizes in (C.7)–(C.8) and the weight and bias bound (C.11)
follow from the specific structure of the σ1 -identity network, see [81, Remark 2.6].
We proceed with the sparse concatenation of σq -NNs.
Definition C.6 (σq -sparse concatenation). Let q ∈ N with q ≥ 2. Let f and g be two σq -NNs.
Furthermore, let the output dimension mg of g equal the input dimension nf of f . Then there
exists a σq -NN f ◦ g with
f ◦ g := f • IdRmg •g
4 The symbol ◦ does mean either the functional concatenation of f and g or the sparse concatenation of the NN
43
realizing the composition f ◦ g : x 7→ f (g(x)) of the functions f and g. It holds
Proof. For the proof of the σ1 -case, see [34, Lemma A.1]. We prove the RePU case in the following.
Without loss of generality we can choose α > 1. For α < 0 we set SMα = −SM−α . For
0 < α < 1 we set SMα = α IdRd , where we directly multiply the weights and biases of the identity
network (with depth L = 1) with α.
Therefore let α > 1. Let K be the maximum integer smaller than log2 (α), and set α̃ =
2−K+1 α < 1. Furthermore, set A1 = (IdRd , IdRd ) and A2 = Σ2 with the one-layered identity
network IdRd and the summation network Σ2 from Definition C.7. We notice that
A2 • A1 x = 2x ∀x ∈ Rd .
44
Using the bounds for Σ2 from Definition C.7 and IdRd from Definition C.4, we have depth(A2 •A1 ) =
1, width(A2 • A1 ) ≤ Cq d, size(A2 • A1 ) ≤ Cq d and mpar(A2 • A1 ) ≤ Cq . Setting A2k+1 = A1 ,
A2k+2 = A2 for k = 1, . . . , K and A2K+3 = α̃ IdRd , we get
A2K+3 ◦ A2K+2 • A2K+1 ◦ A2K • A2K−1 ◦ . . . .. ◦ A2 • A1 x = αx ∀x ∈ Rd .
Applying the σq -NN calculus for concatenation (Definition C.2) and sparse concatenation (Defi-
nition C.6) gives the desired bounds for SMα .
45
Q
Proof. Analogous to [75, Proposition 2.6] we construct e δ,D as a binary tree of × ˜ .,. -networks from
Proposition C.9. We modify the proof of [75] to get a construction with bounded weights.
Define Ñ := min{2k : k ∈ N, 2k ≥ N }. We now consider the multiplication of Ñ numbers
with yN +1 , . . . , yÑ := 1. This can be implemented by a zero-layer network with
(
1 1 i = j ≤ N,
wi,j =
0 otherwise,
(
0 j ≤ N,
b1j =
1 N < j ≤ Ñ .
l
with δ ′ := δ/(Ñ 2 D2Ñ ) and Dl := 2l D2 . We now set
Y
f
:= Rlog2 (Ñ )−1 ◦ · · · ◦ R0 . (C.27)
δ,D
Q
Eq. (C.27) shows that the map Rl describes the multiplications on level l of the binary tree e δ,D .
In order for (C.27) to be well-defined, we have to show that the outputs of the NN Rl are admissible
inputs for the NN Rl+1 .
We therefore denote with yjl , j = 1, . . . , 2log2 (Ñ)−l , l = 1, . . . , log2 (Ñ ) − 1, the output of the
network Rl−1 ◦· · ·◦R0 applied to the input yk0 = yk , k = 1, . . . , Ñ . Then we have to show |yjl | ≤ Dl
for l = 0, . . . , log2 (Ñ ) − 1 and j = 1, . . . , 2log2 (Ñ )−l . We will show this claim by induction. For
l = 0 it holds |yjl | ≤ D = D0 . Now assume |yjl | ≤ Dl for arbitrary but fixed l ∈ {0, . . . , log2 (Ñ )−2}
and all j = 1, . . . , 2log2 (Ñ )−l . Then it holds
|yjl+1 | = |× l
˜ δ′ ,Dl (y2j−1 l
, y2j )|
l l
= |y2j−1 · y2j δ ′ ≤ 2Dl2 = Dl+1
+ δ ′ | ≤ Dl2 + |{z}
≤1≤Dl2
for all j = 1, . . . , 2log2 (Ñ )−(l+1) , which shows the claim. We proceed by showing the error bound
in (C.23). Therefore define
2l
Y
l
zj := yk+2l (j−1)
k=1
log2 (Ñ)−l
for l = 0, . . . , log2 (Ñ ) and j = 1, . . . , 2 . The quantites zjl describe the exact computations
up to level l of the binary tree, i.e. the output of level l − 1, if one uses standard multiplication
instead of the multiplication networks × ˜ in the first l − 1 levels. We now prove
l+1
|yjl − zjl | ≤ 4l D2 δ′, j = 1, . . . , 2log2 (Ñ)−l (C.28)
by induction over l = 0, . . . , log2 Ñ . Inserting l = log2 (Ñ ) then shows the error bound in (C.23)
using the definition of δ ′ .
We have yj0 = yj = zj0 for j = 1, . . . , Ñ , therefore (C.28) holds for l = 0. Now assume (C.28)
46
to hold for an arbitrary but fixed l ∈ {0, . . . , log2 (Ñ ) − 1}. For j = 1, . . . , 2log2 (Ñ )−(l+1) it holds
|yjl+1 − zjl+1 | = × l
˜ Dl ,δ′ (y2j−1 l
, y2j l
) − z2j−1 l
z2j
l l
= y2j−1 y2j + δ ′ − z2j−1
l l
z2j
l l l l l l
= (y2j−1 − z2j−1 + z2j−1 ) · (y2j − z2j + z2j ) + δ ′ − z2j−1
l l
z2j
l l l l l l l
= (y2j−1 − z2j−1 ) · (y2j − z2j ) + (y2j−1 − z2j−1 )z2j
l l l
+ (y2j − z2j )z2j−1 + δ′
l l l l l l l
≤ (y2j−1 − z2j−1 ) · y2j − z2j + y2j−1 − z2j−1 · z2j
| {z } | {z } | {z } |{z}
use(C.28) ≤1 use(C.28) ≤D2l
l l l ′
+ y2j − z2j · z2j−1
+|δ |
| {z } | {z }
use(C.28) ≤D2l
l+1
l
≤ δ ′ 4l D 2 1 + 2D2 +1
l+1 l+1 l+2
≤ 4 4l D 2 D2 δ ′ ≤ 4l+1 D2 δ′ ,
47
Q
For a bound on the size size( e δ,D ) we observe that level l, l = 0, . . . , log2 (Ñ ) − 1, of the binary
Q̃
tree ˜ δ′ ,Dl . We calculate
consists of 2log2 (Ñ )−l−1 product networks ×
In (C.30) we used (C.7) to bound the size of a sparse concatenation and (C.21) for the size of the
product network × ˜ δ,D .
Q
For the bound on the weights and biases, we get mpar( e δ,D ) ≤ 1 because of mpar(× ˜ δ,D ) ≤ 1,
see (C.22), and the NN calculus for sparse concatenation in (C.11) and parallelization in (C.1).
We continue with the RePU-case.
Proposition C.12 (σq -NN of multiplication of N numbers). Let N, q ∈ N with N, q ≥ 2. Then
Q
there exists a σq -NN e : RN → R such that
Y N
Y
f
(y1 , . . . , yN ) = yj .
j=1
Q̃
Proof. The construction is similar to the ReLU case. We define as a binary tree of product
˜ see (C.26) and (C.27). The binary tree has a maximum of 2N binary networks ×,
networks ×, ˜ a
maximum height of log2 (2N ) and a maximum width of N . Therefore (C.31)–(C.34) follow with
the NN calculus rules from Definition C.6.
We proceed and state the approximation results for univariate polynomials. We start with the
ReLU case. The following proposition was shown in [34, Proposition III.5].
48
Proposition C.13 (σ1 -NN approximation of polynomials, cf. [34, Proposition III.5]). Let m ∈ N
and a = (ai )m
i=0 ∈ R
m+1
. Further let D ∈ R, D ≥ 1 and δ ∈ (0, 1/2). Define a∞ = max{1, kak∞}.
Then there exists a σ1 -NN p̃δ,D : [−D, D] → R satisfying
m
X
sup p̃δ,D (x) − ai xi ≤ δ.
x∈[−D,D] i=0
The bounds for p̃ follow from the respective bounds for Σ2 from Definition C.7, IdR from Lemma
C.4 and SMα from Definition C.8.
We now use Propositions C.13 and C.14 to get an approximation result for univariate Legendre
polynomials.
Corollary C.15 (σ1 -NN approximation of Lj ). Let j ∈ N0 and δ ∈ (0, 1/2). Then there exists a
σ1 -NN L̃j,δ : [−1, 1] → R with
sup |L̃j,δ (x) − Lj (x)| ≤ δ.
x∈[−1,1]
49
Proof. For j ∈ N, l ∈ N0 , l ≤ j, denote the coefficients of Lj with cjl . In [77, Eq. (4.17)] the bound
Pj j j j j j j
Pj j j
l=0 |cl | ≤ 4 is derived. With c = (cl )l=0 it holds kc k∞ ≤ l=0 |cl | ≤ 4 . The result now
follows with Proposition C.13.
We continue with the σq -case.
Corollary C.16 (σq -NN approximation of Lj ). Let j ∈ N0 . Then there exists a σq -NN L̃j : R → R
with
Proof. The bounds follow similar as in the σ1 -case using Proposition C.14.
D Proofs of Section 3
D.1 Proof of Proposition 3.9
We proceed analogously to the proof of [75, Proposition 2.13]. We define fΛ,δ as a composition
(1) (2) (2)
of two subnetworks, fΛ,δ := fΛ,δ ◦ fΛ,δ . The subnetwork fΛ,δ evaluates, in parallel, all relevant
univariate Legendre polynomials, i.e.
n o
(2)
fΛ,δ (yy ) := IdR ◦L̃νj ,δ′ (yj ) , (D.1)
(j,νj )∈T
where we used
T := (j, νj ) ∈ N2 : ν ∈ Λ, j ∈ supp ν , (D.2)
−1 −d(Λ)+1
δ ′ := (2d(Λ)) (2m(Λ) + 2) δ
and y = (yj )(j,νj )∈T . In (D.1) the big round brackets denote a parallelization and we use the
(1) (2)
identity networks to synchronize the depth. The subnetwork fΛ,δ takes the output of fΛ,δ as input
Q
and computes, in parallel, tensorized Legendre polynomials using the multiplication networks e .,.
introduced in Proposition C.11. With Mν := 2|νν |1 + 2 we define
(1) (1) (2)
fΛ,δ (zk )k≤|T | = fΛ,δ fΛ,δ (yy )
Y n o
f
:= IdR ◦ IdR ◦L̃νj ,δ′ (yj ) . (D.3)
δ/2,Mν j∈supp ν ν ∈Λ
50
We will first show the error bound in (3.12). Let ν ∈ Λ be arbitrary. We use the shorthand
notation k · k := k · kL∞ ([−1,1]|T | )) and calculate
Lν − L̃ν ,δ
Y Y Y n o
f
≤ Lν − L̃νj ,δ ′ + L̃νj ,δ ′ − L̃νj ,δ′
δ/2,Mν j∈supp ν
j∈supp ν j∈supp ν
X Y Y δ
≤ L̃νj ,δ′ · Lνk − L̃νk ,δ′ · Lνj +
j∈supp ν : j∈supp ν :
2
k∈supp ν
j<k j>k
d(Λ)−1
δ Mν δ δ
≤d(Λ)Mνd(Λ)−1 δ ′ + ≤ + ≤ δ,
2 2m(Λ) + 2 2 2
In (D.5) we used the depth bound for univariate Legendre polynomials, (C.36), at the first in-
(1)
equality. Furthermore, we used νj ≤ m(Λ). For the depth of fΛ,δ it holds
Y
(1) f
depth fΛ,δ = 1 + max depth
ν ∈Λ δ/2,Mν
≤ 1 + C max log (| supp ν |) log (| supp ν |) + | supp ν | log (Mν ) + log δ −1
ν ∈Λ
≤ 1 + C log (d(Λ)) log (d(Λ)) + d(Λ) log (m(Λ)) + log δ −1 , (D.6)
where we used | supp ν | ≤ d(Λ) for all ν ∈ Λ, Mν ≤ 4m(Λ) and the depth bound for σ1 -
multiplication networks from Proposition C.11. Combining the two depth bounds (D.5) and
(D.6), we get
(1) (2)
depth (fΛ,δ ) = 1 + depth fΛ,δ + depth fΛ,δ
≤ Cm(Λ) log (d(Λ)) + d(Λ) log (m(Λ)) + m(Λ) + log δ −1
+ C log (d(Λ)) log (d(Λ)) + d(Λ) log (m(Λ)) + log δ −1
h i
≤ C log(d(Λ))d(Λ) log(m(Λ))m(Λ) + m(Λ)2 + log(δ −1 ) log(d(Λ)) + m(Λ) .
(1) (2)
For the width width(fΛ,δ ) we use width(fΛ,δ ) ≤ 2 max{width(fΛ,δ ), width(fΛ,δ )}, see (C.10). This
51
(2) (1)
leaves us to calculate width(fΛ,δ ) and width(fΛ,δ ). It holds
X
(2)
width fΛ,δ ≤ width IdR ◦L̃νj ,δ′
(j,νj )∈T
X
≤2 width L̃νj ,δ′ ≤ 18|T |, (D.7)
(j,νj )∈T
where we used (C.5) for the width of the σ1 -identity network and (C.37) for the width of L̃νj ,δ′ .
(1)
For width(fΛ,δ ) it holds
X Y
(1) f
width fΛ,δ ≤ width IdR ◦
δ/2,Mν
ν ∈Λ
X Y
f
≤2 width
δ/2,Mν
ν ∈Λ
X
≤ 10d(Λ) = 10|Λ|d(Λ), (D.8)
ν ∈Λ
Q
again using (C.5) and (C.24) for the width of the multiplication network e . Combining (D.7) and
(D.8) gives
In (D.9) we used the NN calculus rules for the sizes of a sparse concatenation in (C.7) and a
parallelization in (C.2). Furthermore, we used |T | ≤ m(Λ)d(Λ) at the first equality and
size (IdR ) ≤ 4 max depth L̃νj ,δ′ (yj ) ≤ 4 max size L̃νj ,δ′ (yj ) , (D.10)
(j,νj )∈T (j,νj )∈T
which follows from (C.4). At the third inequality in (D.9) we used the size bound for the univariate
Legendre polynomials from (C.38).
52
(1)
For size(fΛ,δ ) it holds
X Y
(1) f
size fΛ,δ = size IdR ◦
δ/2,Mν
ν ∈Λ
X Y
f
≤2 size(IdR ) + size
δ/2,Mν
ν ∈Λ
Y
f
≤ 10|Λ| max size
ν ∈Λ δ/2,Mν
≤ C|Λ| max | supp ν | log (| supp ν |) + | supp ν | log(Mν ) + log δ −1
ν ∈Λ
≤ C|Λ|d(Λ) log (d(Λ)) + d(Λ) log(m(Λ)) + log δ −1 . (D.11)
Q̃
In (D.11), we used the size bound for from (C.25) and the argument from (D.10). Additionally
we used Mν = 2|νν |1 + 2 ≤ 4m(Λ). Combining (D.9) and (D.11) shows the size bound for fΛ,δ .
Q̃
The network fΛ,δ consists of sparse concatinations and parallelizations of the networks and
Q̃
L̃j . Because we have mpar( ) ≤ 1 and mpar(L̃j ) ≤ 1, the NN calculus rules (C.11) and (C.1)
yield mpar(fΛ,δ ) ≤ 1. This finishes the proof.
Proof. Similar to the proof of Proposition 3.9 we define fΛ as a composition of two subnetworks
(1) (2)
fΛ and fΛ . It holds n
o
(2)
fΛ (yy ) := IdR ◦L̃νj (yj )
(j,νj )∈T
and
(1) (1) (2)
fΛ (zk )k≤|T | = fΛ fΛ (yy )
f n
Y o
:= IdR ◦ IdR ◦L̃νj (yj )
j∈supp ν ν∈Λ
with T from (D.2) and y = (yj )(j,νj )∈T . Furthermore, we use the σq -NNs L̃j from Corollary C.15
Q̃
and from Proposition C.12. The calculations are similar to the proof of Proposition 3.9. It
53
holds
(2)
depth fΛ =1+ max depth L̃νj
ν ∈Λ
j∈supp ν
≤ Cq max νj
ν ∈Λ
j∈supp ν
≤ Cq m(Λ). (D.12)
In (D.12) we used the depth bound for univariate Legendre polynomials, (C.39). Furthermore, we
(1)
used νj ≤ m(Λ) for all ν ∈ Λ and j ∈ supp ν . For the depth of fΛ it holds
Y
(1) f
depth fΛ = 1 + max depth
ν ∈Λ
where we used (C.16) for the width of a σq -sparse concatenation and (C.40) for the width of L̃νj .
(1)
For width(fΛ ) it holds
X Y
(1) f
width fΛ ≤ width IdR ◦
ν ∈Λ
X Y
f
≤ Cq width
ν ∈Λ
X
≤ Cq d(Λ) = Cq |Λ|d(Λ) (D.15)
ν ∈Λ
Q
using (C.16) and (C.32) for the width of the multiplication network e . Combining (D.14) and
(D.15) gives
width (fΛ ) ≤ Cq |Λ|d(Λ),
where |T | ≤ |Λ|d(Λ) was used.
(1) (2)
To estimate size(fΛ ), we use (C.13) and find size(fΛ ) ≤ Cq (size(fΛ )+size(fΛ )). We calculate
n o
(2)
size fΛ = size IdR ◦L̃νj (yj )
(j,νj )∈T
X
= size IdR ◦L̃νj (yj )
(j,νj )∈T
≤ Cq m(Λ)d(Λ) max size (IdR ) + size L̃νj (yj )
(j,νj )∈T
≤ Cq m(Λ)d(Λ) max size L̃νj (yj )
(j,νj )∈T
2
≤ Cq d(Λ)m(Λ) . (D.16)
54
In (D.16) we used |T | ≤ m(Λ)d at the first inequality and
size (IdR ) ≤ Cq max depth L̃νj (yj ) ≤ Cq max size L̃νj (yj ) , (D.17)
(j,νj )∈T (j,νj )∈T
which follows from (C.6). At the third inequality in (D.16) we used the size bound for the univariate
Legendre polynomials from (C.41).
(1)
For size(fΛ ) it holds
X Y
(1) f
size fΛ = size IdR ◦
ν ∈Λ
X Y
f
≤ Cq size(IdR ) +
ν ∈Λ
Y
f
≤ Cq |Λ| max size
ν ∈Λ
Theorem D.3. Consider the setting of Theorem D.2. Let Ψ X be a Riesz basis. Additionally, let
γ be as in (3.9). Fix τ > 0 (arbitrary small). Then there exists a constant C > 0 independent of
N , such that there exists ΓN ∈ G sp
FN (σq , N ) with
We first show that Theorems D.2 and D.3 imply Theorem 3.10.
Proof of Theorem 3.10. First consider the setting of Theorem D.2. Let τ > 0. Then there exists
a constant C independent of N and a FrameNet ΓN ∈ G sp FN (σq , N ) such that for all N ∈ N
2 2
kΓN − G0 k∞,supp(γ) ≤ sup kΓN (a) − G0 (a)kY ≤ CN −2 min{r−1,t}+τ ,
r (X)
a∈CR
r
where we used (D.19) with τ /2 and supp(γ) ⊆ CR (X) by Assumption 2.
Now consider the setting of Theorem D.3. Let τ > 0. Then there exists a constant C indepen-
dent of N and a FrameNet G spFN (σq , N ) with
r
where we used supp(γ) ⊂ CR (X) (Assumption 2) and (D.20) with τ /2.
We are left to prove Theorems D.2 and D.3. We need some auxiliary results.
55
Auxiliary Results
r
For r > 1, R > 0, U = [−1, 1]N and σR form (3.7) we define
r
u : U → Y, u(yy ) := (G0 ◦ σR )(yy ).
For the proofs of Theorems D.2 and D.3 we do a Y-valued tensorized Legendre expansion of u in
the frame (ηj Lν (yy ))j,νν of L2 (U, π; Y), which reads
XX
r
u(yy ) = G0 (σR (yy )) = cν ,j ηj Lν (yy ) (D.21)
j∈N ν ∈F
The following theorem is a special case of [104, Theorem 2.2.10]. The formulation is similar to
[43, Theorem 4].
Theorem D.4. Let Assumption 2 be satisfied with r > 1 and t > 0. Fix τ > 0, p ∈ ( 1r , 1] and
t′ ∈ [0, t]. Consider F from (3.11), and let π = ⊗j∈N λ2 be the infinite product (probability) measure
on U = [−1, 1]N , where λ denotes the Lebesgue measure on [−1, 1]. Then there exists C > 0 and
a sequence (aν )ν ∈F ∈ lp (F) of positive numbers such that
(i) for each ν ∈ F Z
ωντ Lν (yy )u(yy ) dπ(yy ) ≤ Caν ,
U Yt′
(ii) there exists an enumeration (ννi )i∈N of F such that (aνi )i∈N is monotonically decreasing, the
set ΛN := {ννi : i ≤ N } ⊆ F is downward closed for each N ∈ N, and additionally
for N → ∞,
(iii) the following expansion holds with absolute and uniform convergence:
X Z
′
∀yy ∈ U : u(yy ) = Lν (yy ) x)u(x
Lν (x x ) ∈ Yt .
x ) dπ(x
ν ∈F U
The following proposition reformulates Theorem D.4 (i) into a bound for cν ,j . It was shown in
′
[43, Proposition 2]. Recall that θj denote the weights to define the spaces Yt , t′ > 0, see Definition
3.4.
56
Proposition D.5 ([43, Proposition 2]). Consider the setting of Theorem D.4. Then for each
ν ∈F X ′
ων2τ θj−2t cν2 ,j ≤ C 2 aν2 .
j∈N
Proposition D.5 gives decay of the coefficients cν ,j in both j and ν . Since θj = O(j −1+τ )
′
for all τ > 0 we have cν2 ,j = O(j −1−2t +τ̃ ) for τ̃ < 2τ t′ and every ν ∈ ΛN . Furthermore, since
p
(aν )ν ∈F ∈ l (F) the Legendre coefficients cν ,j decay algebraically in ν . We continue with a
technical lemma, which was shown in [43, Lemma 4].
Lemma D.6 ([43, Lemma 4]). Let α > 1, β > 0 and assume two sequences (ai )i∈N and (dj )j∈N in
R with ai . i−α and dj . j −β for all i, j ∈ N. Additionally assume that (dj )j∈N is monotonically
decreasing. Suppose that there exists a constant C < ∞ such that the sequence (ci,j )i,j∈N satisfies
X
∀i ∈ N : c2i,j d−2 2 2
j ≤ C ai .
j∈N
P
(ii) for all N ∈ N there exists (mi )i∈N ⊆ NN 0 monotonically decreasing such that i∈N mi ≤ N
and 21
X X
c2i,j . N − min{α− 2 ,β }+τ .
1
i∈N j>mi
In the following, we use Lemma D.6 to get a decay property for the Legendre coefficients
cνi ,j with the enumeration νi of ΛN from Theorem D.4. The sequence m = (mi )i∈N quantifies
which coefficients of the Legendre expansion are “important” and are therefore used to define the
surrogate ΓN .
We first show that Theorem D.4 yields sufficient decay on the Legendre coefficients cνi ,j s.t. the
assumptions of Lemma D.6 are satisfied.
Lemma D.7. Consider the setting of Theorem D.4. Let τ̃ > 0 such that 1/p > r − τ̃ /2. Then
′
the assumptions of Lemma D.6 are fulfilled for α = r − τ̃ /2, β = t − τ̃ /2, ai = aνi , dj = θjt and
1/2
ci,j = ωνi cνi ,j for i, j ∈ N.
1
Proof. Proposition D.5 with τ = 2 gives
21
1 X ′
τ̃
ω
2
νi θj−2t cν2i ,j = O(aνi ) = O i−r+ 2 . (D.25)
j∈N
P
The last equality in (D.25) holds because iaνpi ≤ p
j∈N aνj < ∞ (since aνi is monotonically
′
decreasing) implies aνi = O(i−1/p ) = O(i−r+τ̃ /2 ). Since (θjt )j∈N ∈ l1/(t−τ̃ /2) (see Definition 3.4)
it holds ′
θjt = O(j −t+τ̃ /2 )
with the same argument.
57
Proofs of Theorems D.2 and D.3
The proof of Theorems D.2 and D.3 is similar to [43, Sections 4.2-4.4, Proofs of Theorems 1, 2
and 5].
Proof of Theorem D.2. Let (aν )ν ∈F be the enumeration (ννi )i∈N from Theorem D.4, where we use
the case τ = 12 . Therefore (aνi )i∈N is monotonically decreasing and belongs to lp with p ∈ ( r1 , 1].
We further fix τ̃ > 0 and demand p1 > r − τ̃2 . Fix Ñ ∈ N and set ΛÑ := {ννj : j ≤ Ñ } ⊂ F, which
is downward closed by Theorem D.4. Now we approximate the tensorized Legendre polynomials
Lν on the index set ΛÑ . Let ρ ∈ (0, 21 ). In the ReLU case, Proposition 3.9 gives a NN fΛÑ ,ρ with
outputs {L̃ν ,ρ }ν ∈ΛÑ s.t.
sup max Lν (yy ) − L̃ν ,ρ (yy ) ≤ ρ.
y ∈U ν ∈ΛÑ
The constants hidden in O( . ) are independent of Ñ and ρ. For Ñ ∈ N, set the accuracy ρ :=
1
Ñ − min{r− 2 ,t} . Then it holds
depth(fΛÑ ,ρ ) = O log(Ñ )2 log(log(Ñ ))2 ,
size(fΛÑ ,ρ ) = O Ñ log(Ñ )2 log(log(Ñ )) .
Proposition D.1 shows that the ReLU bounds also hold for the RePU-case.
By Lemma D.7 the assumptions of Lemma D.6 are satisfied. Applying P Lemma D.6 (i) with
α := r − τ̃ /2 and β := t − τ̃ /2 gives a sequence (mi )i∈N ⊂ NN
0 such that i∈N mi ≤ Ñ and
12
X 1 X
ω
2
νi cν2i ,j ≤ C Ñ − min{r−1,t}+τ̃ . (D.26)
i∈N j>mi
We now define X
γ̃Ñ ,j := L̃νi ,ρ (yy )cνi ,j (D.27)
{i∈N:mi ≥j}
for j ∈ N, where empty sums are set to zero. Recall the uniform distribution π on U = [−1, 1]N
(Example 3.8). With γ̃Ñ = (γ̃Ñ ,j )j∈N it holds
X X X
r
kG0 ◦ σR (yy ) − DY ◦ γ̃Ñ (yy )kY = cνi ,j Lνi (yy )ηj − cνi ,j L̃νi ,ρ (yy )ηj
i,j∈N i∈N j≤mi
Y
X X X X
≤ Lνi (yy ) cνi ,j ηj + (Lνi (yy ) − L̃νi ,ρ (yy )) cνi ,j ηj
i∈N j>mi i∈N j≤mi
Y Y
21 12
X X X X
≤ ΛΨ Y kLνi k∞,π cν2i ,j + ΛΨ Y ρ cν2i ,j
i∈N
| {z } j>m i∈N j≤m
1 i i
≤ων2i
58
for all y ∈ U . In (D.28) we used the definition of DY , (D.27) and (D.21) at the first equality.
Furthermore, we used (D.26), the definition of ρ and
21 21
X X X X
cν2i ,j ≤ cν2i ,j
i∈N j≤mi i∈N j∈N
21
X 1 X ′
≤ C̃ ω
2
νi θj−2t cν2i ,j
i∈N j∈N
X X
≤ C̃ aνi ≤ C̃ i−r+τ̃ /2 ≤ C̃ (D.29)
i∈N i∈N
at the second-to-last inequality. We changed the constants C̃ from line to line in (D.28) and
(D.29). The last line of (D.28) shows why the RePU-case does not improve the approximation
property qualitatively. In the RePU-case, Proposition D.1 gives a σq -NN fΛ exactly realizing the
tensorized Legendre polynomials, i.e. the case ρ = 0 from above. Therefore the second summand
in the last line of (D.28) vanishes. This does not improve the approximation rate due to the first
summand. This part depends on the summability properties of the Legendre coefficients cνi ,j
following Assumption 2 and is therefore independent of the activation function σ.
Now we argue similar to [43, Proof of Theorem 1]. Consider the scaling Sr from (3.5). It holds
r
Sr ◦ EX (a) ∈ U ∀a ∈ CR (X), (D.30)
The round brackets in (D.32) denote a parallelization. The networks SMcνi ,j denote the scalar
multiplication networks from Definition C.8. Furthermore, Σnj denotes the summation network
from Definition C.7 and we use the identity networks IdR from Lemma C.3 or C.4 to synchronize
the depth. Using the respective bounds for the summation and scalar multiplication networks and
59
the NN calculus for parallelization and sparse concatenation we get
depth(γ̃Ñ ) ≤ 2 + max
2
depth L̃νi ,ρ + depth SM cνi ,j + max depth Σ nj
i,j∈N , mi ≥j j∈N
2
≤ 3 + O(log(Ñ ) log(log(Ñ ))) + max Cq log (|cνi ,j |) + 0
i,j∈N2 , mi ≥j
To get rid of the logarithmic terms, we define N = N (Ñ ) := max{1, Ñ log(Ñ )3 } and obtain a NN
γN = γ̃Ñ with
Per definition depth(γN ) = O(log(N )) and width(γN ) = O(N ) yields constants C̃L , C̃p and
N1 , N2 ∈ N s.t.
60
Cp = max{C̃p , maxN =1,...,N2 −1 width(γN )} shows
depth(γN ) ≤ CL log(N ), N ∈ N,
width(γN ) ≤ Cp N, N ∈ N.
In order to show ΓN := DY ◦ γN ◦ Sr ◦ EX ∈ G sp
FN (σq , N ), we are left to show that the maximum
Euclidean norm k · k2 of γN in U is independent of N . It holds for all y ∈ U that DX ◦ Sr−1 (yy ) ∈
r
CR (X). We get
≤ ΛΨ Y (C + cCG0 ) =: B, (D.36)
where ΛΨY denotes the upper frame bound of Ψ Y and c = θ0t , see Definition 3.4. In (D.36) we used
Assumption 2 and the approximation error from (D.35). Thus ΓN ∈ G sp FN (σq , N ) for all N ∈ N,
r
where we set Cs = 1. Using supp(γ) ⊂ CR (X) (Assumption 2) in (D.31) finalizes the proof of
Theorem D.2.
Proof of Theorem D.3. By Lemma D.7 the assumptions of Lemma D.6 are satisfied. Applying
Lemma D.6 (ii) with α := r − τ̃ /2 and β := t − τ̃ /2 gives a sequence (mi )i∈N ⊂ NN
P 0 such that
i∈N m i ≤ Ñ and
21
X X
cν2i ,j ≤ C̃ Ñ − min{r− 2 ,t}+τ̃ .
1
ων i (D.37)
i∈N j>mi
Define γ̃Ñ = (γ̃Ñ ,j )j∈N for all y ∈ U with γ̃Ñ ,j as in (D.27). Then it holds
X X
r
kG0 ◦ σR − DY ◦ γ̃Ñ kL2 (U,π;Y) ≤ cνi ,j Lνi ηj
i∈N j>mi
L2 (U,π;Y)
X X
+ cνi ,j ηj Lνi − L̃νi ,ρ
i∈N j≤mi
L2 (U,π;Y)
21
21
X X X X
≤ ΛΨY kLνi k2∞,π cν2i ,j + ΛΨY ρ cν2i ,j
i∈N
| {z } j>mi i∈N j≤mi
≤ωνi
1 1
≤ C̃ΛΨ Y Ñ − min{r− 2 ,t}+τ̃ + C̃ΛΨ Y ρ ≤ C̃ Ñ − min{r− 2 ,t}+τ̃ . (D.38)
In (D.38) we used the definition of DY , (D.27) and (D.21) at the first inequality. Additionally we
used that (Lν ηj )ν ,j is a frame of L2 (U, π; Y) at the second inequality. Finally we used (D.37), the
definition of ρ and an argument similar to (D.29) at the second-to-last inequality. Note that again
we changed the constants C̃ from line to line in (D.38).
Since Ψ X is a Riesz basis, we have (see Section 3.1.2 and (3.5))
r r r
CR (X) = {σR (yy ), y ∈ U } and EX ◦ σR (yy ) = Sr−1 (yy ). (D.39)
61
With Γ̃Ñ := DY ◦ γ̃Ñ ◦ Sr ◦ EX we calculate
r r
= kDY ◦ γ̃Ñ ◦ Sr ◦ EX ◦ σR − G0 ◦ σR kL2 (U,π;Y)
where we used (D.39) and (D.38). Defining N = N (Ñ ) := max{1, Ñ log(Ñ )3 } we can proceed
similar to the proof of Theorem D.2 from (D.31) on. The reason this works is that the NNs γ̃Ñ
are defined in the same way in the L2 - and the L∞ -case (only the sequence m changes, but not
its properties). This shows ΓN := Γ̃Ñ ∈ G sp
FN (σq , N ) for all N ∈ N and thus finishes the proof of
Theorem D.3.
where k·k∞ denotes the maximum norm in Rn . Then [80, Proposition 3.5] shows that (gg FN , k·k∞,∞ )
is compact. Since the map i : g FN → G FN , g → G = DY ◦ g ◦ EX is linear, also (G G FN , k·k∞,supp(γ) )
and hence (GG FN , k·kn ) is compact. We now show the entropy bounds for G FN .
Step 1. Recall depth(g) ≤ L, depth(g) ≤ p, size(g) ≤ s and mpar(g) ≤ M for g ∈ g FN .
We first estimate the entropy H(G GFN , k · k∞,supp(γ) , δ) against the respective entropy of g FN . For
G, G′ ∈ G FN and g, g ′ ∈ G FN with G = DY ◦ g ◦ Sr ◦ EX , G′ = DY ◦ g ′ ◦ Sr ◦ EX , it holds
r √
where we used σR = DX ◦Sr−1 and k·k∞,∞ from (D.40). Furthermore, we used kg(u)k2 ≤ pkgk∞
for all g ∈ g FN , since NNs g ∈ g FN have width(g) ≤ p. Then (D.41) yields
δ
GFN , k · k∞,σRr (U) , δ) ≤ H g FN , k · k∞,∞ ,
H(G √ . (D.42)
ΛΨY p
Step 2. It remains to bound H(gg FN , k · k∞,∞ , δ) = log(N (gg FN , k · k∞,∞ , δ)). To this end
we follow the proof and notation of [89, Lemma 5]. For l = 1, . . . L + 1, define the matrices
l
Wl = (wi,j )i,j ∈ Rpl−1 ×pl and the vectors Bl = (blj )j ∈ Rpl . Furthermore, define
A+
kg :R
p0
→ Rpk , A+
k g(x) = σ
Bk
Wk . . . σ B1 W1 x,
A−
kg :R
pk−1
→ RpL+1 , A−
k g(x) = σ
BL+1
WL+1 . . . σ Bk Wk x. (D.44)
62
Furthermore, set A+ −
0 g = IdRp0 and AL+2 g = IdR L+1 . For all 1 ≤ l ≤ L + 1 holds
p
kσ Bl (x)k∞ ≤ kxk∞ + M
kW l (x)k∞ ≤ kW l k∞ kxk∞ ≤ M pkxk∞ .
sup kA+
k g(x)k∞ ≤ (M (p + 1))
k
x∈[−1,1]p0
sup kA+
k xk∞ = sup kσ Bk Wk (σ Bk−1 Wk−1 · · · σ B1 W1 x)k∞
x∈[−1,1]p0 x∈[−1,1]p0
≤ sup kσ Bk W k xk∞
x∈[−(M(p+1))k−1 ,(M(p+1))k−1 ]pk−1
as claimed.
Moreover, for l = 1, . . . , L + 1, Wl : (Rpl−1 , k · k∞ ) → (Rpl , k · k∞ ) is Lipschitz with constant
M p and σ Bl : (Rpl , k · k∞ ) → (Rpl , k · k∞ ) is Lipschitz with constant 1. Thus we can estimate the
Lipschitz constant of A− k g for k = 1, . . . , L + 1. It holds
A− −
k g(x) − Ak g(y) ∞
= σ BL+1 WL+1 . . . σ Bk Wk x − σ BL+1 WL+1 . . . σ Bk Wk y ∞
BL Bk BL Bk
≤ Mp σ WL . . . σ Wk x − σ WL . . . σ Wk y ∞
L+2−k pk−1
≤ · · · ≤ (M p) kx − yk∞ for x, y ∈ R . (D.46)
l,∗
Now let g, g ∗ ∈ g FN be two NN such that |wi,j
l
− wi,j | < ε and |bli − bl,∗
i | < ε for all i ≤ pl+1 ,
j ≤ pl , l ≤ L + 1. Then
L+1
X ∗
kg − g ∗ k∞,∞ ≤ A−
k+1 gσ
Bk
Wk A+ ∗ −
k−1 g − Ak+1 gσ
Bk
Wk∗ A+
k−1 g
∗
∞,∞
k=1
L+1
X L+1−k ∗
≤ (M p) σ Bk Wk A+ ∗
k−1 g − σ
Bk
Wk∗ A+
k−1 g
∗
∞,∞
k=1
L+1
X
L+1−k
≤ (M p) (Wk − Wk∗ )A+
k−1 g
∗
∞,∞
+ kBk − Bk∗ k∞
k=1
L+1
X
≤ε (M p)L+1−k pM k−1 (p + 1)k−1 + 1
k=1
< ε(L + 1)M L (p + 1)L+1 , (D.47)
where we used (D.45), (D.46) and M ≥ 1. The total number of weight and biases is less than
(L + 1)(p2 + p). Therefore there are at most
(L + 1)(p2 + p)
≤ ((L + 1)(p2 + p))s
s
combinations to pick s nonzero parameters. Since all parameters are bounded by M , we choose
63
ε = δ/((L + 1)M L (p + 1)L+1 ) and obtain the covering bound for all δ > 0
( s
)
X s∗
−1 2
N (gg FN , k · k∞,∞ , δ) ≤ max 1, 2M ǫ (L + 1)(p + p)
s∗ =1
( s
)
X s∗
−1 L+1 L+1 2
≤ max 1, 2δ (L + 1)M (p + 1) (L + 1)(p + p)
s∗ =1
( s
)
X ∗
−1 2 L+1 L+3 s
≤ max 1, 2δ (L + 1) M (p + 1)
s∗ =1
s+1
≤ 2L+6 L2 M L+1 pL+3 max 1, δ −1 , (D.48)
where we used L ≥ 1 and p ≥ 1 at the last inequality. Eqs. (D.48) and (D.42) show (3.16).
Applying (3.16) to the sparse FrameNet class G spFN (σ1 , N ) gives
H G sp
FN (σ1 , N ), k · k∞,σR (U) , δ
r
depth +4
≤ (sizeN + 1) log 2depthN +6 ΛΨ Y depth2N M depthN +1 widthN N max 1, δ −1
≤ (Cs N + 1)
× log 2CL log(N )+6 ΛΨY (CL log(N ))2 M CL log(N )+1 (Cp N )CL log(N )+4 max 1, δ −1
SP
≤ CH N 1 + log(N )2 + log max 1, δ −1 , N ∈ N, δ > 0, (D.49)
where we defined
SP
CH = 2Cs (CL + 6) log(2) + log(ΛΨ Y ) + CL2 +
+ (CL + 1) log(M ) + (CL + 4)(log(Cp ) + 1) .
Applying (3.16) to the fully connected FrameNet class G full FN (σ1 , N ) gives
H G full
FN (σ1 , N ), k · k∞,σR
r (U) , δ
depth +4
≤ (sFC (N ) + 1) log 2depthN +6 ΛΨ Y depth2N M depthN +1 widthN N max 1, δ −1
≤ (depthN + 1) width2N + widthN + 1
× log 2CL log(N )+6 ΛΨY (CL log(N ))2 M CL log(N )+1 (Cp N )CL log(N )+4 max 1, δ −1
≤ (CL log(N ) + 1) Cp2 N 2 + Cp N + 1
× log 2CL log(N )+6 ΛΨY (CL log(N ))2 M CL log(N )+1 (Cp N )CL log(N )+4 max 1, δ −1
FC 2
≤ CH N 1 + log(N )3 + log max 1, δ −1 , N ∈ N, (D.50)
where we defined
FC
CH = 8CL Cp2 (CL + 6) log(2) + log(ΛΨ Y ) + CL2 +
+ (CL + 1) log(M ) + (CL + 4)(log(Cp ) + 1) .
64
D.5 Proof of Lemma 3.14
The following proof is a modification of [89, Proof of Lemma 5] to the case where the activation
function is not globally, but only locally Lipschitz continuous. The compactness of (G G , k·k∞,supp(γ) )
follows similarly to the ReLU case since [80, Proposition 3.5] holds for any continuous activation
function.
Let q ∈ N, q ≥ 2 and let σq : R → R, σq (x) = max{0, x}q denote the RePU activation
function. Recall depth(g) ≤ L, depth(g) ≤ p, size(g) ≤ s and mpar(g) ≤ M for g ∈ g FN .
We argue analogously to the ReLU-case in Lemma 3.12 and bound the entropy of the NN class
g FN (σq , L, p, s, M, B). Recall the definitions of σ Bl , A+ −
k g and Ak g from (D.43)-(D.44). Similar to
(D.45) it holds that
kA+
k gk∞,∞ = sup kA+
k g(x)k∞
x∈[−1,1]p0
≤ sup σ Bk Wk σ Bk−1 . . . W2 σq x ∞
x∈[−M(p+1),M(p+1)]p1
≤ sup σ Bk Wk σ Bk−1 . . . W2 x ∞
x∈[−M q (p+1)q ,M q (p+1)q ]p1
Pk
qj qk+1
≤ · · · ≤ (M (p + 1))) j=1
≤ (M (p + 1)) ,
where we used M ≥ 1.
In the RePU-case, A− ′
k g is only locally Lipschitz: Since |σq (x) | ≤ q|x|
q−1
it holds
A− −
k g(x) − Ak g(y) ∞
= σ BL+1 WL+1 σ BL . . . Wk x − σ BL+1 WL+1 σ BL . . . Wk y ∞
BL BL−1 BL BL−1
≤ Mp σ WL σ . . . Wk x − σ WL σ . . . Wk y ∞
!q−1
≤ M pq sup WL σ BL−1 . . . Wk x ∞
kxk∞ ≤C
× WL σ BL−1 . . . Wk x − WL σ BL−1 . . . Wk y ∞
!q−1
≤ (M pq)L+2−k sup WL σ BL−1 . . . Wk x ∞
kxk∞ ≤C
!q−1 !q−1
× sup WL−1 σ BL−2 . . . Wk x ∞
× ···× sup kWk xk∞ kx − yk∞ .
kxk∞ ≤C kxk∞ ≤C
Using
sup Wj σ Bj−1 . . . Wk x ∞
≤ sup Wj σ Bj−1 . . . Wk+1 x ∞
kxk∞ ≤C kxk∞ ≤(M(p+1)C)q
65
for j = k, . . . , L and C, M ≥ 1, we get
L
Y qj−k+1
A− −
k g(x) − Ak g(y) ∞
≤ (M pq)L+2−k M (p + 1)Ĉ kx − yk∞
j=k
qL+2−k
≤ (M pq)L+2−k M (p + 1)Ĉ kx − yk∞ , x, y ∈ Rpk−1 .(D.51)
l,∗
Now we proceed similar to (D.47). Let g, g ∗ ∈ g FN be two NN such that |wi,j l
− wi,j | < ε and
l l,∗ + −
|bi − bi | < ε for all i ≤ pl+1 , j ≤ pl , l ≤ L + 1. Then with A0 g = IdRp0 and AL+2 g = IdRpL+1 we
estimate
kg − g ∗ k∞,∞
L+1
X ∗
≤ A−
k+1 gσ
Bk
Wk A+ ∗ −
k−1 g − Ak+1 gσ
Bk
Wk∗ A+
k−1 g
∗
∞,∞
k=1
L+1
X qL+1−k
L+1−k qk+1
≤ (M pq) M (p + 1) (M (p + 1))
k=1
∗
σ Bk Wk A+ ∗
k−1 g − σ
Bk
Wk∗ A+
k−1 g
∗
∞,∞
L+1
X k+1
qL+1−k
≤ (M pq)L+1−k M (p + 1) (M (p + 1))q
k=1
q−1
(Wk − Wk∗ )A+
k−1 g
∗
∞,∞
+ kB k − B ∗ +
k ∞,∞ q M (p + 1) Ak−1 g
k ∗
∞,∞
L+1
X q L+1−k
L+2−k qk+1
≤ 2ε (M pq) M (p + 1) (M (p + 1)) A+
k−1 g
∗
∞,∞
k=1
q−1
M (p + 1) A+
k−1 g
∗
∞,∞
L
L+2 q
L+1 q
≤ 2ε(L + 1) (M (p + 1)q)L+q M (p + 1) (M (p + 1))q (M (p + 1))q
4q2L+2 √ −1
< εLq L+q (2pM ) 2M p(p2 + p)(L + 1) . (D.52)
combinations to pick s nonzero weights and biases. Since all parameters are bounded by M , we
choose
√
2M p(p2 + p)(L + 1)δ
ε= 4q2L+2
Lq L+q (2pM )
66
and obtain the covering bound
( s
)
X s∗
N (gg FN , k · k∞,∞ , δ) ≤ max 1, 2M ǫ−1 (L + 1)(p2 + p)
s∗ =1
( s
)
X √ s
∗
L+q 4q2L+2
≤ max 1, Lq (2pM ) ( pδ)−1
s∗ =1
2L+2 √ s+1
−1
≤ Lq L+q (2pM )4q p max 1, δ −1 . (D.53)
where we set
SP
CH = 2Cs log(ΛΨ Y ) + CL + (CL + q) log(q) + 4q 2 (log(2Cp M ) + 1) .
Applying (3.17) to the fully connected FrameNet class G full FN (σq , N ) gives the entropy bound
H G full
FN (σq , N ), k · k∞,σR
r (U) , δ
4q2depth N +2
≤ (sF C (N ) + 1) log ΛΨ Y depthN q depthN +q (2widthN M ) max 1, δ −1
≤ (depthN + 1) width2N + widthN + 1
4q2depth N +2
× log ΛΨ Y depthN q depthN +q (2widthN M ) max 1, δ −1
≤ (CL log(N ) + 1) Cp2 N 2 + Cp N + 1
4q2CL log(N )+2
× log ΛΨ Y CL log(N )q CL log(N )+q (2Cp N M ) max 1, δ −1
FC 2+2CL log(q)
≤ CH N 1 + log(N )2 + log max 1, δ −1 , (D.55)
where we set
FC
CH =8Cs Cp2 2
log(ΛΨ Y ) + CL + (CL + q) log(q) + 4q (log(2Cp M ) + 1) .
E Proofs of Section 4
E.1 Proof of Theorem 4.2
In [43, Proof of Proposition 3, Step 1], the holomorphy in Assumption 2 is verified for X, Y in
r
(4.5) with r0 > d/2 and t ∈ [0, (1 + r0 − d/2 − t0 )/d). Moreover, γ = (σR )# π in particular shows
r
supp(γ) ⊆ CR (X) and hence verifies the second part of Assumption 2. Substituting s = r0 + rd,
i.e. r = s−r
d , and taking t = (1 + r0 − d/2 − t0 )/d − τ with some small τ , Theorem 3.15 (i) then
0
gives
κ
EG0 [kĜn − G0 k2L2 (γ) ] ≤ Cn− κ+1 +τ ,
67
where
d
s − r0 1 1 + r0 − 2 − t0
κ = 2 min − , −τ
d 2 d
68